Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Perspective on the Limits of Computation Oskar Mencer May 2012 Limits of Computation Objective: Maximum Performance Computing (MPC) What is the fastest we can compute desired results? Conjecture: Data movement is the real limit on computation. Maximum Performance Computing (MPC) Less Data Movement = Less Data + Less Movement The journey will take us through: 1. Information Theory: Kolmogorov Complexity 2. Optimised Arithmetic: Winograd Bounds 3. Optimisation via Kahneman and Von Neumann 4. Real World Dataflow Implications and Results Kolmogorov Complexity (K) Definition (Kolmogorov): “If a description of string s, d(s), is of minimal length, […] it is called a minimal description of s. Then the length of d(s), […] is the Kolmogorov complexity of s, written K(s), where K(s) = |d(s)|” Of course K(s) depends heavily on the Language L used to describe actions in K. (e.g. Java, Esperanto, an Executable file, etc) Kolmogorov, A.N. (1965). "Three Approaches to the Quantitative Definition of Information". Problems Inform. Transmission 1 (1): 1–7. A Maximum Performance Computing Theorem For a computational task f, computing the result r, given inputs i, i.e. task f: r = f( i ), or i f r Assuming infinite capacity to compute and remember inside box f, the time T to compute task f depends on moving the data in and out of the box. Thus, for a machine f with infinite memory and infinitely fast arithmetic, Kolmogorov complexity K(i+r) defines the fastest way to compute task f. SABR model: dFt t Ft dWt d t t dZ t dW , dZ dt We integrate in time (Euler in log-forward, Milstein in volatility) ln Ft 1 ln Ft 12 ( t exp(( 1) ln Ft )) 2 .dt t exp(( 1) ln Ft )Wt t 1 t t Z t 12 ( t )( )Z t2 dt logic s t a t e σ, F The representation K(σ,F) of the state σ,F is critical! MPC– Bad News 1. Real computers do not have either infinite memory or infinitely fast arithmetic units. 2. Kolmogorov Theorem. K is not a computable function. MPC – Good News Today’s arithmetic units are fast enough. So in practice... Kolmogorov Complexity => Discretisation & Compression => MPC depends on the Representation of the Problem. Euclids Elements, Representing a²+b²=c² 17 × 24 = ? Thinking Fast and Slow Daniel Kahneman Nobel Prize in Economics, 2002 back to 17 × 24 Kahneman splits thinking into: System 1: fast, hard to control ... 400 System 2: slow, easier to control ... 408 Remembering Fast and Slow John von Neumann, 1946: “We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding, but which is less quickly accessible.” Consider Computation and Memory Together Computing f(x) in the range [a,b] with |E| ≤ 2⁻ⁿ Table Table+Arithmetic and +,-,×,÷ uniform vs non-uniform number of table entries how many coefficients Arithmetic +,-,×,÷ polynomial or rational approx continued fractions multi-partite tables Underlying hardware/technology changes the optimum MPC in Practice Tradeoff Representation, Memory and Arithmetic Limits on Computing + and × Shmuel Winograd, 1965 Bounds on Addition - Binary: O(log n) Residue Number System: O(log 2log α(N)) Redundant Number System: O(1) Bounds on Multiplication - Binary: O(log n) Residue Number System: O(log 2log β(N)) Using Tables: O(2[log n/2]+2+[log 2n/2]) Logarithmic Number System: O(Addition) However, Binary and Log numbers are easy to compare, others are not! Lesson: If you optimize only a little piece of the computation, the result is useless in practice => Need to optimize ENTIRE programs. Or in other words: abstraction kills performance. Addition in O(1) Redundant: 2 bits represent 1 binary digit => use counters to reduce the input a b c out1 out2 (3,2) counters reduce three numbers (a,b,c) to two numbers (out1, out2) so that a + b + c = out1 + out2 From Theory to Practice Optimise Whole Programs Customise Architecture Method Iteration Processor Discretisation Storage Bit Level Representation Customise Numerics Mission Impossible? Maximum Performance Computing (MPC) Less Data Movement = Less Data + Less Movement The journey will take us through: 1. Information Theory: Kolmogorov Complexity 2. Optimised Arithmetic: Winograd Bounds 3. Optimisation via Kahneman and Von Neumann 4. Real World Dataflow Implications and Results Optimise Whole Programs with Finite Resources SYSTEM 1 x86 cores SYSTEM 2 flexible memory+logic Low Latency Memory High Throughput Memory Balance Computation and Memory The Ideal System 2 is a Production Line SYSTEM 1 x86 cores SYSTEM 2 flexible memory+logic Low Latency Memory High Throughput Memory Balance Computation and Memory 8 Maxeler DFEs replacing 1,900 Intel CPU cores presented by ENI at the Annual SEG Conference, 2010 2,000 Compared to 32 3GHz x86 cores parallelized using MPI 15Hz peak frequency 1,600 30Hz peak frequency 1,400 45Hz peak frequency 1,200 70Hz peak frequency Equivalent CPU cores 1,800 1,000 800 600 400 200 0 1 4 Number of MAX2 cards 8 100kWatts of Intel cores => 1kWatt of Maxeler Dataflow Engines Example: Sparse Matrix Computations O. Lindtjorn et al, HotChips 2010 Given matrix A, vector b, find vector x in Ax = b. DOES NOT SCALE BEYOND SIX x86 CPU CORES MAXELER SOLUTION: 20-40x in 1U 60 GREE0A 1new01 Speedup per 1U Node 50 40 30 20 10 0 0 1 2 3 4 5 6 Compression Ratio 7 8 Domain Specific Address and Data Encoding (*Patent Pending) 9 10 Example: JP Morgan Derivatives Pricing O Mencer, S Weston, Journal on Concurrency and Computation, July 2011. • Compute value and risk of complex credit derivatives. • Moving overnight run to realtime intra day • Reported Speedup: 220-270x 8 hours => 2 minutes • Power consumption per node drops from 250W to 235W per node See JP Morgan talk at Stanford on Youtube, search “weston maxeler” Maxeler Loop Flow Graphs for JP Morgan Credit Derivatives Whole Program Transformation Options Maxeler Data Flow Graph for JP Morgan Interest Rates Monte Carlo Acceleration Example: data flow graph generated by MaxCompiler 4866 static dataflow cores in 1 chip Maxeler Dataflow Engines (DFEs) High Density DFEs The Dataflow Appliance The Low Latency Appliance Intel Xeon CPU cores and up to 6 DFEs with 288GB of RAM Dense compute with 8 DFEs, 384GB of RAM and dynamic allocation of DFEs to CPU servers with zero-copy RDMA access Intel Xeon CPUs and 1-2 DFEs with direct links to up to six 10Gbit Ethernet connections MaxWorkstation Desktop dataflow development system MaxRack Dataflow Engines 10, 20 or 40 node rack systems integrating compute, networking & storage 48GB DDR3, high-speed connectivity and dense configurable logic MaxCloud Hosted, on-demand, scalable accelerated compute