Download Kolmogorov complexity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
A Perspective on the Limits of Computation
Oskar Mencer
May 2012
Limits of Computation
Objective: Maximum Performance Computing (MPC)
What is the fastest we can compute desired results?
Conjecture:
Data movement is the real limit on computation.
Maximum Performance Computing (MPC)
Less Data Movement = Less Data + Less Movement
The journey will take us through:
1. Information Theory: Kolmogorov Complexity
2. Optimised Arithmetic: Winograd Bounds
3. Optimisation via Kahneman and Von Neumann
4. Real World Dataflow Implications and Results
Kolmogorov Complexity (K)
Definition (Kolmogorov):
“If a description of string s, d(s), is of minimal length, […]
it is called a minimal description of s. Then the length of
d(s), […] is the Kolmogorov complexity of s, written K(s),
where K(s) = |d(s)|”
Of course K(s) depends heavily on the Language L
used to describe actions in K.
(e.g. Java, Esperanto, an Executable file, etc)
Kolmogorov, A.N. (1965). "Three Approaches to the Quantitative Definition of Information". Problems Inform. Transmission 1 (1): 1–7.
A Maximum Performance Computing Theorem
For a computational task f, computing the result r,
given inputs i, i.e. task f: r = f( i ), or
i
f
r
Assuming infinite capacity to compute and remember
inside box f, the time T to compute task f depends on moving
the data in and out of the box.
Thus, for a machine f with infinite memory and
infinitely fast arithmetic, Kolmogorov complexity K(i+r) defines
the fastest way to compute task f.
SABR model:
dFt   t Ft  dWt
d t   t dZ t
 dW , dZ  dt
We integrate in time (Euler in log-forward, Milstein in volatility)
ln Ft 1  ln Ft  12 ( t exp((   1) ln Ft )) 2 .dt   t exp((   1) ln Ft )Wt
 t 1   t   t Z t  12 ( t )( )Z t2  dt 
logic
s
t
a
t
e
σ, F
The representation K(σ,F) of the state σ,F is critical!
MPC– Bad News
1. Real computers do not have either infinite memory or
infinitely fast arithmetic units.
2. Kolmogorov Theorem. K is not a computable function.
MPC – Good News
Today’s arithmetic units are fast enough.
So in practice...
Kolmogorov Complexity => Discretisation & Compression
=> MPC depends on the Representation of the Problem.
Euclids Elements, Representing a²+b²=c²
17 × 24 = ?
Thinking Fast and Slow
Daniel Kahneman
Nobel Prize in Economics, 2002
back to 17 × 24
Kahneman splits thinking into:
System 1: fast, hard to control ... 400
System 2: slow, easier to control ... 408
Remembering Fast and Slow
John von Neumann, 1946:
“We are forced to recognize the
possibility of constructing a
hierarchy of memories, each of
which has greater capacity than
the preceding, but which is less
quickly accessible.”
Consider Computation and Memory Together
Computing f(x) in the range [a,b] with |E| ≤ 2⁻ⁿ
Table
Table+Arithmetic
and +,-,×,÷
 uniform vs non-uniform
 number of table entries
 how many coefficients
Arithmetic
+,-,×,÷
 polynomial or rational approx
 continued fractions
 multi-partite tables
Underlying hardware/technology changes the optimum
MPC in Practice
Tradeoff Representation, Memory and Arithmetic
Limits on Computing + and ×
Shmuel Winograd, 1965
Bounds on Addition
-
Binary: O(log n)
Residue Number System: O(log 2log α(N))
Redundant Number System: O(1)
Bounds on Multiplication
-
Binary: O(log n)
Residue Number System: O(log 2log β(N))
Using Tables: O(2[log n/2]+2+[log 2n/2])
Logarithmic Number System: O(Addition)
However, Binary and Log numbers are easy to compare, others are not!
Lesson: If you optimize only a little piece of the computation, the result is
useless in practice => Need to optimize ENTIRE programs.
Or in other words: abstraction kills performance.
Addition in O(1)
Redundant: 2 bits represent 1 binary digit
=> use counters to reduce the input
a
b
c
out1
out2
(3,2) counters reduce three
numbers (a,b,c) to two numbers (out1, out2)
so that a + b + c = out1 + out2
From Theory to Practice
Optimise Whole Programs
Customise
Architecture
Method
Iteration
Processor
Discretisation
Storage
Bit Level
Representation
Customise
Numerics
Mission Impossible?
Maximum Performance Computing (MPC)
Less Data Movement = Less Data + Less Movement
The journey will take us through:
1. Information Theory: Kolmogorov Complexity
2. Optimised Arithmetic: Winograd Bounds
3. Optimisation via Kahneman and Von Neumann
4. Real World Dataflow Implications and Results
Optimise Whole Programs with Finite Resources
SYSTEM 1
x86 cores
SYSTEM 2
flexible
memory+logic
Low Latency
Memory
High Throughput
Memory
Balance Computation and Memory
The Ideal System 2 is a Production Line
SYSTEM 1
x86 cores
SYSTEM 2
flexible
memory+logic
Low Latency
Memory
High Throughput
Memory
Balance Computation and Memory
8 Maxeler DFEs replacing 1,900 Intel CPU cores
presented by ENI at the Annual SEG Conference, 2010
2,000
Compared to 32 3GHz x86 cores parallelized using MPI
15Hz peak frequency
1,600
30Hz peak frequency
1,400
45Hz peak frequency
1,200
70Hz peak frequency
Equivalent CPU cores
1,800
1,000
800
600
400
200
0
1
4
Number of MAX2 cards
8
100kWatts of Intel cores => 1kWatt of Maxeler Dataflow Engines
Example: Sparse Matrix Computations
O. Lindtjorn et al, HotChips 2010
Given matrix A, vector b, find vector x in Ax = b.
DOES NOT SCALE BEYOND
SIX x86 CPU CORES
MAXELER SOLUTION: 20-40x in 1U
60
GREE0A
1new01
Speedup per 1U Node
50
40
30
20
10
0
0
1
2
3
4
5
6
Compression Ratio
7
8
Domain Specific Address and Data Encoding
(*Patent Pending)
9
10
Example: JP Morgan Derivatives Pricing
O Mencer, S Weston, Journal on Concurrency and Computation, July 2011.
• Compute value and risk of
complex credit derivatives.
• Moving overnight run to
realtime intra day
• Reported Speedup:
220-270x
8 hours => 2 minutes
• Power consumption per
node drops from 250W to
235W per node
See JP Morgan talk at Stanford on Youtube, search “weston maxeler”
Maxeler Loop Flow Graphs
for JP Morgan Credit Derivatives
Whole Program Transformation Options
Maxeler Data Flow Graph for JP Morgan Interest Rates
Monte Carlo Acceleration
Example:
data flow graph
generated by
MaxCompiler
4866
static dataflow cores
in 1 chip
Maxeler Dataflow Engines (DFEs)
High Density DFEs
The Dataflow Appliance
The Low Latency Appliance
Intel Xeon CPU cores and up to 6
DFEs with 288GB of RAM
Dense compute with 8 DFEs,
384GB of RAM and dynamic
allocation of DFEs to CPU servers
with zero-copy RDMA access
Intel Xeon CPUs and 1-2 DFEs with
direct links to up to six 10Gbit
Ethernet connections
MaxWorkstation
Desktop
dataflow
development
system
MaxRack
Dataflow Engines
10, 20 or 40 node rack systems integrating
compute, networking & storage
48GB DDR3, high-speed
connectivity and dense
configurable logic
MaxCloud
Hosted, on-demand, scalable accelerated
compute