Download Cache-Oblivious Algorithms - Dipartimento di Informatica e

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Cache-Oblivious Algorithms
Guido Drovandi
Tecniche Algoritmiche per Grafi e Reti
Dipartimento di Informatica e Automazione
Università degli Studi “Roma Tre”
RAM Model
• Uniform access time
• Infinite memory
CPU
MEMORY
Real Architecture
CPU
L1
L2
Main
Memory
Disk
• Hierarchical structure
– Farthest levels from CPU are slower but
bigger
– Blocks of data are exchanged between
adjacent levels
Memory Access Cost
block size
Memory cost = latency +
bandwith
Register
L1
L2
Main Memory
Disk
Latency
70-200 ns
2-7 ms
CPU Cycles
1
1-4
5-20
50-200
106
I/O Bottleneck
I/O Bottleneck
• Memory access is often bottleneck in
massive data application
• Examples:
–
–
–
–
Web
Weather
Financial
Maps
Two-Level Model
Main Memory
Cache
CPU
finite size
(registers)
B = block size
M = cache size
M/B = number of blocks
B = block size
infinite memory
Memory Model
• Cost:
– number of memory transfers (memory/cache)
• Goal:
– minimize the number of memory transfers
• Parameters:
– N: # of elements of the input
– M: # of elements that fit in cache
– B: # of elements that fit in block
Memory Model
• Cache-Aware
– M and B are known and used in algorithms
• Cache-Oblivious
– M and B are unknown parameters
Is a 2-level hierarchy a good memory model?
Principles of Locality
• Spatial locality:
– If one is using data in position P then he
should use data in the neighborhood of P
• Temporal locality:
– If one must use same data more than once
then he should use those data as long as
possible before retrieving other data
Example
Sequential:
1
2
3
4
5
6
7
8
9
10 11 12 13 14
0
0
14
Stride-2:
2
3
4
5
6
7
8
9
10 11 12 13 14
0
0
1
14
Stride-4:
4
Random:
5
6
7
8
9
10 11 12 13 14
0
1
2
0
7
0
3
14
4
6
5
3
8
14 10 12 13 11
1
2
0
9
14
Example
Example
Scanning Algorithm
• Scanning algorithm
– for (i=0;i<N;i++)
f(a[i]);
• Complexity in RAM model
– Time(N) = O(N)
• How many memory transfers?
– MT(N) = O(N/B+1)
Scanning Algorithm
Main Memory
N
B
Worst Case
B
B
Array-Reversal Algorithm
• Array-Reversal algorithm
– for (i=0;i<N/2;i++)
swap(a[i],a[N-i-1]);
• Complexity in RAM model
– Time(N) = O(N)
• How many memory transfers?
– MT(N) = O(N/B+1)
Binary-Search Algorithm
• Binary-Search algorithm
• Complexity in RAM model
– Time(N) = Time(N/2) + O(1) = O(log2(N))
• How many memory transfers?
– MT(N) = MT(N/2) + O(1) ≈ O(log2(N)-log2(B))
Binary-Search Algorithm
B
Merge Sort
• First approach: Cache-Oblivious
– We do not know B and M
• How many memory transfers?
3
2
3
2
7
0
7
0
1
6
4
5
1
6
4
5
MT(N) = 2MT(N/2) + O(N/B) = O(N/B log2(N/B))
Merge Sort
• What happens with another base-case?
– MT(1) = 1
• MT(N) = O( N/B log2(N) )
– MT(B) = 1
• MT(N) = O( N/B log2(N/B) )
– MT(M) = M/B
• MT(N) = O( N/B log2(N/M) )
Recurrence
N/B
h=log2(N/M)
N
N/2
M
M
M
N/2
…
N/M
M
N/B
M
M
N/B
Merge Sort
• Second approach: Cache-Aware
– We know B and M
• How we can exploit B and M?
3
2
3
2
7
0
7
0
1
6
4
5
1
6
4
5
Merge Sort
• M/B-way Merge Sort
• MT(N) = M/B MT(NB/M) + O(N/B)
= O( N/B logM/B(N/B) )
3
3
2
2
7
7
0
0
1
1
6
6
4
5
4
5
Matrix Multiplication
x
= C
matrix A
matrix B
row-major order
column-major order
Memory transfers:
– O(N2/B)
– O(N3/B)
if 3N2 ≤ M
if N2 > M
Matrix Multiplication
A11 A12
A21 A22
A11 x B11 + A12 x B21
x
B11
B12
B21
B22
=
A11 x B12 + A12 x B22
A21 x B11 + A22 x B21 A21 x B12 + A22 x B22
• Divide the matrix until it fits in cache
• New subproblems:
– 8 multiplications
– 4 summations (O(N2/B) memory transfers)
Matrix Multiplication
1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
• Layout of a 4x4 matrix
• Each recursive submatrix is stored
in consecutive cells of array
Matrix Multiplication
• “One of the practical disadvantages of
bit-interleaved layouts is that index
calculations on conventional
microprocessors can be costly”
• “A deficiency we hope that processor
architects will remedy”
Frigo, Leiserson,
Prokop, Ramachandran
Matrix Multiplication
• Memory transfers:
– MT(N) = 8 MT(N/2) + O(N2/B)
– MT(c√M) = O(M/B)
– MT(N) = 8(8( …(8 MT(N/2s)))) + O(N2/B)
= 8s MT(c√M) + O(N2/B)
= O( N3/(B√M) + N2/B )
Back to Binary Search
• Can we improve the bound O(log2(N/B))?
• Remember that we do not know B and M
• Hints:
– Recursive layout as for matrix
multiplication
– Binary search sounds like binary tree
Static Search Tree
log2 (N )
h=
2
1
3
4
5
8
9
10
Each subtree has √N nodes
12
Static Search Tree
B
B
B
B
B
B
B
B
B
B
B
B
Van Embe Boas Layout
h/2
R
h
C1
R
C1
…
h/2
Cn
…
Cn
Static Search Tree
• The height of a subtree of size B is at
least log2(B)/2 = log2(√B)
• The length of a root-leaf path is log2(N)
• It follows that we incur in
2(log2(N)/log2(√B)) = 4 logB(N) memory
transfers
Sorting
• Cache Aware:
– MT(N) = Θ( N/B logM/B (N/B) )
• Cache Oblivious:
– MT(N) = Θ( N/B log2(N/M) )
– Can we improve this bound?
Funnel Sort
5
1
2
5
14
2
3
3
4
Funnel Sort
• K-Funnel
– merges K sorted list of total size K3
– needs K3/B logM/B(K3/M) memory transfers
• How can we use these funnels?
– Hint: Merge-Sort
Funnel Sort
…
N1/3-way Sort
Funnel Sort
MT(N) = O(N/B logM/B (N/B))+N1/3 MT(N2/3)
= O(N/B logM/B (N/B) + N/B logM/B (N2/3/B)))
+ N2/27 MT(N4/9)
≤ O((1 + 2/3) N/B logM/B(N/B))
+ N2/27 MT(N4/9)
= O ( N/B logM/B(N/B) )
K-Funnel
K3 output buffer
K-funnel
√K-Funnel
K1/2 buffers
of size K3/2
√K
K inputs
√K
√K
√K
K-Funnel
• Space(K) = (1+√K)Space(√K)+O(K2) = O(K2)
• Consider a J-Funnel that fits in ¼ of the
cache
– cJ2 ≤ M/4 -> J ≤ √M/2
– J input buffers fit in ½ of the cache:
• suppose M≥B2
• JB ≤ √M/2 √M = M/2
K-Funnel
• Loading an entire J-Funnel in cache costs
O(M/(4B) + J) = O(J2/B + J) = O(J3/B)
• Extract J3 elements costs O(J3/B)
• Each element pays O(1/B) to enter in a
buffer of size at least J3
K-Funnel
• Each element enters in O(log(K) / log(J))
= O(logMK) buffers (of size at least J3)
• MT(K) = O( K3/B logMK + K )
= O(K3/B logM/BK + K )
(M=Ω(B2))
= O(K3/B logM/B(K/B) + K )
= O(K3/B logM/B(K3/B) + K )
(K=Ω(B2))
Bounds on Graphs
Cache-Oblivious
Cache-Aware
List Ranking
O(Sort (V ) )
O(Sort (V ) )
Euler Tour
O(Sort (V ) )
O(Sort (V ) )
Spanning
Tree / MST
O(Sort ( E ) log log V )
VB 

O Sort ( E ) log log

E 

Bounds on Graphs
Cache-Oblivious
Undirected
BFS
Directed
BFS / DFS
Undirected
SSSP
O(V + Sort (E ) )
Cache-Aware

VE 


O St ( E ) + So( E ) +

B



E

 
E
O V +  log V + So( E )  O V +  log V + So( E ) 
B
B


 
E
 VE
O
log 
B
 B
E
 VE
O
log 
B
 B
References
• E. D. Demaine
Cache-Oblivious Algorithms and Data Structures
• P. Kumar
Cache Oblivious Algorithms
• A. Aggarwal, J. S. Vitter
The input/output complexity of sorting and related
problems, 1988
• M. Frigo, C. E. Leiserson, H. Prokop, S. Ramachandran
Cache-Oblivious Algorithms, 1999
References
• L. Arge, M. A. Bender, E. D. Demaine, B. HollandMinkly, J I. Munro
An Optimal Cache-Oblivious Priority Queue And Its
Application To Graph Algorithms, 2007
• G. S. Brodal, R. Fagerberg, U. Meyer, N. Zeh
Cache-Oblivious Data Structures and Algorithms For
Undirected Breadth-First Search and Shortest
Paths, 2004
• M. A. Bender, E. D. Demaine, M. Farach-Colton
Cache-Oblivious B-Trees, 2000