Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cache-Oblivious Algorithms Guido Drovandi Tecniche Algoritmiche per Grafi e Reti Dipartimento di Informatica e Automazione Università degli Studi “Roma Tre” RAM Model • Uniform access time • Infinite memory CPU MEMORY Real Architecture CPU L1 L2 Main Memory Disk • Hierarchical structure – Farthest levels from CPU are slower but bigger – Blocks of data are exchanged between adjacent levels Memory Access Cost block size Memory cost = latency + bandwith Register L1 L2 Main Memory Disk Latency 70-200 ns 2-7 ms CPU Cycles 1 1-4 5-20 50-200 106 I/O Bottleneck I/O Bottleneck • Memory access is often bottleneck in massive data application • Examples: – – – – Web Weather Financial Maps Two-Level Model Main Memory Cache CPU finite size (registers) B = block size M = cache size M/B = number of blocks B = block size infinite memory Memory Model • Cost: – number of memory transfers (memory/cache) • Goal: – minimize the number of memory transfers • Parameters: – N: # of elements of the input – M: # of elements that fit in cache – B: # of elements that fit in block Memory Model • Cache-Aware – M and B are known and used in algorithms • Cache-Oblivious – M and B are unknown parameters Is a 2-level hierarchy a good memory model? Principles of Locality • Spatial locality: – If one is using data in position P then he should use data in the neighborhood of P • Temporal locality: – If one must use same data more than once then he should use those data as long as possible before retrieving other data Example Sequential: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 14 Stride-2: 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0 1 14 Stride-4: 4 Random: 5 6 7 8 9 10 11 12 13 14 0 1 2 0 7 0 3 14 4 6 5 3 8 14 10 12 13 11 1 2 0 9 14 Example Example Scanning Algorithm • Scanning algorithm – for (i=0;i<N;i++) f(a[i]); • Complexity in RAM model – Time(N) = O(N) • How many memory transfers? – MT(N) = O(N/B+1) Scanning Algorithm Main Memory N B Worst Case B B Array-Reversal Algorithm • Array-Reversal algorithm – for (i=0;i<N/2;i++) swap(a[i],a[N-i-1]); • Complexity in RAM model – Time(N) = O(N) • How many memory transfers? – MT(N) = O(N/B+1) Binary-Search Algorithm • Binary-Search algorithm • Complexity in RAM model – Time(N) = Time(N/2) + O(1) = O(log2(N)) • How many memory transfers? – MT(N) = MT(N/2) + O(1) ≈ O(log2(N)-log2(B)) Binary-Search Algorithm B Merge Sort • First approach: Cache-Oblivious – We do not know B and M • How many memory transfers? 3 2 3 2 7 0 7 0 1 6 4 5 1 6 4 5 MT(N) = 2MT(N/2) + O(N/B) = O(N/B log2(N/B)) Merge Sort • What happens with another base-case? – MT(1) = 1 • MT(N) = O( N/B log2(N) ) – MT(B) = 1 • MT(N) = O( N/B log2(N/B) ) – MT(M) = M/B • MT(N) = O( N/B log2(N/M) ) Recurrence N/B h=log2(N/M) N N/2 M M M N/2 … N/M M N/B M M N/B Merge Sort • Second approach: Cache-Aware – We know B and M • How we can exploit B and M? 3 2 3 2 7 0 7 0 1 6 4 5 1 6 4 5 Merge Sort • M/B-way Merge Sort • MT(N) = M/B MT(NB/M) + O(N/B) = O( N/B logM/B(N/B) ) 3 3 2 2 7 7 0 0 1 1 6 6 4 5 4 5 Matrix Multiplication x = C matrix A matrix B row-major order column-major order Memory transfers: – O(N2/B) – O(N3/B) if 3N2 ≤ M if N2 > M Matrix Multiplication A11 A12 A21 A22 A11 x B11 + A12 x B21 x B11 B12 B21 B22 = A11 x B12 + A12 x B22 A21 x B11 + A22 x B21 A21 x B12 + A22 x B22 • Divide the matrix until it fits in cache • New subproblems: – 8 multiplications – 4 summations (O(N2/B) memory transfers) Matrix Multiplication 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16 • Layout of a 4x4 matrix • Each recursive submatrix is stored in consecutive cells of array Matrix Multiplication • “One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly” • “A deficiency we hope that processor architects will remedy” Frigo, Leiserson, Prokop, Ramachandran Matrix Multiplication • Memory transfers: – MT(N) = 8 MT(N/2) + O(N2/B) – MT(c√M) = O(M/B) – MT(N) = 8(8( …(8 MT(N/2s)))) + O(N2/B) = 8s MT(c√M) + O(N2/B) = O( N3/(B√M) + N2/B ) Back to Binary Search • Can we improve the bound O(log2(N/B))? • Remember that we do not know B and M • Hints: – Recursive layout as for matrix multiplication – Binary search sounds like binary tree Static Search Tree log2 (N ) h= 2 1 3 4 5 8 9 10 Each subtree has √N nodes 12 Static Search Tree B B B B B B B B B B B B Van Embe Boas Layout h/2 R h C1 R C1 … h/2 Cn … Cn Static Search Tree • The height of a subtree of size B is at least log2(B)/2 = log2(√B) • The length of a root-leaf path is log2(N) • It follows that we incur in 2(log2(N)/log2(√B)) = 4 logB(N) memory transfers Sorting • Cache Aware: – MT(N) = Θ( N/B logM/B (N/B) ) • Cache Oblivious: – MT(N) = Θ( N/B log2(N/M) ) – Can we improve this bound? Funnel Sort 5 1 2 5 14 2 3 3 4 Funnel Sort • K-Funnel – merges K sorted list of total size K3 – needs K3/B logM/B(K3/M) memory transfers • How can we use these funnels? – Hint: Merge-Sort Funnel Sort … N1/3-way Sort Funnel Sort MT(N) = O(N/B logM/B (N/B))+N1/3 MT(N2/3) = O(N/B logM/B (N/B) + N/B logM/B (N2/3/B))) + N2/27 MT(N4/9) ≤ O((1 + 2/3) N/B logM/B(N/B)) + N2/27 MT(N4/9) = O ( N/B logM/B(N/B) ) K-Funnel K3 output buffer K-funnel √K-Funnel K1/2 buffers of size K3/2 √K K inputs √K √K √K K-Funnel • Space(K) = (1+√K)Space(√K)+O(K2) = O(K2) • Consider a J-Funnel that fits in ¼ of the cache – cJ2 ≤ M/4 -> J ≤ √M/2 – J input buffers fit in ½ of the cache: • suppose M≥B2 • JB ≤ √M/2 √M = M/2 K-Funnel • Loading an entire J-Funnel in cache costs O(M/(4B) + J) = O(J2/B + J) = O(J3/B) • Extract J3 elements costs O(J3/B) • Each element pays O(1/B) to enter in a buffer of size at least J3 K-Funnel • Each element enters in O(log(K) / log(J)) = O(logMK) buffers (of size at least J3) • MT(K) = O( K3/B logMK + K ) = O(K3/B logM/BK + K ) (M=Ω(B2)) = O(K3/B logM/B(K/B) + K ) = O(K3/B logM/B(K3/B) + K ) (K=Ω(B2)) Bounds on Graphs Cache-Oblivious Cache-Aware List Ranking O(Sort (V ) ) O(Sort (V ) ) Euler Tour O(Sort (V ) ) O(Sort (V ) ) Spanning Tree / MST O(Sort ( E ) log log V ) VB O Sort ( E ) log log E Bounds on Graphs Cache-Oblivious Undirected BFS Directed BFS / DFS Undirected SSSP O(V + Sort (E ) ) Cache-Aware VE O St ( E ) + So( E ) + B E E O V + log V + So( E ) O V + log V + So( E ) B B E VE O log B B E VE O log B B References • E. D. Demaine Cache-Oblivious Algorithms and Data Structures • P. Kumar Cache Oblivious Algorithms • A. Aggarwal, J. S. Vitter The input/output complexity of sorting and related problems, 1988 • M. Frigo, C. E. Leiserson, H. Prokop, S. Ramachandran Cache-Oblivious Algorithms, 1999 References • L. Arge, M. A. Bender, E. D. Demaine, B. HollandMinkly, J I. Munro An Optimal Cache-Oblivious Priority Queue And Its Application To Graph Algorithms, 2007 • G. S. Brodal, R. Fagerberg, U. Meyer, N. Zeh Cache-Oblivious Data Structures and Algorithms For Undirected Breadth-First Search and Shortest Paths, 2004 • M. A. Bender, E. D. Demaine, M. Farach-Colton Cache-Oblivious B-Trees, 2000