Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data-Parallel Algorithms and Data Structures John Owens UC Davis Programming a GPU for Graphics • Application specifies geometry -> rasterized • Each fragment is shaded w/ SIMD program • Shading can use values from texture memory • Image can be used as texture on future passes Programming a GPU for GP Programs (Old-School) • Draw a screen-sized quad • Run a SIMD program over each fragment • “Gather” is permitted from texture memory • Resulting buffer can be treated as texture on next pass Programming a GPU for GP Programs (New School) • Define a computation domain over elements • Run a SIMD program over each element/thread • Global memory supports both gather and scatter • Resulting buffer can be used in other generalpurpose kernels or as texture One Slide Summary of Today • GPUs are great at running many closelycoupled but independent threads in parallel – The programming model is SPMD—single program, multiple data • GPU computing boils down to: – Define a computation domain that generates many parallel threads • This is the data structure – Iterate in parallel over that computation domain, running a program over all threads • This is the algorithm Outline • Data Structures – GPU Memory Model – Taxonomy • Algorithmic Building Blocks – – – – – – Sample Application Map Gather & Scatter Reductions Scan (parallel prefix) Sort, search, … Memory Hierarchy • CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Caches CPU Registers GPU Main (Video) Memory GPU Caches (Texture, Shared) GPU Constant Registers GPU Temporary Registers Big Picture: GPU Memory Model • GPUs are a mix of: – Historical, fixed-function capabilities – Newer, flexible, programmable capabilities • Fixed-function: – Known access patterns, behaviors – Accelerated by special-purpose hardware • Programmable: – Unknown access patterns – Generality good • Memory model must account for both – Consequence: Ample special-purpose functionality – Consequence: Restricting flexibility may improve performance » Take advantage of special-purpose functionality! Taxonomy of Data Structures • Dense arrays – Address translations: n-dimension to m-dimension • Sparse arrays – Static sparse arrays (e.g. sparse matrix) – Dynamic sparse arrays (e.g. page tables) • Adaptive data structures – Static adaptive data structures (e.g. k-d trees) – Dynamic adaptive data structures (e.g. adaptive multiresolution grids) • Non-indexable structures (e.g. stack) Details? Lefohn et al.’s Glift (TOG Jan. ‘06), GPGPU survey GPU Memory Model • More restricted memory access than CPU – Allocate/free memory only before computation – Transfers to and from CPU are explicit – GPU is controlled by CPU, can’t initiate transfers, access disk, etc. • To generalize … – GPUs are better at accessing data structures – CPUs are better at building data structures – As CPU-GPU bandwidth improves, consider doing data structure tasks on their “natural” processor GPU Memory Model • Limited memory access during computation (kernel) – Registers (per fragment/thread) • Read/write – Local memory (shared among threads) • Does not exist in general • CUDA allows access to shared memory btwn threads – Global memory (historical) • Read-only during computation • Write-only at end of computation (precomputed address) – Global memory (new) • Allows general scatter/gather (read/write) – No collision rules! – Exposed in AMD R520+ GPUs, NVIDIA G80+ GPUs Data-Parallel Algorithms • The GPU is a data-parallel processor – Data-parallel kernels of applications can be accelerated on the GPU • Efficient algorithms require efficient building blocks • This talk: data-parallel building blocks – – – – Map Gather & Scatter Reduce Scan • Next talk (Naga Govindaraju): sorting and searching Sample Motivating Application • How bumpy is a surface that we represent as a grid of samples? • Algorithm: – Loop over all elements – At each element, compare the value of that element to the average of its neighbors (“difference”). Square that difference. – Now sum up all those differences. • But we don’t want to sum all the diffs that are 0. • So only sum up the non-zero differences. – This is a fake application—don’t take it too seriously. Picture courtesy http://www.artifice.com Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2 result = 0 for all samples where diff != 0: result += diff return result Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2 result = 0 for all samples where diff != 0: result += diff return result The Map Operation • Given: – Array or stream of data elements A – Function f(x) • map(A, f) = applies f(x) to all ai∈ A • GPU implementation is straightforward – Graphics: A is a texture, ai are texels – General-purpose: A is a computation domain, ai are elements in that domain – Pixel shader/kernel implements f(x), reads ai as x – (Graphics): Draw a quad with as many pixels as texels in A with f(x) pixel shader active – Output stored in another texture/buffer Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2 result = 0 for all samples where diff != 0: result += diff return result Scatter vs. Gather • Gather: p = a[i] – Vertex or Fragment programs • Scatter: a[i] = p – Historically vertex programs, but exposed in fragment programs in latest GPUs Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2 result = 0 for all samples where diff != 0: result += diff return result Parallel Reductions • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements • reduce(⊕, s) returns a0 ⊕ a1 ⊕ … ⊕ an-1 • Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25 • Reductions common in parallel algorithms – Common reduction operators are +, ×, min and max – Note floating point is only pseudo-associative Parallel Reductions on the GPU 1D parallel reduction: add two halves of texture together repeatedly... … until we’re left with a single row of texels + + N/4… N/2 N O(log2N) steps, O(N) work + 1 Multiple 1D Parallel Reductions Can run many reductions in parallel Use 2D texture and reduce one dimension + + MxN/2 MxN MxN/4… O(log2N) steps, O(MN) work + Mx1 2D reductions • Like 1D reduction, only reduce in both directions simultaneously – Note: can add more than 2x2 elements per pixel • Trade per-pixel work for step complexity • Best perf depends on specific GPU (cache, etc.) Sample Motivating Application for all samples: neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) ) diff = (value[x,y] - neighbors[x,y])^2 result = 0 for all samples where diff != 0: result += diff return result Stream Compaction • Input: stream of 1s and 0s [1 0 1 1 0 0 1 0] • Operation:“sum up all elements before you” • Output: scatter addresses for “1” elements [0 1 1 2 3 3 3 4] • Note scatter addresses for red elements are packed! Parallel Scan (aka prefix sum) • Given: – Binary associative operator ⊕ with identity I – Ordered set s = [a0, a1, …, an-1] of n elements • scan(⊕, s) returns [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)] • Example: scan(+, [3 1 7 0 4 1 6 3]) = [0 3 4 11 11 15 16 22] (From Blelloch, 1990, “Prefix Sums and Their Applications”) Applications of Scan • • • • • Radix sort Quicksort String comparison Lexical analysis Stream compaction • • • • • Sparse matrices Polynomial evaluation Solving recurrences Tree operations Histograms A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 Note: With graphics API, can’t read and write the same texture, so must “ping-pong”. Not necessary with newer APIs. A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 Stride 1 T1 Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 Stride 1 T1 Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. • Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin. A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 3 4 11 11 12 12 11 14 Stride 1 T1 Stride 2 T0 Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. • Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin. A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 3 4 11 11 12 12 11 14 Stride 1 T1 Stride 2 T0 Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes vout = v[k] + v[k-2i]. • Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin. A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 3 4 11 11 12 12 11 14 Stride 1 T1 Stride 2 T0 Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes Stride 4 Out vout = v[k] + v[k-2i]. 3 4 11 11 15 16 22 25 • Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin. A Naive Parallel Scan Algorithm Log(n) iterations T0 3 1 7 0 4 1 6 3 3 4 8 7 4 5 7 9 3 4 11 11 12 12 11 14 Stride 1 T1 Stride 2 T0 Note: With graphics API, can’t read and write the same texture, so must “ping-pong” For i from 1 to log(n)-1: • Specify domain from 2i to n. Element k computes Stride 4 Out vout = v[k] + v[k-2i]. 3 4 11 11 15 16 22 25 • Due to ping-pong, specify 2nd domain from 2(i-1) to 2i with a simple pass-through shader vout = vin. A Naive Parallel Scan Algorithm • Algorithm given in more detail by Horn [‘05] • Step-efficient, but not work-efficient – O(log n) steps, but O(n log n) adds – Sequential version is O(n) – A factor of log(n) hurts: 20x for 106 elements! • Dig into parallel algorithms literature for a better solution – See Blelloch 1990, “Prefix Sums and Their Applications” Balanced Trees • Common parallel algorithms pattern – Build a balanced binary tree on the input data and sweep it to and from the root – Tree is conceptual, not an actual data structure • For scan: – Traverse down from leaves to root building partial sums at internal nodes in the tree • Root holds sum of all leaves – Traverse back up the tree building the scan from the partial sums – 2 log n passes, O(n) work Balanced Tree Scan “Balanced tree”: O(n) computation, space • First build sums in place up the tree: “reduce” • Traverse back down and use partial sums to generate scan: “downsweep” See Mark Harris’s article in GPU Gems 3 Figure courtesy Shubho Sengupta Scan with Scatter • Scatter in pixel shaders makes algorithms that use scan easier to implement – NVIDIA CUDA and AMD CTM enable this – NVIDIA CUDA SDK includes example implementation of scan primitive – http://www.gpgpu.org/scan-gpugems3 • CUDA Parallel Data Cache improves efficiency – All steps executed in a single kernel – Threads communicate through shared memory – Drastically reduces bandwidth bottleneck! Parallel Sorting • Given an unordered list of elements, produce list ordered by key value – Kernel: compare and swap • GPU’s constrained programming environment limits viable algorithms – Bitonic merge sort [Batcher 68] – Periodic balanced sorting networks [Dowd 89] • Recent research results impressive – Naga Govindaraju will cover the algorithms in detail Binary Search • Find a specific element in an ordered list • Implement just like CPU algorithm – Finds the first element of a given value v • If v does not exist, find next smallest element > v • Search is sequential, but many searches can be executed in parallel – Number of pixels drawn determines number of searches executed in parallel • 1 pixel == 1 search • For details see: – “A Toolkit for Computation on GPUs”. Ian Buck and Tim Purcell. In GPU Gems. Randy Fernando, ed. 2004 Summary • Think parallel! – Effective programming on GPUs requires mapping computation & data structures into parallel formulations • Future thoughts – Definite need for: • Efficient implementations of parallel primitives • Research on what is the right set of primitives – How will high-bandwidth connections between CPU and GPU change algorithm and data structure design? References • “A Toolkit for Computation on GPUs”. Ian Buck and Tim Purcell. In GPU Gems, Randy Fernando, ed. 2004. • “Parallel Prefix Sum (Scan) with CUDA”. Mark Harris, Shubhabrata Sengupta, John D. Owens. In GPU Gems 3, Herbert Nguyen, ed. Aug. 2007. • “Stream Reduction Operations for GPGPU Applications”. Daniel Horn. In GPU Gems 2, Matt Pharr, ed. 2005. • “Glift: Generic, Efficient, Random-Access GPU Data Structures”. Aaron Lefohn et al., ACM TOG, Jan. 2006.