Download A Survey of General-Purpose Computation on Graphics Hardware

GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke University of Virginia Demo • Today: Matthew Rodgers • That’s it! 2 David Luebke News • New job: NVIDIA Research • New e-mail: [email protected] 3 David Luebke Final Topic: GPGPU • GPGPU: “General Purpose GPU Computing” • Active, exciting area of research and development • A personal interest of mine • Following slides taken from a recent talk I co-presented in Dublin – Accompanying this paper 4 David Luebke A Survey of General-Purpose Computation on Graphics Hardware John Owens David Luebke University of California, Davis University of Virginia with Naga Govindaraju, Mark Harris, Jens Krüger, Aaron Lefohn, Tim Purcell Introduction • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor • Programmability • Precision • Power • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation 6 Motivation: Computational Power • GPUs are fast… • 3.0 GHz dual-core Pentium4: 24.6 GFLOPS • NVIDIA GeForceFX 7800: 165 GFLOPs • 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s • ATI Radeon X850 XT Platinum Edition: 37.8 GB/s • GPUs are getting faster, faster • CPUs: 1.4× annual growth • GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth 7 Motivation: Computational Power Thanks to Ian Buck 8 Motivation: Computational Power Thanks to John Owens 9 Motivation: Flexible and Precise • Modern GPUs are deeply programmable • Programmable pixel, vertex, video engines • Solidifying high-level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications 10 Motivation: The Potential of GPGPU • In short: • The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor 11 Problems: Difficult To Use • GPUs designed for & driven by video games • Programming model unusual • Programming idioms tied to computer graphics • Programming environment tightly constrained • Underlying architectures are: • Inherently parallel • Rapidly evolving (even in basic feature set!) • Largely secret • Can’t simply “port” CPU code! 12 Problems: Not A Panacea • GPUs are fast because they are specialized • Poorly suited to sequential, “pointer-chasing” code • Missing support for some basic functionality • E.g. integers, bitwise operations, indexed write • More on limitations & difficulties later 13 STAR Goals • Detailed & useful survey of general-purpose computing on graphics hardware • Hardware and software developments behind GPGPU • Building blocks: techniques for mapping generalpurpose computation to the GPU • Applications: important applications of GPGPU • A comprehensive GPGPU bibliography (current through Summer 2005…) 14 Architectural Considerations The past: 1987 20 MIPS CPU 1987 [courtesy Anant Agarwal] 16 The future: 2007 1 Billion Transistors 2007 [courtesy Anant Agarwal] 17 Today’s VLSI Capability: Keys to High Performance 0.5mm 64-bit FPU (to scale) 50pJ/FLOP 90nm Chip $200 1GHz 1 clock 12mm [courtesy Pat Hanrahan] 18 For High Performance, We Must … 0.5mm 64-bit FPU (to scale) 1. Exploit 50pJ/FLOP Ample Computation! 90nm Chip $200 1GHz 1 clock 12mm [courtesy Pat Hanrahan] 19 For High Performance, We Must … 0.5mm 64-bit FPU (to scale) 1. Exploit 50pJ/FLOP Ample Computation! 90nm Chip $200 1GHz 1 clock 2. Require Efficient Communication! 12mm [courtesy Pat Hanrahan] 20 Stream Programming Abstraction • Let’s think about our problem in a new way • Goal: SW programming model that matches today’s VLSI • Streams • Collection of data records • All data is expressed in streams • Kernels stream kernel stream • Inputs/outputs are streams • Perform computation on streams • Can be chained together 21 Why Streams? • Ample computation by exposing parallelism • Streams expose data parallelism • Multiple stream elements can be processed in parallel • Pipeline (task) parallelism • Multiple tasks can be processed in parallel • Kernels yield high arithmetic intensity • Efficient communication • Producer-consumer locality • Predictable memory access pattern • Optimize for throughput of all elements, not latency of one • Processing many elements at once allows latency hiding 22 GPU: Special-Purpose Graphics Hardware • Task-parallel organization • Assign each task to processing unit • Hardwire each unit to specific task - huge performance advantage! • Provides ample computation resources • Efficient communication patterns • Dominant graphics architecture [ATI Flipper – 51M T] 23 The Rendering Pipeline Application Compute 3D geometry Make calls to graphics API Transform geometry from 3D to 2D (in parallel) Geometry Rasterization Generate fragments from 2D geometry (in parallel) Combine fragments into image Composite GPU 24 The Programmable Rendering Pipeline Application Geometry Compute 3D geometry Make calls to graphics API Transform geometry from 3D to 2D; vertex programs (Vertex) Rasterization (Fragment) Generate fragments from 2D geometry; fragment programs Composite Combine fragments into image GPU 25 NVIDIA GeForce 6800 3D Pipeline Vertex Triangle Setup Z-Cull Shader Instruction Dispatch Fragment L2 Tex Fragment Crossbar Memory Partition Memory Partition Composite Memory Partition Memory Partition Courtesy Nick Triantos, NVIDIA 26 Characteristics of Graphics Apps • • • • • • Lots of arithmetic Lots of parallelism Simple control Multiple stages Feed forward pipelines Latency-tolerant / deep pipelines • What other applications have these characteristics? Application Command Geometry Rasterization Fragment Composite Display 27 Today’s Graphics Pipeline Application Command Geometry Rasterization Fragment Composite Display • Graphics is well suited to: • The stream programming model • Stream hardware organizations • GPUs are a commodity stream processor! • What if we could apply these techniques to more generalpurpose problems? • GPUs should excel at tasks that require: • Ample computation • Regular computation • Efficient communication 28 Programming a GPU for Graphics • Application specifies geometry  rasterized • Each fragment is shaded w/ SIMD program • Shading can use values from texture memory • Image can be used as texture on future passes 29 Programming a GPU for GP Programs • Draw a screen-sized quad  stream • Run a SIMD kernel over each fragment • “Gather” is permitted from texture memory • Resulting buffer can be treated as texture on next pass 30 Feedback • Each algorithm step depend on the results of previous steps • Each time step depends on the results of the previous time step 31 CPU-GPU Analogies . CPU . . Grid[i][j]= x; . . . Array Write GPU = Render to Texture 32 CPU-GPU Analogies CPU Stream / Data Array = Memory Read = GPU Texture Texture Sample 33 Kernels CPU GPU Kernel / loop body / algorithm step = Fragment Program 34 Scatter vs. Gather • Grid communication • Grid cells share information 35 Computational Resources Inventory • Programmable parallel processors • Vertex & Fragment pipelines • Rasterizer • Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants • Texture unit • Read-only memory interface • Render to texture • Write-only memory interface 36 Vertex Processor • Fully programmable (SIMD / MIMD) • Processes 4-vectors (RGBA / XYZW) • Capable of scatter but not gather • Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory • Latest GPUs: Vertex Texture Fetch • Random access memory for vertices • Gather (But not from the vertex stream itself) 37 Fragment Processor • • • • Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter • RAM read (texture fetch), but no RAM write • Output address fixed to a specific pixel • Paper details ways to synthesize scatter • Typically more useful than vertex processor • More fragment pipelines than vertex pipelines • Direct output (fragment processor is at end of pipeline) 38 Building Blocks & Applications GPGPU Building Blocks • The STAR covers the following fundamental techniques & computational building blocks: • Flow control (a very fundamental building block) • Stream operations • Data structures • Differential equations & linear algebra • Data queries • I’ll discuss each in a bit more detail 40 Flow control • Surprising number of issues on GPUs • Main themes: • Avoid branching when possible • Move branching earlier in the pipeline when possible • Largely SIMD  coherent branching most efficient • Mechanisms: • Rasterized geometry • Z-cull • Occlusion query 41 Domain Decomposition • Avoid branches where outcome is fixed • One region is always true, another false • Separate FPs for each region, no branches • Example: boundaries 42 Z-Cull • In early pass, modify depth buffer • Write depth=0 for pixels that should not be modified by later passes • Write depth=1 for rest • Subsequent passes • Enable depth test (GL_LESS) • Draw full-screen quad at z=0.5 • Only pixels with previous depth=1 will be processed • Can also use early stencil test • Note: Depth replace disables ZCull 43 Pre-computation • Pre-compute anything that will not change every iteration! • Example: arbitrary boundaries • When user draws boundaries, compute texture containing boundary info for cells • Reuse that texture until boundaries modified • Combine with Z-cull for higher performance! 44 Stream Operations • Several stream operations in GPGPU toolkit: • Map: apply a function to every element in a stream • Reduce: use a function to reduce a stream to a smaller stream (often 1 element) • Scatter/gather: indirect read and write • Filter: select a subset of elements in a stream • Sort: order elements in a stream • Search: find a given element, nearest neighbors, etc 45 Data Structures • GPU memory model, iteration, virtualization • Dense arrays (== textures) • Sparse arrays • Static sparsity: pack rows or bands into textures, vertex arrays • Dynamic sparsity: multidimensional page tables 46 Data Structures • Adaptive data structures • Static: packed uniform grid, stackless k-d tree • Dynamic: mipmap of page tables, tree-based address translators • Non-indexable structures • Open & important problem: stacks, priority queues, sets, linked lists, hash tables, k-d tree construction… 47 Differential Equations • Early & common application of GPGPU • Ordinary differential equations (ODEs): commonly used in particle systems • Partial differential equations (PDEs): well suited to GPU, especially when solved on dense grids • Closely related to… 48 Linear Algebra • Major application of GPGPU • Ubiquitous in science, engineering, visual simulation • Several approaches explored in early GPGPU work • Memory layout a key consideration • Pack vectors into 2D textures • Split matrices, e.g. by columns (dense), bands (banded), into vertex array (sparse) 49 Data Queries • Relational database operations on the GPU • Predicates, Boolean combinations, aggregations • Join operations accelerated by sorting on join key • Uses special-purpose depth & stencil hardware extensively • Attributes of data record stored in (multiple) color channels of (multiple) textures 50 Applications • STAR discusses the following broad GPGPU application areas: • Physically-based simulation • Signal & image processing • Global illumination • Geometric computing • Databases & data mining 51 Physically-Based Simulation • Early work (pre-fully programmable GPU): • Cellular automata, texture & blending modes • Now • Finite difference & finite element methods • Particle system simulations • Most popular topic • Fluids! Incompressible flow, clouds, boiling, etc. • Also-ran: mass-spring dynamics for cloth 52 Signal & Image Processing • Segmentation: identify features in 2D or 3D • Common example: identify surface of 3D feature (tumor, blood vessel, etc) in medical scan • Level-set deformation methods evolve an isosurface • Need sparse solution methods for efficient solution • GPGPU bonus: easy to integrate in volume renderer • Computer vision: • Image projection, compositing, rectification • Fast stereo depth extraction 53 Signal & Image Processing cont. • Image processing • Image registration, motion reconstruction, computed tomography (CT), tone mapping • Core Image, Core Video • Signal processing • FFT: an interesting case! Memory bandwidth limited, lack of writable cache harms performance 54 Global Illumination • Ray tracing • Seminal GPGPU papers • Close to the heart of graphics • Earliest complex data structures, control flow • Analysis to inform future hardware design • Comparison to current efficient CPU implementation • Key insights • Ray-triangle intersection maps well to pixel hardware • Rest of the ray tracing pipeline can also be expressed as a stream computation 55 Global Illumination cont. • Photon mapping • Even more involved data structures • Introduced stencil routing (scatter), k-nearest neighbor search • Radiosity • Iterative LA solvers • Subsurface scattering • Lots of work on using GPUs to accelerate/ approximate scattering algorithms 56 Geometric Computing • Lots of image-space geometric computations • CSG operations • Distance fields • Collision detection • Sorting for transparency • Shadow generation • Heavy use of stencil & depth hardware 57 Databases & Data Mining • GPU strengths are useful • Memory bandwidth • Parallel processing • Accelerating SQL queries – 10x improvement • Also well suited for stream mining • Continuous queries on streaming data instead of one-time queries on static database 58 Close-Up: Linear Algebra Linear Algebra Data Structures Vector representation – Textures 2D Textures RGBAbest textures are weeven can really do better rock • Per-pixel vs. per-vertex operations – 6 gigapixels/second vs. 0.7 gigavertices/second – Efficient texture memory cache – Texture read-write access 1 1 N N 60 For details go: http://wwwcg.in.tum.de Representation (cont.) Dense Matrix representation – Treat a dense matrix as a set of column vectors – Again, store these vectors as 2D textures i N Vectors Matrix N 2D-Textures N ... ... N N 1 1 i i ... ... N N 61 For details go: http://wwwcg.in.tum.de Representation (cont.) Banded Sparse Matrix representation – Treat a banded matrix as a set of diagonal vectors – Combine opposing vectors to save space i Matrix 2 Vectors 2 2D-Textures 1 N 2 N N-i N 1 2 62 For details go: http://wwwcg.in.tum.de Operations 1 • Vector-Vector Operations – Reduced to 2D texture operations – Coded in pixel shaders • Example: Vector1 + Vector2  Vector3 Vector 1 Vector 2 Vector 3 Render Texture Static quad TexUnit 0 Pass through Vertex Shader TexUnit 1 return tex0 + tex1 Pixel Shader 63 For details go: http://wwwcg.in.tum.de Operations 2 (reduce) Reduce operation for scalar products original Texture st 1 pass nd 2 pass ... ... ... ... Reduce m x n region in fragment shader 64 For details go: http://wwwcg.in.tum.de Operations In depth example: Vector / Banded-Matrix Multiplication A b x = 65 For details go: http://wwwcg.in.tum.de Example (cont.) Vector / Banded-Matrix Multiplication A A b b x = 66 For details go: http://wwwcg.in.tum.de Example (cont.) Compute the result in 2 Passes: A Pass 1 2 b b‘ x = 67 For details go: http://wwwcg.in.tum.de Conclusions Moving Forward … • What works well now? • What doesn’t work well now? • What will improve in the future? • What will continue to be difficult? 69 What Runs Well on GPUs? • GPUs win when … • Limited data reuse Memory BW Cache BW P4 3GHz 6 GB/s 44 GB/s NV GF 6800 36 GB/s -- • High arithmetic intensity: Defined as math operations per memory op • Attacks the memory wall - are all mem ops necessary? • Common error: Not comparing against optimized CPU implementation 70 Arithmetic Intensity Historical growth rates (per year): • Compute: 71% • DRAM bandwidth: 25% GFLOPS • DRAM latency: 5% 7x Gap GFloats/sec R300 R360 R420 [courtesy Ian Buck] 71 Arithmetic Intensity GeForce 7800 GTX Pentium 4 3.0 GHz GPU wins when… • Arithmetic intensity  Segment 3.7 ops per word 11 GFLOPS [courtesy Ian Buck] 72 Memory Bandwidth GPU wins when… • Streaming memory bandwidth  SAXPY  FFT GeForce 7800 GTX Pentium 4 3.0 GHz [courtesy Ian Buck] 73 Memory Bandwidth • Streaming Memory System • Optimized for sequential performance • GPU cache is limited • Optimized for texture filtering • Read-only • Small GeForce 7800 GTX Pentium 4 • Local storage • CPU >> GPU [courtesy Ian Buck] 74 What Will (Hopefully) Improve? • Orthogonality • Instruction sets • Features • Tools • Stability • Interfaces, APIs, libraries, abstractions • Necessary as graphics and GPGPU converge! 75 What Won’t Change? • Rate of progress • Precision (64b floating point?) • Parallelism • Won’t sacrifice performance • Difficulty of programming parallel hardware • … but APIs and libraries may help • Concentration on entertainment apps 76 Top Ten GPGPU Top Ten • The Killer App • Programming models and tools • GPU in tomorrow’s computer? • Data conditionals • Relationship to other parallel hw/sw • Managing rapid change • • • • in hw/sw (roadmaps) Performance evaluation and cliffs Philosophy of faults and lack of precision Broader toolbox for computation / data structures Wedding graphics and GPGPU techniques 78

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Survey of General-Purpose Computation on Graphics Hardware