Download Performance

On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006 Slide 1 / 16 Outline • • • • • • • Motivation Literature review The graphics pipeline Programmable GPUs Some application examples Performance results Conclusion Slide 2 / 16 Motivation Problem size 11,540 47,636 193,556 780,308 Frames per second using OpenGL(GPU) Mesa (CPU) 189 8.01 52 1.71 13 0.44 3 0.12 Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes. Slide 3 / 16 Motivation • High flops count (currently 200GFlops, single precision) (picture: from the GPU Gems 2 book) • Compatible price performance (less then 1 cent per MFlop) • Performance doubling every 6 months • Continuously increasing functionality and programmability – Realistic games require more complicated physics Slide 4 / 16 Literature review Using graphics hardware for non-graphics applications (just a few examples): • • • • • Cellular automata Reaction-diffusion simulation (Mark Harris, University of North Carolina) Matrix multiply (E. Larsen and D. McAllister, University of North Carolina) Lattice Boltzmann (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook) CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of Virginia) • Convolution (University of Stuttgart) • BLAS 1,2; fft; certain eigensolvers; etc. • See also GPGPU’s homepage : http://www.gpgpu.org/ Slide 5 / 16 Literature review Typical performance results reported (by the middle of 2003): • Significant speedup of GPU vs CPU are reported if the GPU performs low precision computations (30 to 60 times; depends on the configuration) - integers (8 or 12 bit arithmetic), 16-bit floating point • Vendor advertisements about very high performance assume low precision arithmetic • NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 consoles, which could theoretically deliver 0.5 trillion operations/second • GPU’s 32-bit flops performance is comparable to the CPU’s (may be 2-4 times faster depending on application and configuration) Slide 6 / 16 The graphics pipeline • GeForce 256 (August, 1999) - allowed certain degree of programmability - before: fixed function pipeline • GeForce 3 (February, 2001) - considered first fully programmable GPU • GeForce 4 - partial 16-bit floating point arithmetic • NV30 - 32-bit floating point • Cg - high-level programming language Slide 7 / 16 The graphics pipeline • GPUs: on their way into turning into programmable stream processors (picture: from the GPU Gems 2 book) • Stream formulation of the graphics pipeline: all data viewed as streams and computation as kernels • Streaming – Efficient computation (enable efficient parallelism; deep pipeline) – Efficient communication (efficient off-chip communication; intermediate results kept on chip; deep pipelining allows high degree of latency tolerance Slide 8 / 16 Programmable GPUs (in particular NV30) • GPU programming model: streaming – Naturally addresses parallelism and communication – Easy when problems maps well • Support floating point operations • Vertex program – Replaces fixed-function pipeline for vertices – Manipulates single vertex data – Executes for every vertex • Fragment program – Similar to vertex program but for pixels • Programming in Cg: – High level language; looks like C; portable; compiles Cg programs to assembly code Slide 9 / 16 Block Diagram of GeForce FX • • • • • • • AGP 8x graphics bus bandwidth: 2.1GB/s Local memory bandwidth: 16 GB/s Chip officially clocked at 500 MHz Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L) Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle Texture & color interpolators - interpolate texture coordinates and color values Performance (on processing 4D vectors): – – Vertex ops/sec - 1.5 Gops Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002. Slide 10 / 16 Block Diagram of GeForce FX 3 vertex and 8 pixel processors Last nVidia card: dual-GPU GeForce 7950 GX2 with 32 vertex and 96 pixel processors • • • • • • • AGP 8x graphics bus bandwidth: 2.1GB/s Local memory bandwidth: 16 GB/s Chip officially clocked at 500 MHz Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L) Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle Texture & color interpolators - interpolate texture coordinates and color values Performance (on processing 4D vectors): – Vertex ops/sec - 1.5 Gops – Pixel ops/sec - 8 Gops (int), or 4 Gops (float) Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002. Slide 11 / 16 Summary of CPU vs GPU • General vs specialized hardware – CPUs have more complex control hardware – GPU can have hardware acceleration for specific tasks • Sequential vs parallel programming models – In general CPUs don’t have the GPU’s level of data parallelism (though some may be available: Intel’s SSE and PowerPC’s AltiVec instructions sets) • Memory latency vs bandwidth optimization Slide 12 / 16 Some application examples • Monte Carlo simulations – – – – Used in variety of simulations in physics, finance, chemistry, etc. Based on probability statistics and use random numbers A classical example: compute area of a circle Computation of expected values: N E(F) =  F (S i )P(S i ) i=1 – N can be very large : on a 1024 x 1024 lattice of particles, every particle 2 modeled to have k states, N = k 1024 – Random number generation. We used linear congruential type generator: R(n)  (a * R(n 1)  b) mod N Slide 13 / 16 Some application examples • Monte Carlo simulations – Ising model • Simplified model for magnets • Evolve the system into “higher probability” states and compute expected values as average over only those states – Percolation • In studies of disease spreading, flow in porous media, forest fire propagation, clustering, etc. • Lattice Boltzmann method – Simulate fluid flow; particles are allowed to move and collide on a lattice Slide 14 / 16 Some performance results • saxpy on 512 x 512 (x 4) vectors  1GFlop – speed limited by GPU memory bandwidth (16 GB/s) • sin, cos, exp, log 20 times faster than on Pentium 4, 2.8GHz – hardware accelerated of low accuracy • Ising model  7GFlops – 44% of theoretical maximum – On fragment program compiled to 109 assembly instructions Slide 15 / 16 Conclusions • What to expect for future GPGPUs? Can GPGPUs influence future computer systems ? ( HPC and consequently our models of software development: is the IBM’s Cell processor already an example? ) Current trends: CPU  multi-core GPU  more powerful streaming model (Gather, scatter, conditional streams, reduction, etc.) more CPU functionality Slide 16 / 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Performance