Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High performance computing on the GPU: NVIDIA G80 and CUDA Won-Ki Jeong, Ross Whitaker SCI Institute University of Utah GPGPU • General Purpose computation on the GPU – Started in computer graphics community – Mapping computation problems to graphics rendering pipeline Courtesy Jens Krueger and Aaron Lefohn Why GPU for Computing? • GPU is fast – Massively parallel • CPU : ~4 @ 3.0 Ghz (Intel Quad Core) • GPU : ~128 @ 1.35 Ghz (Nvidia GeForce 8800 GTX) – High memory bandwidth • CPU : 21 GB/s • GPU : 86 GB/s – Simple architecture optimized for compute intensive task • Programmable – Shaders, NVIDIA CUDA, ATI CTM • High precision floating point support – 32bit floating point IEEE 754 – 64bit floating point will be available in early 2008 Why GPU for computing? • Inexpensive supercomputer – Two NVIDIA Tesla D870 : 1 TFLOPS • GPU hardware performance increases faster than CPU GFLOPS – Trend : simple, scalable architecture, interaction of clock speed, cache, memory (bandwidth) G80GL = Quadro FX 5600 G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 Courtesy NVIDIA GPU is for Parallel Computing • CPU – Large cache and sophisticated flow control minimize latency for arbitrary memory access for serial process • GPU – Simple flow control and limited cache, more transistors for computing in parallel – High arithmetic intensity hides memory latency ALU ALU ALU ALU Control Courtesy NVIDIA Cache DRAM DRAM CPU GPU GPU-friendly Problems • High arithmetic intensity – Computation must offset memory latency • Coherent data access (e.g. structured grids) – Maximize memory bandwidth • Data-parallel processing – Same computation over large datasets (SIMD) • E.g. convolution using a fixed kernel, PDEs • Jacobi updates (isolate data stream read and write) Traditional GPGPU Model • GPU as a streaming processor (SIMD) – Memory • Textures – Computation kernel • Vertex / fragment shaders – Programming • Graphics API (OpenGL, DirectX), Cg, HLSL • Example – Render a screen-sized quad with a texture mapping using a fragment shader Graphics Pipeline Texture Vertex Processor Rasterizer Fragment Processor Framebuffer Problems of Traditional GPGPU Model • Software limitation – – – – High learning curve Graphics API overhead Inconsistency in API Debugging is difficult • Hardware limitation – No general memory access (no scatter) • B = A[i] : gather (O) • A[i] = B : scatter (X) – No integer/bitwise operations – Memory access can be bottleneck • Need coherent memory access for cache performance NVIDIA G80 and CUDA • New HW/SW architecture for computing on the GPU – GPU as massively parallel multithreaded machine : one step further from streaming model – New hardware features • • • • Unified shaders (ALUs) Flexible memory access Fast user-controllable on-chip memory Integer, bitwise operations – New software features • Extended C programming language and compiler • Support debugging option (through emulation) GPU : Highly Parallel Coprocessor • GPU as a coprocessor that – Has its own DRAM memory – Communicate with host (CPU) through bus (PCIx) – Runs many threads in parallel • GPU threads – GPU threads are extremely lightweight (almost no cost for creation/context switch) – GPU needs at least several thousands threads for full efficiency • Hierarchy – – – – Programming Model: SPMD + SIMD Host Device = Grids Grid = Blocks Block = Warps Warp = Threads Device Grid 1 Kernel 1 • Single kernel runs on multiple blocks (SPMD) • Single instruction executed on multiple threads (SIMD) Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kernel 2 – Warp size determines SIMD granularity (G80 : 32 threads) • Synchronization within a block using shared memory Courtesy NVIDIA Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Hardware Implementation : a set of SIMD Processors • Device – a set of multiprocessors • Multiprocessor – a set of 32-bit SIMD processors Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Processor 1 Processor 2 … Instruction Unit Processor M Courtesy NVIDIA Memory Model • Each thread can: – – – – – – Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory Read only per-grid texture memory • The host can read/write global, constant, and texture memory Host Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory Courtesy NVIDIA Local Memory Hardware Implementation : Memory Architecture • Device memory (DRAM) – Slow (2~300 cycles) – Local, global, constant, and texture memory Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 • On-chip memory Shared Memory – Fast (1 cycle) – Registers, shared memory, constant/texture cache Registers Processor 1 Registers Processor 2 Registers … Instruction Unit Processor M Constant Cache Texture Cache Device memory Courtesy NVIDIA Memory Access Strategy Copy data from global to shared memory Synchronization Computation (iteration) Synchronization Copy data from shared to global memory Execution Model • Each thread block is executed by a single multiprocessor – Synchronized using shared memory • Many thread blocks are assigned to a single multiprocessor – Executed concurrently in a time-sharing fashion – Keep GPU as busy as possible • Running many threads in parallel can hide DRAM memory latency – Global memory access : 2~300 cycles CUDA • C-extension programming language – No graphics API • Flattens learning curve • Better performance – Support debugging tools • Extensions / API – – – – Function type : __global__, __device__, __host__ Variable type : __shared__, __constant__ cudaMalloc(), cudaFree(), cudaMemcpy(),… __syncthread(), atomicAdd(),… • Program types – Device program (kernel) : run on the GPU – Host program : run on the CPU to call device programs Example: Vector Addition Kernel // Pair-wise addition of vector elements // One thread per addition __global__ void vectorAdd(float* iA, float* iB, float* oC) { int idx = threadIdx.x + blockDim.x * blockId.x; oC[idx] = iA[idx] + iB[idx]; } Courtesy NVIDIA Example: Vector Addition Host Code float* h_A = (float*) malloc(N * sizeof(float)); float* h_B = (float*) malloc(N * sizeof(float)); // … initalize h_A and h_B // allocate float* d_A, cudaMalloc( cudaMalloc( cudaMalloc( device memory d_B, d_C; (void**) &d_A, N * sizeof(float) ); (void**) &d_B, N * sizeof(float) ); (void**) &d_C, N * sizeof(float) ); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice ); // execute the kernel on N/256 blocks of 256 threads each vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C); Courtesy NVIDIA Compiling CUDA • nvcc C/C++ CUDA Application – Compiler driver – Invoke cudacc, g++, cl • PTX CPU Code NVCC – Parallel Thread eXecution ld.global.v4.f32 mad.f32 PTX Code {$f1,$f3,$f5,$f7}, [$r9+0]; $f1, $f5, $f3, $f1; PTX to Target Compiler G80 … GPU Target code Courtesy NVIDIA Debugging • Emulation mode – CUDA code can be compiled and run in emulation mode (nvcc –deviceemu) – No need of device or driver – Each device thread is emulated with a host thread – Can call host function from device code (e.g., printf) – Support host debug function (breakpoint, inspection, etc) • Hardware debug will be available late 2007 Optimization Tips • Avoid shared memory bank conflict – Shared memory space is split into 16 banks • Each bank is 4 bytes (32bit) wide • Assigned round-robin fashion – Any non-overlapped parallel bank access can be done by a single memory operation • Coalesced global memory access – Contiguous memory address is fast • a = b[thread_id]; // coalesced • a = b[2*thread_id]; // non-coalesced CUDA Enabled GPUs / OS • Supported OS – MS Windows, Linux • Supported HW – NVIDIA GeForce 8800 series – NVIDIA Quadro 5600/4600 – NVIDIA Tesla series Courtesy NVIDIA ATI CTM (Close To Metal) • Similar to CUDA PTX – A set of native device instructions • No compiler support • Limited programming environment Example: Fast Iterative Method • CUDA implementation – Tile size : 4x4x4 – Update active tile • Neighbor access – Manage active list • Parallel reduction Coalesced Global Memory Access • Reordering – Each tile is stored in global memory in contiguous memory space Non-coalesced Coalesced Update Active Tile • Compute new solution – Copy a tile and its neighbor pixels to shared memory – Avoid bank conflict Tile (yellow) Left Right Top Bottom Manage Active List • Active list – Simple 1D integer array of active tile indices – Need to know which tile is NOT converged Active points 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Active tiles {1,2,5,7,8, 9,11,13,14} Parallel Reduction of Convergence in Tiles Tile (1D view) T T F F T F F F T F F T T = Converged F = Not converged F F F Wrap up • GPU computing is promising – Many scientific computing problems are parallelizable – More consistency/stability in HW/SW • Streaming architectures are here to stay (and more so) • Industry trend is multi/many core processor – Better support/tools (easier to learn, maintain) • Issues – – – – – No industry-wide standard Market driven by gaming industry Not every problem is suitable for GPUs Re-engineer algorithms/software Future performance growth???? • Impact on the data-analysis/interpretation workflow Questions?