Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Parallel Concept and Hardware Architecture CUDA Programming Model Overview Yukai Hung Department of Mathematics National Taiwan University Parallel Concept Overview making your program faster Parallel Computing Goals Solve the problems in less time - divide one problem into smaller pieces - solve smaller problems concurrently - allow to solve more bigger problems Prepare to parallelize one problem - represent algorithms as Directed Acyclic Graphs - identify dependencies in the problems - identify critical paths in the algorithms - modify dependencies to shorten the critical paths 3 Parallel Computing Goals What is parallel computing? 4 Amdahl’s Law Speedup of a parallel program is limited by amount of serial works 5 Amdahl’s Law Speedup of a parallel program is limited by amount of serial works 6 Race Condition Consider the following parallel program - threads are almost impossibly executed at the same time 7 Race Condition Scenario 1 - the result value R is 2 if the initial value R is 1 8 Race Condition Scenario 2 - the result value R is 2 if the initial value R is 1 9 Race Condition Scenario 3 - the result value R is 3 if the initial value R is 1 10 Race Condition with Lock Solve the race condition by Locking - manage the shared resource between threads - avoid the deadlock or unbalanced problems 11 Race Condition with Lock Guarantee the executed instruction order is correct - the problem is back to the sequential procedure - lock and release procedure have high overhead 12 Race Condition with Semaphore Solve the race condition by Semaphore - multi-value locking method (binary locking extension) - instructions in procedure P and V are atomic operations 13 Instruction Level Parallelism Multiple instructions are executed simultaneously - reorder the instructions carefully to get efficiency - compiler reorders the assemble instructions automatically step 1 step 2 14 step 3 Data Level Parallelism Multiple data operations are executed simultaneously - computational data is separable and independent - single operation repeated over different input data sequential procedure parallel procedure 15 Flynn’s Taxonomy Classification for parallel computers and programs Single Data Multiple Data Single Instruction SISD (single core CPU) SIMD (GPU/vector processor) 16 Multiple Instruction MISD (very rare) MIMD (multiple core CPU) Flynn’s Taxonomy Classification for parallel computers and programs SISD SIMD 17 Flynn’s Taxonomy Classification for parallel computers and programs MISD MIMD 18 CPU and GPU Hardware Comparison CPU versus GPU 20 CPU versus GPU Intel Penryn quad-core 255mm2 in 0.82B transistors NVIDIA GTX280 >500mm2 in 1.4B transistors 21 CPU versus GPU computing GFLOPS memory bandwidth 22 CPU versus GPU GPU CPU comparison control/cache size and number of ALUs comparison the clock rates/core numbers/executing latency 23 General Purpose GPU Computation 24 General Purpose GPU Computation algorithm conversion requires the knowledge of graphic APIs (OpenGL and DirectX) 25 General Purpose GPU Computation convert from pixel data to required data restrict the general usage of algorithms 26 Start from Traditional GPU Architecture to Nowadays Simplified Graphic Pipeline sorting stage z-buffer collection 28 Simplified Graphic Pipeline maximum depth z-cull feedback 29 Simplified Graphic Pipeline scale up some units 30 Simplified Graphic Pipeline add framebuffer access bottleneck is the FBI unit for memory management 31 Simplified Graphic Pipeline add programmability by ALU units add programmable geometry and pixel shaders 32 Simplified Graphic Pipeline consider two similar units special case on the pipeline (1) one trigonometry but lots of pixels: pixel shader is busy (2) lots of trigonometry but one pixel: geometry shader is busy 33 Iterative Graphic Pipeline combine two units to unified shader scalable between geometries and pixels memory resource management becomes important 34 Graphic Pipeline Comparison software pipeline hardware pipeline 35 Unified Graphic Architecture Switch on two modes: Graphic and CUDA mode Host Input Assembler Setup / Rstr / ZCull SP SP SP TF SP SP TF L1 TF L1 SP SP SP Pixel Thread Issue SP SP TF L1 L1 L2 FB SP TF TF L1 L2 FB SP Geom Thread Issue SP TF L1 L2 FB 36 SP SP TF L1 L2 FB SP L1 L2 FB Thread Processor Vtx Thread Issue L2 FB Unified Graphic Architecture Switch on two modes: Graphic and CUDA mode Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Global Memory 37 Load/store Load/store Thread Streaming Processing no data communication issue which is suitable for traditional graphical issue 38 Shader Register File/Cache Separate register files - strict streaming processing mode - no data sharing at instruction-level - dynamical allocation and renaming - not exist memory addressing order - registers overflow into local memory 39 Thread Streaming Processing communication issues between different threads (in the same shader) 40 Shader Register File/Cache Shared register files - extra memory hierarchy between shader registers and global memory - share data in the same shader or threads in the same block - synchronize all threads in the shader 41 Unified Graphic Architecture 42 CPU and GPU Chip Architecture 43 CPU and GPU Chip Architecture 44 Nowadays GPU Applications Traditional game rendering - clothing demo and star tales benchmark - real time physical phenomenon rendering - mixed mode for physical phenomenon simulation Scientific computational usage - molecular dynamic simulation - protein folding and nbody simulation - medical images and computational fluid dynamics CUDA community showcase 45 Nowadays GPU Features GPUs are becoming more programmable than before - only standard C extension on unified scalable shaders GPUs now support 32-bit and 64-bit floating point operations - almost IEEE floating point compliant except for some specials - lack mantissa denormalization for small floating point number GPUs have much higher memory bandwidth than CPUs - multiple memory bank driven by need of high-performance graphics Massive data-level parallel architecture - hundreds of thread processors on the chip - thousands of concurrent threads on the shader - lightweight thread switch and long latency memory 46 General Purpose GPU Environment CUDA: Compute Unified Device Architecture - realistic hardware and software GPGPU solution - minimal set of standard C language extensions - tool set includes compiler and software development kits OpenCL: Open Computing Language - similar to CUDA from GPGPU points of view - support both CPU and GPU hardware architecture - execute across heterogeneous platform resources 47 CUDA Programming Overview CUDA Programming Model Integrated host and device application C program - serial or modestly parallel parts in host C code - highly parallel parts in device C extension code ... ... 49 CUDA Programming Model What is the computed device? - coprocessor to the host part - have its own device memory space - run many active threads in parallel What is the difference between CPU and GPU threads? - GPU threads are extremely lightweight - GPU threads have almost no creating overhead - GPU needs more than 1000 threads for full occupancy - multiple core CPU can execute or create only a little threads 50 CUDA Programming Model A kernel is a function executed on the GPU - all threads execute the same kernel function - all threads can take different controlling paths - each thread has its own local ID to control itself float x=data[threadID] float y=kernel_func(x) 51 CUDA Thread Hierarchy GPU can create numerous threads concurrently 52 CUDA Thread Hierarchy GPU can create numerous threads concurrently - all threads are divided into some blocks 53 CUDA Thread Hierarchy GPU can create numerous threads concurrently - all threads are divided into some blocks - all blocks are grouped into one grid 54 CUDA Thread Hierarchy GPU can create numerous threads concurrently - all threads are divided into some blocks - all blocks are grouped into one grid 55 CUDA Thread Hierarchy GPU can create numerous threads concurrently - all threads are divided into some blocks - all blocks are grouped into one grid - a kernel is executed as a grid of blocks of threads 56 CUDA Transparent Scalability 1 2 3 4 5 6 7 57 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2 CUDA Transparent Scalability 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 58 2 3 4 5 6 7 8 CUDA Transparent Scalability 1 1 2 3 2 4 3 5 4 6 5 7 6 7 8 9 10 11 12 9 10 11 12 8 59 ... CUDA Transparent Scalability 60 CUDA Thread Hierarchy Identify local and global thread index - each block has its own local blockID in the same grid - each thread has its own local threadID in the same block - global thread index is evaluated by blockID and threadID 0 block index 0 block index 1 0 1 2 3 4 5 6 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 block index 2 0 1 2 3 4 5 6 10 11 12 13 14 15 16 17 18 19 20 61 CUDA Thread Hierarchy CUDA thread hierarchy summary grid of blocks of threads 62 CUDA Thread Hierarchy CUDA thread hierarchy summary 63 CUDA Programming Model A kernel executes as a grid of thread blocks - different kernel can execute in different grid and block size - each block has its blockID and each thread has its threadID kernel 1 kernel 2 0 1 2 3 1D block (0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) 64 2D block CUDA Memory Hierarchy block of threads registers local memory shared memory 65 CUDA Memory Hierarchy grid of blocks registers local memory shared memory 66 CUDA Memory Hierarchy grid of blocks registers local memory shared memory texture memory constant memory global memory 67 CUDA Memory Hierarchy CUDA memory hierarchy summary Host 68 Start from a Vector-Addition Simple Example Vector-Addition Simple Example Global memory or device memory - contents are visible between all threads - communicate between host and device - readable and writable between all threads - read/write long memory accessing latency We will focus on the global memory now 70 Vector-Addition Simple Example cudaMalloc(…) - allocate space in the global memory - require address pointer and space size cudaFree(…) - free space in the global memory - require address pointer to free space 71 Vector-Addition Simple Example cudaMemcpy(…) - data transfers between host and device - require source and destination pointer - require data transfer direction and size cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice Host 72 Vector-Addition Simple Example //allocate and initialize the host memory space float *h_A; float *h_B; float *h_C; h_A=(float*)malloc(N*sizeof(float)); h_B=(float*)malloc(N*sizeof(float)); h_C=(float*)malloc(N*sizeof(float)); //allocate the device memory space float *d_A; float *d_B; float *d_C; cudaMalloc((void**)&d_A,N*sizeof(float)); cudaMalloc((void**)&d_B,N*sizeof(float)); cudaMalloc((void**)&d_C,N*sizeof(float)); //copy raw data from host to device memory cudaMemcpy(d_A,h_A,N*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_B,h_B,N*sizeof(float),cudaMemcpyHostToDevice); 73 Vector-Addition Simple Example //execute the kernel on N/256 blocks of 256 threads vectorAddition<<< N/256,256 >>>(d_A,d_B,d_C); //copy resukt data from device to host memory cudaMemcpy(h_C,d_C,N*sizeof(float),cudaMemcpyDeviceToHost); //free the host memory space free( h_A ); free( h_B ); free( h_C ); //free the device memory space cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); 74 Vector-Addition Simple Example //compute vector addition C=A+B //each thread performs one pair-wise addition __global__ void vectorAddition( float* d_A, float* d_B, float* d_C ) { int globalIdx; //compute the thread global index globalIdx = blockIdx.x*blockDim.x+threadIdx.x; //each thread performs its own element addition d_C[globalIdx] = d_A[globalIdx]+d_B[global_Idx]; return; } 75 Vector-Addition Simple Example Declaration __global__ __device__ __host__ Call host device host 76 Execute device device host Vector-Addition Simple Example Restrictions on the global function - return type is only available void Restrictions on the device function - no function pointer - no recursion function - no variable number of arguments - no static variable declares inside the function 77 Vector-Addition Simple Example //compute vector addition C=A+B //each thread performs one pair-wise addition __global__ void vectorAddition( float* d_A, float* d_B, float* d_C ) { int globalIdx; //compute the thread global index globalIdx = blockIdx.x * blockDim.x + threadIdx.x; //each thread performs its own element addition d_C[globalIdx] = addElement(d_A[globalIdx],d_B[global_Idx]); return; } __device__ float addElement( float input_a, float input_b ) { return (input_a+input_b); } 78 Vector-Addition Simple Example CUDA built-in structures struct { int int int } dim3 x; y; z; struct uint3 { unsigned int x; unsigned int y; unsigned int z; } 79 Vector-Addition Simple Example CUDA built-in variables - used on the global or device function gridDim (dim3 variable) blockIdx (uint3 variable) blockDim (dim3 variable) threadIdx (uint3 variable) record the dimension on current grid record the block index on current grid record the dimension on current block record the thread index on current block 80 Matrix-Addition Simple Example //allocate and initialize the host memory space float *h_A; float *h_B; float *h_C; h_A=(float*)malloc(N*N*sizeof(float)); h_B=(float*)malloc(N*N*sizeof(float)); h_C=(float*)malloc(N*N*sizeof(float)); //allocate the device memory space float *d_A; float *d_B; float *d_C; cudaMalloc((void**)&d_A,N*N*sizeof(float)); cudaMalloc((void**)&d_B,N*N*sizeof(float)); cudaMalloc((void**)&d_C,N*N*sizeof(float)); //copy raw data from host to device memory cudaMemcpy(d_A,h_A,N*N*sizeof(float),cudaMemcpyHostToDevice); cudaMemcpy(d_B,h_B,N*N*sizeof(float),cudaMemcpyHostToDevice); 81 Matrix-Addition Simple Example dim3 gridSize; dim3 blockSize; //set the grid size and block size gridSize.x=N/16; blockSize.x=16; gridSize.y=N/16; blockSize.y=16; //execute the kernel on N/256 blocks of 256 threads matrixAddition<<<gridSize,blockSize>>>(d_A,d_B,d_C); //copy resukt data from device to host memory cudaMemcpy(h_C,d_C,N*N*sizeof(float),cudaMemcpyDeviceToHost); //free the host memory space free(h_A); free(h_B); free(h_C); //free the device memory space cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); 82 Matrix-Addition Simple Example //compute matrix addition C=A+B //each thread performs one pair-wise addition __global__ void matrixAddition( float* d_A, float* d_B, float* d_C ) { int msize; int rowIdx; int colIdx; //comput the matrix size msize = gridDim.x*blockDim.x; //compute the thread global index rowIdx = blockIdx.y*blockDim.y+threadIdx.y; colIdx = blockIdx.x*blockDim.x+threadIdx.x; //each thread performs its own element addition d_C[rowIdx*msize+colIdx] =d_A[rowIdx*msize+colIdx] +d_B[rowIdx*msize+colIdx]; return; } 83 Reference - Mark Harris http://www.markmark.net/ - Wei-Chao Chen http://www.cs.unc.edu/~ciao/ - Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php 84