Download GPUs as Computational Units The performance increases of the

GPUs as Computational Units The performance increases of the developement of microprocessors has slowed since 2003, due to energy consumption and heat-dissipation issues that have limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU. Virtual all microprocessor vendors have switched to models where multiple processing units, referred to as processor cores, are used in each chip to increase the processing power. Since 2003, the semiconductor industry has settled on two main trajectories for designing microprocessors [Hwu 2008]. The multicores and the many-core trajectories. The multicore trajectory seeks to maintain the execution speed of sequential programs while moving into multiple cores. The multicores began as two-core processors and each new semiconductor process generation seems to double the number of cores. The Intel Core i7 microprocessor has four processor cores, each of which is an out-of-order, multiple instruction issue processor implementing the full x86 instruction set. The many-core trajectory focuses more on the execution throughput of parallel applications. The many-cores began as a large number of much smaller cores, and, once again, the number of cores doubles with each generation. The NVIDIA GeForce GTX 295 graphics processing unit (GPU) with 480 cores, each of which is a heavily multithreaded, in-order, single-instruction issue processor that shares its control and instruction cache with seven other cores. As of 2009, the ratio between many-core GPUs and multicore CPUs for peak floatingpoint calculation throughput is about 10 to 1 (1 teraflops versus 100 gigaflops) [David B. Kirk, Wen-mei W. Hwu 2010]. This large performance gap between current standard CPUs and current standard GPUs, motivates the move of computationally intensive parts of applications to GPUs for execution. NVIDIA has developed a compiler/libraries/runtime SDK called CUDA, that enables programmers to readily access the data-parallel computation model and develop applications. CUDA project was announced together with G80 in November, 2006. Public beta version of CUDA SDK was released in February, 2007. Version 1.0 was timed to the rollout of Tesla solutions in June, 2007, based on G80 and designed for the market of high performance computing. Later on, at the end of the year came CUDA 1.1 beta, which added new features even though it was a minor release. A CUDA program consists of one or more phases that are executed on either the host (CPU) or a device such as a GPU. The NVIDIA C compiler (nvcc) separates the code targeted to device and compiles it while host code is left to the native compiler. A function targeted to device is called a kernel and will be invoked by a large number of threads to exploit data parallelism. For example a straightforward 1000 x 1000 matrix multiplication kernel would generate 1 000 000 threads when it is invoked. It is worth noting that CUDA threads are of much lighter weight than CPU threads. In CUDA, the host (CPU) and devices (GPU) have separate memory spaces. This reflects the reality that devices are hardware cards that come with their own DRAM. In order to execute a kernel on a device, the programmer needs to allocate memory on the device and transfer pertinent data from the host memory to the allocated device memory. Similarly, after device execution, the programmer needs to transfer result data back to host memory and free up the device memory that is no longer needed. The CUDA runtime system provides API functions to perform this type of memory management. The host can transfer data to the in different manners device so that the kernel can access it in different fashions. There are 2 hardware caches, "texture cache" and "constant cache", for different access patterns of constant memory. There is also another cache, "shared memory", for structured random access to device memory and inter-thread communication. Let’s take the red-pill and enter the grid In CUDA, a kernel function specifies the code to be executed by all threads during a parallel phase. One commonly refers to the SIMD execution model, Single Instruction Multiple Data. The ”__global__” keyword is cuda specific and indicates that the funktion declared afer is a kernel and that it can be called from host functions to generate a grid of threads on a device. In general CUDA extends C function declarations with three qualifier keywords: KeyWord __device__ float foo(,,,) __host__ float bar(,,,) __global__ void foobar(,,,) Executed Callable from the on the device device host host device host It is worth noticing that a __global__ function has to return void and is itself responsible for storing any results in global memory in the GPU:s DRAM. Note also that a function can be declared as __host__ __device__ profileMe(,,,) {} In this case the function may be executed on the device or on the host depending on how it is invoked, since it is callable from both the host and the device. As a canonical example of vector arithmetics we take an arbitrary random expression such as which would have been expressed in Fortran or Matlab as; C = 1./sqrt(exp(sin(A + cos(B)))) The C and CUDA version follows; __host__ __device__ float elementwiseWork(float a, float b) { return rsqrt(exp(sin(a + cos(b)))); } __global__ void gpuArithmetics(float* A, float* B, float* C, int N) { unsigned i = blockDim.x*gridDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x; if (i < N) C[i] = elementwiseWork(A[i], B[i]); } void cpuArithmetics(float* A, float* B, float* C, int N) { for (int i=0; i<N; i++) C[i] = elementwiseWork(A[i], B[i]); } Profiling this example shows that the device path and host path suffers from different bottlenecks on our test machine, while providing bitwise identical results. The bottleneck on the host is the CPU computation. The GPU:s bottleneck is the memory bandwidth on 10 GB/s. However, while this example scales perfect from one to two CPU cores, the GPU version on 32 cores is 220 times faster than the CPU version on two cores. This might make one belive that each single GPU core is more than 220*2/32=14 times faster than a single CPU core, which of course is far from true, the GPU uses simplified floating point arithmetics so that the actual number of clock cycles needed per operation are an order of magnitude fewer. Hence the result is also less accurate, but not much less. With random vector elements from the range [0, 1] in the above example, the difference between the CPU and GPU version is at most . Which is comparable to machine epsilon example, . The average relative error for this , is less than which is explained by most elements (70%) being bitwise identical. The accuracy tradeoff combied with a theoretical peak FLOPS being an order of magnitude larger on the GPU allows for speedups of real-world applications by a factor of 100. This makes CUDA capable of addressing problems with new approaches. One of the applications we're developing using CUDA is a signal processing tool (image above) based on methods that was previously disregarded due to the large amount of data produced. But using CUDA we can stream the computations and results of user applied operations. Data is streamed in real time to and from audio devices, and to the screen. In the near future we're looking forward to the next generation of computing chips which does comply to IEEE standards for floating point arithmetics, which would make the errors above identical to 0. This effectively means that GPU kernels can be verified against previous CPU implementations, thus making it even easier to port applications and to develop new applications that make use of the GPU for doing heavy computations. There are also several promising tools and libraries being developed to further ease GPGPU-development. At Addiva, we're constantly tripping on the edge of these new opportunities and we see a bright future for GPGPU-developement. Robin Persson, Utvecklingschef and Linux Team Leader at Addiva. Johan Gustafsson, GPGPU developer and associate member of the Addiva Linux team.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download GPUs as Computational Units The performance increases of the