Download GPUs as Computational Units The performance increases of the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
GPUs as Computational Units
The performance increases of the developement of microprocessors has slowed since
2003, due to energy consumption and heat-dissipation issues that have limited the
increase of the clock frequency and the level of productive activities that can be
performed in each clock period within a single CPU. Virtual all microprocessor vendors
have switched to models where multiple processing units, referred to as processor
cores, are used in each chip to increase the processing power. Since 2003, the
semiconductor industry has settled on two main trajectories for designing
microprocessors [Hwu 2008]. The multicores and the many-core trajectories.
The multicore trajectory seeks to maintain the execution speed of sequential programs
while moving into multiple cores. The multicores began as two-core processors and
each new semiconductor process generation seems to double the number of cores. The
Intel Core i7 microprocessor has four processor cores, each of which is an out-of-order,
multiple instruction issue processor implementing the full x86 instruction set.
The many-core trajectory focuses more on the execution throughput of parallel
applications. The many-cores began as a large number of much smaller cores, and,
once again, the number of cores doubles with each generation. The NVIDIA GeForce
GTX 295 graphics processing unit (GPU) with 480 cores, each of which is a heavily
multithreaded, in-order, single-instruction issue processor that shares its control and
instruction cache with seven other cores.
As of 2009, the ratio between many-core GPUs and multicore CPUs for peak floatingpoint calculation throughput is about 10 to 1 (1 teraflops versus 100 gigaflops) [David B.
Kirk, Wen-mei W. Hwu 2010].
This large performance gap between current standard CPUs and current standard
GPUs, motivates the move of computationally intensive parts of applications to GPUs
for execution.
NVIDIA has developed a compiler/libraries/runtime SDK called CUDA, that enables
programmers to readily access the data-parallel computation model and develop
applications. CUDA project was announced together with G80 in November, 2006.
Public beta version of CUDA SDK was released in February, 2007. Version 1.0 was
timed to the rollout of Tesla solutions in June, 2007, based on G80 and designed for the
market of high performance computing. Later on, at the end of the year came CUDA 1.1
beta, which added new features even though it was a minor release.
A CUDA program consists of one or more phases that are executed on either the host
(CPU) or a device such as a GPU. The NVIDIA C compiler (nvcc) separates the code
targeted to device and compiles it while host code is left to the native compiler.
A function targeted to device is called a kernel and will be invoked by a large number of
threads to exploit data parallelism. For example a straightforward 1000 x 1000 matrix
multiplication kernel would generate 1 000 000 threads when it is invoked. It is worth
noting that CUDA threads are of much lighter weight than CPU threads.
In CUDA, the host (CPU) and devices (GPU) have separate memory spaces. This
reflects the reality that devices are hardware cards that come with their own DRAM. In
order to execute a kernel on a device, the programmer needs to allocate memory on the
device and transfer pertinent data from the host memory to the allocated device
memory. Similarly, after device execution, the programmer needs to transfer result data
back to host memory and free up the device memory that is no longer needed. The
CUDA runtime system provides API functions to perform this type of memory
management. The host can transfer data to the in different manners device so that the
kernel can access it in different fashions. There are 2 hardware caches, "texture cache"
and "constant cache", for different access patterns of constant memory. There is
also another cache, "shared memory", for structured random access to device memory
and inter-thread communication.
Let’s take the red-pill and enter the grid
In CUDA, a kernel function specifies the code to be executed by all threads during a
parallel phase. One commonly refers to the SIMD execution model, Single Instruction
Multiple Data.
The ”__global__” keyword is cuda specific and indicates that the funktion declared afer
is a kernel and that it can be called from host functions to generate a grid of threads on
a device. In general CUDA extends C function declarations with three qualifier
keywords:
KeyWord
__device__ float foo(,,,)
__host__ float bar(,,,)
__global__ void foobar(,,,)
Executed
Callable from the
on the
device
device
host
host
device
host
It is worth noticing that a __global__ function has to return void and is itself responsible
for storing any results in global memory in the GPU:s DRAM. Note also that a function
can be declared as
__host__ __device__ profileMe(,,,) {}
In this case the function may be executed on the device or on the host depending on
how it is invoked, since it is callable from both the host and the device. As a canonical
example of vector arithmetics we take an arbitrary random expression such as
which would have been expressed in Fortran or Matlab as;
C = 1./sqrt(exp(sin(A + cos(B))))
The C and CUDA version follows;
__host__ __device__ float elementwiseWork(float a, float b) {
return rsqrt(exp(sin(a + cos(b))));
}
__global__ void gpuArithmetics(float* A, float* B, float* C, int N)
{
unsigned i = blockDim.x*gridDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
if (i < N)
C[i] = elementwiseWork(A[i], B[i]);
}
void cpuArithmetics(float* A, float* B, float* C, int N)
{
for (int i=0; i<N; i++)
C[i] = elementwiseWork(A[i], B[i]);
}
Profiling this example shows that the device path and host path suffers from different bottlenecks
on our test machine, while providing bitwise identical results. The bottleneck on the host is the
CPU computation. The GPU:s bottleneck is the memory bandwidth on 10 GB/s. However, while
this example scales perfect from one to two CPU cores, the GPU version on 32 cores is 220
times faster than the CPU version on two cores. This might make one belive that each single
GPU core is more than 220*2/32=14 times faster than a single CPU core, which of course is far
from true, the GPU uses simplified floating point arithmetics so that the actual number of clock
cycles needed per operation are an order of magnitude fewer. Hence the result is also less
accurate, but not much less. With random vector elements from the range [0, 1] in the above
example, the difference between the CPU and GPU version is at most
.
Which is comparable to machine epsilon
example,
. The average relative error for this
,
is less than
which is explained by most elements (70%) being bitwise identical.
The accuracy tradeoff combied with a theoretical peak FLOPS being an order of magnitude
larger on the GPU allows for speedups of real-world applications by a factor of 100. This makes
CUDA capable of addressing problems with new approaches. One of the applications we're
developing using CUDA is a signal processing tool (image above) based on methods that was
previously disregarded due to the large amount of data produced. But using CUDA we can
stream the computations and results of user applied operations. Data is streamed in real time to
and from audio devices, and to the screen. In the near future we're looking forward to the next
generation of computing chips which does comply to IEEE standards for floating point
arithmetics, which would make the errors above identical to 0. This effectively means that GPU
kernels can be verified against previous CPU implementations, thus making it even easier to port
applications and to develop new applications that make use of the GPU for doing heavy
computations. There are also several promising tools and libraries being developed to further
ease GPGPU-development. At Addiva, we're constantly tripping on the edge of these new
opportunities and we see a bright future for GPGPU-developement.
Robin Persson, Utvecklingschef and Linux Team Leader at Addiva.
Johan Gustafsson, GPGPU developer and associate member of the Addiva Linux team.