Download TosaOjiru_au11

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Video card wikipedia, lookup

OpenGL wikipedia, lookup

Comparison of OpenGL and Direct3D wikipedia, lookup

Framebuffer wikipedia, lookup

Mesa (computer graphics) wikipedia, lookup

Free and open-source graphics device driver wikipedia, lookup

Stream processing wikipedia, lookup

Graphics processing unit wikipedia, lookup

General-purpose computing on graphics processing units wikipedia, lookup

Implementing MASS on GPU
Developing applications that run with the best average performance is what drives researchers
and the industry to find ways of improving the performance of their applications. Graphics
applications like rendering, and ray tracing, and non-graphics applications like weatherprediction applications would be unusable without significant thought in improving the
runtimes through parallelism and multiple processing units. The key to such performance
improvement is taking advantage of data parallelism and/or task parallelism. For example, in
a rendering program, a single threaded program can be written to draw one pixel, data
parallelism can be applied here to perform the same operation – draw pixel – on every pixel in
the program thereby reducing the total time it would take to complete by running in parallel
across multiple threads (Nickolls & Dally, 2010).
Background on MASS
Multi-Agent Spatial Simulation (MASS) is a parallel computing library whose design is based
on multiple agents each acting as a simulation entity residing in a network-independent array
spanning multiple computing nodes. The application space of the MASS library covers the
single program multiple data programming model i.e. applications that employ parallel
algorithms. Example applications of the MASS library are: molecular dynamics, Schrodinger's
wave equation, Fourier's heat equation, battle games, computational fluid dynamics, etc.
MASS library is a collection of APIs that abstract the low-level parallelism required to speed
up an application that is written using the library functions. Harnessing the power of the
GPUs, these low-level libraries spread their computation across thousands of parallel threads
that can be executed on multiple computing devices that house these GPUs.
Programmable GPU APIs have been available since 2001(Nickolls & Dally, 2010), first for only
graphics applications with the NVIDIA’s GeForce 3 that can be programmed using OpenGL
and DirectX 8. By 2006, NVIDIA introduced the first unified graphics and computing GPU
architecture programmable in C, CUDA, DirectX 10, and OpenGL.OpenCL and
DirectCompute1are other types of GPU APIs that are similar to CUDA’s programming model.
Fall 2011
OpenCL, originally developed by Apple Inc2 and has been adopted by GPU manufacturers
like Intel, AMD, NVIDIA, and ARM. OpenCL API’s programming model is based on support
for a heterogeneous set of GPUs while NVIDIA’s CUDA API is mainly compatible with GPU’s
made by NVIDIA. In this study, NVIDIA’s CUDA is the choice API for implementing MASS
CUDA was developed by NVIDIA in 2006, and is simply an extension to the C programming
language that enables functions to be defined to run the NVIDIA GPUs. More recently, as of
CUDA 4.1, advances in the CUDA API makes it makes it possible in the near future to run on
other GPU manufacturers like AMD and Intel. CUDA has also been used to write thousands of
applications(Owens et al., 2008) and there are several scientific researches on GPU computing
with CUDA being the GPU API of choice.
Fall 2011
Implementing CallAll() using CUDA
CallAll() functionality is to execute a user-specified function on every place in places, and
using CUDA’s API to spawn a GPU thread that would run in parallel across all places. The
Fall 2011
The output of running the test program is given below. It shows the result of
parallel GPU threads used to update the index of each Place in Places.
Performance Evaluation
In terms of performance, the above CallAll() spawns 1 thread for each place, thereby
updating the index for each place simultaneously. For a larger array of places i.e. 500 place
Fall 2011
elements, each Place element is modified by calling set_place_indexes() and updating
each Place index. Therefore, for N < Places->size, the performance of running the
function set_place_indexes() on every place in places is O(1). However, for a function,
sample_function() with a performance of O(N), the performance of running this function
on CUDA with CallAll()for all Placeswill be O(N).
Despite the mouth-watering gains that can be achieved by using GPUs to implement the
MASS library functions there are some drawbacks mentioned below:
No support for automatic memory allocation on GPU
MASS functions can be implemented to run on the GPU using GPU APIs like CUDA or
OpenCL, however, these APIs require memory to allocated on the GPU for every function that
needs to execute on the GPU device beforehand. This poses a huge challenge to
implementation of a CPU function by a MASS library user that wants to take advantage of the
GPUs power without having to rewrite this CPU function to run on the GPU. This limitation
arises because memory is not automatically allocated on the GPU when the GPU API function
is called. Automatic memory allocation is supported3 for C/C++ on the CPU, but APIs CUDA
and OpenCL (extensions of C), do not provide support for automatic memory allocation on the
GPU. This makes it difficult to write a generic C function that would have a fair chance of
executing on the GPU without knowing how much memory this function would use
Client has to know how to program in target GPU API
The limitation above that describes missing support for automatic memory allocation to result
in the developer learning how to program in the target API to have the specific function run
on the GPU. This might not be such a bad thing considering the minimal learning curve
required to perform parallelization on GPU APIs like CUDA. However, this goes against the
fundamental design of MASS library which requires users to extend MASS by writing CPU
functions that would “invisibly” run on the GPU installed.
This independent study investigated the use of GPUs to significantly improve the performance
of the MASS library function by introducing a low-level implementation layer on the GPU
through parallelism on lightweight GPU threads. Developers implementing applications using
MASS library functions can take advantage of GPU parallelism without having to worry about
implementing parallel algorithms. The major limitation of the CUDA API and other GPU APIs
Fall 2011
like OpenCL in implementing GPGPU programs is the inability to implicitly allocate memory
on the GPU for client programs that need to run a GPU function. One way to overcome this
limitation is to implement a preprocessor that would convert a CPU function into a GPU
function and also parse all the data structures needed for the computation so that memory is
allocated before this function is run on the GPU.
Nickolls, J., & Dally, W. J. (2010). The GPU computing era. Micro, IEEE, 30(2), 56-69.
Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU
computing. Proceedings of the IEEE, 96(5), 879-899.