Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
APARAPI
Using GPU/APUs to accelerate Java Workloads
Gary Frost
AMD
PMTS Java Runtime Team
1 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
AGENDA
The age of heterogeneous computing is here
The supercomputer in your desktop/laptop
Why Java?
Current GPU programming options for Java developers
Are developers likely to adopt existing Java OpenCL/CUDA bindings?
Aparapi
– What it does
– How it does it
Performance
Examples/Demos
Challenges
Future work
QA
2 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
THE AGE OF HETEROGENEOUS COMPUTE IS HERE
GPUs originally developed to accelerate graphics operations
Early adopters repurposed GPU for ‘general compute’ by performing ‘unnatural acts’ with shader APIs
OpenGL allows shaders/textures to be compiled and executed via extensions
OpenCL/GLSL/CUDA standardizes/formalizes how to express GPU compute and simplifies host
programming.
New programming models emerging and lowering/removing barriers to adoption
3 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
THE SUPERCOMPUTER IN YOUR DESKTOP
Some interesting tidbits from http://www.top500.org/
– November 2000
“ASCI White is new #1 with 4.9 TFlops on the Linpack"
http://www.top500.org/lists/2000/11
– November 2002
“3.2 TFlops are needed to enter the top 10”
http://www.top500.org/lists/2002/11
May 2011
– AMD Radeon 6990 5.1TFlops single precision performance
http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6990/Pages/amd-radeon-hd-6990-overview.aspx#3
4 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
WHY JAVA?
One of the most widely used programming languages
– http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
Established in domains likely to benefit from heterogeneous compute
Java
C
7.54
C++
Even if applications are not implemented in Java, they may still run on the Java6.51
Virtual Machine (JVM)
5.01 C#
– JRuby, JPython, Scala, Clojure, Quercus(PHP)
PHP
Acts as a good proxy/indicator for enablement of other runtimes/interpreters
Objective C
18.16
4.58
– JavaScript, Flash, .NET, PHP, Python, Ruby, Dalvik?
Python
Other
32.89
– BigData , Search, Hadoop+Pig, Finance, GIS, Oil & 9.14
Gas
16.17
5 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
GPU PROGRAMMING OPTIONS FOR JAVA PROGRAMMERS
Most Java GPU APIs require coding the kernel in a domain-specific language (OpenCL, GLSL, or CUDA)
// OpenCL kernel code
__kernel void squares(__global const float *in, __global float *out){
int gid = get_global_id(0);
out[gid] = in[gid] * in[gid];
}
As well as writing the Java ‘host’ CPU-based code to :– Select/Initialize execution device
– Compile 'Kernel' for a selected device
– Allocate or define memory buffers for args /parameters
– Enqueue/Send arg buffers to device
– Execute the kernel
– Read results buffers back from the device
6 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
ARE DEVELOPERS LIKELY TO ADOPT EXISTING JAVA OPENCL/CUDA BINDINGS?
Some will
– Early adopters
– Prepared to learn new languages
– Motivated to squeeze all the performance they can from available compute devices
– Prepared to implement algorithms both in Java and in CUDA/OpenCL
Most won’t
– OpenCL/CUDA C99 heritage likely to disenfranchise Java developers
Either walked away from C/C++ a while back or possibly never encountered it at all (due to CS education shifts)
Problems exposing low level memory model alien to developers who learned to trust JVM to ‘do the right thing’
Who pays for retraining of Java developers?
– Notion of writing code twice (once for Java execution another for GPU/APU) alien
Where’s my ‘write once, run anywhere’?
7 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
USING JOCL OPENCL JAVA BINDINGS
import static org.jocl.CL.*;
import org.jocl.*;
public class Sample {
public static void main(String args[]) {
// Create input- and output data
int size = 10;
float inArr[] = new float[size];
float outArray[] = new float[size];
for (int i=0; i<size; i++) {
inArr[i] = i;
}
Pointer in = Pointer.to(inArr);
Pointer out = Pointer.to(outArray);
Create in and out array to hold data
// Obtain the platform IDs and initialize the context properties
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(1, platforms, null);
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
// Create an OpenCL context on a GPU device
cl_context context = clCreateContextFromType(contextProperties,
CL_DEVICE_TYPE_CPU, null, null, null);
float in[] = new float[size];
float out[] = new float[size];
for (int i=0; i<size; i++) {
in[i] = i;
}
// Obtain the cl_device_id for the first device
cl_device_id devices[] = new cl_device_id[1];
clGetContextInfo(context, CL_CONTEXT_DEVICES,
Sizeof.cl_device_id, Pointer.to(devices), null);
// Create a command-queue
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
// Allocate the memory objects for the input- and output data
cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * size, in, null);
cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
Sizeof.cl_float * size, null, null);
// Create the program from the source code
cl_program program = clCreateProgramWithSource(context, 1, new String[]{
"__kernel void sampleKernel("+
" __global const float *in,"+
" __global float *out){"+
"
int gid = get_global_id(0);"+
"
out[gid] = in[gid] * in[gid];"+
"}"
}, null, null);
Perform the parallel equivalent of
// Build the program
clBuildProgram(program, 0, null, null, null, null);
// Create and extract a reference to the kernel
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
for (int i=0; i<size; i++) {
out[i] = in[i]*in[i];
}
// Set the arguments for the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem));
clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel,
1, null, new long[]{inArray.length}, null, 0, null, null);
// Read the output data
clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0,
outArray.length * Sizeof.cl_float, out, 0, null, null);
// Release kernel, program, and memory objects
clReleaseMemObject(inMem);
clReleaseMemObject(outMem);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
Print the results
for (float f: out) {
System.out.printf(“%5.2f,”, f);
}
8 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
for (float f:outArray){
System.out.printf("%5.2f, ", f);
}
}
}
AN EXAMPLE USING JOCL
import static org.jocl.CL.*;
import org.jocl.*;
public class Sample {
public static void main(String args[]) {
// Create input- and output data
int size = 10;
float inArr[] = new float[size];
float outArray[] = new float[size];
for (int i=0; i<size; i++) {
inArr[i] = i;
}
Pointer in = Pointer.to(inArr);
Pointer out = Pointer.to(outArray);
// Obtain the platform IDs and initialize the context properties
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(1, platforms, null);
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
Pointer in = Pointer.to(inArr);
Pointer out = Pointer.to(outArray);
// Create an OpenCL context on a GPU device
cl_context context = clCreateContextFromType(contextProperties,
CL_DEVICE_TYPE_CPU, null, null, null);
// Obtain the cl_device_id for the first device
cl_device_id devices[] = new cl_device_id[1];
clGetContextInfo(context, CL_CONTEXT_DEVICES,
Sizeof.cl_device_id, Pointer.to(devices), null);
// Get platform IDs and initialize the context
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(1, platforms, null);
cl_context_properties contextProperties =
new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
// Create a command-queue
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
// Allocate the memory objects for the input- and output data
cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * size, in, null);
cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
Sizeof.cl_float * size, null, null);
// Create the program from the source code
cl_program program = clCreateProgramWithSource(context, 1, new String[]{
"__kernel void sampleKernel("+
" __global const float *in,"+
" __global float *out){"+
"
int gid = get_global_id(0);"+
"
out[gid] = in[gid] * in[gid];"+
"}"
}, null, null);
// Build the program
clBuildProgram(program, 0, null, null, null, null);
// Create and extract a reference to the kernel
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
// Create an OpenCL context on a GPU device
cl_context context = clCreateContextFromType(contextProperties,
CL_DEVICE_TYPE_CPU, null, null, null);
// Obtain the cl_device_id for the first device
cl_device_id devices[] = new cl_device_id[1];
clGetContextInfo(context, CL_CONTEXT_DEVICES,
Sizeof.cl_device_id, Pointer.to(devices), null);
9 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
// Set the arguments for the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem));
clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel,
1, null, new long[]{inArray.length}, null, 0, null, null);
// Read the output data
clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0,
outArray.length * Sizeof.cl_float, out, 0, null, null);
// Release kernel, program, and memory objects
clReleaseMemObject(inMem);
clReleaseMemObject(outMem);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
for (float f:outArray){
System.out.printf("%5.2f, ", f);
}
}
}
AN EXAMPLE USING JOCL
import static org.jocl.CL.*;
import org.jocl.*;
public class Sample {
public static void main(String args[]) {
// Create input- and output data
int size = 10;
float inArr[] = new float[size];
float outArray[] = new float[size];
for (int i=0; i<size; i++) {
inArr[i] = i;
}
Pointer in = Pointer.to(inArr);
Pointer out = Pointer.to(outArray);
// Obtain the platform IDs and initialize the context properties
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(1, platforms, null);
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
// Create a command-queue
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
// Create an OpenCL context on a GPU device
cl_context context = clCreateContextFromType(contextProperties,
CL_DEVICE_TYPE_CPU, null, null, null);
// Obtain the cl_device_id for the first device
cl_device_id devices[] = new cl_device_id[1];
clGetContextInfo(context, CL_CONTEXT_DEVICES,
Sizeof.cl_device_id, Pointer.to(devices), null);
// Create a command-queue
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
// Allocate the memory objects for the input and output data
cl_mem inMem = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * size, in, null);
// Allocate the memory objects for the input- and output data
cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * size, in, null);
cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
Sizeof.cl_float * size, null, null);
// Create the program from the source code
cl_program program = clCreateProgramWithSource(context, 1, new String[]{
"__kernel void sampleKernel("+
" __global const float *in,"+
" __global float *out){"+
"
int gid = get_global_id(0);"+
"
out[gid] = in[gid] * in[gid];"+
"}"
}, null, null);
cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
Sizeof.cl_float * size, null, null);
// Build the program
clBuildProgram(program, 0, null, null, null, null);
// Create and extract a reference to the kernel
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
// Set the arguments for the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem));
clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel,
1, null, new long[]{inArray.length}, null, 0, null, null);
// Read the output data
clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0,
outArray.length * Sizeof.cl_float, out, 0, null, null);
// Release kernel, program, and memory objects
clReleaseMemObject(inMem);
clReleaseMemObject(outMem);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
for (float f:outArray){
System.out.printf("%5.2f, ", f);
}
}
}
10 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
AN EXAMPLE USING JOCL
import static org.jocl.CL.*;
import org.jocl.*;
public class Sample {
public static void main(String args[]) {
// Create input- and output data
int size = 10;
float inArr[] = new float[size];
float outArray[] = new float[size];
for (int i=0; i<size; i++) {
inArr[i] = i;
}
Pointer in = Pointer.to(inArr);
Pointer out = Pointer.to(outArray);
// Obtain the platform IDs and initialize the context properties
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(1, platforms, null);
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
cl_program program =
clCreateProgramWithSource(context, 1, new String[]{
"__kernel void sampleKernel("+
" __global const float *in,"+
" __global float *out){"+
"
int gid = get_global_id(0);"+
"
out[gid] = in[gid] * in[gid];"+
"}"
}, null, null);
// Create an OpenCL context on a GPU device
cl_context context = clCreateContextFromType(contextProperties,
CL_DEVICE_TYPE_CPU, null, null, null);
// Obtain the cl_device_id for the first device
cl_device_id devices[] = new cl_device_id[1];
clGetContextInfo(context, CL_CONTEXT_DEVICES,
Sizeof.cl_device_id, Pointer.to(devices), null);
// Create a command-queue
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
// Allocate the memory objects for the input- and output data
cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * size, in, null);
cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
Sizeof.cl_float * size, null, null);
// Create the program from the source code
cl_program program = clCreateProgramWithSource(context, 1, new String[]{
"__kernel void sampleKernel("+
" __global const float *in,"+
" __global float *out){"+
"
int gid = get_global_id(0);"+
"
out[gid] = in[gid] * in[gid];"+
"}"
}, null, null);
// Build the program
clBuildProgram(program, 0, null, null, null, null);
// Create and extract a reference to the kernel
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
// Build the program
clBuildProgram(program, 0, null, null, null, null);
// Set the arguments for the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem));
clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel,
1, null, new long[]{inArray.length}, null, 0, null, null);
// Read the output data
clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0,
outArray.length * Sizeof.cl_float, out, 0, null, null);
// Create and extract a reference to the kernel
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
// Release kernel, program, and memory objects
clReleaseMemObject(inMem);
clReleaseMemObject(outMem);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
for (float f:outArray){
System.out.printf("%5.2f, ", f);
}
}
}
11 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
AN EXAMPLE USING JOCL
import static org.jocl.CL.*;
import org.jocl.*;
public class Sample {
public static void main(String args[]) {
// Create input- and output data
int size = 10;
float inArr[] = new float[size];
float outArray[] = new float[size];
for (int i=0; i<size; i++) {
inArr[i] = i;
}
Pointer in = Pointer.to(inArr);
Pointer out = Pointer.to(outArray);
// Obtain the platform IDs and initialize the context properties
cl_platform_id platforms[] = new cl_platform_id[1];
clGetPlatformIDs(1, platforms, null);
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
// Set the arguments for the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem));
clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));
// Create an OpenCL context on a GPU device
cl_context context = clCreateContextFromType(contextProperties,
CL_DEVICE_TYPE_CPU, null, null, null);
// Obtain the cl_device_id for the first device
cl_device_id devices[] = new cl_device_id[1];
clGetContextInfo(context, CL_CONTEXT_DEVICES,
Sizeof.cl_device_id, Pointer.to(devices), null);
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel,
1, null, new long[]{inArray.length}, null, 0, null, null);
// Create a command-queue
cl_command_queue commandQueue =
clCreateCommandQueue(context, devices[0], 0, null);
// Allocate the memory objects for the input- and output data
cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
Sizeof.cl_float * size, in, null);
cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE,
Sizeof.cl_float * size, null, null);
// Create the program from the source code
cl_program program = clCreateProgramWithSource(context, 1, new String[]{
"__kernel void sampleKernel("+
" __global const float *in,"+
" __global float *out){"+
"
int gid = get_global_id(0);"+
"
out[gid] = in[gid] * in[gid];"+
"}"
}, null, null);
// Read the output data back into outArr
clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0,
outArray.length * Sizeof.cl_float, out, 0, null, null);
// Build the program
clBuildProgram(program, 0, null, null, null, null);
// Create and extract a reference to the kernel
cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
// Release kernel, program, and memory objects
clReleaseMemObject(inMem);
clReleaseMemObject(outMem);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
// Set the arguments for the kernel
clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem));
clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem));
// Execute the kernel
clEnqueueNDRangeKernel(commandQueue, kernel,
1, null, new long[]{inArray.length}, null, 0, null, null);
// Read the output data
clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0,
outArray.length * Sizeof.cl_float, out, 0, null, null);
// Release kernel, program, and memory objects
clReleaseMemObject(inMem);
clReleaseMemObject(outMem);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(commandQueue);
clReleaseContext(context);
for (float f:outArray){
System.out.printf("%5.2f, ", f);
}
}
}
// finally print out the results
for (float f:outArray){
System.out.printf("%5.2f, ", f);
}
12 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
WHAT IS APARAPI AND HOW IS IT DIFFERENT FROM EXISTING JAVA GPU BINDINGS?
An API for expressing data parallel workloads in Java
– Developer extends a Kernel base class and compiles to Java bytecode using existing tool chain
A runtime component capable of converting Java bytecode to OpenCL for execution on GPU or executing
on the host via a Java Thread Pool
OpenCL?
No
MyKernel.class
javac
Yes
Execute
Kernel
using Java
Thread Pool
MyKernel.class
Bytecode can
be converted
to OpenCL?
No
class MyKernel extends Kernel{
@Override public void run(){
}
Platform
}
Supports
13 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
Yes
Convert
bytecode to
OpenCL
Execute
OpenCL
Kernel
on GPU
CONSIDER AN EMBARASSINGLY PARALLEL USE CASE
We will convert the ‘square example’ (embarrassingly parallel) to Aparapi
– Calculate square[0..size] for a given input in[0..size]
final int[] square= new int[size];
final int[] in = new int[size];
// populating in[0..size] omitted
parallel-for
i=0;i++){
i<size; i++){
for
(int i=0; (int
i<size;
square[i] = in[i] * in[i];
}
Ideally we can indicate that the body of loop need not be
executed sequentially.
Would be great if we could add a parallel-for construct to
the Java language.
But we want to avoid modifing the language, compiler or
toolchain.
14 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
REFACTORING OUR EXAMPLE TO USE APARAPI
final int[] square= new int[size];
final int[] in = new int[size]; // populating in[0..size] omitted
for (int i=0; i<size; i++){
square[i] = in[i] * in[i];
}
new Kernel(){
@Override public void run(){
int i = getGlobalID();
square[i] = in[i]*in[i];
}
}.execute(size);
15 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
EXPRESSING DATA PARALLEL IN APARAPI
What happens when we call execute(n)?
execute(size);
Is this the
first
execution?
Bytecode can
be converted
to OpenCL?
Execute Kernel
using Java Thread
Pool
Yes
Convert bytecode
to OpenCL
No
Yes
No
No
Kernel kernel = new Kernel(){
@Override public void run(){
int i=getGlobalID();
square[i]=int[i]*int[i];
}
};
Yes
Platform
Supports
OpenCL?
Execute OpenCL Kernel
on GPU
No
Do we have
OpenCL?
16 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
Yes
FIRST CALL OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE
– Reload classfile via classloader extracting methods and fields
– For run() method and all methods reachable from run() method …
Convert method bytecode to an IR (expression trees, conditional loop constructs)
– More on how we do this later…
Maintain a list of field accesses and types (read/write/read+write)
– Class fields ultimately represent args passed to OpenCL
– Create and Compile OpenCL for all reachable methods
– Lock accessed primitive arrays so the garbage collector doesn’t move them around
– For each field/primitive that is read by the generated code enqueue write to GPU and/or set arg
– Execute OpenCL Kernel
– For each primitive array that is written by the generated code enqueue read from GPU
– Unlock accessed primitive arrays
– Results now available in Java application
17 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
SUBSEQUENT CALLS OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE
– Lock accessed primitive arrays so the garbage collector doesn’t move them around
– For each field/primitive that is read by the generated code enqueue write to GPU and/or set arg
– Execute OpenCL Kernel
– For each primitive array that is written by the generated code enqueue read from GPU
– Unlock accessed primitive arrays
– Results now available in Java application
18 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS NOT AN OPTION
– Create a thread pool
One thread per core.
– Clone Kernel one per thread
Each Kernel holds state for one thread
– Each thread iterates 0..(size/threads)
Updates globalId, localId, groupSize, globalSize, etc state on it’s Kernel instance.
Executes run() method on Kernel instance.
– Wait for all threads to complete
19 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
BYTECODE PRIMER
Variable Length instructions
Access to immediate values
–
Mostly constant pool and local variable table indexes
Stack Based execution
–
IMUL : multiply two integers from stack and push result
–
…,<op2>, <op1> => [ IMUL ] => …,<op1*op2>
–
Sometimes the types and number of operands cannot be determined from the bytecode alone. We
need to decode from the ConstantPool
Some surprising omissions
–
Store 0 in a local variable or field? (3+ bytes)
–
Instead we push 0, then pop into a local variable (4+ bytes)
20 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
BYTECODE PRIMER: FILE FORMAT
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}
21 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
CONSTANTPOOL
Each class has one ConstantPool
A list of constant values used to describe the class
Not all express source artifacts
Pool is made up of one or more Entries containing:–
primitive types (int, float, double, long)
– Double and Longs take two slots
–
Strings (UTF8)
–
Class/Method/Field/Interface descriptors
These descriptors contain grouped references to other slots.
So a method descriptor will reference the slot containing the Class definition, the slot
containing the name of the method (utf8/String) the slot containing the signature
(utf8/String).. say “(Ljava/lang/String;I)[F”
22 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
ATTRIBUTES
The various sections of the classfile will contain sets of ‘attributes’
–
Each attribute has a name, a length and a sequence of bytes (the value)
Think ‘HashMap<String, Pair<int, ?>>’
–
Class/top level attributes include the name of the sourcefile, the generic signature information
etc.
–
Attributes can be nested
–
Allows new features to be added to the classfile without violating the original spec
Field sections have lists of attributes
–
Generic signature etc...
Method sections have lists of attributes
–
Generic signature etc...
–
One of the attributes of a Method is a ‘Code’ attribute
This contains the sequence of bytecodes representing the method body
23 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
javap –c MyClass
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
24 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
Store
0
in
var
slot
1
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
25 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++ ) {
Loop Control
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
Oracle javac style
14: iload_2
total++;
15: iconst_4
}
16: irem
Eclipse
javac
places
17: ifne 23
}
condition at top and
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
unconditional at
26: iload_2
}
27: bipush 100 bottom
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
26 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23 Executed once
14: iload_2
total++;
15: iconst_4 Store 0 in var slot 2
}
16: irem
Branch to instruction #26
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
27 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
Increment
var
slot
2
by
1
14: iload_2
total++;
if var slot 2 < 100
15: iconst_4
16: irem
branch to instruction at 7 }
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
28 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
“Loop
Body”
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
29 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
“Condition
control”
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
30 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
31 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
Logical operators
11: ifne 23
14: iload_2
total++;
result in ‘short
15: iconst_4
circuit’ branches
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
32 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println(total);
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
33 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
“Conditional body”
11: ifne 23
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println( total );
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
34 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
A BYTECODE TOUR …
0: iconst_0
1: istore_1
2: iconst_0
3: istore_2
public void run() {
4: goto 26
int total = 0;
7: iload_2
8: bipush 10
for (int i = 0; i < 100; i++) {
10: irem
if (i%10==0 && i%4==0) {
11: ifne 23
14: iload_2
total++;
15: iconst_4
}
16: irem
17: ifne 23
}
20: iinc 1, 1
System.out.println( total );
23: iinc 2, 1
26: iload_2
}
27: bipush 100
29: if_icmplt
7
32: getstatic
#15; //Field java/lang/System.out:Ljava/io/PrintStream;
35: iload_1
36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V
39: return
35 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
LETS LOOK AT AN EXAMPLE
Lets ‘fold’ the following instructions
0:
1:
2:
3:
4:
5:
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
Start with an empty list
head and tail pointing to ‘NULL’
head
NULL
tail
36 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
LETS LOOK AT AN EXAMPLE
0:
1:
2:
3:
4:
5:
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
head
iload_2 consumes ‘0’ stack operands
Create a new iload_2 and make it the tail of the list
NULL
tail
head
iload_2
37 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
tail
LETS LOOK AT AN EXAMPLE
0:
1:
2:
3:
4:
5:
iload_1 consumes ‘0’ stack operands
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
head
Create a new iload_1 and add to the tail of the existing linked list
iload_2
tail
head
iload_2
iload_1
38 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
tail
LETS LOOK AT AN EXAMPLE
0:
1:
2:
3:
4:
5:
head
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
iload_2
iadd consumes ‘2’ stack operands
Create a new iadd
Remove ‘tail’ from list (adjust tail) and make it operand[1] of iadd
Remove new ‘tail’ (and adjust tail) and make it operand[0] of iadd
iload_1
tail
head
operand 0
iload_2
tail
iadd
operand 1
iload_1
39 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
LETS LOOK AT AN EXAMPLE
0:
1:
2:
3:
4:
5:
Iconst_2 consumes ‘0’ stack operands
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
head
operand 0
iload_2
Create a new iconst_2 and add to tail
iadd
tail
operand 1
iload_1
head
operand 0
iload_2
iadd
iconst_2
operand 1
iload_1
40 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
tail
LETS LOOK AT AN EXAMPLE
0:
1:
2:
3:
4:
5:
idiv consumes ‘2’ stack operands
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
head
operand 0
iload_2
Create a new idiv
Remove ‘tail’ from list (adjust tail) and make it operand[1] of idiv
Remove new ‘tail’ (and adjust tail) and make it operand[0] of idiv
iadd
iconst_2
tail
head
operand 1
idiv
operand 0
iload_1
operand 1
iadd
operand 0
iload_2
tail
iconst_2
operand 1
iload_1
41 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
LETS LOOK AT AN EXAMPLE
0:
1:
2:
3:
4:
5:
ireturn consumes ‘1’ stack operands
iload_2
iload_1
iadd
iconst_2
idiv
ireturn
head
Create a new ireturn and move existing tail as operand[0]
idiv
operand 0
ireturn
head
operand 1
iadd
operand 0
tail
operand 0
iconst_2
idiv
operand 1
operand 0
iload_2
tail
iload_1
operand 1
iadd
operand 0
iload_2
iconst_2
operand 1
iload_1
42 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
THE RESULT
After parsing we determine that this is a single return statement
For reference here is the source
public int mid(int _min, int _max){
return((_max+_min)/2);
}
When we apply this approach to more complex methods we end up with a linked list of instructions
which represent the ‘roots’ of expressions or statements.
Essentially we end up with a list comprised of conditionals, goto’s, assignments and return statements.
All branch targets are ‘roots’
From this we can fairly easily recognize larger level structures (for/while/if/else)
43 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
COMPARING APARAPI TO EXISTING JAVA OPENCL/CUDA APIS
Existing GPU APIS
Aparapi
Learn OpenCL/CUDA
DIFFICULT
N/A
Locate potential data parallel opportunities
MEDIUM
MEDIUM
Refactor existing code/data structures
MEDIUM
MEDIUM
Create Kernel Code
DIFFICULT
EASY
Create code to coordinate execution and buffer transfers
MEDIUM
N/A
Identify GPU performance bottlenecks
DIFFICULT
DIFFICULT
Iterate code/debug algorithm logic
DIFFICULT
MEDIUM
Solve build/deployment issues
DIFFICULT
MEDIUM
44 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI BY EXTENDING KERNEL
class SquareKernel extends Kernel{
final int[] in, square;
public SquareKernel(final int[] in, final int[] square){
this.in = in;
this.square = square;
}
@Override public void run(){
int i=getGlobalID();
square[i]=int[i]*int[i];
For more complex scenarios developer
}
likely to explicitly extend Kernel base class
}
int []square = new int[size];
int []in = new int[size]; // populating in[0..size] omitted
SquareKernel squareKernel = new SquareKernel(in, square);
squareKernel.execute(size);
45 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
WITHOUT APARAPI: JAVA 'CLASSIC' MULTITHREADED SOLUTION
final
final
final
final
int[] square= new int[size];
int[] in = new int[size]; //populating in[0..size] omitted
int cores = Runtime.getRuntime().availableProcessors();
int chunk = size/cores; // lets assume size % cores ==0 !
Thread threads = new Thread[cores];
for(int core=0; core<cores; core++){
final int start = core*chunk;
(threads[core] = new Thread(new Runnable(){
@Override public void run(){
for (int i=start; i<start+chunk; i++)
square[i] = in[i]*in[i];
}
})).start();
}
for(Thread thread:threads)
thread.join();
46 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
WITHOUT APARAPI:… USING JAVA'S NEW EXECUTOR FRAMEWORK
final
final
final
final
int[] square= new int[size];
int[] in = new int[size]; //populating in[0..size] omitted
int cores = Runtime.getRuntime().availableProcessors();
int chunk = size/cores; // lets just assume size % cores ==0 ! :)
ExecutorService executor = Executors.newFixedThreadPool(cores);
for (int core = 0; core < cores; core++) {
final int start = core*chunk;
executor.execute(new Runnable(){
public void run(){
for (int i=start;i<start+chunk;i++)
square[i] = in[i]*in[i];
}
});
}
executor.shutdown();
executor.awaitTermination(60L, TimeUnit.SECONDS);
47 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI
class SquareKernel extends Kernel{
private int[] in, square;
@Override public void run(){
int i=getGlobalID();
square[i]=int[i]*int[i];
}
public int[] square(int in[]){
this.in = in;
square = new square[in.length];
execute(in.length);
return(square);
}
Base execute(n) method can be
encapsulated to provide a more natural
API
}
int []in = new int[size]; // populating in[0..size] omitted
SquareKernel kernel = new SquareKernel();
int[] square = kernel.square(in)
48 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
EXPRESSING DATA PARALLELISM IN APARAPI USING PROPOSED JAVA 8 LAMBDA'S
JSR 335 ‘Project Lambda’ proposes addition of ‘lambda’ expressions to Java 8.
http://cr.openjdk.java.net/~briangoetz/lambda/lambda-state-3.html
How we expect Aparapi will make use of the proposed extensions
final int [] square = new int[size];
final int [] in = new int[size]; // populating in[0..size] omitted
Kernel.execute(size, #{ i -> out[i]=int[i]*int[i]; });
49 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
HOW APAPAPI EXECUTES ON THE GPU
At runtime Aparapi converts bytecode to OpenCL
OpenCL compiler converts OpenCL to device specific ISA for GPU/APU
GPU comprised of multiple SIMD (Single Instruction Multiple Dispatch) Cores
SIMDs benefit from having multiple execution streams operating the same instructions on different data
– Think single program counter shared across multiple threads
– All SIMDs executing at the same time (in lock-step)
new Kernel(){
@Override public void run(){
int i = getGlobalID();
int temp= in[i]*2;
temp = temp+1;
out[i] = temp;
}
}.execute(4)
i=0
i=1
i=2
i=3
int temp =in[0]*2
int temp =in[1]*2
int temp =in[2]*2
int temp =in[3]*2
temp=temp+1
temp=temp+1
temp=temp+1
temp=temp+1
out[0]=temp
out[1]=temp
out[2]=temp
out[3]=temp
50 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
DEVELOPER RESPONSIBLE FOR ENSURING PROBLEM IS DATA PARALLEL
Data dependencies can violate the ‘in any order’ guideline
for (int i=1; i< 100; i++){
out[i] = out[i-1]+in[i];
}
new Kernel(){ @Override public void run(){
int i = getGlobalID();
out[i] = out[i-1]+in[i];
}}.execute(100);
out[i-1] refers to a value resulting from a previous iteration which may not have been evaluated yet.
Mutating shared data problematic or can require use of atomic constructs
for (int i=0; i< 100; i++){
sum += in[i];
}
new Kernel(){ @Override public void run(){
int i = getGlobalID();
sum+= in[i];
}}.execute(100);
sum += x causes a race condition.
Almost certainly will not be atomic when translated to OpenCL
Actually not even atomic in multi-threaded Java
51 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
SOMETIMES WE CAN REFACTOR TO RECOVER SOME PARALLELISM
for (int i=0; i< 100; i++){
sum += in[i];
}
new (int
Kernel(){
for
n=0; n<10; n++){
@Override
public
void
run(){
for (int i=0;
i<10;
i++){
partial[n]
+= data[n*10+i];
int
i = getGlobalID();
}sum+= in[i];
} }
for
(int i=0; i< 10; i++){
}.execute(100);
sum+=partial[i];
new Kernel(){
} @Override public void run(){
int n = getGlobalID()
for (int i=0; i<10; i++)
partial[n] += data[n*10+i];
}
}.execute(10);
for (int i=0; i< 10; i++){
sum+=partial[i];
}
52 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
TRY TO AVOID BRANCHING WHEREVER POSSIBLE
SIMD performance impacted when code contains branches
– To stay in lockstep SIMDs must processes both the 'then' and 'else' blocks
– Use result of 'condition' to predicate instructions (conditionally mask to a no-op)
new Kernel(){
@Override public void run(){
int i = getGlobalID();
int temp= in[i]*2;
if (i%2==0)
temp = temp+1;
else
temp = temp -1;
out[i] = temp;
}
}.execute(4)
i=0
i=1
i=2
i=3
int temp =in[0]*2
int temp =in[1]*2
int temp =in[2]*2
int temp =in[3]*2
<c> = (0%2==0)
<c> = (1%2==0)
<c> = (2%2==0)
<c> = (3%2==0)
if< c> temp=temp+1
if< c> temp=temp+1
if< c> temp=temp+1
if< c> temp=temp+1
if <!c> temp=temp-1
if <!c> temp=temp-1
if <!c> temp=temp-1
if <!c> temp=temp-1
out[0]=temp
out[1]=temp
out[2]=temp
out[3]=temp
53 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
AVOIDING DIVERGENCE
Sometimes it is more efficient to process unnecessary data to avoid conditionals
for (int i=0; i< 65536; i++)
if (i%64 == 0)
out[i] = 0;
else
out[i] = in[i];
for
new(int
Kernel(){
i=0; i< 65536;
@Override
i++) public void run(){ i=getGlobalID();
out[i] = in[i];
for
}}.execute(65536);
(int i=0; i< 65536; i+=64)
New
out[i]
Kernel(){
= 0; @Override public void run(){i=getGlobalID();
out[i*64]=0;
}}execute(65536/64);
We can often adjust the range and add offsets to avoid boundary checks
for (int i=0; i< 65536; i++)
if (i!=0 && i!=65535)
out[i] =in[i-1]+in[i]+in[i+1]/3;
new Kernel(){ @Override public void run(){ i=getGlobalID();
out[i+1] = in[i]+in[i+1]+]in[i+2];
}}.execute(65534);
54 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
CHARACTERISTICS OF IDEAL DATA PARALLEL WORKLOADS
Looping over large arrays of primitives
– 32/64 bit data types preferred
– Without data dependencies between iterations
– Each iteration contains sequential code (few branches)
Good balance between data size (low) and compute (high)
– Transfer of data to/from the GPU can be costly
– Trivial compute often not worth the transfer cost
– May still benefit, by freeing up CPU for other work
Compute
– Order of iteration unimportant
Ideal
Data Size
55 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
GPU
Memory
APARAPI NBODY DEMO
NBody is a common OpenCL/CUDA benchmark/demo
Determine the positions of N bodies, calculating the gravitational effect that each body has on every other
body
– C++/C version shipped with AMD Stream SDK
Essentially a N^2 space problem
– If we double the number of bodies, we perform four times the positional calculations
Following charts compares
– Naïve Java version (single loop)
– Aparapi version using Java Thread Pool
– Aparapi version running on the GPU (ATI Radeon ™ 5870)
56 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
APARAPI NBODY DEMO
NBODY DEMO
57 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
Frames per second
APARAPI NBODY PERFORMANCE (FRAMES RATE VS NUMBER OF BODIES)
450
400
350
300
250
200
150
100
50
0
Java Single Thread
Aparapi Thread Pool
Aparapi GPU
1k
2k
4k
8k 16k 32k 64k 128k
# of bodies
58 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
Position calculations per µS
APARAPI NBODY PERFORMANCE: CALCULATIONS PER SEC VS. NUMBER OF BODIES
6000
Java Single Thread
Aparapi Thread Pool
Aparapi GPU
5000
4000
3000
2000
1000
0
1k
2k
4k
8k
16k 32k 64k 128k
# of bodies
59 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
APARAPI NBODY DEMO
MANDEL DEMO
60 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
APARAPI EXTENSIONS FOR ITERATING OVER KERNEL EXECUTIONS
Added explicit buffer management for algorithms which iterate over kernel executions
int [] buffer = new int[HUGE];
int [] unusedBuffer = new int[HUGE];
Kernel k = new Kernel(){
@public void run(){
// mutates buffer contents
// no reference to unusedBuffer
}
};
for (int i=0; i< 1000; i++){
//Transfer buffer to GPU
k.execute(HUGE);
//Transfer buffer from GPU
}
Aparapi can/does analyze kernel methods and
generates optimized host buffer transfer requests
at runtime.
Aparapi has no knowledge of buffer accesses
from the enclosing loop so MUST be conservative
and assume that buffer is modified between
invocations
This results in unnecessarily buffer copies (in this
case 1000 of each) to a from the device
61 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
APARAPI EXTENSIONS FOR ITERATING OVER KERNEL EXECUTIONS
– With explicit buffer management we can refactor the code to this
int [] buffer = new int[HUGE];
Kernel k = new Kernel(){
@public void run(){
// mutates buffer contents
}
};
k.setExplicit();
k.put(buffer);
Developer can take control and coordinate
for (int i=0; i< 1000; i++){
when/if transfers take place.
k.execute(HUGE);
}
k.get(buffer);
62 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
PROPOSED APARAPI ENHANCEMENTS: ALLOW ACCESS TO ARRAYS OF OBJECTS
Allow automatic extraction of buffers from arrays/collections of objects.
– A Java developer implementing 'nbody' problem would probably define a class for each particle
public class Particle{
int x, y, z;
String name;
Color color;
// other Particle specific state
}
– .. and would expect to be able to create a Kernel to calculate positions for an array of particles
Particle[] particles = new Particle[1024];
ParticleKernel kernel = new ParticleKernel(particles);
while(displaying){
kernel.execute(particles.length);
//update display of particles
}
– Unfortunately Aparapi would currently fail to convert the above kernel to OpenCL and would fall back to
using a Thread Pool.
63 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
PROPOSED APARAPI ENHANCEMENTS: ALLOW ACCESS TO ARRAYS OF OBJECTS
Aparapi currently not ‘Object Friendly’ and the ideal code will need to be refactored to use primitive arrays.
int[] x = new
int[] y = new
int[] z = new
Color[] color
String[] name
int[1024];
int[1024];
int[1024];
= new Color[1024];
= new String[1024];
Positioner.position(x, y, z);
64 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
PROPOSED APARAPI ENHANCEMENTS: ALLOW ACCESS TO ARRAYS OF OBJECTS
In our initial Open Source release we intend to allow arrays of objects to be accessed.
At runtime Aparapi will automatically copy any accessed fields into temporary primitive arrays.
The OpenCL kernel will be passed these primitive copies.
On completion content from the primitive buffers will be pushed back into the original objects
This will allow us to use any array based collection (ArrayList/Vector) from kernels
65 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
FUTURE WORK
Sync with ‘project lambda’ (Java 8) and allow kernels to be expressed as lambda expressions.
More work on automatically extracting buffer transfers across object collections
Hand more explicit control to ‘power users’
– Explicit buffer (or sub buffer) transfers
– Expose local memory and barriers
Evaluating Open Source
– Aiming for Q3 Open Source release of Aparapi
– License TBD, probably BSD variant
– Need to decide where to host
– http://code.google.com , http://sourceforge.net/ or http://www.java.net
– Enable and encourage community contributions
66 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
SIMILAR INTERESTING/RELATED WORK
Tidepowerd
– Offers a similar solution for .NET
– NVIDIA cards only at present
http://www.tidepowerd.com/
java-gpu
– An open source project for extracting kernels from nested loops
– Extracts code structure from bytecode
– Creates CUDA behind the scenes
http://code.google.com/p/java-gpu/
GRAPHITE-OpenCL: Generate OpenCL Code from Parallel Loops (for GCC)
– Auto detect data parallel loops in gcc compiler and generate OpenCL + host code for the loop
http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=2010-GCC-Summit-Proceedings.pdf
67 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
SUMMARY
APUs/GPUs offer unprecedented performance for the appropriate workload
Don’t assume everything can/should execute on the APU/GPU
Look for ‘Islands of parallel in a sea of sequential’
Aparapi provides an ideal framework for executing data-parallel code on the GPU
Please participate in the upcoming Aparapi Open Source community
Download and experiment with Aparapi
– http://developer.amd.com/aparapi
68 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
© 2011 Advanced Micro Devices, Inc.
69 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
70 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
POWERPOINT 2011 AMD COLOR PALETTE
AMD
RICH BLACK
R0 G0 B0
AMD WHITE
R255
G255
B255
AMD GREEN
PMS 347
R0
G153
B102
AMD GRAY
R27
G27
B27
AMD RED
PMS 186
R211
G25
B25
AMD PURPLE
PMS 272
R118
G125
B197
AMD ORANGE
PMS 1505
R252
G101
B0
AMD GRAY
Cool Gray 10
R128
G127
B130
71 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011
INTEL BLUE
R8
G96
B168
A SEQUENTIAL VERSION OF KERNEL BASE CLASS
public abstract class Kernel {
private int gid = 0;
protected int getGlobalID(){
return(gid);
}
public abstract static void run();
public void execute(int size){
for (gid=0; gid<size; gid++)
run();
}
new Kernel(){
@Override public void run(){
int i = getGlobalID();
square[i] = in[i]*in[i];
}
}.execute(size);
}
72 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011