Download Aparapi HJUG Presentation

APARAPI Using GPU/APUs to accelerate Java Workloads Gary Frost AMD PMTS Java Runtime Team 1 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 AGENDA  The age of heterogeneous computing is here  The supercomputer in your desktop/laptop  Why Java?  Current GPU programming options for Java developers  Are developers likely to adopt existing Java OpenCL/CUDA bindings?  Aparapi – What it does – How it does it  Performance  Examples/Demos  Challenges  Future work  QA 2 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 THE AGE OF HETEROGENEOUS COMPUTE IS HERE  GPUs originally developed to accelerate graphics operations  Early adopters repurposed GPU for ‘general compute’ by performing ‘unnatural acts’ with shader APIs  OpenGL allows shaders/textures to be compiled and executed via extensions  OpenCL/GLSL/CUDA standardizes/formalizes how to express GPU compute and simplifies host programming.  New programming models emerging and lowering/removing barriers to adoption 3 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 THE SUPERCOMPUTER IN YOUR DESKTOP  Some interesting tidbits from http://www.top500.org/ – November 2000  “ASCI White is new #1 with 4.9 TFlops on the Linpack"  http://www.top500.org/lists/2000/11 – November 2002  “3.2 TFlops are needed to enter the top 10”  http://www.top500.org/lists/2002/11  May 2011 – AMD Radeon 6990 5.1TFlops single precision performance  http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/hd-6990/Pages/amd-radeon-hd-6990-overview.aspx#3 4 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 WHY JAVA?  One of the most widely used programming languages – http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html  Established in domains likely to benefit from heterogeneous compute Java C 7.54 C++  Even if applications are not implemented in Java, they may still run on the Java6.51 Virtual Machine (JVM) 5.01 C# – JRuby, JPython, Scala, Clojure, Quercus(PHP) PHP  Acts as a good proxy/indicator for enablement of other runtimes/interpreters Objective C 18.16 4.58 – JavaScript, Flash, .NET, PHP, Python, Ruby, Dalvik? Python Other 32.89 – BigData , Search, Hadoop+Pig, Finance, GIS, Oil & 9.14 Gas 16.17 5 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 GPU PROGRAMMING OPTIONS FOR JAVA PROGRAMMERS  Most Java GPU APIs require coding the kernel in a domain-specific language (OpenCL, GLSL, or CUDA) // OpenCL kernel code __kernel void squares(__global const float *in, __global float *out){ int gid = get_global_id(0); out[gid] = in[gid] * in[gid]; }  As well as writing the Java ‘host’ CPU-based code to :– Select/Initialize execution device – Compile 'Kernel' for a selected device – Allocate or define memory buffers for args /parameters – Enqueue/Send arg buffers to device – Execute the kernel – Read results buffers back from the device 6 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 ARE DEVELOPERS LIKELY TO ADOPT EXISTING JAVA OPENCL/CUDA BINDINGS? Some will – Early adopters – Prepared to learn new languages – Motivated to squeeze all the performance they can from available compute devices – Prepared to implement algorithms both in Java and in CUDA/OpenCL Most won’t – OpenCL/CUDA C99 heritage likely to disenfranchise Java developers  Either walked away from C/C++ a while back or possibly never encountered it at all (due to CS education shifts)  Problems exposing low level memory model alien to developers who learned to trust JVM to ‘do the right thing’  Who pays for retraining of Java developers? – Notion of writing code twice (once for Java execution another for GPU/APU) alien  Where’s my ‘write once, run anywhere’? 7 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 USING JOCL OPENCL JAVA BINDINGS import static org.jocl.CL.*; import org.jocl.*; public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; } Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray);  Create in and out array to hold data // Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); float in[] = new float[size]; float out[] = new float[size]; for (int i=0; i<size; i++) { in[i] = i; } // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null);  Perform the parallel equivalent of // Build the program clBuildProgram(program, 0, null, null, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); for (int i=0; i<size; i++) { out[i] = in[i]*in[i]; } // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context);  Print the results for (float f: out) { System.out.printf(“%5.2f,”, f); } 8 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 for (float f:outArray){ System.out.printf("%5.2f, ", f); } } } AN EXAMPLE USING JOCL import static org.jocl.CL.*; import org.jocl.*; public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; } Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray); // Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); // Get platform IDs and initialize the context cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null); // Build the program clBuildProgram(program, 0, null, null, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); 9 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context); for (float f:outArray){ System.out.printf("%5.2f, ", f); } } } AN EXAMPLE USING JOCL import static org.jocl.CL.*; import org.jocl.*; public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; } Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray); // Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Allocate the memory objects for the input and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); // Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Build the program clBuildProgram(program, 0, null, null, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context); for (float f:outArray){ System.out.printf("%5.2f, ", f); } } } 10 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 AN EXAMPLE USING JOCL import static org.jocl.CL.*; import org.jocl.*; public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; } Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray); // Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null); // Build the program clBuildProgram(program, 0, null, null, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); // Build the program clBuildProgram(program, 0, null, null, null, null); // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context); for (float f:outArray){ System.out.printf("%5.2f, ", f); } } } 11 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 AN EXAMPLE USING JOCL import static org.jocl.CL.*; import org.jocl.*; public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; } Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray); // Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null); // Read the output data back into outArr clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Build the program clBuildProgram(program, 0, null, null, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context); // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context); for (float f:outArray){ System.out.printf("%5.2f, ", f); } } } // finally print out the results for (float f:outArray){ System.out.printf("%5.2f, ", f); } 12 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 WHAT IS APARAPI AND HOW IS IT DIFFERENT FROM EXISTING JAVA GPU BINDINGS?  An API for expressing data parallel workloads in Java – Developer extends a Kernel base class and compiles to Java bytecode using existing tool chain  A runtime component capable of converting Java bytecode to OpenCL for execution on GPU or executing on the host via a Java Thread Pool OpenCL? No MyKernel.class javac Yes Execute Kernel using Java Thread Pool MyKernel.class Bytecode can be converted to OpenCL? No class MyKernel extends Kernel{ @Override public void run(){ } Platform } Supports 13 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 Yes Convert bytecode to OpenCL Execute OpenCL Kernel on GPU CONSIDER AN EMBARASSINGLY PARALLEL USE CASE  We will convert the ‘square example’ (embarrassingly parallel) to Aparapi – Calculate square[0..size] for a given input in[0..size] final int[] square= new int[size]; final int[] in = new int[size]; // populating in[0..size] omitted parallel-for i=0;i++){ i<size; i++){ for (int i=0; (int i<size; square[i] = in[i] * in[i]; } Ideally we can indicate that the body of loop need not be executed sequentially. Would be great if we could add a parallel-for construct to the Java language. But we want to avoid modifing the language, compiler or toolchain. 14 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 REFACTORING OUR EXAMPLE TO USE APARAPI final int[] square= new int[size]; final int[] in = new int[size]; // populating in[0..size] omitted for (int i=0; i<size; i++){ square[i] = in[i] * in[i]; } new Kernel(){ @Override public void run(){ int i = getGlobalID(); square[i] = in[i]*in[i]; } }.execute(size); 15 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 EXPRESSING DATA PARALLEL IN APARAPI  What happens when we call execute(n)? execute(size); Is this the first execution? Bytecode can be converted to OpenCL? Execute Kernel using Java Thread Pool Yes Convert bytecode to OpenCL No Yes No No Kernel kernel = new Kernel(){ @Override public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; } }; Yes Platform Supports OpenCL? Execute OpenCL Kernel on GPU No Do we have OpenCL? 16 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 Yes FIRST CALL OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE – Reload classfile via classloader extracting methods and fields – For run() method and all methods reachable from run() method …  Convert method bytecode to an IR (expression trees, conditional loop constructs) – More on how we do this later…  Maintain a list of field accesses and types (read/write/read+write) – Class fields ultimately represent args passed to OpenCL – Create and Compile OpenCL for all reachable methods – Lock accessed primitive arrays so the garbage collector doesn’t move them around – For each field/primitive that is read by the generated code enqueue write to GPU and/or set arg – Execute OpenCL Kernel – For each primitive array that is written by the generated code enqueue read from GPU – Unlock accessed primitive arrays – Results now available in Java application 17 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 SUBSEQUENT CALLS OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE – Lock accessed primitive arrays so the garbage collector doesn’t move them around – For each field/primitive that is read by the generated code enqueue write to GPU and/or set arg – Execute OpenCL Kernel – For each primitive array that is written by the generated code enqueue read from GPU – Unlock accessed primitive arrays – Results now available in Java application 18 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS NOT AN OPTION – Create a thread pool  One thread per core. – Clone Kernel one per thread  Each Kernel holds state for one thread – Each thread iterates 0..(size/threads)  Updates globalId, localId, groupSize, globalSize, etc state on it’s Kernel instance.  Executes run() method on Kernel instance. – Wait for all threads to complete 19 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 BYTECODE PRIMER  Variable Length instructions  Access to immediate values –  Mostly constant pool and local variable table indexes Stack Based execution – IMUL : multiply two integers from stack and push result – …,<op2>, <op1> => [ IMUL ] => …,<op1*op2> – Sometimes the types and number of operands cannot be determined from the bytecode alone. We need to decode from the ConstantPool  Some surprising omissions – Store 0 in a local variable or field? (3+ bytes) – Instead we push 0, then pop into a local variable (4+ bytes) 20 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 BYTECODE PRIMER: FILE FORMAT ClassFile { u4 magic; u2 minor_version; u2 major_version; u2 constant_pool_count; cp_info constant_pool[constant_pool_count-1]; u2 access_flags; u2 this_class; u2 super_class; u2 interfaces_count; u2 interfaces[interfaces_count]; u2 fields_count; field_info fields[fields_count]; u2 methods_count; method_info methods[methods_count]; u2 attributes_count; attribute_info attributes[attributes_count]; } 21 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 CONSTANTPOOL  Each class has one ConstantPool  A list of constant values used to describe the class  Not all express source artifacts  Pool is made up of one or more Entries containing:– primitive types (int, float, double, long) – Double and Longs take two slots – Strings (UTF8) – Class/Method/Field/Interface descriptors  These descriptors contain grouped references to other slots.  So a method descriptor will reference the slot containing the Class definition, the slot containing the name of the method (utf8/String) the slot containing the signature (utf8/String).. say “(Ljava/lang/String;I)[F” 22 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 ATTRIBUTES The various sections of the classfile will contain sets of ‘attributes’  – Each attribute has a name, a length and a sequence of bytes (the value)   Think ‘HashMap<String, Pair<int, ?>>’ – Class/top level attributes include the name of the sourcefile, the generic signature information etc. – Attributes can be nested – Allows new features to be added to the classfile without violating the original spec Field sections have lists of attributes –  Generic signature etc... Method sections have lists of attributes – Generic signature etc... – One of the attributes of a Method is a ‘Code’ attribute  This contains the sequence of bytecodes representing the method body 23 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 javap –c MyClass 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 24 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 Store 0 in var slot 1 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 25 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++ ) { Loop Control 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 Oracle javac style 14: iload_2 total++; 15: iconst_4 } 16: irem Eclipse javac places 17: ifne 23 } condition at top and 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 unconditional at 26: iload_2 } 27: bipush 100 bottom 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 26 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 Executed once 14: iload_2 total++; 15: iconst_4 Store 0 in var slot 2 } 16: irem Branch to instruction #26 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 27 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 Increment var slot 2 by 1 14: iload_2 total++; if var slot 2 < 100 15: iconst_4 16: irem branch to instruction at 7 } 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 28 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 “Loop Body” 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 29 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 “Condition control” 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 30 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 31 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { Logical operators 11: ifne 23 14: iload_2 total++; result in ‘short 15: iconst_4 circuit’ branches } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 32 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println(total); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 33 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { “Conditional body” 11: ifne 23 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println( total ); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 34 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 A BYTECODE TOUR … 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 public void run() { 4: goto 26 int total = 0; 7: iload_2 8: bipush 10 for (int i = 0; i < 100; i++) { 10: irem if (i%10==0 && i%4==0) { 11: ifne 23 14: iload_2 total++; 15: iconst_4 } 16: irem 17: ifne 23 } 20: iinc 1, 1 System.out.println( total ); 23: iinc 2, 1 26: iload_2 } 27: bipush 100 29: if_icmplt 7 32: getstatic #15; //Field java/lang/System.out:Ljava/io/PrintStream; 35: iload_1 36: invokevirtual #21; //Method java/io/PrintStream.println:(I)V 39: return 35 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 LETS LOOK AT AN EXAMPLE  Lets ‘fold’ the following instructions 0: 1: 2: 3: 4: 5: iload_2 iload_1 iadd iconst_2 idiv ireturn Start with an empty list head and tail pointing to ‘NULL’ head NULL tail 36 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 LETS LOOK AT AN EXAMPLE 0: 1: 2: 3: 4: 5: iload_2  iload_1 iadd iconst_2 idiv ireturn head iload_2 consumes ‘0’ stack operands Create a new iload_2 and make it the tail of the list NULL tail head iload_2 37 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 tail LETS LOOK AT AN EXAMPLE 0: 1: 2: 3: 4: 5: iload_1 consumes ‘0’ stack operands iload_2 iload_1  iadd iconst_2 idiv ireturn head Create a new iload_1 and add to the tail of the existing linked list iload_2 tail head iload_2 iload_1 38 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 tail LETS LOOK AT AN EXAMPLE 0: 1: 2: 3: 4: 5: head iload_2 iload_1 iadd  iconst_2 idiv ireturn iload_2 iadd consumes ‘2’ stack operands Create a new iadd Remove ‘tail’ from list (adjust tail) and make it operand[1] of iadd Remove new ‘tail’ (and adjust tail) and make it operand[0] of iadd iload_1 tail head operand 0 iload_2 tail iadd operand 1 iload_1 39 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 LETS LOOK AT AN EXAMPLE 0: 1: 2: 3: 4: 5: Iconst_2 consumes ‘0’ stack operands iload_2 iload_1 iadd iconst_2  idiv ireturn head operand 0 iload_2 Create a new iconst_2 and add to tail iadd tail operand 1 iload_1 head operand 0 iload_2 iadd iconst_2 operand 1 iload_1 40 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 tail LETS LOOK AT AN EXAMPLE 0: 1: 2: 3: 4: 5: idiv consumes ‘2’ stack operands iload_2 iload_1 iadd iconst_2 idiv  ireturn head operand 0 iload_2 Create a new idiv Remove ‘tail’ from list (adjust tail) and make it operand[1] of idiv Remove new ‘tail’ (and adjust tail) and make it operand[0] of idiv iadd iconst_2 tail head operand 1 idiv operand 0 iload_1 operand 1 iadd operand 0 iload_2 tail iconst_2 operand 1 iload_1 41 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 LETS LOOK AT AN EXAMPLE 0: 1: 2: 3: 4: 5: ireturn consumes ‘1’ stack operands iload_2 iload_1 iadd iconst_2 idiv ireturn  head Create a new ireturn and move existing tail as operand[0] idiv operand 0 ireturn head operand 1 iadd operand 0 tail operand 0 iconst_2 idiv operand 1 operand 0 iload_2 tail iload_1 operand 1 iadd operand 0 iload_2 iconst_2 operand 1 iload_1 42 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 THE RESULT  After parsing we determine that this is a single return statement  For reference here is the source public int mid(int _min, int _max){ return((_max+_min)/2); }  When we apply this approach to more complex methods we end up with a linked list of instructions which represent the ‘roots’ of expressions or statements.  Essentially we end up with a list comprised of conditionals, goto’s, assignments and return statements.  All branch targets are ‘roots’  From this we can fairly easily recognize larger level structures (for/while/if/else) 43 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 COMPARING APARAPI TO EXISTING JAVA OPENCL/CUDA APIS Existing GPU APIS Aparapi Learn OpenCL/CUDA DIFFICULT N/A Locate potential data parallel opportunities MEDIUM MEDIUM Refactor existing code/data structures MEDIUM MEDIUM Create Kernel Code DIFFICULT EASY Create code to coordinate execution and buffer transfers MEDIUM N/A Identify GPU performance bottlenecks DIFFICULT DIFFICULT Iterate code/debug algorithm logic DIFFICULT MEDIUM Solve build/deployment issues DIFFICULT MEDIUM 44 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI BY EXTENDING KERNEL class SquareKernel extends Kernel{ final int[] in, square; public SquareKernel(final int[] in, final int[] square){ this.in = in; this.square = square; } @Override public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; For more complex scenarios developer } likely to explicitly extend Kernel base class } int []square = new int[size]; int []in = new int[size]; // populating in[0..size] omitted SquareKernel squareKernel = new SquareKernel(in, square); squareKernel.execute(size); 45 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 WITHOUT APARAPI: JAVA 'CLASSIC' MULTITHREADED SOLUTION final final final final int[] square= new int[size]; int[] in = new int[size]; //populating in[0..size] omitted int cores = Runtime.getRuntime().availableProcessors(); int chunk = size/cores; // lets assume size % cores ==0 ! Thread threads = new Thread[cores]; for(int core=0; core<cores; core++){ final int start = core*chunk; (threads[core] = new Thread(new Runnable(){ @Override public void run(){ for (int i=start; i<start+chunk; i++) square[i] = in[i]*in[i]; } })).start(); } for(Thread thread:threads) thread.join(); 46 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 WITHOUT APARAPI:… USING JAVA'S NEW EXECUTOR FRAMEWORK final final final final int[] square= new int[size]; int[] in = new int[size]; //populating in[0..size] omitted int cores = Runtime.getRuntime().availableProcessors(); int chunk = size/cores; // lets just assume size % cores ==0 ! :) ExecutorService executor = Executors.newFixedThreadPool(cores); for (int core = 0; core < cores; core++) { final int start = core*chunk; executor.execute(new Runnable(){ public void run(){ for (int i=start;i<start+chunk;i++) square[i] = in[i]*in[i]; } }); } executor.shutdown(); executor.awaitTermination(60L, TimeUnit.SECONDS); 47 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI class SquareKernel extends Kernel{ private int[] in, square; @Override public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; } public int[] square(int in[]){ this.in = in; square = new square[in.length]; execute(in.length); return(square); } Base execute(n) method can be encapsulated to provide a more natural API } int []in = new int[size]; // populating in[0..size] omitted SquareKernel kernel = new SquareKernel(); int[] square = kernel.square(in) 48 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 EXPRESSING DATA PARALLELISM IN APARAPI USING PROPOSED JAVA 8 LAMBDA'S  JSR 335 ‘Project Lambda’ proposes addition of ‘lambda’ expressions to Java 8. http://cr.openjdk.java.net/~briangoetz/lambda/lambda-state-3.html  How we expect Aparapi will make use of the proposed extensions final int [] square = new int[size]; final int [] in = new int[size]; // populating in[0..size] omitted Kernel.execute(size, #{ i -> out[i]=int[i]*int[i]; }); 49 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 HOW APAPAPI EXECUTES ON THE GPU  At runtime Aparapi converts bytecode to OpenCL  OpenCL compiler converts OpenCL to device specific ISA for GPU/APU  GPU comprised of multiple SIMD (Single Instruction Multiple Dispatch) Cores  SIMDs benefit from having multiple execution streams operating the same instructions on different data – Think single program counter shared across multiple threads – All SIMDs executing at the same time (in lock-step) new Kernel(){ @Override public void run(){ int i = getGlobalID(); int temp= in[i]*2; temp = temp+1; out[i] = temp; } }.execute(4) i=0 i=1 i=2 i=3 int temp =in[0]*2 int temp =in[1]*2 int temp =in[2]*2 int temp =in[3]*2 temp=temp+1 temp=temp+1 temp=temp+1 temp=temp+1 out[0]=temp out[1]=temp out[2]=temp out[3]=temp 50 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 DEVELOPER RESPONSIBLE FOR ENSURING PROBLEM IS DATA PARALLEL  Data dependencies can violate the ‘in any order’ guideline for (int i=1; i< 100; i++){ out[i] = out[i-1]+in[i]; } new Kernel(){ @Override public void run(){ int i = getGlobalID(); out[i] = out[i-1]+in[i]; }}.execute(100); out[i-1] refers to a value resulting from a previous iteration which may not have been evaluated yet.  Mutating shared data problematic or can require use of atomic constructs for (int i=0; i< 100; i++){ sum += in[i]; } new Kernel(){ @Override public void run(){ int i = getGlobalID(); sum+= in[i]; }}.execute(100); sum += x causes a race condition. Almost certainly will not be atomic when translated to OpenCL Actually not even atomic in multi-threaded Java  51 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 SOMETIMES WE CAN REFACTOR TO RECOVER SOME PARALLELISM for (int i=0; i< 100; i++){ sum += in[i]; } new (int Kernel(){ for n=0; n<10; n++){ @Override public void run(){ for (int i=0; i<10; i++){ partial[n] += data[n*10+i]; int i = getGlobalID(); }sum+= in[i]; } } for (int i=0; i< 10; i++){ }.execute(100); sum+=partial[i]; new Kernel(){ } @Override public void run(){ int n = getGlobalID() for (int i=0; i<10; i++) partial[n] += data[n*10+i]; } }.execute(10); for (int i=0; i< 10; i++){ sum+=partial[i]; } 52 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 TRY TO AVOID BRANCHING WHEREVER POSSIBLE  SIMD performance impacted when code contains branches – To stay in lockstep SIMDs must processes both the 'then' and 'else' blocks – Use result of 'condition' to predicate instructions (conditionally mask to a no-op) new Kernel(){ @Override public void run(){ int i = getGlobalID(); int temp= in[i]*2; if (i%2==0) temp = temp+1; else temp = temp -1; out[i] = temp; } }.execute(4) i=0 i=1 i=2 i=3 int temp =in[0]*2 int temp =in[1]*2 int temp =in[2]*2 int temp =in[3]*2 <c> = (0%2==0) <c> = (1%2==0) <c> = (2%2==0) <c> = (3%2==0) if< c> temp=temp+1 if< c> temp=temp+1 if< c> temp=temp+1 if< c> temp=temp+1 if <!c> temp=temp-1 if <!c> temp=temp-1 if <!c> temp=temp-1 if <!c> temp=temp-1 out[0]=temp out[1]=temp out[2]=temp out[3]=temp 53 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 AVOIDING DIVERGENCE  Sometimes it is more efficient to process unnecessary data to avoid conditionals for (int i=0; i< 65536; i++) if (i%64 == 0) out[i] = 0; else out[i] = in[i]; for new(int Kernel(){ i=0; i< 65536; @Override i++) public void run(){ i=getGlobalID(); out[i] = in[i]; for }}.execute(65536); (int i=0; i< 65536; i+=64) New out[i] Kernel(){ = 0; @Override public void run(){i=getGlobalID(); out[i*64]=0; }}execute(65536/64);  We can often adjust the range and add offsets to avoid boundary checks for (int i=0; i< 65536; i++) if (i!=0 && i!=65535) out[i] =in[i-1]+in[i]+in[i+1]/3; new Kernel(){ @Override public void run(){ i=getGlobalID(); out[i+1] = in[i]+in[i+1]+]in[i+2]; }}.execute(65534); 54 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 CHARACTERISTICS OF IDEAL DATA PARALLEL WORKLOADS  Looping over large arrays of primitives – 32/64 bit data types preferred – Without data dependencies between iterations – Each iteration contains sequential code (few branches)  Good balance between data size (low) and compute (high) – Transfer of data to/from the GPU can be costly – Trivial compute often not worth the transfer cost – May still benefit, by freeing up CPU for other work Compute – Order of iteration unimportant Ideal Data Size 55 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 GPU Memory APARAPI NBODY DEMO  NBody is a common OpenCL/CUDA benchmark/demo  Determine the positions of N bodies, calculating the gravitational effect that each body has on every other body – C++/C version shipped with AMD Stream SDK  Essentially a N^2 space problem – If we double the number of bodies, we perform four times the positional calculations  Following charts compares – Naïve Java version (single loop) – Aparapi version using Java Thread Pool – Aparapi version running on the GPU (ATI Radeon ™ 5870) 56 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 APARAPI NBODY DEMO NBODY DEMO 57 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 Frames per second APARAPI NBODY PERFORMANCE (FRAMES RATE VS NUMBER OF BODIES) 450 400 350 300 250 200 150 100 50 0 Java Single Thread Aparapi Thread Pool Aparapi GPU 1k 2k 4k 8k 16k 32k 64k 128k # of bodies 58 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 Position calculations per µS APARAPI NBODY PERFORMANCE: CALCULATIONS PER SEC VS. NUMBER OF BODIES 6000 Java Single Thread Aparapi Thread Pool Aparapi GPU 5000 4000 3000 2000 1000 0 1k 2k 4k 8k 16k 32k 64k 128k # of bodies 59 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 APARAPI NBODY DEMO MANDEL DEMO 60 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 APARAPI EXTENSIONS FOR ITERATING OVER KERNEL EXECUTIONS  Added explicit buffer management for algorithms which iterate over kernel executions int [] buffer = new int[HUGE]; int [] unusedBuffer = new int[HUGE]; Kernel k = new Kernel(){ @public void run(){ // mutates buffer contents // no reference to unusedBuffer } }; for (int i=0; i< 1000; i++){ //Transfer buffer to GPU k.execute(HUGE); //Transfer buffer from GPU } Aparapi can/does analyze kernel methods and generates optimized host buffer transfer requests at runtime. Aparapi has no knowledge of buffer accesses from the enclosing loop so MUST be conservative and assume that buffer is modified between invocations This results in unnecessarily buffer copies (in this case 1000 of each) to a from the device 61 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 APARAPI EXTENSIONS FOR ITERATING OVER KERNEL EXECUTIONS – With explicit buffer management we can refactor the code to this int [] buffer = new int[HUGE]; Kernel k = new Kernel(){ @public void run(){ // mutates buffer contents } }; k.setExplicit(); k.put(buffer); Developer can take control and coordinate for (int i=0; i< 1000; i++){ when/if transfers take place. k.execute(HUGE); } k.get(buffer); 62 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 PROPOSED APARAPI ENHANCEMENTS: ALLOW ACCESS TO ARRAYS OF OBJECTS  Allow automatic extraction of buffers from arrays/collections of objects. – A Java developer implementing 'nbody' problem would probably define a class for each particle public class Particle{ int x, y, z; String name; Color color; // other Particle specific state } – .. and would expect to be able to create a Kernel to calculate positions for an array of particles Particle[] particles = new Particle[1024]; ParticleKernel kernel = new ParticleKernel(particles); while(displaying){ kernel.execute(particles.length); //update display of particles } – Unfortunately Aparapi would currently fail to convert the above kernel to OpenCL and would fall back to using a Thread Pool. 63 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 PROPOSED APARAPI ENHANCEMENTS: ALLOW ACCESS TO ARRAYS OF OBJECTS  Aparapi currently not ‘Object Friendly’ and the ideal code will need to be refactored to use primitive arrays. int[] x = new int[] y = new int[] z = new Color[] color String[] name int[1024]; int[1024]; int[1024]; = new Color[1024]; = new String[1024]; Positioner.position(x, y, z); 64 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 PROPOSED APARAPI ENHANCEMENTS: ALLOW ACCESS TO ARRAYS OF OBJECTS  In our initial Open Source release we intend to allow arrays of objects to be accessed.  At runtime Aparapi will automatically copy any accessed fields into temporary primitive arrays.  The OpenCL kernel will be passed these primitive copies.  On completion content from the primitive buffers will be pushed back into the original objects  This will allow us to use any array based collection (ArrayList/Vector) from kernels 65 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 FUTURE WORK  Sync with ‘project lambda’ (Java 8) and allow kernels to be expressed as lambda expressions.  More work on automatically extracting buffer transfers across object collections  Hand more explicit control to ‘power users’ – Explicit buffer (or sub buffer) transfers – Expose local memory and barriers  Evaluating Open Source – Aiming for Q3 Open Source release of Aparapi – License TBD, probably BSD variant – Need to decide where to host – http://code.google.com , http://sourceforge.net/ or http://www.java.net – Enable and encourage community contributions 66 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 SIMILAR INTERESTING/RELATED WORK  Tidepowerd – Offers a similar solution for .NET – NVIDIA cards only at present  http://www.tidepowerd.com/  java-gpu – An open source project for extracting kernels from nested loops – Extracts code structure from bytecode – Creates CUDA behind the scenes  http://code.google.com/p/java-gpu/  GRAPHITE-OpenCL: Generate OpenCL Code from Parallel Loops (for GCC) – Auto detect data parallel loops in gcc compiler and generate OpenCL + host code for the loop  http://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=2010-GCC-Summit-Proceedings.pdf 67 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 SUMMARY  APUs/GPUs offer unprecedented performance for the appropriate workload  Don’t assume everything can/should execute on the APU/GPU  Look for ‘Islands of parallel in a sea of sequential’  Aparapi provides an ideal framework for executing data-parallel code on the GPU  Please participate in the upcoming Aparapi Open Source community  Download and experiment with Aparapi – http://developer.amd.com/aparapi 68 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. © 2011 Advanced Micro Devices, Inc. 69 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 70 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 POWERPOINT 2011 AMD COLOR PALETTE AMD RICH BLACK R0 G0 B0 AMD WHITE R255 G255 B255 AMD GREEN PMS 347 R0 G153 B102 AMD GRAY R27 G27 B27 AMD RED PMS 186 R211 G25 B25 AMD PURPLE PMS 272 R118 G125 B197 AMD ORANGE PMS 1505 R252 G101 B0 AMD GRAY Cool Gray 10 R128 G127 B130 71 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011 INTEL BLUE R8 G96 B168 A SEQUENTIAL VERSION OF KERNEL BASE CLASS public abstract class Kernel { private int gid = 0; protected int getGlobalID(){ return(gid); } public abstract static void run(); public void execute(int size){ for (gid=0; gid<size; gid++) run(); } new Kernel(){ @Override public void run(){ int i = getGlobalID(); square[i] = in[i]*in[i]; } }.execute(size); } 72 | APARAPI : Accelerating Java workloads via GPU | HJUG | May 25th , 2011

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Aparapi HJUG Presentation