Download Implementing Irregular Applications with Java Threads

Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire [email protected] Collaborators • UNH/Hyperion – Mark MacBeth and Keith McGuigan • ENS-Lyon/DSM-PM2 – Gabriel Antoniu, Luc Bougé and Raymond Namyst Focus • Use Java “as is” for high-performance computing – support computationally intensive applications – utilize parallel computing hardware Outline • Our Vision • Java Threads • The PM2 Run-time Environment • Hyperion: Java Threads on Clusters • Evaluation • Related Work • Conclusions Why Java? • Soon to be ubiquitous! – use of Java is growing very rapidly • Designed for portability: – develop programs on your desktop – run programs on a distant cluster Why Java? • Explicitly parallel! – includes a threaded programming model • Relaxed memory model – consistency model aids an implementation on distributed-memory parallel computers Unique Opportunity • Use Java to bring parallelism to the “masses” • Let’s not miss it! • But, programmers will not accept syntax or model changes Open Question • Parallelism via Java access to distributedcomputing techniques? – e.g. RMI (remote method invocation) • Or, parallelism via Java threads? That is, ... • Does a user prefer to view a cluster as a collection of distinct machines? • Or, does a user prefer to view a cluster as a “black box” that will simply run Java code faster? Are you “in a box”? Or, are you “thinking outside of the box”? Climb out of the box! • Use Java threads “as is” to program clusters of computers. • Program for the threaded Java virtual machine. • Allow the implementation to handle the details of executing in a cluster. Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads. java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority(). Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object. synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … } } java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened. Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks. Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: – a thread's object cache be flushed upon entry to a monitor. – local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor. PM2: A Distributed, Multithreaded Runtime Environment • Thread library: Marcel – User-level • Communication library: Madeleine – Portable: BIP, SISCI/SCI, MPI, TCP, PVM – Supports SMP – POSIX-like – Preemptive thread migration Context Switch Create SMP 0.250 s 2 Non-SMP 0.120 s 0.55 s Thread Migration SCI/SISCI 24 s BIP/Myrinet 75 s s – Efficient SCI/SISCI BIP/Myrinet Latency 6 s 8 s Bandwidth 70 MB/s 125 MB/s DSM-PM2: Architecture • DSM comm: DSM-PM2 – send page request DSM Protocol Policy DSM Protocol lib – send page – send invalidate request – … DSM Page Manager DSM Comm • DSM page manager: – set/get page owner – set/get page access Madeleine Comms Marcel Threads PM2 – add/remove to/from copyset – ... DSM-PM2: Performance Operation/Protocol Page fault SISCI/SCI BIP/Myrinet TCP/Myrinet 18 56 56 1 2 2 Transmitting request 17 30 190 Processing request 1 2 2 Sending back 4 kB page Installing the page 85 134 412 12 24 24 134 s 248 s 686 s Fault handling Total • SCI cluster has 450 MHz Pentium II nodes • Myrinet cluster has 200 MHz Pentium Pro nodes Hyperion • Executes threaded Java programs on clusters. • Built on top of PM2 and DSM-PM2. – Provides both portability and efficiency Reversing the Bytecode Stream • Conventionally, users “pull” bytecode to their machines for local execution. • Our vision: – users develop their high-performance Java programs using the Java toolset on their desktop. – they then “push” the resulting bytecode to a Hyperion server for high-performance cycles. Supporting High Performance • Utilizes a bytecode-to-C translator. • Parallel execution via spreading of Java threads across nodes of the cluster. • Java threads implemented as lightweight threads using PM2 library. Compiling Java • Hyperion designed for computationally intensive applications, so small overhead of translating bytecode is not important. • Translating to C allows us to leverage the native C compiler and optimizer. General Hyperion Overview prog.java prog.class javac java2c gcc -06 (bytecode) Sun's Java compiler prog prog.[ch] Instruction-wise translation libs Runtime libraries The Hyperion Run-Time System • Collection of modules to allow “plug-andplay” implementations: – inter-node communication – threads – memory and synchronization – etc Hyperion Internal Structure Load balancer Thread subsystem Native Java API Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model. Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Log mods made to remote objects. • Use explicit in-line checks in get/put. • Each node allocates objects from a different range of the virtual address space. Details • Objects are aligned on 64-byte boundaries. • An object reference is the address of the base of the object. • The bottom 6 bits of the ref can be used to store the node number of the object’s home. More details • loadIntoCache checks the 6 bits to see if an object is remote. • If so, and if not already locally cached, DSM-PM2 is used to load the page(s) containing the object. • When a remote object is cached, a bit is turned on in its header. Yet more details • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node. Evaluation • Minimal-cost map-coloring application. • Branch-and-bound algorithm. • 64 threads, each with its own priority queue. • Current best solution is shared. • Problem size: 29 eastern-most states of USA with 4 colors of differing costs. Experimental Setting • Two Linux 2.2 clusters: – eight 200 MHz Pentium Pro processors connected by Myrinet switch and using MPI over BIP. – four 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6 Performance Results 700 600 500 400 Time (sec) 300 200 100 0 450MHz/SCI 200MHz/BIP 1 2 4 nodes 8 Parallelizability 10 8 200MHz/BIP 450MHz/SCI Ideal 6 4 2 0 1 2 4 Nodes 8 Baseline Performance • Compared serial Java to serial C for mapcoloring application. • Each program has single queue, single thread. Serial Java versus Serial C 350 300 250 200 Time (sec) 150 100 50 0 • Java v2: DSM checks disabled • Java v3: DSM and array-bound checks disabled • Executing on a single 450 MHz Pentium II C Java Java v2 Java v3 Inline checks are expensive! • Genericity of DSM-PM2 allows an alternative implementation. • Use page-fault detection rather than inline check to detect non-local object. Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives. More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive uses the header bit in the same manner as inline-check version. Inline Check versus Page Fault • IC has higher overhead for accessing objects (either local or locally cached). • PF has higher overhead (signal handling and memory protection) for loading a page into the cache. IC versus PF: serial map-coloring 350 300 250 200 Time (sec) 150 100 50 0 • Java XX v2: DSM checks disabled • Java XX v3: DSM and array-bound checks disabled • Executing on a single 450 MHz Pentium II C Java IC Java PF Java IC v2 Java PF v2 Java IC v3 Java PF v3 IC versus PF: parallel map-coloring 300 250 200 Time (sec) 150 IC PF 100 50 0 1 2 nodes • Executing on 450MHz/SCI cluster. 4 Related Work • Java/MPI: cluster nodes are explicit • Java/RMI: ditto • Remote objects via RMI: nearly transparent – e.g. JavaParty, Do! • Distributed interpreters – e.g. Java/DSM, MultiJav, cJVM Conclusions • Approach is clean: Java “as is” • Approach is promising – good parallelizability for map-coloring – need better scalar compilation • e.g. array bound-check removal – need further parallel application studies • are thread/object placement heuristics sufficient for programmers to write efficient programs?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Implementing Irregular Applications with Java Threads