Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire [email protected] Collaborators • UNH/Hyperion – Mark MacBeth and Keith McGuigan • ENS-Lyon/DSM-PM2 – Gabriel Antoniu, Luc Bougé and Raymond Namyst Focus • Use Java “as is” for high-performance computing – support computationally intensive applications – utilize parallel computing hardware Outline • Our Vision • Java Threads • The PM2 Run-time Environment • Hyperion: Java Threads on Clusters • Evaluation • Related Work • Conclusions Why Java? • Soon to be ubiquitous! – use of Java is growing very rapidly • Designed for portability: – develop programs on your desktop – run programs on a distant cluster Why Java? • Explicitly parallel! – includes a threaded programming model • Relaxed memory model – consistency model aids an implementation on distributed-memory parallel computers Unique Opportunity • Use Java to bring parallelism to the “masses” • Let’s not miss it! • But, programmers will not accept syntax or model changes Open Question • Parallelism via Java access to distributedcomputing techniques? – e.g. RMI (remote method invocation) • Or, parallelism via Java threads? That is, ... • Does a user prefer to view a cluster as a collection of distinct machines? • Or, does a user prefer to view a cluster as a “black box” that will simply run Java code faster? Are you “in a box”? Or, are you “thinking outside of the box”? Climb out of the box! • Use Java threads “as is” to program clusters of computers. • Program for the threaded Java virtual machine. • Allow the implementation to handle the details of executing in a cluster. Java Threads • Threads are objects. • The class java/lang/Thread contains all of the methods for initializing, running, suspending, querying and destroying threads. java/lang/Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority(). Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object. synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … } } java/lang/Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened. Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks. Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: – a thread's object cache be flushed upon entry to a monitor. – local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor. PM2: A Distributed, Multithreaded Runtime Environment • Thread library: Marcel – User-level • Communication library: Madeleine – Portable: BIP, SISCI/SCI, MPI, TCP, PVM – Supports SMP – POSIX-like – Preemptive thread migration Context Switch Create SMP 0.250 s 2 Non-SMP 0.120 s 0.55 s Thread Migration SCI/SISCI 24 s BIP/Myrinet 75 s s – Efficient SCI/SISCI BIP/Myrinet Latency 6 s 8 s Bandwidth 70 MB/s 125 MB/s DSM-PM2: Architecture • DSM comm: DSM-PM2 – send page request DSM Protocol Policy DSM Protocol lib – send page – send invalidate request – … DSM Page Manager DSM Comm • DSM page manager: – set/get page owner – set/get page access Madeleine Comms Marcel Threads PM2 – add/remove to/from copyset – ... DSM-PM2: Performance Operation/Protocol Page fault SISCI/SCI BIP/Myrinet TCP/Myrinet 18 56 56 1 2 2 Transmitting request 17 30 190 Processing request 1 2 2 Sending back 4 kB page Installing the page 85 134 412 12 24 24 134 s 248 s 686 s Fault handling Total • SCI cluster has 450 MHz Pentium II nodes • Myrinet cluster has 200 MHz Pentium Pro nodes Hyperion • Executes threaded Java programs on clusters. • Built on top of PM2 and DSM-PM2. – Provides both portability and efficiency Reversing the Bytecode Stream • Conventionally, users “pull” bytecode to their machines for local execution. • Our vision: – users develop their high-performance Java programs using the Java toolset on their desktop. – they then “push” the resulting bytecode to a Hyperion server for high-performance cycles. Supporting High Performance • Utilizes a bytecode-to-C translator. • Parallel execution via spreading of Java threads across nodes of the cluster. • Java threads implemented as lightweight threads using PM2 library. Compiling Java • Hyperion designed for computationally intensive applications, so small overhead of translating bytecode is not important. • Translating to C allows us to leverage the native C compiler and optimizer. General Hyperion Overview prog.java prog.class javac java2c gcc -06 (bytecode) Sun's Java compiler prog prog.[ch] Instruction-wise translation libs Runtime libraries The Hyperion Run-Time System • Collection of modules to allow “plug-andplay” implementations: – inter-node communication – threads – memory and synchronization – etc Hyperion Internal Structure Load balancer Thread subsystem Native Java API Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model. Hyperion’s DSM API • loadIntoCache • invalidateCache • updateMainMemory • get • put DSM Implementation • Node-level caches. • Page-based and home-based protocol. • Log mods made to remote objects. • Use explicit in-line checks in get/put. • Each node allocates objects from a different range of the virtual address space. Details • Objects are aligned on 64-byte boundaries. • An object reference is the address of the base of the object. • The bottom 6 bits of the ref can be used to store the node number of the object’s home. More details • loadIntoCache checks the 6 bits to see if an object is remote. • If so, and if not already locally cached, DSM-PM2 is used to load the page(s) containing the object. • When a remote object is cached, a bit is turned on in its header. Yet more details • The put primitive checks the header bit to see if a modification should be logged. • updateMainMemory sends the logged changes to the home node. Evaluation • Minimal-cost map-coloring application. • Branch-and-bound algorithm. • 64 threads, each with its own priority queue. • Current best solution is shared. • Problem size: 29 eastern-most states of USA with 4 colors of differing costs. Experimental Setting • Two Linux 2.2 clusters: – eight 200 MHz Pentium Pro processors connected by Myrinet switch and using MPI over BIP. – four 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6 Performance Results 700 600 500 400 Time (sec) 300 200 100 0 450MHz/SCI 200MHz/BIP 1 2 4 nodes 8 Parallelizability 10 8 200MHz/BIP 450MHz/SCI Ideal 6 4 2 0 1 2 4 Nodes 8 Baseline Performance • Compared serial Java to serial C for mapcoloring application. • Each program has single queue, single thread. Serial Java versus Serial C 350 300 250 200 Time (sec) 150 100 50 0 • Java v2: DSM checks disabled • Java v3: DSM and array-bound checks disabled • Executing on a single 450 MHz Pentium II C Java Java v2 Java v3 Inline checks are expensive! • Genericity of DSM-PM2 allows an alternative implementation. • Use page-fault detection rather than inline check to detect non-local object. Using Page Faults: details • An object reference is the address of the base of the object. • loadIntoCache does nothing. • DSM-PM2 is used to handle page faults generated by the get/put primitives. More details • When an object is allocated, its address is appended to a list attached to the page that contains its header. • When a page is loaded on a remote node, the list is used to turn on the header bit for all object headers on the page. • The put primitive uses the header bit in the same manner as inline-check version. Inline Check versus Page Fault • IC has higher overhead for accessing objects (either local or locally cached). • PF has higher overhead (signal handling and memory protection) for loading a page into the cache. IC versus PF: serial map-coloring 350 300 250 200 Time (sec) 150 100 50 0 • Java XX v2: DSM checks disabled • Java XX v3: DSM and array-bound checks disabled • Executing on a single 450 MHz Pentium II C Java IC Java PF Java IC v2 Java PF v2 Java IC v3 Java PF v3 IC versus PF: parallel map-coloring 300 250 200 Time (sec) 150 IC PF 100 50 0 1 2 nodes • Executing on 450MHz/SCI cluster. 4 Related Work • Java/MPI: cluster nodes are explicit • Java/RMI: ditto • Remote objects via RMI: nearly transparent – e.g. JavaParty, Do! • Distributed interpreters – e.g. Java/DSM, MultiJav, cJVM Conclusions • Approach is clean: Java “as is” • Approach is promising – good parallelizability for map-coloring – need better scalar compilation • e.g. array bound-check removal – need further parallel application studies • are thread/object placement heuristics sufficient for programmers to write efficient programs?