Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Experience with a Cluster JVM Philip J. Hatcher University of New Hampshire [email protected] Acknowledgements • UNH students – Mark MacBeth and Keith McGuigan • PM2 team – very effective and enjoyable collaboration 2 Traditional Parallel Programming • Parallel programming supported by using serial language plus a “bag on the side”. – e.g. Fortran plus MPI • Parallel programming supported by extending a serial language. – e.g. High Performance Fortran 3 My History • I spent years studying data-parallel extensions to C, such as C*. – Users never really accepted extensions. – They found them too complex. – They wanted standard, well-integrated solutions. 4 Java is a good thing! • Java is explicitly parallel! – Language includes a threaded programming model. • Java employs a relaxed memory model. – Consistency model aids an implementation on distributed-memory parallel computers. 5 Java Threads • Threads are objects. • The class java.lang.Thread contains all of the methods for initializing, running, suspending, querying and destroying threads. 6 java.lang.Thread methods • Thread() - constructor for thread object. • start() - start the thread executing. • run() - method invoked by ‘start’. • stop(), suspend(), resume(), join(), yield(). • setPriority(). 7 Java Synchronization • Java uses monitors, which protect a region of code by allowing only one thread at a time to execute it. • Monitors utilize locks. • There is a lock associated with each object. 8 synchronized keyword • synchronized ( Exp ) Block • public class Q { synchronized void put(…) { … } } 9 java.lang.Object methods • wait() - the calling thread, which must hold the lock for the object, is placed in a wait set associated with the object. The lock is then released. • notify() - an arbitrary thread in the wait set of this object is awakened and then competes again to get lock for object. • notifyall() - all waiting threads awakened. 10 Shared-Memory Model • Java threads execute in a virtual shared memory. • All threads are able to access all objects. • But threads may not access each other’s stacks. 11 Java Memory Consistency • A variant of release consistency. • Threads can keep locally cached copies of objects. • Consistency is provided by requiring that: – a thread's object cache be flushed upon entry to a monitor. – local modifications made to cached objects be transmitted to the central memory when a thread exits a monitor. 12 Problems with Java Threads • Java support for threads is very low level. • Java memory model is not very well understood. 13 Threads API • • • • No condition variables. No semaphores. No barriers. No collective operations on thread groups (e.g. sum reduction). • No parallel collections. 14 So… • Using low-level operations can be difficult and error-prone. • Everyone is “re-inventing the wheel” as they struggle to construct higher level abstractions. 15 Java Specification Request 166 • Expert Group formed 01/23/02. • Goal is to provide java.util.concurrent: – atomic variables – special-purpose locks, barriers, semaphores and condition variables – queues and related collections for multithreaded use – thread pools 16 Java Memory Model • Most programmers did not read Chapter 17 of the Java Language Specification. • Those that did read it, did not fully understand it. • Lots of code has been written that is not portable. 17 For example, • The Java Grande Forum distributes multithreaded Java benchmarks. • These benchmarks utilize a barrier method implemented with volatile variables and “busy waiting”. • However, benchmarks assume when volatile variable is set all of memory will also be made consistent. Not true! 18 Implementors also struggled… • In June 2000, IBM researchers suggested my cluster JVM violated the JMM, but could not cite an example. • In July 2000, I produced a “proof” of correctness. • In June 2001, a counter-example was found. • Problem concerns properly handling “improperly synchronized” programs. 19 Java Specification Request 133 • Expert Group formed 06/12/01. • Goal is to re-specify the Java memory model: – Maintain relaxed consistency. – Loosen implementation requirements for handling “improperly synchronized” programs. – Fix ambiguities and holes. • Current draft is still “rough sledding”! 20 Cluster Implementation of Java • Single JVM running on a cluster of machines. • Nodes of the cluster are transparent. • Multithreaded applications exploit multiple processors of cluster. 21 Hyperion • Cluster implementation of Java developed at the University of New Hampshire. • Currently built on top of the PM2 distributed, multithreaded runtime environment from ENS-Lyon. 22 General Hyperion Overview prog.java prog.class javac prog prog.[ch] java2c gcc -06 (bytecode) Sun's Java compiler Instruction-wise translation libs Runtime libraries 23 The Hyperion Run-Time System • Collection of modules to allow “plugand-play” implementations: – inter-node communication – threads – memory and synchronization – etc 24 Thread and Object Allocation • Currently, threads are allocated to processors in round-robin fashion. • Currently, an object is allocated to the processor that holds the thread that is creating the object. • Currently, DSM-PM2 is used to implement the Java memory model. 25 Hyperion Internal Structure Load balancer Thread subsystem Native Java API Memory subsystem Comm. subsystem PM2 API: pm2_rpc, pm2_thread_create, etc. PM2 DSM subsystem Thread subsystem Comm. Subsystem 26 PM2: A Distributed, Multithreaded Runtime Environment • Thread library: Marcel • Communication library: Madeleine – User-level – Supports SMP – POSIX-like – Portable: BIP, SISCI/SCI, MPI, TCP, PVM – Preemptive thread – Efficient migration Context Switch Create SMP 0.250 s 2 Non-SMP 0.120 s 0.55 s Thread Migration SCI/SISCI 24 s BIP/Myrinet 75 s s SCI/SISCI BIP/Myrinet Latency 6 s 8 s Bandwidth 70 MB/s 125 MB/s 27 DSM-PM2: Architecture • DSM comm: DSM-PM2 DSM Protocol Policy DSM Protocol lib – – – – send page request send page send invalidate request … • DSM page manager: DSM Page Manager DSM Comm – – – – set/get page owner set/get page access add/remove to/from copyset ... Madeleine Comms Marcel Threads PM2 28 DSM Implementation • • • • • Node-level caches. Page-based and home-based protocol. Use page faults to detect remote objects. Log modifications made to remote objects. Each node allocates objects from a different range of the virtual address space. 29 Benchmarking • Two Linux 2.2 clusters: – twelve 200 MHz Pentium Pro processors connected by Myrinet switch and using BIP. – six 450 MHz Pentium II processors connected by a SCI network and using SISCI. • gcc 2.7.2.3 with -O6 30 Pi (50M intervals) 12 Seconds 10 8 200MHz/BIP 450MHz/SCI 6 4 2 0 1 2 4 6 Nodes 8 10 12 31 Jacobi (1024x1024) 100 Seconds 80 60 200MHz/BIP 450MHz/SCI 40 20 0 1 2 4 6 Nodes 8 10 12 32 Seconds Traveling Salesperson (17 cities) 1400 1200 1000 800 600 400 200 0 200MHz/BIP 450MHz/SCI 1 2 4 6 Nodes 8 10 12 33 All-pairs Shortest Path (2K nodes) 1000 Seconds 800 600 200MHz/BIP 450MHz/SCI 400 200 0 1 2 4 6 Nodes 8 10 12 34 Seconds Barnes-Hut (16K bodies) 140 120 100 80 60 40 20 0 200MHz/BIP 450MHz/SCI 1 2 4 6 Nodes 8 10 12 35 Current Work • • • • Comparing Hyperion to mpiJava. mpiJava is set of JNI wrappers to MPI. Using Java Grande Forum benchmarks. mpiJava implemented on top of singlenode version of Hyperion. • This controls for quality of bytecode implementation. 36 Problems with JGF Benchmarks • Written with SMP hardware in mind. – bogus synchronization. – all data allocated by one thread. • SMP is not the right model! 37 An Alternative Model • Programmer should be aware of the memory hierarchy. • Do not require “magic” implementation. • The thread is the correct level of abstraction: – If an object was created by a thread, then the object is “near” the thread. – Otherwise the object might be “far” from the thread. 38 Efficiency and Portability • Will not hurt on SMP hardware and may even help. • Implementation can be straightforward. • “Magic” implementations also possible. • Encourages portability across different implementations and hardware. 39 Other Lessons Learned • in-line checks vs. page faults • network reactivity • System.arraycopy 40 In-line Checks vs. Page Faults • Earlier version of Hyperion used in-line checks to detect remote objects. • For our benchmarks, using page faults was always better. • Local accesses are free. • Remote accesses are more expensive. • But most accesses are local! 41 Network Reactivity • Fetch of remote object implemented by asynchronous message to home node. • Message handled by service thread on home node. • When message arrives, service thread needs to be scheduled. • Need integration of network layer and thread scheduler. 42 Short-term Solution • Over-synchronize: – use BSP programming style – distinct phases for communication and computation – phases separated by barrier synchronization – so only service thread ready to run during communication phase 43 System.arraycopy • Native implementation can transmit data in units that are bigger than a page. • Requires in-line check but usually amortized over large amount of data. 44 Conclusions • Java threads is an attractive vehicle for parallel programming. • Is Java serial execution fast enough? – Need true multi-dimensional arrays? • Need clarified memory model. • Need extended thread API. • Programmers need to be aware of memory hierarchy. 45