Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
“Towards an SSI for HP Java” Francis Lau The University of Hong Kong With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu Cluster Coming of Age • HPC – Cluster the de facto standard equipment – Grid? • Clusters – Fortran or C + MPI the norm – 99% on top of bare-bone Linux or the like – Ok if application is embarrassingly parallel and regular 7/10/2003 ICPP-HPSECA03 2 Cluster for the Mass • Two modes: – For number crunching in Grande type applications (superman) – As a CPU farm to support highthroughput computing (poor man) 7/10/2003 ICPP-HPSECA03 Commercial: Data mining, Financial Modeling, Oil Reservoir Simulation, Seismic Data Processing, Vehicle and Aircraft Simulation Government: Nuclear Stockpile Stewardship, Climate and Weather, Satellite Image Processing, Forces Modeling Academic: Fundamental Physics (particles, relativity, cosmology), Biochemistry, Environmental Engineering, Earthquake Prediction 3 Cluster Programming • Auto-parallelization tools have limited success • Parallelization a chore but “have to do it” (or let’s hire someone) • Optimization for performance not many users’ cup of tea – Partitioning and parallelization – Mapping – Remapping (experts?) 7/10/2003 ICPP-HPSECA03 4 Amateur Parallel Programming • Common problems – Poor parallelization: few large chunks or many small chunks – Load imbalances: large and small chunks • Meeting the amateurs half-way – They do crude parallelization – System does the rest: mapping/remapping (automatic optimization) – And I/O? 7/10/2003 ICPP-HPSECA03 5 Automatic Optimization • “Feed the fat boy with two spoons, and a few slim ones with one spoon” • But load information could be elusive • Need smart runtime supports • Goal is to achieve high performance with good resource utilization and load balancing • Large chunks that are single-threaded a problem 7/10/2003 ICPP-HPSECA03 6 The Good “Fat Boys” • Large chunks that span multiple nodes • Must be a program with multiple execution “threads” • Threads can be in different nodes – program expands and shrinks • Threads/programs can roam around – dynamic migration • This encourages fine-grain programming 7/10/2003 ICPP-HPSECA03 cluster node “amoeba” 7 Mechanism and Policy • Mechanism for migration – Traditional process migration – Thread migration • Redirection of I/O and messages • Objects sharing between nodes for threads • Policy for good dynamic load balancing – Message traffic a crucial parameter – Predictive • Towards the “single system image” ideal 7/10/2003 ICPP-HPSECA03 8 Single System Image • If user does only crude parallelization and system does the rest … • If processes/threads can roam, and processes expand/shrink … • If I/O (including sockets) can be at any node anytime … • We achieve at least 50% of SSI – The rest is difficult 7/10/2003 ICPP-HPSECA03 Single Entry Point File System Virtual Networking I/O and Memory Space Process Space Management / Programming View … 9 Bon Java! • Java (for HPC) in good hands – JGF Numerics Working Group, IBM Ninja, … – JGF Concurrency/Applications Working Group (benchmarking, MPI, …) – The workshops • • • • Java has many advantages (vs. Fortran and C/C++) Performance not an issue any more Threads as first-class citizens! JVM can be modified “Java has the greatest potential to deliver an attractive productive programming environment spanning the very broad range of tasks needed by the Grande programmer ” – The Java Grande Forum Charter 7/10/2003 ICPP-HPSECA03 10 Process vs. Thread Migration • Process migration easier than thread migration – Threads are tightly coupled – They share objects • Two styles to explore – Process, MPI (“distributed computing”) – Thread, shared objects (“parallel computing”) – Or combined • Boils down to messages vs. distributed shared objects 7/10/2003 ICPP-HPSECA03 11 Two Projects @ HKU • M-JavaMPI – “M” for “Migration” – – – – Process migration I/O redirection Extension to grid No modification of JVM and MPI • JESSICA – “Java-Enabled Single System Image Computing Architecture” – – – – By modifying JVM Thread migration, Amoeba mode Global object space, I/O redirection JIT mode (Version 2) 7/10/2003 ICPP-HPSECA03 12 Design Choices • Bytecode instrumentation – Insert code into programs, manually or via pre-processor • JVM extension – Make thread state accessible from Java program – Non-transparent – Modification of JVM is required • Checkpointing the whole JVM process – Powerful but heavy penalty • Modification of JVM – Runtime support – Totally transparent to the applications – Efficient but very difficult to implement 7/10/2003 ICPP-HPSECA03 13 M-JavaMPI • Support transparent Java process migration and provide communication redirection services • Communication using MPI • Implemented as a middleware on top of standard JVM • No modifications of JVM and MPI • Checkpointing the Java process + code insertion by preprocessor 7/10/2003 ICPP-HPSECA03 14 System Architecture Java MPI program (Java bytecode bytecode)) Preprocessing layer (Insert exception handlers) Provide MPI wrapper for Java program Java API Java--MPI API Java Migration layer (Save and restore process) Save and restore process. Process and object information are saved and restored by using object serialization, reflection and exception through JVMDI JVMDI (Debugger interface in Java 2) Debugger interface in Java 2. Used to retrieve and restore process state JVM Restorable MPI layer Provide restorable MPI communication through MPI daemons Native MPI OS Hardware 7/10/2003 Java .class files are modified by inserting an exception handler in each method of each class. The handler is used to restore the process state. ICPP-HPSECA03 Support low-latency and high-bandwidth data communication 15 Preprocessing • Bytecode is modified before passing to JVM for execution • “Restoration functions” are inserted as exception handlers, in the form of encapsulated “try-catch” statements • Re-arrangement of bytecode, and addition of local variables 7/10/2003 ICPP-HPSECA03 16 The Layers • Java-MPI API layer • Restorable MPI layer – Provides restorable MPI communications – No modification of MPI library • Migration Layer – Captures and save the execution state of the migrating process in the source node, and restores the execution state of the migrated process in the destination node – Cooperates with the Restorable MPI layer to reconstruct the communication channels of the parallel application 7/10/2003 ICPP-HPSECA03 17 State Capturing and Restoring • Program code: re-used in the destination node • Data: captured and restored by using the object serialization mechanism • Execution context: captured by using JVMDI and restored by inserted exception handlers • Eager (all) strategy: For each frame, local variables, referenced objects, the name of the class and class method, and program counter are saved using object serialization 7/10/2003 ICPP-HPSECA03 18 State Capturing using JVMDI public class A { try { … } catch (RestorationException e) { a = saved value of local variable a; b = saved value of local variable b; pc = saved value of program counter when the program is suspended jump to the location where the program is suspended } public class A { int a; char b; … } 7/10/2003 } ICPP-HPSECA03 19 Message Redirection Model • MPI daemon in each node to support message passing between distributed java processes • IPC between Java program and MPI daemon in the same node through shared memory and semaphores client-server Node 1 Node 2 Java Program Java Program Java-M PI API Java-M PI API IPC client-server IPC MPI communication M PI Daem on (linked w ith native M PI library) 7/10/2003 M PI Daem on (linked w ith native M PI library) Network ICPP-HPSECA03 20 migration layer (source node) MPI daemon (source node) MPI daemon (destination node) migration client (destination node) Process migration steps suspend user process capture process state send migration request broadcast mig. info. to all MPI daemons send buffered messages JVM and process quit notify MPI daemon of the completion of capturing send notification message (and captured process data if central file system is not used) start an instance of JVM with JVMDI client send notification of the readiness of captured process data Source Node process is restarted and suspended Restoration of execution state starts execution of migrated process is restored Destination Node LEGENDS 7/10/2003 Migration events Event triggers ICPP-HPSECA03 21 Experiments • PC Cluster – 16-node cluster – 300 MHz Pentium II with 128MB of memory – Linux 2.2.14 with Sun JDK 1.3.0 – 100Mb/s fast Ethernet • All Java programs executed in interpreted mode 7/10/2003 ICPP-HPSECA03 22 Bandwidth: PingPong Test Native MPI: 10.5 MB/s Direct Java-MPI binding: 9.2 MB/s Restorable MPI layer: 7.6 MB/s bandwidth (Mbyte/s) 12 10 8 6 4 2 13107 65536 32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1 0 message size (byte) native MPI 7/10/2003 direct Java-MPI binding ICPP-HPSECA03 migratable Java-MPI 23 Latency: PingPong Test 0.0008 Native MPI: 0.2 ms Direct Java-MPI binding: 0.23 ms Restorable MPI layer: 0.26 ms 0.0007 latency(s) 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 64 12 8 25 6 51 2 10 24 20 48 40 96 32 16 8 4 2 1 0 message size (byte) native MPI 7/10/2003 direct Java-MPI binding ICPP-HPSECA03 migratable Java-MPI 24 Migration Cost: capturing and restoring objects 1000 100 10 10 00 K 10 0K 10 K 10 00 10 0 10 1 0 time spent (ms) 10000 data size (# of integers) capturing time 7/10/2003 ICPP-HPSECA03 restoring time 25 Migration Cost: capturing and restoring frames time spent (ms) 4000 3000 2000 1000 0 0 200 400 600 number of frames capture time (ms) 7/10/2003 restore time (ms) ICPP-HPSECA03 26 Application Performance • • • • PI calculation Recursive ray-tracing NAS integer sort Parallel SOR 7/10/2003 ICPP-HPSECA03 27 execution time (sec) Time spent in calculating PI and ray-tracing with and without the migration layer 200 180 160 140 120 100 80 60 40 20 0 0 1 2 PI (w/o migration lay er) 7/10/2003 3 4 5 6 no. of nodes 7 8 9 ray -tracing (w/o migration lay er) PI (w/ migration lay er) ICPP-HPSECA03 ray -tracing (w/ migration lay er) 28 Execution time of NAS program with different problem sizes (16 nodes) Problem size (no. of integers) Time used (sec) in environment without MJavaMPI Time used (sec) in environment with MJavaMPI Overhead introduced by M-JavaMPI (in %) Total Comp Comm Total Comp Comm Total Comm Class S: 65536 0.023 0.009 0.014 0.026 0.009 0.017 13% 21% Class W:1048576 0.393 0.182 0.212 0.424 0.182 0.242 7.8% 14% Class A: 8388608 3.206 1.545 1.66 3.387 1.546 1.840 5.6% 11% No noticeable overhead introduced in the computation part; while in the 7/10/2003 ICPP-HPSECA03 29 communication part, an overhead of about 10-20% execution time (sec) Time spent in executing SOR using different numbers of nodes with and without migration layer 1200 1000 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 no. of nodes SOR (w/o migration lay er) 7/10/2003 SOR (w/ migration lay er) ICPP-HPSECA03 30 Cost of Migration Time spent in executing the SOR program on an array of size 256x256 without and with one migration during the execution No. of nodes 1 2 4 6 8 7/10/2003 No migration (sec) One migration (sec) 1013 1016 518 521 267 270 176 178 141 144 ICPP-HPSECA03 31 Cost of Migration • Time spent in migration (in seconds) for different applications 7/10/2003 Applications Average migration time PI 2 Ray-tracing 3 NAS 2 SOR 3 ICPP-HPSECA03 32 Dynamic Load Balancing • A simple test – SOR program was executed using six nodes in an unevenly loaded environment with one of the nodes executing a computationally intensive program • Without migration : 319s • With migration: 180s 7/10/2003 ICPP-HPSECA03 33 In Progress – M-JavaMPI in JIT mode – Develop system modules for automatic dynamic load balancing – Develop system modules for effective faulttolerant supports 7/10/2003 ICPP-HPSECA03 34 Java Virtual Machine Application Class File Java API Class File • Class Loader – Loads class files Class loader • Interpreter Bytecode – Executes bytecode • Runtime Compiler 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 Interpreter 01010101000101110 10101011000111010 10110011010111011 Runtime compiler – Converts bytecode to native code Native code 7/10/2003 ICPP-HPSECA03 35 A Multithreaded Java Program Threads in JVM p1.start(); c1.start(); Thread 3 Thread 2 Thread 1 Stack Frame Stack Frame 7/10/2003 } } Java Method Area (Code) PC object public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); Heap (Data) Class loader Execution Engine object ICPP-HPSECA03 Class files 36 Java Memory Model (How to maintain memory consistency between threads) T1 Load variable from main memory to Variable is modified working memory Upon in T1’sTworking 1 performs before Upon Tuse. performsis unlock, memory. 2 variable lock, variable writtenflush to When Tback 2 uses in working main memory variable, it memory will be loaded from main memory T2 Per-Thread working memory Main memory Garbage Bin Object Variable Heap Area master copy Threads: T1, T2 JMM 7/10/2003 ICPP-HPSECA03 37 Problems in Existing DJVMs • Mostly based on interpreters – Simple but slow • Layered design using distributed shared memory system (DSM) cannot be tightly coupled with JVM – JVM runtime information cannot be channeled to DSM – False sharing if page-based DSM is employed – Page faults block the whole JVM • Programmer to specify thread distribution lack of transparency – Need to rewrite multithreaded Java applications – No dynamic thread distribution (preemptive thread migration) for load balancing 7/10/2003 ICPP-HPSECA03 38 Related Work Method shipping: IBM cJVM Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. Executed in Interpreter mode. Load balancing problem : affected by the object distribution. Page shipping: Rice U. Java/DSM, HKU JESSICA Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) JVM runtime information can’t be channeled to DSM. Executed in Interpreter mode. Object shipping: Hyperion, Jackal 7/10/2003 Leverage some object-based DSM Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code ICPP-HPSECA03 39 Distributed Java Virtual Machine (DJVM) JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with a Single System Image illusion to Java threads. Java Threads created in a program Global Object Space OS PC OS PC OS PC OS PC High Speed Network 7/10/2003 ICPP-HPSECA03 40 JESSICA2 Main Features • Transparent Java thread migration – Runtime capturing and restoring of thread execution context. – No source code modification; no bytecode instrumentation (preprocessing); no new API introduced – Enables dynamic load balancing on clusters • Operated in Just-In-Time (JIT) compilation Mode • Global Object Space JESSICA2 Transparent migration JIT GOS – A shared global heap spanning all cluster nodes – Adaptive object home migration protocol – I/O redirection 7/10/2003 ICPP-HPSECA03 41 Transparent Thread Migration in JIT Mode • Simple for interpreters (e.g., JESSICA) – Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking – The full state of a thread is available in the data structure of interpreter – No register allocation • JIT mode execution makes things complex (JESSICA2) – – – – Native code has no clear bytecode boundary How to deal with machine registers? How to organize the stack frames (all are in native form now)? How to make extracted thread states portable and recognizable by the remote JVM? – How to restore the extracted states (rebuild the stack frames) and restart the execution in native form? Need to modify JIT compiler to instrument native code 7/10/2003 ICPP-HPSECA03 42 An overview of JESSICA2 Java thread migration Thread Frame parsing GOS (heap) Frames Frames Frames Frame PC JVM (4a) Object Access Frame Method Area GOS (heap) PC (4b) Load method from NFS Source node Destination node Load Monitor 7/10/2003 Migration Manager Method Area (2) Stack analysis Stack capturing Thread Scheduler (1) Alert (3) Restore execution ICPP-HPSECA03 44 Essential Functions • Migration points selection – At the start of loop, basic block or method • Register context handler – Spill dirty registers at migration point without invalidation so that native code can continue the use of registers – Use register recovering stub at restoring phase • Variable type deduction – Spill type in stacks using compression • Java frames linking – Discover consecutive Java frames 7/10/2003 ICPP-HPSECA03 45 Dynamic Thread State Capturing and Restoring in JESSICA2 migration point Bytecode verifier migration point Selection (Restore) register allocation bytecode translation 1. Add migration checking 2. Add object checking 3. Add type & register spilling Intermediate Code Register recovering reg slots invoke cmp mflag,0 jz ... cmp obj[offset],0 jz ... mov 0x110182, slot ... code generation Native Code Linking & Constant Resolution (Capturing) Global Object Access Native stack scanning Java frame mov slot1->reg1 mov slot2->reg2 ... 7/10/2003 C frame Frame ICPP-HPSECA03 Native thread stack 46 How to Maintain Memory Consistency in a Distributed Environment? T1 T2 T3 T4 Heap OS PC T5 T6 T7 T8 Heap OS PC OS PC OS PC High Speed Network 7/10/2003 ICPP-HPSECA03 47 Embedded Global Object Space (GOS) • Take advantage of JVM runtime information for optimization (e.g., object types, accessing threads, etc.) • Use threaded I/O interface inside JVM for communication to hide the latency Non-blocking GOS access • OO-based to reduce false sharing • Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) • Master heap (home objects) and cache heap (local and cached objects): reduce object access latency 7/10/2003 ICPP-HPSECA03 48 Object Cache JVM Cache Heap Area Hash table Java thread Hash table Java thread Master Heap Area Java thread Java thread Hash table JVM Master Heap Area Hash table Cache Heap Area Global Heap 7/10/2003 ICPP-HPSECA03 49 Adaptive object home migration • Definition – “home” of an object = the JVM that holds the master copy of an object • Problems – cache objects need to be flushed and re-fetched from the home whenever synchronization happens • Adaptive object home migration – if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running 7/10/2003 ICPP-HPSECA03 50 I/O redirection Timer Use the time in master node as the standard time Calibrate the time in worker node when they register to master node File I/O Use half word of “fd” as node number Open file For read, check local first, then master node For write, go to master node Read/Write Go to the node specified by the node number in fd Network I/O 7/10/2003 Connectionless send: do it locally Others, go to master ICPP-HPSECA03 51 Experimental Setting • • Modified Kaffe Open JVM version 1.0.6 Linux PC clusters 1. Pentium II PCs at 540MHz (Linux 2.2.1 kernel) connected by Fast Ethernet 2. HKU Gideon 300 Cluster (for the Ray Tracing demo) 7/10/2003 ICPP-HPSECA03 52 Parallel Ray Tracing on JESSICA2 (Using 64 nodes of the Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 4420 seconds (~ 1 hour) Speedup = 4402/108 = 40.75 7/10/2003 ICPP-HPSECA03 53 Micro Benchmarks Time breakdown of thread migration Capture time Pasring time native thread creation resolution of methods frame setup time (PI Calculation) 7/10/2003 ICPP-HPSECA03 54 Java Grande Benchmark Java Grande benchmark result (Single node) Kaffe 1.0.6 JIT JESSICA2 80 70 60 50 40 30 20 10 0 Ba r rr ie r Fo n oi J k nc y S p ry C t ct a F LU SO R Se s rie Sp 7/10/2003 ICPP-HPSECA03 eM at m t ul s ar 55 SPECjvm98 Benchmark “M-”: disabling migration mechanism “I+”: enabling pseudo-inlining 7/10/2003 “M+”: enabling migration “I-”: disabling pseudo-inlining ICPP-HPSECA03 56 JESSICA2 vs JESSICA (CPI) Time(ms) CPI(50,000,000iterations) 250000 200000 150000 100000 50000 0 JESSICA JESSICA2 2 4 8 Number of nodes 7/10/2003 ICPP-HPSECA03 57 Application Performance Speedup 10 Linear speedup Speedup 8 CPI 6 TSP 4 Raytracer 2 nBody 0 2 4 8 Node number 7/10/2003 ICPP-HPSECA03 58 Time (in ms) Effect of Adaptive Object Home Migration (SOR) 80000 70000 60000 Disable adaptive object home migration 50000 40000 30000 20000 10000 Enable adaptive object home migration 0 2 4 8 node number 7/10/2003 ICPP-HPSECA03 59 Work in Progress • • • • New optimization techniques for GOS Incremental Distributed GC Load balancing module Enhanced single I/O space to benefit more real-life applications • Parallel I/O support 7/10/2003 ICPP-HPSECA03 60 Conclusion • Effective HPC for the mass – – – – – – They supply the (parallel) program, system does the rest Let’s hope for parallelizing compilers Small to medium grain programming SSI the ideal Java the choice Poor man mode too • Thread distribution and migration feasible • Overhead reduction – Advances in low-latency networking – Migration as intrinsic function (JVM, OS, hardware) • Grid and pervasive computing 7/10/2003 ICPP-HPSECA03 61 Some Publications • W.Z. Zhu , C.L. Wang, and F.C.M. Lau, “A Lightweight Solution for Transparent Java Thread Migration in Just-in-Time Compilers”, ICPP 2003, Taiwan, October 2003. • W.J. Fang, C.L. Wang, and F.C.M. Lau, “On the Design of Global Object Space for Efficient Multi-threading Java Computing on Clusters”, Parallel Computing, to appear. • W.Z. Zhu , C.L. Wang, and F.C.M. Lau, “JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support,” CLUSTER 2002, Chicago, September 2002, 381-388. • R. Ma, C.L. Wang, and F.C.M. Lau, “M-JavaMPI : A Java-MPI Binding with Process Migration Support,'' CCGrid 2002, Berlin, May 2002. • M.J.M. Ma, C.L. Wang, and F.C.M. Lau, “JESSICA: Java-Enabled Single-System-Image Computing Architecture,’’ Journal of Parallel and Distributed Computing, Vol. 60, No. 10, October 2000, 11941222. 7/10/2003 ICPP-HPSECA03 62 THE END And Thanks! 7/10/2003 ICPP-HPSECA03 63