Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
JIT-Compiler-Assisted Distributed Java Virtual Machine Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau The Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong Presented by Cho-Li Wang Outline Distributed Java Virtual Machine (DJVM) Design tradeoffs Related work JESSICA2 DJVM JIT-compiler-assisted dynamic thread migration Global Object Space (GOS) for locationtransparent object access Experimental results + A demo Conclusion & future work TCHPC 2004, Taiwan, Mar, 2004 2 Distributed Java Virtual Machine (DJVM) import java.util.*; class worker extends Thread{ private long n; public worker(long N){ n=N; } public void run(){ long sum=0; for(long i=0; i<n; i++) sum+=i; System.out.println(“N=“+n+” Sum="+sum);} } Java public class test { static final int N=100; public static void main(String args[]){ worker [] w= new worker[N]; Random r = new Random(); for (int i=0; i<N; i++) w[i] = new worker(r.nextLong()); for (int i=0; i<N; i++) w[i].start(); try{ for (int i=0; i<N; i++) w[i].join();} catch (Exception e){}} } A distributed Java Virtual Machine (DJVM) consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application. thread (Single System Image) A DJVM provides all the JVM services, that are compliant with the Java language specification. Bytecode Execution Engine DJVM DJVM provides an illusion that the program is running on a single machine (yet more powerful) -- Single System Image (SSI) TCHPC 2004, Taiwan, Mar, 2004 Heap Thread JVM JVM JVM Class JVM 3 Design Tradeoffs of a DJVM How to manage the threads? Distributed thread scheduling Initial thread placement vs migration How to store the data ? Thread Sched Exec Engine Heap Object store : A global heap shared by threads ? Memory consistency : Java memory model ? Can an off-the-shelf DSM be used ? Or others ? How to process the bytecode ? Execution Engine : Interpretation, Just-in-Time (JIT) compilation, static compilation High performance ? TCHPC 2004, Taiwan, Mar, 2004 4 Remote Creation Related work Intr Embedded OO-based DSM (Proxy) cJVM (IBM Haifa Research) Interpreter mode execution Embedded OO-based DSM (Proxy) Manual Distribution JAVA/DSM (Rice University) Interpreter mode execution Heap built on top of a page-based DSM JESSICA (HKU) Thread migration Interpreter mode execution Heap built on top of a page-based DSM Jackal, Hyperion Static compilation Link to an object-based DSM TCHPC 2004, Taiwan, Mar, 2004 Intr Transparent Page-based DSM Migration Intr Page-based DSM Remote Creation Static OO-based compilation DSM 5 JESSICA2 (Java-Enabled Single-SystemImage Computing Architecture) A Multithreaded Java Program Thread Migration JIT Compiler Mode Portable Java Frame JESSICA2 JVM JESSICA2 JVM Master JESSICA2 JVM Worker JESSICA2 JVM Worker JESSICA2 JVM Worker JESSICA2 JVM Worker Worker Global Object Space A shared global heap spanning all cluster nodes TCHPC 2004, Taiwan, Mar, 2004 6 JESSICA2 Main Features Cluster-aware bytecode execution engine (JITEE) JVM operated in Just-In-Time (JIT) compilation mode Cluster-aware : global naming scheme for threads, objects,.. JIT-compiler-assisted dynamic thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation (preprocessing); no new API introduced Enable dynamic load balancing Global Object Space (GOS) Provide location-transparent object access for threads Tightly integrated with JVM, Memory consistency : compliant with Java Memory Model (JMM) Various optimizing schemes : adaptive migrating home, synchronized method shipping, object pushing I/O redirection TCHPC 2004, Taiwan, Mar, 2004 7 JESSICA2 thread migration (In a JIT-enabled JVM) RTC: Raw Thread Context BTC : Bytecode-oriented Thread Context = thread id + Java frames (class name, method signature, PC, Operand stack ptr, local vars …) Thread Frames Frames BTC RTC Migration Manager Frame (2) Thread Scheduler Stack analysis Stack capturing Source node Load Monitor JVM Method Area PC (1) Alert Frame parsing (3) Restore execution Transformation of the RTC into the BTC directly inside the JIT compiler TCHPC 2004, Taiwan, Mar, 2004 RTC Frame Method Area PC Destination node 8 Thread Stack Transformation Raw Thread Context (RTC) %esp: 0x00000000 %esp+4: 0x082ca809 %esp+8: 0x08225400 %esp+12: 0x08266bc0 %esp : stack pointer Stack Capturing method id [ : array; D: double %esp: 0x00000000 %esp+4: 0x086243c %esp+8: 0x08623200 %esp+12: 0x08293010 ... %eax = 0x08623200 %ebx = 0x08293010 Frames{ method CPI::run()V@111 local=13;stack=0; var: arg0:CPI, 33, 0x8225400 local1: [D; 33, 0x8266bc0@2 local2: int, 2; node id ... Stack Restoration bytecode Program Counter Bytecode-oriented Thread Context (BTC) TCHPC 2004, Taiwan, Mar, 2004 9 Thread State Capturing : Details Bytecode verifier migration points : (1) head of basic block (loop) (2) before a method invocation Construct control flow graph invoke Bytecode translation Intermediate Code 1. 2. 3. Add migration checking code (cmp mflag,0) Add object checking (local or remote obj) Add type and register spilling code generation Global Object Space Native Code Linking & Constant Resolution Java frame detection Java frame C frame raw stack TCHPC 2004, Taiwan, Mar, 2004 thread stack 10 Restoring: Dynamic Register Patching (on i386 Architecture) Small code stubs Rebuilt register context reg1 <- value1 jmp restore_point1 frame 1 %ebp Compiled methods: Method1(){ ... retore_point1: } Ret addr Stack growth reg1 <- value1 reg2 <- value2 jmp restore_point0 frame 0 %ebp Ret addr trampoline frame Native code Method0(){ ... retore_point0: } trampoline bootstrap frame %ebp %ebp : i386 frame pointer “Ret Addr”: return address of the current function call TCHPC 2004, Taiwan, Mar, 2004 bootstrap(){ trampoline(); closing handler(); } 11 Global Object Space (GOS) Provide global heap abstraction for DJVM Home-based object coherence protocol, compliant with JVM Memory Model OO-based to reduce false sharing Non-blocking communication Use threaded I/O interface inside JVM for communication to hide the latency Adaptive object home migration mechanism Take advantage of JVM runtime information for optimization Optimizations: Home migration, Synchronized Method Shipping, Object pushing TCHPC 2004, Taiwan, Mar, 2004 12 Experimental environment HKU Gideon 300 Linux cluster : 300 P4 PCs (2GHz, 512 MB RAM, 40 GB disk) Network: 312-port Foundry FastIron 1500 Non-blocking switch (100 Mbits/s) Kaffe JVM version 1.0.6; Linux kernel 2.4.18-3 (RedHat 7.3) TCHPC 2004, Taiwan, Mar, 2004 13 Migration overhead during normal execution (SPECJVM98 benchmark) Benchmarks Time (seconds) Space (native code/bytecode) No migration Migration No migration Migration compress 11.31 11.39(+0.71%) 6.89 7.58(+10.01%) jess 30.48 30.96(+1.57%) 6.82 8.34(+22.29%) raytrace 24.47 24.68(+0.86%) 7.47 8.49(+13.65%) db 35.49 36.69(+3.38%) 7.01 7.63(+8.84%) javac 38.66 40.96(+5.95%) 6.74 8.72(+29.38%) mpegaudio 28.07 29.28(+4.31%) 7.97 8.53(+7.03%) mtrt 24.91 25.05(+0.56%) 7.47 8.49(+13.65%) jack 37.78 37.90(+0.32%) 6.95 8.38(+20.58%) Average TCHPC 2004, Taiwan, Mar, 2004 (+2.21%) (+15.68%) 14 Migration overhead analysis Program (frame #) LT(1) CPI(1) ASP(1) N-Body(8) SOR(2) Latency (ms) 4.997 2.680 4.678 10.803 8.467 Overall migration latency (2-10 ms) Frame # 1 2 4 6 8 10 15 37 59 81 103 Size (B) 201 417 849 1281 1713 2145 Capture (us) 202 266 410 495 605 730 Parse (us) 235 253 447 526 611 724 Create (us) 360 360 360 360 360 360 Compile (us) 478 575 847 1,169 1,451 1,720 Build (us) 7 11 14 16 21 28 Total (us) 1,282 1,465 2,078 2,566 3,048 3,562 Var # 4 Migration time breakdown (LT program) TCHPC 2004, Taiwan, Mar, 2004 15 GOS Optimizations (using 4 PCs) 100% 80% Obj 60% Syn 40% Comp 20% ASP NO = No optimizations H = Home migration TCHPC 2004, Taiwan, Mar, 2004 SOR Nbody HSP HS H NO HSP HS H NO HSP HS H NO HSP HS H NO 0% TSP HS = Home migration + Synchronized Method Shipping HSP = HS + Object pushing 16 Application benchmark Speedup 10 Linear speedup Speedup 8 CPI 6 TSP 4 Raytracer 2 nBody 0 2 4 8 Node number Number of Nodes TCHPC 2004, Taiwan, Mar, 2004 17 JESSICA2 vs JESSICA (CPI) Time(ms) CPI(50,000,000iterations) 250000 200000 150000 100000 50000 0 JESSICA JESSICA2 2 4 8 Number of nodes TCHPC 2004, Taiwan, Mar, 2004 18 Parallel Ray Tracing (using 64 nodes of Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 4402 seconds ( 1.2 hour) Speedup = 4402/108=40.75 TCHPC 2004, Taiwan, Mar, 2004 19 Demo Execution Steps 1. 2. 3. 4. Create the display panel Start the ray tracing program on node 26 with 8 threads Add two more nodes: 27 and 28 Add 5 more nodes: 29, 30, 31, 32, 33 TCHPC 2004, Taiwan, Mar, 2004 20 Conclusions Dynamic Java thread migration makes it possible for true parallel execution of Java threads and enables dynamic load balancing. Runtime (“Just-In-Time”) code Instrument for thread state capturing and restoring is feasible. An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead TCHPC 2004, Taiwan, Mar, 2004 21 Advantages of native code instrumentation Lightweight Re-use JIT compiler internal data structures and control flow analysis functions Instrumented native codes are more efficient than instrumented bytecode. Transparent No source code modification. No new API introduced. No preprocessing TCHPC 2004, Taiwan, Mar, 2004 22 Future work Advanced thread migration mechanism without overhead during normal execution Incremental Distributed GC Enhanced Single I/O Space to benefit more real-life applications Parallel I/O Support TCHPC 2004, Taiwan, Mar, 2004 23 Thanks JESSICA2 Webpage http://www.csis.hku.hk/~clwang/ projects/JESSICA2.html TCHPC 2004, Taiwan, Mar, 2004 24