Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
JIT-Compiler-Assisted Distributed Java Virtual Machine Wenzhang Zhu, Cho-Li Wang, Weijian Fang and Francis C. M. Lau Department of Computer Science and Information Systems The University of Hong Kong Presented by Cho-Li Wang Outline Distributed Java Virtual Machine Design Tradeoffs Related work JESSICA2 features Experimental results Conclusion & future work A raytracing demo TCHPC 2004, Taiwan, Mar, 2004 2 Distributed Java Virtual Machine (DJVM) import java.util.*; class worker extends Thread{ private long n; public worker(long N){ n=N; } public void run(){ long sum=0; for(long i=0; i<n; i++) sum+=i; System.out.println(“N=“+n+” Sum="+sum);} } Java public class test { static final int N=100; public static void main(String args[]){ worker [] w= new worker[N]; Random r = new Random(); for (int i=0; i<N; i++) w[i] = new worker(r.nextLong()); for (int i=0; i<N; i++) w[i].start(); try{ for (int i=0; i<N; i++) w[i].join();} catch (Exception e){}} } A distributed Java Virtual Machine (DJVM) consists of a group of extended JVMs running on a distributed environment to support true parallel execution of a multithreaded Java application. A DJVM provides all the JVM services, that are compliant with the Java language specification, as if running on a single machine – Single System Image (SSI). TCHPC 2004, Taiwan, Mar, 2004 thread (Single System Image) Bytecode Execution Engine DJVM Heap Thread JVM JVM JVM Class JVM 3 Design Tradeoffs of a DJVM How to manage the threads? Distributed thread scheduling Initial placement vs thread migration How to store the data ? Thread Sched Exec Engine Heap Distributed heap (object store) Java memory model (memory consistency) Can an off-the-shelf DSM be used as the heap? How to process the bytecode ? Execution Engine : Interpretation, Just-in-Time (JIT) compilation, Static compilation TCHPC 2004, Taiwan, Mar, 2004 4 Remote Creation Related work Intr Embedded OO-based DSM (Proxy) cJVM (IBM Haifa Research) Interpreter mode execution built-in object caching Manual Distribution JAVA/DSM (Rice University) Interpreter mode execution Heap built on top of a page-based DSM Intr Page-based DSM Transparent Migration JESSICA(HKU) Thread migration Interpreter mode execution Heap built on top of a page-based DSM Intr Page-based DSM Remote Creation Jackal, Hyperion Static compilation Link to object-based DSM TCHPC 2004, Taiwan, Mar, 2004 Static OO-based compilation DSM 5 JESSICA2 (Java-Enabled Single-SystemImage Computing Architecture) A Multithreaded Java Program Thread Migration JIT Compiler Mode Portable Java Frame JESSICA2 JVM JESSICA2 JVM Master JESSICA2 JVM Worker JESSICA2 JVM Worker JESSICA2 JVM Worker JESSICA2 JVM Worker Worker Global Object Space TCHPC 2004, Taiwan, Mar, 2004 6 JESSICA2 Main Features Transparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification; no bytecode instrumentation (preprocessing); no new API introduced Enable dynamic load balancing on clusters JIT compiler-based execution engine (JITEE) Operated in Just-In-Time (JIT) compilation mode cluster-aware Global Object Space A shared global heap spanning all cluster nodes Provide location-transparent object access Adaptive migrating home protocol for memory consistency, plus various optimizing schemes. I/O redirection TCHPC 2004, Taiwan, Mar, 2004 7 JESSICA2 thread migration (In a JIT-enabled JVM) RTC: Raw Thread Context BTC : Bytecode-oriented Thread Context (thread id, frames, class names, method signature, PC, Operand stack ptr, local vars …) Thread Frames Frames BTC RTC Migration Manager Frame (2) Thread Scheduler Stack analysis Stack capturing Source node Load Monitor JVM Method Area PC (1) Alert Frame parsing (3) Restore execution Transformation of the RTC into the BTC directly inside the JIT compiler TCHPC 2004, Taiwan, Mar, 2004 RTC Frame Method Area PC Destination node 8 Thread Stack Transformation Raw Thread Context (RTC) Raw Thread Context (RTC) %esp: 0x00000000 %esp+4: 0x082ca809 %esp+8: 0x08225400 %esp+12: 0x08266bc0 %esp: 0x00000000 %esp+4: 0x082ca809 %esp+8: 0x08225400 %esp+12: 0x08266bc0 ... %eax = 0x08623200 %ebx = 0x08293100 Stack Restoration Stack Capturing Frames{ method CPI::run()V@111 local=13;stack=0; var: arg0:CPI, 33, 0x8225400 local1: [D; 33, 0x8266bc0@2 local2: int, 2; ... Bytecode-oriented Thread Context (BTC) TCHPC 2004, Taiwan, Mar, 2004 9 Details Bytecode verifier Construct control flow graph Variables (Restore) Register allocation invoke head of a basic block INVOKESTATIC, INVOKESPECIAL, INVOKEVIRTUAL and INVOKEINTERFACE Bytecode translation 1. 2. 3. 4. Intermediate Code Register rebuild mov var1->reg1 mov var2->reg2 ... Migration checking Non-destructive register spilling Object checking Type spilling for variable type deducing code generation Global Object Space Native Code Linking & Constant Resolution reg migration point selection : var Java frame detection thread stack Java frame raw stack TCHPC 2004, Taiwan, Mar, 2004 C frame 10 Example of native code instrumentation TCHPC 2004, Taiwan, Mar, 2004 11 Optimization on migration points – Pseudo-inlining Purpose : eliminate the costs of unnecessary inserted migration points General idea: delete M-points before a small method invocation TCHPC 2004, Taiwan, Mar, 2004 12 Dynamic Register Patching reg1 <- value1 jmp restore_point1 frame 1 %ebp Compiled methods: Method1(){ ... retore_point1: } Ret addr Stack growth reg1 <- value1 reg2 <- value2 jmp restore_point0 frame 0 %ebp Ret addr trampoline frame Method0(){ ... retore_point10: } trampoline bootstrap frame %ebp TCHPC 2004, Taiwan, Mar, 2004 bootstrap(){ trampoline(); closing handler(); } 13 Advantages of native code instrumentation Lightweight Re-use JIT compiler internal data structures and control flow analysis functions No need to include debugging information in Java class files Instrumented native codes are more efficient than instrumented bytecode. Transparent No source code modification. No new API introduced. No preprocessing TCHPC 2004, Taiwan, Mar, 2004 14 Global Object Space (GOS) Provide global heap abstraction for DJVM Home-based object coherence protocol, compliant with JVM Memory Model OO-based to reduce false sharing Non-blocking communication Use threaded I/O interface inside JVM for communication to hide the latency Adaptive object home migration mechanism Take advantage of JVM runtime information for optimization TCHPC 2004, Taiwan, Mar, 2004 15 GOS runtime data structure Master object Cache object object header object header cache pointer cache pointer object data cache data Cache header Master host id master address class cache obj list Cache data cache data TCHPC 2004, Taiwan, Mar, 2004 thread id status cache data next thread id status cache data next 16 Experimental environment HKU Gideon 300 Linux cluster : 300 P4 PCs (2GHz, 512 MB RAM, 40 GB disk) Network: 312-port Foundry FastIron 1500 Non-blocking switch (100 Mbits/s) TCHPC 2004, Taiwan, Mar, 2004 17 Migration overhead during normal execution (SPECJVM98 benchmark) Benchmarks Time (seconds) Space (native code/bytecode) No migration Migration No migration Migration compress 11.31 11.39(+0.71%) 6.89 7.58(+10.01%) jess 30.48 30.96(+1.57%) 6.82 8.34(+22.29%) raytrace 24.47 24.68(+0.86%) 7.47 8.49(+13.65%) db 35.49 36.69(+3.38%) 7.01 7.63(+8.84%) javac 38.66 40.96(+5.95%) 6.74 8.72(+29.38%) mpegaudio 28.07 29.28(+4.31%) 7.97 8.53(+7.03%) mtrt 24.91 25.05(+0.56%) 7.47 8.49(+13.65%) jack 37.78 37.90(+0.32%) 6.95 8.38(+20.58%) Average TCHPC 2004, Taiwan, Mar, 2004 (+2.21%) (+15.68%) 18 Migration overhead analysis Program (frame #) LT(1) CPI(1) ASP(1) N-Body(8) SOR(2) Latency (ms) 4.997 2.680 4.678 10.803 8.467 Overall migration latency Frame # 1 2 4 6 8 10 15 37 59 81 103 Size (B) 201 417 849 1281 1713 2145 Capture (us) 202 266 410 495 605 730 Parse (us) 235 253 447 526 611 724 Create (us) 360 360 360 360 360 360 Compile (us) 478 575 847 1,169 1,451 1,720 Build (us) 7 11 14 16 21 28 Total (us) 1,282 1,465 2,078 2,566 3,048 3,562 Var # 4 Migration time breakdown (LT program) TCHPC 2004, Taiwan, Mar, 2004 19 GOS Optimizations (using 4 PCs) 100% 80% Obj 60% Syn 40% Comp 20% ASP NO = No optimizations H = Home migration TCHPC 2004, Taiwan, Mar, 2004 SOR Nbody HSP HS H NO HSP HS H NO HSP HS H NO HSP HS H NO 0% TSP HS = Home migration + Synchronized Method Shipping HSP = HS + Object pushing 20 JESSICA2 vs JESSICA (CPI) Time(ms) CPI(50,000,000iterations) 250000 200000 150000 100000 50000 0 JESSICA JESSICA2 2 4 8 Number of nodes TCHPC 2004, Taiwan, Mar, 2004 21 Application benchmark Speedup 10 Linear speedup Speedup 8 CPI 6 TSP 4 Raytracer 2 nBody 0 2 4 8 Node number TCHPC 2004, Taiwan, Mar, 2004 22 Parallel Ray Tracing (using 64 nodes of Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 4402 seconds ( 1.2 hour) Speedup = 4402/108=40.75 TCHPC 2004, Taiwan, Mar, 2004 23 Conclusions Transparent Java thread migration in JIT compiler enables the highperformance execution of multithreaded Java application on clusters An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead TCHPC 2004, Taiwan, Mar, 2004 24 Future work Advanced thread migration mechanism without overhead during normal execution (finished) Incremental Distributed GC Enhanced Single I/O Space to benefit more real-life applications Parallel I/O Support TCHPC 2004, Taiwan, Mar, 2004 25 Thanks JESSICA2 Webpage http://www.csis.hku.hk/~clwang/ projects/JESSICA2.html TCHPC 2004, Taiwan, Mar, 2004 26