Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
JESSICA2: A Distributed Java Virtual Machine with Transparent Thread Migration Support Wenzhang Zhu, Cho-Li Wang, Francis Lau The Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong HKU JESSICA Project JESSICA: “Java-Enabled Single-SystemImage Computing Architecture” : Project started in 1996. First version (JESSICA1) in 1999. A middleware that runs on top of the standard UNIX/Linux operating system to support parallel execution of multithreaded Java applications in a cluster of computers. JESSICA hides the physical boundaries between machines and makes the cluster appear as a single computer to applications -- a single-system image (SSI). Special feature : preemptive thread migration which allows a thread to freely move between machines. Part of the RGC’s Area of Excellence project in 1999-2002. JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 2 JESSICA Team Members Supervisors: Dr. Francis C.M. Lau Dr. Cho-Li Wang Research Students: Ph.D: Wenzhang Zhu (Thread Migration) Ph.D: WeiJian Fang (Global Heap) M.Phil: Zoe Ching Han Yu (Distributed Garbage Collection) Ph.D: Benny W. L. Cheung (Software Distributed Shared Memory) Graduated: Matchy Ma (JESSICA1) JESSICA Team Members The Systems Research Group JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 3 Outline Introduction of Cluster Computing Motivations Related works JESSICA2 features Performance Analysis Conclusion & Future works JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 4 What’s a cluster ? A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected standalone/complete computers cooperatively working together as a single, integrated computing resource – IEEE TFCC. My definition : a HPC system that integrates mainstream commodity components to process large-scale problems low cost, self-made, yet powerful. JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 5 Cluster Computer Architecture Programming Environment (Java, C, MPI, HPF, DSM) Management & Monitoring & Job Scheduling Cluster Applications (Web, Storage, Computing, Rendering, Financing..) Single System Image Infrastructure Availability Infrastructure OS OS OS OS Node Node Node Node High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet) JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 6 Single System Image (SSI) ? JESSICA Project: Java-Enabled Single-SystemImage Computing Architecture A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource. Ultimate Goals of SSI : makes the cluster appear like a single machine to the user, to applications, and to the network. Single Entry Point, Single File System, Single Virtual Networking, Single I/O and Memory Space, Single Process Space, Single Management / Programming View … JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 7 Top 500 computers by “classification” (June 2002) (Source: http://www.top500.org/ ) Count Share Rmax [GF/s] Rpeak [GF/s] Processors MPP 224 44.8 % 104899.21 168829.00 111104 Constellations 187 37.4 % 40246.50 59038.00 33828 Cluster 80 16 % 37596.16 69774.00 50181 SMP 9 1.8 % 39208.10 44875.00 6056 Total 500 100 % 221949.97 342516.00 201169 MPP Constellation Cluster SMP Massively Parallel Processor E.g., cluster of HPCs Cluster of PCs Symmetric Multiprocessor About the TOP500 List: 1. 2. 3. the 500 most powerful computer systems installed in the world. Compiled twice a year since June 1993 Ranked by their performance on the LINPACK Benchmark JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 8 #1 Supercomputer: NEC’s Earth Simulator Linpack : 35.86 Tflop/s (Tera FLOPS = 1012 floating point operations per second = 450 x Pentium 4 PCs) Interconnect: Single stage crossbar (1800 miles of cable) 83,000 copper cables, 16 GB/s cross section bandwidth Built by NEC, 640 processor nodes, each consisting of 8 vector processors, total of 5120 processors, 40 TFlop/s peak, and 10 TB memory. Area of computer = 4 tennis courts, 3 floors (Source: NEC) JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 9 Other Supercomputers in TOP500 #2 #3 Supercomputer: ASCI Q 7.7 TF/s Linpack performance. Los Alamos National Laboratory, U.S. HP Alphserver SC (375 x 32-way multiprocessors, total 11,968 processors), 12 terabytes of memory and 600 terabytes of disk storage #4: IBM ASCI White (U.S.) 8,192 copper microprocessors (IBM SP POWER3), and contains 160 trillion bytes (TB) of memory with more than 160 TB of IBM disk storage capacity; Linpack: 7.22 Tflops. Located at Lawrence Livermore National Laboratory. 512-node, 16-way symmetric multiprocessor. Covers an area the size of two basketball courts, weighs 106 tons. 2,000 miles of copper wiring. Cost: US$110 million. JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 10 TOP500 Nov 2002 List 2 new PC clusters made the TOP 10: #5 is a Linux NetworX/Quadrics cluster at Lawrence Livermore National Laboratory. #8 is a HPTi/Myrinet cluster at the Forecast Systems Laboratory at NOAA. A total of 55 Intel based and 8 AMD based PC clusters are in the TOP500. The number of clusters in the TOP500 grew again to a total of 93 systems. JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 11 Poor Man’s Cluster HKU Ostrich Cluster 32 x 733 MHz Pentium III PCs, 384MB Memory Hierarchical Ethernet-based network : four 24port Fast Ethernet switches + one 8-port Gigabit Ethernet backbone switch) JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 12 Rich Man’s Cluster Computational Plant (C-Plant cluster) 1536 Compaq DS10L 1U servers (466 MHz Alpha 21264 (EV6) microprocessor, 256 MB ECC SDRAM) Each node contains a 64-bit, 33 MHz Myrinet network interface card (1.28 Gbps/s) connected to a 64-port Mesh64 switch. 48 cabinets, each of which contains 32 nodes (48x32=1536) JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 13 The HKU Gideon 300 Cluster (Operating in mid Oct. 2002) Linpack performance: 355 Gflops #175 in TOP500 (Nov. 2002 List) 300 PCs (2.0GHz Pentium 4, 512MB DDR mem, 40GB disk, Linux OS) connected by a 312-port Foundry FastIron 1500 (Fast Ethernet) switch JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 14 Building Gideon 300 JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 15 JESSICA2 : Introduction Research Goal High Performance Java Computing using Clusters Why Java? The dominant language for server-side programming. More than 2 million Java developers [CNETAsia: 06/2002] Platform independent: “Compile once, run anywhere” Code mobility (i.e., dynamic class loading) and data mobility (i.e., object serialization). Built-in multithreading support at language level (parallel programming using MPI, PVM, RMI, RPC, HPF, DSM is difficult) Why cluster? Large scale server-side applications need high-performance multithreaded programming supports A cluster provides a scalable hardware platform for true parallel execution. JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 16 Java Virtual Machine Application Class File Java API Class File Class Loader Loads class files Class loader Interpreter Executes bytecode Runtime Compiler Bytecode 0a0b0c0d0c6262431 c1d688662a0b0c0d0 c1334514726522723 Interpreter 01010101000101110 10101011000111010 10110011010111011 Runtime compiler Converts bytecode to native code Native code JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 17 Threads in JVM Stack Frame Stack Frame JESSICA2, CSIS, HKU } } Java Method Area (Code) PC object public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); p1.start(); c1.start(); Thread 3 Thread 2 Thread 1 A Multithreaded Java Program Heap (Data) Class loader Execution Engine object HKJU, Dec. 18, 2002 Class files 18 Java Memory Model (How to maintain memory consistency between threads) T1 Load variable from main memory to Variable is modified working memory Upon in T1’sTworking 1 performs before Upon Tuse. performsis unlock, memory. 2 variable lock, variable writtenflush to When Tback 2 uses in working main memory variable, it memory will be loaded from main memory T2 Per-Thread working memory Main memory Garbage Bin Object Variable Threads: T1, T2 JESSICA2, CSIS, HKU Heap Area master copy Threads in a JVM HKJU, Dec. 18, 2002 20 Distributed Java Virtual Machine (DJVM) JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple cluster nodes can provide a true parallel execution environment for multithreaded Java applications with a Single System Image illusion to Java threads. Java Threads created in a program Global Object Space OS PC OS PC OS PC OS PC High Speed Network JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 21 Problems in Existing DJVMs Mostly based on interpreters Simple but slow Layered design using distributed shared memory system (DSM) can’t be tightly coupled with JVM JVM runtime information can’t be channeled to DSM False sharing if page-based DSM is employed Page faults block the whole JVM Programmer specifies thread distribution lacks of transparency Need to rewrite multithreaded Java applications No dynamic thread distribution (preemptive thread migration) for load balancing. JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 22 Related Work Method shipping: IBM cJVM Like remote method invocation (RMI) : when accessing object fields, the proxy redirects the flow of execution to the node where the object's master copy is located. Executed in Interpreter mode. Load balancing problem : affected by the object distribution. Page shipping : Rice U. Java/DSM, HKU JESSICA Simple. GOS was supported by some page-based Distributed Shared Memory (e.g., TreadMarks, JUMP, JiaJia) JVM runtime information can’t be channeled to DSM. Executed in Interpreter mode. Object shipping: Hyperion, Jackal Leverage some object-based DSM Executed in native mode: Hyperion: translate Java bytecode to C. Jackal: compile Java source code directly to native code JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 23 JESSICA2 Main Features Transparent Java thread migration Runtime capturing and restoring of thread execution context. No source code modification. No bytecode instrumenting (preprocessing) No new API introduced. Enable dynamic load balancing on clusters JESSICA2 Transparent migration JIT GOS Operated in Just-In-Time (JIT) compilation Mode Global Object Space A shared global heap spanning all cluster nodes Adaptive object home migration protocol I/O redirection JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 25 JESSICA2 Architecture Java Bytecode or Source Code public class ProducerConsumerTest { public static void main(String[] args) { CubbyHole c = new CubbyHole(); Producer p1 = new Producer(c, 1); Consumer c1 = new Consumer(c, 1); Migration } threads Master JVM } er atio r g Mi rd no JITEE worker JVM threads p1.start(); c1.start(); Migration worker JVM JITEE threads Migration Portable Java Frames JITEE Global Object Space Load monitor Host Manager Host Manager OS OS OS Hardware Hardware Hardware Host Manager ... OS Hardware Communication Network JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 26 Transparent Thread Migration in JIT Mode Simple for interpreters (e.g. JESSICA) Interpreter sits in the bytecode decoding loop which can be stopped upon a migration flag checking The full state of a thread is available in the data structure of interpreter No register allocation JIT mode execution makes things complex (JESSICA2) Native code has no clear bytecode boundary How to deal with machine registers? How to organize the stack frames (all are in native form now) ? How to make extracted thread states portable and recognizable by the remote JVM ? How to restore the extracted states (rebuild the stack frames) and restart the execution in native form ? Need to modify JIT compiler to instrument native codes JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 27 An overview of JESSICA2 Java thread migration Thread Frame parsing GOS (heap) Frames Frames Frames Frame PC JVM (4a) Object Access Frame Method Area GOS (heap) PC (4b) Load method from NFS Source node Destination node Load Monitor JESSICA2, CSIS, HKU Migration Manager Method Area (2) Stack analysis Stack capturing Thread Scheduler (1) Alert (3) Restore execution HKJU, Dec. 18, 2002 29 What are those functions? Migration points selection Delayed to the head of loop basic block or method Register context handler Spill dirty registers at migration point without invalidation so that native codes can continue the use of registers Use register recovering stub at restoring phase Variable type deduction Spill type in stacks using compression Java frames linking Discover consecutive Java frames JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 30 Dynamic Thread State Capturing and Restoring in JESSICA2 migration point Bytecode verifier migration point Selection (Restore) register allocation bytecode translation 1. Add migration checking 2. Add object checking 3. Add type & register spilling Intermediate Code Register recovering reg slots invoke cmp mflag,0 jz ... cmp obj[offset],0 jz ... mov 0x110182, slot ... code generation Native Code Linking & Constant Resolution (Capturing) Global Object Access Native stack scanning Java frame mov slot1->reg1 mov slot2->reg2 ... C frame Frame JESSICA2, CSIS, HKU Native thread stack HKJU, Dec. 18, 2002 31 How to Maintain Memory Consistency in a Distributed Environment ? T1 T2 T3 T4 Heap OS PC T5 T6 T7 T8 Heap OS PC OS PC OS PC High Speed Network JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 32 Embedded Global Object Space (GOS) Main Features: Take advantage of JVM runtime information for optimization (e.g. object types, accessing threads, etc.) Use threaded I/O interface inside JVM for communication to hide the latency Non-blocking GOS access OO based to reduce false sharing Home-based, compliant with JVM Memory Model (“Lazy Release Consistency”) Master Heap (home objects) and Cache Heap (local and cached objects) : reduce object access latency JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 33 Object Cache JVM Cache Heap Area Hash table Java thread Hash table Java thread Master Heap Area Java thread Java thread Hash table JVM Master Heap Area Hash table Cache Heap Area Global Heap JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 34 Adaptive object home migration Definition “home” of an object = the JVM that holds the master copy of an object Problems cache objects need to be flushed and re-fetched from the home whenever synchronization happens Adaptive object home migration if # of accesses from a thread dominates the total # of accesses to an object, the object home will be migrated to the node where the thread is running JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 35 I/O redirection Timer Use the time in Master node as the standard time Calibrate the time in worker node when they register to master node File I/O Use half word of fd as node number Open file For read, check local first, then master node For write, go to master node Read/Write Go to the node specified by the node number in fd Network I/O Connectionless send: do locally Others, go to master JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 36 Experimental Setting Modified Kaffe Open JVM version 1.0.6 Linux PC Clusters: 1. 2. Pentium II PCs at 540MHz (Linux 2.2.1 kernel) Connected by Fast Ethernet HKU Gideon 300 Cluster (RayTracing) JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 37 Parallel Ray Tracing on JESSICA2 (Running at 64-node Gideon 300 cluster) Linux 2.4.18-3 kernel (Redhat 7.3) 64 nodes: 108 seconds 1 node: 3430 seconds (~ 1 hour) Speedup = 4402/108=40.75 JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 38 Conclusions Transparent Java thread migration in JIT compiler enable the high-performance execution of multithreaded Java application on clusters while keeping the merits of Java JVM approach => dynamic class loading Just-in-Time compilation for speed An embedded GOS layer can take advantage of the JVM runtime information to reduce communication overhead JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 45 Thanks HKU SRG: http://www.srg.csis.hku.hk/ JESSICA2 Webpage: http://www.csis.hku.hk/~clwang/proj ects/JESSICA2.html JESSICA2, CSIS, HKU HKJU, Dec. 18, 2002 46