Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A JVM for the Barrelfish Operating System 2nd Workshop on Systems for Future Multi-core Architectures (SFMA’12) Martin Maas (University of California, Berkeley) Ross McIlroy (Google Inc.) 10 April 2012, Bern, Switzerland Introduction I Future multi-core architectures will presumably... I I I I ...have a larger numbers of cores ...exhibit a higher degree of diversity ...be increasingly heterogenous ...have no cache-coherence/shared memory I These changes (arguably) require new approaches for Operating Systems: e.g. Barrelfish, fos, Tessellation,... I Barrelfish’s approach: treat the machine’s cores as nodes in a distributed system, communicating via message-passing. I But: How to program such a system uniformly? I How to exploit performance on all configurations? I How to structure executables for these systems? Introduction I I Answer: Managed Language Runtime Environments (e.g. Java Virtual Machine, Common Language Runtime) Advantages over a native programming environment: I I I I I I Single-system image Transparent migration of threads Dynamic optimisation and compilation Language extensibility Investigate challenges of bringing up a JVM on Barrelfish. Comparing two different approaches: I I Convential shared-memory approach Distributed approach in the style of Barrelfish Outline 1. The Barrelfish Operating System 2. Implementation Strategy I I Shared-memory approach Distributed approach 3. Performance Evaluation 4. Discussion & Conclusions 5. Future Work The Barrelfish Operating System I Barrelfish is based on the Multikernel Model: Treats multi-core machine as a distributed system. I Communication through a lightweight message-passing library. I Global state is replicated rather than shared. 2.2. THE BARRELFISH OPERATING SYSTEM 25 Domain App Dispatcher Monitor App Disp Disp Dispatcher Monitor Monitor CPU Driver CPU Driver User Mode Kernel Mode CPU Driver Software Hardware Implementation I Running real-world Java applications would require bringing up a full JVM (e.g. the Jikes RVM) on Barrelfish. I Stresses the memory system (virtual memory is fully managed by the JVM), Barrelfish lacked necessary features (e.g. page fault handling, file system). I Would have distracted from understanding the core challenges. I Approach: Implementation of a rudimentary Java Bytecode interpreter that provides just enough functionality to run standard Java benchmarks (Java Grande Benchmark Suite). I Supports 198 out of 201 Bytecode instructions (except wide, goto w and jsr w), Inheritance, Strings, Arrays, Threads,... I No Garbage Collection, JIT, Exception Handling, Dynamic Linking or Class Loading, Reflection,... Shared memory vs. Distributed approach Shared memory Distributed Approach Domain move object JVM run func on JVM0 JVM1 move object ack return invoke putfield JVM2 JVM3 putfield ack Heap obj A obj B obj C obj D obj A obj B obj C obj D The distributed approach jvm-node0 jvm-node1 Object lookup obj request jvm-node2 jvm-node3 × obj request obj request blocks × obj request response Method call invoke virtual getfield getfield response invoke return Performance Evaluation I Performance evaluation using the sequential and parallel Java Grande Benchmarks (mostly Section 2 - compute kernels). I Performed on a 48-core AMD Magny- Cours (Opteron 6168). I Four 2x6-core processors, 8 NUMA nodes (8GB RAM each). I Evaluation of the shared-memory version on Linux (using numactl to pin cores) and Barrelfish. I Evaluation of the distributed version only on Barrelfish. I Compared performance to industry-standard JVM (OpenJDK 1.6.0) with and without JIT compilation. Single-core (sequential) performance I Consistently within a factor of 2-3 of OpenJDK without JIT. 4.5. MULTI-CORE PERFORMANCE 71 OpenJDK (JIT) OpenJDK (No JIT) JVM (Linux) JVM (Barrelfish) 0.77 6.24 SparseMatmult 12.85 14.42 1.02 20.01 SOR 43.02 40.47 11.78 17.89 Series 18.22 21.08 0.13 3.45 LUFact 9.74 9.85 0.39 5.43 HeapSort 16.19 16.34 4.4 27 FFT 54.89 67.29 0.28 4.83 Crypt 8.55 8.75 0 10 20 30 40 50 60 70 Execution time in s Figure 4.9: Application-benchmarks on a single core (SizeA) While these results do not represent new findings, they are important for Performance of the shared-memory approach I Using the parallel sparse matrix multiplication Java Grande benchmark JGFSparseMatmultBenchSizeB. I Scales to 48 cores as expected (relative to OpenJDK). JGFSparseMatmultBenchSizeB Execution time in s (log scale) 100 Barrelfish JVM (Barrelfish, Shared Memory) Barrelfish JVM (Linux) OpenJDK (No JIT) OpenJDK 10 1 0.1 0 5 10 15 20 25 30 Number of cores 35 40 45 50 Performance of the shared-memory approach I Quasi-linear speed-up implies large interpreter overhead. I Barrelfish overheadPERFORMANCE presumably from agreement protocols. 4.5. MULTI-CORE 75 JGFSparseMatmultBenchSizeB 50 Linux Barrelfish (Shared) 45 40 35 Speed-up 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Number of cores 35 40 45 50 Figure 4.13: Average speed-up of the shared-memory approach run-time of the experiments (which would have been up to 25h otherwise). Performance of the distributed approach I Distributed approach is orders of magnitude slower than shared-memory approach. I Sparse Matrix Multiplication is a difficult benchmark for this implementation: 7 pairs of messages for each iteration of the kernel (almost no communication for shared-memory). I Overhead arguably caused by inter-core communication (150-600 cycles) and message handling in Barrelfish. Cores 1 2 3 4 5 6 7 8 16 Run-time in s 2.70 458 396 402 444 514 1764 2631 9334 σ (Standard deviation) 0.002 7.891 3.545 7.616 2.128 36.77 247.7 335.9 (only executed once) Performance of the distributed approach 76 CHAPTER 4. EVALUATION I The results show that without optimisation, the distributed approach is too Measuring completion time of threads on different cores shows slow to be feasible, at least for this benchmark. Measuring the run-time of each individual thread gives evidence this is caused communication. by the overhead of performance limitation due tothatinter-core I set (jvm-node0) quickly,home threads on other cores take orders All data ”lives” completes on thevery same node (Core #0).of I cores onaother chips (#6 and #7) is significantly expensive Corescation 0-5with within single processor, 6 & 7 ismore off-chip. message passing: While a thread running on the home node of the working magnitude longer (Figure 4.14). The diagram also confirms that communithan on-chip communication (Figure 4.3). Thread (running on jvm-node*) #7 2,478.17 #6 2,386.94 #5 470.05 #4 452.58 #3 453.52 #2 452.06 #1 #0 453.24 0.36 400 800 1,200 1,600 2,000 2,400 2,800 Execution time in s Figure 4.14: Run-times of individual threads For this particular benchmark, the distributed JVM has to exchange 7 pairs Discussion & Future Work I I Preliminary results show that future work should focus on reducing message-passing overhead and number of messages. How can these overheads be alleviated? I I I Additional areas of interest: I I I I Reduce inter-core communication: Caching of objects and arrays, like a directory-based MSI cache-coherence protocol. Reduce message-passing latency: Hardware support for message-passing (e.g. running on the Intel SCC). Garbage Collection on such a system. Relocation of objects at run-time. Logical partitioning of objects. Future work should investigate bringing up the Jikes RVM on Barrelfish, focussing on these aspects. Questions?