Download A JVM for the Barrelfish Operating System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A JVM for the Barrelfish Operating System
2nd Workshop on Systems for Future Multi-core Architectures
(SFMA’12)
Martin Maas (University of California, Berkeley)
Ross McIlroy (Google Inc.)
10 April 2012, Bern, Switzerland
Introduction
I
Future multi-core architectures will presumably...
I
I
I
I
...have a larger numbers of cores
...exhibit a higher degree of diversity
...be increasingly heterogenous
...have no cache-coherence/shared memory
I
These changes (arguably) require new approaches for
Operating Systems: e.g. Barrelfish, fos, Tessellation,...
I
Barrelfish’s approach: treat the machine’s cores as nodes in a
distributed system, communicating via message-passing.
I
But: How to program such a system uniformly?
I
How to exploit performance on all configurations?
I
How to structure executables for these systems?
Introduction
I
I
Answer: Managed Language Runtime Environments (e.g.
Java Virtual Machine, Common Language Runtime)
Advantages over a native programming environment:
I
I
I
I
I
I
Single-system image
Transparent migration of threads
Dynamic optimisation and compilation
Language extensibility
Investigate challenges of bringing up a JVM on Barrelfish.
Comparing two different approaches:
I
I
Convential shared-memory approach
Distributed approach in the style of Barrelfish
Outline
1. The Barrelfish Operating System
2. Implementation Strategy
I
I
Shared-memory approach
Distributed approach
3. Performance Evaluation
4. Discussion & Conclusions
5. Future Work
The Barrelfish Operating System
I
Barrelfish is based on the Multikernel Model: Treats
multi-core machine as a distributed system.
I
Communication through a lightweight message-passing library.
I
Global state is replicated rather than shared.
2.2. THE BARRELFISH OPERATING SYSTEM
25
Domain
App
Dispatcher
Monitor
App
Disp
Disp
Dispatcher
Monitor
Monitor
CPU Driver
CPU Driver
User Mode
Kernel Mode
CPU Driver
Software
Hardware
Implementation
I
Running real-world Java applications would require bringing
up a full JVM (e.g. the Jikes RVM) on Barrelfish.
I
Stresses the memory system (virtual memory is fully managed
by the JVM), Barrelfish lacked necessary features (e.g. page
fault handling, file system).
I
Would have distracted from understanding the core challenges.
I
Approach: Implementation of a rudimentary Java Bytecode
interpreter that provides just enough functionality to run
standard Java benchmarks (Java Grande Benchmark Suite).
I
Supports 198 out of 201 Bytecode instructions (except wide,
goto w and jsr w), Inheritance, Strings, Arrays, Threads,...
I
No Garbage Collection, JIT, Exception Handling, Dynamic
Linking or Class Loading, Reflection,...
Shared memory vs. Distributed approach
Shared memory
Distributed Approach
Domain
move object
JVM
run func on
JVM0
JVM1
move object ack
return
invoke
putfield
JVM2
JVM3
putfield ack
Heap
obj A
obj B
obj C
obj D
obj A
obj B
obj C
obj D
The distributed approach
jvm-node0
jvm-node1
Object lookup
obj request
jvm-node2
jvm-node3
×
obj request
obj request
blocks
×
obj request response
Method call
invoke virtual
getfield
getfield response
invoke return
Performance Evaluation
I
Performance evaluation using the sequential and parallel Java
Grande Benchmarks (mostly Section 2 - compute kernels).
I
Performed on a 48-core AMD Magny- Cours (Opteron 6168).
I
Four 2x6-core processors, 8 NUMA nodes (8GB RAM each).
I
Evaluation of the shared-memory version on Linux (using
numactl to pin cores) and Barrelfish.
I
Evaluation of the distributed version only on Barrelfish.
I
Compared performance to industry-standard JVM (OpenJDK
1.6.0) with and without JIT compilation.
Single-core (sequential) performance
I
Consistently within a factor of 2-3 of OpenJDK without JIT.
4.5. MULTI-CORE PERFORMANCE
71
OpenJDK (JIT)
OpenJDK (No JIT)
JVM (Linux)
JVM (Barrelfish)
0.77
6.24
SparseMatmult
12.85
14.42
1.02
20.01
SOR
43.02
40.47
11.78
17.89
Series
18.22
21.08
0.13
3.45
LUFact
9.74
9.85
0.39
5.43
HeapSort
16.19
16.34
4.4
27
FFT
54.89
67.29
0.28
4.83
Crypt
8.55
8.75
0
10
20
30
40
50
60
70
Execution time in s
Figure 4.9: Application-benchmarks on a single core (SizeA)
While these results do not represent new findings, they are important for
Performance of the shared-memory approach
I
Using the parallel sparse matrix multiplication Java Grande
benchmark JGFSparseMatmultBenchSizeB.
I
Scales to 48 cores as expected (relative to OpenJDK).
JGFSparseMatmultBenchSizeB
Execution time in s (log scale)
100
Barrelfish JVM (Barrelfish, Shared Memory)
Barrelfish JVM (Linux)
OpenJDK (No JIT)
OpenJDK
10
1
0.1
0
5
10
15
20
25
30
Number of cores
35
40
45
50
Performance of the shared-memory approach
I
Quasi-linear speed-up implies large interpreter overhead.
I
Barrelfish
overheadPERFORMANCE
presumably from agreement protocols.
4.5. MULTI-CORE
75
JGFSparseMatmultBenchSizeB
50
Linux
Barrelfish (Shared)
45
40
35
Speed-up
30
25
20
15
10
5
0
0
5
10
15
20
25
30
Number of cores
35
40
45
50
Figure 4.13: Average speed-up of the shared-memory approach
run-time of the experiments (which would have been up to 25h otherwise).
Performance of the distributed approach
I
Distributed approach is orders of magnitude slower than
shared-memory approach.
I
Sparse Matrix Multiplication is a difficult benchmark for this
implementation: 7 pairs of messages for each iteration of the
kernel (almost no communication for shared-memory).
I
Overhead arguably caused by inter-core communication
(150-600 cycles) and message handling in Barrelfish.
Cores
1
2
3
4
5
6
7
8
16
Run-time in s
2.70
458
396
402
444
514
1764
2631
9334
σ (Standard deviation)
0.002
7.891
3.545
7.616
2.128
36.77
247.7
335.9
(only executed once)
Performance of the distributed approach
76
CHAPTER 4. EVALUATION
I
The results show that without optimisation, the distributed approach is too
Measuring
completion time of threads on different cores shows
slow to be feasible, at least for this benchmark. Measuring the run-time of
each
individual
thread gives evidence
this is caused communication.
by the overhead of
performance limitation
due tothatinter-core
I
set (jvm-node0)
quickly,home
threads on
other cores
take orders
All data
”lives” completes
on thevery
same
node
(Core
#0).of
I
cores onaother
chips (#6
and #7) is significantly
expensive
Corescation
0-5with
within
single
processor,
6 & 7 ismore
off-chip.
message passing: While a thread running on the home node of the working
magnitude longer (Figure 4.14). The diagram also confirms that communithan on-chip communication (Figure 4.3).
Thread (running on jvm-node*)
#7
2,478.17
#6
2,386.94
#5
470.05
#4
452.58
#3
453.52
#2
452.06
#1
#0
453.24
0.36
400
800
1,200
1,600
2,000
2,400
2,800
Execution time in s
Figure 4.14: Run-times of individual threads
For this particular benchmark, the distributed JVM has to exchange 7 pairs
Discussion & Future Work
I
I
Preliminary results show that future work should focus on
reducing message-passing overhead and number of messages.
How can these overheads be alleviated?
I
I
I
Additional areas of interest:
I
I
I
I
Reduce inter-core communication: Caching of objects and
arrays, like a directory-based MSI cache-coherence protocol.
Reduce message-passing latency: Hardware support for
message-passing (e.g. running on the Intel SCC).
Garbage Collection on such a system.
Relocation of objects at run-time.
Logical partitioning of objects.
Future work should investigate bringing up the Jikes RVM on
Barrelfish, focussing on these aspects.
Questions?