Download ICPP - The University of Hong Kong

Document related concepts
no text concepts found
Transcript
“Towards an SSI for HP Java”
Francis Lau
The University of Hong Kong
With contributions from C.L. Wang, Ricky Ma, and W.Z. Zhu
Cluster Coming of Age
• HPC
– Cluster the de facto standard
equipment
– Grid?
• Clusters
– Fortran or C + MPI the norm
– 99% on top of bare-bone Linux or
the like
– Ok if application is embarrassingly
parallel and regular
7/10/2003
ICPP-HPSECA03
2
Cluster for the Mass
• Two modes:
– For number crunching
in Grande type
applications (superman)
– As a CPU farm to
support highthroughput computing
(poor man)
7/10/2003
ICPP-HPSECA03
Commercial: Data mining,
Financial Modeling, Oil
Reservoir Simulation,
Seismic Data Processing,
Vehicle and Aircraft
Simulation
Government: Nuclear
Stockpile Stewardship,
Climate and Weather,
Satellite Image Processing,
Forces Modeling
Academic: Fundamental
Physics (particles, relativity,
cosmology), Biochemistry,
Environmental Engineering,
Earthquake Prediction
3
Cluster Programming
• Auto-parallelization tools have
limited success
• Parallelization a chore but “have to
do it” (or let’s hire someone)
• Optimization for performance not
many users’ cup of tea
– Partitioning and parallelization
– Mapping
– Remapping (experts?)
7/10/2003
ICPP-HPSECA03
4
Amateur Parallel Programming
• Common problems
– Poor parallelization: few large chunks or many
small chunks
– Load imbalances: large and small chunks
• Meeting the amateurs half-way
– They do crude parallelization
– System does the rest: mapping/remapping
(automatic optimization)
– And I/O?
7/10/2003
ICPP-HPSECA03
5
Automatic Optimization
• “Feed the fat boy with two spoons, and a
few slim ones with one spoon”
• But load information could be elusive
• Need smart runtime supports
• Goal is to achieve high performance with
good resource utilization and load
balancing
• Large chunks that are single-threaded a
problem
7/10/2003
ICPP-HPSECA03
6
The Good “Fat Boys”
• Large chunks that span
multiple nodes
• Must be a program with
multiple execution “threads”
• Threads can be in different
nodes – program expands
and shrinks
• Threads/programs can
roam around – dynamic
migration
• This encourages fine-grain
programming
7/10/2003
ICPP-HPSECA03
cluster node
“amoeba”
7
Mechanism and Policy
• Mechanism for migration
– Traditional process migration
– Thread migration
• Redirection of I/O and messages
• Objects sharing between nodes for threads
• Policy for good dynamic load balancing
– Message traffic a crucial parameter
– Predictive
• Towards the “single system image” ideal
7/10/2003
ICPP-HPSECA03
8
Single System Image
• If user does only crude
parallelization and system does the
rest …
• If processes/threads can roam, and
processes expand/shrink …
• If I/O (including sockets) can be at
any node anytime …
• We achieve at least 50% of SSI
– The rest is difficult
7/10/2003
ICPP-HPSECA03
Single
Entry Point
File System
Virtual
Networking
I/O and Memory
Space
Process Space
Management /
Programming
View
…
9
Bon Java!
• Java (for HPC) in good hands
– JGF Numerics Working Group, IBM Ninja, …
– JGF Concurrency/Applications Working Group
(benchmarking, MPI, …)
– The workshops
•
•
•
•
Java has many advantages (vs. Fortran and C/C++)
Performance not an issue any more
Threads as first-class citizens!
JVM can be modified
“Java has the greatest potential to deliver an attractive productive
programming environment spanning the very broad range of tasks needed by
the Grande programmer ” – The Java Grande Forum Charter
7/10/2003
ICPP-HPSECA03
10
Process vs. Thread Migration
• Process migration easier than
thread migration
– Threads are tightly coupled
– They share objects
• Two styles to explore
– Process, MPI (“distributed
computing”)
– Thread, shared objects (“parallel
computing”)
– Or combined
• Boils down to messages vs.
distributed shared objects
7/10/2003
ICPP-HPSECA03
11
Two Projects @ HKU
• M-JavaMPI – “M” for “Migration”
–
–
–
–
Process migration
I/O redirection
Extension to grid
No modification of JVM and MPI
• JESSICA – “Java-Enabled Single System Image
Computing Architecture”
–
–
–
–
By modifying JVM
Thread migration, Amoeba mode
Global object space, I/O redirection
JIT mode (Version 2)
7/10/2003
ICPP-HPSECA03
12
Design Choices
• Bytecode instrumentation
– Insert code into programs, manually or via pre-processor
• JVM extension
– Make thread state accessible from Java program
– Non-transparent
– Modification of JVM is required
• Checkpointing the whole JVM process
– Powerful but heavy penalty
• Modification of JVM
– Runtime support
– Totally transparent to the applications
– Efficient but very difficult to implement
7/10/2003
ICPP-HPSECA03
13
M-JavaMPI
• Support transparent Java process
migration and provide communication
redirection services
• Communication using MPI
• Implemented as a middleware on top of
standard JVM
• No modifications of JVM and MPI
• Checkpointing the Java process + code
insertion by preprocessor
7/10/2003
ICPP-HPSECA03
14
System Architecture
Java MPI program (Java bytecode
bytecode))
Preprocessing layer
(Insert exception handlers)
Provide MPI wrapper
for Java program
Java API
Java--MPI API
Java
Migration layer
(Save and restore process)
Save and restore process.
Process and object
information are saved
and restored by using
object serialization,
reflection and exception
through JVMDI
JVMDI (Debugger interface in Java 2)
Debugger interface in Java 2.
Used to retrieve and restore
process state
JVM
Restorable MPI layer
Provide restorable MPI
communication through
MPI daemons
Native MPI
OS
Hardware
7/10/2003
Java .class files are
modified by inserting an
exception handler in each
method of each class. The
handler is used to restore
the process state.
ICPP-HPSECA03
Support low-latency and
high-bandwidth
data communication
15
Preprocessing
• Bytecode is modified before passing to
JVM for execution
• “Restoration functions” are inserted as
exception handlers, in the form of
encapsulated “try-catch” statements
• Re-arrangement of bytecode, and addition
of local variables
7/10/2003
ICPP-HPSECA03
16
The Layers
• Java-MPI API layer
• Restorable MPI layer
– Provides restorable MPI communications
– No modification of MPI library
• Migration Layer
– Captures and save the execution state of the
migrating process in the source node, and restores
the execution state of the migrated process in the
destination node
– Cooperates with the Restorable MPI layer to
reconstruct the communication channels of the
parallel application
7/10/2003
ICPP-HPSECA03
17
State Capturing and
Restoring
• Program code: re-used in the destination node
• Data: captured and restored by using the object
serialization mechanism
• Execution context: captured by using JVMDI and
restored by inserted exception handlers
• Eager (all) strategy: For each frame, local
variables, referenced objects, the name of the
class and class method, and program counter
are saved using object serialization
7/10/2003
ICPP-HPSECA03
18
State Capturing using JVMDI
public class A {
try {
…
} catch (RestorationException e) {
a = saved value of local variable a;
b = saved value of local variable b;
pc = saved value of program counter
when the program is suspended
jump to the location where the program
is suspended
}
public class A {
int a;
char b;
…
}
7/10/2003
}
ICPP-HPSECA03
19
Message Redirection Model
• MPI daemon in each node to support message passing
between distributed java processes
• IPC between Java program and MPI daemon in the same
node through shared memory and semaphores
client-server
Node 1
Node 2
Java Program
Java Program
Java-M PI API
Java-M PI API
IPC
client-server
IPC
MPI communication
M PI Daem on
(linked w ith native M PI library)
7/10/2003
M PI Daem on
(linked w ith native M PI library)
Network
ICPP-HPSECA03
20
migration layer
(source node)
MPI daemon
(source node)
MPI daemon
(destination node)
migration client
(destination node)
Process migration steps
suspend user
process
capture
process state
send migration
request
broadcast mig.
info. to all MPI
daemons
send buffered
messages
JVM and
process quit
notify MPI
daemon of the
completion of
capturing
send notification
message
(and captured
process data if
central file system
is not used)
start an instance
of JVM with
JVMDI client
send notification
of the readiness of
captured process
data
Source Node
process is
restarted and
suspended
Restoration
of execution
state starts
execution of
migrated
process is
restored
Destination Node
LEGENDS
7/10/2003
Migration events
Event triggers
ICPP-HPSECA03
21
Experiments
• PC Cluster
– 16-node cluster
– 300 MHz Pentium II with 128MB of memory
– Linux 2.2.14 with Sun JDK 1.3.0
– 100Mb/s fast Ethernet
• All Java programs executed in interpreted
mode
7/10/2003
ICPP-HPSECA03
22
Bandwidth: PingPong Test
Native MPI: 10.5 MB/s
Direct Java-MPI binding: 9.2 MB/s
Restorable MPI layer: 7.6 MB/s
bandwidth (Mbyte/s)
12
10
8
6
4
2
13107
65536
32768
16384
8192
4096
2048
1024
512
256
128
64
32
16
8
4
2
1
0
message size (byte)
native MPI
7/10/2003
direct Java-MPI binding
ICPP-HPSECA03
migratable Java-MPI
23
Latency: PingPong Test
0.0008
Native MPI: 0.2 ms
Direct Java-MPI binding: 0.23 ms
Restorable MPI layer: 0.26 ms
0.0007
latency(s)
0.0006
0.0005
0.0004
0.0003
0.0002
0.0001
64
12
8
25
6
51
2
10
24
20
48
40
96
32
16
8
4
2
1
0
message size (byte)
native MPI
7/10/2003
direct Java-MPI binding
ICPP-HPSECA03
migratable Java-MPI
24
Migration Cost: capturing and restoring objects
1000
100
10
10
00
K
10
0K
10
K
10
00
10
0
10
1
0
time spent (ms)
10000
data size (# of integers)
capturing time
7/10/2003
ICPP-HPSECA03
restoring time
25
Migration Cost: capturing and restoring frames
time spent (ms)
4000
3000
2000
1000
0
0
200
400
600
number of frames
capture time (ms)
7/10/2003
restore time (ms)
ICPP-HPSECA03
26
Application Performance
•
•
•
•
PI calculation
Recursive ray-tracing
NAS integer sort
Parallel SOR
7/10/2003
ICPP-HPSECA03
27
execution time (sec)
Time spent in calculating PI and ray-tracing with
and without the migration layer
200
180
160
140
120
100
80
60
40
20
0
0
1
2
PI (w/o migration lay er)
7/10/2003
3
4
5
6
no. of nodes
7
8
9
ray -tracing (w/o migration lay er)
PI (w/ migration lay er)
ICPP-HPSECA03 ray -tracing (w/ migration lay er)
28
Execution time of NAS program with different problem
sizes (16 nodes)
Problem
size
(no. of
integers)
Time used (sec) in
environment without MJavaMPI
Time used (sec) in
environment with MJavaMPI
Overhead introduced
by M-JavaMPI (in %)
Total
Comp
Comm
Total
Comp
Comm
Total
Comm
Class S:
65536
0.023
0.009
0.014
0.026
0.009
0.017
13%
21%
Class
W:1048576
0.393
0.182
0.212
0.424
0.182
0.242
7.8%
14%
Class A:
8388608
3.206
1.545
1.66
3.387
1.546
1.840
5.6%
11%
No noticeable overhead introduced in the computation part; while in the
7/10/2003
ICPP-HPSECA03
29
communication
part, an overhead
of about 10-20%
execution time (sec)
Time spent in executing SOR using different numbers
of nodes with and without migration layer
1200
1000
800
600
400
200
0
0
1
2
3
4
5
6
7
8
9
no. of nodes
SOR (w/o migration lay er)
7/10/2003
SOR (w/ migration lay er)
ICPP-HPSECA03
30
Cost of Migration
Time spent in executing the SOR program on an array of size
256x256 without and with one migration during the execution
No. of nodes
1
2
4
6
8
7/10/2003
No migration (sec) One migration (sec)
1013
1016
518
521
267
270
176
178
141
144
ICPP-HPSECA03
31
Cost of Migration
• Time spent in migration (in seconds) for different applications
7/10/2003
Applications
Average migration time
PI
2
Ray-tracing
3
NAS
2
SOR
3
ICPP-HPSECA03
32
Dynamic Load Balancing
• A simple test
– SOR program was executed using six nodes
in an unevenly loaded environment with one
of the nodes executing a computationally
intensive program
• Without migration : 319s
• With migration: 180s
7/10/2003
ICPP-HPSECA03
33
In Progress
– M-JavaMPI in JIT mode
– Develop system modules for automatic
dynamic load balancing
– Develop system modules for effective faulttolerant supports
7/10/2003
ICPP-HPSECA03
34
Java Virtual Machine
Application Class File
Java API Class File
• Class Loader
– Loads class files
Class loader
• Interpreter
Bytecode
– Executes bytecode
• Runtime Compiler
0a0b0c0d0c6262431
c1d688662a0b0c0d0
c1334514726522723
Interpreter
01010101000101110
10101011000111010
10110011010111011
Runtime
compiler
– Converts bytecode to
native code
Native code
7/10/2003
ICPP-HPSECA03
35
A Multithreaded Java Program
Threads in JVM
p1.start();
c1.start();
Thread 3
Thread 2
Thread 1
Stack Frame
Stack Frame
7/10/2003
}
}
Java Method
Area (Code)
PC
object
public class ProducerConsumerTest {
public static void main(String[] args) {
CubbyHole c = new CubbyHole();
Producer p1 = new Producer(c, 1);
Consumer c1 = new Consumer(c, 1);
Heap (Data)
Class loader
Execution
Engine
object
ICPP-HPSECA03
Class
files
36
Java Memory Model
(How to maintain memory consistency between threads)
T1
Load variable from
main memory
to
Variable
is modified
working
memory
Upon
in
T1’sTworking
1 performs
before
Upon
Tuse.
performsis
unlock,
memory.
2 variable
lock,
variable
writtenflush
to
When
Tback
2 uses
in
working
main
memory
variable,
it memory
will be
loaded from main
memory
T2
Per-Thread
working memory
Main memory
Garbage
Bin
Object
Variable
Heap Area
master copy
Threads: T1, T2
JMM
7/10/2003
ICPP-HPSECA03
37
Problems in Existing DJVMs
• Mostly based on interpreters
– Simple but slow
• Layered design using distributed shared memory system
(DSM)  cannot be tightly coupled with JVM
– JVM runtime information cannot be channeled to DSM
– False sharing if page-based DSM is employed
– Page faults block the whole JVM
• Programmer to specify thread distribution  lack of
transparency
– Need to rewrite multithreaded Java applications
– No dynamic thread distribution (preemptive thread migration) for
load balancing
7/10/2003
ICPP-HPSECA03
38
Related Work
Method shipping: IBM cJVM



Like remote method invocation (RMI) : when accessing object
fields, the proxy redirects the flow of execution to the node
where the object's master copy is located.
Executed in Interpreter mode.
Load balancing problem : affected by the object distribution.
Page shipping: Rice U. Java/DSM, HKU JESSICA



Simple. GOS was supported by some page-based Distributed
Shared Memory (e.g., TreadMarks, JUMP, JiaJia)
JVM runtime information can’t be channeled to DSM.
Executed in Interpreter mode.
Object shipping: Hyperion, Jackal


7/10/2003
Leverage some object-based DSM
Executed in native mode: Hyperion: translate Java bytecode to
C. Jackal: compile Java source code directly to native code
ICPP-HPSECA03
39
Distributed Java Virtual Machine (DJVM)
JESSICA2: A distributed Java Virtual Machine (DJVM) spanning multiple
cluster nodes can provide a true parallel execution environment for
multithreaded Java applications with a Single System Image illusion to Java
threads.
Java Threads created in a program
Global
Object Space
OS
PC
OS
PC
OS
PC
OS
PC
High Speed Network
7/10/2003
ICPP-HPSECA03
40
JESSICA2 Main Features
• Transparent Java thread migration
– Runtime capturing and restoring of
thread execution context.
– No source code modification; no
bytecode instrumentation
(preprocessing); no new API introduced
– Enables dynamic load balancing on
clusters
• Operated in Just-In-Time (JIT)
compilation Mode
• Global Object Space
JESSICA2
Transparent
migration
JIT
GOS
– A shared global heap spanning all cluster
nodes
– Adaptive object home migration protocol
– I/O redirection
7/10/2003
ICPP-HPSECA03
41
Transparent Thread Migration in
JIT Mode
• Simple for interpreters (e.g., JESSICA)
– Interpreter sits in the bytecode decoding loop which can be stopped
upon a migration flag checking
– The full state of a thread is available in the data structure of
interpreter
– No register allocation
• JIT mode execution makes things complex (JESSICA2)
–
–
–
–
Native code has no clear bytecode boundary
How to deal with machine registers?
How to organize the stack frames (all are in native form now)?
How to make extracted thread states portable and recognizable by
the remote JVM?
– How to restore the extracted states (rebuild the stack frames) and
restart the execution in native form?
Need to modify JIT compiler to instrument native code
7/10/2003
ICPP-HPSECA03
42
An overview of JESSICA2 Java
thread migration
Thread
Frame parsing
GOS
(heap)
Frames
Frames
Frames
Frame
PC
JVM
(4a) Object Access
Frame
Method Area
GOS
(heap)
PC
(4b) Load method
from NFS
Source node
Destination node
Load
Monitor
7/10/2003
Migration
Manager
Method Area
(2) Stack analysis
Stack capturing
Thread Scheduler
(1) Alert
(3) Restore execution
ICPP-HPSECA03
44
Essential Functions
• Migration points selection
– At the start of loop, basic block or method
• Register context handler
– Spill dirty registers at migration point without invalidation so that
native code can continue the use of registers
– Use register recovering stub at restoring phase
• Variable type deduction
– Spill type in stacks using compression
• Java frames linking
– Discover consecutive Java frames
7/10/2003
ICPP-HPSECA03
45
Dynamic Thread State Capturing and
Restoring in JESSICA2
migration point
Bytecode verifier
migration point
Selection
(Restore)
register
allocation
bytecode translation
1. Add migration checking
2. Add object checking
3. Add type & register spilling
Intermediate Code
Register recovering
reg
slots
invoke
cmp mflag,0
jz ...
cmp obj[offset],0
jz ...
mov 0x110182, slot
...
code generation
Native Code
Linking &
Constant Resolution
(Capturing)
Global Object Access
Native stack scanning
Java frame
mov slot1->reg1
mov slot2->reg2
...
7/10/2003
C frame
Frame
ICPP-HPSECA03
Native thread stack
46
How to Maintain Memory Consistency in a
Distributed Environment?
T1
T2
T3
T4
Heap
OS
PC
T5
T6
T7
T8
Heap
OS
PC
OS
PC
OS
PC
High Speed Network
7/10/2003
ICPP-HPSECA03
47
Embedded Global Object Space
(GOS)
• Take advantage of JVM runtime information for
optimization (e.g., object types, accessing threads,
etc.)
• Use threaded I/O interface inside JVM for
communication to hide the latency  Non-blocking
GOS access
• OO-based to reduce false sharing
• Home-based, compliant with JVM Memory Model
(“Lazy Release Consistency”)
• Master heap (home objects) and cache heap (local
and cached objects): reduce object access latency
7/10/2003
ICPP-HPSECA03
48
Object Cache
JVM
Cache Heap Area
Hash
table
Java thread
Hash table
Java thread
Master Heap Area
Java thread
Java thread
Hash
table
JVM
Master Heap Area
Hash
table
Cache Heap Area
Global Heap
7/10/2003
ICPP-HPSECA03
49
Adaptive object home
migration
• Definition
– “home” of an object = the JVM that holds the master
copy of an object
• Problems
– cache objects need to be flushed and re-fetched from
the home whenever synchronization happens
• Adaptive object home migration
– if # of accesses from a thread dominates the total # of
accesses to an object, the object home will be
migrated to the node where the thread is running
7/10/2003
ICPP-HPSECA03
50
I/O redirection
Timer


Use the time in master node as the standard time
Calibrate the time in worker node when they register to master
node
File I/O



Use half word of “fd” as node number
Open file
 For read, check local first, then master node
 For write, go to master node
Read/Write
 Go to the node specified by the node number in fd
Network I/O


7/10/2003
Connectionless send: do it locally
Others, go to master
ICPP-HPSECA03
51
Experimental Setting
•
•
Modified Kaffe Open
JVM version 1.0.6
Linux PC clusters
1. Pentium II PCs at 540MHz
(Linux 2.2.1 kernel) connected
by Fast Ethernet
2. HKU Gideon 300 Cluster (for
the Ray Tracing demo)
7/10/2003
ICPP-HPSECA03
52
Parallel Ray Tracing on JESSICA2
(Using 64 nodes of the Gideon 300 cluster)
Linux 2.4.18-3 kernel (Redhat 7.3)
64 nodes: 108 seconds
1 node: 4420 seconds (~ 1 hour)
Speedup = 4402/108 = 40.75
7/10/2003
ICPP-HPSECA03
53
Micro Benchmarks
Time breakdown of thread migration
Capture time
Pasring time
native thread creation
resolution of methods
frame setup time
(PI Calculation)
7/10/2003
ICPP-HPSECA03
54
Java Grande Benchmark
Java Grande benchmark result
(Single node)
Kaffe 1.0.6 JIT
JESSICA2
80
70
60
50
40
30
20
10
0
Ba
r
rr ie
r
Fo
n
oi
J
k
nc
y
S
p
ry
C
t
ct
a
F
LU
SO
R
Se
s
rie
Sp
7/10/2003
ICPP-HPSECA03
eM
at
m
t
ul
s
ar
55
SPECjvm98 Benchmark
“M-”: disabling migration mechanism
“I+”: enabling pseudo-inlining
7/10/2003
“M+”: enabling migration
“I-”: disabling pseudo-inlining
ICPP-HPSECA03
56
JESSICA2 vs JESSICA (CPI)
Time(ms)
CPI(50,000,000iterations)
250000
200000
150000
100000
50000
0
JESSICA
JESSICA2
2
4
8
Number of nodes
7/10/2003
ICPP-HPSECA03
57
Application Performance
Speedup
10
Linear speedup
Speedup
8
CPI
6
TSP
4
Raytracer
2
nBody
0
2
4
8
Node number
7/10/2003
ICPP-HPSECA03
58
Time (in ms)
Effect of Adaptive Object
Home Migration (SOR)
80000
70000
60000
Disable adaptive
object home
migration
50000
40000
30000
20000
10000
Enable adaptive
object home
migration
0
2
4
8
node number
7/10/2003
ICPP-HPSECA03
59
Work in Progress
•
•
•
•
New optimization techniques for GOS
Incremental Distributed GC
Load balancing module
Enhanced single I/O space to benefit more
real-life applications
• Parallel I/O support
7/10/2003
ICPP-HPSECA03
60
Conclusion
• Effective HPC for the mass
–
–
–
–
–
–
They supply the (parallel) program, system does the rest
Let’s hope for parallelizing compilers
Small to medium grain programming
SSI the ideal
Java the choice
Poor man mode too
• Thread distribution and migration feasible
• Overhead reduction
– Advances in low-latency networking
– Migration as intrinsic function (JVM, OS, hardware)
• Grid and pervasive computing
7/10/2003
ICPP-HPSECA03
61
Some Publications
• W.Z. Zhu , C.L. Wang, and F.C.M. Lau, “A Lightweight Solution for
Transparent Java Thread Migration in Just-in-Time Compilers”, ICPP
2003, Taiwan, October 2003.
• W.J. Fang, C.L. Wang, and F.C.M. Lau, “On the Design of Global
Object Space for Efficient Multi-threading Java Computing on
Clusters”, Parallel Computing, to appear.
• W.Z. Zhu , C.L. Wang, and F.C.M. Lau, “JESSICA2: A Distributed
Java Virtual Machine with Transparent Thread Migration Support,”
CLUSTER 2002, Chicago, September 2002, 381-388.
• R. Ma, C.L. Wang, and F.C.M. Lau, “M-JavaMPI : A Java-MPI
Binding with Process Migration Support,'' CCGrid 2002, Berlin, May
2002.
• M.J.M. Ma, C.L. Wang, and F.C.M. Lau, “JESSICA: Java-Enabled
Single-System-Image Computing Architecture,’’ Journal of Parallel
and Distributed Computing, Vol. 60, No. 10, October 2000, 11941222.
7/10/2003
ICPP-HPSECA03
62
THE END
And Thanks!
7/10/2003
ICPP-HPSECA03
63