Download Implementing Irregular Applications with Java Threads

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Cluster Computing with
Java Threads
Philip J. Hatcher
University of New Hampshire
[email protected]
Collaborators
• UNH/Hyperion
– Mark MacBeth and Keith McGuigan
• ENS-Lyon/DSM-PM2
– Gabriel Antoniu, Luc Bougé and Raymond
Namyst
Focus
• Use Java “as is” for high-performance
computing
– support computationally intensive
applications
– utilize parallel computing hardware
Outline
• Our Vision
• Java Threads
• The PM2 Run-time Environment
• Hyperion: Java Threads on Clusters
• Evaluation
• Related Work
• Conclusions
Why Java?
• Soon to be ubiquitous!
– use of Java is growing very rapidly
• Designed for portability:
– develop programs on your desktop
– run programs on a distant cluster
Why Java?
• Explicitly parallel!
– includes a threaded programming model
• Relaxed memory model
– consistency model aids an implementation
on distributed-memory parallel computers
Unique Opportunity
• Use Java to bring parallelism to the
“masses”
• Let’s not miss it!
• But, programmers will not accept syntax
or model changes
Open Question
• Parallelism via Java access to distributedcomputing techniques?
– e.g. RMI (remote method invocation)
• Or, parallelism via Java threads?
That is, ...
• Does a user prefer to view a cluster as a
collection of distinct machines?
• Or, does a user prefer to view a cluster
as a “black box” that will simply run Java
code faster?
Are you “in a box”?
Or, are you “thinking outside
of the box”?
Climb out of the box!
• Use Java threads “as is” to program
clusters of computers.
• Program for the threaded Java virtual
machine.
• Allow the implementation to handle the
details of executing in a cluster.
Java Threads
• Threads are objects.
• The class java/lang/Thread contains all of
the methods for initializing, running,
suspending, querying and destroying
threads.
java/lang/Thread methods
• Thread() - constructor for thread object.
• start() - start the thread executing.
• run() - method invoked by ‘start’.
• stop(), suspend(), resume(), join(),
yield().
• setPriority().
Java Synchronization
• Java uses monitors, which protect a
region of code by allowing only one
thread at a time to execute it.
• Monitors utilize locks.
• There is a lock associated with each
object.
synchronized keyword
• synchronized ( Exp ) Block
• public class Q {
synchronized void put(…) {
…
}
}
java/lang/Object methods
• wait() - the calling thread, which must
hold the lock for the object, is placed in a
wait set associated with the object. The
lock is then released.
• notify() - an arbitrary thread in the wait
set of this object is awakened and then
competes again to get lock for object.
• notifyall() - all waiting threads awakened.
Shared-Memory Model
• Java threads execute in a virtual shared
memory.
• All threads are able to access all objects.
• But threads may not access each other’s
stacks.
Java Memory Consistency
• A variant of release consistency.
• Threads can keep locally cached copies of
objects.
• Consistency is provided by requiring that:
– a thread's object cache be flushed upon entry
to a monitor.
– local modifications made to cached objects
be transmitted to the central memory when a
thread exits a monitor.
PM2: A Distributed, Multithreaded
Runtime Environment
• Thread library: Marcel
– User-level
• Communication library:
Madeleine
– Portable: BIP, SISCI/SCI,
MPI, TCP, PVM
– Supports SMP
– POSIX-like
– Preemptive thread migration
Context Switch
Create
SMP
0.250 s
2
Non-SMP
0.120 s
0.55 s
Thread Migration
SCI/SISCI
24 s
BIP/Myrinet
75 s
s
– Efficient
SCI/SISCI
BIP/Myrinet
Latency
6 s
8 s
Bandwidth
70 MB/s
125 MB/s
DSM-PM2: Architecture
• DSM comm:
DSM-PM2
– send page request
DSM Protocol Policy
DSM Protocol lib
– send page
– send invalidate request
– …
DSM Page
Manager
DSM Comm
• DSM page manager:
– set/get page owner
– set/get page access
Madeleine
Comms
Marcel
Threads
PM2
– add/remove to/from copyset
– ...
DSM-PM2: Performance
Operation/Protocol
Page fault
SISCI/SCI BIP/Myrinet
TCP/Myrinet
18
56
56
1
2
2
Transmitting request
17
30
190
Processing request
1
2
2
Sending back 4 kB
page
Installing the page
85
134
412
12
24
24
134 s
248 s
686 s
Fault handling
Total
• SCI cluster has 450 MHz Pentium II nodes
• Myrinet cluster has 200 MHz Pentium Pro nodes
Hyperion
• Executes threaded Java programs on
clusters.
• Built on top of PM2 and DSM-PM2.
– Provides both portability and efficiency
Reversing the Bytecode
Stream
• Conventionally, users “pull” bytecode to
their machines for local execution.
• Our vision:
– users develop their high-performance Java
programs using the Java toolset on their
desktop.
– they then “push” the resulting bytecode to a
Hyperion server for high-performance cycles.
Supporting High Performance
• Utilizes a bytecode-to-C translator.
• Parallel execution via spreading of Java
threads across nodes of the cluster.
• Java threads implemented as lightweight
threads using PM2 library.
Compiling Java
• Hyperion designed for computationally
intensive applications, so small overhead
of translating bytecode is not important.
• Translating to C allows us to leverage the
native C compiler and optimizer.
General Hyperion Overview
prog.java
prog.class
javac
java2c
gcc -06
(bytecode)
Sun's
Java
compiler
prog
prog.[ch]
Instruction-wise
translation
libs
Runtime
libraries
The Hyperion Run-Time System
• Collection of modules to allow “plug-andplay” implementations:
– inter-node communication
– threads
– memory and synchronization
– etc
Hyperion Internal Structure
Load
balancer
Thread
subsystem
Native
Java API
Memory
subsystem
Comm.
subsystem
PM2 API: pm2_rpc, pm2_thread_create, etc.
PM2
DSM subsystem
Thread subsystem
Comm. Subsystem
Thread and Object Allocation
• Currently, threads are allocated to processors
in round-robin fashion.
• Currently, an object is allocated to the
processor that holds the thread that is
creating the object.
• Currently, DSM-PM2 is used to implement the
Java memory model.
Hyperion’s DSM API
• loadIntoCache
• invalidateCache
• updateMainMemory
• get
• put
DSM Implementation
• Node-level caches.
• Page-based and home-based protocol.
• Log mods made to remote objects.
• Use explicit in-line checks in get/put.
• Each node allocates objects from a
different range of the virtual address
space.
Details
• Objects are aligned on 64-byte boundaries.
• An object reference is the address of the
base of the object.
• The bottom 6 bits of the ref can be used to
store the node number of the object’s home.
More details
• loadIntoCache checks the 6 bits to see if
an object is remote.
• If so, and if not already locally cached,
DSM-PM2 is used to load the page(s)
containing the object.
• When a remote object is cached, a bit is
turned on in its header.
Yet more details
• The put primitive checks the header bit
to see if a modification should be logged.
• updateMainMemory sends the logged
changes to the home node.
Evaluation
• Minimal-cost map-coloring application.
• Branch-and-bound algorithm.
• 64 threads, each with its own priority queue.
• Current best solution is shared.
• Problem size: 29 eastern-most states of USA
with 4 colors of differing costs.
Experimental Setting
• Two Linux 2.2 clusters:
– eight 200 MHz Pentium Pro processors
connected by Myrinet switch and using MPI
over BIP.
– four 450 MHz Pentium II processors
connected by a SCI network and using
SISCI.
• gcc 2.7.2.3 with -O6
Performance Results
700
600
500
400
Time (sec)
300
200
100
0
450MHz/SCI
200MHz/BIP
1
2
4
nodes
8
Parallelizability
10
8
200MHz/BIP
450MHz/SCI
Ideal
6
4
2
0
1
2
4
Nodes
8
Baseline Performance
• Compared serial Java to serial C for mapcoloring application.
• Each program has single queue, single
thread.
Serial Java versus Serial C
350
300
250
200
Time (sec)
150
100
50
0
• Java v2: DSM checks disabled
• Java v3: DSM and array-bound checks disabled
• Executing on a single 450 MHz Pentium II
C
Java
Java v2
Java v3
Inline checks are expensive!
• Genericity of DSM-PM2 allows an
alternative implementation.
• Use page-fault detection rather than
inline check to detect non-local object.
Using Page Faults: details
• An object reference is the address of the
base of the object.
• loadIntoCache does nothing.
• DSM-PM2 is used to handle page faults
generated by the get/put primitives.
More details
• When an object is allocated, its address is
appended to a list attached to the page
that contains its header.
• When a page is loaded on a remote node,
the list is used to turn on the header bit for
all object headers on the page.
• The put primitive uses the header bit in the
same manner as inline-check version.
Inline Check versus Page Fault
• IC has higher overhead for accessing
objects (either local or locally cached).
• PF has higher overhead (signal handling
and memory protection) for loading a
page into the cache.
IC versus PF: serial map-coloring
350
300
250
200
Time (sec)
150
100
50
0
• Java XX v2: DSM checks disabled
• Java XX v3: DSM and array-bound checks disabled
• Executing on a single 450 MHz Pentium II
C
Java IC
Java PF
Java IC v2
Java PF v2
Java IC v3
Java PF v3
IC versus PF: parallel map-coloring
300
250
200
Time (sec) 150
IC
PF
100
50
0
1
2
nodes
• Executing on 450MHz/SCI cluster.
4
Related Work
• Java/MPI: cluster nodes are explicit
• Java/RMI: ditto
• Remote objects via RMI: nearly transparent
– e.g. JavaParty, Do!
• Distributed interpreters
– e.g. Java/DSM, MultiJav, cJVM
Conclusions
• Approach is clean: Java “as is”
• Approach is promising
– good parallelizability for map-coloring
– need better scalar compilation
• e.g. array bound-check removal
– need further parallel application studies
• are thread/object placement heuristics
sufficient for programmers to write efficient
programs?