Download ppt - People @ EECS at UC Berkeley

Document related concepts
no text concepts found
Transcript
Optimization of Java-Like
Languages for Parallel and
Distributed Environments
Kathy Yelick
U.C. Berkeley
Computer Science Division
http://www.cs.berkeley.edu/~yelick/talks.html
What this tutorial is about
• Language and compiler support for:
•
•
•
•
Performance
Programmability
Scalability
Portability
• Some of this is specific to the Java language
(not the JVM), but much of it applies to other
parallel languages
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Titanium
Titanium will be used as an examples
• Based on Java
• Has Java’s syntax, safety, memory management, etc.
• Replaces Java’s thread model with static threads (SPMD)
• Other extensions for performance and parallelism
• Optimizing compiler
• Compiles to C (and from there to executable)
• Synchronization analysis
• Various optimizations
• Portable
• Runs on uniprocessors, shared memory, and clusters
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization
• Can we use Java for high performance on
• 1 processor machines?
• Java commercial compilers on some Scientific
applications
• Java the language, compiled to native code (via C)
• Extensions of Java to improve performance
• 10-100 processor machines?
• 1K-10K processor machines?
• 100K-1M processor machines?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SciMark Benchmark
• Numerical benchmark for Java, C/C++
• Five kernels:
•
•
•
•
•
FFT (complex, 1D)
Successive Over-Relaxation (SOR)
Monte Carlo integration (MC)
Sparse matrix multiply
dense LU factorization
• Results are reported in Mflops
• Download and run on your machine from:
• http://math.nist.gov/scimark2
• C and Java sources also provided
Pozo, NIST,
http://math.nist.gov/~Rpozo
Katherine Roldan
Yelick, Computer
Science
Division, EECS, University of California, Berkeley
SciMark: Java vs. C
MFlops
(Sun UltraSPARC 60)
90
80
70
60
50
40
30
20
10
0
C
Java
FFT
SOR
* Sun JDK 1.3 (HotSpot) , javac -0;
MC
Sparse
LU
Sun cc -0; SunOS 5.7
Pozo, NIST,
http://math.nist.gov/~Rpozo
Katherine Roldan
Yelick, Computer
Science
Division, EECS, University of California, Berkeley
SciMark: Java vs. C
(Intel PIII 500MHz, Win98)
120
100
80
C
Java
60
40
20
0
FFT
* Sun JDK 1.2, javac -0;
SOR
MC
Sparse
LU
Microsoft VC++ 5.0, cl -0; Win98
Pozo, NIST,
http://math.nist.gov/~Rpozo
Katherine Roldan
Yelick, Computer
Science
Division, EECS, University of California, Berkeley
Can we do better without the JVM?
• Pure Java with a JVM (and JIT)
• Within 2x of C and sometimes better
• OK for many users, even those using high end
machines
• Depends on quality of both compilers
• We can try to do better using a traditional
compilation model
• E.g., Titanium compiler at Berkeley
• Compiles Java extension to C
• Does not optimize Java arrays or for loops (prototype)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java Compiled by Titanium Compiler
MFlops
Performance on a Pentium IV (1.5GHz)
450
400
350
300
250
200
150
100
50
0
Overall
FFT
java
SOR
C (gcc -O6)
MC
Ti
Sparse
LU
Ti -nobc
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java Compiled by Titanium Compiler
Performance on a Sun Ultra 4
70
60
MFlops
50
40
30
20
10
0
Overall
FFT
SOR
Java
C
MC
Ti
Sparse
Ti -nobc
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
LU
Language Support for Performance
• Multidimensional arrays
• Contiguous storage
• Support for sub-array operations without copying
• Support for small objects
• E.g., complex numbers
• Called “immutables” in Titanium
• Sometimes called “value” classes
• Unordered loop construct
• Programmer specifies iteration independent
• Eliminates need for dependence analysis – short term
solution? Used by vectorizing compilers.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
HPJ Compiler from IBM
• HPJ Compiler from IBM Research
• Moreira et. al
• Program using Array classes which use
contiguous storage
• e.g. A[i][j] becomes A.get(i,j)
• No new syntax (worse for programming, but better
portability – any Java compiler can be used)
• Compiler for IBM machines, exploits hardware
• e.g., Fused Multiply-Add
• Result: 85+% of Fortran on RS/6000
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java vs. Fortran Performance
250
Mflops
200
150
100
50
MA T
M
BSO
M
0
ran
t
r
o
ULT
SHA
LL
OW
a
v
F
Ja
*IBM RS/6000 67MHz POWER2 (266 Mflops peak) AIX Fortran, HPJC
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization
• Can we use Java for high performance on
• 1 processor machines?
• 10-100 processor machines?
• A correctness model
• Cycle detection for reordering analysis
• Synchronization analysis
• 1K-10K processor machines?
• 100K-1M processor machines?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Programming
Parallel programming models and language are
distinguished primary by:
1. How parallel processes/threads are created
• Statically at program startup time
• The SPMD model, 1 thread per processor
Java
• Dynamically during program execution
• Through fork statements or other features
2. How the parallel threads communicate
• Through message passing (send/receive)
• By reading and writing to shared memory
Implicit parallelism not included here
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Two Problems
• Compiler writers would like to move code
around
• The hardware folks also want to build hardware
that dynamically moves operations around
• When is reordering correct?
• Because the programs are parallel, there are more
restrictions, not fewer
• The reason is that we have to preserve semantics of what
may be viewed by other processors
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sequential Consistency
• Given a set of executions from n processors, each
defines a total order Pi.
• The program order is the partial order given by the union
of these Pi ’s.
• The overall execution is sequentially consistent if there
exists a correct total order that is consistent with the
program order.
write x =1
read y  0
When this is
serialized, the
write y =3 read z 2
read and write
semantics
must be
read x 1 read y  3
preserved
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sequential Consistency Intuition
• Sequential consistency says that:
• The compiler may only reorder operations if another
processor cannot observe it.
• Writes (to variables that are later read) cannot result in
garbage values being written.
• The program behaves as if processors take turns
executing instructions
• Comments:
• In a legal execution, there will typically be many possible
total orders – limited only the reads and writes to shared
variables
• This is what you get if all reads and writes go to a single
shared memory, and accesses serialized at memory cell
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
How Can Sequential Consistency Fail?
• The compiler saves a value in a register across multiple
read accesses
• This “moves” the later reads to the point of the first one
• The compiler saves a value in a register across writes
• This “moves” the write until the register is written back from the
standpoint of other processors.
• The compiler performance common subexpression
elimination
• As if the later expression reads are all moved to the first
• Once contiguous in the instruction stream, they are merged
• The compiler performs other code motion
• The hardware has a write buffer
• Reads may by-pass writes in the buffer (to/from different variables)
• Some write buffers are not FIFO
• The hardware may have out-of-order execution
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Weaker Correctness Models
• Many systems use weaker memory models:
• Sun has TSO, PSO, and RMO
• Alpha has its own model
• Some languages do as well
• Java also has its own, currently undergoing redesign
• C spec is mostly silent on threads – very weak on
memory mapped I/O
• These are variants on the following, sequential
consistency under proper synchronization:
• All accesses to shared data must be protected by a lock,
which must be a primitive known to the system
• Otherwise, all bets are off (extreme)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Why Don’t Programmers Care?
• If these popular languages have used these weak models
successfully, then what is wrong?
• They don’t worry about what they don’t understand
• Many people use compilers that are not very aggressive about
reordering
• The hardware reordering is non-deterministic, and may happen very
infrequently in practice
• Architecture community is way ahead of us in worrying
about these problems.
• Open problem: A hardware simulator and/or Java (or C)
compiler that reorders things in the “worst possible way”
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Using Software to Mask Hardware
• Recall our two problems:
1. Compiler writers would like to move code around
2. The hardware folks also want to build hardware that
dynamically moves operations around
• The second can be viewed as compiler problem
• Weak memory models come extra primitives, usually
called fences or memory barriers
•
•
Write fence: wait for all outstanding writes from this processor to
complete
Read fence: do not issue any read pre-fetches before this point
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Use of Memory Fences
• Memory fences can turn a particular memory
model into sequential consistency under proper
synchronization:
• Add a read-fence to acquire lock operation
• Add a write fence to release lock operation
• In general, a language can have a stronger
model than the machine it runs if the compiler is
clever
• The language may also have a weaker model, if
the compiler does any optimizations
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Aside: Volatile
• Because Java and C have weak memory models
at the language level, they give programmers a
tool: volatile variables
• These variables should not be kept in registers
• Operations should not be reordered
• Should have mem fences around accesses
• General problem
• This is a big hammer which may be unnecessary
• No fine-grained control over particular accesses or
program phases (static notion)
• To get SC using volatile, many variables must be volatile
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
How Can Compilers Help?
• To implement a stronger model on a weaker one:
• Figure out what can legal be reordered
• Do optimizations under these constraints
• Generate necessary fences in resulting code
• Open problem: Can this be used to give Java a
sequentially consistent semantics?
• What about C?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Compiler Analysis Overview
• When compiling sequential programs, compute
dependencies:
x = expr1;
y = expr2;
y = expr2;
x = expr1;
Valid if y not in expr1 and x not in expr2 (roughly)
• When compiling parallel code, we need to consider
accesses by other processors.
Initially flag = data = 0
Proc A
Proc B
data = 1;
while (flag == 0);
flag = 1;
... = ...data...;
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Detection
• Processors define a “program order” on accesses from
the same thread
P is the union of these total orders
• Memory system define an “access order” on accesses
to the same variable
A is access order (read/write & write/write pairs)
write data
read flag
write flag
read data
• A violation of sequential consistency is cycle in P U A
[Shash&Snir]
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Analysis Intuition
• Definition is based on execution model, which
allows you to answer the question: Was this
execution sequentially consistent?
• Intuition:
• Time cannot flow backwards
• Need to be able to construct total order
• Examples (all variables initially 0)
write data 1
write flag 1
read flag 1
read data 0
write data 1
write flag 1
read data 1
read flag 0
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Detection Generalization
• Generalizes to arbitrary numbers of variables and
processors
write x
read y
write y
read y
write x
• Cycles may be arbitrarily long, but it is sufficient to
consider only minimal cycles with 1 or 2 consecutive
stops per processor
• Can simplify the analysis by assuming all processors
run a copy of the same code
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Static Analysis for Cycle Detection
• Approximate P by the control flow graph
• Approximate A by undirected “conflict” edges
• Bi-directional edge between accesses to the same variable in
which at least one is a write
• It is still correct if the conflict edge set is a superset of the reality
read x
write z
write y
read y
write z
• Let the “delay set” D be all edges from P that are
part of a minimal cycle
• The execution order of D edge must be preserved; other P
edges may be reordered (modulo usual rules about serial code)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Cycle Detection in Practice
• Cycle detection was implemented in a prototype
version of the Split-C and Titanium compilers.
• Split-C version used many simplifying assumptions.
• Titanium version had too many conflict edges.
• What is needed to make it practical?
• Finding possibly-concurrent program blocks
• Use SPMD model rather than threads to simplify
• Or apply data race detection work for Java threads
• Compute conflict edges
• Need good alias analysis
• Reduce size by separating shared/private variables
• Synchronization analysis
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Synchronization Analysis
• Enrich language with synchronization primitives
• Lock/Unlock or “synchronized” blocks
• Post/Wait or Wait/Notify on condition variables
• Global barriers: all processors wait at barrier
• Compiler can exploit understanding of
synchronization primitives to reduce cycles
• Note: use of language primitives for synchronization may
aid in optimization, but “rolling your own” is still correct
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Edge Ordering
• Post-Wait operations on the a variable can be ordered
post c
wait c
…
• Although correct to treat these as shared memory
accesses, we can get leverage by ordering them
• Then turn edges
• ?  post c into delay edges
• wait c  ? into delay edges
• And oriented corresponding conflict edges
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Edge Deletion
• In SPMD programs, the most common form of
synchronization is global barrier
…
barrier
barrier
…
• If we add to the delay set edges of the form
• ?  barrier
• barrier  ?
Then we can remove corresponding conflict edges
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Synchronization in Cycle Detection
• Iterative algorithm
• Compute delay set restrictions in which at least one
operation is a synchronization operation
• Perform edge orientation and deletion
• Compute delay set on remaining conflict edges
• Two important details
• For locks (and synchronized) we need good alias
information about the lock variables. (Conservative would
probably work…)
• For barriers, need to line up corresponding barriers
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Static Analysis for Barriers
• Lining up barriers is needed for cycle detection.
• Mis-aligned barriers are also a source of bugs
inside branches or loops.
• Includes other global communication primitives
barrier, broadcast, reductions
• Titanium uses barrier analysis, based on the
idea of single variables and methods:
• A “single” method is one called by all procs
public single static void allStep(...)
• A “single” variable has same value on all procs
int single timestep = 0;
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Single Analysis
• The underlying requirement is that barriers only
match the same textual instance
• Complication from conditionals:
if (this processor owns some data) {
compute on it
barrier
}
• Hence the use of “single” variables in Titanium
• If a conditional or loop block contains a barrier,
all processors must execute it
• expression in such loops headers, if statements, etc. must
contain only single variables
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Single Variable Example in Titanium
• Barriers and single in N-body Simulation
class ParticleSim {
public static void main (String [] argv) {
int single allTimestep = 0;
int single allEndTime = 100;
for (; allTimestep < allEndTime; allTimestep++){
// read all particles and compute forces on mine
computeForces(…);
Ti.barrier();
// write to my particles using new forces
spreadParticles(…);
Ti.barrier();
}
}
}
• Single methods are automatically inferred, variables not
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Some Open Problems
• What is the right semantic model for shared
memory parallel languages?
• Is cycle detection practical on real languages?
•
•
•
•
How well can synchronization be analyzed?
Aliases between non-synchronizing variables?
Can we distinguish between shared and private data?
What is the complexity on real applications?
• Analysis in programs with dynamic thread
creation
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization
• Can we use Java for high performance on a
• 1 processor machine?
• 10-100 processor machine?
• 1K-10K processor machine?
•
•
•
•
Programming model landscape
Global address space language support
Optimizing local pointers
Optimizing remote pointers
• 100K-1M processor machine?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Programming Models at Scale
• Large scale machines are mostly
• Clusters of uniprocessors or SMPs
• Some have hardware support for remote memory access
• Shmem in Cray T3E
• GM layer in Myrinet
• DSM on SGI Origin 2K
• Yet most programs are written in:
• SPMD model
• Message passing
• Can we use a simpler, shared memory model?
• On Origin, yes, but what about large machines?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Global Address Space
• To run shared memory programs on distributed memory
hardware, we replace references (pointers) by global
ones:
•
•
•
•
•
May point to remote data
Useful in building large, complex data structures
Easy to port shared-memory programs (functionality is correct)
Uniform programming model across machines
Especially true for cluster of SMPs
• Usual implementation
• Each reference contains:
• Processor id (or process id on cluster of SMPs)
• And a memory address on that processor
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Use of Global / Local
• Global pointers are more expensive than local
• When data is remote, it turns into a remote read or write)
which is a message call of some kind
• When the data is not remote, there is still an overhead
• space (processor number + memory address)
• dereference time (check to see if local)
• Conclusion: not all references should be global
-- use normal references when possible.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Explicit Approach to Global/Local
• A common approach in parallel languages is to
distinguish between local and global (“possibly
remote”) pointers in the language.
• Two variations are:
• Make global the default – nice for porting shared memory
programs
• Make local the default – nice for calling libraries on a
single processor that were built for uniprocessor
• Titanium uses global deafault, with local
declarations in important sections
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Global Address Space
• Processes allocate locally
• References can be passed
to other processes
class C { int val;... }
C gv;
// global pointer
C local lv; // local pointer
if (thisProc() == 0) {
lv = new C();
}
gv = broadcast lv from 0;
gv.val = ...;
... = gv.val;
Process 0
lv
gv
LOCAL
HEAP
Other
processes
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
LOCAL
HEAP
Local Pointer Analysis
• Compiler can infer locals using Local
Qualification Inference
Effect of LQI
250
running time (sec)
200
150
Original
After LQI
100
50
0
cannon
lu
sample
gsrb
poison
applications
• Data structures must be well partitioned
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Remote Accesses
• What about remote accesses? In this case, the
cost of the storage and extra check is small
relative to the message cost.
• Strategies for reducing remote accesses:
• Use non-blocking writes – do not wait for them to
performed
• Use prefetching for reads – ask before data is needed
• Aggregate several accesses to the same processor
together
• All of these involve reordering or the potential
for reordering
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Communication Optimizations
Time (normalized)
• Data on an old machine, UCB NOW, using a simple
subset of C
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example communication costs
 latency (a) and bandwidth (b) measured in units of flops
° b measured per 8-byte word
Machine
CM-5
IBM SP-1
Intel Paragon
IBM SP-2
Cray T3D (PVM)
UCB NOW
UCB Millennium
Year
1992
1993
1994
1994
1994
1996
2000
SGI Power Challenge 1995
SUN E6000
1996
SGI Origin 2K
2000
a
1900
5000
1500
7000
1974
2880
50000
3080
1980
5000
b
Mflop rate per proc
20
32
2.3
40
28
38
300
20
100
50
200
94
180
500
39
9
25
308
180
500
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Organization
• Can we use Java for high performance on a
•
•
•
•
1 processor machine?
10-100 processor machine?
1K-10K processor machine?
100K-1M processor machine?
• Kinds of machines
• Open problems
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Future Machines
• IBM is building a 1M processor Blue Gene
machine
• Expect a processor failure per day
• Would like to run 1 job for a year
• “The grid” is made from harnessing unused
cycles across the internet
• Need to kill job if owner wants to use the machine
• Frequent failures
• All of our high performance programming
models assume the machine works
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Possible Software Model
• System hides some faults at each layer
Over-partitioned applications (Java,Titanium,…)
Uniform machine (dynamic load balancing)
Performance faults (process pairs, checkpoints)
Fail-stop faults
Byzantine faults
• Lower levels send “hints” upward
• Lower level has control, but upper level can optimize
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
References
• Serial Java performance:
• Roldan Pozo, Jose Moreira et al, Titanium group
• Java memory models
• Bill Pugh, Jaejlin Lee
• Cycle analysis
• Dennis Shasha and Marc Snir, Arvind Krishnamurthy and Kathy
Yelick, Jaejin Lee and Sam Midkiff and David Padua
• Synchronization analysis
• Data race detection: many people
• Barriers: Alex Aiken and David Gay
• Global pointers
• See UPC, Split-C, AC, CC++, Titanium, and others
• Local Qualification Inference: Ben Liblit and Alex Aiken
• Non-blocking communication
• Active messages, Global Arrays (PNL), and others
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Summary
• Opportunities to improve programmability
• Simplify programmers model (e.g., Java with sequential
consistency)
• Solve harder compiler problem (use it on “the grid”)
• Basic requirements understood, but not
• Usability in practice on real applications
• Interaction with other analyses
• Complexity
• Current and future machines are harder
• More processors, more levels of hierarchy
• Less reliable overall, because of scale
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Backup Slides
Outline
• Java-like Languages
• Language support for performance
• Optimizations
• Compilation models
• Parallel
•
•
•
•
Machine models
Language models
Memory models
Analysis
• Distributed
• Remote access
• Failures
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Data from Dan
•
•
•
•
•
Origin 2000 (128 CPU configuration):
local memory latency: 300 ns
remote memory latency: 900 ns avg.
bandwidth: 160 MB/sec per CPU
CPU: MIPS R10000 195Mhz(390 MFLOPS) or 250MHz(500
MFLOPS)
• note the hardware supports up to 4 outstanding nonblocking references to
• remote cache lines (SGI obviously agrees with you)
• Millennium cluster:
• CPU: 4-way Intel P3-700
• AMUDP performance (100Mbit half-duplex switched
ethernet,
kernel UDP
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
• driver):
Data from Millennium Home page
• Poweredge 2-way SMPs (500 MHz Pentium IIIs)
running Linux 2.2.5
• Each SMP has a Lanai 7.2 card:
• Round trip time: 32-33 microseconds for small messages
• BW: 59.5 MB/s for 16 KB msgs
• Gap (time between msg sends in steady state): 18-19
microseconds
• Page: Dec 1999
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Value of optimizations
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Also for I/O (Dan’s stuff)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Language Landscape
• Two axes (2-d grid)
• Parallelism (control) model
• Static (SPMD)
• Dynamic (threads)
• Communication/Sharing model
• Shared memory
• Global address space
• Message passing
• In the 2-100 processor range, one can buy
shared memory machines
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Language Landscape
• Implicitly parallel (serial semantics)
• Sequential – compiler too hard
• Data parallel – compiler too hard
• Explicitly parallel (parallel semantics)
• OpenMP – compiler too hard (for large machines)
• Threads – the sweet spot
• People use it (java, vector supers)
• Message passing (e.g., MPI) – programming too hard
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
The Economics of High Performance
• The failure (or delay of) compilers for data
parallel languages in the 90s -> most programs
for large scale machines written in MPI
• Programming community is elite
• Many applications with parallelism don’t use it,
because it’s two hard
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Backup Slides II
Titanium Group
•
•
•
•
•
Susan Graham
Katherine Yelick
Paul Hilfinger
Phillip Colella (LBNL)
Alex Aiken
• Greg Balls (SDSC)
• Peter McQuorquodale
(LBNL)
•
•
•
•
•
•
•
•
Andrew Begel
Dan Bonachea
Tyson Condie
David Gay
Ben Liblit
Chang Sun Lin
Geoff Pike
Siu Man Yau
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Target Problems
• Many modeling problems in astrophysics,
biology, material science, and other areas
require
• Enormous range of spatial and temporal scales
• To solve interesting problems, one needs:
• Adaptive methods
• Large scale parallel machines
• Titanium is designed for methods with
• Stuctured grids
• Locally-structured grids (AMR)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Common Requirements
• Algorithms for numerical PDE
are
computations
• communication intensive
• memory intensive
• AMR makes these harder
• more small messages
• more complex data structures
• most of the programming effort is
the boundary cases
• locality and load balance trade-off is hard
debugging
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
A Little History
• Most parallel programs are written using
explicit parallelism, either:
• Message passing with a SPMD model
• Usually for scientific applications with C++/Fortran
• Scales easily
• Shared memory with a thread C or Java
• Usually for non-scientific applications
• Easier to program
• Take the best features of both for Titanium
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Why Java for Scientific Computing?
• Computational scientists use increasingly
complex models
• Popularized C++ features: classes, overloading, pointerbased data structures
• But C++ is very complicated
• easy to lose performance and readability
• Java is a better C++
• Safe: strongly typed, garbage collected
• Much simpler to implement (research vehicle)
• Industrial interest as well: IBM HP Java
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Summary of Features Added to Java
•
•
•
•
•
•
•
•
•
Multidimensional arrays with iterators
Immutable (“value”) classes
Templates
Operator overloading
Scalable SPMD parallelism
Global address space
Checked Synchronization
Zone-based memory management
Scientific Libraries
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline
• Language and compiler support for
uniprocessor performance
• Immutable classes
• Multidimensional Arrays
• Foreach
•
•
•
•
Language support for ease of programming
Language support for parallel computation
Applications and application-level libraries
Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Java: A Cleaner C++
• Java is an object-oriented language
• classes (no standalone functions) with methods
• inheritance between classes
• Documentation on web at java.sun.com
• Syntax similar to C++
class Hello {
public static void main (String [] argv) {
System.out.println(“Hello, world!”);
}
}
• Safe: strongly typed, auto memory management
• Titanium is (almost) strict superset
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sequential Performance
Java
Ultrasparc: C/C++/
FORTRAN Arrays
DAXPY
3D multigrid
2D multigrid
EM3D
1.4s
12s
5.4s
0.7s
6.8s
1.8s
Java
Pentium II: C/C++/
FORTRAN Arrays
DAXPY
3D multigrid
2D multigrid
EM3D
1.8s
23.0s
7.3s
1.0s
Titanium
Arrays
1.5s
22s
6.2s
1.0s
Titanium
Arrays
2.3s
20.0s
5.5s
1.6s
Overhead
7%
83%
15%
42%
Overhead
27%
-13%
-25%
60%
Performance results from 98; new IR and optimization
framework almost complete.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline
• Language and compiler support for
uniprocessor performance
• Language support for ease of programming
• Templates
• Operator overloading
Example later
• Language support for parallel computation
• Applications and application-level libraries
• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline
• Language and compiler support for
uniprocessor performance
• Language support for parallel computation
•
•
•
•
•
•
SPMD execution
Barriers and single
Explicit Communication
Implicit Communication (global and local references)
More on Single
Synchronized methods and blocks (as in Java)
• Applications and application-level libraries
• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SPMD Execution Model
• Java programs can be run as Titanium, but the
result will be that all processors do all the work
• E.g., parallel hello world
class HelloWorld {
public static void main (String [] argv) {
System.out.println(“Hello from proc “ +
Ti.thisProc());
}
}
• Any non-trivial program will have
communication and synchronization
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
SPMD Model
• All processors start together and execute same code, but
not in lock-step
• Basic control done using
• Ti.numProcs() total number of processors
• Ti.thisProc() number of executing processor
• Bulk-synchronous style
read all particles and compute forces on mine
Ti.barrier();
write to my particles using new forces
Ti.barrier();
• This is neither message passing nor data-parallel
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Explicit Communication: Broadcast
• Broadcast is a one-to-all communication
broadcast <value> from <processor>
• For example:
int count = 0;
int allCount = 0;
if (Ti.thisProc() == 0) count = computeCount();
allCount = broadcast count from 0;
• The processor number in the broadcast must be
single; all constants are single.
• The allCount variable could be declared single.
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example of Data Input
• Same example, but reading from keyboard
• Shows use of Java exceptions
int single count = 0;
int allCount = 0;
if (Ti.thisProc() == 0)
try {
DataInputStream kb = new
DataInputStream(System.in);
myCount =
Integer.valueOf(kb.readLine()).intValue();
} catch (Exception e) {
System.err.println(``Illegal Input’’);
allCount = myCount from 0;
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example: A Distributed Data Structure
• Data can be accessed
across processor
boundaries
Proc 0
Proc 1
local_grids
all_grids
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Example: Setting Boundary Conditions
foreach (l in local_grids.domain()) {
foreach (a in all_grids.domain()) {
local_grids[l].copy(all_grids[a]);
}
}
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Explicit Communication: Exchange
• To create shared data structures
• each processor builds its own piece
• pieces are exchanged (for object, just exchange pointers)
• Exchange primitive in Titanium
int [1d] single allData;
allData = new int [0:Ti.numProcs()-1];
allData.exchange(Ti.thisProc()*2);
• E.g., on 4 procs, each will have copy of allData:
0
2
4
6
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Building Distributed Structures
• Distributed structures are built with exchange:
class Boxed {
public Boxed (int j) { val = j;}
public int val;
}
Object [1d] single allData;
allData = new Object [0:Ti.numProcs()-1];
allData.exchange(new Boxed(Ti.thisProc());
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Distributed Data Structures
• Building distributed arrays:
RectDomain <1> single allProcs = [0:Ti.numProcs-1];
RectDomain <1> myParticleDomain = [0:myPartCount-1];
Particle [1d] single [1d] allParticle =
new Particle [allProcs][1d];
Particle [1d] myParticle =
new Particle [myParticleDomain];
allParticle.exchange(myParticle);
• Now each processor has array of pointers,
one to each processor’s chunk of particles
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Lecture Outline
• Language and compiler support for uniprocessor
performance
• Language support for ease of programming
• Language support for parallel computation
• Applications and application-level libraries
•
•
•
•
•
Gene sequencing application
Heart simulation
AMR elliptic and hyperbolic solvers
Scalable Poisson for infinite domains
Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join
• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Unstructured Mesh Kernel
• EM3D: Relaxation on a
3D unstructured mesh
8
7
6
• Speedup on Ultrasparc
SMP
5
4
• Simple kernel: mesh not
partitioned.
em3d
3
2
1
0
1
2
4
8
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
AMR Poisson
• Poisson Solver [Semenzato, Pike, Colella]
• 3D AMR
• finite domain
• variable
coefficients
• multigrid
across levels
Level 2
Level 1
• Performance Level
of Titanium
implementation
0
• Sequential multigrid performance +/- 20% of Fortran
• On fixed, well-balanced problem of 8 patches, each 723
• parallel speedups of 5.5 on 8 processors
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Scalable Poisson Solver
• MLC for Finite-Differences by Balls and Colella
• Poisson equation with infinite boundaries
• arise in astrophysics, some biological systems, etc.
• Method is scalable
1.2
• Low communication
• SP2 (shown) and t3e
• scaled speedups
• nearly ideal (flat)
• Currently 2D and
non-adaptive
Time/fine-patch-iter/proc
• Performance on
1
0.8
129x129/65x65
129x129/33x33
257x257/129x129
0.6
257x257/65x65
0.4
0.2
0
1
4
16
processors
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
AMR Gas Dynamics
• Developed by McCorquodale and Colella
• Merge with Poisson underway for self-gravity
• 2D Example (3D supported)
• Mach-10 shock on solid surface
oblique angle
• Future: Self-gravitating gas dynamics package
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
at
Distributed Array Libraries
• There are some “standard” distributed array
libraries associated with Titanium
• Hides the details of exchange, indirection within
the data structure, etc.
• Libraries benefit from support for templates
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Distributed Array Library Fragment
template <class T, int single arity> public class
DistArray {
RectDomain <arity> single rd;
T [arity d][arity d] subMatrices;
RectDomain <arity> [arity d] single subDomains;
...
/* Sets the element at p to value */
public void set (Point <arity> p, T value) {
getHomingSubMatrix (p) [p] = value;
}
}
template DistArray <double, 2> single A = new template
DistArray <double, 2> ([ [0, 0] : [aHeight, aWidth]);
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Immersed Boundary Method (future)
• Immersed boundary method [Peskin,MacQueen]
• Used in heart model, platelets, and others
• Currently uses FFT for Navier-Stokes solver
• Begun effort to move solver and full method into
Titanium
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Implementation
• Strategy
• Titanium into C
• Solaris or Posix threads for SMPs
• Lightweight communication for MPPs/Clusters
• Status: Titanium runs on
•
•
•
•
Solaris or Linux SMPs and uniprocessors
Berkeley NOW
SDSC Tera, SP2, T3E (also NERSC)
SP3 port underway
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Using Titanium on NPACI Machines
• Send mail to us if you are interested
[email protected]
• Has been installed in individual accounts
• t3e and BH: upgrade needed
• On uniprocessors and SMPs
• available from the Titanium home page
• http://www.cs.berkeley.edu/projects/titanium
• other documentation available as well
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Calling Other Languages
• We have built interfaces to
• PETSc : scientific library for finite element applications
• Metis: graph partitioning library
• KeLP: starting work on this
• Two issues with cross-language calls
• accessing Titanium data structures (arrays) from C
• possible because Titanium arrays have same format on inside
• having a common message layer
• Titanium is built on lightweight communication
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Future Plans
• Improved compiler optimizations for scalar code
• large loops are currently +/- 20% of Fortran
• working on small loop performance
• Packaged solvers written in Titanium
• Elliptic and hyperbolic solvers, both regular and adaptive
• New application collaboration
• Peskin and McQueen (NYU) with Colella (LBNL)
• Immersed boundary method, currently use for heart
simulation, platelet coagulation, and others
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Backup Slides
Example: Domain
• Domains in general are not rectangular r
• Built using set operations
• union, +
• intersection, *
• difference, -
(0, 0)
r + [1, 1]
• Example is red-black algorithm
Point<2> lb =
Point<2> ub =
RectDomain<2>
...
Domain<2> red
foreach (p in
...
}
[0, 0];
[6, 4];
r = [lb : ub : [2, 2]];
(6, 4)
(7, 5)
(1, 1)
red
= r + (r + [1, 1]);
red) {
(0, 0)
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
(7, 5)
Example using Domains and foreach
• Gauss-Seidel red-black computation in multigrid
void gsrb() {
boundary (phi);
for (domain<2> d = res; d != null;
d = (d = = red ? black : null)) {
foreach (q in d)
unordered iteration
res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4
+ (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)])
- 20.0*phi[q] - k*rhs[q]) * 0.05;
foreach (q in d) phi[q] += res[q];
}
}
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Recent Progress in Titanium
• Distributed data structures built with global refs
• communication may be implicit, e.g.: a[j] = a[i].dx;
• use extensively in AMR algorithms
• Runtime layer optimizes
• bulk communication
• bulk I/O
• Runs on
• t3e, SP2, and Tera
• Compiler analysis optimizes
• global references converted to local ones when possible
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Consistency Model
• Titanium adopts the Java memory consistency
model
• Roughly: Access to shared variables that are not
synchronized have undefined behavior.
• Use synchronization to control access to shared
variables.
• barriers
• synchronized methods and blocks
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Compiler Techniques Outline
• Analysis and Optimization of parallel code
•
•
•
•
•
Tolerate network latency: Split-C experience
Hardware trends and reordering
Semantics: sequential consistency
Cycle detection: parallel dependence analysis
Synchronization analysis: parallel flow analysis
• Summary and future directions
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Parallel Optimizations
• Titanium compiler performs parallel
optimizations
• communication overlap and aggregation
• Two new analyses
• synchronization analysis: the parallel analog to control
flow analysis for serial code [Gay & Aiken]
• shared variable analysis: the parallel analog to
dependence analysis [Krishnamurthy & Yelick]
• local qualification inference: automatically inserts local
qualifiers [Liblit & Aiken]
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Split-C Experience: Latency Overlap
• Titanium borrowed ideas from Split-C
• global address space
• SPMD parallelism
• But, Split-C had non-blocking accesses built in to
tolerate network latency on remote read/write
int *global p;
x := *p;
/* get */
*p := 3;
/* put */
sync;
/* wait for my puts/gets */
• Also*p
one-way
:- x; communication
/* store */
all_store_sync;
/* wait globally */
• Conclusion: useful, but complicated
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
Sources of Memory/Comm. Overlap
• Would like compiler to introduce put/get/store.
• Hardware also reorders
•
•
•
•
out-of-order execution
write buffered with read by-pass
non-FIFO write buffers
weak memory models in general
• Software already reorders too
• register allocation
• any code motion
• System provides enforcement primitives
• e.g., memory fence, volatile, etc.
• tend to be heavy wait and with unpredictable performance
• Can the compiler hide all this?
Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley
End of Compiling Parallel Code