Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Optimization of Java-Like Languages for Parallel and Distributed Environments Kathy Yelick U.C. Berkeley Computer Science Division http://www.cs.berkeley.edu/~yelick/talks.html What this tutorial is about • Language and compiler support for: • • • • Performance Programmability Scalability Portability • Some of this is specific to the Java language (not the JVM), but much of it applies to other parallel languages Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium will be used as an examples • Based on Java • Has Java’s syntax, safety, memory management, etc. • Replaces Java’s thread model with static threads (SPMD) • Other extensions for performance and parallelism • Optimizing compiler • Compiles to C (and from there to executable) • Synchronization analysis • Various optimizations • Portable • Runs on uniprocessors, shared memory, and clusters Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Organization • Can we use Java for high performance on • 1 processor machines? • Java commercial compilers on some Scientific applications • Java the language, compiled to native code (via C) • Extensions of Java to improve performance • 10-100 processor machines? • 1K-10K processor machines? • 100K-1M processor machines? Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley SciMark Benchmark • Numerical benchmark for Java, C/C++ • Five kernels: • • • • • FFT (complex, 1D) Successive Over-Relaxation (SOR) Monte Carlo integration (MC) Sparse matrix multiply dense LU factorization • Results are reported in Mflops • Download and run on your machine from: • http://math.nist.gov/scimark2 • C and Java sources also provided Pozo, NIST, http://math.nist.gov/~Rpozo Katherine Roldan Yelick, Computer Science Division, EECS, University of California, Berkeley SciMark: Java vs. C MFlops (Sun UltraSPARC 60) 90 80 70 60 50 40 30 20 10 0 C Java FFT SOR * Sun JDK 1.3 (HotSpot) , javac -0; MC Sparse LU Sun cc -0; SunOS 5.7 Pozo, NIST, http://math.nist.gov/~Rpozo Katherine Roldan Yelick, Computer Science Division, EECS, University of California, Berkeley SciMark: Java vs. C (Intel PIII 500MHz, Win98) 120 100 80 C Java 60 40 20 0 FFT * Sun JDK 1.2, javac -0; SOR MC Sparse LU Microsoft VC++ 5.0, cl -0; Win98 Pozo, NIST, http://math.nist.gov/~Rpozo Katherine Roldan Yelick, Computer Science Division, EECS, University of California, Berkeley Can we do better without the JVM? • Pure Java with a JVM (and JIT) • Within 2x of C and sometimes better • OK for many users, even those using high end machines • Depends on quality of both compilers • We can try to do better using a traditional compilation model • E.g., Titanium compiler at Berkeley • Compiles Java extension to C • Does not optimize Java arrays or for loops (prototype) Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Java Compiled by Titanium Compiler MFlops Performance on a Pentium IV (1.5GHz) 450 400 350 300 250 200 150 100 50 0 Overall FFT java SOR C (gcc -O6) MC Ti Sparse LU Ti -nobc Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Java Compiled by Titanium Compiler Performance on a Sun Ultra 4 70 60 MFlops 50 40 30 20 10 0 Overall FFT SOR Java C MC Ti Sparse Ti -nobc Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley LU Language Support for Performance • Multidimensional arrays • Contiguous storage • Support for sub-array operations without copying • Support for small objects • E.g., complex numbers • Called “immutables” in Titanium • Sometimes called “value” classes • Unordered loop construct • Programmer specifies iteration independent • Eliminates need for dependence analysis – short term solution? Used by vectorizing compilers. Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley HPJ Compiler from IBM • HPJ Compiler from IBM Research • Moreira et. al • Program using Array classes which use contiguous storage • e.g. A[i][j] becomes A.get(i,j) • No new syntax (worse for programming, but better portability – any Java compiler can be used) • Compiler for IBM machines, exploits hardware • e.g., Fused Multiply-Add • Result: 85+% of Fortran on RS/6000 Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Java vs. Fortran Performance 250 Mflops 200 150 100 50 MA T M BSO M 0 ran t r o ULT SHA LL OW a v F Ja *IBM RS/6000 67MHz POWER2 (266 Mflops peak) AIX Fortran, HPJC Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Organization • Can we use Java for high performance on • 1 processor machines? • 10-100 processor machines? • A correctness model • Cycle detection for reordering analysis • Synchronization analysis • 1K-10K processor machines? • 100K-1M processor machines? Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Parallel Programming Parallel programming models and language are distinguished primary by: 1. How parallel processes/threads are created • Statically at program startup time • The SPMD model, 1 thread per processor Java • Dynamically during program execution • Through fork statements or other features 2. How the parallel threads communicate • Through message passing (send/receive) • By reading and writing to shared memory Implicit parallelism not included here Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Two Problems • Compiler writers would like to move code around • The hardware folks also want to build hardware that dynamically moves operations around • When is reordering correct? • Because the programs are parallel, there are more restrictions, not fewer • The reason is that we have to preserve semantics of what may be viewed by other processors Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Sequential Consistency • Given a set of executions from n processors, each defines a total order Pi. • The program order is the partial order given by the union of these Pi ’s. • The overall execution is sequentially consistent if there exists a correct total order that is consistent with the program order. write x =1 read y 0 When this is serialized, the write y =3 read z 2 read and write semantics must be read x 1 read y 3 preserved Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Sequential Consistency Intuition • Sequential consistency says that: • The compiler may only reorder operations if another processor cannot observe it. • Writes (to variables that are later read) cannot result in garbage values being written. • The program behaves as if processors take turns executing instructions • Comments: • In a legal execution, there will typically be many possible total orders – limited only the reads and writes to shared variables • This is what you get if all reads and writes go to a single shared memory, and accesses serialized at memory cell Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley How Can Sequential Consistency Fail? • The compiler saves a value in a register across multiple read accesses • This “moves” the later reads to the point of the first one • The compiler saves a value in a register across writes • This “moves” the write until the register is written back from the standpoint of other processors. • The compiler performance common subexpression elimination • As if the later expression reads are all moved to the first • Once contiguous in the instruction stream, they are merged • The compiler performs other code motion • The hardware has a write buffer • Reads may by-pass writes in the buffer (to/from different variables) • Some write buffers are not FIFO • The hardware may have out-of-order execution Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Weaker Correctness Models • Many systems use weaker memory models: • Sun has TSO, PSO, and RMO • Alpha has its own model • Some languages do as well • Java also has its own, currently undergoing redesign • C spec is mostly silent on threads – very weak on memory mapped I/O • These are variants on the following, sequential consistency under proper synchronization: • All accesses to shared data must be protected by a lock, which must be a primitive known to the system • Otherwise, all bets are off (extreme) Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Why Don’t Programmers Care? • If these popular languages have used these weak models successfully, then what is wrong? • They don’t worry about what they don’t understand • Many people use compilers that are not very aggressive about reordering • The hardware reordering is non-deterministic, and may happen very infrequently in practice • Architecture community is way ahead of us in worrying about these problems. • Open problem: A hardware simulator and/or Java (or C) compiler that reorders things in the “worst possible way” Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Using Software to Mask Hardware • Recall our two problems: 1. Compiler writers would like to move code around 2. The hardware folks also want to build hardware that dynamically moves operations around • The second can be viewed as compiler problem • Weak memory models come extra primitives, usually called fences or memory barriers • • Write fence: wait for all outstanding writes from this processor to complete Read fence: do not issue any read pre-fetches before this point Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Use of Memory Fences • Memory fences can turn a particular memory model into sequential consistency under proper synchronization: • Add a read-fence to acquire lock operation • Add a write fence to release lock operation • In general, a language can have a stronger model than the machine it runs if the compiler is clever • The language may also have a weaker model, if the compiler does any optimizations Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Aside: Volatile • Because Java and C have weak memory models at the language level, they give programmers a tool: volatile variables • These variables should not be kept in registers • Operations should not be reordered • Should have mem fences around accesses • General problem • This is a big hammer which may be unnecessary • No fine-grained control over particular accesses or program phases (static notion) • To get SC using volatile, many variables must be volatile Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley How Can Compilers Help? • To implement a stronger model on a weaker one: • Figure out what can legal be reordered • Do optimizations under these constraints • Generate necessary fences in resulting code • Open problem: Can this be used to give Java a sequentially consistent semantics? • What about C? Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Compiler Analysis Overview • When compiling sequential programs, compute dependencies: x = expr1; y = expr2; y = expr2; x = expr1; Valid if y not in expr1 and x not in expr2 (roughly) • When compiling parallel code, we need to consider accesses by other processors. Initially flag = data = 0 Proc A Proc B data = 1; while (flag == 0); flag = 1; ... = ...data...; Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Cycle Detection • Processors define a “program order” on accesses from the same thread P is the union of these total orders • Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) write data read flag write flag read data • A violation of sequential consistency is cycle in P U A [Shash&Snir] Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Cycle Analysis Intuition • Definition is based on execution model, which allows you to answer the question: Was this execution sequentially consistent? • Intuition: • Time cannot flow backwards • Need to be able to construct total order • Examples (all variables initially 0) write data 1 write flag 1 read flag 1 read data 0 write data 1 write flag 1 read data 1 read flag 0 Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Cycle Detection Generalization • Generalizes to arbitrary numbers of variables and processors write x read y write y read y write x • Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor • Can simplify the analysis by assuming all processors run a copy of the same code Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Static Analysis for Cycle Detection • Approximate P by the control flow graph • Approximate A by undirected “conflict” edges • Bi-directional edge between accesses to the same variable in which at least one is a write • It is still correct if the conflict edge set is a superset of the reality read x write z write y read y write z • Let the “delay set” D be all edges from P that are part of a minimal cycle • The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code) Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Cycle Detection in Practice • Cycle detection was implemented in a prototype version of the Split-C and Titanium compilers. • Split-C version used many simplifying assumptions. • Titanium version had too many conflict edges. • What is needed to make it practical? • Finding possibly-concurrent program blocks • Use SPMD model rather than threads to simplify • Or apply data race detection work for Java threads • Compute conflict edges • Need good alias analysis • Reduce size by separating shared/private variables • Synchronization analysis Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Synchronization Analysis • Enrich language with synchronization primitives • Lock/Unlock or “synchronized” blocks • Post/Wait or Wait/Notify on condition variables • Global barriers: all processors wait at barrier • Compiler can exploit understanding of synchronization primitives to reduce cycles • Note: use of language primitives for synchronization may aid in optimization, but “rolling your own” is still correct Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Edge Ordering • Post-Wait operations on the a variable can be ordered post c wait c … • Although correct to treat these as shared memory accesses, we can get leverage by ordering them • Then turn edges • ? post c into delay edges • wait c ? into delay edges • And oriented corresponding conflict edges Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Edge Deletion • In SPMD programs, the most common form of synchronization is global barrier … barrier barrier … • If we add to the delay set edges of the form • ? barrier • barrier ? Then we can remove corresponding conflict edges Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Synchronization in Cycle Detection • Iterative algorithm • Compute delay set restrictions in which at least one operation is a synchronization operation • Perform edge orientation and deletion • Compute delay set on remaining conflict edges • Two important details • For locks (and synchronized) we need good alias information about the lock variables. (Conservative would probably work…) • For barriers, need to line up corresponding barriers Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Static Analysis for Barriers • Lining up barriers is needed for cycle detection. • Mis-aligned barriers are also a source of bugs inside branches or loops. • Includes other global communication primitives barrier, broadcast, reductions • Titanium uses barrier analysis, based on the idea of single variables and methods: • A “single” method is one called by all procs public single static void allStep(...) • A “single” variable has same value on all procs int single timestep = 0; Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Single Analysis • The underlying requirement is that barriers only match the same textual instance • Complication from conditionals: if (this processor owns some data) { compute on it barrier } • Hence the use of “single” variables in Titanium • If a conditional or loop block contains a barrier, all processors must execute it • expression in such loops headers, if statements, etc. must contain only single variables Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Single Variable Example in Titanium • Barriers and single in N-body Simulation class ParticleSim { public static void main (String [] argv) { int single allTimestep = 0; int single allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ // read all particles and compute forces on mine computeForces(…); Ti.barrier(); // write to my particles using new forces spreadParticles(…); Ti.barrier(); } } } • Single methods are automatically inferred, variables not Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Some Open Problems • What is the right semantic model for shared memory parallel languages? • Is cycle detection practical on real languages? • • • • How well can synchronization be analyzed? Aliases between non-synchronizing variables? Can we distinguish between shared and private data? What is the complexity on real applications? • Analysis in programs with dynamic thread creation Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Organization • Can we use Java for high performance on a • 1 processor machine? • 10-100 processor machine? • 1K-10K processor machine? • • • • Programming model landscape Global address space language support Optimizing local pointers Optimizing remote pointers • 100K-1M processor machine? Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Programming Models at Scale • Large scale machines are mostly • Clusters of uniprocessors or SMPs • Some have hardware support for remote memory access • Shmem in Cray T3E • GM layer in Myrinet • DSM on SGI Origin 2K • Yet most programs are written in: • SPMD model • Message passing • Can we use a simpler, shared memory model? • On Origin, yes, but what about large machines? Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Global Address Space • To run shared memory programs on distributed memory hardware, we replace references (pointers) by global ones: • • • • • May point to remote data Useful in building large, complex data structures Easy to port shared-memory programs (functionality is correct) Uniform programming model across machines Especially true for cluster of SMPs • Usual implementation • Each reference contains: • Processor id (or process id on cluster of SMPs) • And a memory address on that processor Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Use of Global / Local • Global pointers are more expensive than local • When data is remote, it turns into a remote read or write) which is a message call of some kind • When the data is not remote, there is still an overhead • space (processor number + memory address) • dereference time (check to see if local) • Conclusion: not all references should be global -- use normal references when possible. Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Explicit Approach to Global/Local • A common approach in parallel languages is to distinguish between local and global (“possibly remote”) pointers in the language. • Two variations are: • Make global the default – nice for porting shared memory programs • Make local the default – nice for calling libraries on a single processor that were built for uniprocessor • Titanium uses global deafault, with local declarations in important sections Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Global Address Space • Processes allocate locally • References can be passed to other processes class C { int val;... } C gv; // global pointer C local lv; // local pointer if (thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; gv.val = ...; ... = gv.val; Process 0 lv gv LOCAL HEAP Other processes lv gv lv gv lv gv lv gv lv gv Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley LOCAL HEAP Local Pointer Analysis • Compiler can infer locals using Local Qualification Inference Effect of LQI 250 running time (sec) 200 150 Original After LQI 100 50 0 cannon lu sample gsrb poison applications • Data structures must be well partitioned Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Remote Accesses • What about remote accesses? In this case, the cost of the storage and extra check is small relative to the message cost. • Strategies for reducing remote accesses: • Use non-blocking writes – do not wait for them to performed • Use prefetching for reads – ask before data is needed • Aggregate several accesses to the same processor together • All of these involve reordering or the potential for reordering Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Communication Optimizations Time (normalized) • Data on an old machine, UCB NOW, using a simple subset of C Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Example communication costs latency (a) and bandwidth (b) measured in units of flops ° b measured per 8-byte word Machine CM-5 IBM SP-1 Intel Paragon IBM SP-2 Cray T3D (PVM) UCB NOW UCB Millennium Year 1992 1993 1994 1994 1994 1996 2000 SGI Power Challenge 1995 SUN E6000 1996 SGI Origin 2K 2000 a 1900 5000 1500 7000 1974 2880 50000 3080 1980 5000 b Mflop rate per proc 20 32 2.3 40 28 38 300 20 100 50 200 94 180 500 39 9 25 308 180 500 Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Organization • Can we use Java for high performance on a • • • • 1 processor machine? 10-100 processor machine? 1K-10K processor machine? 100K-1M processor machine? • Kinds of machines • Open problems Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Future Machines • IBM is building a 1M processor Blue Gene machine • Expect a processor failure per day • Would like to run 1 job for a year • “The grid” is made from harnessing unused cycles across the internet • Need to kill job if owner wants to use the machine • Frequent failures • All of our high performance programming models assume the machine works Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Possible Software Model • System hides some faults at each layer Over-partitioned applications (Java,Titanium,…) Uniform machine (dynamic load balancing) Performance faults (process pairs, checkpoints) Fail-stop faults Byzantine faults • Lower levels send “hints” upward • Lower level has control, but upper level can optimize Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley References • Serial Java performance: • Roldan Pozo, Jose Moreira et al, Titanium group • Java memory models • Bill Pugh, Jaejlin Lee • Cycle analysis • Dennis Shasha and Marc Snir, Arvind Krishnamurthy and Kathy Yelick, Jaejin Lee and Sam Midkiff and David Padua • Synchronization analysis • Data race detection: many people • Barriers: Alex Aiken and David Gay • Global pointers • See UPC, Split-C, AC, CC++, Titanium, and others • Local Qualification Inference: Ben Liblit and Alex Aiken • Non-blocking communication • Active messages, Global Arrays (PNL), and others Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Summary • Opportunities to improve programmability • Simplify programmers model (e.g., Java with sequential consistency) • Solve harder compiler problem (use it on “the grid”) • Basic requirements understood, but not • Usability in practice on real applications • Interaction with other analyses • Complexity • Current and future machines are harder • More processors, more levels of hierarchy • Less reliable overall, because of scale Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Backup Slides Outline • Java-like Languages • Language support for performance • Optimizations • Compilation models • Parallel • • • • Machine models Language models Memory models Analysis • Distributed • Remote access • Failures Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Data from Dan • • • • • Origin 2000 (128 CPU configuration): local memory latency: 300 ns remote memory latency: 900 ns avg. bandwidth: 160 MB/sec per CPU CPU: MIPS R10000 195Mhz(390 MFLOPS) or 250MHz(500 MFLOPS) • note the hardware supports up to 4 outstanding nonblocking references to • remote cache lines (SGI obviously agrees with you) • Millennium cluster: • CPU: 4-way Intel P3-700 • AMUDP performance (100Mbit half-duplex switched ethernet, kernel UDP Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley • driver): Data from Millennium Home page • Poweredge 2-way SMPs (500 MHz Pentium IIIs) running Linux 2.2.5 • Each SMP has a Lanai 7.2 card: • Round trip time: 32-33 microseconds for small messages • BW: 59.5 MB/s for 16 KB msgs • Gap (time between msg sends in steady state): 18-19 microseconds • Page: Dec 1999 Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Value of optimizations Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Also for I/O (Dan’s stuff) Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Parallel Language Landscape • Two axes (2-d grid) • Parallelism (control) model • Static (SPMD) • Dynamic (threads) • Communication/Sharing model • Shared memory • Global address space • Message passing • In the 2-100 processor range, one can buy shared memory machines Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Parallel Language Landscape • Implicitly parallel (serial semantics) • Sequential – compiler too hard • Data parallel – compiler too hard • Explicitly parallel (parallel semantics) • OpenMP – compiler too hard (for large machines) • Threads – the sweet spot • People use it (java, vector supers) • Message passing (e.g., MPI) – programming too hard Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley The Economics of High Performance • The failure (or delay of) compilers for data parallel languages in the 90s -> most programs for large scale machines written in MPI • Programming community is elite • Many applications with parallelism don’t use it, because it’s two hard Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Backup Slides II Titanium Group • • • • • Susan Graham Katherine Yelick Paul Hilfinger Phillip Colella (LBNL) Alex Aiken • Greg Balls (SDSC) • Peter McQuorquodale (LBNL) • • • • • • • • Andrew Begel Dan Bonachea Tyson Condie David Gay Ben Liblit Chang Sun Lin Geoff Pike Siu Man Yau Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Target Problems • Many modeling problems in astrophysics, biology, material science, and other areas require • Enormous range of spatial and temporal scales • To solve interesting problems, one needs: • Adaptive methods • Large scale parallel machines • Titanium is designed for methods with • Stuctured grids • Locally-structured grids (AMR) Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Common Requirements • Algorithms for numerical PDE are computations • communication intensive • memory intensive • AMR makes these harder • more small messages • more complex data structures • most of the programming effort is the boundary cases • locality and load balance trade-off is hard debugging Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley A Little History • Most parallel programs are written using explicit parallelism, either: • Message passing with a SPMD model • Usually for scientific applications with C++/Fortran • Scales easily • Shared memory with a thread C or Java • Usually for non-scientific applications • Easier to program • Take the best features of both for Titanium Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Why Java for Scientific Computing? • Computational scientists use increasingly complex models • Popularized C++ features: classes, overloading, pointerbased data structures • But C++ is very complicated • easy to lose performance and readability • Java is a better C++ • Safe: strongly typed, garbage collected • Much simpler to implement (research vehicle) • Industrial interest as well: IBM HP Java Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Summary of Features Added to Java • • • • • • • • • Multidimensional arrays with iterators Immutable (“value”) classes Templates Operator overloading Scalable SPMD parallelism Global address space Checked Synchronization Zone-based memory management Scientific Libraries Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Lecture Outline • Language and compiler support for uniprocessor performance • Immutable classes • Multidimensional Arrays • Foreach • • • • Language support for ease of programming Language support for parallel computation Applications and application-level libraries Summary and future directions Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Java: A Cleaner C++ • Java is an object-oriented language • classes (no standalone functions) with methods • inheritance between classes • Documentation on web at java.sun.com • Syntax similar to C++ class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); } } • Safe: strongly typed, auto memory management • Titanium is (almost) strict superset Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Sequential Performance Java Ultrasparc: C/C++/ FORTRAN Arrays DAXPY 3D multigrid 2D multigrid EM3D 1.4s 12s 5.4s 0.7s 6.8s 1.8s Java Pentium II: C/C++/ FORTRAN Arrays DAXPY 3D multigrid 2D multigrid EM3D 1.8s 23.0s 7.3s 1.0s Titanium Arrays 1.5s 22s 6.2s 1.0s Titanium Arrays 2.3s 20.0s 5.5s 1.6s Overhead 7% 83% 15% 42% Overhead 27% -13% -25% 60% Performance results from 98; new IR and optimization framework almost complete. Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Lecture Outline • Language and compiler support for uniprocessor performance • Language support for ease of programming • Templates • Operator overloading Example later • Language support for parallel computation • Applications and application-level libraries • Summary and future directions Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Lecture Outline • Language and compiler support for uniprocessor performance • Language support for parallel computation • • • • • • SPMD execution Barriers and single Explicit Communication Implicit Communication (global and local references) More on Single Synchronized methods and blocks (as in Java) • Applications and application-level libraries • Summary and future directions Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley SPMD Execution Model • Java programs can be run as Titanium, but the result will be that all processors do all the work • E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc “ + Ti.thisProc()); } } • Any non-trivial program will have communication and synchronization Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley SPMD Model • All processors start together and execute same code, but not in lock-step • Basic control done using • Ti.numProcs() total number of processors • Ti.thisProc() number of executing processor • Bulk-synchronous style read all particles and compute forces on mine Ti.barrier(); write to my particles using new forces Ti.barrier(); • This is neither message passing nor data-parallel Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Explicit Communication: Broadcast • Broadcast is a one-to-all communication broadcast <value> from <processor> • For example: int count = 0; int allCount = 0; if (Ti.thisProc() == 0) count = computeCount(); allCount = broadcast count from 0; • The processor number in the broadcast must be single; all constants are single. • The allCount variable could be declared single. Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Example of Data Input • Same example, but reading from keyboard • Shows use of Java exceptions int single count = 0; int allCount = 0; if (Ti.thisProc() == 0) try { DataInputStream kb = new DataInputStream(System.in); myCount = Integer.valueOf(kb.readLine()).intValue(); } catch (Exception e) { System.err.println(``Illegal Input’’); allCount = myCount from 0; Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Example: A Distributed Data Structure • Data can be accessed across processor boundaries Proc 0 Proc 1 local_grids all_grids Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Example: Setting Boundary Conditions foreach (l in local_grids.domain()) { foreach (a in all_grids.domain()) { local_grids[l].copy(all_grids[a]); } } Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Explicit Communication: Exchange • To create shared data structures • each processor builds its own piece • pieces are exchanged (for object, just exchange pointers) • Exchange primitive in Titanium int [1d] single allData; allData = new int [0:Ti.numProcs()-1]; allData.exchange(Ti.thisProc()*2); • E.g., on 4 procs, each will have copy of allData: 0 2 4 6 Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Building Distributed Structures • Distributed structures are built with exchange: class Boxed { public Boxed (int j) { val = j;} public int val; } Object [1d] single allData; allData = new Object [0:Ti.numProcs()-1]; allData.exchange(new Boxed(Ti.thisProc()); Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Distributed Data Structures • Building distributed arrays: RectDomain <1> single allProcs = [0:Ti.numProcs-1]; RectDomain <1> myParticleDomain = [0:myPartCount-1]; Particle [1d] single [1d] allParticle = new Particle [allProcs][1d]; Particle [1d] myParticle = new Particle [myParticleDomain]; allParticle.exchange(myParticle); • Now each processor has array of pointers, one to each processor’s chunk of particles Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Lecture Outline • Language and compiler support for uniprocessor performance • Language support for ease of programming • Language support for parallel computation • Applications and application-level libraries • • • • • Gene sequencing application Heart simulation AMR elliptic and hyperbolic solvers Scalable Poisson for infinite domains Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join • Summary and future directions Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Unstructured Mesh Kernel • EM3D: Relaxation on a 3D unstructured mesh 8 7 6 • Speedup on Ultrasparc SMP 5 4 • Simple kernel: mesh not partitioned. em3d 3 2 1 0 1 2 4 8 Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley AMR Poisson • Poisson Solver [Semenzato, Pike, Colella] • 3D AMR • finite domain • variable coefficients • multigrid across levels Level 2 Level 1 • Performance Level of Titanium implementation 0 • Sequential multigrid performance +/- 20% of Fortran • On fixed, well-balanced problem of 8 patches, each 723 • parallel speedups of 5.5 on 8 processors Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Scalable Poisson Solver • MLC for Finite-Differences by Balls and Colella • Poisson equation with infinite boundaries • arise in astrophysics, some biological systems, etc. • Method is scalable 1.2 • Low communication • SP2 (shown) and t3e • scaled speedups • nearly ideal (flat) • Currently 2D and non-adaptive Time/fine-patch-iter/proc • Performance on 1 0.8 129x129/65x65 129x129/33x33 257x257/129x129 0.6 257x257/65x65 0.4 0.2 0 1 4 16 processors Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley AMR Gas Dynamics • Developed by McCorquodale and Colella • Merge with Poisson underway for self-gravity • 2D Example (3D supported) • Mach-10 shock on solid surface oblique angle • Future: Self-gravitating gas dynamics package Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley at Distributed Array Libraries • There are some “standard” distributed array libraries associated with Titanium • Hides the details of exchange, indirection within the data structure, etc. • Libraries benefit from support for templates Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Distributed Array Library Fragment template <class T, int single arity> public class DistArray { RectDomain <arity> single rd; T [arity d][arity d] subMatrices; RectDomain <arity> [arity d] single subDomains; ... /* Sets the element at p to value */ public void set (Point <arity> p, T value) { getHomingSubMatrix (p) [p] = value; } } template DistArray <double, 2> single A = new template DistArray <double, 2> ([ [0, 0] : [aHeight, aWidth]); Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Immersed Boundary Method (future) • Immersed boundary method [Peskin,MacQueen] • Used in heart model, platelets, and others • Currently uses FFT for Navier-Stokes solver • Begun effort to move solver and full method into Titanium Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Implementation • Strategy • Titanium into C • Solaris or Posix threads for SMPs • Lightweight communication for MPPs/Clusters • Status: Titanium runs on • • • • Solaris or Linux SMPs and uniprocessors Berkeley NOW SDSC Tera, SP2, T3E (also NERSC) SP3 port underway Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Using Titanium on NPACI Machines • Send mail to us if you are interested [email protected] • Has been installed in individual accounts • t3e and BH: upgrade needed • On uniprocessors and SMPs • available from the Titanium home page • http://www.cs.berkeley.edu/projects/titanium • other documentation available as well Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Calling Other Languages • We have built interfaces to • PETSc : scientific library for finite element applications • Metis: graph partitioning library • KeLP: starting work on this • Two issues with cross-language calls • accessing Titanium data structures (arrays) from C • possible because Titanium arrays have same format on inside • having a common message layer • Titanium is built on lightweight communication Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Future Plans • Improved compiler optimizations for scalar code • large loops are currently +/- 20% of Fortran • working on small loop performance • Packaged solvers written in Titanium • Elliptic and hyperbolic solvers, both regular and adaptive • New application collaboration • Peskin and McQueen (NYU) with Colella (LBNL) • Immersed boundary method, currently use for heart simulation, platelet coagulation, and others Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Backup Slides Example: Domain • Domains in general are not rectangular r • Built using set operations • union, + • intersection, * • difference, - (0, 0) r + [1, 1] • Example is red-black algorithm Point<2> lb = Point<2> ub = RectDomain<2> ... Domain<2> red foreach (p in ... } [0, 0]; [6, 4]; r = [lb : ub : [2, 2]]; (6, 4) (7, 5) (1, 1) red = r + (r + [1, 1]); red) { (0, 0) Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley (7, 5) Example using Domains and foreach • Gauss-Seidel red-black computation in multigrid void gsrb() { boundary (phi); for (domain<2> d = res; d != null; d = (d = = red ? black : null)) { foreach (q in d) unordered iteration res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4 + (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)]) - 20.0*phi[q] - k*rhs[q]) * 0.05; foreach (q in d) phi[q] += res[q]; } } Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Recent Progress in Titanium • Distributed data structures built with global refs • communication may be implicit, e.g.: a[j] = a[i].dx; • use extensively in AMR algorithms • Runtime layer optimizes • bulk communication • bulk I/O • Runs on • t3e, SP2, and Tera • Compiler analysis optimizes • global references converted to local ones when possible Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Consistency Model • Titanium adopts the Java memory consistency model • Roughly: Access to shared variables that are not synchronized have undefined behavior. • Use synchronization to control access to shared variables. • barriers • synchronized methods and blocks Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Compiler Techniques Outline • Analysis and Optimization of parallel code • • • • • Tolerate network latency: Split-C experience Hardware trends and reordering Semantics: sequential consistency Cycle detection: parallel dependence analysis Synchronization analysis: parallel flow analysis • Summary and future directions Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Parallel Optimizations • Titanium compiler performs parallel optimizations • communication overlap and aggregation • Two new analyses • synchronization analysis: the parallel analog to control flow analysis for serial code [Gay & Aiken] • shared variable analysis: the parallel analog to dependence analysis [Krishnamurthy & Yelick] • local qualification inference: automatically inserts local qualifiers [Liblit & Aiken] Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Split-C Experience: Latency Overlap • Titanium borrowed ideas from Split-C • global address space • SPMD parallelism • But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */ • Also*p one-way :- x; communication /* store */ all_store_sync; /* wait globally */ • Conclusion: useful, but complicated Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley Sources of Memory/Comm. Overlap • Would like compiler to introduce put/get/store. • Hardware also reorders • • • • out-of-order execution write buffered with read by-pass non-FIFO write buffers weak memory models in general • Software already reorders too • register allocation • any code motion • System provides enforcement primitives • e.g., memory fence, volatile, etc. • tend to be heavy wait and with unpredictable performance • Can the compiler hide all this? Katherine Yelick, Computer Science Division, EECS, University of California, Berkeley End of Compiling Parallel Code