Download ppt - MPJ Express

MPJ meets Gadget A Java Code for Cosmological Simulations? Bryan Carpenter OMII, University of Southampton Southampton SO17 1BJ, UK March 24 2006 [email protected] 1 Contents     Java for Scientific Computing - background MPJ Express overview The Gadget 2 code Making a Java version of Gadget 2 Acknowledgements   Work done in collaboration with Mark Baker and Aamir Shafi. Most work done by latter. 3 MPJ Background 4 Java History    The Java language grabbed public attention in 1995, with the release of the HotJava experimental Web browser, and its later incorporation into the Netscape browser. Very suddenly, Java became one of the most important programming languages in the industry. Within a year or so of its release, some people were suggesting that Java might be good for high performance scientific computing. • A workshop on Java for Science and Engineering Computation was held at Syracuse University in late 1996 – a precursor of the subsequent Java Grande activities. 5 Java for High Performance Computing?  Various people (Java Grande, 1997-2003, RIP) have argued that in general Java provides a better programming platform than precursors. • In the parallel computing world there has been a long history of novel language concepts to support parallelism, multithreading etc. • Java seems to incorporate at least some of these ideas, and it has the benefit that it is a “mainstream” language.  In general Java encourages better software engineering and is more portable than, say, Fortran or C. • There is a huge body of supporting software for Java. 6 Performance   To enable safe execution of code on multiple platforms, Java programs are translated to portable byte code instructions for the Java Virtual Machine (JVM). Modern JVMs perform compilation from byte code to native machine code on the fly, as the Java program is executed. • Typical JVMs implement many of the most important kinds of optimizations used by compilers for “traditional” programming languages. • Adaptive compilation may even allow optimizations that are impractical for static compilers, because run-time information is available. • Quality of compilation comparable to many C or Fortran compilers.  But Java has safety features that may limit performance. • Difficulty of managing memory explicitly, exact exception processing, … 7 MPJ Background  MPI was introduced in June 1994 as a standard message passing API for parallel scientific computing: • Language bindings for C, C++, and Fortran.   Java Grande Message Passing Workgroup defined Java bindings in 1998. Previous efforts follow two approaches: • Pure Java approaches:  Remote Method Invocation (RMI),  Java Sockets, • Java Native Interface (JNI) approaches:  mpiJava – Java wrapper to platform native MPI. 8 MPJ Goals  Modular design: unify earlier approaches by pluggable communication devices.  Support high-level functionality of MPI-1.2 in pure Java code, as far as possible: • • • •  Point to point communications, Collective communications, Groups, communicators, and contexts, Derived datatypes – buffering API. Open source distribution: http://dsg.port.ac.uk/projects/mpj 9 MPJ Design MPJ API MPJ collective Communications (High level) MPJ point to point communications (Base level) mpjdev (MPJ Device level) xdev Native MPI smpdev JNI Threads API niodev Java NIO gmdev JNI Java Virtual Machine (JVM) Hardware (NIC, Memory etc) 10 MPJ Demos  There are a few existing demos for MPJ, inherited from the mpiJava project, e.g. • A 2d Fluid Flow example. • Some Monte Carlo simulations of 2d spin systems from condensed matter physics.  These have nice GUIs, but aren’t very representative of real codes used by computational scientists. 11 MPJ demo: CFD – inviscid flow 12 MPJ demo: Q-state Potts model 13 Gadget 14 Gadget-2  Gadget-2 is a free-software, production code for cosmological N-body (and hydrodynamic) computations. http://www.mpa-garching.mpg.de/gadget • Written by Volker Springel, of the Max Plank Institute for Astrophysics, Garching.   It is written in the C language – already parallelized using MPI. Versions have been used in various research papers in astrophysics literature, including the Millennium Simulation. 15 Millennium Simulation    See the paper Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution by Springel et al on Gadget home page. Follows evolution of 1010 dark matter “particles” from early Universe (z = 127) to current day. Performed on 512 nodes of IBM p690. • Used 1TB of distributed memory. • 350,000 CPU hours – 28 days elapsed time. • Floating point performance around 0.2 TFLOPS.  Around 20Tb total data generated. 16 Dynamics in Gadget  Gadget is “just” simulating the movement of (a lot of) representative particles under the influence of Newton’s law of gravity. • Plus some hydrodynamic forces, but these don’t affect dominant Dark Matter.   Co-moving coordinates take account of expansion of Universe according to General Relativity, but otherwise basically classical mechanics. Classical N-body problem. 17 Gadget Main Loop  This is a slightly simplified view of the Gadget code: … Initialize … while (not done) { move_particles() ; // update positions domain_Decomposition() ; compute_accelerations() ; advance_and_find_timesteps() ; // update velocities } • Most of the interesting work happens in compute_accelerations() (and domain_Decomposition() – see later). 18 Computing Forces  The function compute_accelerations() must compute forces experienced by all particles. • In particular must compute gravitational force.  Because gravity is a long range force, every particle affects every other. Naively, total cost is O(N2). • with N ≈ 1010, this is quite infeasible.  Need some kind of approximation. • Intuitive approach: when computing force on particle i, group together particles in selected regions distant from i, and treat groups as single particles, located at their centres of mass. 19 Barnes Hut Tree   First divide cubical region of 3d space into 23 = 8 regions, halving each dimension. For every sub-region that contains any particles, divide again recursively to make an octtree, until “leaf” regions have at most one particle. 20 Two Dimensional Example  Picture borrowed from www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html 21 Barnes Hut Force Computation  To compute the force on a particle i, traverse tree starting from root: • if a node n is “distant from” i, just add contribution to force on i from centre of mass of n – no need to visit children of n; • if node n is “close to” i, visit children of n and recurse.  Hinges on definition of distant from/close to. • Basic idea is that a node representing some region of space is distant from a particle i if the angle it subtends is smaller than a threshold opening angle: n i θ 22 Complexity   On average, number of nodes “opened” to compute force on i is O(log N), as opposed to visiting O(N) particles in naïve algorithm. A huge win when N ≈ 1010. 23 Domain Decomposition    Need to divide space and/or particle set into domains, where each domain is handled by a single processor. Problem is that we can’t just divide space evenly, because some regions will have many more particle than others – poor load balancing. Conversely, can’t just divide particles evenly, because particles move throughout space, and want to maintain physically close particles on the same processor, as far as practical – communication problem. 24 Peano-Hilbert Curve  Warren and Salmon originally suggested using a spacefilling curve:  Picture borrowed from http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf 25 Peano-Hilbert Key   Gadget applies the recursion 20 times, logically dividing space into up to 220 × 220 × 220 cells on the Peano-Hilbert curve. Then can label each cell by its location along the PeanoHilbert curve – 260 possible locations comfortably fit into a 64-bit word. 26 Decomposition based on P-H Curve   Because number of cells > > number of particles, segments of linear Peano-Hilbert curve sparsely populated. Sort particles by their Peano-Hilbert key, then divide evenly into P domains. • Intuitively – stretch out the P-H curve with particles dotted along it; segment it into P parts where each part has the same number of particles.  Characteristics of this decomposition: • Good load balancing. • Domains simply connected and quite “compact” in real space, because particles that are close along P-H curve are close in real space (converse often but not always true). • Domains have relatively simple mapping to BH octtree nodes. 27 Distribution of BH Tree in Gadget  Ibid. 28 Distributed Representation of Tree    Every processor hold a copy of the root nodes, and a copy of all child nodes down to the point where all particles in of a node are held on a single remote processor. Remotely held nodes are called pseudoparticles. To compute the force on a single local target particle, traverse tree from root as usual, and accumulate contributions from locally held particles. Build an export list containing target particle and hosts of pseudo-particles encountered in walk. 29 Communication    After local computation for all target particles, process export list and send list of local target particles to all hosts that own pseudo-particle nodes needed for those particles. All processors do another tree walk to compute their contributions to remotely owned (from their point of view) target particles. These contributions are returned to the original processor, and added into the accelerations for the target particles. 30 Other Communications  Notably: • The Domain Decomposition itself requires (in principle) a distributed sort of the particle list. • Gadget approximates this sort, but still fairly intricate communications.  In Gadget all communication is uses the standard MPI library. It makes fairly extensive use of collective communication routines. 31 Java Gadget 32 History       MPJ Express released 2005. Realized we need a realistic exemplar code: both as a demo, and to drive further improvements of our software. First thoughts that an N-body simulation code would make sense – Summer 2005. Learned about Gadget in December that year. Effort to make a Java version of Gadget started seriously early February 2006 – expecting many months work. Non-trivial simulations running in March. 33 Gadget Code Statistics   Public distribution is around 17,000 lines of C, distributed over about 30 source files. Dependencies on: • MPI library for parallelization. • GNU scientific library (but only a handful of functions) • FFTW – library for parallel Fourier transforms.  Comes with a set of initial conditions files that we can use to test our code, including “colliding galaxies” and “cluster formation”. 34 Translation of Gadget Code  Manually translated, but in first cut, deliberately keep the Java code as similar to C as possible. • e.g. nearly all of Java Gadget is currently implemented as static methods on one enormous class called Main. • Small supporting classes corresponding to structs of C code.  Currently not supporting a number of features, including: • PMTree algorithm (an optimization using Fourier Transforms on a mesh to compute long-range part of gravitational force), • Periodic Boundary Conditions, • Multiple Initial Conditions files, • COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS, PSEUDOSYMMETRIC, …  Not quite complete in a few other ways (“magic numbers”). 35 Handling Dependencies    Replace MPI calls with MPJ calls (!) FFTW is not needed, because we disable the PMTree algorithm. Few GSL functions that were needed (numeric quadrature, etc) were hand translated to Java. 36 Test Cases  Restrictions aside, we have successfully run Colliding Galaxies and Cluster Formation example simulations on 2, 4 and 8 processors. • These use pure Dark Matter – hydrodynamics code not yet tested.  Results are indistinguishable from running the original Gadget. 37 Colliding Galaxies 38 Cluster Formation 39 Custom serialization in Java Gadget-2 -Motivation   Java’s built in object serialization can have detrimental effect on parallel application performance Experiment • Compare the performance of sending  An array of 1, 1Kbytes, 1Mbytes byte array elements  An array of 1, 1Kbytes, 1Mbytes object array elements, where each object contains exactly one byte • Both tests are communicating same amount of data Message Size Latency for MPI.BYTE Latency for MPI.OBJECT Bytes communicated in objects case 1 167 us 282 us 72 (overhead is 71 bytes) 1K 512 us 7745 us ~ 17K (overhead is 16K bytes) 1M .09 milliseconds 10.85 seconds ~ 17M (overhead is 16M bytes) 40 Custom serialization in Java Gadget-2  In the original C Gadget-2, initial conditions are read into an array of C struct called ParticleData and SphParticleData • Particles that need to be exported are copied to a contiguous memory region called CommBuffer  In Java version, ParticleData and SphParticleData are objects arrays • CommBuffer is a contiguous memory region, which is an instance of ByteBuffer class • Each primitive datatype in object arrays are copied to (packed) to CommBuffer by the sender and copied from (unpacked) CommBuffer by the receiver  Helps avoiding the overheads of Java’s object serialization 41 Initial Benchmarks   Here we are running the simulations defined by the initial conditions files distributed with Gadget. The numbers of particles are relatively modest, so parallel speedups are less than perfect. 42 43 Interpretation     These are first-cut results – comparing the production C Gadget with a mostly unoptimized Java code. In this version the Java code appears to be roughly a factor of two slower than C. It is fairly clear that Java memory usage is very inefficient, probably this has an adverse effect on numerical performance (poor caching). Need to investigate whether Java code can be rewritten to make better use of memory. 44 Possible enhancement to the MPJ API  Support for sending a basic datatype • For example, if we want to send an integer, it has to be part of or copied to an integer array before communication  Support for sending from and receiving to ByteBuffer • Our buffering API in MPJ Express allows efficient reuse of ByteBuffers • Application developers can get a chunk of memory (ByteBuffer) using buffering API, copy their data, and use it for communication  MPJ could now use this ByteBuffer for socket send method • No additional copying • Can help reduce latency 45

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - MPJ Express