Download ppt - MPJ Express

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Weakly-interacting massive particles wikipedia , lookup

Transcript
MPJ meets Gadget
A Java Code for Cosmological
Simulations?
Bryan Carpenter
OMII, University of Southampton
Southampton SO17 1BJ, UK
March 24 2006
[email protected]
1
Contents




Java for Scientific Computing - background
MPJ Express overview
The Gadget 2 code
Making a Java version of Gadget
2
Acknowledgements


Work done in collaboration with Mark Baker
and Aamir Shafi.
Most work done by latter.
3
MPJ Background
4
Java History



The Java language grabbed public attention in 1995,
with the release of the HotJava experimental Web
browser, and its later incorporation into the Netscape
browser.
Very suddenly, Java became one of the most important
programming languages in the industry.
Within a year or so of its release, some people were
suggesting that Java might be good for high
performance scientific computing.
• A workshop on Java for Science and Engineering
Computation was held at Syracuse University in late 1996 –
a precursor of the subsequent Java Grande activities.
5
Java for High Performance Computing?

Various people (Java Grande, 1997-2003, RIP) have
argued that in general Java provides a better
programming platform than precursors.
• In the parallel computing world there has been a long
history of novel language concepts to support parallelism,
multithreading etc.
• Java seems to incorporate at least some of these ideas, and it
has the benefit that it is a “mainstream” language.

In general Java encourages better software
engineering and is more portable than, say, Fortran
or C.
• There is a huge body of supporting software for Java.
6
Performance


To enable safe execution of code on multiple platforms, Java
programs are translated to portable byte code instructions for the
Java Virtual Machine (JVM).
Modern JVMs perform compilation from byte code to native
machine code on the fly, as the Java program is executed.
• Typical JVMs implement many of the most important kinds of
optimizations used by compilers for “traditional” programming
languages.
• Adaptive compilation may even allow optimizations that are impractical
for static compilers, because run-time information is available.
• Quality of compilation comparable to many C or Fortran compilers.

But Java has safety features that may limit performance.
• Difficulty of managing memory explicitly, exact exception processing, …
7
MPJ Background

MPI was introduced in June 1994 as a standard
message passing API for parallel scientific computing:
• Language bindings for C, C++, and Fortran.


Java Grande Message Passing Workgroup defined Java
bindings in 1998.
Previous efforts follow two approaches:
• Pure Java approaches:
 Remote Method Invocation (RMI),
 Java Sockets,
• Java Native Interface (JNI) approaches:
 mpiJava – Java wrapper to platform native MPI.
8
MPJ Goals

Modular design: unify earlier approaches by
pluggable communication devices.

Support high-level functionality of MPI-1.2 in pure
Java code, as far as possible:
•
•
•
•

Point to point communications,
Collective communications,
Groups, communicators, and contexts,
Derived datatypes – buffering API.
Open source distribution:
http://dsg.port.ac.uk/projects/mpj
9
MPJ Design
MPJ API
MPJ collective Communications (High level)
MPJ point to point communications (Base level)
mpjdev (MPJ Device level)
xdev
Native MPI
smpdev
JNI
Threads
API
niodev
Java NIO
gmdev
JNI
Java Virtual Machine (JVM)
Hardware (NIC, Memory etc)
10
MPJ Demos

There are a few existing demos for MPJ, inherited from
the mpiJava project, e.g.
• A 2d Fluid Flow example.
• Some Monte Carlo simulations of 2d spin systems from
condensed matter physics.

These have nice GUIs, but aren’t very representative of
real codes used by computational scientists.
11
MPJ demo: CFD – inviscid flow
12
MPJ demo: Q-state Potts model
13
Gadget
14
Gadget-2

Gadget-2 is a free-software, production code for
cosmological N-body (and hydrodynamic)
computations.
http://www.mpa-garching.mpg.de/gadget
• Written by Volker Springel, of the Max Plank Institute for
Astrophysics, Garching.


It is written in the C language – already parallelized
using MPI.
Versions have been used in various research papers in
astrophysics literature, including the Millennium
Simulation.
15
Millennium Simulation



See the paper Simulating the Joint Evolution of Quasars,
Galaxies and their Large-Scale Distribution by Springel
et al on Gadget home page.
Follows evolution of 1010 dark matter “particles” from
early Universe (z = 127) to current day.
Performed on 512 nodes of IBM p690.
• Used 1TB of distributed memory.
• 350,000 CPU hours – 28 days elapsed time.
• Floating point performance around 0.2 TFLOPS.

Around 20Tb total data generated.
16
Dynamics in Gadget

Gadget is “just” simulating the movement of (a lot
of) representative particles under the influence of
Newton’s law of gravity.
• Plus some hydrodynamic forces, but these don’t affect
dominant Dark Matter.


Co-moving coordinates take account of expansion of
Universe according to General Relativity, but
otherwise basically classical mechanics.
Classical N-body problem.
17
Gadget Main Loop

This is a slightly simplified view of the Gadget code:
… Initialize …
while (not done) {
move_particles() ;
// update positions
domain_Decomposition() ;
compute_accelerations() ;
advance_and_find_timesteps() ; // update velocities
}
• Most of the interesting work happens in
compute_accelerations() (and domain_Decomposition() –
see later).
18
Computing Forces

The function compute_accelerations() must compute
forces experienced by all particles.
• In particular must compute gravitational force.

Because gravity is a long range force, every particle
affects every other. Naively, total cost is O(N2).
• with N ≈ 1010, this is quite infeasible.

Need some kind of approximation.
• Intuitive approach: when computing force on particle i, group
together particles in selected regions distant from i, and treat
groups as single particles, located at their centres of mass.
19
Barnes Hut Tree


First divide cubical region of 3d space into 23 = 8
regions, halving each dimension.
For every sub-region that contains any particles, divide
again recursively to make an octtree, until “leaf”
regions have at most one particle.
20
Two Dimensional Example

Picture borrowed from
www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html
21
Barnes Hut Force Computation

To compute the force on a particle i, traverse tree
starting from root:
• if a node n is “distant from” i, just add contribution to force
on i from centre of mass of n – no need to visit children of n;
• if node n is “close to” i, visit children of n and recurse.

Hinges on definition of distant from/close to.
• Basic idea is that a node representing some region of space is
distant from a particle i if the angle it subtends is smaller than
a threshold opening angle:
n
i
θ
22
Complexity


On average, number of nodes “opened” to compute
force on i is O(log N), as opposed to visiting O(N)
particles in naïve algorithm.
A huge win when N ≈ 1010.
23
Domain Decomposition



Need to divide space and/or particle set into domains,
where each domain is handled by a single processor.
Problem is that we can’t just divide space evenly,
because some regions will have many more particle
than others – poor load balancing.
Conversely, can’t just divide particles evenly, because
particles move throughout space, and want to maintain
physically close particles on the same processor, as far
as practical – communication problem.
24
Peano-Hilbert Curve

Warren and Salmon originally suggested using a spacefilling curve:

Picture borrowed from
http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf
25
Peano-Hilbert Key


Gadget applies the recursion 20 times, logically
dividing space into up to 220 × 220 × 220 cells on the
Peano-Hilbert curve.
Then can label each cell by its location along the PeanoHilbert curve – 260 possible locations comfortably fit
into a 64-bit word.
26
Decomposition based on P-H Curve


Because number of cells > > number of particles,
segments of linear Peano-Hilbert curve sparsely
populated.
Sort particles by their Peano-Hilbert key, then divide
evenly into P domains.
• Intuitively – stretch out the P-H curve with particles dotted
along it; segment it into P parts where each part has the same
number of particles.

Characteristics of this decomposition:
• Good load balancing.
• Domains simply connected and quite “compact” in real space,
because particles that are close along P-H curve are close in
real space (converse often but not always true).
• Domains have relatively simple mapping to BH octtree nodes.
27
Distribution of BH Tree in Gadget

Ibid.
28
Distributed Representation of Tree



Every processor hold a copy of the root nodes, and a
copy of all child nodes down to the point where all
particles in of a node are held on a single remote
processor. Remotely held nodes are called pseudoparticles.
To compute the force on a single local target particle,
traverse tree from root as usual, and accumulate
contributions from locally held particles.
Build an export list containing target particle and hosts
of pseudo-particles encountered in walk.
29
Communication



After local computation for all target particles, process
export list and send list of local target particles to all
hosts that own pseudo-particle nodes needed for those
particles.
All processors do another tree walk to compute their
contributions to remotely owned (from their point of
view) target particles.
These contributions are returned to the original
processor, and added into the accelerations for the
target particles.
30
Other Communications

Notably:
• The Domain Decomposition itself requires (in principle) a
distributed sort of the particle list.
• Gadget approximates this sort, but still fairly intricate
communications.

In Gadget all communication is uses the standard MPI
library. It makes fairly extensive use of collective
communication routines.
31
Java Gadget
32
History






MPJ Express released 2005.
Realized we need a realistic exemplar code: both as a
demo, and to drive further improvements of our
software.
First thoughts that an N-body simulation code would
make sense – Summer 2005.
Learned about Gadget in December that year.
Effort to make a Java version of Gadget started
seriously early February 2006 – expecting many
months work.
Non-trivial simulations running in March.
33
Gadget Code Statistics


Public distribution is around 17,000 lines of C,
distributed over about 30 source files.
Dependencies on:
• MPI library for parallelization.
• GNU scientific library (but only a handful of functions)
• FFTW – library for parallel Fourier transforms.

Comes with a set of initial conditions files that we can
use to test our code, including “colliding galaxies” and
“cluster formation”.
34
Translation of Gadget Code

Manually translated, but in first cut, deliberately keep the Java
code as similar to C as possible.
• e.g. nearly all of Java Gadget is currently implemented as static methods
on one enormous class called Main.
• Small supporting classes corresponding to structs of C code.

Currently not supporting a number of features, including:
• PMTree algorithm (an optimization using Fourier Transforms on a mesh
to compute long-range part of gravitational force),
• Periodic Boundary Conditions,
• Multiple Initial Conditions files,
• COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS, FLEXSTEPS,
OUTPUTPOTENTIAL, FORCETEST, MAKEGLASS,
PSEUDOSYMMETRIC, …

Not quite complete in a few other ways (“magic numbers”).
35
Handling Dependencies



Replace MPI calls with MPJ calls (!)
FFTW is not needed, because we disable the PMTree
algorithm.
Few GSL functions that were needed (numeric
quadrature, etc) were hand translated to Java.
36
Test Cases

Restrictions aside, we have successfully run Colliding
Galaxies and Cluster Formation example simulations on
2, 4 and 8 processors.
• These use pure Dark Matter – hydrodynamics code not yet
tested.

Results are indistinguishable from running the original
Gadget.
37
Colliding Galaxies
38
Cluster Formation
39
Custom serialization in Java Gadget-2 -Motivation


Java’s built in object serialization can have detrimental effect
on parallel application performance
Experiment
• Compare the performance of sending
 An array of 1, 1Kbytes, 1Mbytes byte array elements
 An array of 1, 1Kbytes, 1Mbytes object array elements, where each
object contains exactly one byte
• Both tests are communicating same amount of data
Message Size
Latency for MPI.BYTE
Latency for MPI.OBJECT
Bytes communicated in objects
case
1
167 us
282 us
72 (overhead is 71 bytes)
1K
512 us
7745 us
~ 17K (overhead is 16K bytes)
1M
.09 milliseconds
10.85 seconds
~ 17M (overhead is 16M bytes)
40
Custom serialization in Java Gadget-2

In the original C Gadget-2, initial conditions are read into an
array of C struct called ParticleData and SphParticleData
• Particles that need to be exported are copied to a contiguous memory
region called CommBuffer

In Java version, ParticleData and SphParticleData are objects arrays
• CommBuffer is a contiguous memory region, which is an instance of
ByteBuffer class
•
Each primitive datatype in object arrays are copied to (packed) to
CommBuffer by the sender and copied from (unpacked) CommBuffer by
the receiver
 Helps avoiding the overheads of Java’s object serialization
41
Initial Benchmarks


Here we are running the simulations defined by the
initial conditions files distributed with Gadget.
The numbers of particles are relatively modest, so
parallel speedups are less than perfect.
42
43
Interpretation




These are first-cut results – comparing the production
C Gadget with a mostly unoptimized Java code.
In this version the Java code appears to be roughly a
factor of two slower than C.
It is fairly clear that Java memory usage is very
inefficient, probably this has an adverse effect on
numerical performance (poor caching).
Need to investigate whether Java code can be rewritten
to make better use of memory.
44
Possible enhancement to the MPJ API

Support for sending a basic datatype
• For example, if we want to send an integer, it has to be part of or copied
to an integer array before communication

Support for sending from and receiving to ByteBuffer
• Our buffering API in MPJ Express allows efficient reuse of ByteBuffers
• Application developers can get a chunk of memory (ByteBuffer) using
buffering API, copy their data, and use it for communication
 MPJ could now use this ByteBuffer for socket send method
• No additional copying
• Can help reduce latency
45