Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INAF Osservatorio “ScicomP 9” Astrofisico di Catania Bologna March 23 – 26 2004 Using LAPI and MPI-2 in an N-body cosmological code on IBM SP M. Comparato, U. Becciani, C. Gheller, V. Antonuccio • The N-Body project at OACT • The FLY code • Performance analysis with LAPI and MPI-2 • Questions INAF Osservatorio Astrofisico di Catania FLY Project • People: V. Antonuccio, U. Becciani, M. Comparato, D. Ferro • Funds: Included in the project “Problematiche Astrofisiche Attuali ed Alta Formazione nel Campo del Supercalcolo” funded by MIUR with more than 500,000 Euros + 170,000 Euros • INAF provides grants on MPP systems at CINECA • Resources: IBM SP, SGI Origin systems, Cray T3E INAF Osservatorio Astrofisico di Catania INAF Astrophysical Observatory of Catania 24 Processors 222 Mhz Global RAM Memory: 48 Gbytes Disk Space: 254 GB (72.8 GB HD per node + 36.2 GB HD cws) Network Topology: SPS scalable Omega switch and FastEthernet node interconnection type Bandwidth: 300 Mbyte/s peak bi-directional transfer rate Programming Language: C, C++, Fortran 90 Parallel paradigms: OpenMP, MPI, LAPI IBM SP POWER3 INAF Osservatorio Astrofisico di Catania INAF Astrophysical Observatory of Catania NEW SYSTEM 8 Processors 1.1 GHz Global RAM Memory: 16 Gbytes Disk Array: 254 GB L2: 1.5 Mbytes L3: 128 Mbyte IBM POWER4 Memory: 2 GB per processor P650 Gravitational N-body problem The cosmological simulation The N-body technique allows to perform cosmological simulation that describe the evolution of the universe. With this method matter is represented as a set of particles: • each particle is characterized by mass, position and velocity, • the only force considered is gravity, • the evolution of the system is obtained integrating numerically the equation of motion on a proper time interval Gravitational N-body problem The cosmological simulation • Direct interaction (P-P) method is conceptually the simplest. • … but it scale as O(N2) and this makes impossible to run simulations with more than N >= 105 particles. • to overcome this problem tree- or mesh-based algorithms have been developed, these scale as NlogN and N • but, only supercomputers, and parallel codes, allow the user to run simulations with N>= 107 particles FLY: Parallel tree N-body code for Cosmological Applications • Based on the Barnes-Hut Algorithm (J. Barnes & P. Hut, Nature, 324, 1986) • Fortran 90 parallel code • High Performance Code for MPP/SMP architecture using one-side communication paradigm: SHMEM – LAPI • It runs on Cray-T3E System, SGI ORIGIN, and on IBM SP. • Typical Simulations require 350 MB Ram for 1 million particles Gravitational N-body problem The Barnes-Hut algorithm The particles evolve according to the laws of Newtonian physics d 2 xi dt 2 Gm j dij dij j i where dij = xi - xj 3 Considering a region the force component on the iparticle may be computed as: j Gm j dij dij 3 GM di,cm di,cm 3 higer order mult. terms Gravitational N-body problem Tree Formation root 2D Domain decomposition and tree structure. The split of each sub-domain is carried out until only one body (a leaf) is contained in each box Gravitational N-body problem Force Computation Two phases: • tree walk procedure • force computation The force on any particle is computed as the sum of the forces by the nearby particles plus the force by the distant cells whose mass distributions are approximated by multipole series truncated, typically, at the quadrupole order. If Cellsize q di cm mark the cell FLY Block Diagram SYSTEM INIZIALIZATION TREE FORMATION and BARRIER FORCE COMPUTATION TREE INSPECTION ACC. COMPONENTS BARRIER UPDATE POSITIONS and BARRIER STOP Parallel implementation of FLY Data distribution Two main data structures: • particles • tree Particles are distribuited in blocks such that each processor has the same number of contiguous bodies. E.g. with 4 processors: 1 2 3 4 5 6 7 8 9 ... ... ... ... Tree structure is distribuited in a cyclic way such that each processor has the same number of cells 1 5 9 13 2 6 10 1 4 3 ... ... ... ... Parallel implementation of FLY Work distribution • Each processor calculates the force for its local particles. • To do that the whole tree structure (which is distributed among processors) must be accessed asyncronously (one-side communications required) • This leads to a huge communication overhead FLY: Tips and tricks In order to lower the problems related to communication overhead, we have implemented several “tricks” Dynamical Load Balancing: processors help each other Grouping: close particles have the same interaction with far distributions of mass Data Buffering FLY: Data buffering Data buffering: free RAM segments are dynamically allocated to store remote data (tree cell properties and remote bodies) already accessed during the tree walk procedure. Performance improvement: 16 Million bodies on Cray T3E, 32 PEs 156 Mbytes for each PE Without Buffering Each PE executes 700 GET operations for each local body Using Buffering Each PE execute ONLY 8 - 25 GET operations for each local body Why FLYing from LAPI to MPI-2 • LAPI is a propretary parallel programming library (IBM) • Implementing FLY using MPI-2 improves the portability of our code • RMA calls introduced in MPI-2 make the porting simple, since there is a direct correspondence between basic functions lapi_get(…) lapi_put(…) mpi_get(…) mpi_put(…) • MPI-2 doesn’t have an atomic fetch and increment call MPI-2 syncronization • MPI_Win_fence mechanism when we can separate non-RMA access from RMA access when all the processors access remote data at the same time • MPI_Win_lock and MPI_Win_unlock mechanism when we need the data just after the call when only one processor access remote data MPI-2 syncronization • FLY algorithm requires continuous asyncronous access to remote data • passive target syncronization is needed • we have to use the lock/unlock mechanism. MPI-2 Drawback Unfortunately, Lock and unlock are usually not implemented efficiently (or they are not even implemented at all) • LAM: not implemented • MPICH: not implemented • MPICH2: I am waiting next release • IBM AIX: poor performance • IBM Turbo MPI2: testing phase FLY 3.3 Problem: poor performance of lock/unlock calls Walk-around: rewrite portion of code (where possible) to separate non-RMA accesses from RMA accesses in order to use the fence calls Result: MPI-2 version runs 2 times faster Why don’t we port these changes on the LAPI version? FLY 3.3 was born FLY 3.2 VS FLY 3.3 Static 450 2M particles test 400 FLY_3.2 FLY_3.3 350 Static part: sec. 300 250 200 150 • Tree generation • Cell properties •… 100 50 0 1 2 4 8 16 32 64 Number of processors Dynamic part: Dynamic 1800 FLY_3.2 FLY_3.3 1400 1200 sec. • Interaction list • Force computation •… 1600 1000 800 600 400 200 0 1 2 4 8 Number of processors 16 32 64 FLY 3.2 VS FLY 3.3 Total 2000 2M particles test 1800 FLY_3.2 FLY_3.3 1600 1400 sec. 1200 1000 800 600 Total simulation time 400 200 0 1 2 4 8 16 32 64 3232 64 64 Number of processors Scalability timing normalized on the number of processors 7 6 6 5 (tnxn)/t1 Scalability: 7 FLY_3.2 FLY_3.2 FLY_3.3 FLY_3.3 5 4 4 3 3 2 2 1 1 0 0 1 1 2 2 4 4 8 8 1616 Number of processors Number of processors FLY 3.3 VS FLY MPI-2 Static 8000 2M particles test 7000 FLY_3.3 FLY_MPI 6000 Static part: sec. 5000 3000 • Tree generation • Cell properties •… 2000 1000 0 1 2 3 4 5 6 7 Number of processors Dynamic Dynamic part: 3000 2500 FLY_3.3 FLY_MPI 2000 sec. • Interaction list • Force computation •… 4000 1500 1000 500 0 1 2 4 8 16 Number of processors 32 64 Conclusions Present: • Low performce MPI2 version of FLY (for now) • More scalable LAPI version of FLY Future: • TurboMPI2 • MPICH2 (porting on Linux clusters) • OO interface to hydrodynamic codes (FLASH)