Download High Performance Computing 811

Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap Today’s Lecture Distributed Memory Computing I  Key concepts – Differences between shared & distributed memory  Message passing  A few network details   General comment – the overall computing model has not changed in decades, but the APIs have… API Evolution  From 80’s through to early 2000s much HPC evolution was driven via math & physics communities Notable focus on regular arrays and data structures  Big forums, working on standards etc   Starting in 2000s growth of data analytics + computational biology introduced different requirements C++, Java or Python  Irregular data, able to start designs from scratch  Shared vs distributed memory  The key difference is data decomposition   Commonly called “domain decomposition” Numerous possible ways to break up data space Each has different compromises in terms of the required communication patterns that result  The comms pattern determines the overall complexity of the parallel code   The decomposition can be handle in implicit explicit ways MPI,PVM SHMEM Implicit Communication Explicit Parallel API’s from the decomposition-communication perspective Message Passing API’s CAF, UPC HPF OpenMP Implicit Shared memory only Decomposition Operate effectively on distributed memory architectures Explicit Message Passing  Concept of sequential processes communicating via messages was developed by Hoare in the 70’s      Each process has its own local memory store Remote data needs are served by passing messages containing the desired data Naturally carries over to distributed memory architectures Two ways of expressing message passing:    Hoare, CAR, Comm ACM, 21, 666 (1978) Coordination of message passing at the language level (e.g. Occam) Calls to a message passing library Two types of message passing   Point-to-point (one-to-one) Broadcast (one-to-all,all-to-all) Broadcast versus point-to-point Broadcast (one-to-all) Point-to-point(one-to-one) Process 2 Process 1 Process 3 Process 4 Collective operation -Involves a group of processes Process 2 Process 1 Process 3 Process 4 Non-Collective operation -Involves a pair of processes Message passing API’s  Message passing API’s dominate      Often reflect underlying hardware design Legacy codes can frequently be converted more easily Allows explicit management of memory hierarchy Message Passing Interface (MPI) is the predominant API Parallel Virtual Machine (PVM) is an earlier API that possesses some useful features over MPI  Useful paradigm for heterogeneous systems, there’s even a python version http://www.csm.ornl.gov/pvm/ PVM – An overview  API can be traced back to 1989(!)   Daemon based    PVM group server controls this aspect Limited number of collective operations   Each user may actively configure their host environment Process groups for domain decomposition   Each host runs a daemon that controls resources Process can be dynamically created and destroyed PVM Console   Geist & Sunderam developed experimental version Barriers, broadcast, reduction Roughly 40 functions in the API PVM API and programming model  PVM most naturally fits a master-worker model     Messages are typed and tagged   Master process responsible for I/O Workers are spawned by master Each process has a unique identifier System is aware of data-type, allows easy portability across heterogeneous network Messages are passed via a three phase proces    Clear (initialize) buffer Pack buffer Send buffer Example code tid=pvm_mytid() if (tid==source){ bufid= pvm_initsend(PvmDataDefault); info = pvm_packint(&i1,1,1); info = pvm_packfloat(vec1,2,1); Sender info = pvm_send(dest,tag); } else if (tid==dest){ bufid= pvm_recv(source,tag); info = pvm_upkint(&i2,1,1); info = pvm_upkfloat(vec2,2,1); } Receiver MPI – An overview  API can be traced back to 1992   Mechanism for creating processes is not specified within API     ‘Communicators’ Richer set of collective operations than PVM Derived data-types important advance   Different mechanism on different platforms MPI 1.x standard does not allow for creating or destroying processes Process groups central to parallel model   First unofficial meeting of MPI forum at Supercomputing 92 Can specify a data-type to control pack-unpack step implicitly 125 functions in the API (v1.0) MPI API and programming model  More naturally a true SPMD type programming model      Oriented toward HPC applications Master-worker model can still be implemented effectively As for PVM, each process has a unique identifier Messages are typed, tagged and flagged with a communicator Messaging can be a single stage operation   Can send specific variables without need for packing Packing is still an option Remote Direct Memory Access  Message passing involves a number of expensive operations:    RDMA cuts down on the CPU overhead    CPUs must be involved (possibly OS kernel too) Buffers are often required CPU sets up channels for the DMA engine to write directly to the buffer and avoid constantly taxing the CPU Frequently discussed under the “zero-copy” euphemism Message passing API’s have been designed around this concept (but usually called remote memory access)  Cray SHMEM RDMA illustrated HOST A Memory/ Buffer HOST B CPU CPU NIC (with RDMA engine) NIC (with RDMA engine) Memory/ Buffer Networking issues   Networks have played a profound role in the evolution of parallel APIs Examine network fundamentals in more detail Provides better understanding of programming issues  Reasons for library design (especially RDMA)  OSI network model         Grew out of 1982 attempt by ISO to develop Open Systems Interconnect (too many vendor proprietary protocols at that time) Motivated from theoretical rather than practical standpoint System of layers taken together = protocol stack Each layer communicates with its peer layer on the remote host Proposed stack was too complex and had too much freedom: not adopted e.g. X.400 email standard required several books of definitions Simplified Internet TCP/IP protocol stack eventually grew out of the OSI model e.g. SMTP email standard takes a few pages Conceptual structure of OSI network Layer 7. Application(http,ftp,…) Upper level Layer 6. Presentation (data std) Layer 5. Session (application) Layer 4.Transport (TCP,UDP,...) Lower level Layer 3. Network (IP,…) Layer 2. Data link (Ethernet,…) Layer 1. Physical (signal) Data transfer Routing Internet Protocol Suite  Protocol stack on which the internet runs   Doesn’t map perfectly to OSI model     Occasionally called TCP/IP protocol stack OSI model lacks richness at lower levels Motivated by engineering rather than concepts Higher levels of OSI model were mapped into a single application layer Expanded some layering concepts within the OSI model (e.g. internetworking was added to the network layer) Internet Protocol Suite “Layer 7” Application e.g. FTP, HTTP, DNS Layer 4.Transport e.g. TCP, UDP, RTP, SCTP Layer 3. Network IP Layer 2. Data link e.g. Ethernet, token ring Layer 1. Physical e.g. T1, E1 Internet Protocol (IP)   Data-oriented protocol used by hosts for communicating data across a packet-switched internetwork Addressing and routing are handled at this level     IP sends and receives data between two IP addresses Data segment = packet (or datagram) Packet delivery is unreliable – packets may arrive corrupted, duplicated or not at all, and out of order Lack of delivery guarantees allows fast switching IP Addressing    On an ethernet network routing at the data link layer occurs between 6 byte MAC (Media Access Control) addresses IP adds its own configurable address scheme on top of this 4 byte address, expressed as 4 decimals on 0-255     Note 0 and 255 are both reserved numbers Division of numbers determines network number versus node Subnet masks determine how these are divided Classes of networks are described by the first number in the IP address and the number of network addresses    [192:255].35.91.* = class C network (254 hosts) (subnet mask 255.255.255.0) [128:191].132.*.* = class B network (65,534 hosts) ( “ 255.255.0.0) [1:126].*.*.* = class A network (16 million hosts) ( “ 255.0.0.0) Note the 35.91 in the class C example, and the 132. in the class B example can be different, but are filled in to show how the network address is defined Transmission Control Protocol (TCP)       TCP is responsible for division of the applications datastream, error correction and opening the channel (port) between applications Applications send a byte stream to TCP TCP divides the byte stream into appropriately sized segments (set by the MTU* of the IP layer) Each segment is given two sequence numbers to enable the byte stream to be reconstructed Each segment also has a checksum to ensure correct packet delivery Segments are passed to IP layer for delivery *maximum transfer unit "Hi, I'd like to hear a TCP joke." "Hello, would you like to hear a TCP joke?" "Yes, I'd like to hear a TCP joke." "OK, I'll tell you a TCP joke." "Ok, I will hear a TCP joke." "Are you ready to hear a TCP joke?" "Yes, I am ready to hear a TCP joke." "Ok, I am about to send the TCP joke. It will last 10 seconds, it has two characters, it does not have a setting, it ends with a punchline." "Ok, I am ready to get your TCP joke that will last 10 seconds, has two characters, does not have an explicit setting, and ends with a punchline." "I'm sorry, your connection has timed out. Hello, would you like to hear a TCP joke?" Humour: TCP joke UDP: Alternative to TCP       UDP=User Datagram Protocol Only adds a checksum and multiplexing capabilitiy – limited functionality allows a streamlined implementation: faster than TCP No confirmation of delivery Unreliable protocol: if you need reliability you must build on top of this layer Suitable for real-time applications where error correction is irrelevant (e.g. streaming media, voice over IP) DNS and DHCP both use UDP Encapsulation of layers Application data TCP header Transport IP header Network Data link enet header data data data Link Layer    For high performance clusters the link layer frequently determines the networking above it All high performance interconnects emulate IP Each data link thus brings its own networking layer with it Overview of interconnect fabrics     Broadly speaking interconnect breakdown into the two camps: commodity vs specialist Commodity: gigabit ethernet (cost<50 per port) Specialist: everything else (cost > 200 dollars per port) Specialist interconnects primarily provide two features over gigabit: Higher bandwidth  Lower message latency  10Gigabit Ethernet    Expected to become commodity any year now (estimates still in the range of $1000 per port) A lot of the early implementations were from companies with HPC backgrounds e.g. Myrinet, Mellanox The problem has always been finding a technological driver outside HPC – few people need a GB/s out of their desktop Infiniband        Infiniband (Open) standard is designed to cover many arenas, from database servers to HPC 237 systems on top500 (Nov. 2015) Has essentially become commoditized Serial bus, can add bandwidth by adding more channels (“lanes”) and increasing channel speed x14 data rate option now available (14 Gb/s) 56Gb/s (14*4) ports now common, higher available  Necessary for “fat nodes” with lots of cores 600 Gb/s projected for 2017 History of MPI    Many different message passing standards circa 1992 Most designed for high performance distributed memory systems Following SC92 MPI Forum is started         Open participation encouraged (e.g. PVM working group was asked for input) Goal is to produce as portable an interface as possible Vendors included but not given control – specific hardware optimizations were avoided Web address: http://www.mpi-forum.org MPI-1 standard released 1994 Forum reconvened in 1995-97 to define MPI-2 Fully functional MPI-2 implementations did not appear until 2002 though Reference guide is available for download  http://www.netlib.org/utk/papers/mpi-book/mpi-book.ps C vs FORTRAN interface   As much effort as possible was extended to keep the interfaces similar Only significant difference is C functions return their value as the error code   FORTRAN versions pass a separate argument Arguments to C functions may be more strongly typed than FORTRAN equivalents  FORTRAN interface relies upon integers MPI Communication model   Messages are typed and tagged Don’t need to explicitly define buffer   Specify start point of a message using memory address    Packing interface available if necessary (MPI_PACK datatype) Communicators (process groups) are a vital component of the MPI standard   Interface is provided if you want to use it Destination processes must include the specific process group Messages must therefore specify: (address,count,datatype,destination,tag,communicator) Address defines the message data, the remaining variables define the message envelope MPI-2   Significant advance over the 1.2 standard Defines remote memory access (RMA) interface      Two types of modes of operation Active target: all processes participate in a single communication phase (although point-to-point messaging is allowed) Passive target: Individual processes participate in point-topoint messaging Parallel I/O Dynamic process management (MPI_SPAWN) Missing pieces  MPI-1 did not specify how processes start       PVM defined its own console Start-up is done using vendor/open source supplied package MPI-2 defines mpiexec – a standardized startup routine Standard buffer interface is implementation specific Process groups are static – can only be created or destroyed No mechanism for obtaining details about the hosts involved in the computation Getting started: enrolling & exitting from the MPI environment  Every program must initialize by executing MPI_INIT(ierr) or int MPI_INIT(int argc, char ***argv)    Default communicator is MPI_COMM_WORLD Determine the process id by calling    MPI_COMM_RANK(MPI_COMM_WORLD, myid,ierr) Note PVM essentials puts enrollment and id resolution into one call Determine total number of processes via   argc, argv are historical hangovers for the C version which maybe set to NULL MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) To exit, processes must call MPI_FINALIZE(ierr) Minimal MPI program #include "mpi.h" #include <stdio.h> program main include “mpif.h” integer ierr,myid int main( int argc, char *argv[] ) call MPI_INIT( ierr ) { call MPI_COMM_RANK(MPI_COMM_WORLD, int myid; & myid, ierr) MPI_Init( &argc, &argv ); print *, 'Hello, world from ’,myid MPI_Comm_rank(MPI_COMM_WORLD, &myid); printf( "Hello, world from %d\n“,myid); call MPI_FINALIZE( ierr ) end MPI_Finalize(); return 0; } Normally execute by mpirun –np 4 my_program Output: Hello,world from 2 Hello world from 1 Hello world from 0 Hello world from 3 Compiling MPI codes  Some implementations (e.g. MPICH) define additional wrappers for the compiler:      mpif77, mpif90 for F77,F90 mpicc, mpicxx for C/C++ Code is then compiled using mpif90 (e.g.) rather than f90, libraries are linked in automatically Usually best policy when machine specific libraries are required Linking can always be done by hand What needs to go in a message? Things that need specifying: How will “data” be described? - specify How will processes be identified? – where? How will the receiver recognize/screen messages? - tagging What will it mean for these operations to complete? – confirmed completion MPI Basic (Blocking) Send MPI_SEND (start, count, datatype, dest, tag, comm)    The message buffer is described by (start, count, datatype). The target process is specified by dest, which is the rank of the target process in the communicator specified by comm. When this function returns, the data has been delivered to the system and the buffer can be reused. The message may not have been received by the target process. From Bill Gropp’s slides 41 Subtleties of point-to-point messaging Process A MPI_Send(B) MPI_Recv(B) Process B MPI_Send(A) MPI_Recv(A) Process A MPI_Recv(B) MPI_Send(B) Process B MPI_Recv(A) MPI_Send(A) Process A MPI_Send(B) MPI_Recv(B) Process B MPI_Recv(A) MPI_Send(A) This kind of communication is `unsafe’. Whether it works correctly is dependent upon whether the system has enough buffer space. This code leads to a deadlock, since the MPI_Recv blocks execution until it is completed. You should always try and write communication patterns like this: a send is match by a recv. Buffered Mode communication  Buffered sends avoid the issue of whether enough internal buffering is available     Programmer explicitly defines buffer space sufficient to allow all messages to be sent MPI_Bsend has same semantics as MPI_Send MPI_Buffer_attach(buffer,size,ierr) must be called to define the buffer space Frequently better to rely on non-blocking communication though Non-blocking communication  Helps alleviate two issues 1. 2.     MPI_Isend adds a handle to the subroutine call which is later used to determine whether the operation has succeeded MPI_Irecv is the matching non-blocking receive operation MPI_Test can be used to detect whether the send/receive has completed MPI_Wait is used to wait for an operation to complete   Blocking communication can potentially starve a process for data while it could be doing useful work Problems related to buffering are circumvented, since the user must explicitly ensure the buffer is available Handle is used to identify which particular message MPI_Waitall is used to wait for a series of operations to complete  Array of handles is used Solutions to deadlocking    If sends and recieves need to be matched use MPI_Sendrecv Process A MPI_Sendrecv(B) Process B MPI_Sendrecv(A) Process A MPI_ISend(B) MPI_IRecv(B) MPI_Waitall Process B MPI_ISend(A) MPI_IRecv(A) MPI_Waitall Non-blocking versions of Isend and Irecv will prevent deadlocks Advice: Use buffered mode sends (Ibsend) so you know for sure that buffer space is available Other sending modes  Synchronous send (MPI_Ssend)     Only returns when the receiver has started receiving the message On return indicates that send buffer can be reused, and also that receiver has started processing the message Non-local communication mode: dependent upon speed of remote processing (Receiver) Ready send (MPI_Rsend)    Used to eliminate unnecessary handshaking on some systems If posted before receiver is ready then outcome is undefined (dangerous!) Semantically, Rsend can be replaced by standard send Collective Operations   Collectives apply to all processes within a given communicator Three main categories:         Data movement (e.g. broadcast) Synchronization (e.g. barrier) Global reduction operations All processes must have a matching call Size of data sent must match size of data received Unless specifically a synchronization function, these routines do not imply synchronization Blocking mode only – but unaware of status of remote operations No tags are necessary Collective Data Movement  Types of data movement: Broadcast (one to all, or all to all)  Gather (collect to single process)  Scatter (send from one processor to all)  MPI_Bcast(buff,count,datatype,root,comm,ierr) data Processor  A0 A0 MPI_Bcast A0 A0 A0 Gather/scatter   MPI_Gather(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,co mm,ierr) MPI_Scatter has same semantics Note MPI_Allgather removes root argument and all processes receive result   Think of it is gather followed by broadcast MPI_Alltoall(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,comm,i err)  Processes sends set of distinct data elements to others – useful for transposing a matrix data Processor  A0 A1 A2 A3 MPI_Scatter MPI_Gather A0 A1 A2 A3 Global Reduction Operations  Plenty of operations covered: Name of Operation MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC Action maximum minimum sum product logical and bit-wise and logical or bit-wise or logical xor bit-wise xor max value and location minimum value and location Reductions  MPI_REDUCE(sendbuf,recvbuf,count,datatype, op,root,comm,ierr) Result is stored in the root process  All members must call MPI_Reduce with the same root, op, and count   MPI_Allreduce(sendbuf,recvbuf,count,datatype, op,comm,ierr)  All members of group receive answer Example using collectives  Numerically integrate 1 4 0 1  x 2 dx  4(arctan( 1)  arctan( 0))      Parallel algorithm: break up integration regions and sum separately over processors Combine all values at the end Very little communication required   Number of pieces Return of values calculated Example using broadcast and reduce c c c c c c c c c c compute pi by integrating f(x) = 4/(1 + x**2) each process: - receives the # of intervals used in the apprxn - calculates the areas of it's rectangles - synchronizes for a global summation process 0 prints the result and the time it took program main include 'mpif.h' double precision PIX parameter (PIX = 4*atan(1.0)) double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, ierr function to integrate f(a) = 4.d0 / (1.d0 + a*a) call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) print *, "Process ", myid, " of ", numprocs, " is alive" if (myid .eq. 0) then print *,"Enter the number of intervals: (0 to quit)" read(5,*) n endif call MPI_BCAST(n, 1, MPI_INTEGER,0,MPI_COMM_WORLD,ierr) c check for n > 0 IF (N.GT.0) THEN c calculate the interval size h = 1.0d0 / n sum = 0.0d0 do 20 i = myid + 1, n, numprocs x = h * (dble(i) - 0.5d0) sum = sum + f(x) 20 continue mypi = h * sum collect all the partial sums call MPI_REDUCE(mypi, pi, 1, + MPI_DOUBLE_PRECISION,MPI_SUM, 0, + MPI_COMM_WORLD, ierr) c c 97 process 0 prints the result if (myid .eq. 0) then write(6, 97) pi, abs(pi - PIX) format(' pi is approximately: ', + F18.16,' Error is: ', F18.16) endif ENDIF call MPI_FINALIZE(ierr) stop end Summary     MPI is a very rich instruction set User defined standard Multiple communication modes Can program a wide variety of problems using a handful of calls Next lecture  More advanced parts of the MPI-1 API User defined data-types  Cartesian primitives for meshes 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download High Performance Computing 811