Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 6: MESSAGE PASSING PROGRAMMING AND MPI Distributed Memory Programming Paradigm Message passing programming is targeted at distributed memory machines (although it can be emulated on shared memory machines). Set of processors, each with its own memory. Communication between processors via a network. Each processor runs a copy of the same executable — Single Program, Multiple Data (SPMD): Written in a conventional programming language. With its own memory address space, aka ‘local’,‘private’, ‘nonglobal’. Passing of data between processors done using library subroutine calls. Cont… SPMD is a special case of the Multiple Instruction Multiple Data (MIMD) model. MIMD allows different executables to be run on each processor, but SPMD model is simpler and more portable and hardly ever a limitation. Message Passing Paradigm Data movement specified explicitly by programmer. Data propagated via packets of messages. Communication types: Point to point — most basic, one processor sending data, and one processor receiving. Building block for all other communications. Collective, involving more than two processors, e.g.: One processor broadcasting data to other processors. Synchronisation, or barrier. Reduction operation, e.g. global sum. Cont… Collective communications should be faster than user implementations with point to point (optimised implementations). Communications libraries allow programmer to specify a virtual interconnect topology, which abstracts over the physical connectivity of the parallel computer. Programmer must ensure correctness of code (every send must have related receive, avoidance of deadlock, etc). Message Passing Programs In some cases the parallel program requires a completely different algorithm, so the code will look very different to the sequential code. However in many cases message passing codes use standard domain decomposition parallelisation and look very similar to sequential codes, except for: Some code to distribute arrays across processors. Some code to pass data across processors where necessary, e.g. exchange array values at edges of domains, do global sums, etc. The rest of the code is mostly the same, basically executing the same sequential algorithm, except that it acts only on the local subset of the array. In the latter case it is straightforward to use the same code on sequential and parallel machines, with preprocessing directives to select sequential or parallel version. Alternatively, can run sequentially by just using parallel version on 1 processor, but there is an overhead. Types of Message Passing Programs Embarrassingly Parallel Virtually no communication required Easily load balanced (even dynamically) Expect perfect speedup Regular and Synchronous Easily statically load balanced Expect good speedup for local communication Expect reasonable speedup for non-local communication Irregular and/or Asynchronous Difficult to load balance – often requires dynamic load balancing Communications overhead usually large Hard to get good speedup (but possible with some applications) Usually cannot be done efficiently using data parallel programming – must use message passing to get best performance Embarrassingly Parallel Problems Each element of an array can be processed independently of the others. No communication required, except to combine the final results. Static load balancing is usually trivial – can use any kind of distribution (e.g. scattered) since communications is not a factor. Dynamic load balancing can be done using a task farm approach: A master (or controller) process sends parcels of work to each of a set of workers (processors). Workers return results when they are done, and request more work when idle. This approach gives automatic dynamic load balancing as long as: 1. the problem can be split into lots of parcels of work 2. the parcels are not too small, to minimise communications overhead 3. the parcels are not too large, so as not to be left waiting for one processor once the others are done Expect perfect speedup. Regular and Synchronous Problems Regular algorithm and data structures. Synchronous (or loosely synchronous) communications. Use standard domain decomposition. Usually requires: As long as the amount of data per processor is large enough so that computation time is much greater than communication time, expect: Good speedup for local communication Reasonable speedup for non-local communication For problems with local communication: local communication (edges of domain); collective communication (combine final results); and (sometimes) non-local communication. Computation goes like the volume of the grid (or array) on each processor Communication goes like the edge (perimeter or surface) of the grid on each processor Comms overhead ∼ Comm / Comp ∼ 1 / grid length So expect reasonable speedup as long as size of array is much larger than number of processors. Irregular and/or Asynchronous Irregular algorithm and/or data structures. Cannot be implemented efficiently except using message passing. Usually a high communications overhead. Complex asynchronous (and usually non-local) communication requires careful coding. May require careful load balancing. May require dynamic repartitioning of data between processors. Difficult to get good speedup, but possible for some problems. In some cases most of the problem is regular, and only part of it is irregular, which makes things easier, but still need to use message passing instead of (or as well as) data parallel. HPF allows calls to extrinsic functions, which may be written in MPI, to support these mixed regular and irregular applications. Message Passing Libraries Long and noble history (since early 1980’s). Main idea: abstraction from machine topology — routines to send/receive/broadcast messages between processors without having to manage the routing through the particular machine topology. User sees a library for an existing language (e.g. C or Fortran), not (usually) extensions to the language. Speed, not portability, historically the main criterion (but these are not necessarily exclusive). N(implementations) ∼ N(vendors,institutes): Machine specific – NX, MPL, CMMD, CS-Tools Portable – CROS, Express, P4, Chameleon, CHIMP, PARMACS, PICL, PVM, SHMEM, TITCH, TINY,VERTEX, ZIPCODE Many good ideas in them. Basic functionality the same, interfaces similar but annoyingly different. PVM (Parallel Virtual Machine) had been de-facto standard in recent times, but focussed on distributed computing (more so after introduction of MPI). Message Passing Interface In 1994 a standardized Message Passing Interface (MPI) was established by an international committee (the MPI Forum). Prompted by the success of the HPF Forum in creating a standard data parallel language. Interface to both C and Fortran, other languages could be supported (C++ bindings are provided in version 2, Java bindings are being investigated). Different languages have slightly different syntax for the interface, e.g. error values returned as integer return codes in C, and as extra parameters in Fortran. Provides only a specification, which can be implemented on any parallel computer or workstation network. Some implementations of this specification are: Versions for many different machines – e.g. MPICH, LAM, CHIMP, MPI-F Versions from vendors for specific machines – e.g. IBM SP2 Initial specification (MPI-1) did not handle some aspects such as parallel I/O and distributed computing support, but most of these issues are addressed in the latest version (MPI-2). Reasons for Using MPI: Standardization - MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. Practically, it has replaced all previous message passing libraries. Portability - There is no need to modify your source code when you port your application to a different platform that supports (and is compliant with) the MPI standard. Performance Opportunities - Vendor implementations should be able to exploit native hardware features to optimize performance. For more information about MPI performance see the MPI Performance Topics tutorial. Functionality - Over 115 routines are defined in MPI-1 alone. Availability - A variety of implementations are available, both vendor and public domain. Programming Model: MPI lends itself to virtually any distributed memory parallel programming model. In addition, MPI is commonly used to implement (behind the scenes) some shared memory models, such as Data Parallel, on distributed memory architectures. Hardware platforms: Distributed Memory: Originally, MPI was targeted for distributed memory systems. Shared Memory: As shared memory systems became more popular, particularly SMP / NUMA architectures, MPI implementations for these platforms appeared. Hybrid: MPI is now used on just about any common parallel architecture including massively parallel machines, SMP clusters, workstation clusters and heterogeneous networks. All parallelism is explicit: the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs. The number of tasks dedicated to run a parallel program is static. New tasks can not be dynamically spawned during run time. (MPI-2 addresses this issue). MPI Preliminaries MPI provides support for: Typographical convention: Point-to-Point message passing Collective Communications Communicators to group messages and processes Inquiry routines to query the environment Constants and data types All MPI identifiers prefixed by ‘MPI ’. C routine names may contain lower case, as in ‘MPI Init’. Fortran routines SHOUT AT YOU IN UPPER CASE, as in ‘MPI INIT’. MPI constants are all in upper case, e.g. ‘MPI FLOAT’ is an MPI C datatype. C routines are actually integer functions which return a status code. Fortran routines really are subroutines, and return a status code as an argument. Strongly advised to always check these status codes. Number of processors used is specified in the command line, when running the MPI loader that loads the MPI program onto the processors, to avoid hardcoding it into the program. MPI Routines MPI has 129 different routines, but most programs can be written using about a dozen simple routines: #include "mpi.h“ include "mpif.h" in C in Fortran MPI_INIT MPI_FINALIZE Initialize MPI computation Terminate MPI computation MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV MPI_BARRIER MPI_BCAST MPI_GATHER MPI_SCATTER MPI_REDUCE MPI_ALLREDUCE Determine number of processes Determine process number Send a message Receive a message Synchronize Broadcast same data to all procs Get data from all procs Send different data to all procs Combine data from all procsonto a single proc Combine data from all procs onto all procs Reduction operations include +, *, AND, OR, min, max Will give general overview to these routines — check MPI specification, MPI programming books and courses, or manuals for implementation you are using, for exact usage of each routine. Point to Point Communications Communication between two processors. one to send, and one to receive the data. Information required to specify the message: Two different communication operations: Identification of sender processor. Identification of destination processor. Type of data — e.g. integers, floats, characters. Number of data elements to send — good idea if sending arrays. . . Where the data to send lives — usually a pointer. Where the received data should be stored into — again a pointer. Blocking — sender & receiver waits until communication is complete. Non-blocking — send & receive handled ‘offline’, processor continues on. In addition, four send modes: Synchronous — Waits for the recipient to complete. Buffered — Copies message data into a buffer. Standard — MPI decides whether the message should be buffered. Ready — A send is only started when a matching receive is initiated. Collective Communications Key concept: Groups are sets of processors that communicate with each other in a certain way. Communications can take place amongst Group members without cross-talk. Rather like definable datatypes, allows better mapping between the language and the problem. Also allows construction of libraries that are guaranteed not to interfere with communications within user program, since all their communications are confined to a particular group. MPI implements Groups using data objects called Communicators. MPI COMM SIZE gives number of processes (or processors, if there is one process per processor) in a communicator. MPI defines a special Communicator MPI COMM WORLD for the Group of all processes (or processors). Subgroups are sub-divided from the world Group. Each Group member is identified by a number, its Rank, enumerated 0, . . . (Number in Group−1). Header File: Required for all programs/routines which make MPI library calls. C include file #include "mpi.h" Fortran include file include 'mpif.h' Format of MPI Calls C Binding Format rc = MPI_Xxxxx(parameter, ... ) Example rc = MPI_Bsend(&buf,count,type,dest,tag,comm) Error Code Returned as "rc". MPI_SUCCESS if successful General MPI Program Structure: Communicators and Groups MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require you to specify a communicator as an argument. Use MPI_COMM_WORLD whenever a communicator is required - it is the predefined communicator that includes all of your MPI processes. Rank Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a "task ID". Ranks are contiguous and begin at zero. Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control program execution (if rank=0 do this / if rank=1 do that). MPI_Init Initializes the MPI execution environment. This function must be called in every MPI program, must be called before any other MPI functions and must be called only once in an MPI program. For C programs, MPI_Init may be used to pass the command line arguments to all processes, although this is not required by the standard and is implementation dependent. MPI_Init (&argc,&argv) MPI_INIT (ierr) MPI_Comm_size Determines the number of processes in the group associated with a communicator. Generally used within the communicator MPI_COMM_WORLD to determine the number of processes being used by your application. MPI_Comm_size (comm,&size) MPI_COMM_SIZE (comm,size,ierr) MPI_Comm_rank Determines the rank of the calling process within the communicator. Initially, each process will be assigned a unique integer rank between 0 and number of processors - 1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. MPI_Comm_rank (comm,&rank) MPI_COMM_RANK (comm,rank,ierr) MPI_Comm_rank Determines the rank of the calling process within the communicator. Initially, each process will be assigned a unique integer rank between 0 and number of processors - 1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well. MPI_Comm_rank (comm,&rank) MPI_COMM_RANK (comm,rank,ierr) MPI_Abort Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes regardless of the communicator specified. MPI_Abort (comm,errorcode) MPI_ABORT (comm,errorcode,ierr) MPI_Get_processor_name Returns the processor name. Also returns the length of the name. The buffer for "name" must be at least MPI_MAX_PROCESSOR_NAME characters in size. What is returned into "name" is implementation dependent - may not be the same as the output of the "hostname" or "host" shell commands. MPI_Get_processor_name (&name,&resultlength) MPI_GET_PROCESSOR_NAME (name,resultlength,ierr) MPI_Initialized Indicates whether MPI_Init has been called - returns flag as either logical true (1) or false(0). MPI requires that MPI_Init be called once and only once by each process. This may pose a problem for modules that want to use MPI and are prepared to call MPI_Init if necessary. MPI_Initialized solves this problem. MPI_Initialized (&flag) MPI_INITIALIZED (flag,ierr) MPI_Wtime Returns an elapsed wall clock time in seconds (double precision) on the calling processor. MPI_Wtime () MPI_WTIME () MPI_Wtick Returns the resolution in seconds (double precision) of MPI_Wtime. MPI_Wtick () MPI_WTICK () MPI_Finalize Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program - no other MPI routines may be called after it. MPI_Finalize () MPI_FINALIZE (ierr) Hello World in MPI #include <stdio.h> #include <mpi.h> void main(int argc, char *argv[]) { int ierror, rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { printf ("Hello World!\n"); } ierror=MPI_Comm_size(MPI_COMM_WORLD, &size); if( ierror != MPI_OK ) { MPI_Abort(MPI_COMM_WORLD,ierror); } printf("I am %d out of %d.\n", rank, size); MPI_Finalize(); } Must call initialize and finalize routines. argc and argv required only for C language. Note difference between a single output from the program (Hello World!) and output from each of the processors. Sending a Message: Blocking, standard mode send int ierror,message[message_len],dest_rank,tag=1; ... ierror=MPI_Send(message,message_len,MPI_INT,dest_rank,tag,MPI_COMM_WORLD); The message to send. Message length. Type of each data element. Rank of destination processor. Tag, for filtering out unwanted message. The communicator for the sender & receiver group. Every SEND must have a corresponding RECEIVE (and vice versa) or else will get deadlock – processors will be waiting forever to send (or receive) messages that never get received (or sent). Deadlock can also occur if message buffers overflow – the MPI implementation should handle this. Receiving a Message : Blocking Receive int ierror,store[store_len],source_rank,tag=1; MPI_Status status; ... ierror=MPI_Recv(store,store_len,MPI_INT, source_rank,tag,MPI_COMM_WORLD, &status); Where to store the message, and its size. Type of each data element. Rank of source processor. Same tag as used by sender processor. The communicator for the sender & receiver group. Data structure of received message (provides the message length, so it can be compared to what is expected). If communication is regular, e.g. each processor sends a message to its right while receiving one from its left (a shift), can combine separate SEND and RECV calls into a single SENDRECV (send and receive) call. Global Summation : Global sum, using Reduction Function float part_sum=1.0,global_sum; int ierror,root_rank=0; ... ierror=MPI_Reduce(&part_sum,&global_sum,1, MPI_FLOAT, MPI_SUM,root_rank, MPI_COMM_WORLD); Partial and global sum variables. Number of elements in the partial sum (= 1). Datatype of element to sum. Reduction operation. Rank (processor) where results will be deposited. Communicator for the participating processors. MPI REDUCE places the result on the specified processor, MPI ALLREDUCE places the result on all processors. Other reduction operations include MPI MAX, MPI MIN, MPI LAND (logical AND), MPI LXOR (exclusive OR), etc. Also useful are MPI MINLOC and MPI MAXLOC, which return the location of the minimum or maximum values in the data structure.