Download High Performance Computing 811

Document related concepts
no text concepts found
Transcript
Computational Methods
in Astrophysics
Dr Rob Thacker (AT319E)
thacker@ap
Today’s Lecture
Distributed Memory Computing I

Key concepts –
Differences between shared & distributed memory
 Message passing
 A few network details


General comment – the overall computing
model has not changed in decades, but the APIs
have…
API Evolution

From 80’s through to early 2000s much HPC
evolution was driven via math & physics
communities
Notable focus on regular arrays and data structures
 Big forums, working on standards etc


Starting in 2000s growth of data analytics +
computational biology introduced different
requirements
C++, Java or Python
 Irregular data, able to start designs from scratch

Shared vs distributed memory

The key difference is data decomposition


Commonly called “domain decomposition”
Numerous possible ways to break up data space
Each has different compromises in terms of the
required communication patterns that result
 The comms pattern determines the overall
complexity of the parallel code


The decomposition can be handle in implicit
explicit ways
MPI,PVM
SHMEM
Implicit
Communication
Explicit
Parallel API’s from the
decomposition-communication
perspective
Message
Passing
API’s
CAF, UPC
HPF
OpenMP
Implicit
Shared memory
only
Decomposition
Operate effectively
on distributed memory
architectures
Explicit
Message Passing

Concept of sequential processes communicating via messages
was developed by Hoare in the 70’s





Each process has its own local memory store
Remote data needs are served by passing messages containing
the desired data
Naturally carries over to distributed memory architectures
Two ways of expressing message passing:



Hoare, CAR, Comm ACM, 21, 666 (1978)
Coordination of message passing at the language level (e.g. Occam)
Calls to a message passing library
Two types of message passing


Point-to-point (one-to-one)
Broadcast (one-to-all,all-to-all)
Broadcast versus point-to-point
Broadcast (one-to-all)
Point-to-point(one-to-one)
Process 2
Process 1
Process 3
Process 4
Collective operation
-Involves a group
of processes
Process 2
Process 1
Process 3
Process 4
Non-Collective operation
-Involves a pair of
processes
Message passing API’s

Message passing API’s dominate





Often reflect underlying hardware design
Legacy codes can frequently be converted more easily
Allows explicit management of memory hierarchy
Message Passing Interface (MPI) is the predominant
API
Parallel Virtual Machine (PVM) is an earlier API that
possesses some useful features over MPI

Useful paradigm for heterogeneous systems, there’s even a
python version
http://www.csm.ornl.gov/pvm/
PVM – An overview

API can be traced back to 1989(!)


Daemon based



PVM group server controls this aspect
Limited number of collective operations


Each user may actively configure their host environment
Process groups for domain decomposition


Each host runs a daemon that controls resources
Process can be dynamically created and destroyed
PVM Console


Geist & Sunderam developed experimental version
Barriers, broadcast, reduction
Roughly 40 functions in the API
PVM API and programming model

PVM most naturally fits a master-worker model




Messages are typed and tagged


Master process responsible for I/O
Workers are spawned by master
Each process has a unique identifier
System is aware of data-type, allows easy portability across
heterogeneous network
Messages are passed via a three phase proces



Clear (initialize) buffer
Pack buffer
Send buffer
Example code
tid=pvm_mytid()
if (tid==source){
bufid= pvm_initsend(PvmDataDefault);
info = pvm_packint(&i1,1,1);
info = pvm_packfloat(vec1,2,1);
Sender
info = pvm_send(dest,tag);
}
else if (tid==dest){
bufid= pvm_recv(source,tag);
info = pvm_upkint(&i2,1,1);
info = pvm_upkfloat(vec2,2,1);
}
Receiver
MPI – An overview

API can be traced back to 1992


Mechanism for creating processes is not specified within API




‘Communicators’
Richer set of collective operations than PVM
Derived data-types important advance


Different mechanism on different platforms
MPI 1.x standard does not allow for creating or destroying processes
Process groups central to parallel model


First unofficial meeting of MPI forum at Supercomputing 92
Can specify a data-type to control pack-unpack step implicitly
125 functions in the API (v1.0)
MPI API and programming model

More naturally a true SPMD type programming model





Oriented toward HPC applications
Master-worker model can still be implemented effectively
As for PVM, each process has a unique identifier
Messages are typed, tagged and flagged with a
communicator
Messaging can be a single stage operation


Can send specific variables without need for packing
Packing is still an option
Remote Direct Memory Access

Message passing involves a number of expensive
operations:



RDMA cuts down on the CPU overhead



CPUs must be involved (possibly OS kernel too)
Buffers are often required
CPU sets up channels for the DMA engine to write directly
to the buffer and avoid constantly taxing the CPU
Frequently discussed under the “zero-copy” euphemism
Message passing API’s have been designed around this
concept (but usually called remote memory access)

Cray SHMEM
RDMA illustrated
HOST A
Memory/
Buffer
HOST B
CPU
CPU
NIC
(with
RDMA
engine)
NIC
(with
RDMA
engine)
Memory/
Buffer
Networking issues


Networks have played a profound role in the
evolution of parallel APIs
Examine network fundamentals in more detail
Provides better understanding of programming
issues
 Reasons for library design (especially RDMA)

OSI network model








Grew out of 1982 attempt by ISO to develop Open Systems
Interconnect (too many vendor proprietary protocols at that
time)
Motivated from theoretical rather than practical standpoint
System of layers taken together = protocol stack
Each layer communicates with its peer layer on the remote host
Proposed stack was too complex and had too much freedom:
not adopted
e.g. X.400 email standard required several books of definitions
Simplified Internet TCP/IP protocol stack eventually grew out
of the OSI model
e.g. SMTP email standard takes a few pages
Conceptual structure of OSI network
Layer 7. Application(http,ftp,…)
Upper
level
Layer 6. Presentation (data std)
Layer 5. Session (application)
Layer 4.Transport (TCP,UDP,...)
Lower
level
Layer 3. Network (IP,…)
Layer 2. Data link (Ethernet,…)
Layer 1. Physical (signal)
Data transfer
Routing
Internet Protocol Suite

Protocol stack on which the internet runs


Doesn’t map perfectly to OSI model




Occasionally called TCP/IP protocol stack
OSI model lacks richness at lower levels
Motivated by engineering rather than concepts
Higher levels of OSI model were mapped into a single
application layer
Expanded some layering concepts within the OSI
model (e.g. internetworking was added to the network
layer)
Internet Protocol Suite
“Layer 7” Application
e.g. FTP, HTTP, DNS
Layer 4.Transport
e.g. TCP, UDP, RTP, SCTP
Layer 3. Network
IP
Layer 2. Data link
e.g. Ethernet, token ring
Layer 1. Physical
e.g. T1, E1
Internet Protocol (IP)


Data-oriented protocol used by hosts for
communicating data across a packet-switched internetwork
Addressing and routing are handled at this level




IP sends and receives data between two IP addresses
Data segment = packet (or datagram)
Packet delivery is unreliable – packets may arrive
corrupted, duplicated or not at all, and out of order
Lack of delivery guarantees allows fast switching
IP Addressing



On an ethernet network routing at the data link layer occurs
between 6 byte MAC (Media Access Control) addresses
IP adds its own configurable address scheme on top of this
4 byte address, expressed as 4 decimals on 0-255




Note 0 and 255 are both reserved numbers
Division of numbers determines network number versus node
Subnet masks determine how these are divided
Classes of networks are described by the first number in the IP
address and the number of network addresses



[192:255].35.91.* = class C network (254 hosts) (subnet mask 255.255.255.0)
[128:191].132.*.* = class B network (65,534 hosts)
( “ 255.255.0.0)
[1:126].*.*.*
= class A network (16 million hosts)
( “ 255.0.0.0)
Note the 35.91 in the class C example, and the 132. in the class B
example can be different, but are filled in to show how the network address
is defined
Transmission Control Protocol
(TCP)






TCP is responsible for division of the applications datastream, error correction and opening the channel (port)
between applications
Applications send a byte stream to TCP
TCP divides the byte stream into appropriately sized
segments (set by the MTU* of the IP layer)
Each segment is given two sequence numbers to enable
the byte stream to be reconstructed
Each segment also has a checksum to ensure correct
packet delivery
Segments are passed to IP layer for delivery
*maximum transfer unit
"Hi, I'd like to hear a TCP joke."
"Hello, would you like to hear a TCP joke?"
"Yes, I'd like to hear a TCP joke."
"OK, I'll tell you a TCP joke."
"Ok, I will hear a TCP joke."
"Are you ready to hear a TCP joke?"
"Yes, I am ready to hear a TCP joke."
"Ok, I am about to send the TCP joke. It will last 10 seconds, it has
two characters, it does not have a setting, it ends with a punchline."
"Ok, I am ready to get your TCP joke that will last 10 seconds, has
two characters, does not have an explicit setting, and ends with a
punchline."
"I'm sorry, your connection has timed out. Hello, would you like to
hear a TCP joke?"
Humour:
TCP joke
UDP: Alternative to TCP






UDP=User Datagram Protocol
Only adds a checksum and multiplexing capabilitiy –
limited functionality allows a streamlined
implementation: faster than TCP
No confirmation of delivery
Unreliable protocol: if you need reliability you must
build on top of this layer
Suitable for real-time applications where error
correction is irrelevant (e.g. streaming media, voice over
IP)
DNS and DHCP both use UDP
Encapsulation of layers
Application
data
TCP
header
Transport
IP
header
Network
Data link
enet
header
data
data
data
Link Layer



For high performance clusters the link layer
frequently determines the networking above it
All high performance interconnects emulate IP
Each data link thus brings its own networking
layer with it
Overview of interconnect fabrics




Broadly speaking interconnect breakdown into
the two camps: commodity vs specialist
Commodity: gigabit ethernet (cost<50 per port)
Specialist: everything else (cost > 200 dollars per
port)
Specialist interconnects primarily provide two
features over gigabit:
Higher bandwidth
 Lower message latency

10Gigabit Ethernet



Expected to become commodity any year now
(estimates still in the range of $1000 per port)
A lot of the early implementations were from
companies with HPC backgrounds e.g. Myrinet,
Mellanox
The problem has always been finding a
technological driver outside HPC – few people
need a GB/s out of their desktop
Infiniband







Infiniband (Open) standard is designed to cover many
arenas, from database servers to HPC
237 systems on top500 (Nov. 2015)
Has essentially become commoditized
Serial bus, can add bandwidth by adding more channels
(“lanes”) and increasing channel speed
x14 data rate option now available (14 Gb/s)
56Gb/s (14*4) ports now common, higher available
 Necessary for “fat nodes” with lots of cores
600 Gb/s projected for 2017
History of MPI



Many different message passing standards circa 1992
Most designed for high performance distributed memory systems
Following SC92 MPI Forum is started








Open participation encouraged (e.g. PVM working group was asked for input)
Goal is to produce as portable an interface as possible
Vendors included but not given control – specific hardware optimizations
were avoided
Web address: http://www.mpi-forum.org
MPI-1 standard released 1994
Forum reconvened in 1995-97 to define MPI-2
Fully functional MPI-2 implementations did not appear until 2002
though
Reference guide is available for download

http://www.netlib.org/utk/papers/mpi-book/mpi-book.ps
C vs FORTRAN interface


As much effort as possible was extended to keep
the interfaces similar
Only significant difference is C functions return
their value as the error code


FORTRAN versions pass a separate argument
Arguments to C functions may be more strongly
typed than FORTRAN equivalents

FORTRAN interface relies upon integers
MPI Communication model


Messages are typed and tagged
Don’t need to explicitly define buffer


Specify start point of a message using memory address



Packing interface available if necessary (MPI_PACK datatype)
Communicators (process groups) are a vital component of the
MPI standard


Interface is provided if you want to use it
Destination processes must include the specific process group
Messages must therefore specify:
(address,count,datatype,destination,tag,communicator)
Address defines the message data, the remaining variables define
the message envelope
MPI-2


Significant advance over the 1.2 standard
Defines remote memory access (RMA) interface





Two types of modes of operation
Active target: all processes participate in a single
communication phase (although point-to-point messaging is
allowed)
Passive target: Individual processes participate in point-topoint messaging
Parallel I/O
Dynamic process management (MPI_SPAWN)
Missing pieces

MPI-1 did not specify how processes start






PVM defined its own console
Start-up is done using vendor/open source supplied package
MPI-2 defines mpiexec – a standardized startup routine
Standard buffer interface is implementation specific
Process groups are static – can only be created or
destroyed
No mechanism for obtaining details about the hosts
involved in the computation
Getting started: enrolling & exitting
from the MPI environment

Every program must initialize by executing MPI_INIT(ierr) or
int MPI_INIT(int argc, char ***argv)



Default communicator is MPI_COMM_WORLD
Determine the process id by calling



MPI_COMM_RANK(MPI_COMM_WORLD, myid,ierr)
Note PVM essentials puts enrollment and id resolution into one call
Determine total number of processes via


argc, argv are historical hangovers for the C version which maybe set to
NULL
MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
To exit, processes must call MPI_FINALIZE(ierr)
Minimal MPI program
#include "mpi.h"
#include <stdio.h>
program main
include “mpif.h”
integer ierr,myid
int main( int argc, char *argv[] )
call MPI_INIT( ierr )
{
call MPI_COMM_RANK(MPI_COMM_WORLD,
int myid;
&
myid, ierr)
MPI_Init( &argc, &argv );
print *, 'Hello, world from ’,myid
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
printf( "Hello, world from %d\n“,myid); call MPI_FINALIZE( ierr )
end
MPI_Finalize();
return 0;
}
Normally execute by mpirun –np 4 my_program
Output:
Hello,world from 2
Hello world from 1
Hello world from 0
Hello world from 3
Compiling MPI codes

Some implementations (e.g. MPICH) define additional
wrappers for the compiler:





mpif77, mpif90 for F77,F90
mpicc, mpicxx for C/C++
Code is then compiled using mpif90 (e.g.) rather than
f90, libraries are linked in automatically
Usually best policy when machine specific libraries are
required
Linking can always be done by hand
What needs to go in a message?
Things that need specifying:
How will “data” be described? - specify
How will processes be identified? – where?
How will the receiver recognize/screen
messages? - tagging
What will it mean for these operations to
complete? – confirmed completion
MPI Basic (Blocking) Send
MPI_SEND (start, count, datatype, dest, tag, comm)



The message buffer is described by (start, count,
datatype).
The target process is specified by dest, which is the rank of
the target process in the communicator specified by comm.
When this function returns, the data has been delivered to the
system and the buffer can be reused. The message may not
have been received by the target process.
From Bill Gropp’s slides
41
Subtleties of point-to-point
messaging
Process A
MPI_Send(B)
MPI_Recv(B)
Process B
MPI_Send(A)
MPI_Recv(A)
Process A
MPI_Recv(B)
MPI_Send(B)
Process B
MPI_Recv(A)
MPI_Send(A)
Process A
MPI_Send(B)
MPI_Recv(B)
Process B
MPI_Recv(A)
MPI_Send(A)
This kind of communication
is `unsafe’. Whether it works
correctly is dependent upon
whether the system has enough
buffer space.
This code leads to a
deadlock, since the
MPI_Recv blocks
execution until it is
completed.
You should always try
and write communication
patterns like this: a send
is match by a recv.
Buffered Mode communication

Buffered sends avoid the issue of whether
enough internal buffering is available




Programmer explicitly defines buffer space sufficient
to allow all messages to be sent
MPI_Bsend has same semantics as MPI_Send
MPI_Buffer_attach(buffer,size,ierr) must be
called to define the buffer space
Frequently better to rely on non-blocking
communication though
Non-blocking communication

Helps alleviate two issues
1.
2.




MPI_Isend adds a handle to the subroutine call which is later
used to determine whether the operation has succeeded
MPI_Irecv is the matching non-blocking receive operation
MPI_Test can be used to detect whether the send/receive has
completed
MPI_Wait is used to wait for an operation to complete


Blocking communication can potentially starve a process for data while
it could be doing useful work
Problems related to buffering are circumvented, since the user must
explicitly ensure the buffer is available
Handle is used to identify which particular message
MPI_Waitall is used to wait for a series of operations to
complete

Array of handles is used
Solutions to deadlocking



If sends and recieves need to be matched use
MPI_Sendrecv
Process A
MPI_Sendrecv(B)
Process B
MPI_Sendrecv(A)
Process A
MPI_ISend(B)
MPI_IRecv(B)
MPI_Waitall
Process B
MPI_ISend(A)
MPI_IRecv(A)
MPI_Waitall
Non-blocking versions of Isend and Irecv will
prevent deadlocks
Advice: Use buffered mode sends (Ibsend) so
you know for sure that buffer space is available
Other sending modes

Synchronous send (MPI_Ssend)




Only returns when the receiver has started receiving the
message
On return indicates that send buffer can be reused, and also
that receiver has started processing the message
Non-local communication mode: dependent upon speed of
remote processing
(Receiver) Ready send (MPI_Rsend)



Used to eliminate unnecessary handshaking on some systems
If posted before receiver is ready then outcome is undefined
(dangerous!)
Semantically, Rsend can be replaced by standard send
Collective Operations


Collectives apply to all processes within a given communicator
Three main categories:








Data movement (e.g. broadcast)
Synchronization (e.g. barrier)
Global reduction operations
All processes must have a matching call
Size of data sent must match size of data received
Unless specifically a synchronization function, these routines do
not imply synchronization
Blocking mode only – but unaware of status of remote
operations
No tags are necessary
Collective Data Movement

Types of data movement:
Broadcast (one to all, or all to all)
 Gather (collect to single process)
 Scatter (send from one processor to all)

MPI_Bcast(buff,count,datatype,root,comm,ierr)
data
Processor

A0
A0
MPI_Bcast
A0
A0
A0
Gather/scatter


MPI_Gather(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,co
mm,ierr)
MPI_Scatter has same semantics
Note MPI_Allgather removes root argument and all processes receive result


Think of it is gather followed by broadcast
MPI_Alltoall(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,comm,i
err)

Processes sends set of distinct data elements to others – useful for transposing a
matrix
data
Processor

A0 A1 A2 A3
MPI_Scatter
MPI_Gather
A0
A1
A2
A3
Global Reduction Operations

Plenty of operations covered:
Name of
Operation
MPI_MAX
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LAND
MPI_BAND
MPI_LOR
MPI_BOR
MPI_LXOR
MPI_BXOR
MPI_MAXLOC
MPI_MINLOC
Action
maximum
minimum
sum
product
logical and
bit-wise and
logical or
bit-wise or
logical xor
bit-wise xor
max value and location
minimum value and location
Reductions

MPI_REDUCE(sendbuf,recvbuf,count,datatype,
op,root,comm,ierr)
Result is stored in the root process
 All members must call MPI_Reduce with the same
root, op, and count


MPI_Allreduce(sendbuf,recvbuf,count,datatype,
op,comm,ierr)

All members of group receive answer
Example using collectives

Numerically integrate
1
4
0 1  x 2 dx  4(arctan( 1)  arctan( 0))  



Parallel algorithm: break up integration regions and sum
separately over processors
Combine all values at the end
Very little communication required


Number of pieces
Return of values calculated
Example using broadcast and reduce
c
c
c
c
c
c
c
c
c
c
compute pi by integrating f(x) = 4/(1 + x**2)
each process:
- receives the # of intervals used in the apprxn
- calculates the areas of it's rectangles
- synchronizes for a global summation
process 0 prints the result and the time it took
program main
include 'mpif.h'
double precision PIX
parameter (PIX = 4*atan(1.0))
double precision mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, ierr
function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
print *, "Process ", myid, " of ", numprocs, " is alive"
if (myid .eq. 0) then
print *,"Enter the number of intervals: (0 to quit)"
read(5,*) n
endif
call MPI_BCAST(n, 1, MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
c
check for n > 0
IF (N.GT.0) THEN
c
calculate the interval size
h = 1.0d0 / n
sum = 0.0d0
do 20 i = myid + 1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20
continue
mypi = h * sum
collect all the partial sums
call MPI_REDUCE(mypi, pi, 1,
+ MPI_DOUBLE_PRECISION,MPI_SUM, 0,
+ MPI_COMM_WORLD, ierr)
c
c
97
process 0 prints the result
if (myid .eq. 0) then
write(6, 97) pi, abs(pi - PIX)
format(' pi is approximately: ',
+ F18.16,' Error is: ', F18.16)
endif
ENDIF
call MPI_FINALIZE(ierr)
stop
end
Summary




MPI is a very rich instruction set
User defined standard
Multiple communication modes
Can program a wide variety of problems using a
handful of calls
Next lecture

More advanced parts of the MPI-1 API
User defined data-types
 Cartesian primitives for meshes
