Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
On Optimizing Collective Communication UT-Texas Advanced Computing Center Avi Purkayastha UT-Computer Science Ernie Chan, Marcel Heimlich Robert van de Geijn Cluster 2004, Sept 20-24 San Diego, CA Outline • Model of Parallel Computation • • • • Collective Communications Algorithms Performance Results Conclusions and Future work Model of Parallel Computation • Target Architectures – distributed memory parallel architectures • Indexing – p nodes – indexed 0 … p – 1 – each node has one computational processor 0 1 2 3 4 5 6 7 8 • often logically viewed as a linear array Model of Parallel Computation • Logically Fully Connected – a node can send directly to any other node • Communicating Between Nodes – a node can simultaneously receive and send • Network Conflicts – sending over a path between two nodes that is completely occupied Model of Parallel Computation • Cost of Communication – sending a message of length n between any two nodes +n • is the startup cost (latency) • is the transmission cost (bandwidth) • Cost of Computation – cost to perform arithmetic operation is • reduction operations – – – – sum prod min max Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work Collective Communications • • • • • • • Broadcast Reduce(-to-one) Scatter Gather Allgather Reduce-scatter Allreduce Lower Bounds (Latency) • Broadcast • Reduce(-to-one) log( p) log( p) • Scatter/Gather log( p) • Allgather log( p) • Reduce-scatter log( p) • Allreduce log( p) Lower Bounds (Bandwidth) • Broadcast n • Reduce(-to-one) n • Scatter/Gather • Allgather • Reduce-scatter • Allreduce p1 n p p 1 n p p 1 n p p 1 p 1 n n p p p 1 p1 2 n n p p Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work Motivating Example • We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation. A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms Short-vector case • Primary concern: – algorithms must have low latency cost • Secondary concerns: – algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes – algorithms should avoid network conflicts • not absolutely necessary, but nice if possible Minimum-Spanning Tree based algorithms • We will show how the following building blocks: – broadcast/reduce – scatter/gather • Using minimum spanning trees embedded in the logical linear array while attaining – minimal latency – implementation for arbitrary numbers of nodes – no network conflicts General principles • message starts on one processor General principles • divide logical linear array in half General principles • send message to the half of the network that does not contain the current node (root) that holds the message General principles • send message to the half of the network that does not contain the current node (root) that holds the message General principles • continue recursively in each of the two halves General principles • The demonstrated technique directly applies to – broadcast – scatter • The technique can be applied in reverse to – reduce – gather General principles • This technique can be used to implement the following building blocks: – broadcast/reduce – scatter/gather • Using a minimum spanning tree embedded in the logical linear array while attaining – minimal latency – implementation for arbitrary numbers of nodes – no network conflicts? • Yes, on linear arrays Reduce-scatter (short vector) Reduce-scatter (short vector) Reduce Reduce-scatter (short vector) Scatter Reduce Before Reduce + + + + + + + + Reduce Reduce + + + + + After + + + Cost of Minimum-Spanning Tree Reduce log( p) n n number of steps cost per steps Cost of Minimum-Spanning Tree Reduce log( p) n n number of steps cost per steps Notice: attains lower bound for latency component Scatter Before After Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes log( p) n k 2 k1 p1 log( p) n p Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes log( p) n k 2 k1 p1 log( p) n p Notice: attains lower bound for latency and bandwidth components Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes log( p)( n n ) p1 log( p) n scatter p p 1 2log( p) log( p)n log( p)n p reduce Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes log( p)( n n ) p1 log( p) n scatter p p 1 2log( p) log( p)n log( p)n p reduce Notice: does not attain lower bound for latency or bandwidth components Recap Reduce log( p )( n n ) Reduce-scatter Scatter log( p) p1 n p Gather log( p) p1 n p Broadcast log( p )( n ) 2log( p) log( p)n( ) p1 n p Allreduce 2log( p) log( p)n( 2 ) Allgather 2log( p) log( p) n p1 n p A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms Long-vector case • Primary concern: – algorithms must have low cost due to vector length – algorithms must avoid network conflicts • Secondary concerns: – algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes Long-vector building blocks • We will show how the following building blocks: – allgather/reduce-scatter • Can be implemented using “bucket” algorithms while attaining – minimal cost due to length of vectors – implementation for arbitrary numbers of nodes – no network conflicts • A logical ring can be embedded in a physical linear array General principles • This technique can be used to implement the following building blocks: – allgather/reduce-scatter • Send subvectors of data around the ring at each step until all data is collected, like a “bucket” – minimal cost due to length of vectors – implementation for arbitrary numbers of nodes – no network conflicts Reduce-scatter Before Reduce + + + + + + + + Reduce-scatter Reduce + + + + + After + + + Cost of Bucket Reduce-scatter n n ( p 1) p p number of steps p1 p1 ( p 1) n n p p cost per steps Cost of Bucket Reduce-scatter n n ( p 1) p p number of steps p1 p1 ( p 1) n n p p cost per steps Notice: attains lower bound for bandwidth and computation component Recap Reduce-scatter ( p1) p1 n( ) p Scatter log( p) p1 n p Gather log( p) p1 n p Allgather ( p1) p1 n p Reduce ( p 1 log( p)) p1 n( 2 ) p Allreduce 2( p 1) p1 n( 2 ) p Broadcast (log( p) p 1) 2 p1 n p A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms Hybrid algorithms (Intermediate length case) • Algorithms must balance latency, cost due to vector length, and network conflicts General principles • View p nodes as a 2-dimensional mesh – p = r x c, row and column dimensions • Perform operation in each dimension – reduce-scatter within columns, followed by reduce-scatter within rows General principles • Many different combinations of short- and long- algorithms • Generally try to reduce vector length to use at next dimension • Can use more than 2-dimensions Example: 2D Scatter/Allgather Broadcast Example: 2D Scatter/Allgather Broadcast Scatter in columns Example: 2D Scatter/Allgather Broadcast Scatter in rows Example: 2D Scatter/Allgather Broadcast Allgather in rows Example: 2D Scatter/Allgather Broadcast Allgather in columns Cost of 2D Scatter/Allgather Broadcast p1 (log( p) r c 2) 2 n p Cost comparison • Option 1: – MST Broadcast in column – MST Broadcast in rows log( p) log( p)n • Option 2: – Scatter in column – MST Broadcast in rows – Allgather in columns c 1 log(r) n c log( p) c 1 2 • Option 3: – – – – Scatter in column Scatter in rows Allgather in rows Allgather in columns p1 (log( p) r c 2) 2 n p Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work Testbed Architecture • Cray-Dell PowerEdge Linux Cluster – Lonestar • Texas Advanced Computing Center – J. J. Pickle Research Campus » The University of Texas at Austin – – – – – 856 3.06 GHz Xeon processors and 410 Dell dual-processor PowerEdge 1750 compute nodes 2 GB of memory per node in the compute nodes Myrinet-2000 switch fabric mpich-gm library 1.2.5..12a Performance Results • Log-log graphs – Shows performance better than linear-linear graphs • Used one processor per node • Graphs – – – – – ‘send-min’ is the minimum time to send between nodes ’MPI’ is the MPICH implementation ‘short’ is minimum spanning tree algorithm ‘long’ is the bucket algorithm ‘hybrid’ is a 8-dimensional hybrid algorithm • Seven higher dimensions with ‘long’ algorithm • Lowest dimension with ‘short’ algorithm – ‘long-short’ is long-vector algorithm using short-vector implementations as building blocks Conclusions and Future work • Have a working prototype for an optimal collective communication library, using optimal algorithms for short, intermediate and long vector lengths. • Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture. • Need to complete this approach for ALL MPI data types and operations. • Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.