Download reduce scatter - Department of Computer Science

Document related concepts
no text concepts found
Transcript
On Optimizing Collective
Communication
UT-Texas Advanced
Computing Center
Avi Purkayastha
UT-Computer Science
Ernie Chan, Marcel Heimlich
Robert van de Geijn
Cluster 2004, Sept 20-24
San Diego, CA
Outline
• Model of Parallel Computation
•
•
•
•
Collective Communications
Algorithms
Performance Results
Conclusions and Future work
Model of Parallel Computation
• Target Architectures
– distributed memory parallel architectures
• Indexing
– p nodes
– indexed 0 … p – 1
– each node has one computational processor
0
1
2
3
4
5
6
7
8
• often logically viewed as a linear array
Model of Parallel Computation
• Logically Fully Connected
– a node can send directly to any other node
• Communicating Between Nodes
– a node can simultaneously receive and send
• Network Conflicts
– sending over a path between two nodes that is
completely occupied
Model of Parallel Computation
• Cost of Communication
– sending a message of length n between any two nodes
 +n
•  is the startup cost (latency)
•  is the transmission cost (bandwidth)
• Cost of Computation
– cost to perform arithmetic operation is 
• reduction operations
–
–
–
–
sum
prod
min
max
Outline
• Model of Parallel Computation
• Collective Communications
• Algorithms
• Performance Results
• Conclusions and Future work
Collective Communications
•
•
•
•
•
•
•
Broadcast
Reduce(-to-one)
Scatter
Gather
Allgather
Reduce-scatter
Allreduce
Lower Bounds (Latency)
• Broadcast
• Reduce(-to-one)
log( p) 
log( p) 
• Scatter/Gather
log( p) 
• Allgather
log( p) 
• Reduce-scatter
log( p) 
• Allreduce
log( p) 
Lower Bounds (Bandwidth)
• Broadcast
n
• Reduce(-to-one)
n 
• Scatter/Gather
• Allgather
• Reduce-scatter
• Allreduce
p1
n
p
p 1
n
p
p 1
n
p
p 1
p 1
n 
n
p
p
p 1
p1
2
n 
n
p
p
Outline
• Model of Parallel Computation
• Collective Communications
• Algorithms
• Performance Results
• Conclusions and Future work
Motivating Example
• We will illustrate the different types of
algorithms and implementations using the
Reduce-scatter operation.
A building block approach to library
implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
Short-vector case
• Primary concern:
– algorithms must have low latency cost
• Secondary concerns:
– algorithms must work for arbitrary number of
nodes
• in particular, not just for power-of-two numbers of nodes
– algorithms should avoid network conflicts
• not absolutely necessary, but nice if possible
Minimum-Spanning Tree based
algorithms
• We will show how the following building
blocks:
– broadcast/reduce
– scatter/gather
• Using minimum spanning trees embedded in
the logical linear array while attaining
– minimal latency
– implementation for arbitrary numbers of nodes
– no network conflicts
General principles
• message starts on one processor
General principles
• divide logical linear array in half
General principles
• send message to the half of the network that
does not contain the current node (root) that
holds the message
General principles
• send message to the half of the network that
does not contain the current node (root) that
holds the message
General principles
• continue recursively in each of the two halves
General principles
• The demonstrated technique directly applies to
– broadcast
– scatter
• The technique can be applied in reverse to
– reduce
– gather
General principles
• This technique can be used to implement the
following building blocks:
– broadcast/reduce
– scatter/gather
• Using a minimum spanning tree embedded in the
logical linear array while attaining
– minimal latency
– implementation for arbitrary numbers of nodes
– no network conflicts?
• Yes, on linear arrays
Reduce-scatter
(short vector)
Reduce-scatter
(short vector)
Reduce
Reduce-scatter
(short vector)
Scatter
Reduce
Before
Reduce
+
+
+
+
+
+
+
+
Reduce
Reduce
+
+
+
+
+
After
+
+
+
Cost of Minimum-Spanning Tree
Reduce
log( p)   n  n 
number of steps
cost per steps
Cost of Minimum-Spanning Tree
Reduce
log( p)   n  n 
number of steps
cost per steps
Notice: attains lower bound for latency component
Scatter
Before
After
Cost of Minimum-Spanning Tree
Scatter
• Assumption: power of two number of nodes
log( p) 
n 
  k  
 


2
k1

p1
log( p)  
n
p
Cost of Minimum-Spanning Tree
Scatter
• Assumption: power of two number of nodes
log( p) 
n 
  k  
 


2
k1

p1
log( p)  
n
p
Notice: attains lower bound for latency and bandwidth
components
Cost of Reduce/Scatter
Reduce-scatter
• Assumption: power of two number of nodes
log( p)(  n  n )
p1
log( p) 
n
scatter
p
p  1

2log( p)  
 log( p)n  log( p)n
 p

reduce
Cost of Reduce/Scatter
Reduce-scatter
• Assumption: power of two number of nodes
log( p)(  n  n )
p1
log( p) 
n
scatter
p
p  1

2log( p)  
 log( p)n  log( p)n
 p

reduce
Notice: does not attain lower bound for latency or
bandwidth components
Recap
Reduce
log( p )(   n  n )
Reduce-scatter
Scatter
log( p) 
p1
n
p
Gather
log( p) 
p1
n
p
Broadcast
log( p )(   n )
2log( p)   log( p)n(    ) 
p1
n
p
Allreduce
2log( p)  log( p)n( 2   )
Allgather
2log( p)   log( p) n 
p1
n
p
A building block approach to library
implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
Long-vector case
• Primary concern:
– algorithms must have low cost due to vector length
– algorithms must avoid network conflicts
• Secondary concerns:
– algorithms must work for arbitrary number of
nodes
• in particular, not just for power-of-two numbers of nodes
Long-vector building blocks
• We will show how the following building
blocks:
– allgather/reduce-scatter
• Can be implemented using “bucket”
algorithms while attaining
– minimal cost due to length of vectors
– implementation for arbitrary numbers of nodes
– no network conflicts
• A logical ring can be embedded in a
physical linear array
General principles
• This technique can be used to implement the
following building blocks:
– allgather/reduce-scatter
• Send subvectors of data around the ring at each step
until all data is collected, like a “bucket”
– minimal cost due to length of vectors
– implementation for arbitrary numbers of nodes
– no network conflicts
Reduce-scatter
Before
Reduce
+
+
+
+
+
+
+
+
Reduce-scatter
Reduce
+
+
+
+
+
After
+
+
+
Cost of Bucket Reduce-scatter

n
n  
( p  1)      
p
p 

number of steps

p1
p1
( p  1) 
n 
n
p
p
cost per steps
Cost of Bucket Reduce-scatter

n
n  
( p  1)      
p
p 

number of steps

p1
p1
( p  1) 
n 
n
p
p
cost per steps
Notice: attains lower bound for bandwidth and
computation component
Recap
Reduce-scatter
( p1)  
p1
n(    )
p
Scatter
log( p) 
p1
n
p
Gather
log( p) 
p1
n
p
Allgather
( p1)  
p1
n
p
Reduce
( p  1  log( p))  
p1
n( 2   )
p
Allreduce
2( p  1)  
p1
n( 2   )
p
Broadcast
(log( p)  p  1)   2
p1
n
p
A building block approach to library
implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
Hybrid algorithms
(Intermediate length case)
• Algorithms must balance latency, cost due to
vector length, and network conflicts
General principles
• View p nodes as a 2-dimensional mesh
– p = r x c, row and column dimensions
• Perform operation in each dimension
– reduce-scatter within columns, followed by
reduce-scatter within rows
General principles
• Many different combinations of short- and
long- algorithms
• Generally try to reduce vector length to use at
next dimension
• Can use more than 2-dimensions
Example: 2D Scatter/Allgather
Broadcast
Example: 2D Scatter/Allgather
Broadcast
Scatter in columns
Example: 2D Scatter/Allgather
Broadcast
Scatter in rows
Example: 2D Scatter/Allgather
Broadcast
Allgather in rows
Example: 2D Scatter/Allgather
Broadcast
Allgather in columns
Cost of 2D Scatter/Allgather
Broadcast
p1
(log( p)  r  c  2)  2
n
p
Cost comparison
• Option 1:
– MST Broadcast in column
– MST Broadcast in rows
log( p)  log( p)n
• Option 2:
– Scatter in column
– MST Broadcast in rows
– Allgather in columns
 c  1  log(r) 
n


c
log( p)  c  1  2
• Option 3:
–
–
–
–
Scatter in column
Scatter in rows
Allgather in rows
Allgather in columns
p1
(log( p)  r  c  2)  2
n
p
Outline
• Model of Parallel Computation
• Collective Communications
• Algorithms
• Performance Results
• Conclusions and Future work
Testbed Architecture
• Cray-Dell PowerEdge Linux Cluster
– Lonestar
• Texas Advanced Computing Center
– J. J. Pickle Research Campus
» The University of Texas at Austin
–
–
–
–
–
856 3.06 GHz Xeon processors and
410 Dell dual-processor PowerEdge 1750 compute nodes
2 GB of memory per node in the compute nodes
Myrinet-2000 switch fabric
mpich-gm library 1.2.5..12a
Performance Results
• Log-log graphs
– Shows performance better than linear-linear graphs
• Used one processor per node
• Graphs
–
–
–
–
–
‘send-min’ is the minimum time to send between nodes
’MPI’ is the MPICH implementation
‘short’ is minimum spanning tree algorithm
‘long’ is the bucket algorithm
‘hybrid’ is a 8-dimensional hybrid algorithm
• Seven higher dimensions with ‘long’ algorithm
• Lowest dimension with ‘short’ algorithm
– ‘long-short’ is long-vector algorithm using short-vector
implementations as building blocks
Conclusions and Future work
• Have a working prototype for an optimal collective
communication library, using optimal algorithms for
short, intermediate and long vector lengths.
• Need to obtain heuristics for cross-over points
between short, hybrid and long vector algorithms,
independent of architecture.
• Need to complete this approach for ALL MPI data
types and operations.
• Need to generalize approach or have heuristics for
large SMP nodes or small n-way node clusters.