Download LogP Model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 1355 wikipedia , lookup

Airborne Networking wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Transcript
LogP Model
Motivation
• BSP Model
Limited to BW of Network (g) and Load of PE
Requires large load per super steps.
• Need Better Models for Portable Algorithms
• Converging Hardware
– Independent from Network Topology
– Programming Models
• Assumption
– Number of PE much bigger than data elements
Parameters
• L: Latency
– delay on the network
• o: Overhead on PE
• g: gap
– minimum interval between consecutive messages (due to bandwidth)
• P: Number of PEs
Note:
L,o,g : independent from P or node distances
Message length: short message
L,o,g are per word or per message of fixed length
k word message: k short messages (k*o overhead)
L independent from message length
Parameters (continue)
•
•
•
•
Bandwidth: 1/g * unit message length
Number of messages to send or receive for each PE: L/g
Send to Receive total time : L+2o
if o >> g, ignore o
– Similar to BSP except no synchronization step
– No communication computation overlapping
– Speed-up factor at most two
Broadcast
Optimal Broad cast tree
P0
0
P1
10
g
P2
P3
18
14
p
P4 0
22
p1
P5
20
P6
24
P7
24
P=8, L=6, g=4, o=2
o
L
Optimal Sum
• Given time T, how many items we can add?
• Approach: recursive
– At root, if T <= L+2o use a single PE (can add T+1 items)
– If T > L+2o,
• Root should have data ready at T,
• and sender must have sum ready at T - L - 2o - 1
• Recursively construct the sum tree at the sender
• If T - g > L+2o, Root also can receive data, and compute the
sum with T-g as the root.
Applications
FFT on the Butterfly network
– Data Placement
• cyclic layout - First log n/P local comm, last log P global
• blocked layout - First log P global comm, remaining local
• hybrid: After log (n/P) iteration, re-map to cyclic so that
remaining can be also local
Communication time: g* (n/P**2) (P-1) + L
each PE has n/P data, each of 1/P goes to each other PE
Total time is (1+g/logn) optimal
– All to all Communication schedule
• Approach 1: each PE sends PE1, PE2, … => bottle neck at PE1, PE2 in this order
• Approach 2 (staggered re-map) -- no congestion
– PE1 sends PE2, PE3,..
– PE2 sends PE3, PE4, etc
Implementation on CM5
• CM:
–
–
–
–
–
33MHz
Fat Trees
Global Control for scan/prefix/broadcast
one CM-5 3.2 MFLOPs
FFT on local: 2.8 - 2.2 MFLOPs (cache effect)
• each cycle:
–
–
–
–
–
multiply and add : 4.5 us
o: 2us
L: 6us
g: 4us
load ans store overhaed per cycle 1us
• communication time : n/P max (1us + 2o, g) + L
• bottleneck: processing and overhead, not bw
LU decomposition
• Data arrangement critical
Matching machine with real machines
Average Distance
topology independent usually works for n=1024 nodes.
The difference between average distance and max distance are not
such different
Potential Concerns
• Algorithmic concern
– Theory?
– Too complex?
• Communication concerns
– how to use trivial comm such as local exchange
– topology dependencies?
Comparison with BSP
• Length of superstep
• message not usable till next step
• special hardware for sync
• virtual/physical large, context switching may be expensive