Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LogP Model Motivation • BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. • Need Better Models for Portable Algorithms • Converging Hardware – Independent from Network Topology – Programming Models • Assumption – Number of PE much bigger than data elements Parameters • L: Latency – delay on the network • o: Overhead on PE • g: gap – minimum interval between consecutive messages (due to bandwidth) • P: Number of PEs Note: L,o,g : independent from P or node distances Message length: short message L,o,g are per word or per message of fixed length k word message: k short messages (k*o overhead) L independent from message length Parameters (continue) • • • • Bandwidth: 1/g * unit message length Number of messages to send or receive for each PE: L/g Send to Receive total time : L+2o if o >> g, ignore o – Similar to BSP except no synchronization step – No communication computation overlapping – Speed-up factor at most two Broadcast Optimal Broad cast tree P0 0 P1 10 g P2 P3 18 14 p P4 0 22 p1 P5 20 P6 24 P7 24 P=8, L=6, g=4, o=2 o L Optimal Sum • Given time T, how many items we can add? • Approach: recursive – At root, if T <= L+2o use a single PE (can add T+1 items) – If T > L+2o, • Root should have data ready at T, • and sender must have sum ready at T - L - 2o - 1 • Recursively construct the sum tree at the sender • If T - g > L+2o, Root also can receive data, and compute the sum with T-g as the root. Applications FFT on the Butterfly network – Data Placement • cyclic layout - First log n/P local comm, last log P global • blocked layout - First log P global comm, remaining local • hybrid: After log (n/P) iteration, re-map to cyclic so that remaining can be also local Communication time: g* (n/P**2) (P-1) + L each PE has n/P data, each of 1/P goes to each other PE Total time is (1+g/logn) optimal – All to all Communication schedule • Approach 1: each PE sends PE1, PE2, … => bottle neck at PE1, PE2 in this order • Approach 2 (staggered re-map) -- no congestion – PE1 sends PE2, PE3,.. – PE2 sends PE3, PE4, etc Implementation on CM5 • CM: – – – – – 33MHz Fat Trees Global Control for scan/prefix/broadcast one CM-5 3.2 MFLOPs FFT on local: 2.8 - 2.2 MFLOPs (cache effect) • each cycle: – – – – – multiply and add : 4.5 us o: 2us L: 6us g: 4us load ans store overhaed per cycle 1us • communication time : n/P max (1us + 2o, g) + L • bottleneck: processing and overhead, not bw LU decomposition • Data arrangement critical Matching machine with real machines Average Distance topology independent usually works for n=1024 nodes. The difference between average distance and max distance are not such different Potential Concerns • Algorithmic concern – Theory? – Too complex? • Communication concerns – how to use trivial comm such as local exchange – topology dependencies? Comparison with BSP • Length of superstep • message not usable till next step • special hardware for sync • virtual/physical large, context switching may be expensive