Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Programming for Performance CS433 Spring 2001 Laxmikant Kale Causes of performance loss • If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? – Several causes, – Each must be understood separately – but they interact with each other in complex ways • Solution to one problem may create another • One problem may mask another, which manifests itself under other conditions (e.g. increased p). 2 Causes • • • • • • • Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks 3 Algorithmic overhead • Parallel algorithms may have a higher operation count • Example: parallel prefix (also called “scan”) – How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I]; 4 Parallel Prefix: continued • How to this operation in parallel? – Seems inherently sequential – Recursive doubling algorithm – Operation count: log(P) . N • A better algorithm: – Take blocking of data into account – Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements – N + log(P) +N: doubling of op. Count 5 Bottleneck • Consider the “primes” program (or the “pi”) – What happens when we run it on 1000 pes? • How to eliminate bottlenecks: – Two structures are useful in most such cases: • Spanning trees: organize processors in a tree • Hypercube-based dimensional exchange 6 Communication overhead • Components: – per message and per byte – sending, receiving and network – capacity constraints • Grainsize analysis: – How much computation per message – Computation-to-communication ratio 7 Communication overhead examples • Usually, must reorganize data or work to reduce communication • Combining communication also helps • Examples: 8 Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = 10 - 100 microseconds, b: 2-10 ns 9 Grainsize control • A Simple definition of grainsize: – Amount of computation per message – Problem: short message/ long message • More realistic: – Computation to communication ratio – computation time / (a + bN) for one message 10 Example: matrix multiplication • How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J]; 11 A simple algorithm: • Distribute A by rows, B by columns – So,any processor can request a row of A and get it (in two messages). Same for a col of B, – Distribute the work of computing each element of C using some load balancing scheme • So it works even on machines with varying processor capabilities (e.g. timeshared clusters) – What is the computation-to-communication ratio? • For each object: 2.N ops, 2 messages with N bytes • 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N) 12 A better algorithm: • Store A as a collection row-bunches – each bunch stores g rows – Same of B’s columns • Each object now computes a gxg section of C • Comp to commn ratio: – 2*g*g*N ops – 2 messages, gN bytes each – alpha ratio: 2g*g*N/2, beta ratio: 2g 13 Alpha vs beta • The per message cost is significantly larger than per byte cost – factor of several thousands – So, several optimizations are possible that trade off : • get larger beta cost in return for smaller alpha – I.e. send fewer messages – Applications of this idea: • Examined in the last two lectures 14 Programming for performance:steps • • • • • Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs – global synchronizations, critical paths 15 Design Philosophy: • Parallel Algorithm design: – Ensure good performance (total op count) – Generate sufficient parallelism – Avoid/minimize “extra work” • Decomposition: – Break into many small pieces: • Smallest grain that sufficiently amortizes overhead 16 Design principles: contd. • Load balancing – Select static, dynamic, or quasi-dynamic strategy • Measurement based vs prediction based load estimation – Principle: let a processor idle but avoid overloading one • (think about this) • Reduce communication overhead – Algorithmic reorganization (change mapping) – Message combining – Use efficient communication libraries 17 Design principles: Synchronization • Eliminate unnecessary global synchronization – If T(i,j) is the time during i’th phase on j’th PE • With synch: sum ( max {T(i,j)}) • Without: max { sum(T (i,j) } • Critical Paths: – Look for long chains of dependences • Draw timeline pictures with dependences 18 Diagnosing performance problems • Tools: – Back of the envelope (I.e. simple) analysis – Post-mortem analysis, with performance logs • Visualization of performance data • Automatic analysis • Phase-by-phase analysis (prog. may have many phases) – What to measure • load distribution, (commun.) overhead, idle time • Their averages, max/min, and variances • Profiling: time spent in individual modules/subroutines 19 Diagnostic technniques • Tell-tale signs: – max load >> average, and # PEs > average is >>1 Load imbalance – max load >> average, and # PEs > average is ~ 1 Possible bottleneck (if there is dependence) – Profile shows increase in total time in routine f with increase in PEs: Algorithmic overhead – Communication overhead: obvious 20 Communication Optimization • Example problem from earlier lecture: Molecular Dynamics – Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors – Assume Send/Receive each: alpha = 10 us, beta: 2ns – Time spent (notice: 26 sends and 26 receives): • 26*2(10 ) = 520 us – If there are more than one cells on each PE, multiply this number! – Can this be improved? How? 21 Message combining • If there are multiple cells per processor: – Neighbors of a cell may be on the same neighboring processor. – Neighbors of two different cells on the same processor – Combine messages going to the same processor 22 Communication Optimization I • Take advantage of the structure of communication, and do communication in stages: – If my coordinates are: (x,y,z): • • • • Send to (x+1, y,z), anything that goes to (x+1, *, *) Send to (x-1, y,z), anything that goes to (x-1, *, *) Wait for messages from x neighbors, then Send to y neighbors a combined message – A total of 6 messages instead of 26 – Apparently longer critical path 23 Communication Optimization II • Send all migrating atoms to processor 0 – Let processor 0 sort them out and send 1 message to each processor – Works ok if the number of processors is small • Otherwise, bottleneck at 0 24 Communication Optimization 3 • Generalized problem: – Each to all, individualized messages – Apply all previously learned techniques 25 Intro to Load Balancing • Example: 500 processors, 50000 units of work • What should the objective of load balancing be? 26 Causes of performance loss • If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? – Several causes, – Each must be understood separately – but they interact with each other in complex ways • Solution to one problem may create another • One problem may mask another, which manifests itself under other conditions (e.g. increased p). 27 Causes • • • • • • • Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks 28 Algorithmic overhead • Parallel algorithms may have a higher operation count • Example: parallel prefix (also called “scan”) – How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I]; 29 Parallel Prefix: continued • How to this operation in parallel? – Seems inherently sequential – Recursive doubling algorithm – Operation count: log(P) . N • A better algorithm: – Take blocking of data into account – Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements – N + log(P) +N: doubling of op. Count 30 Bottleneck • Consider the “primes” program (or the “pi”) – What happens when we run it on 1000 pes? • How to eliminate bottlenecks: – Two structures are useful in most such cases: • Spanning trees: organize processors in a tree • Hypercube-based dimensional exchange 31 Communication overhead • Components: – per message and per byte – sending, receiving and network – capacity constraints • Grainsize analysis: – How much computation per message – Computation-to-communication ratio 32 Communication overhead examples • Usually, must reorganize data or work to reduce communication • Combining communication also helps • Examples: 33 Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = 10 - 100 microseconds, b: 2-10 ns 34 Grainsize control • A Simple definition of grainsize: – Amount of computation per message – Problem: short message/ long message • More realistic: – Computation to communication ratio 35 Example: matrix multiplication • How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J]; 36 A simple algorithm: • Distribute A by rows, B by columns – So,any processor can request a row of A and get it (in two messages). Same for a col of B, – Distribute the work of computing each element of C using some load balancing scheme • So it works even on machines with varying processor capabilities (e.g. timeshared clusters) – What is the computation-toc-mmunication ratio? • For each object: 2.N ops, 2 messages with N bytes 37 A better algorithm: • Store A as a collection row-bunches – each bunch stores g rows – Same of B’s columns • Each object now computes a gxg section of C • Comp to commn ratio: – 2*g*g*N ops – 2 messages, gN bytes each – alpha ratio: 2g*g*N/2, beta ratio: g 38 Alpha vs beta • The per message cost is significantly larger than per byte cost – factor of several thousands – So, several optimizations are possible that trade off : get larger beta cost for smaller alpha – I.e. send fewer messages – Applications of this idea: • Message combining • Complex communication patterns: each-to-all, .. 39 Example: • Each to all communication: – each processor wants to send N bytes, distinct message to each other processor – Simple implementation: alpha*P + N * beta *P • typical values? 40 Programming for performance: steps • • • • • Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs – global synchronizations, critical paths 41 Design Philosophy: • Parallel Algorithm design: – Ensure good performance (total op count) – Generate sufficient parallelism – Avoid/minimize “extra work” • Decomposition: – Break into many small pieces: • Smallest grain that sufficiently amortizes overhead 42 Design principles: contd. • Load balancing – Select static, dynamic, or quasi-dynamic strategy • Measurement based vs prediction based load estimation – Principle: let a processor idle but avoid overloading one (think about this) • Reduce communication overhead – Algorithmic reorganization (change mapping) – Message combining – Use efficient communication libraries 43 Design principles: Synchronization • Eliminate unnecessary global synchronization – If T(i,j) is the time during i’th phase on j’th PE • With synch: sum ( max {T(i,j)}) • Without: max { sum(T (i,j) } • Critical Paths: – Look for long chains of dependences • Draw timeline pictures with dependences 44 Diagnosing performance problems • Tools: – Back of the envelope (I.e. simple) analysis – Post-mortem analysis, with performance logs • Visualization of performance data • Automatic analysis • Phase-by-phase analysis (prog. may have many phases) – What to measure • load distribution, (commun.) overhead, idle time • Their averages, max/min, and variances • Profiling: time spent in individual modules/subroutines 45 Diagnostic technniques • Tell-tale signs: – max load >> average, and # Pes > average is >>1 • Load imbalance – max load >> average, and # Pes > average is ~ 1 • Possible bottleneck (if there is dependence) – profile shows increase in total time in routine f with increase in Pes: algorithmic overhead – Communication overhead: obvious 46