Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Programming for Performance Laxmikant Kale CS 433 Causes of performance loss • If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? – Several causes, – Each must be understood separately – but they interact with each other in complex ways • Solution to one problem may create another • One problem may mask another, which manifests itself under other conditions (e.g. increased p). Causes • • • • • • • Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks Algorithmic overhead • Parallel algorithms may have a higher operation count • Example: parallel prefix (also called “scan”) – How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I]; Parallel Prefix: continued • How to this operation in parallel? – Seems inherently sequential – Recursive doubling algorithm – Operation count: log(P) . N • A better algorithm: – Take blocking of data into account – Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements – N + log(P) +N: doubling of op. Count Bottleneck • Consider the “primes” program (or the “pi”) – What happens when we run it on 1000 pes? • How to eliminate bottlenecks: – Two structures are useful in most such cases: • Spanning trees: organize processors in a tree • Hypercube-based dimensional exchange Communication overhead • Components: – per message and per byte – sending, receiving and network – capacity constraints • Grainsize analysis: – How much computation per message – Computation-to-communication ratio Communication overhead examples • Usually, must reorganize data or work to reduce communication • Combining communication also helps • Examples: Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = 10 - 100 microseconds, b: 2-10 ns Grainsize control • A Simple definition of grainsize: – Amount of computation per message – Problem: short message/ long message • More realistic: – Computation to communication ratio Example: matrix multiplication • How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J]; A simple algorithm: • Distribute A by rows, B by columns – So,any processor can request a row of A and get it (in two messages). Same for a col of B, – Distribute the work of computing each element of C using some load balancing scheme • So it works even on machines with varying processor capabilities (e.g. timeshared clusters) – What is the computation-toc-mmunication ratio? • For each object: 2.N ops, 2 messages with N bytes A better algorithm: • Store A as a collection row-bunches – each bunch stores g rows – Same of B’s columns • Each object now computes a gxg section of C • Comp to commn ratio: – 2*g*g*N ops – 2 messages, gN bytes each – alpha ratio: 2g*g*N/2, beta ratio: g Alpha vs beta • The per message cost is significantly larger than per byte cost – factor of several thousands – So, several optimizations are possible that trade off : get larger beta cost for smaller alpha – I.e. send fewer messages – Applications of this idea: • Message combining • Complex communication patterns: each-to-all, .. Example: • Each to all communication: – each processor wants to send N bytes, distinct message to each other processor – Simple implementation: alpha*P + N * beta *P • typical values? Programming for performance: steps • • • • • Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs – global synchronizations, critical paths Design Philosophy: • Parallel Algorithm design: – Ensure good performance (total op count) – Generate sufficient parallelism – Avoid/minimize “extra work” • Decomposition: – Break into many small pieces: • Smallest grain that sufficiently amortizes overhead Design principles: contd. • Load balancing – Select static, dynamic, or quasi-dynamic strategy • Measurement based vs prediction based load estimation – Principle: let a processor idle but avoid overloading one (think about this) • Reduce communication overhead – Algorithmic reorganization (change mapping) – Message combining – Use efficient communication libraries Design principles: Synchronization • Eliminate unnecessary global synchronization – If T(i,j) is the time during i’th phase on j’th PE • With synch: sum ( max {T(i,j)}) • Without: max { sum(T (i,j) } • Critical Paths: – Look for long chains of dependences • Draw timeline pictures with dependences Diagnosing performance problems • Tools: – Back of the envelope (I.e. simple) analysis – Post-mortem analysis, with performance logs • Visualization of performance data • Automatic analysis • Phase-by-phase analysis (prog. may have many phases) – What to measure • load distribution, (commun.) overhead, idle time • Their averages, max/min, and variances • Profiling: time spent in individual modules/subroutines Diagnostic technniques • Tell-tale signs: – max load >> average, and # Pes > average is >>1 • Load imbalance – max load >> average, and # Pes > average is ~ 1 • Possible bottleneck (if there is dependence) – profile shows increase in total time in routine f with increase in Pes: algorithmic overhead – Communication overhead: obvious