Download CS433 Introduction - Parallel Programming Laboratory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Programming for Performance
CS433
Spring 2001
Laxmikant Kale
Causes of performance loss
• If each processor is rated at k MFLOPS, and there are p
processors, why don’t we see k.p MFLOPS performance?
– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways
• Solution to one problem may create another
• One problem may mask another, which manifests itself under other
conditions (e.g. increased p).
2
Causes
•
•
•
•
•
•
•
Sequential: cache performance
Communication overhead
Algorithmic overhead (“extra work”)
Speculative work
Load imbalance
(Long) Critical paths
Bottlenecks
3
Algorithmic overhead
• Parallel algorithms may have a higher operation count
• Example: parallel prefix (also called “scan”)
– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
4
Parallel Prefix: continued
• How to this operation in parallel?
– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm:
– Take blocking of data into account
– Each processor calculate its sum, then participates in a prallel
algorithm to get sum to its left, and then adds to all its elements
– N + log(P) +N: doubling of op. Count
5
Bottleneck
• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:
– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
6
Communication overhead
• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:
– How much computation per message
– Computation-to-communication ratio
7
Communication overhead examples
• Usually, must reorganize data or work to reduce communication
• Combining communication also helps
• Examples:
8
Communication overhead
Communication delay: time interval between sending on one
processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both
sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
9
Grainsize control
• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:
– Computation to communication ratio
– computation time / (a + bN) for one message
10
Example: matrix multiplication
• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
11
A simple algorithm:
• Distribute A by rows, B by columns
– So,any processor can request a row of A and get it (in two
messages). Same for a col of B,
– Distribute the work of computing each element of C using some
load balancing scheme
• So it works even on machines with varying processor capabilities
(e.g. timeshared clusters)
– What is the computation-to-communication ratio?
• For each object: 2.N ops, 2 messages with N bytes
• 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)
12
A better algorithm:
• Store A as a collection row-bunches
– each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:
– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: 2g
13
Alpha vs beta
• The per message cost is significantly larger than per byte cost
– factor of several thousands
– So, several optimizations are possible that trade off :
• get larger beta cost in return for smaller alpha
– I.e. send fewer messages
– Applications of this idea:
• Examined in the last two lectures
14
Programming for performance:steps
•
•
•
•
•
Select/design Parallel algorithm
Decide on Decomposition
Select Load balancing strategy
Plan Communication structure
Examine synchronization needs
– global synchronizations, critical paths
15
Design Philosophy:
• Parallel Algorithm design:
– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:
– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
16
Design principles: contd.
• Load balancing
– Select static, dynamic, or quasi-dynamic strategy
• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid overloading one
• (think about this)
• Reduce communication overhead
– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
17
Design principles: Synchronization
• Eliminate unnecessary global synchronization
– If T(i,j) is the time during i’th phase on j’th PE
• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:
– Look for long chains of dependences
• Draw timeline pictures with dependences
18
Diagnosing performance problems
• Tools:
– Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs
• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure
• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
19
Diagnostic technniques
• Tell-tale signs:
– max load >> average, and # PEs > average is >>1
Load imbalance
– max load >> average, and # PEs > average is ~ 1
Possible bottleneck (if there is dependence)
– Profile shows increase in total time in routine f with increase in PEs:
Algorithmic overhead
– Communication overhead: obvious
20
Communication Optimization
• Example problem from earlier lecture: Molecular Dynamics
– Each Processor, assumed to house just one cell, needs to send 26
short messages to “neighboring” processors
– Assume Send/Receive each: alpha = 10 us, beta: 2ns
– Time spent (notice: 26 sends and 26 receives):
• 26*2(10 ) = 520 us
– If there are more than one cells on each PE, multiply this number!
– Can this be improved? How?
21
Message combining
• If there are multiple cells per processor:
– Neighbors of a cell may be on the same neighboring processor.
– Neighbors of two different cells on the same processor
– Combine messages going to the same processor
22
Communication Optimization I
• Take advantage of the structure of communication, and do
communication in stages:
– If my coordinates are: (x,y,z):
•
•
•
•
Send to (x+1, y,z), anything that goes to (x+1, *, *)
Send to (x-1, y,z), anything that goes to (x-1, *, *)
Wait for messages from x neighbors, then
Send to y neighbors a combined message
– A total of 6 messages instead of 26
– Apparently longer critical path
23
Communication Optimization II
• Send all migrating atoms to processor 0
– Let processor 0 sort them out and send 1 message to each processor
– Works ok if the number of processors is small
• Otherwise, bottleneck at 0
24
Communication Optimization 3
• Generalized problem:
– Each to all, individualized messages
– Apply all previously learned techniques
25
Intro to Load Balancing
• Example: 500 processors, 50000 units of work
• What should the objective of load balancing be?
26
Causes of performance loss
• If each processor is rated at k MFLOPS, and there are p
processors, why don’t we see k.p MFLOPS performance?
– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways
• Solution to one problem may create another
• One problem may mask another, which manifests itself under other
conditions (e.g. increased p).
27
Causes
•
•
•
•
•
•
•
Sequential: cache performance
Communication overhead
Algorithmic overhead (“extra work”)
Speculative work
Load imbalance
(Long) Critical paths
Bottlenecks
28
Algorithmic overhead
• Parallel algorithms may have a higher operation count
• Example: parallel prefix (also called “scan”)
– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
29
Parallel Prefix: continued
• How to this operation in parallel?
– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm:
– Take blocking of data into account
– Each processor calculate its sum, then participates in a prallel
algorithm to get sum to its left, and then adds to all its
elements
– N + log(P) +N: doubling of op. Count
30
Bottleneck
• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:
– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
31
Communication overhead
• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:
– How much computation per message
– Computation-to-communication ratio
32
Communication overhead examples
• Usually, must reorganize data or work to reduce communication
• Combining communication also helps
• Examples:
33
Communication overhead
Communication delay: time interval between sending on one
processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both
sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
34
Grainsize control
• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:
– Computation to communication ratio
35
Example: matrix multiplication
• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
36
A simple algorithm:
• Distribute A by rows, B by columns
– So,any processor can request a row of A and get it (in two
messages). Same for a col of B,
– Distribute the work of computing each element of C using
some load balancing scheme
• So it works even on machines with varying processor
capabilities (e.g. timeshared clusters)
– What is the computation-toc-mmunication ratio?
• For each object: 2.N ops, 2 messages with N bytes
37
A better algorithm:
• Store A as a collection row-bunches
– each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:
– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: g
38
Alpha vs beta
• The per message cost is significantly larger than per byte
cost
– factor of several thousands
– So, several optimizations are possible that trade off : get larger
beta cost for smaller alpha
– I.e. send fewer messages
– Applications of this idea:
• Message combining
• Complex communication patterns: each-to-all, ..
39
Example:
• Each to all communication:
– each processor wants to send N bytes, distinct message to each other
processor
– Simple implementation: alpha*P + N * beta *P
• typical values?
40
Programming for performance:
steps
•
•
•
•
•
Select/design Parallel algorithm
Decide on Decomposition
Select Load balancing strategy
Plan Communication structure
Examine synchronization needs
– global synchronizations, critical paths
41
Design Philosophy:
• Parallel Algorithm design:
– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:
– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
42
Design principles: contd.
• Load balancing
– Select static, dynamic, or quasi-dynamic strategy
• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid overloading one (think
about this)
• Reduce communication overhead
– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
43
Design principles: Synchronization
• Eliminate unnecessary global synchronization
– If T(i,j) is the time during i’th phase on j’th PE
• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:
– Look for long chains of dependences
• Draw timeline pictures with dependences
44
Diagnosing performance problems
• Tools:
– Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs
• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure
• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
45
Diagnostic technniques
• Tell-tale signs:
– max load >> average, and # Pes > average is >>1
• Load imbalance
– max load >> average, and # Pes > average is ~ 1
• Possible bottleneck (if there is dependence)
– profile shows increase in total time in routine f with increase in
Pes: algorithmic overhead
– Communication overhead: obvious
46