Download Performance and Scalability of Parallel Systems

Performance and Scalability of Parallel Systems • Evaluation – sequential: run time (execution time) • Ts = T(Insize) – parallel: run time (start -> last PE ends) • Tp = T(Insize, p, arch.) • cannot be eval. in isolation for the par. archit. – parallel system: par alg  par archit • Metrics: evaluate performance of par. system – scalability: ability of par. alg. to achieve performance gains proportional w/ n. of PEs. Performance Metrics • Run-Time: Ts, Tp • Speedup: – how much performance is gained by running the application on p” identical processors. • S = Ts / Tp – Ts: fastest sequential algorithm for solving the same problem – If : not known yet (only lower bound known) or known, with large constants at run-time that make it impossible to implement Then: Take the fastest known sequential algorithm that can be practically implemented – Speedup: relative metric • Alg. of adding “n” number on “n” processors (Hypercube n = p = 2 ) k – Ts = (n) – Tp = (logn) n – S = ( log n ) • Efficiency: measure of how effective the problem is solved on p processors – E[0,1) E= S p 1 – if p=n, E = ( log n ) • Cost – Cseq(fast) = Ts Cpar = pTp – cost-optimal: Cpar ~ Cseq(fast) – not cost-optimal: if p=n, fine granularity; if p<n, coars granularity 1 • Cseq(fast) = (n) Cpar = (nlogn) E = ( log n ) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (a) Initial data distribution and the fast communication step 10 32 1 0 54 3 2 76 5 4 89 7 6 11 10 9 8 13 12 11 10 12 15 14 13 14 15 13 14 15 (b) Second communication step 30 74 1 0 3 2 11 8 5 4 7 6 15 12 9 8 11 10 12 ( c) Third communication step 70 15 8 1 0 3 2 5 4 7 6 9 8 10 11 12 13 14 15 (d) Fourth communication step 15 0 0 1 2 3 4 5 6 7 8 9 10 11 12 (e) Accumulation of the sum at processor 0 after the final communication 13 14 15 Effects of Granularity on Cost-Optimality • Assume: n initial PEs – If: p -- physical PEs, then each PE simulates n p – If: -- virtual PEs, then the computation at each PE increase by a factor of np – Note: Even if p<n, this does not necessarily provide cost-optimal alg. 12 13 14 15 12 13 14 15 8 9 10 11 8 9 10 11 4 5 6 7 4 5 6 7 0 1 2 3 10 0 1 2 3 12 13 8 9 54 10 0 14 10 2 Substep 3 2 3 Substep 2 15 11  76 32 1 1 0 Substep 1 32 12 13 89 54 0 15 11 10  76 32 10 3 14 1 2 3 Substep 4 (a) Four processors simulating the first communication step of 16 processors 13 12 89 54 10 15 14 11 10  76 32 1 0 2 13 12 89 54 30 3 13 12 89 74 30 0 2 Substep 3 2 3 Substep 2 15 14 11 10 1 1 0 Substep 1 15 14 11 10  76 13 12 11 8 74 30 3 0 15 14 1 2 3 Substep 4 (a) Four processors simulating the first communication step of 16 processors 15 12 11 8 74 30 15 12 11 8  70 1 0 2 3 1 0 Substep 1 2 3 Substep 2 (c ) Simulation of the third step in two substeps 15 8  70 0 15 0 1 2 3 (d) Simulation of the fourth step 0 1 2 3 (e) Final result Adding “n” numbers on “p” processors (Hypercube p<n) • n= • p= 2k 2w e.g., n=16 p=4 – comput + communication ( n – comput ( p ) – par. exec. time Tpar = ( – Cpar = p ( p log p ) = ( nlog p ) – Cseq(fast) = (n) n n log p p n log p p ) ) A Cost Optimal Algorithm n p – comput ( ) – comput + communication – Tpar = ( n p ( n  log p ) p ) – Cpar = (n) = p Tpar – Cseq(fast) = (n) Cost-optimal • (a) if algorithm is cost-optimal: • p -- physical PEs • each PE simulates – Then n p virtual PEs • if the overall communication does not grow far more than • => total parallel run-time grows by at most n n n n Tc + Tcomm = Ttotal = Tp = T n p • => p p n C par = p p p p T pn  p n pn p n • => C par = pT n = p p Tp = n Tp = C par p • => new algorithm using n p processor is cost-optimal (p<n) n p • (b) if algorithm is not cost-optimal for p=n • if we increase granularity – =>the new algorithm using n p (p<n) may still not be cost optimal – e.g., adding “n” numbers on “p” processors (Hypeecube: p<n) • n = 2k (n=16) • p = 2w (p=4) • each virtual PE simulated by physical PE • First log p (2 step) of the log n (4 steps) in the original algorithms are simulated in n log p p • The remaining steps do not require communication (the PE that continues to communicate in the original algorithm are simulated by the same PE here) The Role of Mapping Computations Onto Processors in Parallel Algorithm Design • For a cost-optimal parallel algorithms E=(1) • If a par. Alg. on p=n processors is not cost-optimal or cost-nonoptimal, then>if p<n you can find a cost-optimal algorithm • even if you find a cost-optimal alg. For p<n >you found an algorithm with best parallel run-time • performance (parallel run-time) depends on (1) #processors (2) data-mapping • Parallel run-time of the same problem (problem size) depends on the mappings of virtual PEs onto physical PEs • Performance artificially depends on the data mapping onto a coarse grain parallel computer – e.g., matrix multiply n*n by a vector on p processors hypercube – parallel FFT on a hypercube w/ cut through – W--computation steps => Pmax = W – For Pmax -- each PE extends one step of the algorithm – For p<W, each PE extends a large number of steps • The choice of the best algorithm to perform local computations depends on #PEs • Optional algorithm for solving a problem on an arbitrary #PEs cannot be obtained from the most fine-grained par. Alg. • The analysis on fine-grained par. Alg. May not reveal important facts such as: – if message short( one word only) => transfer time between 2 PE is the same for store-and-forward and cut-through-routing – if message is long=>cut-through is faster than storeand-forward – performance on HyC mesh is identical with cut-through routing – Perf. On a mesh with store-and-forward is worse • Design – devise the par. Alg. For the finest-grain – mapping data onto PEs – description of alg. Implementation on an arbitrary # PEs – Variables: • problem size • # PEs Scalability • Adding n number on a p processors hypercube – assume: 1 unit time: • adding 2 numbers • communicate w/ connected PE n p n p – adding locally numbers: -1 – p partial sums added in logp steps • for each sum: 1 addition and 1 communication => 2 log p n • Tp = p -1 + 2log p n p • Tp = + 2 log p • Ts = n-1 ---> n • S= n n  2 log p p = np n  2 p log p => S(n, p) • E = Sp  n  2 pnlog p => E(n, p) can be computed for any pair n & p • As p  to increase S => need to increase n => E  • large problem size S , E  but they drop with p • Scalability: of a parallel system is a measure of its capability to increase speedup on proportion to the number of processors • Efficiency of adding n numbers on a p processor hypercube – for the cost-optimal alg: • S= np n  2 p log p E= n P=1 P=4 64 1.0 .80 192 1.0 320 512 S n  p n  2 p log p P=8 = E(n,p) P = 16 P = 32 .57 .33 .17 .92 .80 .60 .38 1.0 .95 .87 .71 .50 1.0 .97 .91 .80 .62 • • • • • • n =  (plogp) E = 0.80 constant for n=64 & p=4: n=8plogp for n=192 & p=8 : n=8plogp for n=512 & p=16: n=8plogp Conclusion: – for adding n number on p processors w/ cost-optimal algorithm: • the alg. is cost-optimal if n =  (plogp) • the alg. is scalable if n increase proportional w/  (plogp) as p is increased • Problem size – for matrix multiply • input n => O( n 3 ) n’=2n => O( n '3 ) = O(8 n 3 ) – for matrix addition • input n => O( n 3 ) n’=2n => O( n '3 ) = O(4 n 3 ) • Doubling the size of the problem means performing twice the amount of computation • Computation step: assume takes 1 time unit – message start-up time, per-word transfer time, and perhop time can be normalized w.r.t. unit computation time => W = Ts • for the fastest sequential algorithm on a sequential computer Overheads Function • ideal: E=1 & S=p • reality: E<1 & S<p • overhead function: – the time collectively spent by all processors in addition to that required by the fastest known sequential algorithm to solve the same problem on a single PE T  T (W , p)  pTp  W – for cost-optimal alg. of adding n number on a p processor hypercube Ts  w  n n  2 log p p n T  p(  2 log p)  n  2 p log p p Tp  Isoefficiency Function T p  T (W , T , p ) T  pT p  W  Tp  W  T0 (W , p ) p S Ts W Wp   T p T p W  T0 (W , p ) E S W 1   p W  T (W , p ) 1  T (W , p ) W If W=ct, p => E If p=ct, W  => E  for parallel scalable systems we need E=ct • E.g.1 – p , W  exponentially with p => poorly scalable, since we need to increase the problem size v. reach to obtain good speedups • E.g.2 – p , W  linearly with p => highly scalable, since the speedup is now proportional to the number of processors 1 E W  T (W , p) T (W , p) 1  E 1 W E E  ct   ct 1 E E given E E k 1 E W  kT (W , p) Function dictates growth rate of W required to keep the E ct as p increase • Isoefficiency is unscalable parallel systems because E cannot be kept constant as p increases, no matter how much a how fast W increase. Overhead Function • Adding n numbers on p processors hypercube: cost-optimal Ts  n Tp  n  2 log p p n T  pTp  Ts  p(  2 log p)  n  2 p log p p Isoefficiecy Function T  2 p log p  T ( p ) W  kT (W , p )  2kp log p => asymptotic isoefficiency function is (plogp) meaning: (1) #PE  p’>p => problem size has to be increased by p ' log p ' p log p to have the same efficiency as on p processors. p' (2) #PE  p’>p by a factor p requires the problem size to p' grow by a factor p' log p' to increase the speedup by p p log p • Here communication overhead is an exclusive function of p: T  T ( p) In general T  T (W , p ) W  kT (W , p ) • E=ct need ratio T W fixed as p  and W  to obtain nondecreasing efficiency (E’>=E) => T should not grow faster than W • If T has multiple terms, we balance W against each term of T and compute the (respective) isoefficiency function for individual features. • The component of T that requires the problem size to grow at the highest rate with respect to p, determines the overall asymptotic isoefficiency of the parallel system. T0  3 p2  3 3 p 4W 4 3 kp 2 3  ( p2 ) W 3 3 kp 4W 4 1 W4 3 kp 4  W  k 4 p3   ( p3 ) • Take the highest of the 2 rates to ensure E does not decrease, the problem size needs to grow as ( p3 ) (overall asymptotic isoefficiency) Isoefficiency Functions • Captures characteristics of the parallel algorithm and parallel architecture • predicts the impact on performance as #PE  • characterize the amount of parallelism in a parallel problem alg. • Study of parallel alg behavior due to HW changes – Speed. PE. – Communication channels Cost-optimality and Isoefficiency Ts  ct pT p pT p   (W ) W  T (W , p )   (W ) T0  pT p W  T (W , p )  O(W ) W  (T (W , p )) A parallel system is cost optimal iff its overhead function does not grow asymptotically more than the problem size. Relationship between Cost-optimality and Isoefficiency Function • E.g. add “n” numbers on “p” processors hypercube – Non-optimal cost W  O ( n) n T p  O( log p) p T  pT p  W   (n log p )  W  k (n log p ) – (b) cost-optimal W  O ( n)  n n T p  O(  log p) p T   (n  p log p)  W  k ( p log p) Problem size should grow at least as plogp s.t. parallel algorithm is scalable. Isoefficiency Function • Determines the easy with which a parallel system can maintain a constant efficiency and thus, achieve speedups increasing in proportion to the number of processors • A small isoefficiency function means that small increments in the problem size are sufficient for the efficient utilization of an increasing number of processors => indicates the parallel system is highly scalable • a large isoeff. function indicates a poorly scalable parallel system • The isoefficiency function does not exist for unscalable parallel systems, because in such systems the efficiency cannot be kept at any constant value as p  , no matter how fast the problem size is increased. Lower bound on the isoefficiency • Small isoefficiency function => high scalability – for a problem with W, Pmax<=W for a cost-optimal system ( if Pmax>W, some PE are idle) – if W<(p), as p  => at one point #PE>W => E – => asymptotically: • W= (p) problem size must increase proportional as (p) to maintain fixed efficiency. • W=(p) (p) is the asymptotic lower bound on the isoefficiency function, but Pmax= (p) => the isoeff function for an ideal parallel system is W= (p) Degree of Concurrency & Isoefficiency Function • Max # tasks that can be executed simultaneously at any time • indep. on the parallel architecture – no more than C(W) processor can be employed efficienctly Effect of Concurrency on Isoefficiency Function • E.g. W   (n3 ) p   (n 2 )  2 C (W )   (W 3 ) => at most  (W given3 p, ( p 2 ) 2 3) processors can be used efficiently 3  ( p 2 ) => problem size should be at least to use them all W 2 3) • The isoefficiency due to concurrency is  (W , that is, (p) only is the degree of concurrency of the parallel algorithm is (W) • if the degree of concurrency of an algorithm is less than (W), then the isoefficiency function due to concurrency is worse, I.e., greater than (p) • Overall isoeff. Function of a parallel system: Isoeff system  max( Isoeff concur , Isoeff commun , Isoeff overhead ) Sources of Parallel Overhead • The overhead function characterizes a parallel system • Given the overhead function T0  T0 (W , p) we can express: Tp, s, E, pTp as fi(W, p) • The overhead function encapsulates all causes of inefficiencies of a parallel system, due to: – algorithm – architecture – alg-architect. interaction Sources of Overhead • Interprocessor communication – each PE speeds: tcomm – overall interproc. Communication: ptcomm • Load imbalance – idle vs. busy PEs • idle PEs contribute to overhead • extra-computation – Redundant computation – W---- for best sequential alg W’----for a seqential alg easily parallelizable W’-W ---- contributes to overhead W=Ws+Wp => Ws executed by 1 PE only => (p-1)Ws contributes to overhead Minimum Execution Time (assume p is not a constraint) • Adding n numbers on a hypercube – in sequential T p  T p (W , p) T pmin For a given W, => p0 for which E.g., n  2 log p p n  0  p0  2 Tp  dTp dp T pmin  2 log n : dTp dp T p  T pmin 0 • Cost-sequential:  (n) • Cost-parallel:  (n log n) – not cost-optimal, since – running this alg. for alg. is cost-optimal T pmin n p0T pmin  * 2 log n 2 is not cost-optimal but this • Derive: lower bound for Tp s.t. parallel cost is optimal T pcos t opt -- par run-time s.t. cost is optimal -- W fixed If Isoeff. function is ( f (p) ) then problem of size w can be executed cost-optimally only iff: W=( f (p) ) p=O( f (W) ) : required for a cost-optimal solution W Tp for cost-optimal alg is  ( p ) since pTp   (W ) W Tp   ( ) p p  O( f 1 (W ))  T pcos t opt  ( W f 1 (W ) ) Minimum Cost-optimal Time for Adding n Numbers on a Hypercube • Isoefficiency function: T  pT p  W n  2 log p  p n T  p (  2 log p)  n  2 p log p p W  kT  k * 2 p log p   ( p log p ) Tp  if if W  n  f ( p)  plpgp  log n  log p  log log p  log p n  f ( p )  p log p p  f 1 (n) n  p log p  p  n n  log p log n n log n n n f 1 (W )  ( ) log n log n  f 1 (n)  • The cost-optimal solution p  O( f 1 (W )) For a cost optimal solution: n p  For log n T p  T pcos t opt  n  2 log p p  T pcos t opt  log n  2 log( Note: p   (n log n) n )  3 log n  2 log log n   (log n) log n T pmin   (log n) T pcos t opt   (log n)  T pcos t opt   (T pmin ) Cost-optimal solution is the best asymptotic solution in terms of execution time. • Parallel system when T  Tp  3 p2  T pcos t opt  T pmin 3 3 p 4W 4 W  T p  Tp  dT p W  p 1 p2 0 dp p0   (W ) T pmin  3 p4 1   (W 2 ) W 3 4 1 p4 3 3 1 3 1 1  W 4  ( W 2  2W ) 2   (W 4 ) 4 16 – Isoeff. function W  kT0  k 4 p 3   ( p 3 )  pmax Tp  1   (W 3 ) W  p 1 p2  W max # PE for which alg is cost-optimal 3 4 1 p4 p   (W )  T pcos t opt   (W  T pcos t opt  T pmin 2 3) asymptotically

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Performance and Scalability of Parallel Systems