* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Scalability_1.1
Pattern recognition wikipedia , lookup
Sieve of Eratosthenes wikipedia , lookup
Theoretical computer science wikipedia , lookup
Knapsack problem wikipedia , lookup
Multiplication algorithm wikipedia , lookup
Dynamic programming wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Fast Fourier transform wikipedia , lookup
Drift plus penalty wikipedia , lookup
Computational complexity theory wikipedia , lookup
Mathematical optimization wikipedia , lookup
Selection algorithm wikipedia , lookup
Fisher–Yates shuffle wikipedia , lookup
Travelling salesman problem wikipedia , lookup
Simulated annealing wikipedia , lookup
Genetic algorithm wikipedia , lookup
Simplex algorithm wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS • Evaluation • Sequential: runtime (execution time) Ts =T (InputSize) • Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture) Note: Cannot be Evaluated in Isolation from the Parallel architecture • Parallel System : Parallel Algorithm ∞ Parallel Architecture • Metrics - Evaluate the Performance of Parallel System SCALABILITY: Ability of Parallel Algorithm to Achieve Performance Gains proportional to number of PE PERFORMANCE METRICS 1. Run-Time : The serial run time (𝑇𝑠 ) of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. The parallel run time (𝑇𝑝 ) is the time elapsed from the moment that a parallel computation starts to the moment that the last processor finishes execution. 2. Speedup: How much Performance is gained by running the application on “p”(Identical) processors. Speedup is a measure of relative benefit of solving a problem in parallel. Speedup (S = 𝑇𝑠 𝑇𝑝 ) PERFORMANCE METRICS where, 𝐓𝐬 : Fastest Sequential Algorithm for solving the same problem IF – not known yet(only lower bound known) or – known, with large constants at runtime that make it impossible to implement Then: Take the fastest known sequential algorithm that can be practically implemented Formally, Speedup S is the ratio of the serial run time of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on a P processors. Speedup – relative metric PERFORMANCE METRICS S ≤ p S > p (Super linear) Algorithm of adding “n” numbers on “n” processors (Hypercube) Initially, each processor is assigned one of the numbers to be added and, at the end of the computation, one of the processors stores the sum of all the numbers. Assuming n = 16, Processors as well as numbers are labeled from 0 to 15. The sum of numbers with consecutive labels from i to j is denoted by 𝑇𝑆 = Θ(n) 𝑇𝑝 = Θ(logn) S = Θ( 𝑛 ) logn (n = p = 2𝑘 ) Algorithm of adding “n” numbers on “n” processors (Hypercube) Efficiency(E): measure of how effective the problem is solved on P processors E= 𝑆 𝑃 E є (0,1) Measures the fraction of time for which a processor is usefully employed Ideal parallel system can deliver a speedup equal to P processors. Ideal behavior is not achieved because processors cannot devote 100 percent of their time to the computation of the algorithm. If p = n E = Θ( 1 log n ) Algorithm of adding “n” numbers on “n” processors (Hypercube) Cost: Cost of solving a problem on a parallel system is the product of parallel run time and the number of processors used. 𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 = 𝑇𝑆 𝐶𝑝𝑎𝑟 = p 𝑇𝑝 𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 ~ 𝐶𝑝𝑎𝑟 Cost-Optimal 𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 = Θ (n) Not Cost-Optimal 𝐶𝑝𝑎𝑟 = Θ (nlogn) P = n Fine granularity 1 ) 𝑙𝑜𝑔𝑛 E = Θ( P < n Coarse granularity Scaling down EFFECTS OF GRANULARITY ON COSTOPTIMALITY _ _ _ _ √n _ _ _ ___ _ _ _ ___ _ _ __ _ __ __ ___ __ _ _ __ __ ___ √n √n ≡ √n nPEs pPEs Assume: n virtual PEs; If p – physical PEs, then each PE simulates 𝑛 𝑝 – virtual PEs the computation at each PE increases by a factor: 𝑛 𝑝 𝑛 𝑝 Note: Even if p<n, this doesn't necessarily provide Cost –Optimal algorithm Algorithm of adding “n” number on “n” Processors (HYPERCUBE) (p<n) n = 2𝑘 p = 2𝑚 Eg: n=16, p=4 - Computation + Communication(First 8 Steps) 𝑛 Θ( 𝑝 logp ) - Computation (last 4 Steps) 𝑛 Θ( 𝑝 ) 𝑛 Parallel Execution Time(𝑻𝒑𝒂𝒓 )=Θ( 𝑝 logp ) 𝐶𝑝𝑎𝑟 𝑛 = p Θ( 𝑝 logp) = Θ(nlogp) 𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 = Θ(n) P ↑ asymptotic – Not Cost Optimal A Cost Optimal Algorithm : 𝑛 Compute Θ(𝑝 ) Compute + Communic Θ(logp) 𝑛 Θ(𝑝 + logp) (n > plogp) 𝑛 𝑝 𝑇𝑝𝑎𝑟 = Θ( ) 𝐶𝑝𝑎𝑟 = Θ(n) = p 𝑇𝑝𝑎𝑟 𝐶𝑠𝑒𝑞.𝑓 = 𝑇𝑠𝑒𝑞.𝑓𝑎𝑠𝑡 = Θ(n) ⇒ Cost Optimal a) If algorithm is cost optimal: P – Physical PEs Each PE stimulates 𝒑 virtual PEs 𝒏 Then, If the overall communication does not grow more 𝒏 than: 𝒑 (𝑃𝑟𝑜𝑝𝑒𝑟 𝑀𝑎𝑝𝑝𝑖𝑛𝑔) Total parallel run-time grows at most: 𝑛 𝑛 𝑇+ 𝑝 𝑐 𝑝 𝑇𝑐𝑜𝑚𝑚 = 𝑛 𝑇 𝑝 𝑡𝑜𝑡𝑎𝑙 = 𝑛 𝑇 𝑝 𝑝 = 𝑇𝑛 𝑝 𝑝=𝑛 𝐶𝑝𝑎𝑟 = p𝑇𝑃(𝑛=𝑝) 𝑝<𝑛 𝑛 𝑝 𝑝=𝑛 𝐶𝑝𝑎𝑟 = p 𝑇𝑛 = p 𝑇𝑝 = 𝑛 𝑇𝑝 = 𝐶𝑝𝑎𝑟 𝑝 New algorithm using 𝑛 𝑝 processors is cost-optimal (p<n) b) If algorithm is not COST-OPTIMAL for p = n: If we increase the granularity ⇒The new algorithm using 𝒏 ,(p<n) 𝒑 may still not be cost optimal Example: Adding “n” numbers on “p” processors HYPERCUBE n=2𝑘 Eg: n=16, p=2𝑚 p=4 Each virtual PE (i) is simulated by physical PE (i mod p) First logp (2 steps) of the logn (4 steps) in the original 𝒏 𝟏𝟔 algorithm are simulated in logp ( ∗ 2 = 8 Steps on p=4 𝒑 𝟒 processors) The remaining steps do not require communication (the PE that continue to communicate in the original algorithm are simulated by the same PE here) THE ROLE OF MAPPING COMPUTATIONS ONTO PROCESSORS IN PARALLEL ALGORITHM DESIGN For a cost-optimal parallel algorithm E = Θ(1) If a parallel algorithm on p = n processors is not cost-optimal or cost-non optimal then ⇏ if p<n you can find a cost optimal algorithm. Even if you find a cost-optimal algorithm for p<n then ⇏ you found an algorithm with best parallel run-time. Performance(Parallel run-time) depends on - Number of processors - Data-Mapping (Assignment) Parallel run-time of the same problem (problem size) depends upon the mapping of the virtual PEs onto Physical PEs. Performance critically depends on the data mapping onto a coarse grained parallel computer. Example: Matrix multiply nxn by a vector on p processor 𝑛 hypercube [p square blocks vs p slices of rows] 𝑝 Parallel FFT on a hypercube with Cut- Through Routing W – Computation Steps ⇒ 𝑃𝑚𝑎𝑥 = W For 𝑃𝑚𝑎𝑥 – each PE executes one step of the algorithm For p<W, each PE executes a larger number of steps The choice of the best algorithm to perform local computations depends upon #PEs (how much fragmentation is available) Optimal algorithm for solving a problem on an arbitrary #PEs cannot be obtained from the most fine-grained parallel algorithm The analysis on fine-grained parallel algorithm may not reveal important facts such as: Analysis of coarse grain Parallel algorithm: Notes: 1) If message is short (one word only) => transfer time between 2 PE is the same for store-and-forward and cutthrough-routing 2) if message is long => cut-through-routing is faster than store-and-forward 3)Performance on Hypercube and Mesh is identical with cut-through routing 4) Performance on a mesh with store-and-forward is worse. Design: 1) Devise the parallel algorithm for the finest-grain 2) mapping data onto PEs 3) description of algorithm implementation on an arbitrary # PEs Variables: Problem size, #PEs SCALABILITY S≤P S(p) E(p) Example: Adding n numbers on a p processors Hypercube Assume : 1 unit time(For adding 2 numbers or to communicate with connected PE) 𝑛 1) adding locally numbers 𝑝 𝒏 𝒑 Takes : -1 2) p partial sums added in logp steps ( for each sum : 1 addition + 1 communication)=> 2logp 𝒏 𝑇𝑝 = -1 + 2logp 𝒑 𝒏 𝒑 𝑇𝑝 = + 2logp 𝑇𝑠 = n-1 = n 𝒏 S=𝒏 = +𝟐𝒍𝒐𝒈𝒑 𝒑 E= 𝑺 𝑷 = (n↑, p↑) 𝒏𝒑 𝒏+𝟐𝒑𝒍𝒐𝒈𝒑 𝒏 𝒏+𝟐𝒑𝒍𝒐𝒈𝒑 => S(n,p) => E(n,p) Can be computed for any pair of n and p As p↑ to increase S => need to increase n (Otherwise saturation) => E↓ Larger Problem sizes. S ↑, E ↑ but they drop with p ↑. E=Ct Scalability : of a parallel system is a measure of its capacity to increase speedup in proportion to the number of processors. Efficiency of adding “n” numbers on a “p” processor hypercube For cost optimal algorithm: S= 𝑛𝑝 𝑛+2𝑝𝑙𝑜𝑔𝑝 E= 𝑛 𝑛+2𝑝𝑙𝑜𝑔𝑝 E = E(n,p) n=Ω(plogp) E=0.80 constant For n=64 , p =4 , n=8plogp For n=192, p=8 , n=8plogp For n=512, p=16, n=8plogp Conclusions: For adding n numbers on p processors hypercube with cost optimal algorithm. The algorithm is cost-optimal if n = Ω(plogp) The algorithm is scalable if n increases proportional with Θ(plogp) as p is increased. PROBLEM SIZE For Matrix multiply: Input size n => O(𝑛3 ) 𝑛′ = 2𝑛 ⇒ O(𝑛′3 ) ≡ O(8𝑛3 ) Matrix addition Input size n => O(𝑛2 ) 𝑛′ = 2𝑛 ⇒O(𝑛′2 ) ≡ O(4𝑛2 ) Doubling the size of the problem means performing twice the amount of computation. Computation Step: Assume takes 1 time unit Message start-up time, per-word transfer time, per-hop time ⇒ W = 𝑇𝑠 can be normalized with respect to unit computation time (for the fastest sequential algorithm on a sequential computer) Overhead Function E=1 S=p (Ideal) E<1 S<p (In reality) overhead( interprocessor communic ….. etc) => Overhead function The time collectively spent by all processors in addition to that required by the fastest sequential algorithm to solve the same problem on a single PE. 𝑻𝒐 = 𝑻𝒐 (W,p) 𝑻𝒐 = p 𝑻𝒑 – W For cost-optimal algorithm of adding n numbers on p processors hypercube 𝑇𝑠 = W = n 𝒏 𝑇𝑝 = + 2logp 𝒑 𝒏 𝒑 𝑇𝑜 = p( + 2logp) – n = 2plogp 𝐓𝐨 = 2plogp ISOEFFICIENCY FUNCTION Parallel execution time = (function of problem size, overhead function, no. of processors) 𝑇𝑝 = T(W,𝑇𝑜 ,p) 𝑇𝑝 = 𝑊+𝑇𝑜 (𝑊,𝑝) 𝑝 𝑇𝑜 = 𝑝𝑇𝑝 – W Speedup(S) = 𝑇𝑠 𝑇𝑝 Efficiency(E) = = 𝑆 𝑃 𝑊 𝑇𝑝 = = 𝑊𝑝 𝑊+𝑇𝑜 (𝑊,𝑝) 𝑊 𝑊+𝑇𝑜 (𝑊,𝑝) E= 1 𝑇 (𝑊,𝑝) 1+ 𝑜 𝑤 If W = constant, P = ↑ then E↓ If p= constant, W ↑ then E ↑ for parallel scalable systems. we need E = constant - for scalable effective systems. Example.1 P↑, W↑ exponentially with p ⇒ then problem is poorly scalable since we need to increase the problem size very much to obtain good speedups Example.2 P↑, W↑ linearly with p ⇒ then problem is highly scalable since the speedup is now proportional to the number of processors. 1 𝐸 E = 𝑇𝑜(𝑊,𝑝) ⇒ 𝑊= 𝑇𝑜 (𝑊, 𝑝) 1+ E = ct ⇒ Given E ⇒ 𝑤 𝐸 = ct 1−𝐸 𝐸 =k 1−𝐸 W = 𝐾𝑇𝑜 (𝑊, 𝑝) 1−𝐸 Function dictates growth rate of W required to Keep the E Constant as P increases. Isoefficiency ∄ in unscalable parallel systems because E cannot be kept constant as p increases, no matter how much or how fast W increases. Overhead Function ( adding n numbers on p processors Hypercube) 𝑇𝑠 = n 𝑛 𝑇𝑝 = + 2logp 𝑝 𝑇𝑜 = p𝑇𝑝 – 𝑇𝑠 = p( 𝑛 𝑝 + 2logp) – n = 2plogp Isoefficiency Function W= k 𝑇𝑜 (W,p) 𝑇𝑜 = 2plogp (Note: 𝑇𝑜 = 𝑇𝑜 (p)) W=2kplogp => asymptotic isoefficiency function is Θ(plogp) Meaning: 1) #PE↑ p’ > p => problem size has to be increased 𝑝′ 𝑙𝑜𝑔𝑝′ by ( ) to have the same efficiency as on p 𝑝𝑙𝑜𝑔𝑝 processors. 2) #PE↑ p’ > p by a factor grow by a factor 𝑝′ 𝑙𝑜𝑔𝑝′ ( 𝑝𝑙𝑜𝑔𝑝 𝑝′ 𝑝 requires problem size to )to increase the speedup by 𝑝′ 𝑝 Here communication overhead is an exclusive function of p: 𝑇𝑜 = 𝑇𝑜 (p) In general 𝑇𝑜 = 𝑇𝑜 (W,p) W= k 𝑇𝑜 (W,p) (may involve many terms) Sometimes hard to solve in terms of p E= constant need ratio As p ↑, W↑ 𝑇𝑜 𝑤 fixed to obtain nondecreasing efficiency E’≥E => 𝑇𝑜 should not grow faster than W None of 𝑇𝑜 terms should grow faster than W. If 𝑇𝑜 has multiple terms, we balance W against each term of 𝑇𝑜 and compute the respective isoefficiency functions for corresponding individual terms. The component of 𝑇𝑜 that requires the problem size to grow at the highest rate with respect to p, determines the overall asymptotic isoefficiency of the parallel system. Example 1: 𝑇𝑜 = p3/2 + p3/4 W3/4 W= k p3/2 => Θ(p3/2) W= k p3/4 W3/4 W 1/4= k p3/4 W=k4p3 =>Θ(P3) → take the highest of the two rates isoefficiency function of this example is Θ(P3) Example 2: 𝑇𝑜 = p1/2 W1/2 + p3/5 + p3/4 W3/4 + p3/2 + p W = K 𝑇𝑜 = K (p1/2 W1/2 + p3/5 + p3/4 W3/4 + p3/2 + p) W = K p1/2 W1/2 W = K p3/5 W = K p3/4 W3/4 = Θ(P3) W = K p3/2 W =Kp Therefore W = Θ(P3) (𝑇𝑜 has this isoefficiency function which is the highest of all) → 𝑇𝑜 ensure E doesn't decrease, the problem size needs to grow as Θ(P3) (Overall asymptotic isoefficiency ) Isoefficiency Functions Captures characteristics of the parallel algorithm and architecture Predicts the impact on performance as #PE↑ Characterizes the amount of parallelism in a parallel algorithm. Study of algorithm(parallel system) behavior due to hardware changes(speed, PE, communication channels) Cost-Optimality and Isoefficiency Function 𝑇 Cost-Optimality = 𝑝𝑇𝑠 = ct 𝑝 P 𝑇𝑝 = Θ(W) W + 𝑇𝑜 (W,p) = Θ(W) (𝑇𝑜 =p𝑇𝑝 – W) 𝑇𝑜 (W,p) = O(W) W=Ω(𝑇𝑜 (W,p)) A parallel system is cost optimal iff ( of 2 only if ) its overhead function does not grow asymptotically more than the problem size Relationship between Cost-optimality and Isoefficiency Function Eg: Add “n” numbers on “p” processors hypercube a) Non-optimal cost W = O(n) 𝑛 𝑇𝑝 = O( logp) 𝑝 𝑇𝑜 =p𝑇𝑝 – W =Θ(nlogp) W=k Θ(nlogp) not true for all K and E Algorithm is not cost-optimal, and ∄ isoefficiency function not scalable b) Cost-Optimal W = O(n) ; W ≈ 𝑛 𝑛 𝑝 𝑇𝑝 = O( + logp) 𝑇𝑜 = Θ(n + plogp) – O(n) W = k Θ(plogp) W=Ω(plogp) n>>p for cost optimality Problem size should grow at least as plogp such that parallel system is scalable. ISOEFFICIEENCY FUNCTION Determines the ease with which a parallel system can maintain a constant efficiency and thus, achieve speedups increasing in proportion to the number of processors A small isoefficiency function means that small increments in the problem size are sufficient for the efficient utilization of an increasing number of processors => indicates the parallel system is highly scalable A large isoefficiency function indicates a poorly scalable parallel system The isoefficiency function does not exist for unscalable parallel systems, because in such systems the efficiency cannot be kept at any constant value as p↑, no matter how fast the problem size is increased Lower Bound on the Isoefficiency Small isoefficiency function => higher scalability → For a problem with W, 𝑃𝑚𝑎𝑥 ≤ W for cost-optimal system (if 𝑃𝑚𝑎𝑥 > W, some PE are idle) → If W < Θ(p) i.e problem size grows slower than p, as p↑ ↑ => at one point #PE > W =>E↓ ↓ => asymptotically W= Θ(p) Problem size must increase proportional as Θ(p) to maintain fixed efficiency W = Ω(p) (W should grow at least as fast as p) Ω(p) is the asymptotic lower bound on the isoefficiency function But 𝑃𝑚𝑎𝑥 = Θ(W) (p should grow at most as fast as W) => The isoefficiency function for an ideal parallel system is: W = Θ(p) Example ( which one overhead function is most scalable ) 𝑇𝑜1 = ap5 W4/5 + bp3 𝑇𝑜2 = cp5 + dp3 + ep2 𝑇𝑜3 = fp5 + gp5 W4/5 + hp3/4 W3/4 First calculate isoefficiency function of all three overhead function and then find the lowest of these three. Isoefficiency function of 𝑇𝑜1 = Θ(P25) 𝑇𝑜2 = Θ(P5) 𝑇𝑜3 = Θ(P25) Therefore 𝑇𝑜2 = Θ(P5) has the lowest isoefficiency function which is most scalable among three. Degree of Concurrency & Isoefficiency Function Maximum number of tasks that can be executed simultaneously at any time Independent of parallel architecture C(W) – no more than C(W) processors can be employed effectively Effect of Concurrency on Isoefficiency function Example: Gaussian Elimination : W=Θ(n3) P= Θ(n2) C(W)= Θ(W2/3) => at most Θ(W2/3) processors can be used efficiently Given p W=Ω(p3/2) => problem size should be at least Ω(p3/2) to use them all => The Isoefficiency due to concurrency is Θ(p3/2) The Isoefficiency function due to concurrency is optimal, that is , Θ(p) only is the degree of concurrency of the parallel algorithm is Θ(W) SOURCES OF OVERHEAD Interprocessor Communication − Each PE Spends tcomm − Overall interprocessor communication : ptcomm (Architecture impact) Load imbalance − Idle vs busy PEs( Contributes to overhead) Example: In sequential part 1PE : 𝑊𝑠 Useful p-1 PEs: (p-1) 𝑊𝑠 contribution to overhead function Extra-Computation 1) Redundant Computation (eg: Fast fourier transform) 2) W – for best sequential algorithm W’ – for a sequential algorithm easily parallelizable W’-W contributes to overhead. W= 𝑊𝑠 + 𝑊𝑝 => 𝑊𝑠 executed by 1PE only =>(p-1) 𝑊𝑠 contributes to overhead − Overhead of scheduling If the degree of concurrency of an algorithm is less than Θ(W), then the Isoefficiency function due to concurrency is worse, i.e. greater than Θ(p) Overall Isoefficiency function of a parallel system: Isoeff system = max(Isoeff concurr, Isoeff commun, Isoeff overhead) SOURCES OF PARALLEL OVERHEAD The overhead function characterizes a parallel system Given the overhead function 𝑇𝑜 = 𝑇𝑜 (W,p) We can express: 𝑇𝑝 , S,E,p𝑇𝑝 (cost) as fi(W,p) The overhead function encapsulates all causes of inefficiencies of a parallel system, due to: Algorithm Architecture Algorithm –architecture interaction MINIMUM EXECUTION TIME (Adding “n” Number on a Hypercube) (Assume p is not a constraint) As we increase the number of processors for a given problem size, either the parallel run time continues to decrease and asymptotically approaches a minimum value, or it starts rising after attaining a minimum value. 𝑇𝑝 = 𝑇𝑝 (W,p) For a given W, 𝑇𝑝𝑚𝑖𝑛 2 𝑑 𝑑𝑝 𝑇𝑝 = 0 => 𝑃𝑜 for which 𝑇𝑝 = 𝑇𝑝𝑚𝑖𝑛 Example: The parallel runtime for the problem of adding n numbers on a p-processor hypercube be approx. by 𝑇𝑝 = 𝑛 𝑝 + 2logp we have, 𝑑𝑇𝑝 𝑑𝑝 𝑛 - 2 𝑝 =0 + 2 𝑝 =0 -n + 2p = 0 𝑛 => 𝑃𝑜 = 2 𝑇𝑝𝑚𝑖𝑛 = 2logn The processor-time product for p= 𝑝𝑜 is Θ(nlogn) Cost-Sequential : Θ(n) not cost optimal Cost-Parallel : Θ(nlogn) since 𝑃𝑜 𝑇𝑝𝑚𝑖𝑛 = 𝑛 × 2 logn 2 Not Cost optimal => running this algorithm for 𝑇𝑝𝑚𝑖𝑛 is not cost-optimal But this algorithm is COST-OPTIMAL Derive: Lower bound for 𝑻𝒑 such that parallel cost is optimal: 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 𝑻𝒑 −Parallel run time such that cost is optimal. -W fixed. If Isoefficiency function is Θ(f(p)) Then problem of size W can be executed Cost-optimally only iff: W = Ω(f(p)) P= Ο(f-1(W)) {Required for a cost optimal solution } 𝑤 parallel runtime 𝑇𝑝 for cost cost-optimal solution is = Θ( ) Since 𝑃 p𝑇𝑝 = Θ(W) 𝑤 𝑇𝑝 = Θ( ) 𝑃 -1 P= Θ(f (W)) The lower bound on the parallel runtime for solving a problem of size W is cost optimal iff : 𝒘 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 => 𝑻𝒑 = Ω( −𝟏 ) (𝒇 (w)) MINIMUM COST-OPTIMAL TIME FOR ADDING N NUMBERS ON A HYPERCUBE A) isoefficiency function: 𝑇𝑂 = p 𝑇𝑝 -W 𝑛 𝑇𝑝 = + 2logp => 𝑇𝑂 = 𝑛 p( 𝑃 𝑃 + 2logp) – n = 2plogp W= k 𝑇𝑂 = k×2plogp W=Θ(plogp) {isoefficiency function} If W = n = f(p) = plogp =>logn = logp + loglogp logn ≈ logp (ignoring double logarithmic term) If n = f(p) = plogp P = f-1(n) 𝑛 𝑛 n = plogp => p = ≈ 𝑛 f-1(n) = 𝑙𝑜𝑔𝑛 𝑛 f-1(W) = 𝑙𝑜𝑔𝑛 𝑙𝑜𝑔𝑝 𝑙𝑜𝑔𝑛 𝑛 𝑙𝑜𝑔𝑛 f-1 (W) = Θ( ) B) The cost-optimal solution p= O(f-1(W)) => for a cost optimal solution P = Θ(nlogn) { the max for cost-optimal solution } 𝑛 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 For P = 𝑇𝑝 = 𝑻𝒑 => 𝑙𝑜𝑔𝑛 𝑛 𝑇𝑝 = + 2logp 𝑝 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 𝑻𝒑 = logn + 2log( 𝑛 ) 𝑙𝑜𝑔𝑛 = 3logn – 2loglogn 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 𝑻𝒑 = Θ(logn) Note: 𝑻𝒎𝒊𝒏 = Θ(logn) { cost optimal solution is the best 𝒑 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 𝑻𝒑 = Θ(logn) asymptotic solution in terms of execution time} 𝑛 𝑛 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 (𝑻𝒎𝒊𝒏 ) ⟹ 𝑃 = > 𝑃 = ⟸ ( 𝑻 ) 𝒑 𝑜 𝑜 𝒑 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 => 𝑻𝒑 2 𝑙𝑜𝑔𝑛 = Θ(𝑻𝒎𝒊𝒏 ) 𝒑 𝒄𝒐𝒔𝒕−𝒐𝒑𝒕 Both 𝑻𝒎𝒊𝒏 and 𝑻𝒑 for adding n numbers on hypercube are 𝒑 Θ(logn), thus for above problem, a cost-optimal solution is also the asymptotically fastest solution. Parallel System where 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 > 𝑇𝑝 𝑚𝑖𝑛 𝑇𝑜 = p3/2 𝑇𝑝 = 𝑊+𝑇𝑜 𝑃 𝒅 𝑇 𝒅𝒑 𝒑 =0 + p3/4W3/4 => 𝑇𝑝 = 𝑊 𝑃 𝑊 1 𝑊3 4 =- 2+ 12 - 54=0 𝑃 2𝑃 4𝑃 1 1 -W + 𝑃3 2 - 𝑊 3 4 𝑃3 4 = 0 2 4 𝟏 𝟏 𝑃3 4 = W3/4 ± ( W3/4 + 2W)1/2 𝟒 𝟏𝟔 𝒅 𝑇 𝒅𝒑 𝑝 = Θ(W3/4) 𝑃𝑜 = Θ(W) (where 𝑃𝑜 = 𝑃) 𝑇𝑝 𝑚𝑖𝑛 = Θ(W1/2)………..(equation i) + p1/2 + 𝑊3 4 𝑃1 4 Isoefficiency Function: W = k 𝑇𝑜 = k4p3 = Θ(p3) => 𝑃𝑚𝑎𝑥 = Θ(W1/3)……(equation ii) {Max #PE for which algorithm is cost-optimal} 𝑇𝑝 = 𝑾 𝑷 𝑊3 1/2 +p + 4 𝑃1 4 p = Θ(W) => 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 = Θ(W2/3) => 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 > 𝑇𝑝 𝑚𝑖𝑛 asymptotically equation i and ii shows that 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 is asymptotically greater than 𝑇𝑝 𝑚𝑖𝑛 . 𝑚𝑖𝑛 Deriving 𝑇𝑝 it is important to be aware of 𝑃𝑚𝑎𝑥 that can be utilized is bounded by the degree of concurrency C(W) of the parallel algorithm. It is possible that 𝑃𝑜 > C(W) for parallel system. i.e Value of 𝑃𝑜 is meaningless and 𝑇𝑝 𝑚𝑖𝑛 is given by 𝑇𝑝 𝑚𝑖𝑛 = 𝑊+𝑇𝑜 (𝑊,𝐶 𝑊 ) 𝐶(𝑊) Example (showing whether parallel algorithm is costoptimal with respect to sequential algorithm or not) Given 𝑇𝑠1 = an2 + bn + c a,b,c ∈ 𝑅 𝑇𝑠2 = a’n2 logn + b’ n2 + c’n + d a’,b’,c’,d ∈ 𝑅 𝑇𝑝 = 𝑛2 𝑝 + 64 logp Cost optimality if p𝑇𝑝 ~ 𝑇𝑠 𝑇𝑠 𝑝𝑇𝑝 = constant we have, 𝑝𝑇𝑝 = n2 + 64 plogp ⇒ 𝑝𝑇𝑝 = n2 + Θ (𝑝𝑙𝑜𝑔𝑝) 𝑇𝑠1 = Θ (𝑛2 ) The parallel algorithm is cost-optimal w.r.t. to 𝑇𝑠1 iff 𝑛2 grows at least by Θ (plogp) ⇒ 𝑛2 = Ω (plogp) 𝑇𝑠2 = Θ (𝑛2 logn) 𝑝𝑇𝑝 = Θ (𝑛2 ) 𝑇𝑠2 = Θ (logn) ≠ O(1) 𝑝𝑇𝑝 The parallel algorithm is not cost-optimal w.r.t. to 𝑇𝑠2 𝑇𝑠2 because = Θ (logn) ≠ O(1) 𝑝𝑇𝑝 Example (condition when parallel runtime is minimum) 𝑑 𝑇 𝑑𝑝 𝑝 =0 −𝑛2 ⇒ 𝑝2 + 64 𝑝 P= 𝑛2 64 =0 𝑇𝑝 = 64 + 128(logn – log8) 𝑛2 p𝑇𝑝 = (64 + 128 64 = Θ (𝑛2 logn) 𝑇𝑝 = O(logn) (logn- log8)) 𝑇𝑠1 : 𝑇𝑠1 𝑝𝑇𝑝 = Θ(𝑛2 ) 𝑂(𝑛2 𝑙𝑜𝑔𝑛) = 1 O( ) 𝑙𝑜𝑔𝑛 ≠ O(1) Therefore it is not optimal w.r.t 𝑇𝑠1 . 𝑇𝑠2 : 𝑇𝑠2 𝑝𝑇𝑝 = Θ(𝑛2 𝑙𝑜𝑔𝑛) 𝑂(𝑛2 𝑙𝑜𝑔𝑛) = O(1) This is cost-optimal w.r.t 𝑇𝑠2 . Asymptotic Analysis of Parallel Programs Ignores constants and concern with the asymptotic behavior of quantities. In many cases, this can yield a clearer picture of relative merits and demerits of various parallel programs. Consider the problem of sorting a list of n numbers. The fastest serial programs for this problem run in time O (n log n). The objective is to determine which of these four algorithms is the best. Perhaps the simplest metric is one of speed; the algorithm with the lowest 𝑇𝑝 is the best. By this metric, algorithm A1 is the best, followed by A3, A4, and A2. This is also reflected in the fact that the speedups of the set of algorithms are also in this order. In practice, we will rarely have n2 processing elements as are required by algorithm A1. Resource utilization is an important aspect of practical program design. Algorithms A2 and A4 are the best, followed by A3 and A1. The costs of algorithms A1 and A3 are higher than the serial runtime of n log n and therefore neither of these algorithms is cost optimal. However, algorithms A2 and A4 are cost optimal.