Download Scalability_1.1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Sieve of Eratosthenes wikipedia , lookup

Theoretical computer science wikipedia , lookup

Knapsack problem wikipedia , lookup

Multiplication algorithm wikipedia , lookup

Dynamic programming wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Fast Fourier transform wikipedia , lookup

Drift plus penalty wikipedia , lookup

Computational complexity theory wikipedia , lookup

Mathematical optimization wikipedia , lookup

Selection algorithm wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Travelling salesman problem wikipedia , lookup

Simulated annealing wikipedia , lookup

Genetic algorithm wikipedia , lookup

Simplex algorithm wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Algorithm wikipedia , lookup

Time complexity wikipedia , lookup

Transcript
SCALABILITY ANALYSIS
PERFORMANCE AND SCALABILITY
OF PARALLEL SYSTEMS
• Evaluation
•
Sequential: runtime (execution time)
Ts =T (InputSize)
•
Parallel: runtime (start-->last PE ends)
Tp =T (InputSize,p,architecture)
Note: Cannot be Evaluated in Isolation from the
Parallel architecture
•
Parallel System : Parallel Algorithm ∞ Parallel
Architecture
• Metrics - Evaluate the Performance of Parallel System
SCALABILITY: Ability of Parallel Algorithm to Achieve
Performance Gains proportional to number of PE
PERFORMANCE METRICS
1. Run-Time : The serial run time (𝑇𝑠 ) of a program is the time
elapsed between the beginning and the end of its execution on
a sequential computer.
The parallel run time (𝑇𝑝 ) is the time elapsed from the moment
that a parallel computation starts to the moment that the last
processor finishes execution.
2. Speedup:
 How much Performance is gained by running the application on
“p”(Identical) processors.
 Speedup is a measure of relative benefit of solving a problem in
parallel.
Speedup (S =
𝑇𝑠
𝑇𝑝
)
PERFORMANCE METRICS
where,
𝐓𝐬 : Fastest Sequential Algorithm for solving the same
problem
IF – not known yet(only lower bound known)
or
– known, with large constants at runtime that
make it impossible to implement
Then:
Take the fastest known sequential algorithm that
can be practically implemented
Formally, Speedup S is the ratio of the serial run time of the best
sequential algorithm for solving a problem to the time taken by
the parallel algorithm to solve the same problem on a P
processors.
 Speedup – relative metric
PERFORMANCE METRICS
S ≤ p S > p (Super linear)
Algorithm of adding “n” numbers on “n” processors
(Hypercube)

Initially, each processor is assigned one of the numbers to
be added and, at the end of the computation, one of the
processors stores the sum of all the numbers.
 Assuming n = 16, Processors as well as numbers are
labeled from 0 to 15. The sum of numbers with
consecutive labels from i to j is denoted by


𝑇𝑆 = Θ(n)
𝑇𝑝 = Θ(logn)
S = Θ(
𝑛
)
logn
(n = p = 2𝑘 )
Algorithm of adding “n” numbers on “n” processors
(Hypercube)

Efficiency(E): measure of how effective the problem is
solved on P processors
E=
𝑆
𝑃
E є (0,1)
Measures the fraction of time for which a processor is
usefully employed
 Ideal parallel system can deliver a speedup equal to P
processors. Ideal behavior is not achieved because
processors cannot devote 100 percent of their time to the
computation of the algorithm.
If p = n

E = Θ(
1
log n
)
Algorithm of adding “n” numbers on “n”
processors (Hypercube)
Cost: Cost of solving a problem on a parallel system is the
product of parallel run time and the number of processors
used.
𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 = 𝑇𝑆
𝐶𝑝𝑎𝑟 = p 𝑇𝑝
𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 ~ 𝐶𝑝𝑎𝑟
 Cost-Optimal
𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 = Θ (n)
 Not Cost-Optimal
𝐶𝑝𝑎𝑟 = Θ (nlogn)
 P = n Fine granularity
1
)
𝑙𝑜𝑔𝑛

E = Θ(

P < n Coarse granularity
Scaling down
EFFECTS OF GRANULARITY ON COSTOPTIMALITY
_
_
_
_
√n
_ _ _ ___ _ _
_ ___ _ _ __
_ __ __ ___ __
_ _ __ __ ___
√n
√n
≡
√n
nPEs

pPEs
Assume: n virtual PEs;
If p – physical PEs, then each PE simulates
𝑛
𝑝
– virtual PEs the computation at each PE increases by a
factor:

𝑛
𝑝
𝑛
𝑝
Note: Even if p<n, this doesn't necessarily provide Cost –Optimal
algorithm
Algorithm of adding “n” number on “n” Processors
(HYPERCUBE) (p<n)
n = 2𝑘
p = 2𝑚
Eg: n=16, p=4
-
Computation + Communication(First 8 Steps)
𝑛
 Θ( 𝑝 logp )
-
Computation (last 4 Steps)
𝑛
Θ( 𝑝 )
𝑛
Parallel Execution Time(𝑻𝒑𝒂𝒓 )=Θ( 𝑝 logp )
𝐶𝑝𝑎𝑟
𝑛
= p Θ( 𝑝 logp) = Θ(nlogp)
𝐶𝑠𝑒𝑞_𝑓𝑎𝑠𝑡 = Θ(n)
P ↑ asymptotic – Not Cost Optimal

A Cost Optimal Algorithm :


𝑛
Compute Θ(𝑝 )
Compute + Communic
Θ(logp)
𝑛
Θ(𝑝 + logp)
(n > plogp)
𝑛
𝑝
 𝑇𝑝𝑎𝑟 = Θ( )
 𝐶𝑝𝑎𝑟 = Θ(n) = p 𝑇𝑝𝑎𝑟
 𝐶𝑠𝑒𝑞.𝑓 = 𝑇𝑠𝑒𝑞.𝑓𝑎𝑠𝑡 = Θ(n)
⇒ Cost Optimal
a) If algorithm is cost optimal:

P – Physical PEs

Each PE stimulates 𝒑 virtual PEs
𝒏
Then,

If the overall communication does not grow more
𝒏
than:
𝒑

(𝑃𝑟𝑜𝑝𝑒𝑟 𝑀𝑎𝑝𝑝𝑖𝑛𝑔)
Total parallel run-time grows at most:
𝑛
𝑛
𝑇+
𝑝 𝑐 𝑝
𝑇𝑐𝑜𝑚𝑚 =
𝑛
𝑇
𝑝 𝑡𝑜𝑡𝑎𝑙
=
𝑛
𝑇
𝑝 𝑝
= 𝑇𝑛
𝑝
𝑝=𝑛
𝐶𝑝𝑎𝑟 = p𝑇𝑃(𝑛=𝑝)
𝑝<𝑛
𝑛
𝑝
𝑝=𝑛
𝐶𝑝𝑎𝑟 = p 𝑇𝑛 = p 𝑇𝑝 = 𝑛 𝑇𝑝 = 𝐶𝑝𝑎𝑟
𝑝

New algorithm using
𝑛
𝑝
processors is cost-optimal (p<n)
b) If algorithm is not COST-OPTIMAL for p = n:
 If we increase the granularity
⇒The new algorithm using
𝒏
,(p<n)
𝒑
may still not be cost optimal
Example: Adding “n” numbers on “p” processors
HYPERCUBE
n=2𝑘
Eg: n=16,
p=2𝑚
p=4
Each virtual PE (i) is simulated by physical PE (i mod p)
 First logp (2 steps) of the logn (4 steps) in the original
𝒏
𝟏𝟔
algorithm are simulated in logp ( ∗ 2 = 8 Steps on p=4

𝒑
𝟒
processors)
 The remaining steps do not require communication (the PE
that continue to communicate in the original algorithm are
simulated by the same PE here)
THE ROLE OF MAPPING COMPUTATIONS
ONTO PROCESSORS IN PARALLEL
ALGORITHM DESIGN
For a cost-optimal parallel algorithm
E = Θ(1)
 If a parallel algorithm on p = n processors is not
cost-optimal or cost-non optimal then ⇏ if p<n you
can find a cost optimal algorithm.
 Even if you find a cost-optimal algorithm for p<n
then ⇏ you found an algorithm with best parallel
run-time.
 Performance(Parallel run-time) depends on

- Number of processors
- Data-Mapping (Assignment)
Parallel run-time of the same problem (problem size)
depends upon the mapping of the virtual PEs onto Physical
PEs.
 Performance critically depends on the data mapping onto a
coarse grained parallel computer.
 Example: Matrix multiply nxn by a vector on p processor
𝑛
hypercube
[p square blocks vs p slices of rows]

𝑝
Parallel FFT on a hypercube with Cut- Through
Routing
W – Computation Steps ⇒ 𝑃𝑚𝑎𝑥 = W
 For 𝑃𝑚𝑎𝑥 – each PE executes one step of the algorithm
 For p<W, each PE executes a larger number of steps
 The choice of the best algorithm to perform local
computations depends upon #PEs
(how much fragmentation is available)




Optimal algorithm for solving a problem on an arbitrary #PEs
cannot be obtained from the most fine-grained parallel algorithm
The analysis on fine-grained parallel algorithm may not reveal
important facts such as:
Analysis of coarse grain Parallel algorithm:
Notes:
1) If message is short (one word only) => transfer time
between 2 PE is the same for store-and-forward and cutthrough-routing
2) if message is long => cut-through-routing is faster
than store-and-forward
3)Performance on Hypercube and Mesh is identical
with cut-through routing
4) Performance on a mesh with store-and-forward is worse.
Design:
1) Devise the parallel algorithm for the finest-grain
2) mapping data onto PEs
3) description of algorithm implementation on an arbitrary
# PEs
Variables: Problem size,
#PEs
SCALABILITY
S≤P
S(p)
E(p)
Example: Adding n numbers on a p processors Hypercube
Assume : 1 unit time(For adding 2 numbers or to
communicate with connected PE)
𝑛
1) adding locally numbers
𝑝
𝒏
𝒑
Takes : -1
2) p partial sums added in logp steps
( for each sum : 1 addition + 1 communication)=> 2logp
𝒏
𝑇𝑝 = -1 + 2logp
𝒑
𝒏
𝒑
𝑇𝑝 = + 2logp
𝑇𝑠 = n-1 = n
𝒏
S=𝒏
=
+𝟐𝒍𝒐𝒈𝒑
𝒑
E=
𝑺
𝑷
=
(n↑, p↑)
𝒏𝒑
𝒏+𝟐𝒑𝒍𝒐𝒈𝒑
𝒏
𝒏+𝟐𝒑𝒍𝒐𝒈𝒑
=> S(n,p)
=> E(n,p)
Can be computed for any pair of n and p

As p↑ to increase S => need to increase n (Otherwise
saturation)
=> E↓



Larger Problem sizes. S ↑, E ↑ but they drop with p ↑.
E=Ct
Scalability : of a parallel system is a measure of its
capacity to increase speedup in proportion
to the number of processors.
Efficiency of adding “n” numbers on a “p”
processor hypercube
For cost optimal algorithm:
S=
𝑛𝑝
𝑛+2𝑝𝑙𝑜𝑔𝑝
E=
𝑛
𝑛+2𝑝𝑙𝑜𝑔𝑝
E = E(n,p)
n=Ω(plogp)
E=0.80 constant
For n=64 ,
p =4 , n=8plogp
For n=192,
p=8 , n=8plogp
For n=512,
p=16, n=8plogp
 Conclusions:



For adding n numbers on p processors hypercube with
cost optimal algorithm.
The algorithm is cost-optimal if n = Ω(plogp)
The algorithm is scalable if n increases
proportional with Θ(plogp) as p is increased.
PROBLEM SIZE
 For Matrix multiply:
Input size n => O(𝑛3 )
𝑛′ = 2𝑛 ⇒ O(𝑛′3 ) ≡ O(8𝑛3 )
 Matrix addition
Input size n => O(𝑛2 )
𝑛′ = 2𝑛 ⇒O(𝑛′2 ) ≡ O(4𝑛2 )
Doubling the size of the problem means performing twice the
amount of computation.
Computation Step: Assume takes 1 time unit
Message start-up time,
 per-word transfer time,
 per-hop time

⇒ W = 𝑇𝑠
can be normalized with
respect to unit
computation time
(for the fastest sequential algorithm on a
sequential computer)
Overhead Function
E=1
S=p
(Ideal)
E<1
S<p
(In reality)
overhead( interprocessor
communic ….. etc)
=> Overhead function
The time collectively spent by all processors in
addition to that required by the fastest sequential
algorithm to solve the same problem on a single PE.
𝑻𝒐 = 𝑻𝒐 (W,p)
𝑻𝒐 = p 𝑻𝒑 – W
 For cost-optimal algorithm of adding n numbers on p
processors hypercube
𝑇𝑠 = W = n
𝒏
𝑇𝑝 = + 2logp
𝒑
𝒏
𝒑
𝑇𝑜 = p( + 2logp) – n = 2plogp
𝐓𝐨 = 2plogp
ISOEFFICIENCY FUNCTION
Parallel execution time = (function of problem size, overhead
function, no. of processors)
𝑇𝑝 = T(W,𝑇𝑜 ,p)
𝑇𝑝 =
𝑊+𝑇𝑜 (𝑊,𝑝)
𝑝
𝑇𝑜 = 𝑝𝑇𝑝 – W
Speedup(S) =
𝑇𝑠
𝑇𝑝
Efficiency(E) =

=
𝑆
𝑃
𝑊
𝑇𝑝
=
=
𝑊𝑝
𝑊+𝑇𝑜 (𝑊,𝑝)
𝑊
𝑊+𝑇𝑜 (𝑊,𝑝)
E=
1
𝑇 (𝑊,𝑝)
1+ 𝑜 𝑤
If W = constant, P = ↑ then E↓
If p= constant, W ↑ then E ↑
for parallel scalable
systems.
we need E = constant
- for scalable effective systems.


Example.1
P↑,
W↑
exponentially with p
⇒ then problem is poorly scalable
since we need to increase the problem size very much to obtain
good speedups
Example.2
P↑,
W↑
linearly with p
⇒ then problem is highly scalable
since the speedup is now proportional to the number of processors.
1
𝐸
E = 𝑇𝑜(𝑊,𝑝) ⇒
𝑊=
𝑇𝑜 (𝑊, 𝑝)
1+

E = ct ⇒
Given E ⇒


𝑤
𝐸
= ct
1−𝐸
𝐸
=k
1−𝐸
W = 𝐾𝑇𝑜 (𝑊, 𝑝)
1−𝐸
Function dictates growth rate of W required to
Keep the E Constant as P increases.
Isoefficiency ∄ in unscalable parallel systems because E cannot be
kept constant as p increases, no matter how much or how fast W
increases.
Overhead Function
( adding n numbers on p
processors Hypercube)
𝑇𝑠 = n
𝑛
𝑇𝑝 = + 2logp
𝑝
𝑇𝑜 = p𝑇𝑝 – 𝑇𝑠 = p(
𝑛
𝑝
+ 2logp) – n = 2plogp
Isoefficiency Function
W= k 𝑇𝑜 (W,p)
𝑇𝑜 = 2plogp (Note: 𝑇𝑜 = 𝑇𝑜 (p))
W=2kplogp
=> asymptotic isoefficiency function is Θ(plogp)
Meaning:
1) #PE↑ p’ > p => problem size has to be increased
𝑝′ 𝑙𝑜𝑔𝑝′
by (
) to have the same efficiency as on p
𝑝𝑙𝑜𝑔𝑝
processors.
2)
#PE↑ p’ > p by a factor
grow by a factor
𝑝′ 𝑙𝑜𝑔𝑝′
(
𝑝𝑙𝑜𝑔𝑝
𝑝′
𝑝
requires problem size to
)to increase the speedup by
𝑝′
𝑝
Here communication overhead is an exclusive
function of p: 𝑇𝑜 = 𝑇𝑜 (p)
In general
𝑇𝑜 = 𝑇𝑜 (W,p)
W= k 𝑇𝑜 (W,p) (may involve many terms)
Sometimes hard to solve in
terms of p
E= constant need ratio
As
p ↑,
W↑
𝑇𝑜
𝑤
fixed
to obtain nondecreasing efficiency
E’≥E
=> 𝑇𝑜 should not grow faster
than W
None of 𝑇𝑜 terms should grow faster than W.


If 𝑇𝑜 has multiple terms, we balance W against each term
of 𝑇𝑜 and compute the respective isoefficiency functions for
corresponding individual terms.
The component of 𝑇𝑜 that requires the problem size to grow
at the highest rate with respect to p, determines the
overall asymptotic isoefficiency of the parallel system.
Example 1:
𝑇𝑜 = p3/2 + p3/4 W3/4
W= k p3/2 => Θ(p3/2)
W= k p3/4 W3/4
W 1/4= k p3/4
W=k4p3
=>Θ(P3)
→ take the highest of the two rates
isoefficiency function of this example is Θ(P3)
Example 2:
 𝑇𝑜
= p1/2 W1/2 + p3/5 + p3/4 W3/4 + p3/2 + p
W
= K 𝑇𝑜
= K (p1/2 W1/2 + p3/5 + p3/4 W3/4 + p3/2 + p)
W = K p1/2 W1/2
W = K p3/5
W = K p3/4 W3/4 = Θ(P3)
W = K p3/2
W =Kp
Therefore W = Θ(P3) (𝑇𝑜 has this isoefficiency
function which is the highest of all)
→ 𝑇𝑜 ensure E doesn't decrease, the problem size
needs to grow as Θ(P3) (Overall asymptotic
isoefficiency )
Isoefficiency Functions




Captures characteristics of the parallel algorithm and architecture
Predicts the impact on performance as #PE↑
Characterizes the amount of parallelism in a parallel algorithm.
Study of algorithm(parallel system) behavior due to hardware
changes(speed, PE, communication channels)
Cost-Optimality and Isoefficiency Function
𝑇
Cost-Optimality = 𝑝𝑇𝑠 = ct
𝑝
P 𝑇𝑝 = Θ(W)
W + 𝑇𝑜 (W,p) = Θ(W)
(𝑇𝑜 =p𝑇𝑝 – W)
𝑇𝑜 (W,p) = O(W)
W=Ω(𝑇𝑜 (W,p))
A parallel system is cost optimal iff ( of 2 only
if ) its overhead function does not grow
asymptotically more than the
problem size
Relationship between Cost-optimality and
Isoefficiency Function
Eg: Add “n” numbers on “p” processors hypercube
a) Non-optimal cost
W = O(n)
𝑛
𝑇𝑝 = O( logp)
𝑝
𝑇𝑜 =p𝑇𝑝 – W =Θ(nlogp)
W=k Θ(nlogp) not true for all K and E
Algorithm is not cost-optimal, and ∄ isoefficiency function
not scalable
b) Cost-Optimal
W = O(n) ; W ≈ 𝑛
𝑛
𝑝
𝑇𝑝 = O( + logp)
𝑇𝑜 = Θ(n + plogp) – O(n)
W = k Θ(plogp)
W=Ω(plogp)
n>>p for cost optimality
Problem size should grow at least as plogp such that parallel
system is scalable.
ISOEFFICIEENCY FUNCTION
Determines the ease with which a parallel system can
maintain a constant efficiency and thus, achieve speedups
increasing in proportion to the number of processors
 A small isoefficiency function means that small increments
in the problem size are sufficient for the efficient utilization
of an increasing number of processors

=> indicates the parallel system is highly
scalable
A large isoefficiency function indicates a poorly scalable
parallel system
 The isoefficiency function does not exist for unscalable
parallel systems, because in such systems the efficiency
cannot be kept at any constant value as p↑, no matter how
fast the problem size is increased

Lower Bound on the Isoefficiency

Small isoefficiency function => higher scalability
→ For a problem with W, 𝑃𝑚𝑎𝑥 ≤ W for cost-optimal system (if
𝑃𝑚𝑎𝑥 > W, some PE are idle)
→ If W < Θ(p) i.e problem size grows slower than p,
as p↑ ↑ => at one point #PE > W =>E↓ ↓
=> asymptotically W= Θ(p)
Problem size must increase proportional as Θ(p) to
maintain fixed efficiency
W = Ω(p)
(W should grow at least as fast as p)
Ω(p) is the asymptotic lower bound on the isoefficiency function
But 𝑃𝑚𝑎𝑥 = Θ(W)
(p should grow at most as fast as W)
=> The isoefficiency function for an ideal parallel system is:
W = Θ(p)






Example ( which one overhead function is most scalable )
𝑇𝑜1 = ap5 W4/5 + bp3
𝑇𝑜2 = cp5 + dp3 + ep2
𝑇𝑜3 = fp5 + gp5 W4/5 + hp3/4 W3/4
First calculate isoefficiency function of all three overhead
function and then find the lowest of these three.
Isoefficiency function of 𝑇𝑜1 = Θ(P25)
𝑇𝑜2 = Θ(P5)
𝑇𝑜3 = Θ(P25)
Therefore 𝑇𝑜2 = Θ(P5) has the lowest isoefficiency function
which is most scalable among three.
Degree of Concurrency & Isoefficiency Function


Maximum number of tasks that can be executed
simultaneously at any time
Independent of parallel architecture
C(W) – no more than C(W) processors can be employed
effectively
Effect of Concurrency on Isoefficiency function
Example: Gaussian Elimination :
W=Θ(n3)
P= Θ(n2)
C(W)= Θ(W2/3) => at most Θ(W2/3)
processors can be
used efficiently
Given p
W=Ω(p3/2)
=> problem size should be at
least Ω(p3/2) to use them all
=> The Isoefficiency due to concurrency is Θ(p3/2)
The Isoefficiency function due to concurrency is optimal,
that is , Θ(p) only is the degree of concurrency of the
parallel algorithm is Θ(W)
SOURCES OF OVERHEAD



Interprocessor Communication
− Each PE Spends tcomm
− Overall interprocessor communication : ptcomm
(Architecture impact)
Load imbalance
− Idle vs busy PEs( Contributes to overhead)
Example: In sequential part
1PE : 𝑊𝑠  Useful
p-1 PEs: (p-1) 𝑊𝑠 contribution to overhead function
Extra-Computation
1) Redundant Computation (eg: Fast fourier transform)
2) W – for best sequential algorithm
W’ – for a sequential algorithm easily parallelizable
W’-W  contributes to overhead.
W= 𝑊𝑠 + 𝑊𝑝 => 𝑊𝑠 executed by 1PE only
=>(p-1) 𝑊𝑠 contributes to overhead
− Overhead of scheduling


If the degree of concurrency of an algorithm is less than
Θ(W), then the Isoefficiency function due to concurrency
is worse, i.e. greater than Θ(p)
Overall Isoefficiency function of a parallel system:
Isoeff system = max(Isoeff concurr, Isoeff commun,
Isoeff overhead)
SOURCES OF PARALLEL OVERHEAD



The overhead function characterizes a parallel system
Given the overhead function 𝑇𝑜 = 𝑇𝑜 (W,p)
We can express:
𝑇𝑝 , S,E,p𝑇𝑝 (cost) as fi(W,p)
The overhead function encapsulates all causes of
inefficiencies of a parallel system, due to:
Algorithm
 Architecture
 Algorithm –architecture interaction

MINIMUM EXECUTION TIME
(Adding “n” Number on a Hypercube)
(Assume p is not a constraint)
As we increase the number of processors for a given
problem size, either the parallel run time continues to
decrease and asymptotically approaches a minimum
value, or it starts rising after attaining a minimum
value.
𝑇𝑝 = 𝑇𝑝 (W,p)
For a given W, 𝑇𝑝𝑚𝑖𝑛 2
𝑑
𝑑𝑝
𝑇𝑝 = 0 => 𝑃𝑜 for which 𝑇𝑝 = 𝑇𝑝𝑚𝑖𝑛
Example:
The parallel runtime for the problem of adding n
numbers on a p-processor hypercube be approx. by
𝑇𝑝 =
𝑛
𝑝
+ 2logp
we have,
𝑑𝑇𝑝
𝑑𝑝
𝑛
- 2
𝑝
=0
+
2
𝑝
=0
-n + 2p = 0
𝑛
=> 𝑃𝑜 =
2
𝑇𝑝𝑚𝑖𝑛 = 2logn
The processor-time product for p= 𝑝𝑜 is Θ(nlogn)
Cost-Sequential :
Θ(n)
not cost optimal
Cost-Parallel :
Θ(nlogn)
since
𝑃𝑜 𝑇𝑝𝑚𝑖𝑛 =
𝑛
× 2 logn
2
Not Cost optimal
=> running this algorithm for 𝑇𝑝𝑚𝑖𝑛 is not cost-optimal But
this algorithm is COST-OPTIMAL
Derive:
Lower bound for 𝑻𝒑 such that parallel cost is optimal:
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
𝑻𝒑
−Parallel run time such that cost is optimal.
-W fixed.

If Isoefficiency function is
Θ(f(p))
Then problem of size W can be executed Cost-optimally
only iff: W = Ω(f(p))
P= Ο(f-1(W))
{Required for a cost optimal
solution }
𝑤
parallel runtime 𝑇𝑝 for cost cost-optimal solution is = Θ( )
Since
𝑃
p𝑇𝑝 = Θ(W)
𝑤
𝑇𝑝 = Θ( )
𝑃
-1
P= Θ(f (W))
The lower bound on the parallel runtime for solving a problem of
size W is cost optimal iff :
𝒘
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
=> 𝑻𝒑
= Ω( −𝟏
)
(𝒇 (w))
MINIMUM COST-OPTIMAL TIME FOR
ADDING N NUMBERS ON A HYPERCUBE
A) isoefficiency function:
𝑇𝑂 = p 𝑇𝑝 -W
𝑛
𝑇𝑝 = + 2logp
=> 𝑇𝑂 =


𝑛
p(
𝑃
𝑃
+ 2logp) – n = 2plogp
W= k 𝑇𝑂 = k×2plogp
W=Θ(plogp)
{isoefficiency function}
If W = n = f(p) = plogp
=>logn = logp + loglogp
logn ≈ logp (ignoring double logarithmic term)
If n = f(p) = plogp
P = f-1(n)
𝑛
𝑛
n = plogp => p =
≈
𝑛
f-1(n) =
𝑙𝑜𝑔𝑛
𝑛
f-1(W) =
𝑙𝑜𝑔𝑛
𝑙𝑜𝑔𝑝
𝑙𝑜𝑔𝑛
𝑛
𝑙𝑜𝑔𝑛
f-1 (W) = Θ(
)
B) The cost-optimal solution
p= O(f-1(W))
=> for a cost optimal solution
P = Θ(nlogn) { the max for cost-optimal solution }
𝑛
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
For P =
𝑇𝑝 = 𝑻𝒑
=>
𝑙𝑜𝑔𝑛
𝑛
𝑇𝑝 = + 2logp
𝑝
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
𝑻𝒑
= logn +
2log(
𝑛
)
𝑙𝑜𝑔𝑛
= 3logn – 2loglogn
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
𝑻𝒑
= Θ(logn)
Note:
𝑻𝒎𝒊𝒏
= Θ(logn)
{ cost optimal solution is the best
𝒑
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
𝑻𝒑
= Θ(logn)
asymptotic solution in terms of
execution time}
𝑛
𝑛
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
(𝑻𝒎𝒊𝒏
)
⟹
𝑃
=
>
𝑃
=
⟸
(
𝑻
)
𝒑
𝑜
𝑜
𝒑
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
=> 𝑻𝒑
2
𝑙𝑜𝑔𝑛
= Θ(𝑻𝒎𝒊𝒏
)
𝒑
𝒄𝒐𝒔𝒕−𝒐𝒑𝒕
Both 𝑻𝒎𝒊𝒏
and 𝑻𝒑
for adding n numbers on hypercube are
𝒑
Θ(logn), thus for above problem, a cost-optimal solution is also
the asymptotically fastest solution.
Parallel System where 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 > 𝑇𝑝 𝑚𝑖𝑛
𝑇𝑜 = p3/2
𝑇𝑝 =
𝑊+𝑇𝑜
𝑃
𝒅
𝑇
𝒅𝒑 𝒑
=0
+ p3/4W3/4
=> 𝑇𝑝 =
𝑊
𝑃
𝑊
1
𝑊3 4
=- 2+ 12 - 54=0
𝑃
2𝑃
4𝑃
1
1
-W + 𝑃3 2 - 𝑊 3 4 𝑃3 4 = 0
2
4
𝟏
𝟏
𝑃3 4 = W3/4 ± ( W3/4 + 2W)1/2
𝟒
𝟏𝟔
𝒅
𝑇
𝒅𝒑 𝑝
= Θ(W3/4)
𝑃𝑜 = Θ(W) (where 𝑃𝑜 = 𝑃)
𝑇𝑝 𝑚𝑖𝑛 = Θ(W1/2)………..(equation i)
+ p1/2 +
𝑊3 4
𝑃1 4

Isoefficiency Function:
W = k 𝑇𝑜 = k4p3 = Θ(p3)
=> 𝑃𝑚𝑎𝑥 = Θ(W1/3)……(equation ii) {Max #PE for
which algorithm
is cost-optimal}
𝑇𝑝 =
𝑾
𝑷
𝑊3
1/2
+p
+
4
𝑃1 4
p = Θ(W)
=> 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 = Θ(W2/3)
=> 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙 > 𝑇𝑝 𝑚𝑖𝑛 asymptotically
equation i and ii shows that 𝑇𝑝 𝑐𝑜𝑠𝑡−𝑜𝑝𝑡𝑖𝑚𝑎𝑙
is
asymptotically greater than 𝑇𝑝 𝑚𝑖𝑛 .
𝑚𝑖𝑛
 Deriving 𝑇𝑝
it is important to be aware of 𝑃𝑚𝑎𝑥 that can
be utilized is bounded by the degree of concurrency C(W)
of the parallel algorithm.
 It is possible that 𝑃𝑜 > C(W) for parallel system. i.e Value
of 𝑃𝑜 is meaningless and 𝑇𝑝 𝑚𝑖𝑛 is given by
 𝑇𝑝
𝑚𝑖𝑛
=
𝑊+𝑇𝑜 (𝑊,𝐶 𝑊 )
𝐶(𝑊)
Example (showing whether parallel algorithm is costoptimal with respect to sequential algorithm or not)
 Given
𝑇𝑠1 = an2 + bn + c
a,b,c ∈ 𝑅
𝑇𝑠2 = a’n2 logn + b’ n2 + c’n + d
a’,b’,c’,d ∈ 𝑅

𝑇𝑝 =
𝑛2
𝑝
+ 64 logp
Cost optimality if p𝑇𝑝 ~ 𝑇𝑠
𝑇𝑠

𝑝𝑇𝑝
= constant
we have,
𝑝𝑇𝑝 = n2 + 64 plogp ⇒ 𝑝𝑇𝑝 = n2 + Θ (𝑝𝑙𝑜𝑔𝑝)
𝑇𝑠1 = Θ (𝑛2 )
 The parallel algorithm is cost-optimal w.r.t. to 𝑇𝑠1 iff
𝑛2 grows at least by Θ (plogp) ⇒ 𝑛2 = Ω (plogp)
𝑇𝑠2 = Θ (𝑛2 logn)
𝑝𝑇𝑝 = Θ (𝑛2 )
𝑇𝑠2
= Θ (logn) ≠ O(1)
𝑝𝑇𝑝

The parallel algorithm is not cost-optimal w.r.t. to
𝑇𝑠2
𝑇𝑠2 because
= Θ (logn) ≠ O(1)
𝑝𝑇𝑝

Example (condition when parallel runtime is
minimum)
𝑑

𝑇
𝑑𝑝 𝑝
=0
−𝑛2
⇒
𝑝2

+
64
𝑝
P=
𝑛2
64
=0
𝑇𝑝 = 64 + 128(logn – log8)
𝑛2
 p𝑇𝑝 =
(64 + 128
64
= Θ (𝑛2 logn)
𝑇𝑝 = O(logn)
(logn- log8))

𝑇𝑠1 :
𝑇𝑠1
𝑝𝑇𝑝
=
Θ(𝑛2 )
𝑂(𝑛2 𝑙𝑜𝑔𝑛)
=
1
O(
)
𝑙𝑜𝑔𝑛
≠ O(1)
Therefore it is not optimal w.r.t 𝑇𝑠1 .
𝑇𝑠2 :
𝑇𝑠2
𝑝𝑇𝑝
=
Θ(𝑛2 𝑙𝑜𝑔𝑛)
𝑂(𝑛2 𝑙𝑜𝑔𝑛)
= O(1)
This is cost-optimal w.r.t 𝑇𝑠2 .
Asymptotic Analysis of Parallel Programs
Ignores constants and concern with the
asymptotic behavior of quantities. In many cases, this
can yield a clearer picture of relative merits and demerits
of various parallel programs.
Consider the problem of sorting a list of n numbers. The
fastest serial programs for this problem run in time O (n log n).
The objective is to determine which of these four algorithms is the
best. Perhaps the simplest metric is one of speed; the algorithm
with the lowest 𝑇𝑝 is the best. By this metric, algorithm A1 is the
best, followed by A3, A4, and A2. This is also reflected in the fact
that the speedups of the set of algorithms are also in this order.
In practice, we will rarely have n2 processing elements as
are required by algorithm A1. Resource utilization is an important
aspect of practical program design. Algorithms A2 and A4 are the
best, followed by A3 and A1. The costs of algorithms A1 and A3
are higher than the serial runtime of n log n and therefore neither
of these algorithms is cost optimal. However, algorithms A2 and
A4 are cost optimal.