Download Performance and Scalability of Parallel Systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Knapsack problem wikipedia , lookup

Theoretical computer science wikipedia , lookup

Computational complexity theory wikipedia , lookup

Multi-core processor wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Multiplication algorithm wikipedia , lookup

Genetic algorithm wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Mathematical optimization wikipedia , lookup

Algorithm wikipedia , lookup

Simulated annealing wikipedia , lookup

Simplex algorithm wikipedia , lookup

Selection algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Fast Fourier transform wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Time complexity wikipedia , lookup

Transcript
Performance and Scalability of
Parallel Systems
• Evaluation
– sequential: run time (execution time)
• Ts = T(Insize)
– parallel: run time (start -> last PE ends)
• Tp = T(Insize, p, arch.)
• cannot be eval. in isolation for the par. archit.
– parallel system: par alg  par archit
• Metrics: evaluate performance of par. system
– scalability: ability of par. alg. to achieve performance
gains proportional w/ n. of PEs.
Performance Metrics
• Run-Time: Ts, Tp
• Speedup:
– how much performance is gained by running the
application on p” identical processors.
• S = Ts / Tp
– Ts: fastest sequential algorithm for solving the same problem
– If : not known yet (only lower bound known)
or known, with large constants at run-time that make it impossible to
implement
Then: Take the fastest known sequential algorithm that can be practically
implemented
– Speedup: relative metric
• Alg. of adding “n” number on “n” processors
(Hypercube n = p = 2 )
k
– Ts = (n)
– Tp = (logn)
n
– S = ( log n )
• Efficiency: measure of how effective the problem is solved on p
processors
– E[0,1)
E= S
p
1
– if p=n, E = ( log n )
• Cost
– Cseq(fast) = Ts Cpar = pTp
– cost-optimal: Cpar ~ Cseq(fast)
– not cost-optimal: if p=n, fine granularity; if p<n, coars granularity
1
• Cseq(fast) = (n) Cpar = (nlogn) E = ( log n )
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(a) Initial data distribution and the fast communication step
10
32
1
0
54
3
2
76
5
4
89
7
6
11
10
9
8
13
12
11
10
12
15
14
13
14
15
13
14
15
(b) Second communication step
30
74
1
0
3
2
11
8
5
4
7
6
15
12
9
8
11
10
12
( c) Third communication step
70
15
8
1
0
3
2
5
4
7
6
9
8
10
11
12
13
14
15
(d) Fourth communication step
15
0
0
1
2
3
4
5
6
7
8
9
10
11
12
(e) Accumulation of the sum at processor 0 after the final communication
13
14
15
Effects of Granularity on Cost-Optimality
• Assume: n initial PEs
– If: p -- physical PEs, then each PE simulates
n
p
– If:
-- virtual PEs, then the computation at each PE
increase by a factor of np
– Note: Even if p<n, this does not necessarily provide
cost-optimal alg.
12
13
14
15
12
13
14
15
8
9
10
11
8
9
10
11
4
5
6
7
4
5
6
7
0
1
2
3
10
0
1
2
3
12
13
8
9
54
10
0
14
10
2
Substep 3
2
3
Substep 2
15
11
 76
32
1
1
0
Substep 1
32
12
13
89
54
0
15
11
10
 76
32
10
3
14
1
2
3
Substep 4
(a) Four processors simulating the first communication step of 16 processors
13
12
89
54
10
15
14
11
10
 76
32
1
0
2
13
12
89
54
30
3
13
12
89
74
30
0
2
Substep 3
2
3
Substep 2
15
14
11
10
1
1
0
Substep 1
15
14
11
10
 76
13
12
11
8
74
30
3
0
15
14
1
2
3
Substep 4
(a) Four processors simulating the first communication step of 16 processors
15
12
11
8
74
30
15
12
11
8
 70
1
0
2
3
1
0
Substep 1
2
3
Substep 2
(c ) Simulation of the third step in two substeps
15
8
 70
0
15
0
1
2
3
(d) Simulation of the fourth step
0
1
2
3
(e) Final result
Adding “n” numbers on “p” processors
(Hypercube p<n)
• n=
• p=
2k
2w
e.g., n=16
p=4
– comput + communication (
n
– comput
( p )
– par. exec. time
Tpar = (
– Cpar = p ( p log p ) = ( nlog p )
– Cseq(fast) = (n)
n
n
log p
p
n
log p
p
)
)
A Cost Optimal Algorithm
n
p
– comput
( )
– comput + communication
– Tpar = (
n
p
(
n
 log p )
p
)
– Cpar = (n) = p Tpar
– Cseq(fast) = (n)
Cost-optimal
• (a) if algorithm is cost-optimal:
• p -- physical PEs
• each PE simulates
– Then
n
p
virtual PEs
• if the overall communication does not grow far more than
• => total parallel run-time grows by at most
n
n
n
n
Tc + Tcomm = Ttotal = Tp = T n
p
• =>
p
p n
C par
=
p
p
p
p
T pn  p
n
pn
p n
• => C par
= pT n = p p Tp = n Tp = C par
p
• => new algorithm using
n
p
processor is cost-optimal (p<n)
n
p
• (b) if algorithm is not cost-optimal for p=n
• if we increase granularity
– =>the new algorithm using
n
p
(p<n) may still not be cost optimal
– e.g., adding “n” numbers on “p” processors (Hypeecube: p<n)
• n = 2k
(n=16)
• p = 2w
(p=4)
• each virtual PE simulated by physical PE
• First log p (2 step) of the log n (4 steps) in the original
algorithms are simulated in n log p
p
• The remaining steps do not require communication (the PE that
continues to communicate in the original algorithm are
simulated by the same PE here)
The Role of Mapping Computations Onto
Processors in Parallel Algorithm Design
• For a cost-optimal parallel algorithms E=(1)
• If a par. Alg. on p=n processors is not cost-optimal
or cost-nonoptimal, then>if p<n you can find a
cost-optimal algorithm
• even if you find a cost-optimal alg. For p<n >you
found an algorithm with best parallel run-time
• performance (parallel run-time) depends on (1)
#processors (2) data-mapping
• Parallel run-time of the same problem (problem
size) depends on the mappings of virtual PEs onto
physical PEs
• Performance artificially depends on the data
mapping onto a coarse grain parallel computer
– e.g., matrix multiply n*n by a vector on p processors
hypercube
– parallel FFT on a hypercube w/ cut through
– W--computation steps => Pmax = W
– For Pmax -- each PE extends one step of the algorithm
– For p<W, each PE extends a large number of steps
• The choice of the best algorithm to perform local
computations depends on #PEs
• Optional algorithm for solving a problem on an
arbitrary #PEs cannot be obtained from the most
fine-grained par. Alg.
• The analysis on fine-grained par. Alg. May not
reveal important facts such as:
– if message short( one word only) => transfer time
between 2 PE is the same for store-and-forward and
cut-through-routing
– if message is long=>cut-through is faster than storeand-forward
– performance on HyC mesh is identical with cut-through
routing
– Perf. On a mesh with store-and-forward is worse
• Design
– devise the par. Alg. For the finest-grain
– mapping data onto PEs
– description of alg. Implementation on an arbitrary #
PEs
– Variables:
• problem size
• # PEs
Scalability
• Adding n number on a p processors hypercube
– assume: 1 unit time:
• adding 2 numbers
• communicate w/ connected PE
n
p
n
p
– adding locally numbers:
-1
– p partial sums added in logp steps
• for each sum: 1 addition and 1 communication => 2 log p
n
• Tp = p -1 + 2log p
n
p
• Tp =
+ 2 log p
• Ts = n-1 ---> n
• S=
n
n
 2 log p
p
=
np
n  2 p log p
=> S(n, p)
• E = Sp  n  2 pnlog p => E(n, p) can be computed for
any pair n & p
• As p  to increase S => need to increase n => E 
• large problem size S , E  but they drop with p
• Scalability: of a parallel system is a measure of its
capability to increase speedup on proportion to the
number of processors
• Efficiency of adding n numbers on a p processor
hypercube
– for the cost-optimal alg:
• S=
np
n  2 p log p
E=
n
P=1
P=4
64
1.0
.80
192
1.0
320
512
S
n

p n  2 p log p
P=8
= E(n,p)
P = 16
P = 32
.57
.33
.17
.92
.80
.60
.38
1.0
.95
.87
.71
.50
1.0
.97
.91
.80
.62
•
•
•
•
•
•
n =  (plogp)
E = 0.80 constant
for n=64 & p=4: n=8plogp
for n=192 & p=8 : n=8plogp
for n=512 & p=16: n=8plogp
Conclusion:
– for adding n number on p processors w/ cost-optimal
algorithm:
• the alg. is cost-optimal if n =  (plogp)
• the alg. is scalable if n increase proportional w/  (plogp) as p
is increased
• Problem size
– for matrix multiply
• input n => O( n 3 )
n’=2n => O(
n '3 )
= O(8 n 3 )
– for matrix addition
• input n => O( n 3 )
n’=2n => O( n '3 ) = O(4 n 3 )
• Doubling the size of the problem means
performing twice the amount of computation
• Computation step: assume takes 1 time unit
– message start-up time, per-word transfer time, and perhop time can be normalized w.r.t. unit computation
time
=> W = Ts
• for the fastest sequential algorithm on a sequential computer
Overheads Function
• ideal: E=1 & S=p
• reality: E<1 & S<p
• overhead function:
– the time collectively spent by all processors in addition
to that required by the fastest known sequential
algorithm to solve the same problem on a single PE
T  T (W , p)  pTp  W
– for cost-optimal alg. of adding n number on a p
processor hypercube
Ts  w  n
n
 2 log p
p
n
T  p(  2 log p)  n  2 p log p
p
Tp 
Isoefficiency Function
T p  T (W , T , p )
T  pT p  W
 Tp 
W  T0 (W , p )
p
S
Ts W
Wp


T p T p W  T0 (W , p )
E
S
W
1


p W  T (W , p ) 1  T (W , p )
W
If W=ct, p => E
If p=ct, W  => E  for parallel scalable systems
we need E=ct
• E.g.1
– p ,
W  exponentially with p => poorly scalable, since we need to
increase the problem size v. reach to obtain good speedups
• E.g.2
– p ,
W  linearly with p => highly scalable, since the speedup is now
proportional to the number of processors
1
E
W 
T (W , p)
T (W , p)
1

E
1
W
E
E  ct 
 ct
1 E
E
given
E
E
k
1 E
W  kT (W , p)
Function dictates growth rate of W required to keep
the E ct as p increase
• Isoefficiency is unscalable parallel systems
because E cannot be kept constant as p increases,
no matter how much a how fast W increase.
Overhead Function
• Adding n numbers on p processors hypercube:
cost-optimal
Ts  n
Tp 
n
 2 log p
p
n
T  pTp  Ts  p(  2 log p)  n  2 p log p
p
Isoefficiecy Function
T  2 p log p  T ( p )
W  kT (W , p )  2kp log p
=> asymptotic isoefficiency function is (plogp)
meaning:
(1) #PE  p’>p => problem size has to be increased by
p ' log p '
p log p
to have the same efficiency as on p processors.
p'
(2) #PE  p’>p by a factor p requires the problem size to
p'
grow by a factor p' log p' to increase the speedup by p
p log p
• Here communication overhead is an exclusive
function of p: T  T ( p)
In general
T  T (W , p )
W  kT (W , p )
• E=ct need ratio
T
W
fixed as p  and W  to obtain
nondecreasing efficiency (E’>=E)
=> T should not grow faster than W
• If T has multiple terms, we balance W against each
term of T and compute the (respective)
isoefficiency function for individual features.
• The component of T that requires the problem
size to grow at the highest rate with respect to p,
determines the overall asymptotic isoefficiency of
the parallel system.
T0 
3
p2

3 3
p 4W 4
3
kp 2
3
 ( p2 )
W
3 3
kp 4W 4
1
W4
3
kp 4

W  k 4 p3   ( p3 )
• Take the highest of the 2 rates to ensure E does not
decrease, the problem size needs to grow as (
p3
) (overall
asymptotic isoefficiency)
Isoefficiency Functions
• Captures characteristics of the parallel algorithm
and parallel architecture
• predicts the impact on performance as #PE 
• characterize the amount of parallelism in a parallel
problem alg.
• Study of parallel alg behavior due to HW changes
– Speed. PE.
– Communication channels
Cost-optimality and Isoefficiency
Ts
 ct
pT p
pT p   (W )
W  T (W , p )   (W )
T0  pT p W

T (W , p )  O(W )
W  (T (W , p ))
A parallel system is cost optimal iff its overhead function
does not grow asymptotically more than the problem size.
Relationship between Cost-optimality and
Isoefficiency Function
• E.g. add “n” numbers on “p” processors hypercube
– Non-optimal cost
W  O ( n)
n
T p  O( log p)
p
T  pT p  W   (n log p )

W  k (n log p )
– (b) cost-optimal
W  O ( n)  n
n
T p  O(  log p)
p
T   (n  p log p)

W  k ( p log p)
Problem size should grow at least as plogp s.t.
parallel algorithm is scalable.
Isoefficiency Function
• Determines the easy with which a parallel system
can maintain a constant efficiency and thus,
achieve speedups increasing in proportion to the
number of processors
• A small isoefficiency function means that small
increments in the problem size are sufficient for
the efficient utilization of an increasing number of
processors => indicates the parallel system is
highly scalable
• a large isoeff. function indicates a poorly scalable
parallel system
• The isoefficiency function does not exist for
unscalable parallel systems, because in such
systems the efficiency cannot be kept at any
constant value as p  , no matter how fast the
problem size is increased.
Lower bound on the isoefficiency
• Small isoefficiency function => high scalability
– for a problem with W, Pmax<=W for a cost-optimal
system ( if Pmax>W, some PE are idle)
– if W<(p), as p  => at one point #PE>W => E
– => asymptotically:
• W= (p)
problem size must increase proportional as (p) to maintain
fixed efficiency.
• W=(p)
(p) is the asymptotic lower bound on the isoefficiency
function, but Pmax= (p) => the isoeff function for an ideal
parallel system is W= (p)
Degree of Concurrency & Isoefficiency
Function
• Max # tasks that can be executed simultaneously
at any time
• indep. on the parallel architecture
– no more than C(W) processor can be employed
efficienctly
Effect of Concurrency on Isoefficiency
Function
• E.g.
W   (n3 )
p   (n 2 )

2
C (W )   (W 3 )
=> at most  (W
given3 p,
( p 2 )
2
3)
processors can be used efficiently
3
 ( p 2 )
=> problem size should be at least
to use them all
W
2
3)
• The isoefficiency due to concurrency is  (W , that
is, (p) only is the degree of concurrency of the
parallel algorithm is (W)
• if the degree of concurrency of an algorithm is less
than (W), then the isoefficiency function due to
concurrency is worse, I.e., greater than (p)
• Overall isoeff. Function of a parallel system:
Isoeff system  max( Isoeff concur , Isoeff commun , Isoeff overhead )
Sources of Parallel Overhead
• The overhead function characterizes a parallel
system
• Given the overhead function T0  T0 (W , p)
we can express: Tp, s, E, pTp as fi(W, p)
• The overhead function encapsulates all causes of
inefficiencies of a parallel system, due to:
– algorithm
– architecture
– alg-architect. interaction
Sources of Overhead
• Interprocessor communication
– each PE speeds:
tcomm
– overall interproc. Communication: ptcomm
• Load imbalance
– idle vs. busy PEs
• idle PEs contribute to overhead
• extra-computation
– Redundant computation
– W---- for best sequential alg
W’----for a seqential alg easily parallelizable
W’-W ---- contributes to overhead
W=Ws+Wp => Ws executed by 1 PE only => (p-1)Ws
contributes to overhead
Minimum Execution Time
(assume p is not a constraint)
• Adding n numbers on a hypercube
– in sequential
T p  T p (W , p)
T pmin
For a given W,
=> p0 for which
E.g.,
n
 2 log p
p
n
 0  p0 
2
Tp 
dTp
dp
T pmin  2 log n
:
dTp
dp
T p  T pmin
0
• Cost-sequential:  (n)
• Cost-parallel:  (n log n)
– not cost-optimal, since
– running this alg. for
alg. is cost-optimal
T pmin
n
p0T pmin  * 2 log n
2
is not cost-optimal but this
• Derive: lower bound for Tp s.t. parallel cost is optimal
T pcos t opt
-- par run-time s.t. cost is optimal
-- W fixed
If Isoeff. function is ( f (p) )
then problem of size w can be executed cost-optimally only
iff: W=( f (p) )
p=O( f (W) ) : required for a cost-optimal solution
W
Tp for cost-optimal alg is  ( p )
since pTp   (W )
W
Tp   ( )
p
p  O( f 1 (W )) 
T pcos t opt  (
W
f
1
(W )
)
Minimum Cost-optimal Time for Adding n
Numbers on a Hypercube
• Isoefficiency function:
T  pT p  W
n
 2 log p 
p
n
T  p (  2 log p)  n  2 p log p
p
W  kT  k * 2 p log p   ( p log p )
Tp 
if
if
W  n  f ( p)  plpgp 
log n  log p  log log p  log p
n  f ( p )  p log p
p  f 1 (n)
n  p log p  p 
n
n

log p log n
n
log n
n
n
f 1 (W ) 
(
)
log n
log n
 f 1 (n) 
• The cost-optimal solution
p  O( f 1 (W ))
For a cost optimal solution:
n
p

For
log n
T p  T pcos t opt 
n
 2 log p
p
 T pcos t opt  log n  2 log(
Note:
p   (n log n)
n
)  3 log n  2 log log n   (log n)
log n
T pmin   (log n)
T pcos t opt   (log n)
 T pcos t opt   (T pmin )
Cost-optimal solution is the best asymptotic solution
in terms of execution time.
• Parallel system when
T 
Tp 
3
p2

T pcos t opt  T pmin
3 3
p 4W 4
W  T
p

Tp 
dT p
W

p
1
p2
0
dp
p0   (W )
T pmin

3
p4
1
  (W 2 )
W
3
4
1
p4
3
3
1
3
1
1
 W 4  ( W 2  2W ) 2   (W 4 )
4
16
– Isoeff. function
W  kT0  k 4 p 3   ( p 3 )
 pmax
Tp 
1
  (W 3 )
W

p
1
p2

W
max # PE for which alg is cost-optimal
3
4
1
p4
p   (W )
 T pcos t opt
  (W
 T pcos t opt  T pmin
2
3)
asymptotically