Download Dense Matrix Algorithms - McGill School Of Computer Science

Document related concepts

Jordan normal form wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Orthogonal matrix wikipedia , lookup

System of linear equations wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Four-vector wikipedia , lookup

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Gaussian elimination wikipedia , lookup

Transcript
Dense Matrix Algorithms
Carl Tropper
Department of Computer Science
McGill University
Dense
• Few non-zero entries
• Topics
– Matrix-Vector Multiplication
– Matrix-Matrix Multiplication
– Solving a System of Linear Equations
Introductory ramblings
• Due to their regular structure, parallel
computations involving matrices and vectors
readily lend themselves to datadecomposition.
• Typical algorithms rely on input, output, or
intermediate data decomposition.
• Discuss one-and two-dimensional block,
cyclic, and block-cyclic partitionings.
• Use one task per process
Matrix-Vector Multiplication
• Multiply a dense n x n matrix A with an n
x 1 vector x to yield an n x 1 vector y.
• The serial algorithm requires n2
multiplications and additions.
Rowwise 1-D Partitioning
One row per process
• Each process starts with only one element of
x , need all-to-all broadcast to distribute all
the elements of x to all of the processes.
• Process Pi then computes
• The all-to-all broadcast and the computation
of y[i] both take time Θ(n) . Therefore, the
parallel time is Θ(n) .
P<N
• Use block 1D partitioning.
• Each process initially stores n/p complete
rows of the matrix and a portion of the vector
of size n/p.
• all-to-all broadcast takes place among p
processes and involves messages of size n/p
takes time tslog p + tw(n/p)(p-1)~tslog p+twn for
large p
• This is followed by n/p local dot products
• The parallel runtime is TP=n2/p + ts log p +twn
• pTP=n2 + pts log p + ptwn =>
• Cost optimal(pTp) if p=O(n)
Scalability Analysis
• We know that T0 = pTP - W, therefore, we
have,
• For isoefficiency, we have W = KT0, where K
= E/(1 – E) for desired efficiency E.
• TO=tsplog p + twnp
• W=n2=Ktwnp from tw term alone
=>W=n2=K2tw2p2
• From this, we have W = O(p2) from the tw term
• There is also a bound on isoefficiency
because of concurrency. In this case, p < n,
therefore, W = n2 = Ω(p2).
• From these 2 bounds on W, the overall
isoefficiency is W = (p2).
2-D Partitioning (naïve
version)
• Begin with one element per process
partitioning
• The n x n matrix is partitioned among n2
processors such that each processor owns a
single element.
• The n x 1 vector x is distributed in the last
column of n processors.Each processor has
one element.
2-D Partitioning
2-D Partitioning
• We must first align the vector with the matrix
appropriately.
• The first communication step for the 2-D
partitioning aligns the vector x along the
principal diagonal of the matrix.
• The second step copies the vector elements
from each diagonal process to all the
processes in the corresponding column using
n simultaneous broadcasts among all
processors in the column.
• Finally, the result vector is computed by
performing an all-to-one reduction along the
columns.
2-D Partitioning
• Three basic communication operations are
used in this algorithm:
– one-to-one communication to align the vector
along the main diagonal,
– one-to-all broadcast of each vector element
among the n processes of each column, and
– all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time
and the parallel time is Θ(log n) .
• There are n2 processes,so the cost (processtime product) is Θ(n2 log n) ; hence, the
algorithm is not cost-optimal.
The less naïve version-fewer than n2
processes
• When using fewer than n2 processors, each
process owns an (n/√ p) x (n/ √ p)
block of the matrix.
• The vector is distributed in portions of (n/√ p)
elements in the last process-column only.
• In this case, the message sizes for the
alignment, broadcast, and reduction are all
(n/√ p)
• The computation is a product of an (n/√ p) x
(n/ √ p) submatrix with a vector of length (n/√
p) .
Parallel Run Time
• Sending message of size n/√p to diagonal
takes time ts + twn/√p
• Column-wise one to all broadcast takes (ts +
twn/√p)log √p using hypercube algorithm
• All to one reduction takes same amount of
time
• Assuming a multiplication and addition takes
unit time,each process spends n2/p
computing
• TP on next page
Next page
•
•
•
•
•
•
TP=
{computation} TP= n2/p +
{aligning vector} ts +twn/√p+
{columwise one to all broadcast }(ts + twn/ √ p)log√ p +
{all to one reduction} ts + twn/ √ p)log √ p
TP ~ n2/p + ts log p + tw (n/ √ p) log p
Scalability Analysis
• From W=n2, expression for TP, and TO=pTPW, TO=tsp log p + twn p log p
• As before, find out what each term
contributes to W
• W=Ktsp log p
• W=n2=Ktwn√plog p=>n=Ktw √p log p=>
n2=K2tw2 p log 2p=>
• W=K2tw2 p log2 p
(**)
• Concurrency is n2=>p=O(n2)=>n2= Ω(p) and
• W= Ω(p)
• The tw term dominates (**) everything =>
• W= (p log2 p )
Scalability Analysis
• Maximum number of processes which can be
used cost-optimally for a problem of size W is
determined by p log2 p= O(n2)
• After some manipulation, p=O(n2/log2n),
• Asymptotic upper bound on the number of
processes which can be used for cost-otpimal
solution
• Bottom line:2-D partitioning is better than 1-D
because:
• It is faster!
• It has a smaller isoefficiency function-get the same
efficiency on more processes!
Matrix-Matrix multiplication
• Standard serial algorithm involves taking the
dot product of each row with each column,
has complexity of n3
• Can also use q x q array of blocks, where
each block is (n/q x n/q). This yields q3
multiplications and additions of the submatrices. Each of the sub-matrices involves
(n/q)3 additions and multiplications.
• Paralellize the q x q blocks algorithm.
Simple Parallel Algorithm
• A and B are partitioned into p blocks, i.e. AIJ ,
BIJ of size (n/√p x n/√p)
• They are mapped onto a √p x √p mesh
• PI,J stores AI,J and BI,J and computes CI,J
• Needs AI,K and BJ,K sub-matrices 0 k< √p
• All to all broadcast of A’s blocks done on each
row and of B’s blocks on each column
• Then multiply A’s and B’s
Scalability
• 2 all to all broadcasts of process mesh
• Messages contain submatrices of n2/p elements
• Communication time is 2(ts log (√p) + tw (n2/p)( p-1)
{hypercube is assumed}
• Each process computes C I,J-takes p
multiplications (n/√p x n/√p) submatrices,
taking n3/p time.
• Parallel time TP= n3/p + ts log p + 2 tw n2/√p
• Process time product=n3+tsplog p+2twn2 √p
• Cost optimal for p=O(n2)
Scalability
• The isoefficiency is O(p1.5) due to
bandwidth term tw and concurrency
• Major drawback-algorithm is not
memory optimal-Memory is (n2 √p), or
√p times the memory of the sequential
algorithm
Canon’s algorithm
• Idea: schedule the computations of the
processes of the ith row such that at any
given time each process uses a different
block Ai,k.
• These blocks can be systematically rotated
among the processes after every submatrix
multiplication so that every process gets a
fresh Ai,k after each rotation
• Use same algorithm for columns=>no
process holds more then one bock at a time
• Memory is (n2)
Canon shift
Performance
•
•
•
•
Max shift for a block is √p-1.
2 shifts (row and column) require 2(ts+twn2/p)
P shifts=>√p2(ts+twn2/p) total comm time
The time for multiplying p matrices of size
(n/√p) x (n/√p) is n3/p
• TP= n3/p+√p2(ts+twn2/p)
• Same cost-optimality condition as simple
algorithm and same iso function.
• Difference is memory!!
DNS Algorithm
• Simple and Canon
• Use block 2-D partitioning of input and output matrices
• Use a max of n2 processes for nxn matrix multiplication
• Have Ω(n) run time because of (n3) ops in the serial
algorithm
• DNS
• Uses up to n3 processes
• Has a run time of (log n) using Ω(n3/log n) processes
DNS Algorithm
• Assume an n x n x n mesh of processors.
• Move the columns of A and rows of B and
perform broadcast.
• Each processor computes a single addmultiply.
• This is followed by an accumulation along the
C dimension.
• Addition along C takes time (log n) =>
• Parallel runtime is (log n)
• This is not cost optimal. It can be made cost
optimal by using n / log n processors along
the direction of accumulation
Cost optimal DNS with fewer then n3
• Let p=q3 for q<n
• Partition the 2 matrices into blocks of
size n/q x n/q
• Have a q x q square array of blocks
Performance
• 1-1 communication takes ts+tw(n/q)2
• 1-all broadcast takes tslog q+tw(n/q)2 for each
matrix
• Last all-1 reduction takes tslog q+tw(n/q)2log q
• Multiplication of n/q x n/q submatrices takes
(n/q)3
• TP~(n/q)3 + tslogp+tw(n2/p2/3)logp=>cost is
n3+tsplogp+twn2p1/3logp
• Isoefficiency function is (p(logp)3)
• Algorithm is cost optimal for p=O(n3/(log n)3)
Linear Equations
Upper Triangular Form
•Idea is to convert the equations into this form,
and then back substitute (i.e. go up the chain)
Principle behind solution
• Can make use of elementary operations
on equations to solve them
• Elementary operations are
• Interchanging two rows
• Replace any equation by a linear combination
of any other equation and itself
Code for Gaussian Elimination
What the code is doing
Complexity of serial Gaussian
• n2/2 divisions (line 6 of code)
• n3/3-n2/2 subtractions and
multiplications (line 12)
• Assuming all ops take unit time, for
large enough n have W=2/3 n3
Parallel Gaussian
• Use 1-D Partitioning
• One row per process
1-D Partitioning
Parallel 1-D
• Assume p = n with each row assigned to a processor.
• The first step of the algorithm normalizes the row.
This is a serial operation and takes time (n-k) in the
kth iteration.
• In the second step, the normalized row is
broadcast to all the processors. This takes time
(ts+tw(n-k-1))log n
• Each processor can independently eliminate this row
from its own. This requires (n-k-1) multiplications and
subtractions.
• The total parallel time can be computed by summing
from k = 1 … n-1 as TP=3/2n(n-1)+tsnlog n+1/2twn(n1)log n.
• The formulation is not cost optimal because of the tw
term.
Parallel 1-D with Pipelining
• The (k+1)st iteration starts only after kth
iteration completes
• In each iteration, all of the active processes
collaborate together
• This is a synchronous algorithm
• Idea: Implement algorithm so that no process
has to wait for all of its predecessors to finish
their work
• The result is an asynchronous algorithm,
which makes use of pipelining
• Algorithm turns out to be cost-optimal
Pipelining
• During the kth iteration, P k sends part of the
kth row to Pk+1, which forwards it to Pk+1,
which…..
• P k+1 can perform the elimination step without
waiting for the data finish its journey to the
bottom of the matrix
• Idea is to get the maximum overlap of
communication and computation
• If a process has data destined for other processes, it
sends it right away
• If the process can do a computation using the data it has,
it does so
Pipeline for 1-D, 5x5
Pipelining is cost optimal
• The total number of steps in the entire
pipelined procedure is Θ(n).
• In any step, either O(n) elements are
communicated between directly-connected
processes, or a division step is performed on
O(n) elements of a row, or an elimination step
is performed on O(n) elements of a row.
• The parallel time is therefore O(n2)
• Since there are n processes, the cost is O(n3)
• Guess what,cost optimal!
Pipelining 1-D with p<n
• Pipelining algorithm can be easily
extended
• N x N matrix
• n/p processes per processor
• Example on next slide
• P=4
• 8 x8 matrix
Next slide
Analysis
• In the kth iteration, a processor with all rows
belonging to the active part of the matrix
performs (n – k -1) / np multiplications and
subtractions during elimination step of the kth
iteration.
• Computation dominates communication at
each iteration (n-(k+1)) words are
communicated during iteration k (vs (n(k+1)/np computation ops)
• The parallel time is 2(n/p)∑k=0n-1 (n-k-1) ~
n3/p.
• The algorithm is cost optimal, but the cost is
higher than the sequential run time by a
factor of 3/2.
Fewer then n processes
• The parallel time is 2(n/p)∑k=0n-1 (n-k-1) ~
n3/p.
• The algorithm is cost optimal, but the cost is
higher than the sequential run time by a
factor of 3/2.
• Inefficiency due to unbalanced load
– In the figure on next slide,1 process is idle, 1 is
partially active, 2 are fully active
• Use cyclic block distribution to balance load
Block and cyclic mappings
2-D Partitioning
• A[i,j] is n x n and is mapped to n x n meshA[i,j] goes to P I,J
• The rest is as before, only the communication
of individual elements takes place between
processors
• Need one to all broadcast of A[i,k] along ith
row for k≤ i<n and one to all broadcast of
A[k,j] along jth column for k<j<n
• Picture on next slide
• The result is not cost optimal
Picture
K=3 for 8 x8 mesh
Pipeline
• If we use synchronous broadcasts, the results
are not cost optimal, so we pipeline the 2-D
algorithm
• Principle of the pipelining algorithm is the
same-if you can compute or communicate, do
it now, not later
– P k,k+1 can divide A[k,k+1] by A[k,k] before A[k,k+1]
reaches P k,n-1 {the end of the row}
– After A[k,j] performs the division, it can send the
result down column j without waiting
• Next slide exhibits algorithm for 2-D pipelining
2-D pipelining algorithm
Pipelining-the wave
• The computation and communication for each
iteration moves through the mesh from topleft to bottom-right like a wave
• After the wave corresponding to a certain
iteration passes through a process, the
process is free to perform subsequent
iterations.
• In g, after k=0 wave passes P 1,1 it starts k=1 iteration by
passing A[1,1] to P 1,2.
• Multiple wave that correspond to different
iterations are active simultaneously.
The wave-continued
• If each step (division, elimination, or
communication) is assumed to take constant
time, the front moves a single step in this
time. The front takes Θ(n) time to reach Pn-1,n1.
• Once the front has progressed past a
diagonal processor, the next front can be
initiated. In this way, the last front passes the
bottom-right corner of the matrix Θ(n) steps
after the first one.
• The parallel time is therefore O(n) , which is
cost-optimal.
Fewer then n2 proceses
• In this case, a processor containing an active
part of the matrix performs n2/p multiplications
and subtractions, and communicates n/ √p
words along its row and its column.
• The computation dominates communication
for n >> p.
• The total parallel run time of this algorithm is
(2n2/p) x n, since there are n iterations.
• Process time product=2n3/p x p= 2n3
• This is three times the serial operation count!
Fewer
Load imbalance
• Same problem as with 1-D mapping-an
uneven load distribution
• Same solution-cyclic partitioning
Load imbalance and a cyclic solution
Comparison
• Pipelined version takes (n3/p) time on
p processes for both 1-D and 2-D
versions
• 2-D partitioning can use more
processes O(n2) then 1-D partitioning
O(n) for an n x n matrix => 2-D version
is more scalable