Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Numerical Linear Algebra An Introduction High Performance Computing 1 Levels of multiplication • vector-vector a[i]*b[i] • matrix-vector A[i,j]*b[j] • matrix-matrix A[I,k]*B[k,j] High Performance Computing 1 Matrix-Matrix for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n3) work High Performance Computing 1 Block matrix • Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication • Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n2.8) work High Performance Computing 1 Pipeline method • Pump data left to right and top to bottom recv(&A,P[i,j-1]); recv(&B,P[i-1,j]); C=C+A*B send(&A,P[i,j+1]); send(&B,P[i+1, j]); High Performance Computing 1 Pipeline method B[*,0] A[0,*] A[1,*] A[2,*] A[3,*] C[0,0] C[1,0] C[2,0] C[3,0] B[*,1] C[0,1] C[1,1] C[2,1] C[3,1] B[*,2] C[0,2] C[1,2] C[2,2] C[3,2] High Performance Computing 1 B[*,3] C[0,3] C[1,3] C[2,3] C[3,3] Pipeline method • Similar method for matrix-vector multiplication. But you lose some of the cache reuse High Performance Computing 1 A sense of speed – vector ops Loop Flops per pass Operation per pass operation 1 2 v1(i)=v1(i)+a*v2(i) update 2 8 v1(i)=v1(i)+Σ sk*vk(i) 4-fold vector update 3 1 v1(i)=v1(i)/v2(i) divide 4 2 v1(i)=v1(i)+s*v2(ind(i)) update+gather 5 2 v1(i)=v2(i)-v3(i)*v1(i-1) Bidiagonal 6 2 s=s+v1(i)*v2(i) High Performance Computing 1 Inner product A sense of speed – vector ops Loop 1 J90 cft77 (100 nsec clock) r_∞ n1/2 97 19 T90 1 processor cft77 (2.2 nsec clock) r_∞ n1/2 1428 159 2 163 29 1245 50 3 21 6 219 43 4 72 21 780 83 5 4.9 2 25 9 6 120 202 474 High Performance Computing 1 164 Observations • Simple do loops not effective • Cache and memory hierarchy bottlenecks • For better performance, – combine loops – minimize memory transfer High Performance Computing 1 LINPACK • library of subroutines to solve linear algebra • example – LU decomposition and system solve (dgefa and dgesl, resp.) • In turn, built on BLAS • see netlib.org High Performance Computing 1 BLAS Basic Linear Algebra Subprograms • a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies. • The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM. • The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3. High Performance Computing 1 BLAS • Level 1 High Performance Computing 1 BLAS • Level 2 High Performance Computing 1 BLAS • Level 3 High Performance Computing 1 How efficient is the BLAS? load/store float ops refs/ops level 1 SAXPY 3N 2N 3/2 MN+N+2M 2MN 1/2 2MNK 2/N level 2 SGEMV level 3 SGEMM 2MN+MK+KN High Performance Computing 1 Matrix-vector read x(1:n) into fast memory read y(1:n) into fast memory for i = 1:n read row i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j) write y(1:n) back to slow memory High Performance Computing 1 Matrix-vector • • • • m=# slow memory refs = n^2 +3n f=# arithmetic ops = 2n^2 q=f/m ~2 Mat-vec multiple limited by slow memory High Performance Computing 1 Matrix-matrix High Performance Computing 1 Matrix Multiply - unblocked for i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory * High Performance Computing 1 Matrix Multiply unblocked Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrixvector multiply High Performance Computing 1 Matrix Multiply blocked Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N read block C(i,j) into fast memory for k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} * write block C(i,j) back to slow memory High Performance Computing 1 Matrix Multiply blocked Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n High Performance Computing 1 Matrix Multiply blocked So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3) High Performance Computing 1 More on BLAS • Industry standard interface(evolving) • Vendors, others supply optimized implementations • History – BLAS1 (1970s): • vector operations: dot product, saxpy • m=2*n, f=2*n, q ~1 or less – BLAS2 (mid 1980s) • matrix-vector operations • m=n^2, f=2*n^2, q~2, less overhead • somewhat faster than BLAS1 High Performance Computing 1 More on BLAS – BLAS3 (late 1980s) • matrix-matrix operations: matrix matrix multiply, etc • m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 • Good algorithms used BLAS3 when possible (LAPACK) • www.netlib.org/blas, www.netlib.org/lapack High Performance Computing 1 BLAS on an IBM RS6000/590 Peak speed = 266 Mflops Peak BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) High Performance Computing 1