Download slides

A functional approach to programming of heterogeneous systems G. Zumbusch Institut für Angewandte Mathematik Friedrich-Schiller Universität Jena Workshop PHSP11, Jena, 5-7 October 2011 accelerators Nvidia Fermi STI Cell processor Photos: Nvidia, Delemon, computerbase.de AMD Northern Islands super computer with accelerators Photo Nvidia Tianhe-1A (Tianjin) 7168 * Nvida Tesla 2050 Nebulae (Shenzen) 4640 * Nvidia Tesla 2050 Tsubame 2 (Tokyo) 4224 * Nvidia Tesla 2050 Roadrunner (Los Alamos) 12960 * IBM XCell Lomonosov (Moscow) 1554 * Nvidia Tesla 2070 Loewe-CSC (Frankfurt) 778 * AMD Radeon 5870 some history vector computer CDC STAR-100 (1974) 100 MFlop/s CDC Cyber 205 Cray 1... Photo DKRZ parallel vector computer Cray XMP (1982) 4 Processors a 200MFlop/s Fujitsu, Hitachi, NEC Photo:Rama massively parallel computer Intel iPSC (1982) 128 Processors a ~100 kFlop/s Intel Paragon Cray T3D, Cluster… Photo:cariierdetect non standard parallel architectures Tera (1999): MTA hyper-threading Kendall-Square (1986): all cache, threading Thinking Machines: CM-1 (1985): SIMD Photo: D. Armstrong accelerators Intel 8087 (1980) Weitek 1067 (1981) FPGA Xilinx (1985) Altera.. (parallel) programming? vectorization !DIR$ IVDEP INFINITEVL DO I = 1,N A(B(I)) = A(B(I))+1 ENDDO DO I = 1,N IF ( A(I) > 0.0 ) THEN !DIR$ PROBABILITY_ALMOST_NEVER B(I) = B(I)/A(I) ENDIF ENDDO SUBROUTINE _AXPY ( N, ALPHA, X, INCX, Y, INCY ) APL Fortran77 (1980-) optimized libraries BLAS-1 (1979-) array languages A = [ 1 2 3; 3 4 5; 6 7 8]; b = [0:2:8]; c = (A+A')*b(1:3); APL (1962) MATLAB (MathWorks 1984-) Fortran90 (1992) HPF (1993) Co-array Fortran (1998) ... message passing MPI_Comm_rank(MPI_COMM_WORLD, &myrank); Intel NX P4, Parmacs,... PVM (1989) MPI-1 (1994) if (myrank == 0) MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD); else MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status); single-sided BSP (1990) Cray shmem MPI-2 (1998) partitioned global address space languages distributed shared memory memory hierarchy weak memory consistency model local/distant memory access UPC Co-array Fortran X10 Chapel Fortress languages void kernel k (float a<>, out float b<>, float p) { b = a + p; } float a<1000>; float b<1000>; streamRead(a, data); k (a, b, 3.2f); streamWrite(b, result); stream processing: SISAL, Brook single asignment SSA, SAC functional programming: Scheme, Haskell, ML, Scala,... a functional approach separate operations and data a model problem a textbook example (m) −1 x =D (b−( A−D) x m=1,2 , ... (m−1) ), Jacobi iteration matrix notation n 1 x(m) = (bi −∑ aij x (m−1) ), i j aii j≠i i=1,.. n ; m=1,2 ,... 1 (m−1) (m−1) 1 (m−1) (m) x i =(b i+ x i−1 −x i + x i+1 ) , 2 2 i=2,.. n−1; m=1,2 , ... components insert Finite Difference matrix a textbook example (2) for m = 1 xm(1) = for i = xm(i) xm(n) = components to c 0; 2 to n-1 = 1.0 – x(m-1)(i) + .5 * ( x(m-1)(i-1)+x(m-1)(i+1) ); 0; change storage: x,y for m = 1 to c y(1) = 0; for i = 2 to n-1 y(i) = 1.0 – x(i) + .5 * ( x(i-1)+x(i+1) ); y(n) = 0; swap(y, x); translation to functional style x(m, i) (if (m==0) a(i) else if (i==0) 0.0f else if (i==n) 0.0f else ( 1.0f + x(m-1,i) – 0.5f * ( x(m-1, i-1) + x(m-1, i+1) ) ) ) pure functions only: no program states no modifiable data, no side effects, no procedures, no execution order, no storage model translation to functional style (2) x(m, i) (if (m==0) a(i) else if (i==0) 0.0f else if (i==n) 0.0f else ( 1.0f + x(m-1,i) – 0.5f * ( x(m-1, i-1) + x(m-1, i+1) ) ) ) strict evaluation no higher order functions type inference local single assignment grammar define program [repeat fdef] end define define fdef [arg] ( [list arg] ) [expr] end define define arg [opt type] [id] end define define stmt [arg] = [expr]; end define define expr [expr] [binop] [expr] | [unop] [expr] | ( [expr] ) | [opt type] [num] | [id] ( [list expr] ) | 'if ( [expr] ) [expr] 'else [expr] | { [repeat stmt] 'return [expr]; } end define TXL, BNF notation define data layout express parallelism translation to array style add types, declarations recursive to iterative detect loop parallelism array<float> x0(0,n), x1(0,n); for int m = 1 to c { for_parallel int i = 0 to n { x1(i) = if (i==0) 0.0 else if (i==n) 0.0 else 1.0 – x0(i) + 0.5 * ( x0(i-1)+x0(i+1) ); } swap(x1, x0); } abstract parallelization array<float> x0(0,n), x1(0,n); for int m = 1 to c { for_parallel int i0 = 0 to n step n0 { array<float> y0(i0-1,i0+n0), y1(i0,i0+n0-1); y0 = x0; for_parallel int i = i0 to i0+n0-1 { y1(i) = if (i==0) 0.0 else if (i==n) 0.0 else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) ); } x1 = y1; } swap(x1, x0); } loop tiling local array y0, y1 data transfer at y0=x0, x1=y1 abstract vector processor vectors y0, y1 vector load/store split loop into if/else split loop into aligned vector + remainder array<float> x0(0,n), x1(0,n); for int m = 1 to c { for_parallel int i0 = 0 to n step n0 { array<float> y0(i0-1,i0+n0), y1(i0,i0+n0-1); y0 = x0; for_parallel int i = i0 to i0+n0-1 { y1(i) = if (i==0) 0.0 else if (i==n) 0.0 else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) ); } x1 = y1; } swap(x1, x0); } abstract distributed memory array<float> x0(0,n), x1(0,n); for int m = 1 to c { for_parallel int i0 = 0 to n step n0 { array<float> y0(i0-1,i0+n0), y1(i0,i0+n0-1); y0 = x0; for_parallel int i = i0 to i0+n0-1 { y1(i) = if (i==0) 0.0 else if (i==n) 0.0 else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) ); } x1 = y1; } swap(x1, x0); } eliminate x0, x1 message passing at y0=x0 eliminate i0 loop abstract GPU allocate x0, x1 on host & GPU i loop: kernel eliminate i0 loop memory I/O at y0=x0, x1=y1 array<float> x0(0,n), x1(0,n); for int m = 1 to c { for_parallel int i0 = 0 to n { array<float> y0(i0-1,i0+1), y1(i0,i0); y0 = x0; for_parallel int i = i0 to i0 { y1(i) = if (i==0) 0.0 else if (i==n) 0.0 else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) ); } x1 = y1; } swap(x1, x0); } demo 10 8 scalar SSE AVX OpenCL 6 4 2 0 1 proc seq 2 proc MPI 4 proc MPI Gflop/s, n=65536, i7-2600, GTX590 further optimization improve data layout improve instruction order remove unaligned vector load re-order vector components fuse several iterations re-compute tile edges tile loops for cache reuse memory hierarchy use single vector x(m) + auxilary row outlook separate numerical algorithm memory layout parallelization elements of functional languages may help common code analysis and optimization for distributed memory kernel off-loading thread parallelism vectorization

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides