Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A functional approach to
programming of heterogeneous
systems
G. Zumbusch
Institut für Angewandte Mathematik
Friedrich-Schiller Universität Jena
Workshop PHSP11, Jena, 5-7 October 2011
accelerators
Nvidia Fermi
STI Cell processor
Photos: Nvidia, Delemon, computerbase.de
AMD Northern Islands
super computer with accelerators
Photo Nvidia
Tianhe-1A (Tianjin)
7168 * Nvida Tesla 2050
Nebulae (Shenzen)
4640 * Nvidia Tesla 2050
Tsubame 2 (Tokyo)
4224 * Nvidia Tesla 2050
Roadrunner (Los Alamos)
12960 * IBM XCell
Lomonosov (Moscow)
1554 * Nvidia Tesla 2070
Loewe-CSC (Frankfurt)
778 * AMD Radeon 5870
some history
vector computer
CDC STAR-100
(1974)
100 MFlop/s
CDC Cyber 205
Cray 1...
Photo DKRZ
parallel vector computer
Cray XMP
(1982)
4 Processors
a 200MFlop/s
Fujitsu, Hitachi, NEC
Photo:Rama
massively parallel computer
Intel iPSC
(1982)
128 Processors
a ~100 kFlop/s
Intel Paragon
Cray T3D,
Cluster…
Photo:cariierdetect
non standard parallel
architectures
Tera (1999): MTA
hyper-threading
Kendall-Square (1986):
all cache, threading
Thinking Machines:
CM-1 (1985): SIMD
Photo: D. Armstrong
accelerators
Intel 8087 (1980)
Weitek 1067 (1981)
FPGA
Xilinx (1985)
Altera..
(parallel) programming?
vectorization
!DIR$ IVDEP INFINITEVL
DO I = 1,N
A(B(I)) = A(B(I))+1
ENDDO
DO I = 1,N
IF ( A(I) > 0.0 ) THEN
!DIR$ PROBABILITY_ALMOST_NEVER
B(I) = B(I)/A(I)
ENDIF
ENDDO
SUBROUTINE _AXPY ( N, ALPHA, X, INCX, Y, INCY )
APL
Fortran77 (1980-)
optimized libraries
BLAS-1 (1979-)
array languages
A = [ 1 2 3; 3 4 5; 6 7 8];
b = [0:2:8];
c = (A+A')*b(1:3);
APL (1962)
MATLAB (MathWorks
1984-)
Fortran90 (1992)
HPF (1993)
Co-array Fortran (1998)
...
message passing
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
Intel NX
P4, Parmacs,...
PVM (1989)
MPI-1 (1994)
if (myrank == 0)
MPI_Send(message, strlen(message)+1,
MPI_CHAR, 1, tag, MPI_COMM_WORLD);
else
MPI_Recv(message, 20, MPI_CHAR, 0, tag,
MPI_COMM_WORLD, &status);
single-sided
BSP (1990)
Cray shmem
MPI-2 (1998)
partitioned global address space languages
distributed shared
memory
memory hierarchy
weak memory
consistency model
local/distant memory
access
UPC
Co-array Fortran
X10
Chapel
Fortress
languages
void kernel k (float a<>,
out float b<>, float p) {
b = a + p;
}
float a<1000>;
float b<1000>;
streamRead(a, data);
k (a, b, 3.2f);
streamWrite(b, result);
stream processing:
SISAL, Brook
single asignment
SSA, SAC
functional programming:
Scheme, Haskell, ML,
Scala,...
a functional approach
separate operations and data
a model problem
a textbook example
(m)
−1
x =D (b−( A−D) x
m=1,2 , ...
(m−1)
),
Jacobi iteration
matrix notation
n
1
x(m)
=
(bi −∑ aij x (m−1)
),
i
j
aii
j≠i
i=1,.. n ; m=1,2 ,...
1 (m−1) (m−1) 1 (m−1)
(m)
x i =(b i+ x i−1 −x i
+ x i+1 ) ,
2
2
i=2,.. n−1; m=1,2 , ...
components
insert Finite Difference
matrix
a textbook example (2)
for m = 1
xm(1) =
for i =
xm(i)
xm(n) =
components
to c
0;
2 to n-1
= 1.0 – x(m-1)(i) + .5 * ( x(m-1)(i-1)+x(m-1)(i+1) );
0;
change storage: x,y
for m = 1 to c
y(1) = 0;
for i = 2 to n-1
y(i) = 1.0 – x(i) + .5 * ( x(i-1)+x(i+1) );
y(n) = 0;
swap(y, x);
translation to functional style
x(m, i)
(if (m==0) a(i)
else if (i==0) 0.0f
else if (i==n) 0.0f
else ( 1.0f + x(m-1,i) – 0.5f *
( x(m-1, i-1) + x(m-1, i+1) ) ) )
pure functions only:
no program states
no modifiable data,
no side effects,
no procedures,
no execution order,
no storage model
translation to functional style (2)
x(m, i)
(if (m==0) a(i)
else if (i==0) 0.0f
else if (i==n) 0.0f
else ( 1.0f + x(m-1,i) – 0.5f *
( x(m-1, i-1) + x(m-1, i+1) ) ) )
strict evaluation
no higher order functions
type inference
local single assignment
grammar
define program
[repeat fdef]
end define
define fdef
[arg] ( [list arg] ) [expr]
end define
define arg
[opt type] [id]
end define
define stmt
[arg] = [expr];
end define
define expr
[expr] [binop] [expr]
| [unop] [expr]
| ( [expr] )
| [opt type] [num]
| [id] ( [list expr] )
| 'if ( [expr] ) [expr] 'else [expr]
| { [repeat stmt] 'return [expr]; }
end define
TXL, BNF notation
define data layout
express parallelism
translation to array style
add types, declarations
recursive to iterative
detect loop parallelism
array<float> x0(0,n), x1(0,n);
for int m = 1 to c {
for_parallel int i = 0 to n {
x1(i) = if (i==0) 0.0
else if (i==n) 0.0
else 1.0 – x0(i) + 0.5 * ( x0(i-1)+x0(i+1) );
}
swap(x1, x0);
}
abstract parallelization
array<float> x0(0,n), x1(0,n);
for int m = 1 to c {
for_parallel int i0 = 0 to n step n0 {
array<float> y0(i0-1,i0+n0), y1(i0,i0+n0-1);
y0 = x0;
for_parallel int i = i0 to i0+n0-1 {
y1(i) = if (i==0) 0.0
else if (i==n) 0.0
else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) );
}
x1 = y1;
}
swap(x1, x0);
}
loop tiling
local array y0, y1
data transfer at
y0=x0, x1=y1
abstract vector processor
vectors y0, y1
vector load/store
split loop into if/else
split loop into aligned
vector + remainder
array<float> x0(0,n), x1(0,n);
for int m = 1 to c {
for_parallel int i0 = 0 to n step n0 {
array<float> y0(i0-1,i0+n0), y1(i0,i0+n0-1);
y0 = x0;
for_parallel int i = i0 to i0+n0-1 {
y1(i) = if (i==0) 0.0
else if (i==n) 0.0
else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) );
}
x1 = y1;
}
swap(x1, x0);
}
abstract distributed memory
array<float> x0(0,n), x1(0,n);
for int m = 1 to c {
for_parallel int i0 = 0 to n step n0 {
array<float> y0(i0-1,i0+n0), y1(i0,i0+n0-1);
y0 = x0;
for_parallel int i = i0 to i0+n0-1 {
y1(i) = if (i==0) 0.0
else if (i==n) 0.0
else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) );
}
x1 = y1;
}
swap(x1, x0);
}
eliminate x0, x1
message passing at
y0=x0
eliminate i0 loop
abstract GPU
allocate x0, x1 on
host & GPU
i loop: kernel
eliminate i0 loop
memory I/O at
y0=x0, x1=y1
array<float> x0(0,n), x1(0,n);
for int m = 1 to c {
for_parallel int i0 = 0 to n {
array<float> y0(i0-1,i0+1), y1(i0,i0);
y0 = x0;
for_parallel int i = i0 to i0 {
y1(i) = if (i==0) 0.0
else if (i==n) 0.0
else 1.0 – y0(i) + 0.5 * ( y0(i-1)+y0(i+1) );
}
x1 = y1;
}
swap(x1, x0);
}
demo
10
8
scalar
SSE
AVX
OpenCL
6
4
2
0
1 proc seq
2 proc MPI
4 proc MPI
Gflop/s, n=65536, i7-2600, GTX590
further optimization
improve data layout
improve instruction order
remove unaligned vector load
re-order vector components
fuse several iterations
re-compute tile edges
tile loops for cache reuse
memory hierarchy
use single vector
x(m) + auxilary row
outlook
separate
numerical algorithm
memory layout
parallelization
elements of
functional languages may help
common code analysis and
optimization for
distributed memory
kernel off-loading
thread parallelism
vectorization
Related documents