Download Static Translation of Stream Program to a Parallel System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Communication Overhead
Estimation on Multicores
S. M. Farhad
The University of Sydney
Joint work with
Yousun Ko
Bernd Burgstaller
Bernhard Scholz
Outline

Motivation




Multicore trend
Stream programming
Profiling communication overhead
Related works
2
Motivation
Stream
Programming
512
Unicore
256
PicoChip AMBRIC
Homogeneous Multicore
CISCO CSR1
Heterogeneous Multicore
128
NVIDIA G80
Courtesy: Scott’08
Larrabee
# cores/chip
64
X10
Peakstream
32
RAZA
XLR
C/C++/Java
16
RAW
Accelerator
Ct
8
4
BCM 1480
4004
8008
8080
Xbox 360
Power4 PA8800
8086
286
386
486
Pentium
P2
1
P3 P4
Athlon
1975
1980
1985
1990
1995
2000
Fortress
Cavium
Cell
Niagara
2
CUDA
Opteron 4P
AMD Fusion
CTM
Core2Quad
Xeon
Power6
Rstream
Opteron
Core2Duo
CoreDuo
Rapidmind
Core
Itanium Itanium2
2005
2010
3
Stream Programming Paradigm

Programs expressed as stream
graphs
Stream


Streams: Infinite sequence of data
elements
Actor
Actors: Functions applied to streams
Stream
4
Properties of Stream Program
AtoD
FMDemod


Regular and repeating
computation
Independent actors with explicit
communication

Producer / Consumer
dependencies
Splitter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Joiner
Adder
Speaker
5
StreamIt Language
filter



An implementation
of stream prog.
Hierarchical
structure
Each construct has
single input/output
stream
pipeline
may be
any StreamIt
language construct
splitjoin
splitter
parallel computation
joiner
feedback loop
joiner
splitter
6
How to Estimate the Communication
Overhead?
7
Problems to Measure Communication
Overhead

Reasons:




Multicores are non-communication exposed
architecture
Complex cache hierarchy
Cache coherence protocols
Consequence:


Cannot directly measure the communication cost
Estimate the communication cost by measuring
the execution time of actors
8
Measuring the Communication Overhead
of an Edge
C(i ,k )  ti  ti  t k  t k
Processor 1
i
Processor 1
k
ti
tk
No communication cost
Processor 2
i
k
ti
t k
With communication cost
9
How to Minimize the Required Number
of Experiments
Processor 1
A
1
Requires
2+1 Exps
Processor 2
A
1
1
B
2
C
Processor 2
A
B
2
Processor 1
Graph
Coloring
B
2
C
C
3
3
D
4
E
4
D
E
5
F
Pipeline
Odd edges
across partition
Even edges
across partition
10
Obs. 1: There is no loop of three actors in a
stream graph
Processor 1
Processor 2
l
i
k
11
Obs. 2: There is no interference of adjacent
nodes between edges
P-1
A
P-2
B
C
D
E
P-3
F
P-4
For blue color edges
12
Remove Interference



Convert to a line
graph
Add interference
edges
Use vertex coloring
algorithm
A
AB
BC
B
BD
C
D
E
CE
DE
EF
F
Stream graph
Line graph
13
Processor Leveling Graph
A
B
C
A
D
B, C, D, E
E
F
F
For blue colored edge
Processor leveling graph
14
Coloring the Processor Labelling Graph
Processor 1
A
A
B, C, D, E
B, C, D, E
F
F
Processor 2
A
B, C, D, E
F
15
Measuring the Communication Cost
Processor 1
Processor 2
A
t A
B
t B
A
B, C, D, E
C
D
E
t E
F
tF
F
C( A, B )  (t A  t A )  (t B  t B )
C( E , F )  (t E  t E )  (t F  t F )
For blue colored edge
16
Profiling Performance
Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%)
SAR
44
3
7
10
MatrixMult
88
21
24
17
MergeSort
37
4
11
31
FMRadio
21
3
14
24
DCT
28
9
32
14
RadixSort
12
2
17
5
FFT
26
3
12
27
MPEG
56
17
30
15
Channel
22
6
27
11
BeamFormer
39
5
13
13
GM
17%
15%
17
Related Works
[1] Static Scheduling of SDF Programs for DSP [Lee ‘87]
[2] StreamIt: A language for streaming applications [Thies ‘02]
[3] Phased Scheduling of Stream Programs [Thies ’03]
[4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in
Stream Programs [Thies ‘06]
[5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08]
[6] Software Pipelined Execution of Stream Programs on GPUs
[Udupa‘09]
[7] Synergistic Execution of Stream Programs on Multicores with
Accelerators [Udupa ‘09]
[8] Orchestration by approximation [Farhad ‘11]
18
Questions?
Related documents