Download Static Translation of Stream Program to a Parallel System

Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz Outline  Motivation     Multicore trend Stream programming Profiling communication overhead Related works 2 Motivation Stream Programming 512 Unicore 256 PicoChip AMBRIC Homogeneous Multicore CISCO CSR1 Heterogeneous Multicore 128 NVIDIA G80 Courtesy: Scott’08 Larrabee # cores/chip 64 X10 Peakstream 32 RAZA XLR C/C++/Java 16 RAW Accelerator Ct 8 4 BCM 1480 4004 8008 8080 Xbox 360 Power4 PA8800 8086 286 386 486 Pentium P2 1 P3 P4 Athlon 1975 1980 1985 1990 1995 2000 Fortress Cavium Cell Niagara 2 CUDA Opteron 4P AMD Fusion CTM Core2Quad Xeon Power6 Rstream Opteron Core2Duo CoreDuo Rapidmind Core Itanium Itanium2 2005 2010 3 Stream Programming Paradigm  Programs expressed as stream graphs Stream   Streams: Infinite sequence of data elements Actor Actors: Functions applied to streams Stream 4 Properties of Stream Program AtoD FMDemod   Regular and repeating computation Independent actors with explicit communication  Producer / Consumer dependencies Splitter LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Joiner Adder Speaker 5 StreamIt Language filter    An implementation of stream prog. Hierarchical structure Each construct has single input/output stream pipeline may be any StreamIt language construct splitjoin splitter parallel computation joiner feedback loop joiner splitter 6 How to Estimate the Communication Overhead? 7 Problems to Measure Communication Overhead  Reasons:     Multicores are non-communication exposed architecture Complex cache hierarchy Cache coherence protocols Consequence:   Cannot directly measure the communication cost Estimate the communication cost by measuring the execution time of actors 8 Measuring the Communication Overhead of an Edge C(i ,k )  ti  ti  t k  t k Processor 1 i Processor 1 k ti tk No communication cost Processor 2 i k ti t k With communication cost 9 How to Minimize the Required Number of Experiments Processor 1 A 1 Requires 2+1 Exps Processor 2 A 1 1 B 2 C Processor 2 A B 2 Processor 1 Graph Coloring B 2 C C 3 3 D 4 E 4 D E 5 F Pipeline Odd edges across partition Even edges across partition 10 Obs. 1: There is no loop of three actors in a stream graph Processor 1 Processor 2 l i k 11 Obs. 2: There is no interference of adjacent nodes between edges P-1 A P-2 B C D E P-3 F P-4 For blue color edges 12 Remove Interference    Convert to a line graph Add interference edges Use vertex coloring algorithm A AB BC B BD C D E CE DE EF F Stream graph Line graph 13 Processor Leveling Graph A B C A D B, C, D, E E F F For blue colored edge Processor leveling graph 14 Coloring the Processor Labelling Graph Processor 1 A A B, C, D, E B, C, D, E F F Processor 2 A B, C, D, E F 15 Measuring the Communication Cost Processor 1 Processor 2 A t A B t B A B, C, D, E C D E t E F tF F C( A, B )  (t A  t A )  (t B  t B ) C( E , F )  (t E  t E )  (t F  t F ) For blue colored edge 16 Profiling Performance Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%) SAR 44 3 7 10 MatrixMult 88 21 24 17 MergeSort 37 4 11 31 FMRadio 21 3 14 24 DCT 28 9 32 14 RadixSort 12 2 17 5 FFT 26 3 12 27 MPEG 56 17 30 15 Channel 22 6 27 11 BeamFormer 39 5 13 13 GM 17% 15% 17 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 18 Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Static Translation of Stream Program to a Parallel System