Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Communication Overhead Estimation on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz Outline Motivation Multicore trend Stream programming Profiling communication overhead Related works 2 Motivation Stream Programming 512 Unicore 256 PicoChip AMBRIC Homogeneous Multicore CISCO CSR1 Heterogeneous Multicore 128 NVIDIA G80 Courtesy: Scott’08 Larrabee # cores/chip 64 X10 Peakstream 32 RAZA XLR C/C++/Java 16 RAW Accelerator Ct 8 4 BCM 1480 4004 8008 8080 Xbox 360 Power4 PA8800 8086 286 386 486 Pentium P2 1 P3 P4 Athlon 1975 1980 1985 1990 1995 2000 Fortress Cavium Cell Niagara 2 CUDA Opteron 4P AMD Fusion CTM Core2Quad Xeon Power6 Rstream Opteron Core2Duo CoreDuo Rapidmind Core Itanium Itanium2 2005 2010 3 Stream Programming Paradigm Programs expressed as stream graphs Stream Streams: Infinite sequence of data elements Actor Actors: Functions applied to streams Stream 4 Properties of Stream Program AtoD FMDemod Regular and repeating computation Independent actors with explicit communication Producer / Consumer dependencies Splitter LPF1 LPF2 LPF3 HPF1 HPF2 HPF3 Joiner Adder Speaker 5 StreamIt Language filter An implementation of stream prog. Hierarchical structure Each construct has single input/output stream pipeline may be any StreamIt language construct splitjoin splitter parallel computation joiner feedback loop joiner splitter 6 How to Estimate the Communication Overhead? 7 Problems to Measure Communication Overhead Reasons: Multicores are non-communication exposed architecture Complex cache hierarchy Cache coherence protocols Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring the execution time of actors 8 Measuring the Communication Overhead of an Edge C(i ,k ) ti ti t k t k Processor 1 i Processor 1 k ti tk No communication cost Processor 2 i k ti t k With communication cost 9 How to Minimize the Required Number of Experiments Processor 1 A 1 Requires 2+1 Exps Processor 2 A 1 1 B 2 C Processor 2 A B 2 Processor 1 Graph Coloring B 2 C C 3 3 D 4 E 4 D E 5 F Pipeline Odd edges across partition Even edges across partition 10 Obs. 1: There is no loop of three actors in a stream graph Processor 1 Processor 2 l i k 11 Obs. 2: There is no interference of adjacent nodes between edges P-1 A P-2 B C D E P-3 F P-4 For blue color edges 12 Remove Interference Convert to a line graph Add interference edges Use vertex coloring algorithm A AB BC B BD C D E CE DE EF F Stream graph Line graph 13 Processor Leveling Graph A B C A D B, C, D, E E F F For blue colored edge Processor leveling graph 14 Coloring the Processor Labelling Graph Processor 1 A A B, C, D, E B, C, D, E F F Processor 2 A B, C, D, E F 15 Measuring the Communication Cost Processor 1 Processor 2 A t A B t B A B, C, D, E C D E t E F tF F C( A, B ) (t A t A ) (t B t B ) C( E , F ) (t E t E ) (t F t F ) For blue colored edge 16 Profiling Performance Benchmark Total Edge Prof Steps Steps/Edge (%) Err (%) SAR 44 3 7 10 MatrixMult 88 21 24 17 MergeSort 37 4 11 31 FMRadio 21 3 14 24 DCT 28 9 32 14 RadixSort 12 2 17 5 FFT 26 3 12 27 MPEG 56 17 30 15 Channel 22 6 27 11 BeamFormer 39 5 13 13 GM 17% 15% 17 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 18 Questions?