Download Work @ CADL

Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc [email protected] Garuda-PM July 2011 (C)RG@SERC,IISc Overview • • • • • • Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 2 Moore’s Law : Transistors Garuda-PM July 2011 © RG@SERC,IISc 3 Moore’s Law : Performance Processor performance doubles every 1.5 years Garuda-PM July 2011 © RG@SERC,IISc 4 Overview • Introduction • Trends in Processor Architecture – Instruction Level Parallelism – Multi-Cores – Accelerators • • • • Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 5 Progress in Processor Architecture • More transistors  New architecture innovations – Multiple Instruction Issue processors • VLIW • Superscalar • EPIC – More on-chip caches, multiple levels of cache hierarchy, speculative execution, … Era of Instruction Level Parallelism Garuda-PM July 2011 © RG@SERC,IISc 6 Moore’s Law: Processor Architecture Roadmap (Pre-2000) EPIC ILP Superscalar Processors Garuda-PM July 2011 © RG@SERC,IISc 7 3 GHz 4 Core 3 GHz 16 Core 3 GHz 1 Core 3 GHz 2 Core 1 GHz 1 Core Performance 6 GHz 1 Core Multicores : The Right Turn Garuda-PM July 2011 © RG@SERC,IISc 8 Era of Multicores (Post 2000) • Multiple cores in a single die • Early efforts utilized multiple cores for multiple programs • Throughput oriented rather than speeduporiented! Garuda-PM July 2011 © RG@SERC,IISc 9 Moore’s Law: Processor Architecture Roadmap (Post-2000) EPIC Superscalar Garuda-PM July 2011 © RG@SERC,IISc 10 Progress in Processor Architecture • More transistors  New architecture innovations – – – – Multiple Instruction Issue processors More on-chip caches Multi cores Heterogeneous cores and accelerators Graphics Processing Units (GPUs) Cell BE, Clearspeed Larrabee or SCC Reconfigurable accelerators … Era of Heterogeneous Accelerators Garuda-PM July 2011 © RG@SERC,IISc 11 Moore’s Law: Processor Architecture Roadmap (Post-2000) Accelerators EPIC Superscalar Garuda-PM July 2011 © RG@SERC,IISc 12 Accelerators Garuda-PM July 2011 © RG@SERC,IISc 13 Accelerators: Hype or Reality? Some Top500 Systems (June 2011 List) Rank System Description 2 Tianhe Xeon + Nvidia Tesla C2050 GPUs 186368 2,566 4 NebulaeDawning Intel X5650, NVidia Tesla C2050 GPU 55,680 + 64,960 1,271 7 Tsubame Xeon + Nvidia GPU 73278 1,192 10 Roadrunner Opteron + CellBE 6480 +12960 1,105 Garuda-PM July 2011 © RG@SERC,IISc # Procs. R_max (TFLOPS) 14 Accelerators – Cell BE Garuda-PM July 2011 © RG@SERC,IISc 15 Accelerator – Fermi S2050 Garuda-PM July 2011 © RG@SERC,IISc 16 How is the GPU connected? C0 C1 C2 C3 SLC IMC Memory Garuda-PM July 2011 © RG@SERC,IISc 17 Overview • • • • • • Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 18 Handling the Multi-Core Challenge • Shared and Distributed Memory Programming Languages – OpenMP – MPI • Other Parallel Languages (partitioned global address space languages) – X10, UPC, Chapel, … • Emergence of Programming Languages for GPU – CUDA – OpenCL Garuda-PM July 2011 © RG@SERC,IISc 19 GPU Programming: Good News • Emergence of Programming Languages for GPU – CUDA – OpenCL – Open Standards • Growing collection of code base – CUDAzone – Packages supporting GPUs by ISV • Impressive performance – Yes! • What about Programmer Productivity? Garuda-PM July 2011 © RG@SERC,IISc 20 GPU Programming: Boon or Bane • Challenges in GPU programming – – – – Managing parallelism across SMs and SPMD cores Transfer of data between CPU and GPU Managing CPU-GPU memory bandwidth efficiently Efficient use of different level of memory (GPU memory, Shared Memory, Constant and Texture Memory, … – Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. – Identifying appropriate execution configuration for efficient execution – Synchronization across multiple SMs Garuda-PM July 2011 © RG@SERC,IISc 21 The Challenge : Data Parallelism OpenCL CUDA AMD CAL Cg Garuda-PM July 2011 © RG@SERC,IISc 22 Challenge – Level 2 Synergistic Exec. on CPU & GPU Cores C0 C1 C2 C3 SLC MC Memory Garuda-PM July 2011 © RG@SERC,IISc 23 Challenge – Level 3 Synergistic Exec. on Multiple Nodes Memory NIC NIC Memory N/W Switch Memory Garuda-PM July 2011 NIC © RG@SERC,IISc 24 Data, Thread, & Task Level Parallelism OpenMP OpenCL SSE CUDA CAL, PTX, … MPI Garuda-PM July 2011 © RG@SERC,IISc 25 Overview • • • • Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores – CUDA for Clusters • Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 26 Our Apprach OpenMP MPI Streaming Lang. Array Lang. (Matlab) CUDA/ OpenCL Parallel Lang. Compiler/ Runtime System Synergistic Execution on Multiple Hetergeneous Cores Multicores SSE Garuda-PM July 2011 GPUs © RG@SERC,IISc Other Accel. CellBE 27 CUDA on Other Platforms? • CUDA programs are data parallel • Cluster of multi-cores, typically for task parallel applications • CUDA for multi-core SMPs (M-CUDA) • CUDA on Cluster of multi-nodes having multicores? Garuda-PM July 2011 © RG@SERC,IISc CUDA on Clusters: Motivation Motivation • CUDA, a natural way to write data-parallel programs • Large Code-base (CUDA Zone) • Emergence of OpenCL • How can we leverage these on (existing) Clusters without GPUs? Challenges: • Mismatch in parallelism granularity • Synchronization and context switch can be more expensive! Garuda-PM July 2011 © RG@SERC,IISc 29 CUDA on Multi-Cores (M-CUDA) Memory • • • • • Fuses threads in a thread block to a thread OMP Parallel for executing thread blocks in parallel Proper synchronization across kernels Loop Transformation techniques for efficient code Shared memory and synchronization inherent! Garuda-PM July 2011 © RG@SERC,IISc CUDA-For-Clusters • CUDA kernels naturally express parallelism – Block level and thread-level • Multiple levels of parallelism in hardware too! – Across multiple nodes – And across multiple cores within a node • A runtime system to perform this work partitioning at multiple granularities. • Little communication between block (in data parallel programs) • Need some mechanism for – “Global Shared memory” across multiple nodes! – Synchronization across nodes • A 'lightweight' software distributed shared memory layer to provide necessary abstraction Garuda-PM July 2011 © RG@SERC,IISc CUDA-For-Clusters B0 B1 B2 B3 CUDA program B4 B5 B6 B7 Source-to-source transforms B0 B1 B2 B3 B4 B5 B6 B7 C program with block-level code specification MPI + OpenMP work distribution runtime B0,B1 P1 P2 N 1 B2,B3 RAM P1 P2 N 2 B4,B5 RAM P1 P2 Interconnect RAM B6,B7 P1 P2 N 3 RAM N 4 Software Distributed Shared Memory Garuda-PM July 2011 © RG@SERC,IISc • Consistency only at kernel boundaries • Relaxed consistency model for global data • Lazy Update • CUDA memory mgmt. functions are linked to the DSM functions. • Runtime system to ensure partitioning of tasks across cores • OpenMP and MPI Experimental Setup • Benchmarks from Parboil and Nvidia CUDA SDK suite • 8-node Xeon clusters, each node having Dual Socket Quad Core (2 x 4 cores) • Baseline M-CUDA on single node (8-cores) Garuda-PM July 2011 © RG@SERC,IISc Experimental Results With Lazy Update Garuda-PM July 2011 © RG@SERC,IISc Overview • • • • • Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism – Stream Programs on GPU • Data Parallelism on GPUs and Task Parallelism on CPUs • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 36 Stream Programming Model • Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them. • Exposes Pipelined parallelism and Task-level parallelism • Synchronous Data Flow (SDF), Stream Flow Graph, StreamIT, Brook, … • Compiling techniques for achieving rate-optimal, buffer-optimal, software-pipelined schedules • Mapping applications to Accelerators such as GPUs and Cell BE. Garuda-PM July 2011 © RG@SERC,IISc 37 StreamIT Example Program 2 – Band Equalizer Signal Source Dup. Splitter Bandpass Filter + Amplifier Bandpass Filter + Amplifier Combiner Garuda-PM July 2011 © RG@SERC,IISc 38 Challenges on GPUs • Work distribution between the multiprocessors – GPUs have hundreds of processors (SMs and SIMD units)! • Exploiting task-level and data-level parallelism – Scheduling across the multiprocessors – Multiple concurrent threads in SM to exploit DLP • Determining the execution configuration (number of threads for each filter) that minimizes execution time. • Lack of synchronization mechanisms between the multiprocessors of the GPU. • Managing CPU-GPU memory bandwidth efficiently Garuda-PM July 2011 © RG@SERC,IISc 39 Stream Graph Execution Software Pipelined Execution Stream Graph A B C 0 1 2 3 4 5 6 7 SM1 SM2 A1 A2 B1 B2 A3 A4 C1 C2 B3 B4 D1 D2 C3 C4 D3 D4 D Data Parallelism Garuda-PM July 2011 Pipeline Parallelism © RG@SERC,IISc SM3 SM4 Task Parallelism 40 Our Approach • Multithreading – Identify good execution configuration to exploit the right amount of data parallelism • Memory – Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. • Task Partition between GPU and CPU cores • Work scheduling and processor (SM) assignment problem. – Takes into account communication bandwidth restrictions Garuda-PM July 2011 © RG@SERC,IISc 41 Compiler Framework StreamIt Program Generate Code for Profiling Configuration Selection Execute Profile Runs ILP Partitioner CUDA Code + C Code Code Generation Garuda-PM July 2011 Modulo Scheduling Instance Partitioning Task Partitioning Instance Partitioning Task Partitioning Heuristic Partitioner © RG@SERC,IISc 42 20 18 16 14 12 10 8 6 4 2 0 Garuda-PM July 2011 > 65x > 32x > 52x Experimental Results on Tesla Mostly GPU Syn-Heur Syn-ILP © RG@SERC,IISc 43 Overview • • • • • • Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs – MATLAB on CPU/GPU • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 44 Compiling MATLAB to Heterogeneous Machines • MATLAB is an array language extensively used for scientific computation • Expresses data parallelism – Well suited for acceleration on GPUs • Current solutions (Jacket, GPUmat) require user annotation to identify “GPU friendly” regions • Our compiler, MEGHA (MATLAB Execution on GPU-based Heterogeneous Architectures), is fully automatic Garuda-PM July 2011 © RG@SERC,IISc 45 Compiler Overview • Frontend constructs an SSA intermediate representation (IR) from the input MATLAB code • Type inference is performed on the SSA IR – Needed because MATLAB is dynamically typed • Backend identifies “GPU friendly” kernels, decides where to run them and inserts reqd. data transfers Garuda-PM July 2011 © RG@SERC,IISc Frontend Code Simplification SSA Construction Type Inference Kernel Identification Mapping and Scheduling Data Transfer Insertion Code Generation Backend 46 Backend : Kernel Identification • Kernel identification identifies sets of IR statements (kernels) on which mapping and scheduling decisions are made • Modeled as a graph clustering problem • Takes into account several costs and benefits while forming kernels – – – – Register utilization (Cost) Memory utilization (Cost) Intermediate array elimination (Benefit) Kernel invocation overhead reduction (Benefit) Garuda-PM July 2011 © RG@SERC,IISc 47 Backend : Scheduling and Transfer Insertion • Assignment and scheduling is performed using a heuristic algorithm – Assigns kernels to the processor which minimizes its completion time – Uses a variant of list scheduling • Data transfers for dependencies within a basic block are inserted during scheduling • Transfers for inter basic block dependencies are inserted using a combination of data flow analysis and edge splitting Garuda-PM July 2011 © RG@SERC,IISc 48 Results 113 64 Speedup Over MATLAB 14 12 10 8 GPUmat CPU only 6 CPU+ GPU 4 2 0 bscholes clos fdtd nb1d nb3d Benchmarks were run on a machine with an Intel Xeon 2.83GHz quad-core Benchmarks processor and a Tesla S1070. Generated code was compiled using nvcc version 2.3. MATLAB version 7.9 was used. Garuda-PM July 2011 © RG@SERC,IISc 49 Overview • • • • • • Introduction Trends in Processor Architecture Programming Challenges Data and Task Level Parallelism on Multi-cores Exploiting Data, Task and Pipeline Parallelism Data Parallelism on GPUs and Task Parallelism on CPUs • Conclusions Garuda-PM July 2011 © RG@SERC,IISc 50 Summary • Multi-cores and Heterogeneous accelerators present tremendous research opportunity in – Architecture – High Performance Computing – Compilers – Programming Languages & Models • Challenge is to exploit Performance PPP • Keeping programmer Productivity high and is Important • Portability of code across platforms Garuda-PM July 2011 © RG@SERC,IISc 51 Thank You !! Garuda-PM July 2011 (C)RG@SERC,IISc

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Work @ CADL