Download Work @ CADL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Challenges and
Opportunities in
Heterogeneous
Multi-core Era
R. Govindarajan
HPC Lab,SERC, IISc
[email protected]
Garuda-PM July 2011
(C)RG@SERC,IISc
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
2
Moore’s Law : Transistors
Garuda-PM July 2011
© RG@SERC,IISc
3
Moore’s Law : Performance
Processor performance
doubles every 1.5 years
Garuda-PM July 2011
© RG@SERC,IISc
4
Overview
• Introduction
• Trends in Processor Architecture
– Instruction Level Parallelism
– Multi-Cores
– Accelerators
•
•
•
•
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
5
Progress in Processor Architecture
• More transistors  New architecture
innovations
– Multiple Instruction Issue processors
• VLIW
• Superscalar
• EPIC
– More on-chip caches, multiple levels of cache
hierarchy, speculative execution, …
Era of Instruction Level Parallelism
Garuda-PM July 2011
© RG@SERC,IISc
6
Moore’s Law: Processor
Architecture Roadmap (Pre-2000)
EPIC
ILP
Superscalar
Processors
Garuda-PM July 2011
© RG@SERC,IISc
7
3 GHz
4 Core
3 GHz
16 Core
3 GHz
1 Core
3 GHz
2 Core
1 GHz
1 Core
Performance
6 GHz
1 Core
Multicores : The Right Turn
Garuda-PM July 2011
© RG@SERC,IISc
8
Era of Multicores (Post 2000)
• Multiple cores in a single
die
• Early efforts utilized
multiple cores for
multiple programs
• Throughput oriented
rather than speeduporiented!
Garuda-PM July 2011
© RG@SERC,IISc
9
Moore’s Law: Processor
Architecture Roadmap (Post-2000)
EPIC
Superscalar
Garuda-PM July 2011
© RG@SERC,IISc
10
Progress in Processor Architecture
• More transistors  New architecture innovations
–
–
–
–
Multiple Instruction Issue processors
More on-chip caches
Multi cores
Heterogeneous cores and accelerators
Graphics Processing Units (GPUs)
Cell BE, Clearspeed
Larrabee or SCC
Reconfigurable accelerators …
Era of Heterogeneous Accelerators
Garuda-PM July 2011
© RG@SERC,IISc
11
Moore’s Law: Processor
Architecture Roadmap (Post-2000)
Accelerators
EPIC
Superscalar
Garuda-PM July 2011
© RG@SERC,IISc
12
Accelerators
Garuda-PM July 2011
© RG@SERC,IISc
13
Accelerators: Hype or Reality?
Some Top500 Systems (June 2011 List)
Rank
System
Description
2
Tianhe
Xeon + Nvidia Tesla
C2050 GPUs
186368
2,566
4
NebulaeDawning
Intel X5650, NVidia
Tesla C2050 GPU
55,680 +
64,960
1,271
7
Tsubame
Xeon + Nvidia GPU
73278
1,192
10
Roadrunner
Opteron + CellBE
6480
+12960
1,105
Garuda-PM July 2011
© RG@SERC,IISc
# Procs. R_max
(TFLOPS)
14
Accelerators – Cell BE
Garuda-PM July 2011
© RG@SERC,IISc
15
Accelerator – Fermi S2050
Garuda-PM July 2011
© RG@SERC,IISc
16
How is the GPU connected?
C0
C1
C2
C3
SLC
IMC
Memory
Garuda-PM July 2011
© RG@SERC,IISc
17
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
18
Handling the Multi-Core
Challenge
• Shared and Distributed Memory Programming
Languages
– OpenMP
– MPI
• Other Parallel Languages (partitioned global
address space languages)
– X10, UPC, Chapel, …
• Emergence of Programming Languages for GPU
– CUDA
– OpenCL
Garuda-PM July 2011
© RG@SERC,IISc
19
GPU Programming: Good News
• Emergence of Programming Languages for GPU
– CUDA
– OpenCL – Open Standards
• Growing collection of code base
– CUDAzone
– Packages supporting GPUs by ISV
• Impressive performance
– Yes!
• What about Programmer Productivity?
Garuda-PM July 2011
© RG@SERC,IISc
20
GPU Programming:
Boon or Bane
• Challenges in GPU programming
–
–
–
–
Managing parallelism across SMs and SPMD cores
Transfer of data between CPU and GPU
Managing CPU-GPU memory bandwidth efficiently
Efficient use of different level of memory (GPU
memory, Shared Memory, Constant and Texture
Memory, …
– Efficient buffer layout scheme to ensure all accesses
to GPU memory are coalesced.
– Identifying appropriate execution configuration for
efficient execution
– Synchronization across multiple SMs
Garuda-PM July 2011
© RG@SERC,IISc
21
The Challenge : Data Parallelism
OpenCL
CUDA
AMD
CAL
Cg
Garuda-PM July 2011
© RG@SERC,IISc
22
Challenge – Level 2
Synergistic Exec. on CPU & GPU Cores
C0
C1
C2
C3
SLC
MC
Memory
Garuda-PM July 2011
© RG@SERC,IISc
23
Challenge – Level 3
Synergistic Exec. on Multiple Nodes
Memory
NIC
NIC
Memory
N/W
Switch
Memory
Garuda-PM July 2011
NIC
© RG@SERC,IISc
24
Data, Thread, & Task Level
Parallelism
OpenMP
OpenCL
SSE
CUDA
CAL,
PTX, …
MPI
Garuda-PM July 2011
© RG@SERC,IISc
25
Overview
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
– CUDA for Clusters
• Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
26
Our Apprach
OpenMP
MPI
Streaming
Lang.
Array Lang.
(Matlab)
CUDA/
OpenCL
Parallel
Lang.
Compiler/
Runtime System
Synergistic Execution on
Multiple Hetergeneous Cores
Multicores
SSE
Garuda-PM July 2011
GPUs
© RG@SERC,IISc
Other
Accel.
CellBE
27
CUDA on Other Platforms?
• CUDA programs are data parallel
• Cluster of multi-cores, typically for task parallel
applications
• CUDA for multi-core SMPs (M-CUDA)
• CUDA on Cluster of multi-nodes having multicores?
Garuda-PM July 2011
© RG@SERC,IISc
CUDA on Clusters: Motivation
Motivation
• CUDA, a natural way to write data-parallel
programs
• Large Code-base (CUDA Zone)
• Emergence of OpenCL
• How can we leverage these on (existing)
Clusters without GPUs?
Challenges:
• Mismatch in parallelism granularity
• Synchronization and context switch can be more
expensive!
Garuda-PM July 2011
© RG@SERC,IISc
29
CUDA on Multi-Cores (M-CUDA)
Memory
•
•
•
•
•
Fuses threads in a thread block to a thread
OMP Parallel for executing thread blocks in parallel
Proper synchronization across kernels
Loop Transformation techniques for efficient code
Shared memory and synchronization inherent!
Garuda-PM July 2011
© RG@SERC,IISc
CUDA-For-Clusters
• CUDA kernels naturally express parallelism
– Block level and thread-level
• Multiple levels of parallelism in hardware too!
– Across multiple nodes
– And across multiple cores within a node
• A runtime system to perform this work partitioning at
multiple granularities.
• Little communication between block (in data parallel
programs)
• Need some mechanism for
– “Global Shared memory” across multiple nodes!
– Synchronization across nodes
• A 'lightweight' software distributed shared memory layer
to provide necessary abstraction
Garuda-PM July 2011
© RG@SERC,IISc
CUDA-For-Clusters
B0
B1
B2
B3
CUDA
program
B4
B5
B6
B7
Source-to-source transforms
B0
B1
B2
B3
B4
B5
B6
B7
C program with
block-level
code
specification
MPI + OpenMP work distribution runtime
B0,B1
P1
P2
N
1
B2,B3
RAM
P1
P2
N
2
B4,B5
RAM
P1
P2
Interconnect
RAM
B6,B7
P1
P2
N
3
RAM
N
4
Software Distributed Shared Memory
Garuda-PM July 2011
© RG@SERC,IISc
• Consistency only at
kernel boundaries
• Relaxed consistency
model for global data
• Lazy Update
• CUDA memory mgmt.
functions are linked to
the DSM functions.
• Runtime system to
ensure partitioning of
tasks across cores
• OpenMP and MPI
Experimental Setup
• Benchmarks from Parboil and Nvidia CUDA SDK
suite
• 8-node Xeon clusters, each node having Dual
Socket Quad Core (2 x 4 cores)
• Baseline M-CUDA on single node (8-cores)
Garuda-PM July 2011
© RG@SERC,IISc
Experimental Results
With Lazy
Update
Garuda-PM July 2011
© RG@SERC,IISc
Overview
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
– Stream Programs on GPU
• Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
36
Stream Programming Model
• Higher level programming model where nodes
represent computation and channels communication
(producer/consumer relation) between them.
• Exposes Pipelined parallelism and Task-level
parallelism
• Synchronous Data Flow (SDF), Stream Flow Graph,
StreamIT, Brook, …
• Compiling techniques for achieving rate-optimal,
buffer-optimal, software-pipelined schedules
• Mapping applications to Accelerators such as GPUs
and Cell BE.
Garuda-PM July 2011
© RG@SERC,IISc
37
StreamIT Example Program
2 – Band Equalizer
Signal Source
Dup. Splitter
Bandpass Filter
+
Amplifier
Bandpass Filter
+
Amplifier
Combiner
Garuda-PM July 2011
© RG@SERC,IISc
38
Challenges on GPUs
• Work distribution between the multiprocessors
– GPUs have hundreds of processors (SMs and SIMD units)!
• Exploiting task-level and data-level parallelism
– Scheduling across the multiprocessors
– Multiple concurrent threads in SM to exploit DLP
• Determining the execution configuration (number of
threads for each filter) that minimizes execution time.
• Lack of synchronization mechanisms between the
multiprocessors of the GPU.
• Managing CPU-GPU memory bandwidth efficiently
Garuda-PM July 2011
© RG@SERC,IISc
39
Stream Graph Execution
Software Pipelined Execution
Stream Graph
A
B
C
0
1
2
3
4
5
6
7
SM1
SM2
A1
A2
B1
B2
A3
A4
C1
C2
B3
B4
D1
D2
C3
C4
D3
D4
D
Data
Parallelism
Garuda-PM July 2011
Pipeline
Parallelism
© RG@SERC,IISc
SM3
SM4
Task
Parallelism
40
Our Approach
• Multithreading
– Identify good execution configuration to exploit the
right amount of data parallelism
• Memory
– Efficient buffer layout scheme to ensure all
accesses to GPU memory are coalesced.
• Task Partition between GPU and CPU cores
• Work scheduling and processor (SM)
assignment problem.
– Takes into account communication bandwidth
restrictions
Garuda-PM July 2011
© RG@SERC,IISc
41
Compiler Framework
StreamIt
Program
Generate Code
for Profiling
Configuration
Selection
Execute Profile
Runs
ILP Partitioner
CUDA
Code
+
C Code
Code
Generation
Garuda-PM July 2011
Modulo
Scheduling
Instance
Partitioning
Task
Partitioning
Instance
Partitioning
Task
Partitioning
Heuristic Partitioner
© RG@SERC,IISc
42
20
18
16
14
12
10
8
6
4
2
0
Garuda-PM July 2011
> 65x
> 32x
> 52x
Experimental Results on Tesla
Mostly GPU
Syn-Heur
Syn-ILP
© RG@SERC,IISc
43
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
– MATLAB on CPU/GPU
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
44
Compiling MATLAB to
Heterogeneous Machines
• MATLAB is an array language extensively used for
scientific computation
• Expresses data parallelism
– Well suited for acceleration on GPUs
• Current solutions (Jacket, GPUmat) require user
annotation to identify “GPU friendly” regions
• Our compiler, MEGHA (MATLAB Execution on
GPU-based Heterogeneous Architectures), is fully
automatic
Garuda-PM July 2011
© RG@SERC,IISc
45
Compiler Overview
• Frontend constructs an SSA
intermediate representation
(IR) from the input MATLAB
code
• Type inference is performed
on the SSA IR
– Needed because MATLAB is
dynamically typed
• Backend identifies “GPU
friendly” kernels, decides
where to run them and
inserts reqd. data transfers
Garuda-PM July 2011
© RG@SERC,IISc
Frontend
Code Simplification
SSA Construction
Type Inference
Kernel Identification
Mapping and
Scheduling
Data Transfer
Insertion
Code Generation
Backend
46
Backend : Kernel
Identification
• Kernel identification identifies sets of IR
statements (kernels) on which mapping and
scheduling decisions are made
• Modeled as a graph clustering problem
• Takes into account several costs and benefits
while forming kernels
–
–
–
–
Register utilization (Cost)
Memory utilization (Cost)
Intermediate array elimination (Benefit)
Kernel invocation overhead reduction (Benefit)
Garuda-PM July 2011
© RG@SERC,IISc
47
Backend : Scheduling and
Transfer Insertion
• Assignment and scheduling is performed
using a heuristic algorithm
– Assigns kernels to the processor which
minimizes its completion time
– Uses a variant of list scheduling
• Data transfers for dependencies within a
basic block are inserted during scheduling
• Transfers for inter basic block
dependencies are inserted using a
combination of data flow analysis and
edge splitting
Garuda-PM July 2011
© RG@SERC,IISc
48
Results
113
64
Speedup Over MATLAB
14
12
10
8
GPUmat
CPU only
6
CPU+ GPU
4
2
0
bscholes
clos
fdtd
nb1d
nb3d
Benchmarks were run on a machine
with an Intel Xeon 2.83GHz quad-core
Benchmarks
processor and a Tesla S1070. Generated code was compiled using nvcc version 2.3.
MATLAB version 7.9 was used.
Garuda-PM July 2011
© RG@SERC,IISc
49
Overview
•
•
•
•
•
•
Introduction
Trends in Processor Architecture
Programming Challenges
Data and Task Level Parallelism on Multi-cores
Exploiting Data, Task and Pipeline Parallelism
Data Parallelism on GPUs and
Task Parallelism on CPUs
• Conclusions
Garuda-PM July 2011
© RG@SERC,IISc
50
Summary
• Multi-cores and Heterogeneous
accelerators present tremendous research
opportunity in
– Architecture
– High Performance Computing
– Compilers
– Programming Languages & Models
• Challenge is to exploit Performance
PPP
• Keeping programmer Productivity
high and
is Important
• Portability of code across platforms
Garuda-PM July 2011
© RG@SERC,IISc
51
Thank You !!
Garuda-PM July 2011
(C)RG@SERC,IISc