Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Copperhead: A Python-like Data
Parallel Language & Compiler
Bryan Catanzaro, UC Berkeley
Michael Garland, NVIDIA Research
Kurt Keutzer, UC Berkeley
Universal Parallel Computing Research Center
University of California, Berkeley
Intro to CUDA
Overview
Multicore/Manycore
SIMD
Programming with millions of threads
2/28
The CUDA Programming Model
CUDA is a recent programming model, designed for
Manycore architectures
Wide SIMD parallelism
Scalability
CUDA provides:
A thread abstraction to deal with SIMD
Synchronization & data sharing between small groups of
threads
CUDA programs are written in C + extensions
OpenCL is inspired by CUDA, but HW & SW vendor neutral
Programming model essentially identical
3/28
Multicore and Manycore
Multicore
Manycore
Multicore: yoke of oxen
Each core optimized for executing a single thread
Manycore: flock of chickens
Cores optimized for aggregate throughput, deemphasizing
individual performance
4/28
Multicore & Manycore, cont.
Specifications
Core i7 960
GTX285
Processing Elements
4 cores, 4 way SIMD
@3.2 GHz
30 cores, 8 way SIMD
@1.5 GHz
4 cores, 2 threads, 4
width SIMD:
32 strands
30 cores, 32 SIMD
vectors, 32 width
SIMD:
30720 strands
SP GFLOP/s
102
1080
Memory Bandwidth
25.6 GB/s
159 GB/s
Register File
-
1.875 MB
Local Store
-
480 kB
Resident Threads
(max)
Core i7
GTX285
5/28
SIMD: Neglected Parallelism
It is difficult for a compiler to exploit SIMD
How do you deal with sparse data & branches?
Many languages (like C) are difficult to vectorize
Fortran is somewhat better
Most common solution:
Either forget about SIMD
▪ Pray the autovectorizer likes you
Or instantiate intrinsics (assembly language)
Requires a new code version for every SIMD extension
6/28
What to do with SIMD?
4 way SIMD
16 way SIMD
Neglecting SIMD in the future will be more expensive
AVX: 8 way SIMD, Larrabee: 16 way SIMD
This problem composes with thread level parallelism
7/28
CUDA
CUDA addresses this problem by abstracting both SIMD
and task parallelism into threads
The programmer writes a serial, scalar thread with the
intention of launching thousands of threads
Being able to launch 1 Million threads changes the
parallelism problem
It’s often easier to find 1 Million threads than 32: just look
at your data & launch a thread per element
CUDA is designed for Data Parallelism
Not coincidentally, data parallelism is the only way for
most applications to scale to 1000(+) way parallelism
8/28
Hello World
9/28
CUDA Summary
CUDA is a programming model for manycore
processors
It abstracts SIMD, making it easy to use wide SIMD
vectors
It provides good performance on today’s GPUs
In the near future, CUDA-like approaches will map well
to many processors & GPUs
CUDA encourages SIMD friendly, highly scalable
algorithm design and implementation
10/28
A Parallel Scripting Language
What is a scripting language?
Lots of opinions on this
I’m using an informal definition:
▪ A language where performance is happily traded for productivity
Weak performance requirement of scalability
▪ “My code should run faster tomorrow”
What is the analog of today’s scripting languages for manycore?
11/28
Data Parallelism
Assertion: Scaling to 1000 cores requires data
parallelism
Accordingly, manycore scripting languages will be data
parallel
They should allow the programmer to express data
parallelism naturally
They should compose and transform the parallelism to
fit target platforms
12/28
Warning: Evolving Project
Copperhead is still in embryo
We can compile a few small programs
Lots more work to be done in both language definition
and code generation
Feedback is encouraged
13/28
Copperhead = Cu + python
Copperhead is a subset of Python, designed
for data parallelism
Why Python?
Extant, well accepted high level scripting language
▪ Free simulator(!!)
Already understands things like map and reduce
Comes with a parser & lexer
The current Copperhead compiler takes a subset of
Python and produces CUDA code
Copperhead is not CUDA specific, but current compiler is
14/28
Copperhead is not Pure Python
Copperhead is not for arbitrary Python code
Most features of Python are unsupported
Copperhead is compiled, not interpreted
Connecting Python code & Copperhead code will
require binding the programs together, similar to
Python-C interaction
Copperhead is statically typed
Python
Copperhead
15/28
Saxpy: Hello world
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
Some things to notice:
Types are implicit
▪ The Copperhead compiler uses a Hindley-Milner type system
with typeclasses similar to Haskell
▪ Typeclasses are fully resolved in CUDA via C++ templates
Functional programming:
▪ map, lambda (or equivalent in list comprehensions)
▪ you can pass functions around to other functions
▪ Closure: the variable ‘a’ is free in the lambda function, but
bound to the ‘a’ in its enclosing scope
16/28
Type Inference, cont.
c=a+b
+ : (Num0, Num0) > Num0
A145 A207 A52
c=a+b
Num52 Num52 Num52
Copperhead includes function templates for intrinsics like
add, subtract, map, scan, gather
Expressions are mapped against templates
Every variable starts out with a unique generic type, then
types are resolved by union find on the abstract syntax tree
Tuple and function types are also inferred
17/28
Data parallelism
Copperhead computations are organized around data
parallel arrays
map performs a “forall” for each element in an array
Accesses must be local
Accessing non-local elements is done explicitly
shift, rotate, or gather
No side effects allowed
18/28
Copperhead primitives
map
reduce
Scans:
scan, rscan, segscan, rsegscan
exscan, exrscan, exsegscan, exrsegscan
Shuffles:
shift, rotate, gather, scatter
19/28
Implementing Copperhead
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
The Copperhead
compiler is written in
Python
Python provides its
own Abstract Syntax
Tree
Type inference, code
generation, etc. is done
by walking the AST
Module(
None,
Stmt(
Function(
None,
'saxpy',
['a', 'x', 'y'],
0,
None,
Stmt(
Return(
CallFunc(
Name('map'),
Lambda(
['xi', 'yi'],
0,
Add(
Mul(
Name('a'),
Name('xi')
),
Name('yi')
)
),
Name('x'),
Name('y'),
None,
None
)
)
)
)
)
)
20/28
Compiling Copperhead to CUDA
Every Copperhead function creates at least one CUDA
device function
Top level Copperhead functions create a CUDA global
function, which orchestrates the device function calls
The global function takes care of allocating shared
memory and returning data (storing it to DRAM)
Global synchronizations are implemented through
multiple phases
All intermediate arrays & plumbing handled by Copperhead
compiler
21/28
Saxpy Revisited
def saxpy(a, x, y):
return map(lambda xi, yi: a*xi + yi, x, y)
template<typename Num> __device__ Num lambda0(Num xi, Num yi, Num a) {
return ((a * xi) + yi);
}
template<typename Num>__device__ void saxpy0Dev(Array<Num> x, Array<Num> y, Num a, uint
_globalIndex, Num& _returnValueReg) {
Num _xReg, yReg;
if (_globalIndex < x.length) _xReg = x[_globalIndex];
if (_globalIndex < y.length) _yReg = y[_globalIndex];
if (_globalIndex < x.length) _returnValueReg = lambda0<Num>(_xReg, _yReg, a);
}
template<typename Num>__global__ void saxpy0(Array<Num> x, Array<Num> y, Num a, Array<Num>
_returnValue) {
uint _blockMin = IMUL(blockDim.x, blockIdx.x);
uint _blockMax = _blockMin + blockDim.x;
uint _globalIndex = _blockMin + threadIdx.x;
Num _returnValueReg;
saxpy0Dev(x, y, a, _globalIndex, _returnValueReg);
if (_globalIndex < _returnValue.length) _returnValue[_globalIndex] = _returnValueReg;
}
22/28
Phases
Reduction
phase 0
phase 1
Scan
phase 0
phase 1
phase 2
23/28
Copperhead to CUDA, cont.
B = reduce(map(A))
D = reduce(map(C))
phase 0
phase 1
Compiler schedules computations into phases
Right now, this composition is done greedily
Compiler tracks global and local availability of all variables
and creates a phase boundary when necessary
Fusing work into phases is important for performance
24/28
Copperhead to CUDA, cont.
Shared memory used only for communicating between
threads
Caching unpredictable accesses (gather)
Accessing elements with a uniform stride (shift & rotate)
Each device function returns its intermediate results
through registers
25/28
phases
Split
def split(input, value):
0 flags = map(lambda a: 1 if a <= value else 0, input)
0 notFlags = map(lambda a: not a, flags)
0-2 leftPositions = exscan(lambda a, b: a + b, 0, flags)
0-2 rightPositions = exrscan(lambda a, b: a + b, 0, notFlags)
2 positions = map(lambda a, b, flag: a if flag else
len(input) - b - 1, leftPositions, rightPositions, flags)
2 return scatter(input, positions)
This code is decomposed into 3 phases
Copperhead compiler takes care of intermediate
variables
Copperhead compiler uses shared memory for
temporaries used in scans here
Everything else is in registers
26/28
Interpreting to Copperhead
If the interpreter harvested dynamic type information, it
could use the Copperhead compiler as a backend
Fun project – see what kinds of information could be
gleaned from the Python interpreter at runtime to
figure out what should be compiled via Copperhead to a
manycore chip
27/28
Future Work
Finish support for the basics
Compiler transformations
Nested data parallelism flattening
▪ segmented scans
Retargetability
Thread Building Blocks/OpenMP/OpenCL
Bridge Python and Copperhead
Implement real algorithms with Copperhead
Vision/Machine Learning, etc.
28/28