Download GPU Computing - NVIDIA Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
GPU Computing:
Pervasive Massively
Multithreaded
Processors
Michael C Shebanow
Sr. Architecture Manager,
GPUs
Agenda
●
●
●
●
●
GPU Computing
Tesla Products
CUDA
SM: Thread
Multiprocessor
Application
Performance Model
© NVIDIA Corporation 2008
GPU Computing
● GPU = graphics
processing unit
● NVIDIA’s latest
●
products =
88xx/98xx series
Accelerates
Graphics
● GPU Computing
● Using GPUs for
general purpose
computing other
than graphics
© NVIDIA Corporation 2008
GPU Computing is Pervasive
●
Huge #s of deployed
parallel computing engines
●
●
●
NVIDIA has shipped >70M
CUDA-capable GPUs to
date
NVIDIA currently shipping
more than1M CUDAcapable GPUs per week
Wide range of products
ranging from laptops to the
servers
●
Coming soon: cell
phones, PDAs, …
© NVIDIA Corporation 2008
GPU Computing Key Concepts
●
●
●
●
●
Hardware (HW) thread management
●
●
●
●
HW thread launch and monitoring
HW thread switching
Tens of thousands of lightweight, concurrent threads
Real threads: PC, private registers, …
SIMT execution model
Multiple memory scopes
●
●
●
Per-thread private memory
Per-thread-block shared memory
Global memory
Using threads to hide memory latency
Coarse grain thread synchronization
© NVIDIA Corporation 2008
SIMT Multithreaded Execution
●
SIMT: Single-Instruction Multi-Thread
executes one instruction across many
independent threads
●
Single-Instruction Multi-Thread
instruction scheduler
●
time
warp 8 instruction 11
●
warp 1 instruction 42
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
© NVIDIA Corporation 2008
●
Warp: a set of 32 parallel threads
that execute a SIMT instruction
SIMT provides easy single-thread scalar
programming with SIMD efficiency
Hardware implements zero-overhead
warp and thread scheduling
SIMT threads can execute independently
●
●
SIMT warp diverges and converges when
threads branch independently
Best efficiency and performance when threads
of a warp execute together
Multiple Memory Scopes
●
Per-thread private memory
●
●
●
●
●
Each thread has its own
local memory
Stacks, other private data
Per-thread-block shared
memory
●
Per-thread
Local Memory
Block
Per-block
Shared
Memory
Small memory close to
the processor, low latency
Allocated per thread block
Main memory
●
●
Thread
GPU frame buffer
Can be accessed by any
thread in any thread block
© NVIDIA Corporation 2008
Kernel 0
.
.
.
Kernel 1
...
Sequential
Blocks
Per-device
Global
Memory
Hiding LOAD Latency
●
●
Principle: Little’s Law:
N l L
●
●
●
W 0T1
Load Req
W 0T2
Load Req
W 0T0
Load Resp
W 0T1
Load Resp
W 0T2
Load Resp
Arrival Rate product of:
●
●
●
N = “number in flight”
l = arrival rate
L = memory latency
W 0T0
Load Req
Desired execution rate
(IPC)
Density of LOAD
instructions (%)
N = # of threads needed to
cover latency L
W kT29
Load Req
W kT30
Load Req
W kT31
Load Req
Time
© NVIDIA Corporation 2008
Hiding LOAD Latency w/ Fewer
Threads
● Use batching
● Group independent
●
LOADs together
Modified law:
N
lL
B
● B = batch size
© NVIDIA Corporation 2008
// batch size 3 example
float *d_A, *d_B, *d_C;
float a, b, c, result;
a = *d_A; b = *d_B; c = *d_C;
result = a * b + c;
●
●
The values ‘a’, ‘b’, and ‘c’ are
loaded independently before being
used
Implication is that we can execute
3 loads from one thread before the
first use (‘a’ in this case) causes a
stall
Thread Synchronization
●
Barrier synchronization among threads of block
●
●
●
●
●
●
Fast single-instruction barrier in Tesla GPUs
void __syncthreads();
Synchronizes all threads in a thread block
Once all threads have reached this point,
kernel execution resumes normally
Use before reading shared memory written by another
thread in the same block
Global synchronization between dependent kernels
●
●
Waits for all thread blocks of kernel grid to complete
Fast synchronization and kernel launch in Tesla GPUs
© NVIDIA Corporation 2008
10
Tesla Series
Products
The Tesla 8-Series Processor
NVIDIA’s 1st Generation CUDA Processor
● 681 million
●
●
●
●
transistors
518 Gigaflops
128 cores (SPs),
12288 threads max
384-bit 800 MHz
GDDR3
76 GB/sec peak
© NVIDIA Corporation 2008
The Tesla 10-Series Processor
NVIDIA’s 2nd Generation CUDA Processor
● 1.4 billion
●
●
●
●
transistors
1 Teraflop
240 cores (SPs),
30720 threads max
512-bit, 800MHz
GDDR3
102 GB/sec peak
© NVIDIA Corporation 2008
Tesla C1060 Computing Processor
© NVIDIA Corporation 2008
Processor
1 x Tesla T10
Number of cores
240
Core Clock
1.33 GHz
On-board memory
4.0 GB
Memory bandwidth
102 GB/sec peak
Memory I/O
512-bit, 800MHz GDDR3
Form factor
Full ATX: 4.736” x 10.5”
Dual slot wide
System I/O
PCIe x16 Gen2
Typical power
160 W
Tesla S1070 1U System
© NVIDIA Corporation 2008
Processors
4 x Tesla T10
Number of cores
960
Core Clock
1.5 GHz
Performance
4 Teraflops
Total system memory
16.0 GB
(4.0 GB per T10)
Memory bandwidth
408 GB/sec peak
(102 GB/sec per T10)
Memory I/O
2048-bit, 800MHz
GDDR3
(512-bit per T10)
Form factor
1U (EIA 19” rack)
System I/O
2 PCIe x16 Gen2
Typical power
700 W
Example Speedups
146X
36X
Interactive
visualization of
volumetric white
matter
connectivity
Ionic placement
for molecular
dynamics
simulation on
GPU
149X
47X
Financial
simulation of
LIBOR model
with swaptions
© NVIDIA Corporation 2008
GLAME@lab: An
M-script API for
linear Algebra
operations on
GPU
18X
17X
100X
Simulation in
Matlab using
.mex file CUDA
function
Astrophysics Nbody simulation
20X
24X
30X
Ultrasound
medical imaging
for cancer
diagnostics
Highly optimized
object oriented
molecular
dynamics
Cmatch exact
string matching
to find similar
proteins and
gene sequences
Transcoding HD
video stream to
H.264
Example: Fluid Simulation
CUDA port of:
Jos Stam, "Stable Fluids", In SIGGRAPH 99 Conference Proceedings,
Annual Conference Series, August 1999, 121-128.
© NVIDIA Corporation 2008
CUDA N-Body Simulation
10B interactions / s
16K bodies
44 FPS
x 20 FLOPS / interaction
x 16K2 interactions /
frame
= 240 GFLOP/s
= 50x tuned CPU
implementation
on Intel Core 2 Duo
GeForce 8800 GTX GPU
Highly Parallel
High Arithmetic Intensity
© NVIDIA Corporation 2008
N-Body Physics on CUDA
●
●
All-pairs gravitational N-body physics of 16,384 stars
240 GFLOPS on NVIDIA GeForce 8800 – see GPU Gems 3
© NVIDIA Corporation 2008
CUDA
CUDA Programming Model
●
●
●
●
Minimal extension of C and
C++ languages
Write a serial program that
calls parallel kernels
NVCC
Serial portions execute on
the host CPU
A kernel executes as
parallel threads on the GPU
device
●
●
Kernels may be simple
functions or full programs
Many threads execute
each kernel
© NVIDIA Corporation 2008
Virtual
C CUDA
Application
CPU Code
PTX Code
Physical
PTX to Target
Compiler
G80
…
Target code
GTX
CUDA Thread Model
●
Thread
●
Thread Block
t0 t1 t2 … tm
●
●
Grid
Thread
●
●
Computes result elements
threadIdx is thread id number
Thread Block
●
●
●
Computes result data Block
1 to 512 threads per Thread Block
blockIdx is block id number
Grid of Blocks
●
●
Computes many result blocks
1 to many blocks per grid
Sequential Grids
●
Compute sequential problem steps
Block 0 Block 1 Block 2
Block n
...
© NVIDIA Corporation 2008
CUDA: Basics
●
Declaration specifiers to indicate where things live
__global__
__device__
__device__
__shared__
●
void
void
int
int
KernelFunc(...);
DeviceFunc(...);
GlobalVar;
SharedVar;
// 500 blocks, 128 threads each
Special variables for thread identification in kernels
dim3 threadIdx;
●
kernel callable from host
function callable on device
variable in device memory
in per-block shared memory
Extend function invocation syntax for parallel kernel launch
KernelFunc<<<500, 128>>>(...);
●
//
//
//
//
dim3 blockIdx;
dim3 blockDim;
Intrinsics that expose specific operations in kernel code
__syncthreads();
© NVIDIA Corporation 2008
// barrier synchronization
CUDA: Extended Features
●
Standard mathematical functions
sinf, powf, atanf, ceil, etc.
●
Built-in vector types
float4, int4, uint4, etc. for dimensions 1..4
●
Atomics
atomicAdd(int *pmem; int value), etc.
add, sub, min, max, and, or, xor ...
●
Texture accesses in kernels
texture<float,2> my_texture; // declare texture reference
float4 texel = texfetch(my_texture, u, v);
© NVIDIA Corporation 2008
CUDA Memory Management
// allocate host memory
unsigned int numBytes = N * sizeof(float)
float* h_A = (float*) malloc(numBytes);
// allocate device memory
float* d_A = 0;
cudaMalloc((void**)&d_A, numbytes);
// copy data from host to device
cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
// copy data from device back to host
cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);
// free device memory
cudaFree(d_A);
© NVIDIA Corporation 2008
25
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
© NVIDIA Corporation 2008
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
Host Code
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
© NVIDIA Corporation 2008
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …,
*h_B = …;
// allocate
float *d_A,
cudaMalloc(
cudaMalloc(
cudaMalloc(
device (GPU) memory
*d_B, *d_C;
(void**) &d_A, N * sizeof(float));
(void**) &d_B, N * sizeof(float));
(void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute the kernel on N/256 blocks of 256 threads each
vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
© NVIDIA Corporation 2008
Example #2:
Adding matrices with 2D grids
CPU C program
CUDA C program
void addMatrix(float *a, float *b,
float *c, int N)
{
int i, j, index;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
index = i + j * N;
c[index]=a[index] + b[index];
}
}
}
__global__
void addMatrix(float *a, float *b, float *c, int N)
{
int i=blockIdx.x*blockDim.x+threadIdx.x;
int j=blockIdx.y*blockDim.y+threadIdx.y;
int index = i + j * N;
if ( i < N && j < N)
c[index]= a[index] + b[index];
}
void main()
{
.....
addMatrix(a, b, c, N);
}
© NVIDIA Corporation 2008
void main()
{
.....
dim3 dimBlk (blocksize, blocksize);
dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);
addMatrix<<<dimGrd,dimBlk>>>(a, b, c, N);
}
The SM:
Thread
Multiprocesor
Tesla 10-Series Architecture
●
●
Scales parallel performance 2X beyond Tesla 8-series
240 multithreaded thread processors at 1.5 GHz, 1 TFLOPS peak
Host CPU
Bridge
System Memory
Tesla T10
Work Distribution
``
``
SMC
``
SMC
``
SMC
``
SMC
``
SMC
``
SMC
``
SMC
``
SMC
``
SMC
SMC
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
I-Cache
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
MT Issue
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
C-Cache
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP ` SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
SFU SFU
SP
SFU SFU
SFU SFU
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
DP
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Texture Unit
Tex L1
Tex L1
Tex L1
Tex L1
Tex L1
Tex L1
Tex L1
Tex L1
Tex L1
Tex L1
Interconnection Network
ROP
L2
ROP
DRAM
© NVIDIA Corporation 2008
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
T10 Multithreaded Multiprocessor
SM
I-Cache
●
●
MT Issue
C-Cache
SP SP
TPC
●
Geometry Controller
SMC
I-Cache
I-Cache
I-Cache
MT Issue
MT Issue
MT Issue
C-Cache
C-Cache
C-Cache
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SFU SFU
SFU SFU
SFU SFU
DP
DP
DP
Shared
Memory
Shared
Memory
Shared
Memory
SP SP
SP SP
●
Texture Unit
Tex L1
SP SP
SFU SFU
●
DP
Shared
Memory
© NVIDIA Corporation 2008
●
Scalar register-based ISA
Multithreaded Instruction Unit
●
●
●
●
1024 threads, hardware multithreaded
32 SIMT warps of 32 threads
Independent thread execution
Hardware thread scheduling
8 SP Thread Processors
●
●
●
IEEE 754 32-bit floating point
32-bit and 64-bit integer
16K 32-bit registers
2 SFU Special Function Units
●
●
●
RCP, RSQRT, EXP2, LOG2, SIN, COS
2 SFUs per SM yields ¼ instruction throughput
Accuracy ranges from 22.5 to 24.0 bits
1 DP Double Precision Unit
●
●
●
IEEE 754 64-bit floating point
Fused multiply-add
Full-speed denormalized operands and results
16KB Shared Memory
●
●
Concurrent threads share data
Low latency load/store
SM Conceptual Block Diagram
Instruction
Cache
Warp 0
Warp 0
PC
Warp 1
PC
ALUs
Warp 1
Fetch
Unit
Warp K
Sched
Unit
Register
Files
LSU
PC
Warp K
Single
Instruction
(SI)
© NVIDIA Corporation 2008
Instruction
Buffers
MultiThreaded
(MT)
Application
Performance
Limiter Theory
●
●
●
SM a form of queuing system
Use “limiter theory” to predict
SM performance
There are three types of limits
on the performance of the SM:
●
●
●
●
Bandwidth resource limiters
Per-thread-block space
limiters
Per-thread space limiters
The most constraining limiter
is called the critical limiter
●
= min(all limiters)
© NVIDIA Corporation 2008
Instruction
Cache
Warp 0
Warp 0
PC
Warp 1
PC
ALUs
Warp 1
Fetch
Unit
Warp K
Sched
Unit
Register
Files
LSU
PC
Warp K
Single
Instruction
(SI)
Instruction
Buffers
MultiThreaded
(MT)
Bandwidth Limiters
●
●
●
●
●
Thread blocks arrive at some rate λTB
Threads composed of some distribution of operations
Each arriving thread block of S threads contributes a
distribution of operations to be performed
Per operation type, the offered load, or BW DEMAND, is the
product of:
●
●
●
Thread block arrival rate λTB
# of threads S in a block
Operation density δ in each thread
BW CAPACITY is an upper bound on BW DEMAND
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
2
1
0
FADD
© NVIDIA Corporation 2008
IADD
IMUL
LOAD
STORE
Space Limiters
●
SM also has space resources.
Examples:
●
●
●
●
Space resources:
●
●
●
●
Finite limit on warp count
Finite limit on register file
space
Finite limit on shared
memory size
Allocated on thread block
launch
Deallocated on thread block
completion
Consumption computed
using Little’s Law (N = λL)
Thread Latency (L)
●
●
Complex computation
Varies with memory behavior
© NVIDIA Corporation 2008
Warp 0
Warp 1
Warp 2
Warp 3
Registers,
Shared
Memory
Warp 4
Warp 5
Registers,
Shared
Memory
Warp 30
Warp 31
Thread
Block 0
Thread
Block 1
Limitations of Limiter Theory
● Limiter theory
●
●
assumes uniform
workloads
Breaks down if
“traffic jam”
behavior
Limiter theory is an
ok 1st order
approximation
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
2
1
0
FADD
IADD
IMUL
LOAD
STORE
Instruction
Cache
Warp 0
Warp 0
PC
Warp 1
PC
ALUs
Warp 1
Fetch
Unit
Warp K
Sched
Unit
Register
Files
LSU
PC
Warp K
Single
Instruction
(SI)
© NVIDIA Corporation 2008
Instruction
Buffers
MultiThreaded
(MT)
Implications of Limiter Theory
●
●
Kernel code has to pay careful attention to Operation “mix”
●
●
●
Don’t “freeway jam” kernel code
●
●
●
●
Math-to-memory operation ratios for example
Do not want to bottleneck on one function unit leaving other
units idling
Ideal: all units equally critical
Making thread blocks too large so that only a few execute on the
SM at a time a bad idea
“Bunching” operations of a similar type in one section of a
kernel will aggravate the problem
Ideal: lots of small thread blocks with uniform distribution of
operation densities
Focus on space resource consumption
●
Ideal: use as few resources necessary to “load the SM”
© NVIDIA Corporation 2008
Final Thoughts
●
●
●
Threads are free
●
●
●
●
●
●
A common mistake in GPU Computing kernels is to make
threads do too much
Keep them short and sweet
Example: one thread per vector element
HW provides LOTs of them (10s of thousands)
HW launch => near zero overhead to create them
HW context switching => near zero overhead scheduling
Barriers are cheap
●
●
●
●
Single instruction
HW synchronization of thread blocks
Partition kernel code into producer-consumer
DON’T use spin locks!
Partition on results, not sources
© NVIDIA Corporation 2008
Questions?
[email protected]
http://www.nvidia.com/CUDA
http://www.nvidia.com/TESLA