Download Parallel Architectures

Document related concepts

Distributed operating system wikipedia , lookup

Transcript
COMP5426
Parallel and Distributed Computing
Parallel Architectures
Basic Functional units
Peripherals
Computer
Central
Processing
Unit
Computer
Communication
lines
Main
Memory
System’s
Interconnection
Input
Output
2
Implicit Parallelism
Microprocessor clock speeds have posted
impressive gains over the past two decades
(two to three orders of magnitude).
Higher levels of device integration have
made available a large number of
transistors.
How best to utilize these resources?

Conventionally, use these resources in multiple
functional units and execute multiple
instructions in the same cycle (ILP)
 Pipelining
 Superscalar
3
Pipelining
overlaps various stages of instruction
execution to achieve performance
F1
E1
F2 E2
I1
I1:
I2:
I3:
F1
I2
E1
F2
E3
.......
Time
I3
F: Fetch
E: Execute
E2
F3
F3
E3
Time
4
Pipelining
Limitations:


The speed of a pipeline is eventually limited by
the slowest stage. – needs more stage, or very
deep pipelines
However, in typical program traces, every 5-6th
instruction is a conditional jump! - requires very
accurate branch prediction.
The penalty of a misprediction grows with
the depth of the pipeline, since a larger
number of instructions will have to be
flushed.
5
Superscalar
multiple redundant functional units
within each CPU so that multiple
instructions can executed on separate
data items concurrently.


Early ones: two ALUs and a single FPU
modern ones: have more, e.g. the
PowerPC 970 includes four ALUs and two
FPUs, as well as two SIMD units.
6
Superscalar
The scheduler - a piece of hardware looks
at a large number of instructions in an
instruction queue and selects appropriate
number of instructions to execute
concurrently.
Very Long Instruction Word (VLIW)
processors - rely on compile time analysis
to identify and bundle together
instructions that can be executed
concurrently.
7
Superscalar
Limitations:
The degrees of intrinsic parallelism in
the instruction stream, i.e. limited
amount of instruction-level parallelism
 The complexity and time cost of the
dispatcher and associated dependency
checking logic.
The performance of the system as a whole
will suffer if unable to keep all of the units
fed with instructions.

8
Implicit Parallelism
Things affect the performance:



True Data Dependency: The result of one
operation is an input to the next.
Resource Dependency: Two operations require
the same resource.
Branch Dependency: Scheduling instructions
across conditional branch statements cannot be
done deterministically a-priori.
9
Multicore Technology
Power and Heat: Intel Embraces Multicore
May 17, 2004 … Intel, the world's largest chip maker, publicly
acknowledged that it had hit a ''thermal wall'' on its microprocessor line.
As a result, the company is changing its product strategy and disbanding
one of its most advanced design groups. Intel also said that it would
abandon two advanced chip development projects …
Now, Intel is embarked on a course already adopted by some of its major
rivals: obtaining more computing power by stamping multiple processors
on a single chip rather than straining to increase the speed of a single
processor … Intel's decision to change course and embrace a ''dual
core'‘ processor structure shows the challenge of overcoming the effects
of heat generated by the constant on-off movement of tiny switches in
modern computers … some analysts and former Intel designers said that
Intel was coming to terms with escalating heat problems so severe they
threatened to cause its chips to fracture at extreme temperatures…
New York Times, May 17, 2004
10
Generic Multicore chip
Processor
Memor
y
Processor
Memor
y
Global Memory
Handful of processors each supporting ~1 hardware
thread
On-chip memory near processors (cache, RAM, or
both)
Shared global memory space (external cache and
DRAM)
11
Intel Core i7
12
GPU
13
GPU
All PCs have a GPU - the main chip inside a
computer which calculates and generates the
positioning of graphics on a computer screen.




Games typically renders 10 000s triangles @ 60 fps
Screen resolution is typically 1600 x 1200 and each pixel is
recalculated every frame
This corresponds to processing 115 200 000 pps
GPUs are designed to make these operations fast
New types of GPUs contain multiple cores that
utilise hardware multithreading and SIMD.
14
IBM Cell Broadband Engine
Multicore
Single powerful thread
per core
Thread parallel
Explicit communication
Explicit synchronization
15
Cell Processor
PPE – Power Processing
Element
SPE – Synergistic
Processing Element



SPU – Synergistic Processing
Unit
LS – Local Store
MFC – Memory Flow
Controller
EIB – Element
Interconnect Bus
16
Cell Processor – the PPE
More-less a standard scalar
processor





Access to the main memory
through load and store
instructions
Standard L1 and L2 caches
Capable of running scalar (not
vectorized) code fast
Capable of running a standard
operating system, e.g., Linux
Capable of executing IEEE
floating point arithmetic (double
and single precision)
17
Cell Processor – the SPE
Completely a vector processor



Only executes code from the local
memory
Only operates on data in the local
memory
Accesses the maim memory and
local memories of other SPEs only
through DMA messages
Loads and stores only 128-bit
vectors
Only operates on 128-bit vectors
(scalar instructions are emulated in
software)
Only supports a single thread with a
register file of 128 vector
registers at its disposal
18
NVIDIA GTX 280
The GPU (a set of multiprocessors) executes many
thread blocks
Each thread block consists of many threads
Within thread block threads are grouped in warps
Each thread has:


Per-thread registers
Per-thread memory (in DRAM)
Each thread block has:

Per-thread-block shared memory
Global memory (DRAM) is accessible to all threads
19
NVIDIA GTX 280
SM – Streaming
Multiprocessor (more-less
a core)
SP – Streaming Processor
(“scalar processor core”)
(AKA “thread processor”)
Register file
Shared memory
Constant cache (read only
for SM)
Texture cache (read only
for SM)
20
GTX 280: Streaming Multiprocessor
Eight scalar processors (thread
procs.) with one instruction
issue logic (SIMD)
Long vector (32 threads = 1
warp)
Massively multithreaded (512
scalar hardware threads = 16
warps)
Huge register file (8192 scalar
registers shared among all
threads)
Can load and store data to and
from shared memory
Can load and store data directly
to and from the main memory
21
General-Purpose Computing on GPUs
Idea:
 Potential for very high performance at low cost
 Architecture well suited for certain kinds of
parallel applications (data parallel)
Early challenges:
 Architectures very customized to graphics
problems (e.g., vertex and fragment processors)
 Programmed using graphics-specific programming
models or libraries
Recent trends:
 Some convergence between commodity and GPUs
and their associated parallel programming models
22
CUDA
Compute Unified Device Architecture, one of first
to support heterogeneous architectures
Data-parallel programming interface to GPU



Data to be operated on is discretized into independent
partition of memory
Each thread performs roughly same computation to
different partition of data
When appropriate, easy to express and very efficient
parallelization
Programmer expresses



Thread programs to be launched on GPU, and how to launch
Data organization and movement between host and GPU
Synchronization, memory management, …
23
CUDA Software Stack
24
CUDA Hardware Model
25
CUDA Memory Model
•
Each thread can:
–
–
–
–
–
–
•
Read/write per-thread registers
Read/write per-thread local
memory
Read/write per-block shared
memory
Read/write per-grid global
memory
Read only per-grid constant
memory
Read only per-grid texture
memory
The host can read/write
global, constant, and
texture memory
26
CUDA Programming Model
The GPU is viewed as a compute device
that:



is a coprocessor to the CPU or host
has its own device memory
runs many threads in parallel
Data-parallel portions of an application
are executed on the device as kernels
which run in parallel on many threads
27
CUDA Programming Model
28
29
Future Computer Systems
30
Challenges of Using Multicore and GPU
31
Flynn’s Taxonomy
Data stream
Single
Multiple
Instruction stream
Single
SISD
Uniprocessor
MISD
Rarely used
Multiple
SIMD
Procesor arrays
Pipelined vector processors
MIMD
Multiprocessors
Multicomputers
32
SISD
SIMD
Uniprocessors Array or vector
processors
MISD
Rarely used
MIMD
Multiproc’s or
multicomputers
Global
memory
Multiple data
streams
Distributed
memory
Single data
stream
Johnson’ s expansion
Multiple instr
streams
Single instr
stream
Types of Parallelism
Shared
variables
Message
passing
GMSV
GMMP
Shared-memory
multiprocessors
Rarely used
DMSV
DMMP
Distributed
Distrib-memory
shared memory multicomputers
Flynn’s categories
33
SIMD
Single Instruction Multiple
Data architecture
 A single instruction can
operate on multiple data
elements in parallel
 rely on the highly
structured nature of the
underlying computations
 Data parallelism

widely applied in
multimedia processing
(e.g., graphics, image and
video)
34
SIMD
35
MIMD
MIMD architectures can be classified
(based on the memory organization) into
shared-memory or distributed-memory
architectures (or a combination of both,
i.e. distributed shared-memory).


shared-memory multiprocessors communicate
together through a common memory.
distributed-memory multicomputers
communicate together through communication
links.
36
Parallel Architectures
M
M
DS2
P1
P2
...
Memory
M
Interconnection Network
DS1
DSn
Pn
SHARED-MEMORY MULTIPROCESSOR
All the processors in the system can directly access all memory
locations in the system
37
Parallel Architectures
DS2
P1
P2
M
M
...
Interconnect Network
DS1
DSn
Pn
•Processors have direct access
only to local memory
•communication & synchronization
takes place via messages.
M
DISTRIBUTED-MEMORY MULTICOMPUTERS
38
DS1
DS2
P1
M
P2
M
...
Interconnect Network
Parallel Architectures
DSn
Pn
M
DISTRIBUTED SHARED-MEMORY MULTIPROCESSOR
All
the processors in the system can access all local memory
locations.
39
Parallel Architectures
Shared Memory
Distributed Memory
•Explicit global data structure
•Implicit global data structure
• Decomposition of work is
independent of data layout
•Decomposition of data determines
assignment of work
•Communication is implicit
•Communication is explicit
•Explicit synchronization (need to •Synchronization is implicit (data
avoid race condition and overbuffered till received)
writing)
40
Architecture of Interest
41
Architecture of Interest
42
Architecture of Interest
43
Architecture of Interest
44
Cloud Computing
Cloud computing describes a new business model for IT
service supplement, consumption, and delivery over the
Internet.
NIST definition:
 Cloud computing as a model for enabling convenient, ondemand network access to a shared pool of configurable
computing resources (e.g., networks, servers, storage,
applications, and services) that can be rapidly provisioned
and released with minimal management effort or service
provider interaction.
The cloud computing promotes availability and consists of
 Five essential characteristics
 Four deployment models
 Three service models
45
Cloud Computing
Five essential characteristics:





On-demand self-service
Broad network access
Resource pooling
Rapid elasticity
Measured service
Four deployment models:

Private cloud; Community cloud; Public cloud
and hybrid cloud
46
Cloud Computing
Three service models:
End Users
Software as a
Service
Platform as a
Service
Infrastructure as a
Service
47
Cloud Computing
Cloud computing is a style of computing in which
dynamically scalable and often virtualized
resources are provided as a service over the
Internet.
Users need not have knowledge of, expertise in,
or control over the technology infrastructure in
the "cloud" that supports them.
Characteristics:


Virtual: software, databases, Web servers, operating
systems, storage and networking all as virtual servers.
XaaS: Infrastructure as a Service (IaaS), Platform as a
Service (Paas) and Software as a Service (SaaS)
48
Cloud Computing
Motivation: Efficient resource use



Utilization of typical data centers: below 10-30%
Average lifetime of servers: approx. 3 years (CapEx)
Excessive operating costs (OpEx)
 Staffing
 Maintenance (HW & SW)
 Energy (both for powering and cooling)
49
Cost Efficiency: Provider’s
Perspective
Cost Reductions

Economies of scale prevails
 Cloud service providers can bring 75% - 80% cost

reduction by bulk purchases
Efficient resource management practices
 Utilization improvement (server consolidation)
 Automated processes (reduction in staffing cost)
Profit Increases


Increase in market demand
Quality of service (performance)
50
Cost Efficiency: Provider’s
Perspective
Demand
2
1
Time (days)
3
Capacity
Demand
2
1
Time (days)
3
Resources
Capacity
Resources
Resources
Smoothing/flattening out workloads by effectively
managing demand and capacity
Capacity
Demand
2
1
Time (days)
3
51
Cost Efficiency: User’s
Perspective
Elasticity
Utilization may often be bursty
# EC2 instances

250,000 users:
3,500 instances
Friday,
18/April/2008
50,000 users:
400 instances
Tuesday,
25,000
15/April/2008
“registered” users
Monday,
14/April/2008
52
Cost Efficiency: User’s
Perspective
Capacity
Supply
Capacity
Dynamic provisioning (pay-as-you-go)
Supply
Demand
Time
Demand
Time
53
Cloud Computing
Virtualization (VMWare and Xen)
 highly customized user environment
 the isolation to malicious software
 seamless migration of services across physical
machines
App
App
App
App
App
App
OS1
OS2
OS1
Operating System
Hypervisor
Hardware
Hardware
Traditional Stack
Virtualized Stack
54
References
Jack Dongarra (U. Tenn.) CS 594 slides
http://www.cs.utk.edu/~dongarra/WEBPAGES/cs594-2010.htm
55