Download Motivation for Parallelism Motivation for Parallelism Types of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Motivation for Parallelism
The speed of an application is determined by more than just
processor speed.
Motivation for Parallelism
Memory speed
Disk speed
Network speed
...
Multiprocessors typically improve the aggregate speeds:
1
Types of Parallelism
Instructions near each other in an instruction stream could be
independent.
These can then execute in parallel
either partially (pipelining),
or fully (superscalar).
Hardware needed for dependency tracking.
The amount of ILP available per instruction stream is limited
ILP is usually considered an implicit parallelism since the hardware
automatically exploits it without programmer/compiler intervention.
Programmers/compilers could transform
applications to expose more ILP.
2007 October 11th
Huge data sets could be difficult, expensive, or otherwise
infeasible to store in a central location.
Distributed data and parallel processing is a practical solution.
2007 October 11th
2
ILP Example: Loop Unrolling
Instruction Level Parallelism (ILP)
Harnessing the computing power of distributed systems over the
Internet is a popular example of parallel processing.
(SETI, Folding@home, ...)
Constraints on the location of data
Memory bandwidth is improved by separate memories.
Multiprocessors usually have more aggregate cache memory.
Each processor in a cluster can have its own disk and network
adapter, improving aggregate speeds.
2007 October 11th
Communication enables parallel applications:
for( i=0; i<=64; i++ ) {
vec[i] = a * vec[i];
}
This loop has one sequence of
load, multiply, store per iteration.
The amount of ILP is very limited.
for( i=0; i<=64; i+=4 ) {
vec[i+0] = a * vec[i+0];
vec[i+1] = a * vec[i+1];
vec[i+2] = a * vec[i+2];
vec[i+3] = a * vec[i+3];
}
4-fold loop unrolling increases the
amount of ILP exploitable by the
hardware.
Four independent
load, multiply, store
sequences per iteration.
3
2007 October 11th
4
Types of Parallelism
TLP Example: Quicksort
QuickSort( A, B ):
if( B – A < 10 ) {
/* Base Case: Use fast sort */
FastSort( A, B );
} else {
/* Continue Recursively */
Partition [A,B] into [A,C] and [C+1,B];
QuickSort( A, C );
QuickSort( C+1, B );
}
Task Level Parallelism (TLP)
Several instruction streams are independent.
Parallel execution of the streams is possible.
Much more coarse-grain parallelism compared with ILP.
Typically involves the programmer and/or compiler to:
decompose the application into tasks,
enforce dependencies,
and expose parallelism.
X
Some experimental techniques, such as speculative
multithreading, are aimed at removing these burdens from the
programmer.
Y
2007 October 11th
5
Types of Parallelism
2007 October 11th
4
+
5
=
9
3
+
6
=
9
6
+
2
=
8
5
+
4
=
9
4
+
3
=
7
6
Scalar processors can issue one instruction per cycle.
Superscalar processors can issue more than one instruction per cycle.
Common feature in most modern processors.
By replicating functional units and adding hardware to detect and track
instruction dependencies a superscalar processor takes advantage of
Instruction Level Parallelism.
A related technique (applicable to both scalar and superscalar processors)
is out-of-order (OoO) execution.
7
+
1
=
8
Z
2007 October 11th
In many applications a collection of data is transformed in such a
way that the operations on each element is largely independent
of the others.
A typical scenario is when we apply the same instruction to a
collection of data.
Example: adding two arrays
3
+
4
=
7
Both Y and Z depend
on X but are mutually
independent.
Superscalar and OoO execution
Data Parallelism (DP)
/* Task X */
/* Task Y */
/* Task Z */
The same operation (+) applied
to a collection of data
7
Instructions are reordered (by hardware) for better utilization of pipeline(s).
Excellent example of the use of extra transistors to speed up execution
without programmer intervention.
Naturally limited by the available ILP.
Also severely limited by the hardware complexity of dependency checking.
In practice: 2-way superscalar architectures are common, more than 4-way
is unlikely.
2007 October 11th
8
Vector Processors
Vector processors refer to a previously common supercomputer
architecture where a vector is a basic memory abstraction.
Vector: 1D array of numbers
Example: add two vectors
Examples: Cray 1, IBM 3090/VF.
Scalar solution:
Loop through vectors and add each scalar element
Vector Processors
Repeated address translations
Branches
Vector solution:
Add vectors via a vector addition instruction
Address translations once
No branches
Independent operations enable:
deeper pipelines,
higher clock frequencies, and
multiple concurrent functional units (e.g., SIMD-units)
2007 October 11th
9
Single Instruction Multiple Data (SIMD)
Matrix Computations
Graphics Processing
Image Analysis
...
SIMD instruction extensions such as MMX, SSE, AltiVec.
Uniform Memory Access (UMA): each processor has the same
access time.
Non-Uniform Memory Access (NUMA): some memory is closer to
a processor, access time is higher to distant memory.
Furthermore, their caches could be coherent or not.
CC-UMA: Cache-Coherent Uniform Memory Access
2007 October 11th
10
Multiprocessors where all processors share a single address
space are commonly called Shared Memory Multiprocessors.
They can be classified based on how long access time
different processors have to different memory areas.
Found primarily in common microprocessors and GPUs:
2007 October 11th
Shared Memory Multiprocessor
Several functional units executing the same instruction on
different data streams simultaneously and synchronized.
Suitable architecture for many data parallel applications:
Vector processors are suitable for a certain range of
applications.
Traditional, scalar, processor designs have been successful
over a larger spectrum of applications.
Economic realities favor large clusters of commodity
processors or small-scale SMPs.
11
By another name, Symmetric MultiProcessor (SMP).
2007 October 11th
12
Bus-Based UMA MP
Chip
Chip
NUMA MP
Chip
Chip
Proc.
Proc.
Proc.
Proc.
Cache
Cache
Cache
Cache
Memory
Proc.
Memory
Memory
13
Multicore
They are very similar to SMPs but they
Memory
14
Chip
A new desktop computer today is definitely a multiprocessor.
Proc.
Multicore
Multicores usually have a single address space and are
cache-coherent.
Proc.
2007 October 11th
When several processor cores are physically located in the
same processor socket we refer to it as a multicore processor.
Both Intel and AMD now have quad-core (4 cores) processors
in their product portfolios.
Memory
Interconnect
Bus
2007 October 11th
Proc.
Core
Core
Core
Core
L1
Cache
L1
Cache
L1
Cache
L1
Cache
L2
Cache
typically share one or more levels of cache,
have more favorable inter-processor/core communication speed.
Memory
2007 October 11th
15
2007 October 11th
16
Multicore
Multicore chips have multiple benefits:
Higher peak performance
Power consumption control
Some cores can be turned off.
Production yield increase
Distributed Memory Multiprocessor
8-core chips with a defective core sold with one core disabled.
...but also some potential drawbacks:
Memory bandwidth per core limited
Lower peak performance per thread
In contrast to a single address space, machines with multiple
private memories are commonly called Distributed Memory
Machines.
Data is exchanged between memories via messages
communicated over a dedicated network.
When the processor/memory pairs are physically separated,
such as on different boards or in different casings, such
machines are called Clusters.
Physical limits such as the number of pins.
Some inherently sequential applications may actually run slower.
2007 October 11th
17
Cluster
2007 October 11th
Clusters: Past and Present
Node
Node
Node
Node
In the past, clusters were exclusively high-end machines with
Proc.
Proc.
Proc.
Proc.
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
Network
commodity processors, and
commodity interconnects (e.g. ethernet with switches).
The economic benefits and programmer familiarity with
commodity components far outweigh the performance issues.
Has helped ”democratize” supercomputing:
19
custom supercomputer processors, and
custom high-performance interconnects.
These machines where very expensive and therefore limited
to research and big corporations.
In the ’90s onwards it is increasingly common with clusters
based on off-the-shelf components:
2007 October 11th
18
many corporations and universities have clusters today.
2007 October 11th
20
Networks
Access to memory on other nodes is very expensive.
Data must be transferred over a relatively high-latency lowbandwidth network.
Algorithms with low data locality will suffer.
High synchronization requirements will also degrade
performance for the same reason.
The network design is a tradeoff between conflicting goals:
Maximum bandwidth and low latency:
Low cost and power consumption:
full connectivity
tree network
Switch-based networks common today, other examples of
topologies include rings, meshes, hypercubes, and trees.
2007 October 11th
21