Download More on Parallel Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Multidimensional empirical mode decomposition wikipedia , lookup

Theoretical computer science wikipedia , lookup

Transcript
More on Parallel
Computing
Spring Semester 2005
Geoffrey Fox
Community
Grids Laboratory
Indiana University
505 N Morton
Suite 224
Bloomington IN
[email protected]
30
Computational Science
jsumethod05 [email protected]
1
What is Parallel Architecture?
• A parallel computer is any old collection of processing
elements that cooperate to solve large problems fast
– from a pile of PC’s to a shared memory multiprocessor
• Some broad issues:
– Resource Allocation:
• how large a collection?
• how powerful are the elements?
• how much memory?
– Data access, Communication and Synchronization
• how do the elements cooperate and communicate?
• how are data transmitted between processors?
• what are the abstractions and primitives for cooperation?
– Performance and Scalability
• how does it all translate into performance?
• how does it scale?
30 January 2005
jsumethod05 [email protected]
2
Parallel Computers -- Classic Overview
• Parallel computers allow several CPUs to contribute to a
computation simultaneously.
• For our purposes, a parallel computer has three types of
parts:
Colors Used in
– Processors
Following
pictures
– Memory modules
– Communication / synchronization network
• Key points:
–
–
–
–
All processors must be busy for peak speed.
Local memory is directly connected to each processor.
Accessing local memory is much faster than other memory.
Synchronization is expensive, but necessary for correctness.
30 January 2005
jsumethod05 [email protected]
3
Distributed Memory Machines
• Every processor has a
memory others can’t access.
• Advantages:
– Relatively easy to design and
build
– Predictable behavior
– Can be scalable
– Can hide latency of
communication
• Disadvantages:
– Hard to program
– Program and O/S (and
sometimes data) must be
replicated
30 January 2005
jsumethod05 [email protected]
4
Communication on Distributed
Memory Architecture
Messages
30 January 2005
• On distributed memory
machines, each chunk of
decomposed data resides
on separate memory space
-- a processor is typically
responsible for storing and
processing data (ownercomputes rule)
• Information needed on
edges for update must be
communicated via
explicitly generated
messages
jsumethod05 [email protected]
5
Distributed Memory Machines -- Notes
• Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster
BlueGene are quite similar.
• Bandwidth and latency of interconnects different
• The network topology is a two-dimensional torus for Paragon,
three-dimensional torus for BlueGene, fat tree for CM-5,
hypercube for nCUBE and Switch for SP-2
• To program these machines:
• Divide the problem to minimize number of messages while
retaining parallelism
• Convert all references to global structures into references to local
pieces (explicit messages convert distant to local variables)
• Optimization: Pack messages together to reduce fixed overhead
(almost always needed)
• Optimization: Carefully schedule messages (usually done by
library)
30 January 2005
jsumethod05 [email protected]
6
BlueGene/L has Classic Architecture
32768 node BlueGene/L
takes #1 TOP500
Position 29 Sept 2004
70.7 Teraflops
30 January 2005
jsumethod05 [email protected]
7
BlueGene/L Fundamentals


Low Complexity nodes gives more flops per transistor and per
watt
3D Interconnect supports many scientific simulations as nature
as we see it is 3D
30 January 2005
jsumethod05 [email protected]
8
1987 MPP
30 January 2005
1024 Nodes full system
with hypercube Interconnect
jsumethod05 [email protected]
9
Shared-Memory Machines
• All processors access the same
memory.
• Advantages:
– Retain sequential programming
languages such as Java or Fortran
– Easy to program (correctly)
– Can share code and data among
processors
• Disadvantages:
– Hard to program (optimally)
– Not scalable due to bandwidth
limitations in bus
30 January 2005
jsumethod05 [email protected]
10
Communication on Shared
Memory Architecture
• On a shared Memory
Machine a CPU is
responsible for processing
a decomposed chunk of
data but not for storing it
• Nature of parallelism is
identical to that for
distributed memory
machines but
communication implicit
as “just” access memory
30 January 2005
jsumethod05 [email protected]
11
Shared-Memory Machines -- Notes
• Interconnection network varies from machine to
machine
• These machines share data by direct access.
– Potentially conflicting accesses must be protected by
synchronization.
– Simultaneous access to the same memory bank will
cause contention, degrading performance.
– Some access patterns will collide in the network (or
bus), causing contention.
– Many machines have caches at the processors.
– All these features make it profitable to have each
processor concentrate on one area of memory that
others access infrequently.
30 January 2005
jsumethod05 [email protected]
12
Distributed Shared Memory Machines
• Combining the
(dis)advantages of shared
and distributed memory
• Lots of hierarchical
designs.
– Typically, “shared memory
nodes” with 4 to 32
processors
– Each processor has a local
cache
– Processors within a node
access shared memory
– Nodes can get data from or
put data to other nodes’
memories
30 January 2005
jsumethod05 [email protected]
13
Summary on Communication etc.
• Distributed Shared Memory machines have
communication features of both distributed (messages)
and shared (memory access) architectures
• Note for distributed memory, programming model must
express data location (HPF Distribute command) and
invocation of messages (MPI syntax)
• For shared memory, need to express control (openMP)
or processing parallelism and synchronization -- need to
make certain that when variable updated, “correct”
version is used by other processors accessing this
variable and that values living in caches are updated
30 January 2005
jsumethod05 [email protected]
14
Seismic Simulation of Los Angeles Basin
• This is (sophisticated) wave equation similar to Laplace example
and you divide Los Angeles geometrically and assign roughly
equal number of grid points to each processor
Computer with
4 Processors
30 January 2005
Problem represented by
Grid Points and divided
Into 4 Domains
jsumethod05 [email protected]
15
Communication Must be Reduced
• 4 by 4 regions in each
processor
– 16 Green (Compute) and 16
Red (Communicate) Points
• 8 by 8 regions in each
processor
– 64 Green and “just” 32 Red
Points
• Communication is an
edge effect
30 January 2005
• Give each processor plenty
of memory and increase
region in each machine
• Large Problems Parallelize
Best
jsumethod05 [email protected]
16
Irregular 2D Simulation -- Flow over an Airfoil
• The Laplace grid
points become
finite element
mesh nodal points
arranged as
triangles filling
space
• All the action
(triangles) is near
near wing
boundary
• Use domain
decomposition but
no longer equal
area as equal
triangle count
30 January 2005
jsumethod05 [email protected]
17
Heterogeneous Problems
30 January 2005
• Simulation of
cosmological
cluster (say 10
million stars )
• Lots of work per
star as very close
together
( may need
smaller time step)
• Little work per
star as force
changes slowly
and can be well
approximated by
low order
multipole
expansion
jsumethod05 [email protected]
18
Load Balancing Particle Dynamics
• Particle dynamics of this type (irregular with sophisticated force
calculations) always need complicated decompositions
• Equal area decompositions as shown here to load imbalance
Equal Volume Decomposition
Universe Simulation
If use
simpler
algorithms
(full O(N2)
forces) or
FFT, then
equal area
best
16 Processors
Galaxy or Star or ...
30 January 2005
jsumethod05 [email protected]
19
Reduce Communication
• Consider a geometric problem with 4
processors
• In top decomposition, we divide
domain into 4 blocks with all points in
a given block contiguous
• In bottom decomposition we give each
processor the same amount of work
but divided into 4 separate domains
• edge/area(bottom) = 2* edge/area(top)
• So minimizing communication implies
we keep points in a given processor
together
30 January 2005
Block Decomposition
Cyclic Decomposition
jsumethod05 [email protected]
20
Minimize Load Imbalance
Block Decomposition
• But this has a flip side. Suppose we are
decomposing Seismic wave problem
and all the action is near a particular
earthquake fault denoted by
.
• In Top decomposition only the white
processor does any work while the
other 3 sit idle.
Cyclic Decomposition
– Ffficiency 25% due to Load Imbalance
• In Bottom decomposition all the
processors do roughly the same work
and so we get good load balance …...
30 January 2005
jsumethod05 [email protected]
21
Parallel Irregular Finite Elements
• Here is a cracked plate and
calculating stresses with an
equal area decomposition
leads to terrible results
– All the work is near crack
30 January 2005
Processor
jsumethod05 [email protected]
22
Irregular Decomposition for Crack
Region assigned to 1 processor
• Concentrating processors
near crack leads to good
workload balance
• equal nodal point -- not equal
area -- but to minimize
communication nodal points
assigned to a particular
processor are contiguous
• This is NP complete
Work
(exponenially hard)
Load
optimization problem but in
practice many ways of getting
good but not exact good
decompositions
30 January 2005
Not Perfect !
Processor
jsumethod05 [email protected]
23
Further Decomposition Strategies
California gets its independence
• Not all decompositions are quite the same
• In defending against missile attacks, you
track each missile on a separate node -geometric again
• In playing chess, you decompose chess tree
-- an abstract not geometric space
Computer
Chess Tree
Current Position
(node in Tree)
First Set Moves
Opponents
Counter Moves
30 January 2005
jsumethod05 [email protected]
24
Summary of Parallel Algorithms
• A parallel algorithm is a collection of tasks and a partial
ordering between them.
• Design goals:
– Match tasks to the available processors (exploit parallelism).
– Minimize ordering (avoid unnecessary synchronization points).
– Recognize ways parallelism can be helped by changing
ordering
• Sources of parallelism:
– Data parallelism: updating array elements simultaneously.
– Functional parallelism: conceptually different tasks which
combine to solve the problem. This happens at fine and coarse
grain size
• fine is “internal” such as I/O and computation; coarse is “external”
such as separate modules linked together
30 January 2005
jsumethod05 [email protected]
25
Data Parallelism in Algorithms
• Data-parallel algorithms exploit the parallelism inherent in many
large data structures.
– A problem is an (identical) algorithm applied to multiple points in data “array”
– Usually iterate over such “updates”
• Features of Data Parallelism
– Scalable parallelism -- can often get million or more way parallelism
– Hard to express when “geometry” irregular or dynamic
• Note data-parallel algorithms can be expressed by ALL
programming models (Message Passing, HPF like, openMP like)
30 January 2005
jsumethod05 [email protected]
26
Functional Parallelism in Algorithms
• Functional parallelism exploits the parallelism between the parts
of many systems.
– Many pieces to work on  many independent operations
– Example: Coarse grain Aeroelasticity (aircraft design)
• CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.)
can be evaluated in parallel
• Analysis:
– Parallelism limited in size -- tens not millions
– Synchronization probably good as parallelism natural from problem and
usual way of writing software
– Web exploits functional parallelism NOT data parallelism
30 January 2005
jsumethod05 [email protected]
27
Pleasingly Parallel Algorithms
• Many applications are what is called (essentially) embarrassingly
or more kindly pleasingly parallel
• These are made up of independent concurrent components
–
–
–
–
Each client independently accesses a Web Server
Each roll of a Monte Carlo dice (random number) is an independent sample
Each stock can be priced separately in a financial portfolio
Each transaction in a database is almost independent (a given account is
locked but usually different accounts are accessed at same time)
– Different parts of Seismic data can be processed independently
• In contrast points in a finite difference grid (from a differential
equation) canNOT be updated independently
• Such problems are often formally data-parallel but can be
handled much more easily -- like functional parallelism
30 January 2005
jsumethod05 [email protected]
28
Parallel Languages
• A parallel language provides an executable notation for
implementing a parallel algorithm.
• Design criteria:
– How are parallel operations defined?
• static tasks vs. dynamic tasks vs. implicit operations
– How is data shared between tasks?
• explicit communication/synchronization vs. shared memory
– How is the language implemented?
• low-overhead runtime systems vs. optimizing compilers
• Usually a language reflects a particular style of expressing parallelism.
• Data parallel expresses concept of identical algorithm on different
parts of array
• Message parallel expresses fact that at low level parallelism
implies information is passed between different concurrently
executing program parts
30 January 2005
jsumethod05 [email protected]
29
Data-Parallel Languages
• Data-parallel languages provide an abstract, machineindependent model of parallelism.
–
–
–
–
Fine-grain parallel operations, such as element-wise operations on arrays
Shared data in large, global arrays with mapping “hints”
Implicit synchronization between operations
Partially explicit communication from operation definitions
• Advantages:
– Global operations conceptually simple
– Easy to program (particularly for certain scientific applications)
• Disadvantages:
– Unproven compilers
– As express “problem” can be inflexible if new algorithm which language
didn’t express well
• Examples: HPF
• Originated on SIMD machines where parallel operations are in
lock-step but generalized (not so successfully as compilers too
hard) to MIMD
30 January 2005
jsumethod05 [email protected]
30
Approaches to Parallel Programming
• Data Parallel typified by CMFortran and its generalization - High
Performance Fortran which in previous years we discussed in detail but
this year we will not discuss; See Source Book for more on HPF
• Typical Data Parallel Fortran Statements are full array statements
– B=A1 + A2
– B=EOSHIFT(A,-1)
– Function operations on arrays representing full data domain
• Message Passing typified by later discussion of Laplace Example, which
specifies specific machine actions i.e. send a message between nodes
whereas data parallel model is at higher level as it (tries) to specify a
problem feature
• Note: We are always using "data parallelism" at problem level whether
software is "message passing" or "data parallel"
• Data parallel software is translated by a compiler into "machine
language" which is typically message passing on a distributed memory
machine and threads on a shared memory
30 January 2005
jsumethod05 [email protected]
31
Shared Memory Programming Model
• Experts in Java are familiar with this as it is built in
Java Language through thread primitives
• We take “ordinary” languages such as Fortran, C++,
Java and add constructs to help compilers divide
processing (automatically) into separate threads
– indicate which DO/for loop instances can be executed in
parallel and where there are critical sections with global
variables etc.
• openMP is a recent set of compiler directives supporting
this model
• This model tends to be inefficient on distributed memory
machines as optimizations (data layout, communication
blocking etc.) not natural
30 January 2005
jsumethod05 [email protected]
32
Structure(Architecture) of Applications - I
• Applications are metaproblems with a mix of module (aka coarse
grain functional) and data parallelism
• Modules are decomposed into parts (data parallelism) and
composed hierarchically into full applications.They can be the
– “10,000” separate programs (e.g. structures,CFD ..) used in
design of aircraft
– the various filters used in Adobe Photoshop or Matlab image
processing system
– the ocean-atmosphere components in integrated climate
simulation
– The data-base or file system access of a data-intensive
application
– the objects in a distributed Forces Modeling Event Driven
Simulation
30 January 2005
jsumethod05 [email protected]
33
Structure(Architecture) of Applications - II
• Modules are “natural” message-parallel components of problem
and tend to have less stringent latency and bandwidth
requirements than those needed to link data-parallel
components
– modules are what HPF needs task parallelism for
– Often modules are naturally distributed whereas parts of data parallel
decomposition may need to be kept on tightly coupled MPP
• Assume that primary goal of metacomputing system is to add to
existing parallel computing environments, a higher level
supporting module parallelism
– Now if one takes a large CFD problem and divides into a few components,
those “coarse grain data-parallel components” will be supported by
computational grid technology
• Use Java/Distributed Object Technology for modules -- note Java to growing
extent used to write servers for CORBA and COM object systems
30 January 2005
jsumethod05 [email protected]
34
Multi Server Model for metaproblems
• We have multiple supercomputers in the backend -- one
doing CFD simulation of airflow; another structural
analysis while in more detail you have linear algebra
servers (Netsolve); Optimization servers (NEOS);
image processing filters(Khoros);databases (NCSA
Biology workbench); visualization systems(AVS,
CAVEs)
– One runs 10,000 separate programs to design a modern
aircraft which must be scheduled and linked …..
• All linked to collaborative information systems in a sea
of middle tier servers(as on previous page) to support
design, crisis management, multi-disciplinary research
30 January 2005
jsumethod05 [email protected]
35
Multi-Server Scenario
NEOS Control
Optimization
Agent-based
Choice of
Compute Engine
Gateway Control
Multidisciplinary
Control (WebFlow)
Parallel DB
Proxy
Database
Optimization
Service
Origin 2000
Proxy
NetSolve
Linear Alg.
Server
MPP
Matrix
Solver
IBM SP2
Proxy
Data Analysis
Server
30 January 2005
jsumethod05 [email protected]
MPP
36