Download Lecture 12

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Falcon (programming language) wikipedia , lookup

Reactive programming wikipedia , lookup

Thread (computing) wikipedia , lookup

Virtual synchrony wikipedia , lookup

Message Passing Interface wikipedia , lookup

Corecursion wikipedia , lookup

Algorithmic skeleton wikipedia , lookup

General-purpose computing on graphics processing units wikipedia , lookup

Distributed computing wikipedia , lookup

Computer cluster wikipedia , lookup

Supercomputer wikipedia , lookup

Stream processing wikipedia , lookup

Supercomputer architecture wikipedia , lookup

ILLIAC IV wikipedia , lookup

Data-intensive computing wikipedia , lookup

Parallel computing wikipedia , lookup

Transcript
Parallel Computing
Parallel
PVM (Parallel Virtual Machine)
MPI (Message Passing Interface)
CUDA (Compute Unified Device Architecture)
Parallel C# - Lightweight Parallelism
There are several parallel programming models in common use:
Shared Memory
Threads
Message Passing
Data Parallel
Hybrid
https://computing.llnl.gov/tutorials/parallel_comp/
Parallel Programming Models are Abstract
Although it might not seem apparent, these models are NOT specific to a particular
type of machine or memory architecture. In fact, any of these models can
(theoretically) be implemented on any underlying hardware. Two examples:
1. Shared memory model on a distributed memory machine: Kendall Square Research
(KSR) ALLCACHE approach.
Machine memory was physically distributed, but appeared to the user as a single shared
memory (global address space). Generically, this approach is referred to as "virtual shared
memory". Note: although KSR is no longer in business, there is no reason to suggest that a
similar implementation will not be made available by another vendor in the future.
2. Message passing model on a shared memory machine: MPI on SGI Origin.
The SGI Origin employed the CC-NUMA type of shared memory architecture, where
every task has direct access to global memory. However, the ability to send and receive
messages with MPI, as is commonly done over a network of distributed memory
machines, is not only implemented but is very commonly used.
https://computing.llnl.gov/tutorials/parallel_comp/
Shared Memory Model
In the shared-memory programming model, tasks share a common address space,
which they read and write asynchronously.
Various mechanisms such as locks/semaphores may be used to control access to the
shared memory.
An advantage of this model from the programmer's point of view is that the notion of
data ownership is lacking, so there is no need to specify explicitly the communication of
data between tasks. Program development can often be simplified.
An important disadvantage in terms of performance is that it becomes more difficult to
understand and manage data locality.
Keeping data local to the processor that works on it conserves memory accesses, cache
refreshes and bus traffic that occurs when multiple processors use the same data.
Unfortunately, controlling data locality is hard to understand and beyond the control of
the average user.
https://computing.llnl.gov/tutorials/parallel_comp/
Threads
In the threads model of parallel programming, a single process can have multiple,
concurrent execution paths.
Perhaps the most simple analogy that can be used to describe threads is the concept of
a single program that includes a number of subroutines: Threads Model
Threads are commonly associated with shared memory architectures and operating
systems.
https://computing.llnl.gov/tutorials/parallel_comp/
Two Types of Threads
Unrelated standardization efforts have resulted in two very different implementations
of threads: POSIX Threads and OpenMP.
POSIX Threads
* Library based; requires parallel coding
* Specified by the IEEE POSIX 1003.1c standard (1995).
* C Language only
* Commonly referred to as Pthreads.
* Most HW vendors offer Pthreads in addition to proprietary implementations.
* Very explicit parallelism; requires significant programmer attention to detail.
OpenMP
* Compiler directive based; can use serial code
* Endorsed by HW & SW vendors.
* Portable/multi-platform, including Unix and Windows NT platforms
* Available in C/C++ and Fortran implementations
* Can be very easy and simple to use - provides for "incremental parallelism"
https://computing.llnl.gov/tutorials/parallel_comp/
Message Passing Model
The message passing model demonstrates the following characteristics:
A set of tasks that use their own local memory during computation. Multiple tasks can
reside on the same physical machine and/or across an arbitrary number of machines.
Tasks exchange data through communications by sending and receiving messages.
Data transfer usually requires cooperative operations to be performed by each process.
For example, a send operation must have a matching receive operation.
https://computing.llnl.gov/tutorials/parallel_comp/
Data Parallel Model
The data parallel model demonstrates the following characteristics:
Most parallel work focuses on performing operations on a data set. The data set is
typically organized into a common structure, such as an array or cube.
A set of tasks work collectively on the same data structure, however, each task works
on a different partition of the same data structure.
Tasks perform the same operation on their partition of work, for example, "add 4 to
every array element".
On shared memory architectures, all tasks
may have access to the data structure
through global memory. On distributed
memory architectures the data structure
is split up and resides as chunks in the
local memory of each task.
https://computing.llnl.gov/tutorials/parallel_comp/
Hybrid Parallel Model
In this model, any two or more parallel programming models are combined.
Currently, a common example of a hybrid model is the combination of the message
passing model (MPI) with either the threads model (POSIX threads) or the shared
memory model (OpenMP). This hybrid model lends itself well to the increasingly
common hardware environment of networked SMP machines.
Another common example of a hybrid model is combining data parallel with message
passing. Distributed memory architectures use message passing to transmit data
between tasks, transparently to the programmer.
https://computing.llnl.gov/tutorials/parallel_comp/
Single Program Multiple Data (SPMD)
SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models. SPMD Model
A single program is executed by all tasks simultaneously.
At any moment in time, tasks can be executing the same or different instructions within
the same program.
SPMD programs usually have logic programmed into them to allow different tasks to
branch or conditionally execute only those parts of the program they are designed to
execute. That is, tasks do not necessarily have to execute the entire program - perhaps
only a portion of it.
All tasks may use different data
https://computing.llnl.gov/tutorials/parallel_comp/
Multiple Program Multiple Data (MPMD)
Like SPMD, MPMD is actually a "high level" programming model that can be built upon
any combination of the previously mentioned parallel programming models. MPMD
Model
MPMD applications typically have multiple executable object files (programs). While
the application is being run in parallel, each task can be executing the same or different
program as other tasks.
All tasks may use different data
https://computing.llnl.gov/tutorials/parallel_comp/
Automatic vs. Manual Parallelization
Designing and developing parallel programs has characteristically been a very manual
process. The programmer is typically responsible for both identifying and actually
implementing parallelism.
Very often, manually developing parallel codes is a time consuming, complex, error-prone
and iterative process.
For a number of years now, various tools have been available to assist the programmer
with converting serial programs into parallel programs. The most common type of tool
used to automatically parallelize a serial program is a parallelizing compiler or preprocessor.
https://computing.llnl.gov/tutorials/parallel_comp/
Parallelizing Compilers
A parallelizing compiler generally works in two different ways:
Fully Automatic
The compiler analyzes the source code and identifies opportunities for parallelism.
The analysis includes identifying inhibitors to parallelism and possibly a cost
weighting on whether or not the parallelism would actually improve performance.
Loops (do, for) loops are the most frequent target for automatic parallelization.
Programmer Directed
Using "compiler directives" or possibly compiler flags, the programmer explicitly
tells the compiler how to parallelize the code.
May be able to be used in conjunction with some degree of automatic
parallelization also.
https://computing.llnl.gov/tutorials/parallel_comp/
Issues with Automatic Parallelization
If you are beginning with an existing serial code and have time or budget constraints,
then automatic parallelization may be the answer. However, there are several important
caveats that apply to automatic parallelization:
Wrong results may be produced
Performance may actually degrade
Much less flexible than manual parallelization
Limited to a subset (mostly loops) of code
May actually not parallelize code if the analysis suggests there are
inhibitors or the code is too complex
https://computing.llnl.gov/tutorials/parallel_comp/
Understanding the Problem
Undoubtedly, the first step in developing parallel software is to first understand the
problem that you wish to solve in parallel. If you are starting with a serial program, this
necessitates understanding the existing code also.
Before spending time in an attempt to develop a parallel solution for a problem,
determine whether or not the problem is one that can actually be parallelized.
https://computing.llnl.gov/tutorials/parallel_comp/
Example of Parallelizable Problem:
Calculate the potential energy for each of several thousand independent
conformations of a molecule. When done, find the minimum energy
conformation.
This problem is able to be solved in parallel. Each of the molecular conformations is
independently determinable. The calculation of the minimum energy conformation is
also a parallelizable problem.
Example of a Non-parallelizable Problem:
Calculation of the Fibonacci series (1, 1, 2, 3,5, 8, 13, 21,. . .) by use of the
formula:
F(n) = F(n-1) + F(n-2)
This is a non-parallelizable problem because the calculation of the Fibonacci
sequence as shown would entail dependent calculations rather than independent
ones. The calculation of the F(n) value uses those of both F(n-1) and F(n-2). These
three terms cannot be calculated independently and therefore, not in parallel.
https://computing.llnl.gov/tutorials/parallel_comp/
Methods to Improve Parallel Algorithm Performance
1. Identify the program's hotspots:
2. Know where most of the real work is being done. The majority of scientific and
technical programs usually accomplish most of their work in a few places.
Profilers and performance analysis tools can help here
Focus on parallelizing hotspots, ignore sections comprising little CPU usage.
3. Identify bottlenecks in the program
Are there areas that are disproportionately slow
May be possible to restructure the program or use a different algorithm
4. Identify inhibitors to parallelism. One common class of inhibitor is data dependence,
as demonstrated by the Fibonacci sequence above.
5. Investigate other algorithms if possible. This may be the single most important
consideration when designing a parallel application.
https://computing.llnl.gov/tutorials/parallel_comp/
Partitioning
One of the first steps in designing a parallel program is to break the problem into
discrete chunks of work that can be distributed to multiple tasks. This is known as
decomposition or partitioning.
There are two basic ways to partition computational work among parallel tasks:
domain decomposition
and
functional decomposition.
https://computing.llnl.gov/tutorials/parallel_comp/
Domain Decomposition
In this type of partitioning, the data associated
with a problem is decomposed. Each parallel
task then works on a portion of of the data.
There are different ways to partition data:
BLOCK
CYCLIC
https://computing.llnl.gov/tutorials/parallel_comp/
Functional Decomposition
In this approach, the focus is on the computation that is to be performed rather than
on the data manipulated by the computation. The problem is decomposed according
to the work that must be done. Each task then performs a portion of the overall work.
https://computing.llnl.gov/tutorials/parallel_comp/
Ecosystem Modeling
Example of Functional Decomposition
Each program calculates the population of a given group, where each group's growth
depends on that of its neighbors. As time progresses, each process calculates its
current state, then exchanges information with the neighbor populations. All tasks then
progress to calculate the state at the next time step.
https://computing.llnl.gov/tutorials/parallel_comp/
Signal Processing
An audio signal data set is passed through four distinct computational filters. Each filter
is a separate process. The first segment of data must pass through the first filter before
progressing to the second. When it does, the second segment of data passes through
the first filter. By the time the fourth segment of data is in the first filter, all four tasks
are busy.
https://computing.llnl.gov/tutorials/parallel_comp/
IMADS
a tool for functional decomposition
Intelligent Multiple Agent Development System
Input
Output
Function
Module
Neural Network
Module
Rule-Based
Module
http://concurrent.us
Parallel Virtual Machine
Parallel Virtual Machine
PVM (Parallel Virtual Machine) is a portable message-passing programming system,
designed to link separate host machines to form a ``virtual machine'' which is a
single, manageable computing resource.
The virtual machine can be composed of hosts of varying types, in physically
remote locations.
PVM applications can be composed of any number of separate processes, or
components, written in a mixture of C, C++ and Fortran.
The system is portable to a wide variety of architectures, including workstations,
multiprocessors, supercomputers and PCs.
PVM is a by-product of ongoing research at several institutions, and is made
available to the public free of charge.
NetLib Bob Manchek http://www.netlib.org/pvm3/
PVM Modes of Operation
Parallel computing using a system such as PVM may be approached from three
fundamental viewpoints, based on the organization of the computing tasks.
Crowd Computing
Tree Computation
Hybrid
http://www.netlib.org/pvm3/book/
Crowd Computing
Crowd Computing is a collection of closely related processes, typically executing the
same code, perform computations on different portions of the workload, usually
involving the periodic exchange of intermediate results. This paradigm can be further
subdivided into two categories:
The master-slave (or host-node ) model in which a separate ``control'' program
termed the master is responsible for process spawning, initialization, collection and
display of results, and perhaps timing of functions.
The slave programs perform the actual computation involved; they either are
allocated their workloads by the master (statically or dynamically) or perform the
allocations themselves.
The node-only model where multiple instances of a single program execute, with
one process (typically the one initiated manually) taking over the non-computational
responsibilities in addition to contributing to the computation itself.
http://www.netlib.org/pvm3/book/
PVM Mandelbrot
example of Crowd Computing
{Master Mandelbrot algorithm.}
{Initial placement}
for i := 0 to NumWorkers - 1
pvm_spawn(<worker name>)
pvm_send(<worker tid>,999)
endfor
{Receive-send}
while (WorkToDo)
pvm_recv(888)
{Start up worker i}
{Send task to worker i}
{Receive result}
pvm_send(<available worker tid>,999)
{Send next task to available worker}
display result
endwhile
{Gather remaining results.}
for i := 0 to NumWorkers - 1
pvm_recv(888)
pvm_kill(<worker tid i>)
display result
endfor
{Receive result}
{Terminate worker i}
{Worker Mandelbrot algorithm.}
while (true)
pvm_recv(999)
{Receive task}
result := MandelbrotCalculations(task) {Compute result}
pvm_send(<master tid>,888)
{Send result to master}
endwhile
http://www.netlib.org/pvm3/book/
Tree Computation
The second model supported by PVM is termed a tree computation . In this scenario,
processes are spawned in a tree-like manner, thereby establishing a tree-like, parentchild relationship (as opposed to crowd computations where a star-like relationship
exists).
This paradigm, although less commonly used, is an extremely natural fit to
applications where the total workload is not known a priori, for example, in branchand-bound algorithms, alpha-beta search, and recursive divide-and-conquer
algorithms.
http://www.netlib.org/pvm3/book/
PVM Split-Sort-Merge
example of Tree Computation
{ Spawn and partition list based on a broadcast tree pattern. }
for i := 1 to N, such that 2^N = NumProcs
forall processors P such that P < 2^i
pvm_spawn(...) {process id P XOR 2^i}
if P < 2^(i-1) then
midpt: = PartitionList(list);
{Send list[0..midpt] to P XOR 2^i}
pvm_send((P XOR 2^i),999)
list := list[midpt+1..MAXSIZE]
else
pvm_recv(999)
{receive the list}
endif
endfor
endfor
{ Sort remaining list. }
Quicksort(list[midpt+1..MAXSIZE])
{ Gather/merge sorted sub-lists. }
for i := N downto 1, such that 2^N = NumProcs
forall processors P such that P < 2^i
if P > 2^(i-1) then
pvm_send((P XOR 2^i),888)
{Send list to P XOR 2^i}
else
pvm_recv(888)
{receive temp list}
merge templist into list
endif
endfor
endfor
code written for a 4-node HyperCube - http://www.netlib.org/pvm3/book/
Hybrid
The third model, which we term hybrid, can be thought of as a combination of the
tree model and crowd model. Essentially, this paradigm possesses an arbitrary
spawning structure: that is, at any point during application execution, the process
relationship structure may resemble an arbitrary and changing graph.
http://www.netlib.org/pvm3/book/
Message Passing Interface
MPI Hello World
#include <stdio.h>
#include <mpi.h>
int main(int argc, char ** argv)
{
int size,rank;
int length;
char name[80];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Get_processor_name(name,&length);
printf("Hello MPI! Process %d of %d on %s\n",size,rank,name);
MPI_Finalize();
return 0;
}
http://www.cs.earlham.edu/~lemanal/slides/mpi-slides.pdf
Compute Unified Device Architecture (CUDA)
to be continued...
http://www.netlib.org/pvm3/book/