Download Class introduction and overview of resources

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Parallel
Programming and Algorithms
CS599
David Monismith
Assumptions
• You have some basic understanding of
processes and threads
• You have had experience with programming
with notepad and an IDE
• You have had experience programming in Java
and in creating data structures
• You will be able to pick up C/C++ programming
and learn to use a remote linux system
relatively quickly
Outline
•
•
•
•
•
•
•
Serial Computing
Moore’s Law
Power Law
Why Parallel Programming?
Amdahl’s Law
Flynn’s Taxonomy
Hello world!
– Thread-based
– Directive-based
– Process-based
Serial Computing
• Instructions executed one at a time.
• Advances in microprocessor technology over
the past ten years have put multiprocessor
chips in almost every new laptop and desktop.
• Importance of concurrency in computation
recognized for many years.
• Serial processors have made use of
concurrency (they have given the appearance
of parallelism) for years.
Moore’s Law - Gordon Moore, Intel
Corp.
• Note that Gordon Moore (co-founder of Intel)
coined Moore's law
– The number of transistors on chip doubles every
18 months
• For some time, processor speeds followed this
law as well, but eventually, a power wall was
reached.
http://en.wikipedia.org/wiki/Moore%27s_law#mediaviewer/File:Transistor_Count_and_Moore%27s_Law_-_2011.svg
Energy Consumption and the Power
Wall
• How power is used by the CPU?
• What is the amount of power per unit of computation?
– Note that there is a “power wall” that currently limits the on chip clock
frequency.
– Clock frequency, sometimes called “speed” of the chip is related to the inverse
square of the voltage applied.
– Frequency ~ 1/V2
• Voltage applied is directly related to power (wattage), which in turn is
directly related to heat.
• Faster clocks = more heat
• Since the late 1990’s/early 2000’s, there has been a push to add more
parallelism on chip rather than higher clock speeds
• Dr. Edward Bosworth provides a more detailed description of the “power
wall” at
http://www.edwardbosworth.com/My5155_Slides/Chapter01/ThePower
Wall.doc
Why Parallel Computing?
• Ubiquity of parallel hardware/architectures
• Lowered cost of components
• Increased ease of parallel programming
Means of Parallel Computing
• Multiple datapaths
– For example, superscalar systems, additional busses, etc.
• Higher accessibility to storage (e.g. parallel data access)
– PVFS/Lustre Filesystems
• Performance that scales up
– Improved performance with both better resources on
system (scaling up) and by using more systems (scaling
out)
• Threads – lightweight processes
• Processes – running programs
Parallel Computing
• Simply put, using more than one
computational resource to solve a
computational problem.
Lowered Cost of Components
• Xeon E5-2680 - $1745 tray price - 12 cpu cores, 2
threads/core, 2.5GHz, 30MB L3 Cache, 120W
– Source: http://intel.com
• ARM Cortex A9 (AM4379, 32 bit) - $15/unit volume
price - 1 A9 core + 4 PRU-ICSS (Programmable real time
unit and industrial communication subsystem) cores,
1GHz, 256KB L2 Cache, Approx. 1W max
– Sources:
http://www.ti.com/product/AM4379/technicaldocuments
?dcmp=dsproject&hqs=td&#doctype2
– http://linuxgizmos.com/ti-spins-cortex-a9-sitara-soc/
Examples of Parallel Computing
Problems
•
•
•
•
•
Web servers (Amazon.com, google, etc.)
Database servers (University data system)
Graphics and visualization (XSEDE)
Weather (Forecasting – NOAA RapidRefresh)
Biology (DNA/RNA Longest Common
Subsequence)
• Physics (Firing a Hydrogen Ion at a surface of
metal atoms)
• Many more . . .
Speedup
• Parallelizing a problem can result in speedup
• Patterson and Hennessy defines performance
as 1/execution time for a given task
• Performance A / Performance B is Speedup
• If, PerfA/PerfB = 2, then, Machine A is twice as
fast as machine B for this task
• This means the speedup for running this task
on Machine A is 2
Amdahl’s Law
• But wait! We can’t speed up every task.
• Some tasks are inherently serial and sometimes a group of
parallel tasks must complete before a program can
continue.
• So, there are limitations to parallelization.
• Gene Amdahl noticed that “a program can only run as fast
as its slowest part.”
• For speedup of a sequential program using parallelism,
such a program can only improve to the runtime of the part
that cannot be parallelized.
• Effectively, you can't go any faster than your slowest part.
• The serial portion will dictate the highest possible speed in
both Hardware and Software
Computing Parallel Execution Time
New_Execution_Time = Parallelizable_Execution_Time/Parallel_Factor +
Sequential_Execution_Time
• Assume a program with a 90s run time on a sequential processor where 80s is
parallelizable. 10s is the fastest the program could ever run. Note - 10s =
80s/infinity + 10s.
• If we use a dual core processor the run time is 50s = 80s/2 + 10s
• If we use a quad core processor the run time is 30s = 80s/4 + 10s
• If we use an 8 core processor the run time is 20s = 80s/8 + 10s
The speedup using a quad (4) core processor vs. a sequential processor is 90s (seq.
runtime) / 30s (quad core runtime) = 3
The maximum theoretical speedup is 90s (seq. runtime) / 10s (infinite parallelism) = 9
•
Based on Amdahl's law we should make the most commonly used code (i.e. loops)
parallel if possible.
Parallel Patterns (According to Michael
McCool of Intel)
• Superscalar
Sequences/Task
Graphs
• Speculative Selection
• Map
• Gather
• Stencils
• Partition
• Reduce
•
•
•
•
Reduce
Pack
Scatter
Histogram
Memory/Disk Speed Issues
Graph source: Patterson and Hennessy, Computer Architecture: A
Quantitative Approach
Image source: http://dave.cheney.net/wpcontent/uploads/2014/06/Gocon-2014-10.jpg
Flynn’s Taxonomy
• SISD = Single Instruction Single Data = Serial Programming
• SIMD = Single Instruction Multiple Data = Implicit Parallelism
(Instruction/Architecture Level)
• MISD = Multiple Instruction Single Data (Rarely implemented)
• MIMD = Multiple Instruction Multiple Data = Multiprocessor
Single Data
Multiple Data
Single Instruction
SISD
SIMD
Multiple Instruction
MISD
MIMD
Flynn’s Taxonomy
• SIMD instructions and architectures allow for
implicit parallelism when writing programs
• To provide a sense of how these work,
pipelines and superscalar architectures are
discussed in the next set of slides
• Our focus, however, will be on MIMD through
the use of processes and threads, and we will
look at examples shortly
Pipelined Instructions
• Divide instructions into
multiple stages
• Allow at most one
instruction to be in one
stage at a time
• Maximum instruction
throughput is equal to the
number of stages
• Memory access is only
performed on instructions
for which it is required
otherwise, it is generally
skipped
Instruction Stages
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MA = Memory Access
WB = Writeback
Pipelined Instructions
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
Superscalar Architecture
• Allow for more than one pipeline to run at the same
time
• Allows for parallelism, but only provides ideal
speedup if instructions are independent
• Most instructions are, however, not independent,
(e.g. mathematics and branches) so complex logic
may be needed for branch prediction and hazard
detection
Superscalar Pipeline
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
IF
ID
EX
MA
WB
Processes and Threads
• Our main focus for this course
• These exist only at execution time
• They have fast state changes -> in memory and waiting
• A Process
–
–
–
–
is a fundamental computation unit
can have one or more threads
is handled by process management module
requires system resources
Process
• Process (job) - program in execution, ready to
execute, or waiting for execution
• A program is static whereas a process is dynamic.
• We will implement processes using an API called
the Message Passing Interface (MPI)
• MPI will provide us with an abstract layer that will
allow us to create and identify processes without
worrying about the creation of data structures for
sockets or shared memory
• We will discuss process based programming in
detail starting in week 4
Threads
• threads - lightweight processes
– Dynamic component of processes
– Often, many threads are part of a process
• Current OSes support multithreading
– multiple threads (tasks) per process
• Execution of threads is handled more efficiently than
that of full weight processes (although there are other
costs).
• At process creation, one thread is created, the "main"
thread.
• Other threads are created from the "main" thread
Threads in C (POSIX)
• Use a function to represent a thread
• This function must have the following format:
void * ThreadName(void * threadArg)
{
//Do thread work here
//Cause the thread to exit
pthread_exit(NULL);
}
POSIX Threads (pthreads)
• pthread_exit(NULL) terminates the
calling thread.
• Even though a void pointer is supposed to be
returned, there is no explicit return statement
within the thread function.
• This function actually provides its return value
through the pthread_exit function.
• Here, NULL is the return value.
pthread_create
• pthread_create does the following: it
takes
• 1) the address of the variable where a the
identifier of the of the thread will be stored,
• 2) the thread attributes,
• 3) the name of the function containing the
code for the thread, and
• 4) the parameter passed into the thread as
arguments.
pthread_create
• The prototype for this function follows:
int pthread_create(pthread_t * thread,
const pthread_attr_t * attr,
void * (*start_routine) (void *),
void * arg);
• The function then creates the thread, stores its identifier in
the variable thread, starts the thread using
start_routine, and passes in the argument arg.
• See "man pthread_create" on Littlefe.
pthread_exit
• Causes a thread to exit.
• See "man pthread_exit" for more info.
• Especially for information about returning
parameter values.
pthread_t
• pthread_t is the pthread type.
• This type can store a unique identifier to a
thread.
• Note that thread ids are only guaranteed to be
unique within a single process as threads are
specific to a process.
• This means that you should not attempt to pass a
pointer to a thread to a process different from
the one in which the thread was created.
What is OpenMP? (Answers per LLNL)
• Open Multi-Processing
– Provided via open specifications by work between
academia, corporations and government
• Provides a standardized shared memory multiprogramming environment
• A directive and functional based API
• Parallelism easily achieved with but a few
simple directives
OpenMP Threads
• OpenMP Threads are implemented using
directive based programming
• The number of threads is determined using an
environment variable called
OMP_NUM_THREADS
• In C and C++, these are created using #pragma
omp statements
• For example, a block of code that is preceded by
#pragma omp parallel would be
threaded
In-class Exercise
• Log in to one of the LittleFe clusters and
complete the Linux tutorial if you have not
completed it in the past.
– Complete and run “Hello, world!” programs for
OpenMP, pthreads, and MPI
• Install Cygwin and Eclipse for Parallel
Application Developers on your laptop
A Simple (Serial) C Program
• Open an editor by typing:
• nano hello_world.c
#include <stdio.h> //Similar to import
int main(int argc, char ** argv)
//argc is the number of command line arguments
//argv is the array of command line arguments
(strings)
{
printf("Hello, world\n");
return 0;
}
Compilation
• Compile by typing the following:
• gcc hello_world.c -o hello_world.exe
• Notice that you get an executable file that is really in machine
language (not byte code).
• You should see errors if you have made mistakes. You'll see no
output if your program compiles correctly.
• In bash at the $ prompt, type:
• ./hello_world.exe to run
C Program using the MPI Library
#include <stdio.h> /* printf, scanf, . . . */
#include <stdlib.h>
/* atop, atof, . . ., malloc, calloc, free, . . . */
#include <mpi.h>
/* MPI function calls */
int main (int argc, char ** argv)
{
int my_rank; //This process's id
int number_of_processes;
int mpi_error_code;
mpi_error_code = MPI_Init(&argv, &argc);
mpi_error_code = MPI_Comm_rank( MPI_COMM_WORLD, &my_rank);
mpi_error_code = MPI_Comm_size(MPI_COMM_WORLD, &number_of_processes);
printf("%d of %d: Hello, world!\n", my_rank, number_of_processes);
mpi_error_code = MPI_Finalize();
return 0;
}
Compile and Run a C/MPI Program
•
Compile with:
•
mpicc hello_world_mpi.c -o hello_world_mpi.exe
•
Run with:
•
mpirun -n 10 –machinefile machines
hello_world_mpi.exe
•
Everything between "MPI_Init" and "MPI_Finalize" runs as many times as
there are processes
•
Copies of each variable are made for each process. Each of these copies
is distinct.
Identifying a Process
• A process’s rank identifies it within MPI
• This is retrieved using the MPI_Comm_rank
function and stored in my_rank in the
example program
• The number of processes is retrieved using
the MPI_Comm_size function and stored in
number_of_processes in the example
program
Example Program Execution
Process 0
Process 1
Process 2
• rank = 0
• size = 10
• rank = 1
• size = 10
• rank = 2
• size = 10
Process 9
...
• rank = 9
• size = 10
Processes each have their own copies of all variables declared within the code.
They also have their own memory environment, but they all execute the same
code.
A Simple Pthreads Program
#include <stdio.h>
#include <pthread.h>
#define NUM_THREADS 4
void * helloThread(void * tid)
{
int threadId = (int)tid;
printf("Hello from thread %d\n", threadId);
pthread_exit(NULL);
}
A Simple Pthreads Program
int main(int argc, char ** argv)
{
pthread_t threadList[NUM_THREADS];
int i;
for(i = 0; i < NUM_THREADS; i++)
pthread_create(&threadList[i], NULL, helloThread, (void *)i);
for(i = 0; i < NUM_THREADS; i++)
pthread_join(threadList[i], NULL);
return 0;
}
Compile and Run a PThreads Program
• Compile with:
• gcc pthreadsExample.c -pthread
-o pthreadsExample.exe
• Run with:
• ./pthreadsExample.exe
A Simple OpenMP Example
// Based upon LLNL Example code from
// https://computing.llnl.gov/tutorials/openMP/
#include <omp.h>
#include <stdio.h>
int main(int argc, char ** argv) {
int tid;
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf("Hello from thread number %d\n", tid);
} //Threads join and the program finishes
return 0;
}
Compile and Run an OpenMP Program
• Compile with:
• gcc ompExample.c -fopenmp -o
ompExample.exe
• Run with:
• ./ompExample.exe
Referenced Texts
• Patterson & Hennessey, Computer
Organization: The Hardware/Software
Interface
• Patterson & Hennessy, Computer Architecture:
A Quantitative Approach
• Grama, Gupta, Karypis, and Kumar,
Introduction to Parallel Computing, Second
Edition
Reading Assignment for This Week
• C Programming Slides – see course website –
http://monismith.info/cs599/notes.html
• LLNL pthreads tutorial https://computing.llnl.gov/tutorials/pthreads/
• LLNL OpenMP tutorial https://computing.llnl.gov/tutorials/openMP/