Download Class introduction and overview of resources

Introduction to Parallel Programming and Algorithms CS599 David Monismith Assumptions • You have some basic understanding of processes and threads • You have had experience with programming with notepad and an IDE • You have had experience programming in Java and in creating data structures • You will be able to pick up C/C++ programming and learn to use a remote linux system relatively quickly Outline • • • • • • • Serial Computing Moore’s Law Power Law Why Parallel Programming? Amdahl’s Law Flynn’s Taxonomy Hello world! – Thread-based – Directive-based – Process-based Serial Computing • Instructions executed one at a time. • Advances in microprocessor technology over the past ten years have put multiprocessor chips in almost every new laptop and desktop. • Importance of concurrency in computation recognized for many years. • Serial processors have made use of concurrency (they have given the appearance of parallelism) for years. Moore’s Law - Gordon Moore, Intel Corp. • Note that Gordon Moore (co-founder of Intel) coined Moore's law – The number of transistors on chip doubles every 18 months • For some time, processor speeds followed this law as well, but eventually, a power wall was reached. http://en.wikipedia.org/wiki/Moore%27s_law#mediaviewer/File:Transistor_Count_and_Moore%27s_Law_-_2011.svg Energy Consumption and the Power Wall • How power is used by the CPU? • What is the amount of power per unit of computation? – Note that there is a “power wall” that currently limits the on chip clock frequency. – Clock frequency, sometimes called “speed” of the chip is related to the inverse square of the voltage applied. – Frequency ~ 1/V2 • Voltage applied is directly related to power (wattage), which in turn is directly related to heat. • Faster clocks = more heat • Since the late 1990’s/early 2000’s, there has been a push to add more parallelism on chip rather than higher clock speeds • Dr. Edward Bosworth provides a more detailed description of the “power wall” at http://www.edwardbosworth.com/My5155_Slides/Chapter01/ThePower Wall.doc Why Parallel Computing? • Ubiquity of parallel hardware/architectures • Lowered cost of components • Increased ease of parallel programming Means of Parallel Computing • Multiple datapaths – For example, superscalar systems, additional busses, etc. • Higher accessibility to storage (e.g. parallel data access) – PVFS/Lustre Filesystems • Performance that scales up – Improved performance with both better resources on system (scaling up) and by using more systems (scaling out) • Threads – lightweight processes • Processes – running programs Parallel Computing • Simply put, using more than one computational resource to solve a computational problem. Lowered Cost of Components • Xeon E5-2680 - $1745 tray price - 12 cpu cores, 2 threads/core, 2.5GHz, 30MB L3 Cache, 120W – Source: http://intel.com • ARM Cortex A9 (AM4379, 32 bit) - $15/unit volume price - 1 A9 core + 4 PRU-ICSS (Programmable real time unit and industrial communication subsystem) cores, 1GHz, 256KB L2 Cache, Approx. 1W max – Sources: http://www.ti.com/product/AM4379/technicaldocuments ?dcmp=dsproject&hqs=td&#doctype2 – http://linuxgizmos.com/ti-spins-cortex-a9-sitara-soc/ Examples of Parallel Computing Problems • • • • • Web servers (Amazon.com, google, etc.) Database servers (University data system) Graphics and visualization (XSEDE) Weather (Forecasting – NOAA RapidRefresh) Biology (DNA/RNA Longest Common Subsequence) • Physics (Firing a Hydrogen Ion at a surface of metal atoms) • Many more . . . Speedup • Parallelizing a problem can result in speedup • Patterson and Hennessy defines performance as 1/execution time for a given task • Performance A / Performance B is Speedup • If, PerfA/PerfB = 2, then, Machine A is twice as fast as machine B for this task • This means the speedup for running this task on Machine A is 2 Amdahl’s Law • But wait! We can’t speed up every task. • Some tasks are inherently serial and sometimes a group of parallel tasks must complete before a program can continue. • So, there are limitations to parallelization. • Gene Amdahl noticed that “a program can only run as fast as its slowest part.” • For speedup of a sequential program using parallelism, such a program can only improve to the runtime of the part that cannot be parallelized. • Effectively, you can't go any faster than your slowest part. • The serial portion will dictate the highest possible speed in both Hardware and Software Computing Parallel Execution Time New_Execution_Time = Parallelizable_Execution_Time/Parallel_Factor + Sequential_Execution_Time • Assume a program with a 90s run time on a sequential processor where 80s is parallelizable. 10s is the fastest the program could ever run. Note - 10s = 80s/infinity + 10s. • If we use a dual core processor the run time is 50s = 80s/2 + 10s • If we use a quad core processor the run time is 30s = 80s/4 + 10s • If we use an 8 core processor the run time is 20s = 80s/8 + 10s The speedup using a quad (4) core processor vs. a sequential processor is 90s (seq. runtime) / 30s (quad core runtime) = 3 The maximum theoretical speedup is 90s (seq. runtime) / 10s (infinite parallelism) = 9 • Based on Amdahl's law we should make the most commonly used code (i.e. loops) parallel if possible. Parallel Patterns (According to Michael McCool of Intel) • Superscalar Sequences/Task Graphs • Speculative Selection • Map • Gather • Stencils • Partition • Reduce • • • • Reduce Pack Scatter Histogram Memory/Disk Speed Issues Graph source: Patterson and Hennessy, Computer Architecture: A Quantitative Approach Image source: http://dave.cheney.net/wpcontent/uploads/2014/06/Gocon-2014-10.jpg Flynn’s Taxonomy • SISD = Single Instruction Single Data = Serial Programming • SIMD = Single Instruction Multiple Data = Implicit Parallelism (Instruction/Architecture Level) • MISD = Multiple Instruction Single Data (Rarely implemented) • MIMD = Multiple Instruction Multiple Data = Multiprocessor Single Data Multiple Data Single Instruction SISD SIMD Multiple Instruction MISD MIMD Flynn’s Taxonomy • SIMD instructions and architectures allow for implicit parallelism when writing programs • To provide a sense of how these work, pipelines and superscalar architectures are discussed in the next set of slides • Our focus, however, will be on MIMD through the use of processes and threads, and we will look at examples shortly Pipelined Instructions • Divide instructions into multiple stages • Allow at most one instruction to be in one stage at a time • Maximum instruction throughput is equal to the number of stages • Memory access is only performed on instructions for which it is required otherwise, it is generally skipped Instruction Stages IF = Instruction Fetch ID = Instruction Decode EX = Execute MA = Memory Access WB = Writeback Pipelined Instructions IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB Superscalar Architecture • Allow for more than one pipeline to run at the same time • Allows for parallelism, but only provides ideal speedup if instructions are independent • Most instructions are, however, not independent, (e.g. mathematics and branches) so complex logic may be needed for branch prediction and hazard detection Superscalar Pipeline IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB Processes and Threads • Our main focus for this course • These exist only at execution time • They have fast state changes -> in memory and waiting • A Process – – – – is a fundamental computation unit can have one or more threads is handled by process management module requires system resources Process • Process (job) - program in execution, ready to execute, or waiting for execution • A program is static whereas a process is dynamic. • We will implement processes using an API called the Message Passing Interface (MPI) • MPI will provide us with an abstract layer that will allow us to create and identify processes without worrying about the creation of data structures for sockets or shared memory • We will discuss process based programming in detail starting in week 4 Threads • threads - lightweight processes – Dynamic component of processes – Often, many threads are part of a process • Current OSes support multithreading – multiple threads (tasks) per process • Execution of threads is handled more efficiently than that of full weight processes (although there are other costs). • At process creation, one thread is created, the "main" thread. • Other threads are created from the "main" thread Threads in C (POSIX) • Use a function to represent a thread • This function must have the following format: void * ThreadName(void * threadArg) { //Do thread work here //Cause the thread to exit pthread_exit(NULL); } POSIX Threads (pthreads) • pthread_exit(NULL) terminates the calling thread. • Even though a void pointer is supposed to be returned, there is no explicit return statement within the thread function. • This function actually provides its return value through the pthread_exit function. • Here, NULL is the return value. pthread_create • pthread_create does the following: it takes • 1) the address of the variable where a the identifier of the of the thread will be stored, • 2) the thread attributes, • 3) the name of the function containing the code for the thread, and • 4) the parameter passed into the thread as arguments. pthread_create • The prototype for this function follows: int pthread_create(pthread_t * thread, const pthread_attr_t * attr, void * (*start_routine) (void *), void * arg); • The function then creates the thread, stores its identifier in the variable thread, starts the thread using start_routine, and passes in the argument arg. • See "man pthread_create" on Littlefe. pthread_exit • Causes a thread to exit. • See "man pthread_exit" for more info. • Especially for information about returning parameter values. pthread_t • pthread_t is the pthread type. • This type can store a unique identifier to a thread. • Note that thread ids are only guaranteed to be unique within a single process as threads are specific to a process. • This means that you should not attempt to pass a pointer to a thread to a process different from the one in which the thread was created. What is OpenMP? (Answers per LLNL) • Open Multi-Processing – Provided via open specifications by work between academia, corporations and government • Provides a standardized shared memory multiprogramming environment • A directive and functional based API • Parallelism easily achieved with but a few simple directives OpenMP Threads • OpenMP Threads are implemented using directive based programming • The number of threads is determined using an environment variable called OMP_NUM_THREADS • In C and C++, these are created using #pragma omp statements • For example, a block of code that is preceded by #pragma omp parallel would be threaded In-class Exercise • Log in to one of the LittleFe clusters and complete the Linux tutorial if you have not completed it in the past. – Complete and run “Hello, world!” programs for OpenMP, pthreads, and MPI • Install Cygwin and Eclipse for Parallel Application Developers on your laptop A Simple (Serial) C Program • Open an editor by typing: • nano hello_world.c #include <stdio.h> //Similar to import int main(int argc, char ** argv) //argc is the number of command line arguments //argv is the array of command line arguments (strings) { printf("Hello, world\n"); return 0; } Compilation • Compile by typing the following: • gcc hello_world.c -o hello_world.exe • Notice that you get an executable file that is really in machine language (not byte code). • You should see errors if you have made mistakes. You'll see no output if your program compiles correctly. • In bash at the $ prompt, type: • ./hello_world.exe to run C Program using the MPI Library #include <stdio.h> /* printf, scanf, . . . */ #include <stdlib.h> /* atop, atof, . . ., malloc, calloc, free, . . . */ #include <mpi.h> /* MPI function calls */ int main (int argc, char ** argv) { int my_rank; //This process's id int number_of_processes; int mpi_error_code; mpi_error_code = MPI_Init(&argv, &argc); mpi_error_code = MPI_Comm_rank( MPI_COMM_WORLD, &my_rank); mpi_error_code = MPI_Comm_size(MPI_COMM_WORLD, &number_of_processes); printf("%d of %d: Hello, world!\n", my_rank, number_of_processes); mpi_error_code = MPI_Finalize(); return 0; } Compile and Run a C/MPI Program • Compile with: • mpicc hello_world_mpi.c -o hello_world_mpi.exe • Run with: • mpirun -n 10 –machinefile machines hello_world_mpi.exe • Everything between "MPI_Init" and "MPI_Finalize" runs as many times as there are processes • Copies of each variable are made for each process. Each of these copies is distinct. Identifying a Process • A process’s rank identifies it within MPI • This is retrieved using the MPI_Comm_rank function and stored in my_rank in the example program • The number of processes is retrieved using the MPI_Comm_size function and stored in number_of_processes in the example program Example Program Execution Process 0 Process 1 Process 2 • rank = 0 • size = 10 • rank = 1 • size = 10 • rank = 2 • size = 10 Process 9 ... • rank = 9 • size = 10 Processes each have their own copies of all variables declared within the code. They also have their own memory environment, but they all execute the same code. A Simple Pthreads Program #include <stdio.h> #include <pthread.h> #define NUM_THREADS 4 void * helloThread(void * tid) { int threadId = (int)tid; printf("Hello from thread %d\n", threadId); pthread_exit(NULL); } A Simple Pthreads Program int main(int argc, char ** argv) { pthread_t threadList[NUM_THREADS]; int i; for(i = 0; i < NUM_THREADS; i++) pthread_create(&threadList[i], NULL, helloThread, (void *)i); for(i = 0; i < NUM_THREADS; i++) pthread_join(threadList[i], NULL); return 0; } Compile and Run a PThreads Program • Compile with: • gcc pthreadsExample.c -pthread -o pthreadsExample.exe • Run with: • ./pthreadsExample.exe A Simple OpenMP Example // Based upon LLNL Example code from // https://computing.llnl.gov/tutorials/openMP/ #include <omp.h> #include <stdio.h> int main(int argc, char ** argv) { int tid; #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Hello from thread number %d\n", tid); } //Threads join and the program finishes return 0; } Compile and Run an OpenMP Program • Compile with: • gcc ompExample.c -fopenmp -o ompExample.exe • Run with: • ./ompExample.exe Referenced Texts • Patterson & Hennessey, Computer Organization: The Hardware/Software Interface • Patterson & Hennessy, Computer Architecture: A Quantitative Approach • Grama, Gupta, Karypis, and Kumar, Introduction to Parallel Computing, Second Edition Reading Assignment for This Week • C Programming Slides – see course website – http://monismith.info/cs599/notes.html • LLNL pthreads tutorial https://computing.llnl.gov/tutorials/pthreads/ • LLNL OpenMP tutorial https://computing.llnl.gov/tutorials/openMP/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Class introduction and overview of resources