Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Parallel Programming By J. H. Wang May 2, 2017 Outline • Introduction to Parallel Programming • Parallel Algorithm Design Motivation • “Fast” isn’t fast enough • Faster computers let you tackle larger computations What’s Parallel Programming • The use of a parallel computer to reduce the time needed to solve a single computational problem • Parallel computer is a multiple-processor system • Multicomputers, centralized multiprocessors (SMP) • Programming in a language that allows you to explicitly indicate how different portions of the computation may be executed concurrently by different processors • MPI: Message Passing Interface • OpenMP: SMP Concurrency • To identify operations that may be performed in parallel (concurrently) • Data dependence graph • Vertex u: task • Edge u->v: task v is dependent on task u • Data parallelism • Independent tasks applying the same operation to different data elements • Functional parallelism • Independent tasks applying different operation to different data elements • Pipelined computation • Computation divided into stages • Size considerations An Example of Data Dependence Graph Programming parallel computers • Parallelizing compilers • Sequential programs with compiler directives • To extend a sequential programming language with parallel functions • For creation, synchronization, and communication of processes, E.g.: MPI • Adding a parallel programming layer • Creation and synchronization of processes, partitioning of data • Parallel language • Or to add parallel constructs to existing languages Parallel Algorithm Design • Task/Channel Model represents a parallel computation as a set of tasks that interact by sending messages through channels • Task: a program, its local memory, and a collection of I/O ports • Channel: a message queue that connects output port with other’s input port • Asynchronous sending, synchronous receiving PCAM: a design methodology for parallel programs Partitioning • Dividing the computation and data into pieces • Domain decomposition • First divide the data into pieces, then determine how to associate computations with the data • Functional decomposition • First divide the computation into pieces, then determine how to associate data items with the computations • E.g. pipelining • To identify as many primitive tasks as possible Checklist for partitioning • There are at least an order of magnitude more primitive tasks than processors • Redundant computations and data storage are minimized • Primitive tasks are roughly the same size • The number of tasks is an increasing function of the problem size Communication • Local communication • When a task needs values from a small number of other tasks, we create channels from the tasks supplying data to the task consuming them • Global communication • When a significant number of primitive tasks must contribute data in order to perform a computation • Part of the overhead of a parallel algorithm Checklist for communication • Communication operations are balanced among tasks • Each task communicates with only a small number of neighbors • Tasks can perform their communications concurrently • Tasks can perform their computations concurrently Agglomeration • Grouping tasks into larger tasks in order to improve performance or simplify programming • Goals of agglomeration • To lower communication overhead • Increasing the locality of parallel algorithm • Another way to lower communication overhead is to combine groups of sending and receiving tasks, reducing the number of messages being sent • To maintain the scalability of the design • To reduce software engineering costs Checklist of Agglomeration • The agglomeration has increased the locality of the parallel algorithm • Replicated computations task less time than the communications they replace • The amount of replicated data is small enough to allow the algorithm to scale • Agglomerated tasks have similar computational and communications costs • The number of tasks is an increasing function of the problem size • The number of tasks is as small as possible, yet at least as great as the number of processors • The tradeoff between agglomeration and the cost of modifications to existing sequential code is reasonable Mapping • Assigning tasks to processors • Goal: to maximize processor utilization and minimize interprocess communication • They are usually conflicting goals • Finding an optimal solution is NP-hard Checklist for mapping • Designs based on one task per processor and multiple tasks per processor have been considered • Both static and dynamic allocation of tasks to processors have been evaluated • For dynamic allocation, the task allocator is not a bottleneck • For static allocation, the ratio of tasks to processors is at least 10:1 References • Ian Foster, Designing and Building Parallel Programs, available online at: http://www.mcs.anl.gov/~itf/dbpp/