* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Parallel Programming
Survey
Document related concepts
Transcript
Parallel Programming By J. H. Wang May 2, 2017 Outline • Introduction to Parallel Programming • Parallel Algorithm Design Motivation • “Fast” isn’t fast enough • Faster computers let you tackle larger computations What’s Parallel Programming • The use of a parallel computer to reduce the time needed to solve a single computational problem • Parallel computer is a multiple-processor system • Multicomputers, centralized multiprocessors (SMP) • Programming in a language that allows you to explicitly indicate how different portions of the computation may be executed concurrently by different processors • MPI: Message Passing Interface • OpenMP: SMP Concurrency • To identify operations that may be performed in parallel (concurrently) • Data dependence graph • Vertex u: task • Edge u->v: task v is dependent on task u • Data parallelism • Independent tasks applying the same operation to different data elements • Functional parallelism • Independent tasks applying different operation to different data elements • Pipelined computation • Computation divided into stages • Size considerations An Example of Data Dependence Graph Programming parallel computers • Parallelizing compilers • Sequential programs with compiler directives • To extend a sequential programming language with parallel functions • For creation, synchronization, and communication of processes, E.g.: MPI • Adding a parallel programming layer • Creation and synchronization of processes, partitioning of data • Parallel language • Or to add parallel constructs to existing languages Parallel Algorithm Design • Task/Channel Model represents a parallel computation as a set of tasks that interact by sending messages through channels • Task: a program, its local memory, and a collection of I/O ports • Channel: a message queue that connects output port with other’s input port • Asynchronous sending, synchronous receiving PCAM: a design methodology for parallel programs Partitioning • Dividing the computation and data into pieces • Domain decomposition • First divide the data into pieces, then determine how to associate computations with the data • Functional decomposition • First divide the computation into pieces, then determine how to associate data items with the computations • E.g. pipelining • To identify as many primitive tasks as possible Checklist for partitioning • There are at least an order of magnitude more primitive tasks than processors • Redundant computations and data storage are minimized • Primitive tasks are roughly the same size • The number of tasks is an increasing function of the problem size Communication • Local communication • When a task needs values from a small number of other tasks, we create channels from the tasks supplying data to the task consuming them • Global communication • When a significant number of primitive tasks must contribute data in order to perform a computation • Part of the overhead of a parallel algorithm Checklist for communication • Communication operations are balanced among tasks • Each task communicates with only a small number of neighbors • Tasks can perform their communications concurrently • Tasks can perform their computations concurrently Agglomeration • Grouping tasks into larger tasks in order to improve performance or simplify programming • Goals of agglomeration • To lower communication overhead • Increasing the locality of parallel algorithm • Another way to lower communication overhead is to combine groups of sending and receiving tasks, reducing the number of messages being sent • To maintain the scalability of the design • To reduce software engineering costs Checklist of Agglomeration • The agglomeration has increased the locality of the parallel algorithm • Replicated computations task less time than the communications they replace • The amount of replicated data is small enough to allow the algorithm to scale • Agglomerated tasks have similar computational and communications costs • The number of tasks is an increasing function of the problem size • The number of tasks is as small as possible, yet at least as great as the number of processors • The tradeoff between agglomeration and the cost of modifications to existing sequential code is reasonable Mapping • Assigning tasks to processors • Goal: to maximize processor utilization and minimize interprocess communication • They are usually conflicting goals • Finding an optimal solution is NP-hard Checklist for mapping • Designs based on one task per processor and multiple tasks per processor have been considered • Both static and dynamic allocation of tasks to processors have been evaluated • For dynamic allocation, the task allocator is not a bottleneck • For static allocation, the ratio of tasks to processors is at least 10:1 References • Ian Foster, Designing and Building Parallel Programs, available online at: http://www.mcs.anl.gov/~itf/dbpp/