Download High Performance Computing Lecture 1

Virtues of Good (Parallel) Software Concurrency Able to exploit concurrencies in algorithm/problem/hardware Scalability Resilient to increasing processor count Locality More frequent access to local data than to remote data Modularity Employ abstraction and modular design 1 Two Basic Requirements for Parallel Program Safety: Produce correct results Result computed on P processors and on 1 processor must be IDENTICAL. Livelihood: Able to proceed and finish; free of deadlock. 2 Sources of Overhead  Execution time  The time that elapses from when the first processor starts executing on the problem to when the last processor completes execution  Execution time = computation time + communication time + idle time  Communication / interprocess interaction: usually main source of overhead  T_comm = t_s + t_w*L  Minimize the volume and frequency of communications; overlap computation/communication  Idling: lack of computation or lack of data     Load imbalance Synchronization Presence of serial components Wait on remote data  Replicated Computation  Communicate or replicate 3 Speedup & Efficiency  Relative speed-up: the factor by which the execution time is reduced on multiple processors  S(p) = T_1/T_p  T_1 is the execution time on one processor  T_p is execution time on p processors  Absolute speed-up: where is T_1 is the uniprocessor time for best-known (sequential) algorithm  S(p) <= p  Embarrassingly parallel (EP): no communication among cpus.  Superlinear speedup: exists in reality  Efficiency: the fraction of time that processors spend doing useful work.  E = S/p = T_1/(p*T_p)  Parallel cost: p*T_p  Parallel overhead: T_o = p*T_p – T_1 4 S  Amdahl’s Law 1 1   P Alpha – fraction of operations in serial code that can be parallelized P – number of processors This is for a fixed problem size T_p = alpha*T_1/p + (1alpha)*T_1 S 1/(1-alpha) as Pinfinity Alpha = 90%, S10 Alpha =99%, S100 Alpha = 99.9%, S1000 “Mental block” 5 Gustafson’s Law S  (1   )  P Alpha – fraction of time spent on parallel operations in the parallel program This is for a scaled problem size; or constant run time. T_1 = (1-alpha)*T_p + p*alpha*T_p As problem size increases, fraction of parallel operations increases 6 Iso-Efficiency Function For fixed problem size N, as P increases, increase in speedup S slows down or levels off, efficiency E decreases For fixed P, as the N increases, S increases, efficiency E increases As P increases, can increase the problem size N such that the efficiency is kept constant This N(p) for fixed efficiency is called iso-efficiency function Rate of increase in N(p), dN/dp, measures the scalability of a parallel program Smaller rate of increase  more scalable 7 Parallel Program Design PCAM Model (I. Foster) Concurrency, scalability Locality, performance-related issues 8 Partitioning Decompose the computation to be performed and the data operated on by this computation into small tasks Purpose: expose opportunities of parallel execution Ignore practical issues such as number of processors in target machine etc Avoid replicating computation and data Focus: Define a large number of small tasks in order to yield a fine-grained decomposition of the problem Fine grained decomposition provides the greatest flexibility in terms of potential parallel algorithms Maximize concurrency 9 Partitioning Good partition: divides both the computation associated with a problem and the data this computation operates on Domain/Data decomposition: first focus on data Partition the data associated with the problem Associate computations with partitioned data Functional decomposition: first focus on computation Decompose computations to be performed Deal with data decomposed computations work on 10 Domain Decomposition  Decompose the data first, and then associated computations  “owner computes”  Outcome: tasks comprising some data and a set of operations on that data  Some operation may require data from several tasks  communication  Data can be input data, output data, intermediate data, or all of them.  Rule of thumb: focus first on largest data structure or the data structure accessed most frequently  Mesh-based problems:  Structured mesh: 1D, 2D, 3D decompositions  Unstructured mesh: graph partitioning tools such as METIS  Favor the most aggressive decomposition possible at this stage 11 Functional Decomposition Focus first on computation to be performed; Divide computations into disjoint tasks Then consider the data associated with each sub-task Data requirements may be disjoint  done Data may overlap significantly, communications; May just as well try domain decomposition Provide an alternative way of thinking about problem; Hybrid decomposition maybe best E.g. multi-physics simulations, overall functional decomposition, each component domain decomposition 12 Partitioning: Questions to Ask  Does your partition define more tasks (an order of magnitude more?) than the number of processors of the target machine?  No  reduced flexibility in subsequent stages  Does your partition avoid redundant computation and storage requirements?  No  may not be scalable to large problems  Are tasks of comparable size?  No  hard to allocate to cpus with equal amount of work  load imbalance  Does the number of tasks scale with problem size?  Ideal: increased problem size  increase in number of tasks  No  may not be able to solve larger problems with more processors  Have you identified alternative partitions?  Maximize flexibility; try both domain and functional decompositions 13 Communication Purpose: Determine the interaction among tasks Distribute communication operations among many tasks Organize communication operations in a way that permits concurrent execution 4 categories of communications: Local/global communications: Local: each task communicates with a small set of other tasks (neighbors) Global: communicate with many or all other tasks 14 Communication  Structured/un-structured communication  Structured: A task and neighbors form a regular structure, grid or tree  Un-structured: communication represented by arbitrary graphs  Static/dynamic communication:  Static: identity of communication partners does not change over time  Dynamic: identity of partners determined by data computed at runtime and highly variable  Synchronous/asynchronous communication  Synchronous: requires coordination between communication partners  Asynchronous: without cooperation 15 Task Dependency Graph Task dependencies: one task cannot start until some other task(s) finishes. E.g. the output of one task is the input to another task Represented by the task dependency graph: Directed acyclic Nodes: tasks (task size as the weight of node) Directed edges: dependencies among tasks 16 Task Dependency Graph  Degree of concurrency: number of tasks that can run concurrently  Maximum degree of concurrency: the maximum number of tasks that can be executed simultaneously at any given time  Average degree of concurrency: the average number of tasks that can run concurrently over the duration of program  Critical path: The longest vertex-weighted directed path between any pair of start and finish nodes  Critical path length: sum of vertex weights along the critical path  Average degree of concurrency = total amount of work / critical path length 17 Task Interaction Graph Even independent tasks may need to interact, e.g. sharing data Interaction graph: captures interaction patterns among tasks Nodes: tasks Edges: communications / interactions Example interaction graph Usually contains task dependency graph as sub-graph 18 Communication: Questions to Ask  Do all tasks perform the same number of communication operations?  Unbalanced communication  poor scalability  Distribute communications equitably  Does each task communicate only with a small number of neighbors?  May need to re-formulate global communication in terms of local communication structures  Can communications proceed concurrently?  Can computations associated with different tasks proceed concurrently?  No  may need to re-order computations / communications 19 Agglomeration Improve performance: Combine tasks to reduce the task interaction strength, increase locality, increase the computation and communication granularity. Also determine if it is worthwhile to replicate data/computation Dependent tasks will be combined Independent tasks may also be agglomerated to increase granularity Goals: reduce communication cost, retain flexibility w.r.t. scalability and mapping decisions 20 Increasing Granularity  Coarse-grain usually performs better:  Send less data (reduce volume of communication)  Use fewer messages when sending same amount of data (reducing frequency of communications)  Surface-to-volume effects:  Communication cost usually proportional to surface area of domain  Computation cost usually proportional to volume of domain  As task size increases, amount of communication per unit computation decreases  High-D decomposition usually more efficient than low-D decompositions, due to reduced surface area for a given volume.  Replicate computation:  May trade off replicated computation for reduced communication or execution time. 21 Agglomeration: Questions to Ask  Has agglomeration reduced communication costs by increasing locality?  If computation is replicated, have you verified that the benefits of replication out-weigh its costs for a range of problem size and processor counts?  If data is replicated, have you verified that it does not comprise scalability  Do the tasks have similar computation and communication costs after agglomeration?  Load balance  Does the number of tasks still scale with problem size? 22 Mapping Map tasks to processors or processes. If the number of tasks is larger than the number of processors, may need to place more than one task on a single processor Goal: minimize total execution time Place tasks that execute concurrently on different processors Place tasks that communicate frequently on the same processor In general case, no computationally tractable algorithm for the mapping problem, NPcomplete. If SPMD-style, one task per processor 23 Parallel Algorithm Models  Data parallel model: processors perform similar operations on different data  Work/task pool model (replicated workers):  Pool of tasks, a number of processors  A processor can remove a task from pool and work on it  A processor may generate a new task during computation and add it to the pool  Master-slave/manager-worker model: master processors generate work and allocate it to worker processors  Pipeline/producer-consumer model: a stream of data passes through a succession of processors, each perform some task on it.  Hybrid model: combination of two or more models 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download High Performance Computing Lecture 1