Download More on Parallel Computing

More on Parallel Computing Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN [email protected] 30 Computational Science jsumethod05 [email protected] 1 What is Parallel Architecture? • A parallel computer is any old collection of processing elements that cooperate to solve large problems fast – from a pile of PC’s to a shared memory multiprocessor • Some broad issues: – Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? – Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? – Performance and Scalability • how does it all translate into performance? • how does it scale? 30 January 2005 jsumethod05 [email protected] 2 Parallel Computers -- Classic Overview • Parallel computers allow several CPUs to contribute to a computation simultaneously. • For our purposes, a parallel computer has three types of parts: Colors Used in – Processors Following pictures – Memory modules – Communication / synchronization network • Key points: – – – – All processors must be busy for peak speed. Local memory is directly connected to each processor. Accessing local memory is much faster than other memory. Synchronization is expensive, but necessary for correctness. 30 January 2005 jsumethod05 [email protected] 3 Distributed Memory Machines • Every processor has a memory others can’t access. • Advantages: – Relatively easy to design and build – Predictable behavior – Can be scalable – Can hide latency of communication • Disadvantages: – Hard to program – Program and O/S (and sometimes data) must be replicated 30 January 2005 jsumethod05 [email protected] 4 Communication on Distributed Memory Architecture Messages 30 January 2005 • On distributed memory machines, each chunk of decomposed data resides on separate memory space -- a processor is typically responsible for storing and processing data (ownercomputes rule) • Information needed on edges for update must be communicated via explicitly generated messages jsumethod05 [email protected] 5 Distributed Memory Machines -- Notes • Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster BlueGene are quite similar. • Bandwidth and latency of interconnects different • The network topology is a two-dimensional torus for Paragon, three-dimensional torus for BlueGene, fat tree for CM-5, hypercube for nCUBE and Switch for SP-2 • To program these machines: • Divide the problem to minimize number of messages while retaining parallelism • Convert all references to global structures into references to local pieces (explicit messages convert distant to local variables) • Optimization: Pack messages together to reduce fixed overhead (almost always needed) • Optimization: Carefully schedule messages (usually done by library) 30 January 2005 jsumethod05 [email protected] 6 BlueGene/L has Classic Architecture 32768 node BlueGene/L takes #1 TOP500 Position 29 Sept 2004 70.7 Teraflops 30 January 2005 jsumethod05 [email protected] 7 BlueGene/L Fundamentals   Low Complexity nodes gives more flops per transistor and per watt 3D Interconnect supports many scientific simulations as nature as we see it is 3D 30 January 2005 jsumethod05 [email protected] 8 1987 MPP 30 January 2005 1024 Nodes full system with hypercube Interconnect jsumethod05 [email protected] 9 Shared-Memory Machines • All processors access the same memory. • Advantages: – Retain sequential programming languages such as Java or Fortran – Easy to program (correctly) – Can share code and data among processors • Disadvantages: – Hard to program (optimally) – Not scalable due to bandwidth limitations in bus 30 January 2005 jsumethod05 [email protected] 10 Communication on Shared Memory Architecture • On a shared Memory Machine a CPU is responsible for processing a decomposed chunk of data but not for storing it • Nature of parallelism is identical to that for distributed memory machines but communication implicit as “just” access memory 30 January 2005 jsumethod05 [email protected] 11 Shared-Memory Machines -- Notes • Interconnection network varies from machine to machine • These machines share data by direct access. – Potentially conflicting accesses must be protected by synchronization. – Simultaneous access to the same memory bank will cause contention, degrading performance. – Some access patterns will collide in the network (or bus), causing contention. – Many machines have caches at the processors. – All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently. 30 January 2005 jsumethod05 [email protected] 12 Distributed Shared Memory Machines • Combining the (dis)advantages of shared and distributed memory • Lots of hierarchical designs. – Typically, “shared memory nodes” with 4 to 32 processors – Each processor has a local cache – Processors within a node access shared memory – Nodes can get data from or put data to other nodes’ memories 30 January 2005 jsumethod05 [email protected] 13 Summary on Communication etc. • Distributed Shared Memory machines have communication features of both distributed (messages) and shared (memory access) architectures • Note for distributed memory, programming model must express data location (HPF Distribute command) and invocation of messages (MPI syntax) • For shared memory, need to express control (openMP) or processing parallelism and synchronization -- need to make certain that when variable updated, “correct” version is used by other processors accessing this variable and that values living in caches are updated 30 January 2005 jsumethod05 [email protected] 14 Seismic Simulation of Los Angeles Basin • This is (sophisticated) wave equation similar to Laplace example and you divide Los Angeles geometrically and assign roughly equal number of grid points to each processor Computer with 4 Processors 30 January 2005 Problem represented by Grid Points and divided Into 4 Domains jsumethod05 [email protected] 15 Communication Must be Reduced • 4 by 4 regions in each processor – 16 Green (Compute) and 16 Red (Communicate) Points • 8 by 8 regions in each processor – 64 Green and “just” 32 Red Points • Communication is an edge effect 30 January 2005 • Give each processor plenty of memory and increase region in each machine • Large Problems Parallelize Best jsumethod05 [email protected] 16 Irregular 2D Simulation -- Flow over an Airfoil • The Laplace grid points become finite element mesh nodal points arranged as triangles filling space • All the action (triangles) is near near wing boundary • Use domain decomposition but no longer equal area as equal triangle count 30 January 2005 jsumethod05 [email protected] 17 Heterogeneous Problems 30 January 2005 • Simulation of cosmological cluster (say 10 million stars ) • Lots of work per star as very close together ( may need smaller time step) • Little work per star as force changes slowly and can be well approximated by low order multipole expansion jsumethod05 [email protected] 18 Load Balancing Particle Dynamics • Particle dynamics of this type (irregular with sophisticated force calculations) always need complicated decompositions • Equal area decompositions as shown here to load imbalance Equal Volume Decomposition Universe Simulation If use simpler algorithms (full O(N2) forces) or FFT, then equal area best 16 Processors Galaxy or Star or ... 30 January 2005 jsumethod05 [email protected] 19 Reduce Communication • Consider a geometric problem with 4 processors • In top decomposition, we divide domain into 4 blocks with all points in a given block contiguous • In bottom decomposition we give each processor the same amount of work but divided into 4 separate domains • edge/area(bottom) = 2* edge/area(top) • So minimizing communication implies we keep points in a given processor together 30 January 2005 Block Decomposition Cyclic Decomposition jsumethod05 [email protected] 20 Minimize Load Imbalance Block Decomposition • But this has a flip side. Suppose we are decomposing Seismic wave problem and all the action is near a particular earthquake fault denoted by . • In Top decomposition only the white processor does any work while the other 3 sit idle. Cyclic Decomposition – Ffficiency 25% due to Load Imbalance • In Bottom decomposition all the processors do roughly the same work and so we get good load balance …... 30 January 2005 jsumethod05 [email protected] 21 Parallel Irregular Finite Elements • Here is a cracked plate and calculating stresses with an equal area decomposition leads to terrible results – All the work is near crack 30 January 2005 Processor jsumethod05 [email protected] 22 Irregular Decomposition for Crack Region assigned to 1 processor • Concentrating processors near crack leads to good workload balance • equal nodal point -- not equal area -- but to minimize communication nodal points assigned to a particular processor are contiguous • This is NP complete Work (exponenially hard) Load optimization problem but in practice many ways of getting good but not exact good decompositions 30 January 2005 Not Perfect ! Processor jsumethod05 [email protected] 23 Further Decomposition Strategies California gets its independence • Not all decompositions are quite the same • In defending against missile attacks, you track each missile on a separate node -geometric again • In playing chess, you decompose chess tree -- an abstract not geometric space Computer Chess Tree Current Position (node in Tree) First Set Moves Opponents Counter Moves 30 January 2005 jsumethod05 [email protected] 24 Summary of Parallel Algorithms • A parallel algorithm is a collection of tasks and a partial ordering between them. • Design goals: – Match tasks to the available processors (exploit parallelism). – Minimize ordering (avoid unnecessary synchronization points). – Recognize ways parallelism can be helped by changing ordering • Sources of parallelism: – Data parallelism: updating array elements simultaneously. – Functional parallelism: conceptually different tasks which combine to solve the problem. This happens at fine and coarse grain size • fine is “internal” such as I/O and computation; coarse is “external” such as separate modules linked together 30 January 2005 jsumethod05 [email protected] 25 Data Parallelism in Algorithms • Data-parallel algorithms exploit the parallelism inherent in many large data structures. – A problem is an (identical) algorithm applied to multiple points in data “array” – Usually iterate over such “updates” • Features of Data Parallelism – Scalable parallelism -- can often get million or more way parallelism – Hard to express when “geometry” irregular or dynamic • Note data-parallel algorithms can be expressed by ALL programming models (Message Passing, HPF like, openMP like) 30 January 2005 jsumethod05 [email protected] 26 Functional Parallelism in Algorithms • Functional parallelism exploits the parallelism between the parts of many systems. – Many pieces to work on  many independent operations – Example: Coarse grain Aeroelasticity (aircraft design) • CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.) can be evaluated in parallel • Analysis: – Parallelism limited in size -- tens not millions – Synchronization probably good as parallelism natural from problem and usual way of writing software – Web exploits functional parallelism NOT data parallelism 30 January 2005 jsumethod05 [email protected] 27 Pleasingly Parallel Algorithms • Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel • These are made up of independent concurrent components – – – – Each client independently accesses a Web Server Each roll of a Monte Carlo dice (random number) is an independent sample Each stock can be priced separately in a financial portfolio Each transaction in a database is almost independent (a given account is locked but usually different accounts are accessed at same time) – Different parts of Seismic data can be processed independently • In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently • Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism 30 January 2005 jsumethod05 [email protected] 28 Parallel Languages • A parallel language provides an executable notation for implementing a parallel algorithm. • Design criteria: – How are parallel operations defined? • static tasks vs. dynamic tasks vs. implicit operations – How is data shared between tasks? • explicit communication/synchronization vs. shared memory – How is the language implemented? • low-overhead runtime systems vs. optimizing compilers • Usually a language reflects a particular style of expressing parallelism. • Data parallel expresses concept of identical algorithm on different parts of array • Message parallel expresses fact that at low level parallelism implies information is passed between different concurrently executing program parts 30 January 2005 jsumethod05 [email protected] 29 Data-Parallel Languages • Data-parallel languages provide an abstract, machineindependent model of parallelism. – – – – Fine-grain parallel operations, such as element-wise operations on arrays Shared data in large, global arrays with mapping “hints” Implicit synchronization between operations Partially explicit communication from operation definitions • Advantages: – Global operations conceptually simple – Easy to program (particularly for certain scientific applications) • Disadvantages: – Unproven compilers – As express “problem” can be inflexible if new algorithm which language didn’t express well • Examples: HPF • Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD 30 January 2005 jsumethod05 [email protected] 30 Approaches to Parallel Programming • Data Parallel typified by CMFortran and its generalization - High Performance Fortran which in previous years we discussed in detail but this year we will not discuss; See Source Book for more on HPF • Typical Data Parallel Fortran Statements are full array statements – B=A1 + A2 – B=EOSHIFT(A,-1) – Function operations on arrays representing full data domain • Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature • Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel" • Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory 30 January 2005 jsumethod05 [email protected] 31 Shared Memory Programming Model • Experts in Java are familiar with this as it is built in Java Language through thread primitives • We take “ordinary” languages such as Fortran, C++, Java and add constructs to help compilers divide processing (automatically) into separate threads – indicate which DO/for loop instances can be executed in parallel and where there are critical sections with global variables etc. • openMP is a recent set of compiler directives supporting this model • This model tends to be inefficient on distributed memory machines as optimizations (data layout, communication blocking etc.) not natural 30 January 2005 jsumethod05 [email protected] 32 Structure(Architecture) of Applications - I • Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism • Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the – “10,000” separate programs (e.g. structures,CFD ..) used in design of aircraft – the various filters used in Adobe Photoshop or Matlab image processing system – the ocean-atmosphere components in integrated climate simulation – The data-base or file system access of a data-intensive application – the objects in a distributed Forces Modeling Event Driven Simulation 30 January 2005 jsumethod05 [email protected] 33 Structure(Architecture) of Applications - II • Modules are “natural” message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components – modules are what HPF needs task parallelism for – Often modules are naturally distributed whereas parts of data parallel decomposition may need to be kept on tightly coupled MPP • Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism – Now if one takes a large CFD problem and divides into a few components, those “coarse grain data-parallel components” will be supported by computational grid technology • Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems 30 January 2005 jsumethod05 [email protected] 34 Multi Server Model for metaproblems • We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs) – One runs 10,000 separate programs to design a modern aircraft which must be scheduled and linked ….. • All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research 30 January 2005 jsumethod05 [email protected] 35 Multi-Server Scenario NEOS Control Optimization Agent-based Choice of Compute Engine Gateway Control Multidisciplinary Control (WebFlow) Parallel DB Proxy Database Optimization Service Origin 2000 Proxy NetSolve Linear Alg. Server MPP Matrix Solver IBM SP2 Proxy Data Analysis Server 30 January 2005 jsumethod05 [email protected] MPP 36

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download More on Parallel Computing