Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Parallel Seismic Computing Theory and Practice By Paul Stoffa The University of Texas at Austin Institute for Geophysics 4412 Spicewood Springs Road, Building 600 Austin, Texas 78759-8500 1 5/23/2017 Introduction What one should know/learn before starting Why parallel? What we want/need to parallel What can and cannot be paralleled Ways to parallel - hardware Ways to parallel – software (MPI/PVM/MYP) Ways to parallel – algorithms A.1 2 5/23/2017 Motivation To process seismic data faster using better algorithms and at reduced cost Better algorithms: 3D wave equation based pre stack migration & inversion Faster: days not months to complete the processing Reduced cost: $10-$100k per site not $1M or more B.1 3 5/23/2017 Floating Point Performance 1000 Cray X-MP 100 64-bit Cray 2 Cray Y-MP Cray 1S 10 Cray Y-MP/C90 DEC ALPHA RS6000 i860 MIPS R2000 1 .1 68881 32-bit 16-bit 8087 .01 75 B.2 80 80387 80287 85 90 95 4 5/23/2017 Moore’s Law In 1965, Gordon Moore (co-founder of Intel) was preparing a speech and when he started to graph data about the growth in memory chip performance, he realized there was a striking trend: B.3 Each new chip contained roughly twice as much capacity as its predecessor Each chip was released within 18-24 months of the previous chip If this trend continued, computing power would rise exponentially over relatively brief periods of time 5 5/23/2017 Moore’s Law (Diagram) 1975 1980 1985 1990 1995 Micro 500 2000 (MIPS) 10M (transistors) 1M 80486 80386 80286 100K Pentium processor 8086 10K 8080 4004 25 1.0 0.1 0.01 Speed and # of transistors vs. year B.4 6 5/23/2017 Breaking The Speed Limit IBM Interlocked Pipelined CMOS Intel Willamette 4.5 GHz 1.5 GHz Intel Pentium III 1.0 GHz AMD Athlon Current B.5 1.0 GHz 0.8 GHz 7 5/23/2017 What is MPP? MPP = Massively parallel processing Up to thousands of processors Physically distributed memory – Multicomputers: Each processor has a separate address space – Multiprocessors: All processors share a common address space C.1 8 5/23/2017 Processors may be commodity microprocessors or proprietary (i.e., custom) designs Processors and memories are connected with a network MPPs are scalable: The aggregate memory bandwidth increases with the number of processors C.2 9 5/23/2017 Small-Scale Multiprocessing Uniprocessing P M C.3 P P P P Shared Memory 10 5/23/2017 Massively Parallel Processing Interconnect network C.4 PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M PP P P M M M M 11 5/23/2017 Why MPP? Technology reasons: – – – – Limits to device (transistor) switching speeds Limits to signal propagation speeds Need for large memory Need for scalable systems Economic reasons: – Low cost of microprocessors – Low cost of DRAM – Leverage R&D for workstation market C.5 12 5/23/2017 Problems with MPP Hardware – Commodity microprocessors are not designed for MPP systems – Slow divide – Poor single-node efficiency – Poor cache reuse on large-scale problems – Low memory bandwidth – Slow inter-processor communication – Cost of synchronization C.6 13 5/23/2017 Software – MPP software is more complex – It’s difficult to debug and optimize programs – It’s difficult to port applications from conventional systems General – Fundamental limits to parallel performance – Load balance – Data locality C.7 14 5/23/2017 Clusters Why Now ? Commodity pricing of computer workstations, servers and pc's has reduced the cost per machine dramatically in the last ten years The power and speed of processors have increased significantly during this time D.1 15 5/23/2017 Cache and DRAM memory per machine have also increased It is now possible to ‘cluster’ machines to create a parallel machine using ethernet as the communication vehicle Agreed programming standards (PVM & MPI) D.2 16 5/23/2017 Today’s Solution Use widely available computer capability designed for the general public, e.g. PC's, workstations, servers Aggregate these machines into clusters that can be dedicated to the seismic processing task Minimize the communication required between processors to keep overhead to a minimum Minimize the transfer of data between processors to keep I/O to a minimum D.3 17 5/23/2017 Hardware Parallel Computers Parallel Processing Concepts Memory Design Basic Hardware Concepts Processor Connectivity Amdahl’s Law E.1 18 5/23/2017 Parallel Computers Before commodity pricing and the widespread availability of high speed components parallel computers were designed, built and used for scientific computing These machines were expensive, hard to program and of different designs E.1.1 19 5/23/2017 Two distinct ways of organizing the processing flow were investigated: All processors issue the same instruction at the same time The processors act independently of each other E.1.2 20 5/23/2017 Parallel Processing Concepts SIMD single instruction multiple data – Used in dedicated applications – Fully synchronous MIMD multiple instructions multiple data – Used in general purpose applications – Independent and asynchronous – Can overlap computation and communication SPMD single program multiple data – Same as MIMD, except that all processors run identical programs E.2.1 21 5/23/2017 Flynn's taxonomy A classification of computer architectures based on the number of streams of instructions and data: – Single instruction/single data (SISD) - a sequential computer – Multiple instruction/single data (MISD) - unusual – Single instruction/multiple data (SIMD) - e.g. DAP – Multiple instruction/multiple data(MIMD) multiple autonomous processors simultaneously executing different instructions on different data E.2.2 22 5/23/2017 Memory Design Several different memory distribution designs were tried: Each processor has its own independent memory bank Each processor’s independent memory is shared All the memory is shared by all the processors E.3.1 23 5/23/2017 Independent memory banks Shared independent memory banks Shared memory bank E.3.2 M M M M M M P P P P P P M M M M M M P P P P P P P P MEMORY P P P P 24 5/23/2017 Parkinson's Law of Data ”Data expands to fill the space available for storage” Buying more memory encourages the use of more memory-intensive techniques E.3.3 It has been observed over the last 10 years that the memory usage of evolving systems tends to double roughly once every 18 months Fortunately, memory density available for constant dollars also tends to double about once every 12 months (see Moore’s Law) 25 5/23/2017 Basic Hardware Concepts Communication protocols – Circuits switched A physical circuit is established between any two communicating processors – Packet switched Data communicated between processors is divided into packets and routed along a path usually including other processors E.4.1 26 5/23/2017 Routing techniques – Store and forward Each packet is stored in the local memory of every processor on its path before being forwarded to the next processor – Cut through Each packet proceeds directly from one processor to the next along channel, without being stored in memory E.4.2 27 5/23/2017 Metric and characteristics – Latency How long does it take to start sending a message High latency implies use of large packets for efficiency – Bandwidth What data rate can be sustained once the message is started Bi-directional bandwidth: Most systems can send data in two (or more) directions at once Aggregate bandwidth: the total bandwidth when all processors are sending and receiving data – Bisection bandwidth At what rate can data be transferred from one half of the system to the other E.4.3 28 5/23/2017 Synchronization Necessary for all MIMD systems – Barriers Processors wait at a barrier in the code until they all reach it – Critical Sections Used to protect code: A critical section of code can only be executed by one processor at a time E.4.4 29 5/23/2017 – Locks Used to protect data: Locked data cannot be accessed until it is unlocked. A lock is a primitive synchronization mechanism that can be used to implement barriers and critical sections – Synchronization mechanisms should be supported with special hardware (but aren’t in many systems) – Some systems support only barriers E.4.5 30 5/23/2017 Processor Connectivity Several ways of organizing the connections between processors were used to optimize inter processor communication & data exchange efficiency, i.e. the architecture of the processor connections is important. Network topologies are characterized by: – Diameter: maximum distance between processors – Degree: number of connections to/from processor – Extendability: Can an arbitrary size network be biult? E.5.1 31 5/23/2017 Examples X-Y grid (or 2D grid) Fat Tree Butterfly Hypercube 2D Torus 3D Torus E.5.2 32 5/23/2017 X-Y Grid (Intel Paragon) 25 Node 2D Grid E.5.3 33 5/23/2017 Fat Tree (CM5) 16 Node Fat Tree E.5.4 34 5/23/2017 Butterfly (Kendal Square) P S S S M P M P M P S S S P P M S S S P P M M M S S S M 8 Node Butterfly E.5.5 35 5/23/2017 Hypercube (IPSC2) 16 Node Hypercube E.5.6 36 5/23/2017 2D Torus (concept) 25 Node 2D Torus E.5.7 37 5/23/2017 3D Torus (T3D) +Y -Z +X -X +Z -Y 3D Torus Network E.5.8 38 5/23/2017 Amdahl’s Law Overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel, no matter how many processors are applied. If p is the percentage of the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, the theoretical speedup limit is 100 100 - p E.6.1 39 5/23/2017 Amdahl’s Law for N Processors If s is the percentage of amount of time spent (by a serial processor) on serial parts of a program (thus s+p=100 %), speedup is 100 s + p/N E.6.2 40 5/23/2017 Time = 100 s p Run on serial processor Run on parallel processor Time= s + p/N E.6.3 41 5/23/2017 Amdahl’s Law (Diagram 1) 25 Speedup limit Speedup N=16 Speedup 20 15 10 5 0 0 E.6.4 10 20 30 40 50 60 70 80 90 100 p 42 5/23/2017 Amdahl’s Law (Diagram 2) 100 p=50% p=60% p=70% p=80% p=90% Speedup p=100% 10 N 1 2 E.6.5 4 8 16 32 64 128 256 512 1024 43 5/23/2017 Amdahl’s Law Reevaluated Graph steepness in diagram 1 implies that few problems will experience even a 10-fold speedup The implicit assumption that p is independent of N is virtually never the case In reality , the problem size scales with N, i.e. given a more powerful processor users gain control over grid resolution, number of timesteps, etc., allowing the program run in some desired amount of time. It is more realistic to assume that run time, not problem size, is constant E.6.6 44 5/23/2017 Amdahl’s Law Reevaluated If we use s´ and p´ to represent percentages of serial and parallel time spent on the parallel system then we arrive at an alternative to Amdahl's law: Scaled speedup is s´ + Np´ 100 E.6.7 45 5/23/2017 Time= s´ +N p´ Hypothetical run on serial processor s´ p´ Run on parallel processor Time= s´ + p´ =100 E.6.8 46 5/23/2017 Scaled speedup (diagram 3) 100 p=50% p=60% p=70% p=80% Speedup p=90% p=100% 10 N 1 2 E.6.9 4 8 16 32 64 128 256 512 1024 47 5/23/2017 Brief History of Example Machines DAP (Distributed Array of Processors) MasPar-1 Thinking machine CM2 – SIMD Intel - IPSC2 MIMD hypercube Thinking machine CM5 – MIMD Cray T3D & T3E - MIMD torus SGI origin 2000 - MIMD shared memory Microway Alpha Cluster F.1 48 5/23/2017 AMT DAP Hardware VME Host computer Host Computer unit Master Control unit Code store Processor elements 1 – bit 8 - bit Fast I/O 64 64 F.1.1 Array memory (at least 32 Kbits per PE) 49 5/23/2017 DAP 610C is a SIMD machine. 4096 processor elements simultaneously execute the same instruction on data within their local memory The processor elements (PE) are arranged in a square array (64 × 64), each comprising a general-purpose bit-organized processor and an 8-bit wide co-processor for performing complex functions such as floating point arithmetic F.1.2 50 5/23/2017 Each PE has a local memory which ranges from 32 to 256 Kbits per PE (i.e., 16 to 128 Mbytes in total) Each bit-organized processor is connected to its four nearest neighbors in the square array, and to row and column data paths used for rapid broadcast or combination of data F.1.3 51 5/23/2017 Active Memory Solution Simplicity in programming The array structure matches the parallelism in a wide range of problems A large number of simple PEs can be provided at reasonable cost F.1.4 52 5/23/2017 Specialized functions such as floating point operations or variable word length mathematics can be micro-coded from very general elementary operations provided by the PE Overheads are very low since one control operation (managed by the MCU) simultaneously produces 4096 results (64 × 64 array), in contrast to a single result for a conventional processor F.1.5 53 5/23/2017 Conventional (SISD) Memory Program Control unit PE Active memory (SIMD) Memory Memory Memory Master control unit 1024 unit PE F.1.6 PE PE 54 5/23/2017 MasPar-1 MasPar-1 architecture consists of four subsystems: – – – – F2.1 The Processor Element (PE) Array The Array Control Unit (ACU) A UNIX subsystem with standard I/O A high-speed I/O subsystem 55 5/23/2017 F.2.2 56 5/23/2017 PE Array is computational heart of MP-1 Family architecture and enables to operate on thousands of data elements in parallel Each PE is a RISC processor with dedicated data memory and execution logic, and operates at 1.6 MIPS (32-bit integer add) and 73 KFLOPS (average of 32-bit floating point multiply and add) Extensive use of CMOS VLSI allows extremely dense packaging – 32 processors per chip, with forty 32-bit registers per PE F.2.3 57 5/23/2017 Using SIMD thousands of PEs can work on a single problem simultaneously. Each PE can share its own data memory with other processors through a global communications system. PEs receive instructions from ACU MP-1 systems support PE arrays with 1,024 to 16,384 PEs. Performance ranges from 1,600 to 26,000 integer MIPS (32-bit add) and from 75 to 1,200 single precision MFLOPS (average of add and multiply). The aggregate PE Array data memory ranges from 16 MB to 1 GB F.2.4 58 5/23/2017 Processor Element Array Data memory RISC Processor A single Processing Element (PE) F.2.5 59 5/23/2017 Thinking machine CM2 – SIMD Thinking machines provided a very cost effective hardware platform in their CM2 which was a SIMD machine. Unfortunately it was hard to program. The machine was well suited for explicit finite difference algorithms. Mobil was one of the many companies that used this machine for seismic processing. F.3.1 60 5/23/2017 CM-2 General Specifications Processors Memory Memory Bandwidth I/O Channels Capacity per Channe Max. Transfer Rate F.3.2 65,536 512 Mbytes 300Gbits/Sec 8 l40 Mbytes/Sec 320 Mbytes/Sec 61 5/23/2017 DataVault Specifications Storage Capacity I/O Interfaces Transfer Rate, Burst Max. Aggregate Rate 5 or 10 Gbytes 2 40 Mbytes/Sec 320 Mbytes/Sec CM-2 has performance in excess of 2500 MIPs and floating point performance above 2500 MFlops F3.3 62 5/23/2017 CM-2 12-D Hypercube Simple 1-bit processors, capable of performing the same calculation, each on its own separate data set, are grouped 16 to a chip, making total of 4096 chips The chips are wired together in a network having shape of 12-D hypercube – a cube with 212 corners, or 4096 chips, each connected to 12 other chips F.3.4 63 5/23/2017 Hypercube A cube of more than three dimensions Single point (‘node’) can be considered as a zero dimensional cube Two nodes joined by a line are a one dimensional cube Four nodes arranged in a square are a two dimensional cube Eight (23) nodes are an ordinary three dimensional cube F.4.1 64 5/23/2017 Zero dimensional cube Three dimensional cube Two dimensional cube F.4.2 65 5/23/2017 Hypercube (Cont.) First hypercube has 24 =16 nodes and is a four dimensional shape (a "four-cube") and an N dimensional cube has 2N nodes (an "N-cube") Each node in an N dimensional cube is directly connected to N other nodes F.4.3 66 5/23/2017 Hypercube (Cont.) F.4.4 The simple, regular geometrical structure and the close relationship between the coordinate system and binary numbers make the hypercube an appropriate topology for a parallel computer interconnection network. The fact that the number of directly connected, "nearest neighbor", nodes increases with the total size of the network is also highly desirable for a parallel computer 67 5/23/2017 Intel - IPSC2 MIMD Hypercube 16 Node Hypercube F.5.1 68 5/23/2017 Intel - IPSC2 MIMD hypercube This machine was one of the very first parallel machines used for scientific computing. At that time no programming standards were agreed upon and Intel provided software for coordination and control of the processors Production seismic processing was done using this machine and its successors by Tensor F.5.2 69 5/23/2017 nCUBE - MIMD hypercube nCUBE was another parallel machine that was used for seismic processing. MIT’s ERL group had a machine and Shell used several for production seismic processing. nCUBE provided the specialized software necessary for inter processor communication and coordination. F.6.1 70 5/23/2017 nCUBE 2 Architecture: Distributed Memory MIMD hypercube Node: Based on a 64-bit custom chip and 64bit floating point unit, running at 7.5 MIPS or 3.3 MFLOPS,1-64 Mbytes of memory I/O: Other I/O boards include: an improved graphics board; a parallel I/O sub-system board; and an Open System board. Each board can transfer data at a peak rate of 568 Mbytes/s per I/O slot F.6.2 71 5/23/2017 Topology: A hypercube with extra I/O boards, although host boards are no longer necessary as the nCUBE 2 is hosted by a workstation which is able to allocate sub-cubes separately to different users. Wormhole routing is implemented in hardware over high-speed DMA communications channels Scalability: Scales from a 5-dimensional cube (32 nodes) to a 13-dimensional cube (8192 nodes) F.6.3 72 5/23/2017 F.6.4 Performance: Peak performance is 27 GFLOPS for a 13-dimensional hypercube with 512 Gbytes of memory and a peak I/O bandwidth of 36 Gbytes/s. It would also be able to support 4096 disc drives. The largest built to date is a 1024 node machine which has been shown to scale almost linearly on some applications 73 5/23/2017 Thinking machine CM5 – MIMD Realizing the limitations of the CM2 the CM5 was developed which was an easier to program MIMD machine GECO among other companies used this machine for production seismic processing F.7.1 74 5/23/2017 Thinking Machine CM5 – MIMD Architecture: coordinated homogeneous array of RISC microprocessors with following communication mechanisms: – A low-latency, high-bandwidth communications mechanism that allows each processor to access data stored in other processors – A fast global synchronization mechanism that allows the entire machine to be brought to a defined state at specific point in the course of computation F.7.2 75 5/23/2017 Processing node: each processing node consists of – a RISC microprocessor – 32 MB of memory – interface to the control and data interconnection network – 4 independent vector units with direct 64-bit path to an 8MB bank of memory This gives the processing node with a double-precision floating point rate of 128 MFLOPS and a memory bandwidth of 512 MB/sec F.7.3 76 5/23/2017 To data network Instruction Fetch, Program control RISC microprocessor (SPARC) To control network Data network interface Control network interface 64-bit bus Instructions Floating point vector processors F.7.4 Vector unit Vector unit Vector unit Vector unit Data Memory (32MB) 64-bit Data paths 77 5/23/2017 Data network topology: uses a fat tree topology – Each interior node connects four children to two or more parents – Processing nodes form the leaf nodes of the tree – As more processing nodes are added, the number of levels of the tree insreases, allowing the total bandwidth across the network to increase proportionally to the number of processors F.7.5 78 5/23/2017 Processors and I/O nodes 16 Node Fat Tree F.7.6 79 5/23/2017 Cray T3D Architecture Disks Tapes Networks Workstations F.8.1 Disks Tapes Networks Workstations 80 5/23/2017 Cray T3D System Configuration Multi-cabinet CRAY T3D Model MCA 32 Processing Elements (PEs) 32 Total Memory (Gbytes) 0.5 or 2 MCA 64 64 1 or 4 9.6 Air/water MCA128 128 2 or 8 19.2 Air/water MC 128 MC 256 MC 512 MC1024 MC2048 128 256 512 1024 2048 2 or 8 4 or 16 8 or 32 16 or 64 32 or 128 19.2 38.4 76.8 153.6 307.2 Water Water Water Water Water F.8.2 Peak Performance Cooling (GFLOPS) 4.8 Air/water 81 5/23/2017 Cray T3D Latency Hiding System design supports three major latency hiding mechanisms that augment the highperformance system interconnect – Pre-fetch queue (PFQ) – Read ahead – Block Transfer Engine (BLT) PFQ and read-ahead features are designed to hide latency at the level of a few words, BLT is designed to hide latency at the level of hundreds or thousands of words F.8.3 82 5/23/2017 Cray T3D Fast Synchronization Cray Research has developed a range of synchronization mechanisms, including – – – – Barrier/Eureka Synchronization Fetch-and-Increment Register Atomic Swap Messaging Facility These mechanisms accommodate both (SIMD) and (MIMD) programming styles F.8.4 83 5/23/2017 Cray T3D Parallel I/O The high-bandwidth I/O subsystem, based on Cray Research’s Model E I/O subsystem technology, provides access to the full range of Cray Research disk, tape, and network peripherals. Peak performance in the gigabytes per second range can be achieved. F.8.5 84 5/23/2017 Cray T3D Scalability Cray T3D system is scalable – From 32 to 2048 processors – From 4.2 to over 300 GFLOPS peak performance – From 512 Mbytes to 128 Gbytes of memory – terabytes of storage capacity Scalability is used for improving system throughput by increasing the number of job streams, and for significantly increasing the performance of individual user jobs F.8.6 85 5/23/2017 Cray T3E Provides a shared physical address space of up to 2,048 processors over a 3D torus interconnect. Each node contains an alpha 21164 processor, system control chip, local memory, and network router. The system logic runs at 75 MHz, processor runs at some multiple of this (300 MHz for Cray T3E) with sustainable bandwidths of 267mb/per second input and output for every four processors F.8.7 86 5/23/2017 Cray T3E - Torus Interconnect +Y -Z +X -X +Z -Y 3D Torus Network F.8.8 87 5/23/2017 Cray T3E node block diagram 75 MHz Alpha 21164 300-450 MHz Streams Control E-Reg 64 MB to 2GB Router F.8.9 88 5/23/2017 Node Organization Processing node Redundant node I/O node F.8.10 89 5/23/2017 Cray T3D, T3E - torus Cray Research developed the T3D and T3E using commodity processors, the DEC alpha chip The interconnect between processors was based on a 3D torus which provided a very efficient way to transfer data between the compute nodes. Phillips Petroleum used these machines for pre stack 3D depth migration. F.8.11 90 5/23/2017 SGI Origin 2000 Hardware: 8 SGI R10000 250 MHz processors 4 MB cache per processor 8 GB RAM 100 GB local disk space 800 MB/s link between nodes 100 Mbit/s link to LAN 1000 Mbit/s link to Microway Alpha cluster Software: SGI IRIX64TM v.6.4 SGI Message passing toolkit for IRIX v.1.3 Price: 465,098 F.9.1 91 5/23/2017 SGI Origin 2000 (Cont.) Node 1 Cache 4MB 250 MHz Processor 0 250 MHz XIO Memory and Directory 2GB 800 MB/s Node 1 Router HUB Craylink 800 MB/s Xbow 0 XIO 1-6 LAN 1 Gbit/s Node 3 F.9.2 800 MB/s 100 Mbit/s Node 2 Node 4 Cache 4MB 250 MHz Processor 1 250 MHz Xbow 1 XIO 7-12 Alpha 92 5/23/2017 SGI origin 2000 - SMP shared memory The Origin 2000, O2K, became popular because the use of shared memory between the processors made programming easier and minimized data transfers. Routine seismic processing is performed using these multi processor machines in parallel as well as by running independent processing tasks on each processor. F.9.3 93 5/23/2017 Microway Aplha Cluster Hardware: 8 Alpha 21164 667 MHz processors 8 MB cache per processor 8 GB RAM 72 GB local disk space 100 Mbit/s link between nodes 100 Mbit/s link to LAN 1000 Mbit/s link to SGI Origin 2000 Software: Microsoft Windows NT v.4.0 SP 4 MPI/PRO™ for Windows NT v.1.2.3. by MPI Software Technology, Inc. Hummingbird NFS Maestro v.6.1 Price: 71,915 F.10.1 94 5/23/2017 Microway Aplha Cluster (Cont.) Node 0 Cache 8MB 1.2 GB/s 83 MHz Processor 0 Alpha 21264 667 MHz Ultra 2 LVD Controller 80 MB/s Memory 1 GB 83 MHz PCI Gbit NIC 1 Gbit/s Ultra SCSI Disk 9GB 10 MB/s 1GB/s Node 0 Node 1 100 Mbit/s SGI 100 Mbit/s Node 7 100 Mbit/s LAN 3Com 36 port 100 Mbit Switch 1 Gbit/s F.10.2 … 100 Mbit/s 95 5/23/2017 Basic Software Concepts Styles of parallel coding – Data parallel – Messaging passing – Work sharing The data parallel style – Data parallel is the most natural style for SIMD, but it can also be implemented in MIMD systems – Uses array syntax for concise coding G.1 96 5/23/2017 The message passing style – Message passing is the natural style for multicomputer systems (separate address spaces) – Uses standard languages plus libraries for communication, global operations, etc. – Both synchronous and asynchronous communications are possible – Both node-to-node and broadcast communications are possible G.2 97 5/23/2017 The work sharing style – Work sharing is a natural style for a multiprocessor ( global address space) – Work sharing is similar to Cray’s auto-tasking – Compiler directives specify data distributions – Loop iterations can be distributed across processors G.3 98 5/23/2017 Comparison of the different styles – The data parallel style is relatively high-level, elegant, and concise. The parallelism is fully implicit – The message passing style is relatively low-level, general, efficient, flexible, and economical of memory. The parallelism is fully explicit: communication, synchronization, and data decomposition must be specified in complete detail by the programmer G.4 99 5/23/2017 – The work sharing style is intermediate between the other two styles Communication is implicit, synchronization is explicit (but more general types are allowed than in message passing), and data distribution is implicit (with guidance from the programmer) Work sharing can be concise yet efficient, and it is more similar to serial code than the other styles – A few systems allow mixing styles within a program G.5 100 5/23/2017 Software Machine Specific Software – Intel – nCUBE – Thinking Machine Standards – High performance Fortran – PVM – MPI H.1 101 5/23/2017 Software … … MYP (my parallel interface) PVM (parallel virtual machine) MPI (message passing interface) … … H.2 102 5/23/2017 PVM Standard and Implementation Developed by – University of Tennessee – Oak Ridge National Laboratory – Emory University I.1 103 5/23/2017 Message Passing under PVM Structure of the Transfer Process FORM MESSAGE SEND MESSAGE DEQUEUE MESSAGE Creates a send buffer and assembles the message (possibly from multiple sources) I.2 PVM: Use IPC library or shared memory implementation Receiver: Identifies message in PVM COPY TO DESTINATION Sender: Receiver: Initiates block transfer or memory copy 104 5/23/2017 Syntax for sending a message in C Initialize send buffer pvm_initsend (int PvmData ---) PvmDataDefault = 0 PvmDataRaw =1 PvmDataInPlace = 2 I.3 105 5/23/2017 Insert data into send buffer pvm_pkbyte(char*ptr, int num, int strd) ptr - address of destination array num - message length in bytes strd - stride or spacing of elements Send the message pvm_send (int dest, int type) dest = id of destination processor type = message type I.4 106 5/23/2017 Syntax for receiving a message in C Receive a message pvm_recv (int src, int type) src - source processor id (-1 means any source) type - message type (-1 means any type accepted) Extract data from message pvm_unpkbyte (char*ptr, int num, int strd) ptr - address of destination array num - message length in bytes strd - stride or spacing of elements I.5 107 5/23/2017 Syntax for sending a message in Fortran Initialize send buffer call pvmfinitsend(pvminplace, istat) Insert data into send buffer call pvmfpack(byte1, data, len, 1, istat) Send the message call pvmfsend(dest, type, istat) I.6 108 5/23/2017 Syntax for receiving a message in Fortran Receive a message call pvmfrecv(src, type, istat) Extract data from message call pvmfunpack(byte1, data, len, 1,istat) I.7 109 5/23/2017 MPI Standard and Implementation MPI standardization effort involved about 40 organizations mainly from the US and Europe. Most of the major vendors of concurrent computers were involved in MPI, along with researchers from universities, government laboratories, and industry J.1 110 5/23/2017 Goals Design an application programming interface (not necessarily for compilers or a system implementation library) Allow efficient communication: Avoid memory-to-memory copying and allow overlap of computation and communication and offload to communication co-processor, where available J.2 111 5/23/2017 Allow for implementations that can be used in a heterogeneous environment Allow convenient C and Fortran 77 bindings for the interface Assume a reliable communication interface: the user need not cope with communication failures. Such failures are dealt with by the underlying communication subsystem J.3 112 5/23/2017 Define an interface that is not too different from current practice, such as PVM, NX, Express, p4, etc., and provides extensions that allow greater flexibility Define an interface that can be implemented on many vendor's platforms, with no significant changes in the underlying communication and system software J.4 113 5/23/2017 Semantics of the interface should be language independent The interface should be designed to allow for thread-safety J.5 114 5/23/2017 The standardization process began with the Workshop on Standards for Message Passing in a Distributed Memory Environment, sponsored by the Center for Research on Parallel Computing, held April 29-30, 1992, in Williamsburg, Virginia At this workshop a working group was established to continue the standardization process J.6 115 5/23/2017 A preliminary draft proposal, known as MPI1, was put forward by Dongarra, Hempel, Hey, and Walker in November 1992, and a revised version was completed in February 1993 Version 1.0: June, 1994 Version 1.1: June, 1995 Version 1.2: April, 1997 (MPI 1.2 is part of MPI-2 document) J.7 116 5/23/2017 UNIX MPICH v.1.2.0 : Portable implementation of MPI, released on 12/2/1999 SGI MPT for IRIX v.1.3 : MPI release implements the MPI 1.2 standard MPI/PROTM for Windows NT® v.1.2.3: Full MPI 1.2 specification is currently supported, with MPI 2.0 support to be offered incrementally in the future J.8 117 5/23/2017 MPI Commands in C and Fortran UNIX, SGI: Include files – mpi.h (for C) and – mpif.h (for Fortran) contain definitions of MPI objects, e.g., data types, communicators, collective operations, error handles, etc. J.9 118 5/23/2017 MPI/Pro: File mpi.h refers to include files for C – MPIPro_SMP.H – MPIPro_TCP.H File mpif.H refers to include files for Fortran – MPIPro_SMP_f.H – MPIPro_TCP_f.H J.10 119 5/23/2017 Format Comparison C int MPI_Init(argc, argv) #include ‘mpi.h’ int *argc; char ***argv; Input parameters: J.11 Fortran MPI_Init(ier) include ‘mpif.h’ – Output variables ier - integer – returned error value argc - Pointer to the number of arguments argv - Pointer to the argument vector 120 5/23/2017 MPI Subroutines in Fortran MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Barrier J.12 121 5/23/2017 MPI Subroutines in Fortran MPI_Send MPI_Recv MPI_Bcast MPI_Finalize more … J.13 122 5/23/2017 Subroutine MPI_init ( ier) Initializes the MPI execution environment Include ‘mpif.h’ – Output variables J.14 ier - integer - returned error value 123 5/23/2017 Subroutine MPI_Comm_size (comm,size,ier) Determines the size of the group associated with a communicator Include ‘mpif.h’ – Input variables comm - integer - communicator (handle) – Output variables size - integer - # of processes in the group of comm ier - integer - returned error value J.15 124 5/23/2017 Subroutine MPI_Comm_rank (mpi_com_world, rank, ier) Determines the rank of the calling process in the communicator Include ‘mpif.h’ – Input variables comm - integer - communicator (handle) – Output variables rank - integer- rank of the calling process in comm ier - integer - returned error value J.16 125 5/23/2017 Subroutine MPI_Barrier (comm, ier) Blocks until all process have reached this routine Include ‘mpif.h’ – Input variables comm - integer - communicator (handle) – Output variables J.17 ier - integer- returned error value 126 5/23/2017 Subroutine MPI_Send(buf,count, datatype,dest,tag,comm, ier) Performs a basic send Include ‘mpif.h’ – Input variables J.18 buf - integer- initial address of send buffer (choice) count - integer- number of elements in send buffer datatype - integer- datatype of each send buffer element(handle) 127 5/23/2017 Subroutine MPI_Send(buf,count, datatype,dest,tag,comm, ier) dest - integer- rank of destination tag - integer- message tag comm - integer- communicator (handle) – Output variables J.19 ier - integer- returned error value 128 5/23/2017 Subroutine MPI_Recv (count, datatype,source,tag,comm, buf,status,ier) Basic receive Include ‘mpif.h’ – Input variables count - integer- max number of elements in receive buffer datatype - integer- datatype of each receive buffer element (handle) J.20 129 5/23/2017 Subroutine MPI_Recv (count, datatype,source,tag,comm, buf,status,ier) source - integer- rank of source tag - integer- message tag comm - integer- communicator (handle) – Output variables buf - integer- initial address of receive buffer (choice) status - integer- status object (status) ier - integer- returned error value J.21 130 5/23/2017 Subroutine MPI_Bcast (buffer, count, datatype, root,comm,ier) Broadcasts a message from ‘root’ to all other processes of the group Include ‘mpif.h’ – Input variables buffer - integer- starting address of buffer (choice) count - integer- number of entries in buffer J.22 131 5/23/2017 Subroutine MPI_Bcast (buffer, count, datatype, root,comm,ier) datatype - integer- data type of buffer (handle) root - integer- rank of broadcast root comm - integer- communicator (handle) – Output variables J.23 ier – integer - returned error value 132 5/23/2017 Subroutine MPI_Finalize (ier) Terminates MPI execution environment Include ‘mpif.h’ – Output variables J.24 ier - integer- returned error value 133 5/23/2017 MYP Philosophy And Implementation A suite of Fortran subroutines that perform the most common tasks in a parallel program All application level software uses these routines to avoid machine specific software and changing ‘standard’ software at the application level K.1 134 5/23/2017 Subroutines can be modified to use any specific parallel subroutines, e.g. current version uses PVM and MPI A check program is used to verify accuracy of any new implementation K.2 135 5/23/2017 Software: MYP Incorporates the following PVM (parallel virtual machine) MPI (message passing interface) more… K.3 136 5/23/2017 MYP Subroutines in Fortran Myp_open Myp_barrier Myp_close Myp_writer Myp_readr Myp_bcast Myp_g2sum Myp_g1sum K.4 137 5/23/2017 MYP Subroutines in Fortran Myp_token Myp_transp Myp_copy Myp_rdxfer Myp_wrxfer Myp_message K.5 138 5/23/2017 Subroutine myp_open (iamnode,ier) Initializes mpi Include ‘mypf.h’ – Output variables iamnode – integer – node identification number ier – integer – returned error value K.6 139 5/23/2017 SUBROUTINE & myp_open(iamnode,ier) INTEGER iamnode, ier INCLUDE ‘mpif.h’ INCLUDE ‘mpyf.h’ CALL mpi_init(ier) IF (ier .lt. 0) GOTO 7 CALL mpi_comm_rank & (mpi_comm_world, iamnode, ier) IF (ier .lt. 0) GOTO 7 CALL mpi_comm_size & (mpi_comm_world, nodes, ier) IF (ier .lt. 0) GOTO 7 CALL mpi_barrier & (mpi_comm_world,ier) IF (ier .lt. 0) GOTO 7 c get length of myname&myp_path for c later use, then check if myname is ok mytid=iamnode myp_path='' K.6.1 myname='myp_mpi' mygroup='mpi_world' ncmyname = ichrlen(myname) IF(ncmyname.le.0) GOTO 7 ncmygroup=ichrlen(mygroup) c get my node rank in ascii and save it, c 5 character length, pre-pend zeros ncmyp_path = ichrlen(myp_path) write(nodechar, '(i5)') iamnode DO j=1,5 if (nodechar(j:j) .eq. ' ') & nodechar(j:j) ='0' ENDDO ier = 0 RETURN 7 ier=-1 RETURN END 140 5/23/2017 Subroutine myp_barrier (mygroup, nodes,ier) Blocks until all process reach this routine Include ‘mypf.h’ – Input variables nodes – integer – number of nodes in the group mygroup – character*(*) – name of the group – Output variables K.7 ier – integer – returned error value 141 5/23/2017 5 SUBROUTINE myp_barrier(mygroup,nodes,ier) INTEGER nodes, ier CHARACTER*(*) mygroup INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ CALL mpi_barrier(mpi_comm_world,ier) IF (ier .lt. 0) GOTO 5 ier = 0 RETURN ier=-1 RETURN END K.7.1 142 5/23/2017 Subroutine myp_close(ier) Terminates MPI execution environment Include ‘mypf.h’ – Output variables K.8 ier – integer – returned error value 143 5/23/2017 9 SUBROUTINE myp_close(ier) INTEGER ier INCLUDE ‘mpif.h’ INCLUDE ‘myp.f’ CALL mpi_barrier(mpi_comm_world,ier) IF (ier .lt. 0) GOTO 9 CALL mpi_finalize(ier) IF (ier .lt. 0) GOTO 9 ier = 0 RETURN ier=-1 RETURN END K.8.1 144 5/23/2017 Subroutine myp_writer (to,buf, count, tag,tag1,lasync,ier) Writes buffer buf into ‘to’ node Include ‘mypf.h’ – Input variables to – integer – receiving node rank buf(*) – real – send buffer (choice) count – integer – number of elements in send buffer K.9 145 5/23/2017 Subroutine myp_writer (to,buf, count, tag,tag1,lasync,ier) tag – integer – message tag tag1 – integer – message tag lasync – logical – Output variables K.10 ier – integer – returned error value 146 5/23/2017 6 SUBROUTINE myp_writer(to,buf,count,tag,tag1,lasync,ier) INTEGER to, count, tag, tag1, ier REAL buf(*) LOGICAL lasync INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ CALL mpi_send(buf,count,mpi_real4,to,tag,mpi_comm_world,ier) IF (ier .lt. 0) GOTO 6 ier = 0 RETURN ier=-1 RETURN END K.10.1 147 5/23/2017 Subroutine myp_readr (from, buf, count, tag,tag1,lasync,ier) Reads into buffer buf from node "from“ Include ‘mypf.h’ – Input variables from – integer – sending node rank count – integer – maximum number of elements in receive buffer K.11 148 5/23/2017 Subroutine myp_readr (from, buf, count, tag,tag1,lasync,ier) tag – integer – message tag tag1- integer – message tag lasync – logical – Output variables buf(*) – real – receiving buffer ier - integer- returned error value K.12 149 5/23/2017 3 SUBROUTINE myp_readr(from,buf,count,tag,tag1,lasync,ier) INTEGER from, count, tag, tag1, ier REAL buf(*) LOGICAL lasync INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ INTEGER status (mpi_status_size) CALL mpi_recv(buf,count,mpi_real4,from,tag,mpi_comm_world, status,ier) IF (ier .lt. 0) GOTO 3 ier = 0 RETURN ier=-1 RETURN END K.12.1 150 5/23/2017 Subroutine myp_bcast (from,buf, count,tag,tag1,lasync,node,ier) Broadcasts data in buf to all nodes in group Include ‘mypf.h’ – Input variables from – integer – rank of broadcast root buf(*) – real – input buffer count – integer – number of entries in buffer K.13 151 5/23/2017 Subroutine myp_bcast (from,buf, count,tag,tag1,lasync,node,ier) tag – integer – message tag tag1 – integer – message tag node – integer – node ID lasync – logical – Output variables buf – real – output buffer ier – integer – returned error value K.14 152 5/23/2017 3 SUBROUTINE myp_bcast(from,buf,count,tag,tag1,lasync,iamnode,ier) INTEGER from, count, tag, tag1, iamnode, ier REAL buf(*) LOGICAL lasync INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ CALL mpi_barrier( mpi_comm_world,ier) IF (ier .lt. 0) GOTO 3 CALL mpi_bcast(buf,count,mpi_real4,from,mpi_comm_world,ier) IF (ier .lt. 0) GOTO 3 CALL mpi_barrier(mpi_comm_world,ier) IF (ier .lt. 0) GOTO 3 ier = 0 RETURN ier=-1 RETURN END K.14.1 153 5/23/2017 Node To Node Transfer Send Receive NODE i NODE j NODE 1 NODE 0 NODE 2 NODE 3 K.15 154 5/23/2017 Subroutine myp_g2sum(x,temp, count, iamnode,ier) Sums an array from all nodes into an array on node zero. Requires 2m nodes Include ‘mypf.h’ – Input variables count – integer – number of entries to transfer iamnode – integer – starting node ID K.16 155 5/23/2017 Subroutine myp_g2sum(x,temp, count, iamnode,ier) – Output variables x(*) – real – result buffer temp(*) – real – intermediate buffer ier - integer- returned error value K.17 156 5/23/2017 SUBROUTINE myp_g2sum(x, & temp,count,iamnode,ier) REAL x(*),temp(*) INTEGER count, iamnode, ier INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ INTEGER tag, iinodes, node INTEGER ifromnode, itonode iinodes = nodes DO ndim =0,14 iinodes = iinodes/2 IF (iinodes .eq. 0) EXIT ENDDO tag = 32768 node = iamnode DO j=1,ndim IF (mod(node,2) .eq. 0) THEN ifromnode = iamnode+2**(j-1) CALL myp_readr(ifromnode, & temp, count,tag,tag,.false.,ier) K.17.1 IF (ier .lt. 0 ) GOTO 4 DO ii=1,count x(ii) = x(ii) + temp(ii) ENDDO ELSE itonode = iamnode-2**(j-1) CALL myp_writer (itonode, & x,count,tag,tag,.false.,ier) IF (ier .lt. 0) GOTO 4 ier = 0 RETURN ENDIF node = node/2 ENDDO ier =0 RETURN 4 ier = -1 RETURN END 157 5/23/2017 Power Of 2 Global Sum NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 NODE 7 + + + + NODE 0 NODE 2 NODE 4 NODE 6 + + NODE 0 NODE 4 + NODE 0 K.18 158 5/23/2017 Subroutine myp_g1sum (x,temp,count,iamnode,ier) Sums an array from all nodes to node zero Include ‘mypf.h’ – Input variables x(*) – real – data buffer count – integer – number o entries to transfer iamnode – integer- node ID – Output variables temp(*) – real – intermediate buffer ier – integer – returned error value K.19 159 5/23/2017 SUBROUTINE myp_g1sum(x, & temp, count,iamnode,ier) REAL x(*),temp(*) INTEGER count, iamnode, ier INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ INTEGER tag, msend, mrecv ifromnode = iamnode-1 itonode = iamnode+1 IF( itonode .gt. nodes-1) & itonode=0 IF(ifromnode .lt. 0 ) & ifromnode=nodes-1 msend = 32768 + iamnode mrecv = 32768 + ifromnode IF(iamnode.eq.0) THEN CALL myp_readr(ifromnode, & temp,count,mrecv,mrecv, & .false.,ier) IF (ier .lt. 0) go to 1000 DO ii=1, count x(ii) = x(ii) + temp(ii) ENDDO K.19.1 5 ELSE IF(iamnode.gt.1) THEN CALL myp_readr(ifromnode, & temp,count,mrecv,mrecv, & .false.,ier) IF (ier .lt. 0) GOTO 5 DO ii=1, count temp(ii) = x(ii) + temp(ii) ENDDO CALL myp_writer(itonode, temp, count,msend,msend, .false.,ier) IF(ier .lt. 0) GOTO 5 ELSE CALL myp_writer(itonode,x, count, msend,msend, .false.,ier) IF (ier .lt. 0) GOTO 5 ENDIF ENDIF CALL myp_barrier(mygroup, nodes,ier) ier = 0 RETURN ier = -1 RETURN END 160 5/23/2017 Global Sum NODE 0 NODE 1 NODE 2 NODE 3 NODE 4 NODE 5 NODE 6 + + + + + + K.20 161 5/23/2017 Subroutine myp_token(x,temp, count,iamnode,incnode,ier) Passes a buffer of data, x , between nodes Include ‘mypf.h’ – Input variables x(*) – real – data buffer count – integer iamnode – integer – node identification number incnode – integer – node increment for token ring – Output variables temp(*) – real – intermediate buffer ier – integer – returned error value K.21 162 5/23/2017 SUBROUTINE myp_token(x, temp, & count, iamnode,incnode,ier) INTEGER count, iamnode, INTEGER incnode, ier REAL x(*), temp(*) INCLUDE ‘mpif.h’ INCLUDE ‘mypf.h’ INTEGER msend, mrecv INTEGER itonode, ifromnode itonode = mod(iamnode + incnode, & nodes) ifromnode = mod(iamnode – & incnode,nodes) IF(ifromnode.lt.0) & ifromnode=ifromnode+nodes msend = 32768 + 1000*incnode + & iamnode mrecv = 32768 + 1000*incnode + & ifromnode IF (iamnode .eq. 0 .or. itonode .eq. ifromnode) THEN CALL myp_writer(itonode,x, count,msend,msend,.false.,ier) K.21.1 IF (ier .lt. 0) GOTO 9 CALL myp_readr(ifromnode,x, & count,mrecv,mrecv,.false.,ier) IF (ier .lt. 0) GOTO 9 ELSE DO j=1,count temp(j) = x(j) ENDDO CALL myp_readr(ifromnode,x, & count,mrecv,mrecv,.false.,ier) IF (ier .lt. 0) GOTO 9 CALL myp_writer(itonode,temp, & count,msend,msend,.false.,ier) IF (ier .lt. 0) GOTO 9 ENDIF ier = 0 RETURN 9 ier = -1 RETURN END 163 5/23/2017 General Token Ring Transfer NODE 4 NODE 5 NODE 3 NODE 6 NODE 2 NODE 7 NODE 1 NODE 0 K.22 Increment =1 Increment =2 164 5/23/2017 Subroutine myp_transp(x, temp, count,iamnode,ier) Transposes data, x, between the nodes Include ‘mypf.h’ – Input variables x(*) – real – data buffer count – integer – number of entries to transfer iamnode – integer – node ID – Output variables temp(*) – real – intermediate buffer ier - integer- returned error value K.23 165 5/23/2017 SUBROUTINE myp_transp(x, & temp, count,iamnode,ier) INTEGER count,iamnode,ier REAL x(*),temp(*) INCLUDE ‘mpif.h’ INTEGER msend, mrecv INCLUDE ‘mypf.h’ msend = 32700 mrecv = 32700 DO jrnode=0, nodes-2 jxsend = jrnode*count+1 DO jsnode = jrnode+1, nodes -1 jxrecv = jsendnode*count+1 IF(iamnode .eq. jsendnode) THEN CALL myp_writer(jrecvnode, & x(jxsend), count,msend,msend, .false.,ier) IF (ier .lt. 0) GOTO 7 CALL myp_readr(jrecvnode, & x(jxsend) ,count,mrecv,mrecv, .false.,ier) IF (ier .lt. 0) GOTO 7 ENDIF K.23.1 7 IF( iamnode .eq. jrecvnode) THEN DO j=1, count temp(j) = x(jxrecv+j-1) ENDDO CALL myp_readr(jsendnode, x(jxrecv), count,mrecv,mrecv, .false.,ier) IF(ier .lt. 0) GOTO 7 CALL myp_writer(jsendnode, temp, count,msend,msend, .false.,ier) IF (ier .lt. 0) GOTO 7 ENDIF ENDDO ENDDO ier = 0 RETURN ier = -1 RETURN END 166 5/23/2017 Transpose a11 a12 a13 a21 a22 a23 a14 a24 a15 a25 NODE 0 a31 a41 a32 a42 a33 a43 a34 a44 a35 a45 NODE 1 a51 a61 a52 a62 a53 a54 a63 a64 a55 a65 NODE 2 NODE 3 K.24 a11 a12 a21 a22 a31 a32 a41 a42 a51 a61 a52 a62 a13 a14 a23 a24 a33 a34 a43 a44 a53 a63 a54 a64 a15 a25 a35 a45 a55 a65 Multinode transposes in memory out of memory (via disk) w/wo fast Fourier transform 167 5/23/2017 Subroutine pvmcopy (i01,…,i20,n,ier) Broadcasts up to 20 around to the nodes Assume default integers or 4 bytes per unit Include ‘mypf.h’ – Input variables i01,…,i20 – integer – units to be broadcast n – integer – number of units – Output variables K.25 ier – integer- returned error value 168 5/23/2017 Subroutine pvm_rdxfer (issd, trace, temp,nt,krec,llasync, llssd,iamnode,jnode,ier) Node ‘iamnode’ reads data from disk and send to node ‘jnode’ Include ‘mypf.h’, ‘mypio.h’ – Input variables issd – integer – logical unit number to read from nt – integer – number of samples krec – integer – record number llasync – logical K.26 169 5/23/2017 Subroutine pvm_rdxfer (issd, trace, temp,nt,k_rec,llasync, llssd,iamnode,jnode,ier) llssd – integer iamnode – integer – reading node ID jnode – integer – receiving node ID – Output variables trace(*) – real – buffer to read into temp(*) – real – transfer buffer ier – integer – returned error value K.27 170 5/23/2017 Subroutine pvm_wrxfer(issd, trace,temp,nt,krec,llasync, llssd,iamnode, jnode,ier) Writes data from jnode to disk Include ‘mypf.h’, ‘mypio.h’ – Input variables issd – integer – logical unit number to write to nt – integer – number of samples to be written krec – integer – record number llasync – logical K.28 171 5/23/2017 Subroutine pvm_wrxfer(issd, trace,temp,nt,krec,llasync, llssd,iamnode, jnode,ier) llssd – integer iamnode – integer – node ID jnode – integer – sending node ID – Output variables trace(*) – real – receiving buffer temp(*) - real – transfer buffer ier – integer – returned error value K.29 172 5/23/2017 Subroutine myp_message (mesg,ier) Broadcasts message to all nodes Include ‘mypf.h’ – Input variables mesg – integer – message to be distributed to all nodes – Output variables K.30 ier – integer – returned error value 173 5/23/2017 Algorithms There are several ways to parallelize seismic processing – Task distribution Independent processing tasks are distributed among the processors – Domain decomposition Each processor works on a unique part of the survey or image space – Data decomposition L.1 Each processor works on a unique part of the data 174 5/23/2017 Task distribution Develop a pipeline of independent tasks which are handled by multiple processors One or more tasks can be done by each processor depending on the time required,e.g. CDP gather L.2 NMO Stack Migration 175 5/23/2017 Task distribution (cont.) Advantage – Existing serial codes are easily adapted with minimal rewriting Disadvantages – Difficult to balance the load between processors so efficiency is reduced – All data must move from processor to processor L.3 176 5/23/2017 Domain decomposition Each part of the image space is developed by a processor Divide the image space into sub volumes that can be processed independently, e.g – Finite difference RTM where the computational grid in x,y,z is divided between the processors – Kirchhoff pre stack depth migration L.4 177 5/23/2017 ny nx Node 0 nz1 ny nx Node 1 nz2 … Node … ny nx nzk L.5 Node k 178 5/23/2017 Domain decomposition (cont.) Advantage – Image volume is generally less than data volume so large problems can be solved in memory Disadvantage – Depending on the algorithm exchange of data at the borders of the sub volumes between the processors may be necessary during the algorithm (e.g. in RTM at each time step) – All input data (in general) must be processed by every processor L.6 179 5/23/2017 Data decomposition The data are distributed among the processors Divide the data into sub sets that can be processed independently, e.g. – In phase shift migration constant frequency planes – In plane wave migration in constant plane wave volumes – In Kirchhoff migration in constant offset sections L.7 180 5/23/2017 Data decomposition (cont.) x Decomposed seismic data volumes or sections can be processed by each node independently with minimal communication overhead y Seismic data can be imaged using • Shot records • CDPs • Constant offsets • Constant plane waves • Frequency slices L.8 181 5/23/2017 Plane wave volumes x image plane waves y node 2 node 1 node 0 etc. Px1, Py1 Px2, Py2 L.9 -p -p -p -p NMO Time migration Split-step Fourier depth migration Kirhhoff depth migration 182 5/23/2017 Offset volumes image offsets x y node 2 node 1 node 0 etc. Ox2, Oy2 Ox1, Oy1 L.10 -p NMO -p Time migration -p Kirhhoff depth migration 183 5/23/2017 Data decomposition (cont.) Advantage – if the data decomposition results in independent sub-volumes, no exchanges of data between the processors are required during the algorithm Disadvantage – the independent data volumes may still be large and may not fit in a single processor’s memory requiring an out of memory solution L.11 184 5/23/2017 Which approach? Which approach to use in designing the parallel program is based on: The algorithm - some are easier to parallelize than others Data structure and size – number of: sources and receivers; frequencies; plane waves etc. Domain size - image size in x, y, and z L.12 185 5/23/2017 Implementation Nearly all seismic imaging algorithms can be implemented in a parallel mode Some require new coding while others can simply use serial code run in each processor M.1 186 5/23/2017 Existing serial code Shot record migrations (RTM, split-step, or implicit finite differences) can be run on each processor independently The resulting images can then be summed at the end. No communication between processors is required. No data exchanges are required. M.2 187 5/23/2017 Modifications to existing serial code Any migration method that uses the data organized as independent subsets such as frequency planes, plane waves etc. Do loops over the independent data susbsets are distributed across the processors M.3 188 5/23/2017 Test code Initialize myp (myp_open) Test myp_writer & myp_readr Test myp_bcast Test myp_g2sum Test myp_g1sum Test myp_token Test myp_transpose Finalize and close L.1 189 5/23/2017 Initialize myp myp_writer & myp_readr myp_bcast myp_g2sum Prepare data myp_g1sum Check the result myp_token myp_transp L.2 Finalize & close 190 5/23/2017 PROGRAM myp_test INTEGER iamnode, ier PARAMETER (mxnodes=32, nx=mxnodes*1000) REAL x(nx), temp(nx) CHARACTER*80 logfil INCLUDE 'mypf.h' c start up myp, get my node id & my tid CALL myp_open(iamnode, ier) IF(ier.lt.0) THEN WRITE(*,*) 'myp_open error' STOP ENDIF c log file log=32 logfil=myp_path(1:ncmyp_path)//'myp_log.'//nodechar OPEN(log,FILE=logfil,STATUS='unknown', FORM='formatted') WRITE(ilog,*) 'iamnode & nodes=', iamnode, nodes L.2.1 191 5/23/2017 c do a sequence of xfers & returns, define data & send data ifromnode=0 IF(iamnode.eq.ifromnode) THEN DO jnode=1,nodes-1 DO j=1,nx x(j)=jnode+1 ENDDO jrec =jnode jreqid=jnode CALL myp_writer(jnode,x,nx,jrec,jrecid,.false.,ier) IF(ier.lt.0) GOTO 77 ENDDO ELSE CALLl myp_readr(ifromnode,x,nx,iamnode,jdum,.false.,ier) IF(ier.lt.0) GOTO 77 c check & post results DO j=1,nx IF(abs(x(j)-float(iamnode+1)).gt.1.e-5) THEN WRITE(log,*)'myp_readr/myp_writer error from node ',ifromnode GOTO 20 ENDIF ENDDO WRITE(log,*)'myp_readr/writer: ok,iamnode&ifrom=', iamnode,ifromnode 20 CONTINUE ENDIF L.2.2 192 5/23/2017 c now let ifromnode read them back & check IF (iamnode.ne.ifromnode) THEN jrec =iamnode jrecid=iamnode CALL myp_writer(ifromnode,x,nx,jrec,jrecid,.false.,ier) IF(ier.lt.0) GOTO 77 ELSE DO jnode=1,nodes-1 CALL myp_readr(jnode,x,nx,jnode,jdum,.false.,ier) DO j=1,nx IF(abs(x(j)-float(jnode+1)).gt.1.e-6) THEN WRITE(log,*)'myp_readr/myp_writer error from node ',jnode GOTO 25 ENDIF ENDDO WRITE(log,*) 'myp_readr/writer: ok,iamnode&jnode=',iamnode,jnode 25 CONTINUE ENDDO ENDIF L.2.3 193 5/23/2017 c do a broadcast xfer & returns, define data & send data ifromnode=0 IF (iamnode.eq.ifromnode) FORALL(j=1:nx) x(j)=1000 jrec =1000 jreqid=1000 CALL myp_bcast(ifromnode,x,nx,jrec,jrecid,.false.,iamnode,ier) IF (ier.lt.0) goto 77 c get the data back & check IF (iamnode.ne.ifromnode) THEN jrec =iamnode ; jrecid=iamnode CALL myp_writer(ifromnode,x,nx,jrec,jrecid,.false.,ier) IF(ier.lt.0) goto 77 ELSE DO jnode=1,nodes-1 FORALL( j=1:nx) x(j)=0. CALL myp_readr(jnode,x,nx,jnode,jdum,.false.,ier) DO j=1,nx IF(ABS(x(j)-1000.).gt.1.e-5) THEN WRITE(log,*)'myp_brdcast: error from node ',jnode GOTO 30 ENDIF ENDDO WRITE(log,*)'myp_brdcast: ok, iamnode= ',iamnode 30 CONTINUE ENDDO ENDIF L.2.4 194 5/23/2017 c do a 2**m global sum, define data DO jnode=0,nodes-1 IF(iamnode.eq.jnode) FORALL(j=1:nx) x(j)=jnode+1 ENDDO c get sum -> node 0 CALL myp_g2sum(x,temp,nx,iamnode,ier) IF(ier.lt.0) GOTO 77 c check & post results xtest=0. FORALL(j=1:nodes) xtest=xtest+j IF (iamnode.eq.0) THEN DO j=1,nx IF(abs(x(j)-xtest).gt.1.e-5) THEN WRITE(log,*)'myp_2gsum error: iamnode,xtest, x=',iamnode,xtest,x(1) GOTO 10 ENDIF ENDDO WRIET(log,*)'myp_g2sum: ok, iamnode & sum= ',iamnode,xtest 10 CONTINUE ENDIF L.2.5 195 5/23/2017 c do a linear global sum, define data DO jnode=0,nodes-1 IF(iamnode.eq.jnode) FORALL(j=1:nx) x(j)=jnode+1 ENDDO c get sum -> node 0 CALL myp_g1sum(x,temp,nx,iamnode,ier) IF(ier.lt.0) GOTO 77 c check & post results xtest=0. FORALL(j=1:nodes) xtest=xtest+j IF(iamnode.eq.0) THEN DO j=1,nx IF(abs(x(j)-xtest).gt.1.e-5) THEN WRITE(log,*)'myp_1gsum error: iamnode,xtest,x=',iamnode,xtest,x(1) GOTO 15 ENDIF ENDDO WRITE(log,*)'myp_g1sum: ok, iamnode & sum= ',iamnode,xtest 15 CONTINUE ENDIF L.2.6 196 5/23/2017 c token ring pass of xcbuf from 0->1, 1->2,... nodes-1->0 c define the data as the node id DO incnode=1,nodes-1,nodes FORALL(j=1:nx) x(j)=iamnode+10 c pass the data around the nodes displaced by incnode CALL myp_token(x,temp,nx,iamnode,incnode,ier) IF(ier.ne.0) GOTO 77 c check the result ifrom=iamnode-incnode IF(ifrom.lt.0) ifrom=ifrom+nodes xtest=ifrom+10 DO j=1,nx IF(abs(x(j)-xtest).gt.1.e-5) THEN WRITE(log,*) 'myp_token error: iamnode,xtest,x,incnode=', & iamnode,xtest,x(1),incnode GOTO 35 ENDIF ENDDO WRITE(log,*)'myp_token: ok, iamnode & inc= ', iamnode,incnode 35 CONTINUE ENDDO L.2.7 197 5/23/2017 c multinode transpose test data; for test purposes define the data such that c iamnode*nbuf+jx-1, jx=1,nbuf + jbuf*10000, jbuf=1,nodes i.e. as follows: c input: node 0: 10000->10099, 20000->20099, 30000->30099 ... c node 1: 10100->10199, 20100->20199, 30100->30199 ... c output: node 0: 10000->10099, 10100->10199, 10200,10299 ... c node 1: 20000->20099, 20100->20199, 20200,20299 ... etc c define the partial buffer as nbuf=100; it must be nbuf*nodes <= nx nbuf=100 DO jbuf=1,nodes jsa=(jbuf-1)*nbuf FORALL(jx=1:nbuf) x(jsa+jx)=jbuf*10000+iamnode*nbuf+jx-1 ENDDO CALL myp_transp(x,temp,nbuf,iamnode,ier) IF(ier.ne.0) GOTO 77 c check transpose results DO j=1,nodes*nbuf test=(iamnode+1)*10000+j-1 IF(abs(x(j)-test).gt.1.e-5) THEN WRITE(log,*)'myp_transp: error test ,x(j)=',test,x(j),iamnode GOTO 40 ENDIF ENDDO WRITE(log,*) 'myp_transp: ok, iamnode= ',iamnode 40 CONTINUE L.2.8 198 5/23/2017 c program finished exit myp_ CALL myp_close(iermyp) STOP 77 WRITE(log,*)'a myp_ error occurred, ier, iermyp_=', ier,iermyp_ CALL myp_close(iermyp_) STOP END L.2.9 199 5/23/2017 Testing results L.3 200 5/23/2017 Seismic algorithms Xtnmo Tpnmo Xttime Tptime Xtkirch Tpkirch Tp3d More … M.1 201 5/23/2017 Xtnmo flowchart M.2 202 5/23/2017 Xtnmo code M.3 203 5/23/2017 204 5/23/2017 205 5/23/2017 206 5/23/2017 207 5/23/2017 x-t data Input seismic data: shot gathers in the x–t domain number of shots number of offsets/shot number of samples/trace (time/depth) 1000 60 2000 Migration results: CIG gathers number of CIG’s number of traces/CIG number of samples 1000 60 1000 208 5/23/2017 -p data Input seismic data: shot gathers in the t–p domain number of shots number of rays/shot number of samples/trace 1000 61 2000 Migration results: CIG gathers – number of CIG’s 1000 – number of rays/CIG 61 – number of samples/trace (time/depth) 1000 209 5/23/2017 Timing diagrams -p time vs. number of CPUs -p time speedup x-t time vs. number of CPUs x-t time speedup -p plane wave split step vs. vs. number of CPUs -p plane wave split step speedup 210 5/23/2017 -p time vs. number of CPUs -p time vs number of CPUs 18000 SGI Cluster w/network New cluster 16000 time (sec) 14000 12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8 num ber of nodes 211 5/23/2017 -p time speedup -p time speed up 9 8 ratio to 1 CPU 7 6 5 4 3 2 SGI Cluster w/network Nodes 1 0 1 2 3 4 5 6 7 8 num ber of nodes 212 5/23/2017 x-t time vs. number of CPUs x-t time vs number of CPUs 18000 SGI Cluster w/network 16000 time (sec) 14000 12000 10000 8000 6000 4000 2000 0 1 2 3 4 5 6 7 8 num ber of nodes 213 5/23/2017 x-t time speedup x-t time speedup 9 8 ratio to 1 CPU 7 6 5 4 3 2 SGI Cluster w/network 1 0 1 2 3 4 5 6 7 8 num ber of nodes 214 5/23/2017 -p plane wave split step vs. number of CPUs -p plane wave split-step vs number of CPUs 25000 SGI Cluster w/network New cluster time (sec) 20000 15000 10000 5000 0 1 2 3 4 5 6 7 8 num ber of nodes 215 5/23/2017 -p plane wave split step speedup -p plane wave split-step speed up 9 8 ratio to 1 CPU 7 6 5 4 3 2 SGI Cluster w/network 1 0 1 2 3 4 5 6 7 8 num ber of nodes 216 5/23/2017 Conclusions Beowulf Alpha Cluster is a cost effective solution for parallel seismic imaging – performance is ‘comparable’ to SGI Origin 2000 for imaging algorithms tested – cost is $71,915 vs. $465,098 for SGI Origin 2000 – price performance improvement of 6.5 217 5/23/2017 Conclusions (Cont.) network interconnect from cluster to large UNIX network is not a performance bottleneck need better cluster ‘management’ tools to facilitate use need true 64 bit cluster operating system 218 5/23/2017 Summary State what has been learned Define ways to apply training Request feedback of training session 219 5/23/2017 Where to Get More Information Other training sessions List books, articles, electronic sources Consulting services, other sources 220 5/23/2017