Download LECTURE 1

CS 350: Principal of Parallel Computing LECTURE 1 Applied Parallel Computing (2002 manuscript) by Dr. Yuefan Deng & Dr. Stephen Tse all rights reserved Introduction 1. Introduction  Computer accounts: Every registered student will get account on QC Cluster (send email to [email protected] to request your account.) In your email, please state the following 1. Name (first last names) 2. CS350 account username for Parallel Machine 3. E-mail address that’s active    Projects: All four projects should be done on the Parallel Machine. All but Project 4 will be collected three weeks after assigning. Project 4 will have four weeks Lecture notes: At the start of the week, notes for the following two lectures will be emailed to registered students. Networked computing system in Computer Science Lab: (NSB A131) 2. Parallel Computing: What? Why? and How? 2.1 What: Parallel computing is defined as ``simultaneous processing by more than one processing unit on a single application''. 2.2 2.2.1 Why: Improve response time (stock market, hospitals, battle fields, etc) 1 2.2.2 2.2.3 Amend the total amount of work done in a given time (fluids, QCD, proteins, meteorology, materials: finer grid, larger domain, and more parameters, etc) Cost-effective: 10X compared to mainframes (everyone enjoys more money) Machines 2.2.4 2.2.4.1 2.2.4.2 2.2.4.3 2.2.4.4 2.2.4.5 2.2.4.6 2.2.4.7 2.2.4.8 2.2.4.9 2.2.4.10 2.2.4.11 2.2.4.12 2.2.4.13 Typical Times Grand Moderate Challenge Problems Problems Applications Protein folding, O(1) Hrs QCD, Turbulence Sub-Pflop NA Tflop Machine 1K-Node Beowulf Cluster High-End Workstation PC With 2 GHz Pentium O(1) Sec O(10) Hrs Weather O(1) Minute O(1) Wks 2D CFD, Simple designs O(1) Hrs O(10) Yrs O(1) Days O(100) Yrs Problems Requiring Parallel Computing Prediction of Weather, Climate, and Global Change Materials Science Semiconductor Design Superconductivity Structural Biology Drug Design Human Genome QCD Astronomy Transportation Vehicle Signature---Military Turbulence Vehicle Dynamics 2 2.2.4.14 2.2.4.15 2.2.4.16 2.2.4.17 2.2.4.18 2.2.4.19 2.2.4.20 2.2.5 2.2.5.1 2.2.5.2 2.2.5.3 2.2.5.4 2.2.5.5 2.2.5.6 2.2.5.7 2.2.5.8 2.2.5.9 2.2.5.10 2.2.5.11 2.2.5.12 2.2.5.13 2.2.6 Nuclear Fusion Combustion System Oil and Gas Recovery Ocean Science Speech Vision Undersea Surveillance Best Application areas Aerodynamics: Aircraft design, air breathing propulsion, advanced sensors. Applied Mathematics: Fluid dynamics, turbulence, differential equations, numerical analysis, global optimization, numerical linear algebra. Biology and engineering: Simulation of genetic compounds, neural network, structural biology, conformation, drug design, protein folding, human genome. Chemistry and engineering: Polymer simulation, reaction rate prediction. Computer science: Simulation of devices and circuits, VLSI, artificial intelligence Electrical engineering: Electromagnetic scattering. Geosciences: Oil and seismic exploration, enhanced oil and gas recovery. Materials Research: Material property prediction, modeling of new materials, superconductivity. Mechanical engineering: Structural analysis, combustion simulation. Meteorology: Weather forecasting, climate modeling. Oceanography: global ocean modeling Physics: Astrophysics, evolution of galaxies and natures of black holes; particle physics, quark interactions, properties of new particles; plasma physics, fusion reaction modeling; nuclear physics, weapon design and modeling, remedy of nuclear contaminations. Others: Information super highway, optical processing, and transportation. Basic Physics Equations 3 2.2.6.1 2.2.6.2 2.2.6.3 2.2.6.4 2.2.6.5 Classical Mechanics---Newton's Second Law Electrodynamics---Maxwell Equations Quantum Dynamics---Schrödinger Equation Statistical Mechanics Quantum Chromo Dynamics---Yang-Mills Equation. (In future lectures, I'll gradually introduce these equations and analyze the relevant structures for parallel computing.) 2.3 How? The subject for the entire semester 3. Parallel Computers: Hardware Issues 3.1 Characterization 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.2 3.2.1 3.2.2 3.2.3 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 Number and properties of compute processors Network topologies and communication bandwidth Instruction and data streams Processor-memory connectivity Memory size and I/O Node architectures Simplicity in the processing units Conventional, off-the-shelf, and mass-product processors (developing special-purpose processors for Cray processors and for IBM mainframe CPUs,) Several main processors used in parallel machines are: i860 for iPSC/860, Intel Delta, Transtech, Meiko, Paragon; sparc in CM-5; DEC's Alpha in DECmpp, Cray T3D; RS/6000 in SP-1, SP-2; Weitek in MasPars, etc. Network topology 1D array 2D mesh 3D hypercube Ring Four-level, complete binary tree 4 3.3.6 3.3.7 3.4 Bus and switches Cross-bar network Network performance measurement 3.4.1 Connectivity: (defined as the multiplicity between any two processors) Diameter: (defined as the maximum distance between two processors in the number of hops between two most distant processors) Bisection bandwidth: (defined as the number of bits that can be transmitted in parallel multiplied by the bisection width which is defined as the minimum number of communication links that have to be removed to divide the network into two equal partitions.) 3.4.2 3.4.3 3.5 Instruction And Data Streams 3.5.1 Based on the nature of the instruction and data streams, parallel computers can be made as: Instruction & Data Single Multiple Single SISD MISD Multiple SIMD MIMD 3.5.1.1 SISD: Such as workstations and PC with 1-CPU (to disappear) 3.5.1.2 SIMD: Easy to use for simple problems (CM-1/2...) 3.5.1.3 MISD: Rare 3.5.1.4 MIMD: Trend (Paragon, IBM SP-1/2, and Cray T3D...) 3.6 Processor-memory connectivity 5 3.6.1 3.6.2 3.6.3 3.7 List 11/2 001 11/2 001 11/2 001 Distributed memory Shared memory Distributed shared memory The Top 10 supercomputers Manuf Ran Comput acture k er r ASCI White,SP 1 IBM Power3 375 MHz AlphaSer Comp ver SC 2 aq ES45/1 GHz SP Power3 3 IBM 375 MHz 16 way 11/2 4 001 11/2 5 001 11/2 6 001 11/2 7 001 11/2 8 001 11/2 9 001 11/2 10 001 Intel ASCI Red Rmax(GFl Installatio Coun Ye Proces Rpeak Nm Nha ops) n Site try ar sors (GFlops) ax lf Lawrence Livermore 7226.00 USA National Laboratory Pittsburgh Supercomp 4059.00 USA uting Center 3052.00 20 8192 00 12288 518 179 096 000 20 3024 01 6048 525 105 000 000 20 3328 01 4992 371 102 712 400 USA 19 9632 99 3207 362 754 880 00 USA 19 5808 99 3868 431 344 USA 20 1536 01 3072 390 710 000 00 Japan 20 1152 01 2074 141 160 000 00 USA 19 6144 98 3072 374 138 400 000 USA 20 1336 00 2004 374 000 NERSC/LB USA NL Sandia 2379.00 National Labs ASCI Lawrence BlueLivermore IBM Pacific 2144.00 National SST,IBM Laboratory SP 604e AlphaSer Los Comp ver SC Alamos 2096.00 aq ES45/1 National GHz Laboratory SR8000/ University Hitachi 1709.10 MPP of Tokyo Los ASCI Alamos SGI Blue 1608.00 National Mountain Laboratory Naval SP Oceanogra IBM Power3 1417.00 phic Office 375 MHz (NAVOCE ANO) SP Deutscher Power3 IBM 1293.00 Wetterdien 375 MHz st 16 way Germ 20 1280 any 01 6 3.8 Cost-effective of Parallel Computers 3.8.1 Performance/Cost System Number of Performance in processors in MF 1997 2001 2001 3.8.2 3.9 128 128 1700 40,000 300,000 4,000,000 Cost ($) CostPerformance Ratio (MF/$) 250,000 250,000 3,500,000 0.16 1.20 1.20 Benchmarking Definitions Name Benchmark Code and Class EP Embarrassingly parallel MG Multigrid CG Conjugate gradient FT 3-D FFT PDE LU LU solver PS Pentadiagonal solver BT Block tridiagonal solver A B C Comparison of Human Brain with Supercomputers Typical Human Brain Total # of Neurons Neurons in cerebral cortex 20 – 50 billion Intel Pentium 4 (at 1.5 GHz) 42 million Transistors ~20 billion 7 Number of synapses Weight Power consumption 1014 (2,000-5,000/neuron) ~1.5 kg ~40W (4 nW/neuron) 2% weight, 0.04-0.07% Percentage of cells, body 20-44% power consumption Genetic code 1 bit per 10,000-1,000,000 influence synapses Atrophy/death 50,000 per day (age: of neurons 20~75) Sleep 30% requirement Normal operating 37±2°C temperature Maximum firing 250-2,000 Hz (0.5-4 ms frequency of intervals) neuron Signal propagation 90 m/s sheathed, <0.1 m/s speed inside unsheathed axon Processing of complex 0.5s or 100-1,000 firings stimuli 0.3 kg ~55 W (1 mW/transistor) 0% 15-85°C 1.5 GHz Speed of light Long time 3.10 Supercomputer Design Issues 3.10.1 3.10.2 Processors: Advanced pipelining, instruction-level parallelism, reduction of branch penalties with dynamic hardware prediction and scheduling, Networks: Inter-networking, and networking processors to the networks; 8 3.10.3 3.10.4 3.11 3.11.1 3.11.2 Processor-memory Connectivity: Caching, reduction of caching misses and the penalty, design of memory hierarchies, and virtual memory. centralized shared-memory architectures, distributed shared-memory architectures, and synchronization; Storage Systems: Types and performance of storage devices, buses-connected storage devices, storage area network, raid, reliability and availability. Gordon Bell 11 rules of supercomputer design Performance, performance, performance. People are buying supercomputers for performance. Performance, within a broad price range, is everything. Everything matters. The use of the harmonic mean for reporting performance on the Livermore Loops severely penalizes machines that run poorly on even one loop. It also brings little benefit for those loops that run significantly faster than other loops. Since the Livermore Loops was designed to simulate the real computational load mix at Livermore Labs, there can be no holes in performance when striving to achieve high performance on this realistic mix of computational loads. 3.11.3 Scalars matter the most. A well-designed vector unit will probably be fast enough to make scalars the limiting factor. Even if scalar operations can be issued efficiently, high latency through a pipelined floating point unit such as the VPU can be deadly in some applications. 3.11.4 Provide as much vector performance as price allows. Peak vector performance is primarily determined by bus bandwidth in some circumstances, and the use of vector registers in others. Thus the bus was designed to be as fast as practical using a cost-effective mix of TTL and ECL logic, and the VRF was designed to be as large and flexible as possible within cost limitations. Gordon Bell's rule of thumb is that each vector unit must be able to produce at least two results per clock tick to have acceptably high performance. 9 3.11.5 Avoid holes in the performance space. This is an amplification of rule 2. Certain specific operations may not occur often in an "average" application. But in those applications where they occur, lack of high speed support can significantly degrade performance. 3.11.6 Place peaks in performance. Marketing sells machines as much or more so than technical excellence. Benchmark and specification wars are inevitable. Therefore the most important inner loops or benchmarks for the targeted market should be identified, and inexpensive methods should be used to increase performance. It is vital that the system can be called the "World's Fastest", even though only on a single program. A typical way that this is done is to build special optimizations into the compiler to recognize specific benchmark programs. 3.11.7 Provide a decade of addressing. Computers never have enough address space. History is full of examples of computers that have run out of memory addressing space for important applications while still relatively early in their life (e.g., the PDP-8, the IBM System 360, and the IBM PC). Ideally, a system should be designed to last for 10 years without running out of memory address space for the maximum amount of memory that can be installed. Since dynamic RAM chips tend to quadruple in size every three years, this means that the address space should contain 7 bits more than required to address installed memory on the initial system. 3.11.8 Make it easy to use. The "dusty deck" syndrome, in which users want to reuse FORTRAN code written two or three decades early, is rampant in the supercomputer world. Supercomputers with parallel processors and vector units are expected to run this code efficiently without any work on the part of the programmer. While this may not be entirely realistic, it points out the issue of making a complex system easy to use. Technology changes too quickly for customers to have time to become an expert on each and every machine version. 3.11.9 Build on other's work. 10 3.11.10 Design for the next one, and then do it again. In a small startup company, resources are always scarce, and survival depends on shipping the next product on schedule. It is often difficult to look beyond the current design, yet this is vital for long term success. Extra care must be taken in the design process to plan ahead for future upgrades. The best way to do this is to start designing the next generation before the current generation is complete, using a pipelined hardware design process. Also, be resigned to throwing away the first design as quickly as possible. 3.11.11 Have slack resources. Expect the unexpected. No matter how good the schedule, unscheduled events will occur. It is vital to have spare resources available to deal with them, even in a startup company with little extra manpower or capital. 11

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LECTURE 1