Download CS267: Introduction - Electrical Engineering & Computer Sciences

CS267/E233 Applications of Parallel Computers Lecture 1: Introduction Kathy Yelick [email protected] www.cs.berkeley.edu/~yelick/cs267_spr07 01/17/2007 CS267-Lecture 1 1 Why powerful computers are parallel circa 1991-2006 01/17/2007 CS267-Lecture 1 2 Tunnel Vision by Experts • “I think there is a world market for maybe five computers.” - Thomas Watson, chairman of IBM, 1943. • “There is no reason for any individual to have a computer in their home” - Ken Olson, president and founder of Digital Equipment Corporation, 1977. • “640K [of memory] ought to be enough for anybody.” - Bill Gates, chairman of Microsoft,1981. • “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.” - Ken Kennedy, CRPC Directory, 1994 01/17/2007 CS267-Lecture 1 Slide source: Warfield et al. 3 Technology Trends: Microprocessor Capacity Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra 01/17/2007 CS267-Lecture 1 4 Microprocessor Transistors per Chip • Growth in transistors per chip • Increase in clock rate 100,000,000 1000 10,000,000 1,000,000 i80386 i80286 100,000 R3000 R2000 100 Clock Rate (MHz) Transistors R10000 Pentium 10 1 i8086 10,000 i8080 i4004 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Year 01/17/2007 CS267-Lecture 1 0.1 1970 1980 1990 2000 Year 5 Impact of Device Shrinkage • What happens when the feature size (transistor size) shrinks by a factor of x ? • Clock rate goes up by x because wires are shorter - actually less than x, because of power consumption • Transistors per unit area goes up by x2 • Die size also tends to increase - typically another factor of ~x • Raw computing power of the chip goes up by ~ x4 ! - typically x3 is devoted either on-chip - parallelism: hidden parallelism such as ILP - locality: caches 01/17/2007 CS267-Lecture 1 6 But there are limiting forces Manufacturing costs and yield problems limit use of density • Moore’s 2nd law (Rock’s law): costs go up Demo of 0.06 micron CMOS Source: Forbes Magazine • Yield -What percentage of the chips are usable? -E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield 01/17/2007 CS267-Lecture 1 7 Power Density Limits Serial Performance 01/17/2007 CS267-Lecture 1 8 More Limits: How fast can a serial computer be? 1 Tflop/s, 1 Tbyte sequential machine r = 0.3 mm • Consider the 1 Tflop/s sequential machine: - Data must travel some distance, r, to get from memory to CPU. - To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm. • Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: - Each bit occupies about 1 square Angstrom, or the size of a small atom. • No choice but parallelism 01/17/2007 CS267-Lecture 1 9 Revolution is Happening Now • Chip density is continuing increase ~2x every 2 years - Clock speed is not - Number of processor cores may double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 01/17/2007 CS267-Lecture 1 10 Why Parallelism (2007)? • These arguments are no long theoretical • All major processor vendors are producing multicore chips - Every machine will soon be a parallel machine - All programmers will be parallel programmers??? • New software model - Want a new feature? Hide the “cost” by speeding up the code first - All programmers will be performance programmers??? • Some may eventually be hidden in libraries, compilers, and high level languages - But a lot of work is needed to get there • Big open questions: - What will be the killer apps for multicore machines - How should the chips be designed, and how will they be programmed? 01/17/2007 CS267-Lecture 1 11 Outline all • Why powerful computers must be parallel processors Including your laptop • Large important problems require powerful computers Even computer games • Why writing (fast) parallel programs is hard • Principles of parallel computing performance • Structure of the course 01/17/2007 CS267-Lecture 1 12 Why we need powerful computers 01/17/2007 CS267-Lecture 1 13 Units of Measure in HPC • High Performance Computing (HPC) units are: - Flop: floating point operation - Flops/s: floating point operations per second - Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes Giga Tera Peta Exa Zetta Gflop/s = 109 flop/sec Tflop/s = 1012 flop/sec Pflop/s = 1015 flop/sec Eflop/s = 1018 flop/sec Zflop/s = 1021 flop/sec Gbyte = 230 ~ 109 bytes Tbyte = 240 ~ 1012 bytes Pbyte = 250 ~ 1015 bytes Ebyte = 260 ~ 1018 bytes Zbyte = 270 ~ 1021 bytes Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes • See www.top500.org for current list of fastest machines 01/17/2007 CS267-Lecture 1 14 Simulation: The Third Pillar of Science • Traditional scientific and engineering paradigm: 1) Do theory or paper design. 2) Perform experiments or build system. • Limitations: - • Too difficult -- build large wind tunnels. Too expensive -- build a throw-away passenger jet. Too slow -- wait for climate or galactic evolution. Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3) Use high performance computer systems to simulate the phenomenon - Base on known physical laws and efficient numerical methods. 01/17/2007 CS267-Lecture 1 15 Some Particularly Challenging Computations • Science - Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences • Engineering - Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation • Business - Financial and economic modeling - Transaction processing, web services and search engines • Defense - Nuclear weapons -- test by simulations - Cryptography 01/17/2007 CS267-Lecture 1 16 Economic Impact of HPC • Airlines: - System-wide logistics optimization systems on parallel systems. - Savings: approx. $100 million per airline per year. • Automotive design: - Major automotive companies use large systems (500+ CPUs) for: - CAD-CAM, crash testing, structural integrity and aerodynamics. - One company has 500+ CPU parallel system. - Savings: approx. $1 billion per company per year. • Semiconductor industry: - Semiconductor firms use large systems (500+ CPUs) for - device electronics simulation and logic validation - Savings: approx. $1 billion per company per year. • Securities industry: - Savings: approx. $15 billion per year for U.S. home mortgages. 01/17/2007 CS267-Lecture 1 17 $5B World Market in Technical Computing 1998 1999 2000 2001 2002 2003 100% Other Technical Management and Support Simulation 90% Scientific Research and R&D 80% Mechanical Design/Engineering Analysis 70% Mechanical Design and Drafting 60% Imaging 50% Geoscience and Geoengineering 40% Electrical Design/Engineering Analysis Economics/Financial 30% Digital Content Creation and Distribution 20% Classified Defense 10% Chemical Engineering Biosciences 0% Source: IDC 2004, from NRC Future of Supercomputing Report 01/17/2007 CS267-Lecture 1 18 Global Climate Modeling • Problem Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity • Approach: - Discretize the domain, e.g., a measurement point every 10 km - Devise an algorithm to predict weather at time t+dt given t • Uses: - Predict major events, e.g., El Nino - Use in setting air emissions standards Source: http://www.epm.ornl.gov/chammp/chammp.html 01/17/2007 CS267-Lecture 1 19 Global Climate Modeling Computation • One piece is modeling the fluid flow in the atmosphere - Solve Navier-Stokes equations - Roughly 100 Flops per grid point with 1 minute timestep • Computational requirements: - To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s - Weather prediction (7 days in 24 hours)  56 Gflop/s - Climate prediction (50 years in 30 days)  4.8 Tflop/s - To use in policy negotiations (50 years in 12 hours)  288 Tflop/s • To double the grid resolution, computation is 8x to 16x • State of the art models require integration of atmosphere, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more • Current models are coarser than this 01/17/2007 CS267-Lecture 1 20 High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL 01/17/2007 CS267-Lecture 1 21 A 1000 Year Climate Simulation • Demonstration of the Community Climate Model (CCSM2) • A 1000-year simulation shows long-term, stable representation of the earth’s climate. • 760,000 processor hours used • Temperature change shown • Warren Washington and Jerry Meehl, National Center for Atmospheric Research; Bert Semtner, Naval Postgraduate School; John Weatherly, U.S. Army Cold Regions Research and Engineering Lab Laboratory et al. • http://www.nersc.gov/news/science/bigsplash2002.pdf 01/17/2007 CS267-Lecture 1 22 Climate Modeling on the Earth Simulator System  Development of ES started in 1997 in order to make a comprehensive understanding of global environmental changes such as global warming.  Its construction was completed at the end of February, 2002 and the practical operation started from March 1, 2002  35.86Tflops (87.5% of the peak performance) is achieved in the Linpack benchmark (world’s fastest machine from 2002-2004).  26.58Tflops was obtained by a global atmospheric circulation code. 01/17/2007 CS267-Lecture 1 23 Astrophysics: Binary Black Hole Dynamics • Massive supernova cores collapse to black holes. • At black hole center spacetime breaks down. • Critical test of theories of gravity – General Relativity to Quantum Gravity. • Indirect observation – most galaxies have a black hole at their center. • Gravity waves show black hole directly including detailed parameters. • Binary black holes most powerful sources of gravity waves. • Simulation extraordinarily complex – evolution disrupts the spacetime ! 01/17/2007 CS267-Lecture 1 24 01/17/2007 CS267-Lecture 1 25 Heart Simulation • Problem is to compute blood flow in the heart • Approach: - Modeled as an elastic structure in an incompressible fluid. - The “immersed boundary method” due to Peskin and McQueen. - 20 years of development in model - Many applications other than the heart: blood clotting, inner ear, paper making, embryo growth, and others - Use a regularly spaced mesh (set of points) for evaluating the fluid • Uses - Current model can be used to design artificial heart valves - Can help in understand effects of disease (leaky valves) - Related projects look at the behavior of the heart during a heart attack - Ultimately: real-time clinical work 01/17/2007 CS267-Lecture 1 26 Heart Simulation Calculation The involves solving Navier-Stokes equations - 64^3 was possible on Cray YMP, but 128^3 required for accurate model (would have taken 3 years). - Done on a Cray C90 -- 100x faster and 100x more memory - Until recently, limited to vector machines - Needs more features: - Electrical model of the heart, and details of muscles, E.g., - Chris Johnson - Andrew McCulloch - Lungs, circulatory systems 01/17/2007 CS267-Lecture 1 27 Heart Simulation Animation of lower portion of the heart 01/17/2007 CS267-Lecture 1 Source: www.psc.org 28 Parallel Computing in Data Analysis • Finding information amidst large quantities of data • General themes of sifting through large, unstructured data sets: - Has there been an outbreak of some medical condition in a community? - Which doctors are most likely involved in fraudulent charging to medicare? - When should white socks go on sale? - What advertisements should be sent to you? • Data collected and stored at enormous speeds (Gbyte/hour) - remote sensor on a satellite - telescope scanning the skies - microarrays generating gene expression data - scientific simulations generating terabytes of data - NSA analysis of telecommunications 01/17/2007 CS267-Lecture 1 29 Performance on Linpack Benchmark www.top500.org 01/17/2007 CS267-Lecture 1 30 01/17/2007 CS267-Lecture 1 31 Why writing (fast) parallel programs is hard 01/17/2007 CS267-Lecture 1 32 Principles of Parallel Computing • Finding enough parallelism (Amdahl’s Law) • Granularity • Locality • Load balance • Coordination and synchronization • Performance modeling All of these things makes parallel programming even harder than sequential programming. 01/17/2007 CS267-Lecture 1 33 “Automatic” Parallelism in Modern Machines • Bit level parallelism - within floating point operations, etc. • Instruction level parallelism (ILP) - multiple instructions execute per clock cycle • Memory system parallelism - overlap of memory operations with computation • OS parallelism - multiple jobs run in parallel on commodity SMPs Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks 01/17/2007 CS267-Lecture 1 34 Finding Enough Parallelism • Suppose only part of an application seems parallel • Amdahl’s law - let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable - P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s • Even if the parallel part speeds up perfectly performance is limited by the sequential part 01/17/2007 CS267-Lecture 1 35 Overhead of Parallelism • Given enough parallel work, this is the biggest barrier to getting desired speedup • Parallelism overheads include: - cost of starting a thread or process - cost of communicating shared data - cost of synchronizing - extra (redundant) computation • Each of these can be in the range of milliseconds (=millions of flops) on some systems • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work 01/17/2007 CS267-Lecture 1 36 Locality and Parallelism Conventional Storage Proc Hierarchy Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache L3 Cache Memory Memory Memory • Large memories are slow, fast memories are small • Storage hierarchies are large and fast on average • Parallel processors, collectively, have large, fast cache - the slow accesses to “remote” data we call “communication” • Algorithm should do most work on local data 01/17/2007 CS267-Lecture 1 potential interconnects L3 Cache Processor-DRAM Gap (latency) CPU “Moore’s Law” Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. 100 10 1 µProc 60%/yr. 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 Time 01/17/2007 CS267-Lecture 1 38 Load Imbalance • Load imbalance is the time that some processors in the system are idle due to - insufficient parallelism (during that phase) - unequal size tasks • Examples of the latter - adapting to “interesting parts of a domain” - tree-structured computations - fundamentally unstructured problems • Algorithm needs to balance load 01/17/2007 CS267-Lecture 1 39 Measuring Performance 01/17/2007 CS267-Lecture 1 40 Improving Real Performance Peak Performance grows exponentially, a la Moore’s Law  In 1990’s, peak performance increased 100x; in 2000’s, it will increase 1000x 1,000 But efficiency (the performance relative to the hardware peak) has declined  was 40-50% on the vector supercomputers of 1990s now as little as 5-10% on parallel supercomputers of today Close the gap through ...   Mathematical methods and algorithms that achieve high performance on a single processor and scale to thousands of processors More efficient programming models and tools for massively parallel supercomputers 01/17/2007 CS267-Lecture 1 100 Teraflops  Peak Performance Performance Gap 10 1 Real Performance 0.1 1996 2000 2004 41 Performance Levels • Peak advertised performance (PAP) - You can’t possibly compute faster than this speed • LINPACK - The “hello world” program for parallel computing - Solve Ax=b using Gaussian Elimination, highly tuned • Gordon Bell Prize winning applications performance - The right application/algorithm/platform combination plus years of work • Average sustained applications performance - What one reasonable can expect for standard applications When reporting performance results, these levels are often confused, even in reviewed publications 01/17/2007 CS267-Lecture 1 42 Performance on Linpack Benchmark www.top500.org 01/17/2007 CS267-Lecture 1 43 Performance Levels (for example on NERSC3) • Peak advertised performance (PAP): 5 Tflop/s • LINPACK (TPP): 3.05 Tflop/s • Gordon Bell Prize winning applications performance : 2.46 Tflop/s - Material Science application at SC01 • Average sustained applications performance: ~0.4 Tflop/s - Less than 10% peak! 01/17/2007 CS267-Lecture 1 44 Course Organization 01/17/2007 CS267-Lecture 1 45 Course Mechanics • This class is listed as both a CS and Engineering class • Normally a mix of CS, EE, and other engineering and science students • This class seems to be about: - 15 grads + 4 undergrads - X% CS, EE, Other (BioPhys, BioStat, Civil, Mechanical, Nuclear) • For final projects we encourage interdisciplinary teams - This is the way parallel scientific software is generally built 01/17/2007 CS267-Lecture 1 46 Rough Schedule of Topics • Parallel Programming Models and Machines - Shared Memory and Multithreading - Distributed Memory and Message Passing - Data parallelism • Parallel languages and libraries - Shared memory threads and OpenMP - MPI - Languages (UPC) • “Seven Dwarfs” of Scientific Computing - Dense Linear Algebra Structured grids Particle methods Sparse matrices Spectral methods Unstructured Grids Spectral methods (FFTs, etc) • Applications: biology, climate, combustion, astrophysics, … • Work: 3 homeworks + 1 final project 01/17/2007 CS267-Lecture 1 47 Lecture Scribes • Each student should scribe ~2 lectures • Next lecture is on single processor performance (and architecture) • Sign up on list to choose 1 topic in first part of the course 01/17/2007 CS267-Lecture 1 48 Reading Materials • Some on-line texts: - Demmel’s notes from CS267 Spring 1999, which are similar to 2000 and 2001. However, they contain links to html notes from 1996. - http://www.cs.berkeley.edu/~demmel/cs267_Spr99/ - Simon’s notes from Fall 2002 - http://www.nersc.gov/~simon/cs267/ - Ian Foster’s book, “Designing and Building Parallel Programming”. - http://www-unix.mcs.anl.gov/dbpp/ • Potentially useful texts: - “Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, .. - A general overview of parallel computing methods - “Performance Optimization of Numerically Intensive Codes” by Stefan Goedecker and Adolfy Hoisie - This is a practical guide to optimization, mostly for those of you who have never done any optimization • More pointers will be on the web page 01/17/2007 CS267-Lecture 1 49 What you should get out of the course In depth understanding of: • When is parallel computing useful? • Understanding of parallel computing hardware options. • Overview of programming models (software) and tools. • Some important parallel applications and the algorithms • Performance analysis and tuning 01/17/2007 CS267-Lecture 1 50 Administrative • Information Instructors: - Kathy Yelick, 777 Soda, [email protected] - TA: Marghoob Mohiyuddin ([email protected]) - Office hours TBD • Lecture notes are based on previous semester notes: - Jim Demmel, David Culler, David Bailey, Bob Lucas, Kathy Yelick and Horst Simon - Much material stolen from others (with sources noted) • Most class material and lecture notes are at: - http://www.cs.berkeley.edu/~yelick/cs267_spr07 01/17/2007 CS267-Lecture 1 51 Extra slides 01/17/2007 CS267-Lecture 1 52 Transaction Processing (mar. 15, 1996) 25000 other Throughput (tpmC) 20000 Tandem Himalaya IBM PowerPC 15000 DEC Alpha SGI PowerChallenge HP PA 10000 5000 0 0 20 40 60 80 100 120 Processors • Parallelism is natural in relational operators: select, join, etc. • Many difficult issues: data partitioning, locking, threading. 01/17/2007 CS267-Lecture 1 53 SIA Projections for Microprocessors 1000 100 Feature Size (microns) 10 Transistors per chip x 106 1 0.1 2010 2007 2004 2001 1998 0.01 1995 Feature Size (microns) & Million Transistors per chip Compute power ~1/(Feature Size)3 Year of Introduction based on F.S.Preston, 1997 01/17/2007 CS267-Lecture 1 54 Much of the Performance is from Parallelism Thread-Level Parallelism? Instruction-Level Parallelism Bit-Level Parallelism Name 01/17/2007 CS267-Lecture 1 55 Performance on Linpack Benchmark www.top500.org 100000 Earth Simulator 10000 ASCI White ASCI Red 1000 Rmax max Rmax mean Rmax min Rmax 100 System 10 04 03 Ju n 03 D ec 02 Ju n 02 D ec 01 Ju n 01 D ec 00 Ju n 00 D ec 99 Ju n 99 D ec 98 Ju n 98 D ec 97 Ju n 97 D ec 96 Ju n 96 D ec 95 Ju n 95 D ec 94 Ju n 94 D ec 93 Ju n D ec Ju n 93 1 0.1 Nov 2004: IBM Blue Gene L, 70.7 Tflops Rmax 01/17/2007 CS267-Lecture 1 56

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS267: Introduction - Electrical Engineering & Computer Sciences