Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 1: Introduction to High Performance Computing Grand challenge problem A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers. Weather Forecasting Cells of size 1 mile x 1 mile x 1 mile => Whole global atmosphere about 5 x 108 cells If each calculation requires 200 Flops => 1011 Flops, in one time step To forecast the weather over 10 days using 10-minute intervals, with a computer operating at 100 Mflops (108 Flops/s) => would take 107 seconds or over 100 days. To perform the calculation in 10 minutes would require a computer operating at 1.7 Tflops (1.7 x 1012 Flops/s). Some Grand Challenge Applications Science • • • • • Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences Engineering • • • • • Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Business • • Financial and economic modeling Transaction processing, web services and search engines Defense • • Nuclear weapons -- test by simulations Cryptography Units of High Performance Computing 1 Mflop/s 1 Gflop/s 1 Tflop/s 1 Pflop/s Speed 1 Megaflop/s 1 Gigaflop/s 1 Teraflop/s 1 Petaflop/s 106 Flop/second 109 Flop/second 1012 Flop/second 1015 Flop/second 1 MB 1 GB 1 TB 1 PB Capacity 1 Megabyte 1 Gigabyte 1 Terabyte 1 Petabyte 106 Bytes 109 Bytes 1012 Bytes 1015 Bytes Moore’s Law Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Moore’s Law holds also for performance and capacity 1945 2002 Computer ENIAC Laptop Number of vacuum tubes / transistors 18 000 6 000 000 000 Weight (kg) 27 200 0.9 68 0.0028 20 000 60 4 630 000 1 000 Memory (bytes) 200 1 073 741 824 Performance (Flops/s) 800 5 000 000 000 Size (m3) Power (watts) Cost ($) Peak Performance A contemporary RISC processor delivers 10% of its peak performance Two primary reasons behind this low efficiency: • • IPC inefficiency Memory inefficiency Instructions per cycle (IPC) inefficiency Today the theoretical IPC is 4-6 Detailed analysis for a spectrum of applications indicates that the average IPC is 1.2–1.4 ~75% of the performance is not used Reasons for IPC inefficiency Latency • Waiting for access to memory or other parts of the system Overhead • Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform Starvation • Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources Contention • Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint Memory Hierarchy Processor-Memory Problem Processors issue instructions roughly every nanosecond DRAM can be accessed roughly every 100 nanoseconds The gap is growing: • • processors getting faster by 60% per year DRAM getting faster by 7% per year Processor-Memory Problem How fast can a serial computer be? Consider the 1 Tflop sequential machine • • • data must travel distance, r, to get from memory to CPU to get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s so r < c / 1012 = 0.3 mm For 1 TB of storage in a 0.3 mm2 area • each word occupies about 3 Angstroms2, the size of a small atom So, we need Parallel Computing! High Performance Computers In 1980s • • 1x106 Floating Point Ops/sec (Mflop/s) Scalar based In 1990s • • 1x109 Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing Today • • 1x1012 Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing What is a Supercomputer? A supercomputer is a hardware and software system that provides close to the maximum performance that can currently be achieved Top500 Computers Over the last 10 years the range for the Top500 has increased greater than Moore’s law: 1993 • • #1 = 59.7 GFlop/s #500 = 422 MFlop/s 2004 • • #1 = 70 TFlop/s #500 = 850 GFlop/s Top500 List at June 2005 Manuf. Computer Instal. Site Cntry Year Rmax (Tflop/s) #proc 1 IBM BlueGene/L LLNL USA 2005 136.8 65536 2 IBM BlueGene/L IBM Watson Res. Center USA 2005 91.3 40960 3 SGI Altix NASA USA 2004 51.9 10160 4 NEC Vector Earth Simulator Center Japan 2002 35.9 5120 5 IBM Cluster Barcelona Supercomp. C. Spain 2005 27.9 4800 Performance Development Increasing CPU Performance Manycore Chip Composed of hybrid cores • Some general purpose • Some graphics • Some floating point What is Next? Board composed of multiple manycore chips sharing memory Rack composed of multiple boards A room full of these racks Millions of cores Exascale systems (1018 Flop/s) Moore’s Law Reinterpreted Number of cores per chip doubles every 2 year, while clock speed decreases (not increases). • Need to deal with systems with millions of concurrent threads Number of threads of execution doubles every 2 year Performance Projection Directions Move toward shared memory • • SMPs and Distributed Shared Memory Shared address space with deep memory hierarchy Clustering of shared memory machines for scalability Efficiency of message passing and data parallel programming • MPI and HPF Future of HPC Yesterday's HPC is today's mainframe is tomorrow's workstation