Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation Dr. Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory October 1, 1999 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 3 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 4 Rational Drug Design Nanotechnology Biomolecular Dynamics Fracture Mechanics Crystallography Diffraction Inversion Problems Atomic Scattering Condensed Matter Electronic Structure Population Genetics Transportation Systems Plasma Processing Chemical Reactors Cloud Physics Carlo Raster Graphics Pattern Matching Neutron Transport Boilers Multimedia Collaboration Tools Scientific Visualization Chemical Reactors ODE Structural Mechanics Weather and Climate Seismic Processing Multibody Dynamics Geophysical Fluids Aerodynamics Fields Ecosystems Economics Models Orbital Mechanics Astrophysics Electromagnetics Intelligent Search Computer Algebra Databases Magnet Design Data Minning CAD Intelligent Dr. Thomas Agents Automated Deduction CVD Multiphase Flow Cryptography Computer Vision Virtual Prototypes PDE Symbolic Processing Genome Processing Virtual Reality Reaction-Diffusion CFD Basic Algorithms & Numerical Methods Monte Nuclear Structure Radiation Graph Theoretic Transport Discrete Events Air Traffic Control Computational Steering Flow in Porous Media Pipeline Flows n-body Economics 5/23/2017 Electrical Grids Signal Processing Reservoir Modelling Biosphere/Geosphere Distribution Networks Fourier Methods VLSI Design QCD Neural Networks Combustion Quantum Chemistry Manufacturing Systems Military Logistics Data Assimilation Electronic Structure Actinide Chemistry Cosmology Astrophysics Phylogenetic Trees MRI Imaging Molecular Modelling Chemical Dynamics Tomographic Reconstruction Number Theory Sterling - HTMT Petaflops Architecture 6 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 7 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 8 A 10 Gflops Beowulf Center for Advance Computing Research 172 Intel Pentium Pro microprocessors California Institute of Technology 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 9 Emergence of Beowulf Clusters 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 10 1st printing: May, 1999 2nd printing: Aug. 1999 MIT Press 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 11 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 12 Beowulf Scalability 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 13 INTEGRATED SMP - WDM DRAM - 4 GBYTES - HIGHLY INTERLEAVED MULTI-LAMBDA AON CROSS BAR coherence 2nd LEVEL CACHE 640 GBYTES/SEC 2nd LEVEL CACHE 96 MBYTES 96 MBYTES 64 bytes wide 160 gbytes/sec VLIW/RISC CORE 24 GFLOPS 6 ghz ... 64 bytes wide 160 gbytes/sec VLIW/RISC CORE 24 GFLOPS 6 ghz COTS PetaFlop System 3 2 4 5 128 die/box 4 CPU/die ... 16 1 17 64 ALL-OPTICAL SWITCH 63 ... 18 ... 32 49 48 47 I/O ... 33 Multi-Die Multi-Processor 46 10 meters= 50 NS Delay 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 15 COTS PetaFlops System • • • • • • • • • • 8192 Dies (4 CPU/die-minimum) Each Die is 120 GFlops 1 PetaFlop Peak Power 8192 x200 Watts = 1.6 MegaWatts Extra Main Memory >3 MegaWatts (512 TBytes) 15.36 TFlops/Rack (128 die) 30 KWatts/Rack - thus 64 racks - 30 inch Common System I/O 2 Level Main Memory Optical Interconnect – OC768 Channels (40 GHz) – 128 Channels per Die (DWDM)-5.12 THz – ALL Optical Switching • Bisection Bandwidth of 50 TBytes/sec – 15 TFlops/rack*.1bytes/flop/sec*32 racks •5/23/2017 Rack Bandwidth - 15 TFlops*.1= 12 THz Dr. Thomas Sterling - HTMT Petaflops Architecture 16 The SIA CMOS Roadmap 100,000 MB per DRAM Chip Logic Transistors per Chip (M) uP Clock (MHz) 10,000 1,000 100 10 2012 2009 2006 2003 2001 1999 1997 1 Year of Technology Availability 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 17 Requirements for High End Systems • Bulk capabilities – – – – performance storage capacities throughput/bandwidth cost, power, complexity • Efficiency – – – – overhead latency contention starvation/parallelism • Usability – generality – programmability – reliability 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 18 Points of Inflection in the History of Computing • Heroic Era (1950) – – – – technology: vacuum tubes, mercury delay lines, pulse transformers architecture: accumulator based model: von-Neumann, sequential instruction execution examples: Whirlwind, EDSAC • Mainframe (1960) – – – – 5/23/2017 technology: transistors, core memory, disk drives architecture: register bank based model: virtual memory examples: IBM 7090, PDP-1 Dr. Thomas Sterling - HTMT Petaflops Architecture 19 Points of Inflection in the History of Computing • Supercomputers (1980) – – – – technology: ECL, semiconductor integration, RAM architecture: pipelined model: vector example: Cray-1 • Massively Parallel Processing (1990) – – – – technology: VLSI, microprocessor, architecture: MIMD model: Communicating Sequential Processes, Message passing examples: TMC CM-5, Intel Paragon • ? (2000) 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 20 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 21 HTMT Objectives • Scalable architecture with high sustained performance in the presence of disparate cycle times and latencies • Exploit diverse device technologies to achieve substantially superior operating point • Execution model to simplify parallel system programming and expand generality and applicability 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 22 Hybrid Technology MultiThreaded Architecture 3D Mem • Compress/Decompress • Spectral Transforms DRAM PIM OPTICAL SWITCH SRAM PIM • Data Structure Initializations •“In the Memory” Operations 5/23/2017 RSFQ Nodes Dr. Thomas Sterling - HTMT Petaflops Architecture I/O FARM • Compress/Decompress • ECC/Redundancy • Compress/Decompress • Routing • RSFQ Thread Management • Context Percolation • Scatter/Gather Indexing • Pointer chasing • Push/Pull Closures • Synchronization Activities 23 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 24 Storage Capacity by Subsystem 2007 Design Point 5/23/2017 Subsystem Unit Storage # of Units Total Storage CRAM 32 KB 16 K 512 MB SRAM 64 MB 16 K 1 TB DRAM 512 MB 32 K 16 TB HRAM 10 GB 128 K 1 PB Primary Disk 100 GB 100 K 10 PB Secondary Disk 100 GB 100 K 10 PB Tape 1 TB 6Kx20 120 PB Dr. Thomas Sterling - HTMT Petaflops Architecture 25 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 26 HTMT Strategy • High performance – Superconductor RSFQ logic – Data Vortex optical interconnect network – PIM smart memory • Low power – Superconductor RSFQ logic – Optical holographic storage – PIM smart memory 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 27 HTMT Strategy (cont) • Low cost – reduce wire count through chip-to-chip fiber – reduce processor count through x100 clock speed – reduce memory chips by 3-2 holographic memory layer • Efficiency – processor level multithreading – smart memory managed second stage context pushing multithreading – fine grain regular & irregular data parallelism exploited in memory – high memory bandwidth and low latency ops through PIM – memory to memory interactions without processor intervention – hardware mechanisms for synchronization, scheduling, data/context migration, gather/scatter 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 28 HTMT Strategy (cont) • Programmability – Global shared name space – hierarchical parallel thread flow control model • no explicit processor naming – automatic latency management • automatic processor load balancing • runtime fine grain multithreading • automatic context pushing for process migration (percolation) – configuration transparent, runtime scalable 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 29 RSFQ Roadmap (VLSI Circuit Clock Frequency) 1 THz high-Tc (65-77 K) ?? 0.25 um 0.4 um low-Tc (4-5 K) 0.8 um 100 GHz 1.5 um 3.5 um 10 GHz ?? optical lithgraphy 1 GHz 0.07 um 0.13 um e-beam lithgraphy 0.25 um (SIA Forecast) 100MHz 1998 2001 2004 2007 2010 Year 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 30 RSFQ Building Block L1 JJ1 5/23/2017 JJ2 Dr. Thomas Sterling - HTMT Petaflops Architecture 31 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 32 Advantages • • • • • X100 clock speeds achievable X100 power efficiency advantage Easier fabrication Leverage semiconductor fabrication tools First technology to encounter ultra-high speed operation 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 33 Superconductor Processor • • • • • 100 GHz clock, 33 GHz inter-chip 0.8 micron Niobium on Silicon 100K gates per chip 0.05 watts per processor 100Kwatts per Petaflops 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 34 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 35 FUNCTIONALITY AND CAPABILITY (1 petaflops machine, Yr. 2006, design COOL-0) 1. Technology Assumptions (a) chip Min JJ size Min runner width Nb layers Junction density 5/23/2017 Runner pitch (in 1 layer) Chip size Contact Pin Pitch 0.8 m 1.5 m 8+1 (4 wires) 1M/cm2 logic 3M/cm2 memory 5 m 22 cm2 100100 m2 (b) CMCM Size Nb layers Runner width Runner pitch (in 1 layer) 2020 cm2 4+1 (2 wires) 3 m 8 m (c) CPCB Size Metallic layers Runner pitch 54 cm (max diam) 10+1 (5 wires) 100 m Dr. Thomas Sterling - HTMT Petaflops Architecture 36 6. COOL 0 System as a Whole SPELLs Total 4K 12K chips 40 BJJs 4 Gbytes 16K chips 160 BJJs 24,576 nodes 2K chips 8 BJJs CRAM Total CNET Total COOL 0 Grand Total 512 CMCMs 160 CPCBs I/O Bandwidth Physical Size Dissipated Power Refrigeration Power 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 1.0 Pflops 4 Pbytes/s 0.5 m3 250 W @ 4 K 100 kW 37 Data Vortex Optical Interconnect 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 38 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 39 DATA VORTEX LATENCY DISTRIBUTION network height = 1024 number of messages 120x10 3 22% active input ports 100 80 100% active input ports 60 40 20 0 0 5/23/2017 20 40 60 number of hops Dr. Thomas Sterling - HTMT Petaflops Architecture 80 100 40 Single-mode rib waveguides on silicon-on-insulator wafers‡ Optical mode SiO2 cladding Hybrid sources and detectors Buried oxide Mix of CMOS-like and ‘micromachining’-type processes for fabrication Si ‡ e.g: R A Soref, J Schmidtchen & K Petermann, IEEE J. Quantum Electron. 27 p1971 (1991) Si substrate A Rickman, G T Reed, B L Weiss & F Navamar, IEEE Photonics Technol. Lett. 4 p.633 (1992) B Jalali, P D Trinh, S Yegnanarayanan & F Coppinger IEE Proc. Optoelectron. 143 p.307 (1996) 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 41 PIM Provides Smart Memory Sense Amps Memory Stack Memory Stack Sense Amps Sense Amps Memory Stack Memory Stack Sense Amps Sense Amps Decode Sense Amps Basic Node Logic Silicon Sense Amps Sense Amps Macro Single Chip 5/23/2017 • Merge logic and memory • Integrate multiple logic/mem stacks on single chip • Exposes high intrinsic memory bandwidth • Reduction of memory access latency • Low overhead for memory oriented operations • Manages data structure manipulation, context coordination and percolation Dr. Thomas Sterling - HTMT Petaflops Architecture 42 Multithreaded PIM DRAM • • • • • • Multithreaded Control of PIM Functions multiple operation sequences with low context switching overhead maximize memory utilization and efficiency maximize processor and I/O utilization multiple banks of row buffers to hold data, Boolean ALU instructions, and addr Memory Row Registers data parallel basic Stack operations at row buffer GP - ALU manages shared Context Registers resources such as FP Row Buffers • • • Direct PIM to PIM Interaction memory communicates with memory within and across chip boundaries without external control processor intervention by “parcels” exposes fine grain parallelism intrinsic to vector and irregular data structures e.g. pointer chasing, block moves, synchronization, data balancing 5/23/2017 Node Logic Memory Bus I/F (PCI) Dr. Thomas Sterling - HTMT Petaflops Architecture FP Hi Speed Links (Firewire) FP 43 Silicon Budget for HTMT DRAM PIM • Designed to provide proper balance of memory & support for fiber bandwidth – Different Vortex configurations => different #s Logic By Area 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 32MB 32MB Memory FtPt ASAP FtPt ASAP 15.9% SuperScalar Core 50.8% FtPt ASAP FtPt ASAP 33.3% HRAM & Vortex Output 32MB Interface 32MB • In 2004, 16 TB = 4096 groups of 64 chips • Each Chip: Fiber WDM Optical Receiver 44 Holographic 3/2 Memory Performance Scaling 1998 1 Gbit Module capacity Number of modules Access time 1 ms Readout 1 Gb/s bandwidth Record 1 Mb/s bandwidth 5/23/2017 2001 1 GB 2004 10 GB 105 105 100 s .1 PB/s 10 s 1 PB/s 1 GB/s .1 PB/s • • • • • Advantages petabyte memory • competitive cost • 10 sec access time • low power efficient interface to DRAM Dr. Thomas Sterling - HTMT Petaflops Architecture Disadvantages recording rate is slower than the readout rate for LiNbO3 recording must be done in GB chunks long term trend favors DRAM unless new materials and lasers are used 45 1.4 m 77oK 1m 0.3 m 4 oK 50 W SIDE VIEW Fiber/Wire Interconnects 1m 3m 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 0.5 m 46 Nitrogen SIDE VIEW Helium 77 oK 4oK Hard Disk Tape Silo Array Array (40 cabinets) (400 Silos) 50 W Fiber/Wire Interconnects Front End Computer Server 3m 3m Console Cable Tray Assembly 0.5 m 220Volts 220Volts WDM Source Generator 980 nm Pumps Generator Optical Amplifiers (20 cabinets) 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 47 HTMT Facility (Top View) 15 m 27 m 27 m Cryogenics Refrigeratio n Room 25 m 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 48 Floor Area 1. 2. 3. 4. 5. 6. 7. HTMT Server Pump/MG Laser 980 Disk Farm (80) Tape Robot Farm (20) Operator Room 1,000 250 3,000 1,000 1,600 4,000 1,000 TOTAL = 11,850 sq ft 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 49 Power Dissipation by Subsystem Petaflops Design Point 5/23/2017 Subsystem Unit Type Unit Power # of Units Total Power Cryostat/Cooling System 400 kW 1 400 kW SRAM PIM 5W 16 K 80 kW WDM source/amps Port 15 W 4K 62 kW Data Vortex Subnet 2 kW 128 258 kW DRAM PIM 625 mW 32 K 20 kW HRAM HRAM 100 mW 128 K 13 kW Primary Disk Disk 15 W 100 K 1500 kW Tape Silo 1 kW 20 20 kW Server Machine 100 kW 1 100 kW TOTAL 2.4 MW Dr. Thomas Sterling - HTMT Petaflops Architecture 50 Subsystem Interfaces 2007 Design Point Subsystem RSFQ SRAM SRAM Data Vortex Data Vortex DRAM DRAM DRAM Server Server Server HRAM Interface to Wires/Port Speed/Wire (bps) #ports Aggregate BW (Byte/s) Wire count type of IF SRAM 16000 20.0E+9 512 20.5E+15 8.2E+6 wire RSFQ 1000 2.0E+9 8000 2.0E+15 8.0E+6 TBD Data Vortex 1000 2.0E+9 8000 2.0E+15 8.0E+6 wire SRAM 1 640.0E+9 2048 163.8E+12 2.0E+3 fiber DRAM 1 640.0E+9 2048 163.8E+12 2.0E+3 fiber Data Vortex 1000 1.0E+9 33000 4.1E+15 33.0E+6 wire HRAM 1000 1.0E+9 33000 4.1E+15 33.0E+6 wire Server 1 800.0E+6 1000 100.0E+9 1.0E+3 wire DRAM 1 800.0E+6 1000 100.0E+9 1.0E+3 (fiber channel) Disk 1 800.0E+6 1000 100.0E+9 1.0E+3 (fiber channel) Tape 1 800.0E+6 200 20.0E+9 200.0E+0 (fiber channel) DRAM 800 100.0E+6 1.00E+05 1.0E+15 80.0E+6 wire •Same colors indicate a connection between subsystems •Horizontal lines group interfaces within a subsystem 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 51 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 52 Getting Efficiency • Contention: – hardware for bandwidth, logic throughput, hardware arbitration • Latency: – multithreaded processor with hardware context switching – “percolation” for proactive prestaging of executables • PIM-DRAM & PIM-SRAM provides smart data oriented mechanisms • Overhead: – hardware context switching – in PIM smart synchronization and context management – proactive percolation performed in PIM • Starvation: – dynamic load balancing – high speed processor for reduced parallelism – expose/exploit fine grain parallelism 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 53 Multilevel Multithreaded Execution Model •Extend latency hiding of multithreading •Hierarchy of logical thread •Delineates threads and thread ensembles •Action sequences, state, and precedence constraints •Fine grain single cycle thread switching •Processor level, hides pipeline and time of flight latency •Coarse grain context "percolation" •Memory level, in memory synchronization •Ready contexts move toward processors, pending contexts towards big memory 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 54 Tera MTA Friends 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 55 Percolation of Active Tasks • Multiple stage latency management methodology • Augmented multithreaded resource scheduling • Hierarchy of task contexts • Coarse-grain contexts coordinate in PIM memory • Ready contexts migrate to SRAM under PIM control releasing threads for scheduling • Threads pushed into SRAM/CRAM frame buffers • Strands loaded in register banks on space available basis 5/23/2017 Strands Stored in Regs Threads Stored in SRAM Dr. Thomas Sterling - HTMT Petaflops Architecture Contexts Stored in DRAM 56 HTMT Percolation Model CRYOGENIC AREA DMA to CRAM start Split-Phase Synchronization to SRAM done C-Buffer A-Queue Parcel Dispatcher & Dispenser I-Queue Parcel Assembly Re-Use & Disassembly D-Queue Parcel Invocation & Termination T-Queue Run Time System SRAM-PIM 5/23/2017 DMA to DRAM-PIM Dr. Thomas Sterling - HTMT Petaflops Architecture 57 HTMT Execution Model “Contexts” in SRAM Data Structures “Contexts” in CRAM V O R T E X C N E T SPELL DRAM PIMs 5/23/2017 SRAM PIMs Dr. Thomas Sterling - HTMT Petaflops Architecture 58 DRAM PIM Functions • Initialize data structures • Stride thru regular data structures, transferring to/from SRAM • Pointer chase thru linked data structures • “Join-like” operations • Reorderings • Prefix operations • I/O transfer management – DMA, compress/decompress, ... 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 59 SRAM PIM Functions • Initiate Gather/Scatter to/from DRAM • Recognize when sufficient operands arrive in SRAM context block • Enqueue/Dequeue SRAM block addresses • Initiate DMA transfers to/from CRAM context block • Signal SPELL re task initiation • Prefix operations like Flt Pt Sum 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 60 StrawMan Prototype for Phase 4 Number of Units Total Capability 100 Gflops 128 10 Tflops CRAM 8 Kbytes 512 4 Mbytes SRAM 1 Mbyte/1 proc. 16K 16 Gbytes Subsystem Processors Data Vortex 5/23/2017 Unit Capability 4 Gbits/s/8 4K in 128 Tbits/s DRAM 8 Mbyte/4 proc. 64K 512 Gbytes HRAM 1 Gbyte 8K 8 Tbytes Dr. Thomas Sterling - HTMT Petaflops Architecture 61 1.4 m 77oK 1m 0.3 m 4 oK 50 W SIDE VIEW Fiber/Wire Interconnects 1m 3m 5/23/2017 Dr. Thomas Sterling - HTMT Petaflops Architecture 0.5 m 62