Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ralph K. Cavin, III March 18, 2009 Brussels Is there a Carnot-like theorem for computation? ◦ e.g., a limit on rate of information throughput/power consumed? The MIND architecture benchmarking activity for novel devices Memory Architectures Inference Architectures Chose a simple one-bit, four instruction processor All transistors operate at ~kT switching energy Interconnects dissipate energy at ~kT/gate length Transistor average fan-out is three 144 2-4 DEC I2 I1 1 6 S1 1 S2 1 S3 X 6 Y 12+ Program Counter 2-bit Counter 24 Red numbers = # transistors 6 98 6 Memory ALU Z 6 C1 C0 1 CPU S5 1 S6 1 S4 4 Total: 314 devices Von Neumann threshold: Joyner tiling: n=314 Areamin n 8a 2 314 8a 2 2500 a 2 50 a 50 a amin= 1.5 nm Areamin 75 nm 75 nm Operational energy of the Minimal Turing Machine 9 18 J Eop nk BT ln 2 980 k BT / cycle 4 10 2 cycle Per full CPU operation: Eop 3 4 10 18 J J 10 17 5 cycle operation Devices: 314 Area 75nm 75nm Device density: 5.61012 cm-2 Energy per cycle 4 10 18 J cycle Power: 2mW BITS=density x freq. = 1014 bit/s Time per cycle ~2 ps Power density : ~30 kW/cm2 MIPS: 2105 6 Sources: The Intel Microprocessor Quick Reference Guide and TSCP Benchmark Scores 1.E+09 1.E+07 1019 bit/s 108 MIPS Brain 30 W 1.E+06 MIPS Instructions per sec 106 1.E+08 106 W/cm2 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E-01 1.E+09 1.E+11 1.E+13 1.E+15 1.E+17 1.E+19 1.E+21 1.E+23 1.E+25 BIT: Max. binary throughput, bit/s 7 The Minimal Turing Machine lies on the different performance trajectorie from conventional computers ◦ It has slope to meet brain performance More detailed physics based analysis is needed ◦ System thermodynamics of computation Carnot’s equivalent for Computational Engine? Lessons from Biological Computation? Candidates for beyond-CMOS nano-electronics should be evaluated in the context of system scaling ◦ e.g. spintronic minimal Turing Machine? 8 NRI Focus Centers Kerry Bernstein, IBM February 2009 Update 1. 2. Short Term – Switches that supplement CMOS and are CMOS-compatible, supporting performance via hardware acceleration Long Term – Switches that replace CMOS for general purpose high performance compute applications 1) CMOS is not going away anytime soon. Charge (state variable), and the MOSFET (fundamental switch) will remain the preferred HPC solution until new switches appear as the long term replacement solution in 10-20 years. 2) Hdwre Accelerators execute selected functions faster than software performing it on the CPU. Accelerators are responsible for substantial improvements in thruput. 3) Alternative switches often exhibit emergent, idiosyncratic behavior. We should exploit them. Certain physical behaviors may emulate selected HPC instruction sequences. Some operations may be superior to digital solutions. 4) New switches may improve high-utility accelerators The shorter term supplemental solution (5-15 years) improves or replaces accelerators “built in CMOS and designed for CMOS”, either on-chip, or on-planar, or on-3D-stack Hierarchical Benchmarking Level Device Circuit Metric The Good The Bad CV/I Charge-Based Ft Specific to Clked Ion/Ioff Current-Based NAND2 Dly, Pwr, Incomplete Energy, Area Architecture E-D-A Product Optimization P-D-A Product Optimization Logical Effort Constrained MIPS, IPC Equivalent T/P Synchronous Ops/Joule Energy Efficiency Discrete Ops SpecInt Industry Stds New Capability 1) 2) 3) Derive values for the conventional quantitative ITRS benchmarks shown in Benchmark 1. Derive quantitative values and qualitative entries for architecture benchmarks shown in Benchmark 2a and 2b Identify specific logic operations performed elegantly by your switch: where physical device behavior complements desired logic operation. Determine the equivalent IPC, power of that function performed in the new switch, as shown in Benchmark 3 example. Determine the actual IPC, and Operations/Watt had the function been performed via software in the CPU. Benchmark 1: Device Metrics Defined by ITRS ERD Working Group Captures Fundamental Device Properties Benchmark 2: Architectural Metrics 2a. Quantitative Communication Metrics 2b. Qualitative CMOS Compatibility Values AREA of die/host accessible within 1 switch delay Clocking infrastructure and Locality NO. OF SWITCHES accessible within 1 switch delay Sq BW/ unit area (Channels x Freq)X x Channels x Freq)Y Memory Reqmts and compatibility Sq Comm Channels (NX x NY) per unit area (Accessible Area within one switch delay) / (Area of 1 switch) Scalability Mem. delay / Logic Delay Logic Metrics 32 Bit Adder Inverter with F04 NAND2 FO1 Generic Noise Immunity (dB) Generic Logical Effort Comp.Density (MIPS/no of devs) PETE1 (EDDA) PETE2 (PDA) Metric Delay Power Energy Area No. of Sw. Reconfigurability or Library Dimensions Logic Execution / Architecture Specific Logic Function performed well Useful Specific Physical Behaviors Azad Naeemi, Georgia Tech Delay versus Length for Various Transport Mode Transport Mechanisms New State Variables will impact communication and fan-out Information Token Diffusion Direct Excitons Indirect Excitons Spin PseudoSpin Drift Indirect Excitons Spin Pseudospin Ballistic Transport (Fermi Velocity) Spin Pseudospin Spin Wave Spin EM Waves Photons Plasmons H.264 Comp Crypto …………….. Compare apples-to-apples, independent of particular strength BTBTFET MQCA Quantum Equivalent IPC - MIPS/Watt - Ops/Joule of switch in application Matching Logic Functions & New Switch Behaviors New Switch Ideas Popular Accelerators Single Spin Encrypt / Decrypt Compr / Decompr Spin Domain Reg. Expression Scan Tunnel-FETs Discrete COS Trnsfrm NEMS Bit Serial Operations H.264 Std Filtering DSP, A/D, D/A Viterbi Algorithms Image, Graphics ? MQCA Molecular Bio-inspired CMOL Excitonics Example: Cryptography Hardware Acceleration Operations required: Rotate, Byte Alignmt, EXORs, Multiply, Table Lookup Circuits used in Accel: Transmission Gates (“T-Gates”) New Switch Opportunity: A number of new switches (i.e. T-FETs) don’t have thermionic barriers: won’t suffer from CMOS Pass-gate VT drop, Body Effect, or Source-Follower delay. Potential Opportunity: Replace 4 T-Gate MOSFETs with 1 low power switch. 2.8E-4 Bernstein, 1/25/09 • Example of HPC Hdwre Accelerator contribution to power, area, instruction retirement rate, energy efficiency improvement. • Purdue Emerging Technology Evaluator (PETE) metric is convolution of power/energy, delay, and area. • IPC and Ops/nJprovides apples-to-apples comparison of new switches. Paul Franzon Department of Electrical and Computer Engineering [email protected] 919.515.7351 Goal: Major Conclusions: ◦ Determine research needs for ~2015 1000 Petaflop computer, and smaller equivalents ◦ Major challenge #1: Power efficiency Communication Overhead in computation ◦ Major Challenge #2: Resiliency Completing computation in presence of permanent and transient faults ◦ Major Challenge #3: Performance Scaling Performance scaling limited by software, communications bisection bandwidth, and memory speed Critical Needs: ◦ Reduce power SRAM replacements 45 nm L1 Cache: 3.6 pJ/bit Note: re-architecting in 3D can save ~50% What is the potential for an ERD to reduce to 0.3 pJ/bit? Note: Would require low-swing on bit lines, while retaining speed and low SET rate ◦ Reduced power switched interconnect Esp. packet routed interconnect (NOC) What is the potential for a memory-style ERD to be used for fast switchable interconnect? Flash devices can do this for static reconfiguration BUT faster switching devices will be needed for dynamic reconfiguration Blue Gene system reliability: ◦ Most of the DRAM failures are due to DIMM socket failures, not device failures ◦ Critical need: Sub-system level checkpointing and roll-back ERD requirement: ◦ Tightly embedded Flash-like state “capture” memory for checkpointing ◦ Requirements: Tightly embedded, e.g. Shadow registers, with minimum process change Slow read/write OK ~10 M writes minimum extrinsic reliability requirement 1. Metrics for cache replacement Read & Energy/bit for Write 64kbit array Speed for 64 kit array Area for 64kbit array SEU rate Added process complexity 2. Metrics for programmed routability Stage Delay for 2x2 switchbox Energy/bit for routing through 2x2 swtich box Area for 2x2 switch box Configuration change speed for 2x2 switchbox Added process and design Complexity 3. Metrics for Local Check-pointing memory Read/write Energy/bit delay per bit for routing for write Area per bit Write lifetime Added process and design complexity In future computing, both General Purpose and Application Specific, the bottleneck is not in logic operations but in memory, communications, and reliability Opportunities arise for memory style devices to solve these bottlenecks: ◦ Low power SRAM replacement ◦ Ultra-low swing, routable interconnect replacement ◦ Local non-volatile memory as an aid to resiliency The Memory Wall for multi-core In general purpose multi-core processors, the tradeoff for L1-L3 between memory bandwidth and memory size is dramatic. ◦ At constant BW, two cores may require as much as 8x memory of one core ◦ At 2x BW, two cores require only about 2x memory of a single core system ◦ Kerry Bernstein “New Dimensions in Performance, Feb. 2009 Workshop: “Technology Maturity for Adaptive, Massively Parallel Computing” – March 2009, Portland Oregon http://www.technologydashboard.com/adaptivecomputing/. ◦ General theme: Inference Architectures and Technology Karlheinz Meier, U. Heidleberg, “VLSI Implementation of Very Large Scale Neuromorphic Circuits – Achievements, Challenges, Hopes” Progress in architectures is being made but many technology challenges remain. (Complexity) Can Emerging Research Devices accelerate realization of Inference Architectures? Continue work on ERD Architectural Benchmarking ◦ Work with NRI MIND benchmarking effort Develop section on memory architectures for Emerging Research Memories Look at role of ERD/ERM in novel architectures where unique properties can provide substantial leverage; e.g. inference architectures