Download Building Modern Integrated Systems: A Cross-cut Approach

Building Modern Integrated Systems: A Cross-cut Approach Vladimir Stojanović Integrated Systems Group Massachusetts Institute of Technology Acknowledgments  Devices: Rajeev Ram, Miloš Popović, Tsu-Jae King Liu  Architecture: Krste Asanović, Christopher Batten, Ajay Joshi  Circuits: Elad Alon, Dejan Marković  Post-docs and Students :  Devices - Jason Orcutt, Jeffrey Shainline, Anatoly Khilo, Jie Sun, Cheryl Sorace, Mark Wade, Eugen Zgraggen, Jaeseok Jeon, Rhesa Nathanael, Hei Kam, Stevan Urosevic, Josh Wang, Ekaterina Kononov  Circuits – Michael Georgas, Jonathan Leu, Ben Moss, Chen Sun, Yu-Hsin Chen, Fred Chen, Byungsub Kim, Hossein Fariborzi, Matthew Spencer, Chengcheng Wang, Kevin Dwan  Architecture – Yunsup Lee, Yong-Jin Kwon, Scott Beamer, Chen Sun, Yu-Hsin Chen, Imran Shamim  Micron Technology: Roy Meade, Gurtej Sandhu, Reha Bafrali, Emanuele Baracchi, Miri Baruch, Erez Conforti, Paul Daley, Eyal Friedman, Harel Frish, Dana Haran, Lilach Makrabi, Maxim Rabinovitch, Matt Ross, Yoel Shetrit, Zvi Sternberg, Ofer Tehar-Zahav  DARPA MTO IBM Trusted Foundry Texas Instruments – Dennis Buss and Tom Bonifield Intel Corporation – Ian Young and Alex Kern    2 Chip design needs radical changes   Already have more devices than can use at once Limited by power density and bandwidth Intel Knights Corner 50 cores, 200 Threads Oracle T5 Nvidia Fermi 16 cores, 128 Threads 540 CUDA cores Intel 4004 (1971): 4-bit processor, 2312 transistors, ~100 KIPS, 10 micron PMOS, 11 mm2 chip IBM Power 7 8 cores, 32 threads 1000s of processor “The Processor is cores and the new Transistor” [Rowen] accelerators per die 3 Subthreshold leakage: Game over for CMOS Energy/op vs. 1/throughput 25 100 20 80 Normalized Energy/op Normalized Energy/cycle Energy/op vs. Vdd 15 Etotal 10 Edynamic 5 Eleak 0.1  0.2 0.3 Vdd (V) 0.4 0.5 60 Scale Vdd & VT: 40 20 0 1 10 2 10 3 4 10 10 1/throughput (ps/op) 5 10 CMOS circuits have well-defined minimum energy    Caused by leakage and finite sub-threshold slope Need to balance leakage and active energy Limits energy-efficiency, regardless how slow the circuit runs 4 Wire and I/O scaling problems I/O On-chip wires Best electrical links On-chip wires copper resistivity Energy-cost [pJ/b] 18 16 Chip2Chip Backplane Loss ~20-25dB 14 12 10 8 6 4 Loss ~10dB 2 0 0 5 10 15 20 25 Data-rate [Gb/s]   Increased wire resistivity makes wire caps scale very slowly Can’t get both energy-efficiency and high-data rate in I/O 5 Bandwidth, pin count and power scaling Package pin count 256 cores 2,4 cores *> half pins for power supply Need 16k pins in 2017 for HPC* 2 TFlop/s signal pins @ 20 Gb/s/link 1 Byte/Flop 6 Emerging devices can help Energy-efficient computation and communication CMOS – need cross-cut approach to keep scaling performance Network & µArchitecture Post-CMOS – need cross-cut approach to guide new devices/systems Design Optimization Communications (Eq., Mod, Coding) 2.5 Energy/Bit (pJ/Bit) 2 Equalized, 30mV Eye Equalized, 50mV Eye Equalized, 90mV Eye Repeated Circuit modeling, Characterization 1.5 1 0.5 0 0 1 2 Data Rate Density (Gbps/um) Φ Φ Circuits & Logic Tx, Rx, Ctrl, Meas Φ in- in+ IPHOTO Φ Φ Φ 3 Interconnect and switch technology Cu MOSFET Si-Photonics 7 Monolithic Si-Photonics for core-to-core and core-to-DRAM networks Supercomputers Si-photonics in advanced CMOS and DRAM process NO costly process changes Embedded apps Bandwidth density – need dense WDM Energy-efficiency – need monolithic integration 8 8 Monolithic CMOS photonic integration <150 nm SiO2 Thin BOX SOI CMOS Electronics Bulk CMOS Electronics 9 Si and polySi waveguide formation 9 Integrated photonic interconnects • Each λ carries one bit of data Bandwidth Density achieved through DWDM Energy-efficiency achieved through low-loss optical components and tight integration 11 Single channel link tradeoffs Loss 10-dB 15-dB Rx Cap 5-fF 25-fF 12 Resonance sensitivity 3 0 1 2 3 0 Direct thermal tuning Wafer-level ring variation data from our Micron designs   Process and temperature shift resonances Direct thermal tuning cost prohibitive Georgas CICC 2011, Sun NOCS 2012 13 Smarter wavelength tuning Georgas CICC 2011, Sun NOCS 2012 Nearest channel tuning + reshuffling   Utilize systematic global mismatch and temperature shifts Electrical backend enables dense WDM  Helps reduce tuning costs by more than 10x 14 Need to optimize carefully 512 Gb/s aggregate throughput  Laser energy increases with data-rate – –   Limited Rx sensitivity assuming 32nm CMOS Modulation more expensive -> lower extinction ratio Tuning costs decrease with data-rate Georgas CICC 2011 Moderate data rates most energy-efficient 15 DWDM link efficiency optimization   Optimize for min energy-cost Bandwidth density dominated by circuit and photonics area (not coupler pitch)   10x better than electrical bump limited 200x better than electrical package pin limit 16 Many architectural studies show promise [Shacham’07] [Petracca’08] [Koka’08-10] [Joshi’09] [Pan’09] [Batten’08] [Vantrease’08] [Psota’07] [Kirman’06] [Beamer’10] 17 Photonic memory interface – leveraging optical bandwidth density Super DIMM Laser in CPU Important Concepts DRAM cube 1 - Power/message switching DRAM cube 4 MC 1 (only to active DRAM chip in DRAM cube/super DIMM) - Vertical die-to-die coupling Dwr Drd MC K cmd Drd (minimizes cabling - 8 dies per DRAM cube) Dwr Mem Scheduler ( cube 1, die 8) die-die switch cmd -Command distributed electrically (broadcast) - Data photonic (single writer multiple readers) Dwr Drd ( cube 1, die 1) Super DIMM K DRAM cube 4 Modulator bank MC 16 Processor die Receiver/PD bank Tunable filterbank Through silicon via Through silicon via hole Enables energy-efficient throughput and capacity scaling per memory channel Beamer ISCA 2010 18 Laser power guiding effectiveness  Enables capacity scaling per channel and significant Beamer ISCA 2010 savings in laser energy 19 Optimizing DRAM with photonics P1 Floorplan P4 Beamer ISCA 2010 20 Design Space Exploration of Networks Tool DSENT – A Tool Connecting Emerging Photonics with Electronics for OptoElectronic Networks Modeling Chen NOCS 2012 DSENT Model Parameters Nin Nout fclock ... Technology Parameters Process VDD Wmin T ... User-Defined Models Area Arbiter Router Mesh Network Multiplexer Crossbar Repeated Link Electrical Clos Decoder Buffers Optical Link Photonic Clos Support Models Standard Cells Optical Link Components Technology Characterization Non-DataDependent Power Data-Dependent Energy Tools Timing Optimization Expected Transitions Delay Optical Link Optimization Available for download at: https://sites.google.com/site/mitdsent/ 21 Kurian IPDPS 2012 21 Significant integration activity, but hybrid and older processes … 130nm thick BOX SOI 130nm thick BOX SOI [IBM] [Luxtera/Oracle/Kotura] [Many schools] Bulk CMOS Backend monolithic [Intel] [HP] [Watts/Sandia/MIT] [Lipson/Cornell] [Kimerling/MIT] 22 EOS Platform for Monolithic CMOS 2011 photonic integration Joint work with Ram and Popovic 2007 0 Transmission, dB -2 -4 -6 -8 45 nm SOI CMOS IBM 12SOI -10 -12 -14 -200 0 200 400 600 800 1000 Frequency, GHz 32 nm bulk CMOS Texas Instruments 90 nm bulk CMOS IBM cmos9sf 65 nm bulk CMOS Texas Instruments Create integration platform to accelerate technology development and adoption 23 EOS Platform: Fabricated in IBM 45nm SOI Orcutt et al, Optics Express, 2012 3 x 3 mm die 45nm Thin Box SOI Technology (used for Power 7 and Cell processors) 3M Transistors 400 Pads ARM Standard Cells and custom link circuits 24 EOS performance summary Fiber-to-chip grating couplers with 3.5 dB insertion loss Waveguides under 4dB/cm propagation loss 10 dB extinction optical modulators 8 channel wavelength division multiplexing filter bank with <-20 dB cross talk SiGe photodetectors All integrated with electronic circuits 25 Integration of photonics into VLSI tools Layout of photonics abstract Layout of Circuit blocks abstract LEF LEF LEF of standard cells, I/O pads (provided by ARM) layout modulator.LEF VERSION 5.6 ; BUSBITCHARS "[]" ; DIVIDERCHAR "/" ; MACRO block_electronic_etch_row_1 CLASS BLOCK ; ORIGIN -208 -1794 ; FOREIGN block_electronic_etch_row_1 208 1794 ; SIZE 2488 BY 165 ; SYMMETRY X Y R90 ; PIN heater_a_1 DIRECTION INOUT ; USE SIGNAL ; PORT LAYER ua ; RECT 431 1870.5 436.5 1882 ; END END heater_a_1 ... OBS LAYER m1 ; RECT 208 1794 2696 1959 ; ... END END block_electronic_etch_row_1 Chip-level verilog (instantiation of .LEF macros and connectivity) Floorplan (macro placement, power grid, routing Constraints) SOC Encounter Place and route Place&routed layout Technology files END LIBRARY abstract Photonic device p-cell custom photonics-friendly auto-fill 26 Platform organization 27 A full electro-optical test setup Fiber Positioner Microscope DUT Fiber Positioner Chip Board HS Clocks Control Board FPGA USB to laptop 28 Best waveguide losses ever reported in a sub-100nm production CMOS line  Body-Si waveguides   470nm width Poly waveguides   3-4dB/cm loss 50dB/cm loss 700nm width 700nm width Body-Si ring Q factor   227k @ 1280nm 112k @ 1550nm 29 Exceptional dimensional control in 45nm node through drop8 drop7 drop6 drop5 drop4 drop3 drop2 drop1 250 GHz spacing input > 20 dB isolation 30 GHz bandwidth   8-wavelength filterbank results  Filter channels fabricated in order  Less than 1nm variation Excellent channel isolation (>20dB at 250GHz spacing) 30 Integrated delta-sigma heater control ~10mW required to retune all 8 rings  Tuning efficiency 130µW/K   Thermal tuning BW ~1MHz Tuning control overhead negligible On fully substrate removed die (45nm SOI) 100µW/K on Deep-trench bulk die (Micron) 31 Current-sensing optical data receiver Receiver detects photo current 50fJ/b, µA sensitivities, 3-5Gb/s Georgas ESSCIRC 2011, JSSC 2012 32 Optical modulator design Shainline, Popovic Carrier-injection device at 1550nm • Extinction ratio 19dB • 45GHz 3dB optical bw at 1280nm • Extinction ratio 9dB • 60GHz 3dB optical bw 33 Optical modulator – electrical tests  Carrier-lifetime 2-3ns  200MHz electrical bandwidth Diffusion time constant affected by – – Recombination time Drift conditions 34 Modulator driver sub-bit pre-emphasis  Partial forward bias at 0-bit key to fast operation 35 Modulator driver heads  Split-supply used for sub-bit pre-emphasis  Use core and I/O voltage – no regulators 36 Fastest modulation in 45nm process     2.5Gb/s modulation 1.2pJ/bit 3dB insertion loss 3dB extinction ratio Moss ISSCC 2013 37 Power and pins required for 10TFlops/s 80Tb/s sustained bandwidth assuming 1B/Flop Total memory channel power [W] 1600 DDR4 1400 Mobile LPDDR2-1066 GDDR5 1200 Mobile LPDDRX-1666 Mobile LPDDRX 2017 1000 DDR3-1333 4GB 800 HMC 600 400 200 DDR4-2667 8GB GDDR5 LPDDR HMC-Gen1 HMC-Gen2 POEM PIM POEM Phase 1 POEM Phase 2 POEM Post-phase 2 0 100 1000 10000 100000 # socket pins required for memory channels 38 Improving computation efficiency Energy-efficient computation and communication CMOS – need cross-cut approach to keep scaling performance Network & µArchitecture Post-CMOS – need cross-cut approach to guide new devices/systems Design Optimization Communications (Eq., Mod, Coding) 2.5 Energy/Bit (pJ/Bit) 2 Equalized, 30mV Eye Equalized, 50mV Eye Equalized, 90mV Eye Repeated Circuit modeling, Characterization 1.5 1 0.5 0 0 1 2 Data Rate Density (Gbps/um) Φ Φ Circuits & Logic Tx, Rx, Ctrl, Meas Φ in- in+ IPHOTO Φ Φ Φ 3 Interconnect and switch technology Cu MOSFET Si-Photonics NEMS relay 39 Nano-electro-mechanical (NEM) relays Joint work with T-J. King Liu, E. Alon and D. Markovic (UCB, UCLA) 30mm Gate Oxide 90nm Body Drain A Gate Body 2 7. 5mm A’ Source Channel Relay schematic  Nearly ideal switching characteristics:    Low on-state resistance (Ron <10kΩ) Infinite off-state resistance  Zero off-state leakage But, relatively slow actuation 40 Why not use relays to compute? - Need to compare at block level NEMS: 12 relays 4 gate delays  Delay Comparison vs. CMOS    1 mechanical delay Single mechanical delay vs. several electrical gate delays For reasonable load, NEMS delay unaffected by fan-out/fan-in Area Comparison vs. CMOS   Larger individual devices But often need fewer devices to implement same function F. Chen et al., “Integrated Circuit Design with NEM Relays,” ICCAD 2008 41 Scaled NEMS vs. CMOS adders Energy/op vs. Delay/op across Vdd 9x  Compare vs. Sklansky CMOS adder*  90nm technology  30x less capacitance  10x   2.4x lower Vdd    Lower device Cg, Cd Fewer devices No leakage energy For similar area: >9x lower E/op, >10x greater delay Scaled relays limited by contact surface energy - 2aJ for 90nm litho – 50x better than 90nm CMOS Patil et. al., “Robust Energy-Efficient Adder Topologies,” in Proc. 18th IEEE Symp. 42 on Computer Arithmetic (ARITH'07). *D. Contact resistance - Feedback from system level Energy/op vs. Delay/op across Vdd & CL  Very low contact R not critical  Good news for reliability…  Can build testplatforms that work 43 CLICKR technology development platform: NEM relay-based circuits ISSCC 2010 – TD Award F. Chen et al, ISSCC2010 M. Spencer et al, JSSC Jan’11 44 Towards more complex designs Y2 (a) Y2 (b) A4 A4 Kill A3 A3 A3 A3 8mm A2 A2 Generate Y2 A1 A1 A1 A2 A2 Y1 A1 A1 A1 A0 Y0 A0 Y2 (c) 4 10 (d) A6 Y2 Y2 A5 A4 A4 A3 A3 OTCT (90nm) Dadda/HC (45nm) A2 A2 A1 A1 A4 A3 A3 A2 A2 Y A4 A4 A3 A3 A3 A3 A2 A2 A2 A A3 A0  A1 A1 A1 A5 A3 A3 A2 A2 A2 3 A4 A4 A3 A3 A2 A5 A4 A4 A4 A3 A5 A5 A4 A3 Y0 A6 A5 A5 Scaled MEM Relay Y0 A6 A6 A6 A5 A5 700μm Energy-benefit preserved even in more complex functions A0 A1 A1 A1 A1 A1 A1 A1 A1 A0 10 A0 A0 A0 A0 Energy/op (fJ) (a) A0 A0 (b) 16X Parallel 2 10 16-bit multipliers 1 10 0 10 1 2 10 10 3 10 Delay(ns) Multiplier building block: 7:3 compressor 98 relays – largest working relay circuit to date Fariborzi ASSCC 2011 Input code 45 NEM Relay VLSI design infrastructure P-cell Spectre Verilog-A Verilog-A Model Model Schematic Vout Device A A B Layout Verilog B Logic Synthesis Synthesis Place & Route Place Route LVS DRC   Verilog-A model and Logic Synthesis created for NEMS technology The flow supports multiple device designs and foundries 46 Toward full systems - NEM Relay scaling UC Berkeley: 1 µm litho SEMATECH: 0.25 µm litho 9mm Test Devices 8-bit adders 9mm 4-bit and 2-bit adders 2-bit accumulator SRAM Flip-flops DRAMs 7:3 Compressor 4-bit DAC 4-bit ADCs 4-bit DAC Oscillators 1st prototype: 120 µm x 150 µm Scaled relay: 15 µm x 15 µm 47 Toward full systems - NEM Relay scaling Y1 Y2 Y1 Y2 A6 A5 A5 A3 A3 A6 A6 A5 A5 A5 A4 A4 A4 A3 A3 A3 A2 A2 A1 A0 INPUT CODE Input code Y2 output (V) Y0 6 Y1 output (V) Y1 A2 A2 A2 A2 A1 A1 A1 A1 6 6 A4 A3 A0 A2 A1 A0 A0 A0 Y0 output (V) Y2 A5 A3 A3 A1 4 2 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 4 2 0 Scaled relay: 15 µm x 15 µm 4 2 0 0 1 2 3 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 150 100 50 0 4 Time(S) Time (s) output code A2 Y0 A6 A4 A4 A3 A3 SEMATECH: 0.25 µm litho Y0 A5 Y0 A4 A4 46 6T relays Y1 7 Correct result Test result 5 3 1 0 20 40 60 input code 80 100 120 World’s smallest relay multiplier circuit 48 Summary  Cross-layer modeling and design key to continued system performance scaling    Building early technology development platforms    Feedback to device and circuit designers Speerhead adoption EOS Platform designed for multi-project wafer runs       Fast design-space exploration Feedback to all layers of design hierarchy Best end-of-line passives in sub-100nm process (3-4dB/cm loss) 50 fJ/b receivers with uA sensitivities Record high tuning efficiency with undercut ~ 25µW/K First modulation demonstrated in 45nm process Ported to bulk Micron process CLICKR Platform designed for multiple foundries and devices    Energy-gains preserved for larger blocks Largest working relay circuits demonstrated Designs moving toward scaled devices and full VLSI systems 49

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Building Modern Integrated Systems: A Cross-cut Approach