Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Interconnect Problem: From Emerging Devices to Energy-efficient Networks Vladimir Stojanović Integrated Systems Group Massachusetts Institute of Technology High-speed links needed everywhere Backbone Router Rack PC or Console MIT ISG 2 2 What makes it challenging High speed link chip Attenuation [dB] Channel response 0 9" FR4 -10 -20 -30 26" FR4 -40 -50 9" FR4, via stub 26" FR4, via stub -60 0 2 source: Rambus 4 6 8 10 frequency [GHz] Requires sophisticated equalization circuits MIT ISG 3 3 Chip-to-chip I/O scaling problem 14 16 18 20 22 25 28 10 32 36 40 45 52 59 68 78 90 10 Technology (nm) 1 100 Energy/bit (pJ) y = 10800x Normalized unit to 90nm node 100 #I/O pads Off-chip fclk Aggr BW Aggr BW (Fit) -2.1 [Ken Yang, UCLA] 1000 y = 399.17x1.157 10 Erg/bit (2-PAM) Erg/bit Trend (2PAM) 1 100 0.01 0.1 Technology (m) 1 Bandwidth need grows faster than energy/bit drops Creates exponentially increasing I/O power consumption In power constrained systems (like processors and anything inside the box) – this limits the available bandwidth MIT ISG 4 4 Parallel off-chip links 0 Tx Data Signal and noise spectrum [dBV] Linear transmit equalizer Anticausal taps Sampled Data -20 -40 -60 -80 -100 Channel 0 Causal taps 50 50 outP outN d d outN outP inP inP 2 4 clk clk 6 8 10 12 14 frequency [GHz] Q outP outN inN I I thresh 2 I I th resh 2 clk Q I eq 0 pre-amp with offset comparator Often share clock generation and synch Limited equalization (few taps) Most power burned to drive the 50 Ω line Current-mode – 200 mV swing (4 mW off 1 V supply) – – Data rate independent With receiver and pre-driver, at 10 Gb/s energy budget 500 fJ/bit Voltage-mode – (2 pJ/bit state-of-the-art, dynamic power) – Can possibly scale to 500 fJ/bit, but not much further MIT ISG 5 5 16 18 20 Convergence of platforms Only way to meet future system feature set, design cost, power, and performance requirements is by programming a processor array Multiple parallel general-purpose processors (GPPs) Multiple application-specific processors (ASPs) IBM Cell Intel Network Processor 1 GPP (2 threads) 1 GPP Core 8 ASPs 16 ASPs (128 threads) 18 18 18 Stripe RDRA M 1 PCI 64b (64b) 66 MHz QDR SRAM 1 E/D Q 1 1 8 8 RDRA M 2 RDRA M 3 Intel® XScale ™ Core 32K IC 32K DC QDR SRAM 2 E/D Q 1 1 8 8 MEv2 MEv2 MEv2 MEv2 1 2 3 4 Rbuf 64 @ 128B MEv2 MEv2 MEv2 MEv2 8 7 6 5 G A S K E T QDR SRAM 3 E/D Q 1 1 8 8 MEv2 MEv2 MEv2 MEv2 9 10 11 12 QDR SRAM 4 E/D Q 1 1 8 8 Sun Niagara 8 GPP cores (32 threads) Intel 4004 (1971): 4-bit processor, 2312 transistors, ~100 KIPS, 10 micron PMOS, 11 mm2 chip MEv2 MEv2 MEv2 MEv2 16 15 14 13 Picochip DSP 1 GPP core 248 ASPs IXP280 0 S P 16b I 4 or C S 16b I X Tbuf 64 @ 128B Hash 48/64/1 CSRs 28 -Scratc Fast_wr h -UART 16KB Timers -GPIO BootROM/Sl owPort Cisco CSR-1 188 Tensilica GPPs 1000s of processor cores per MITdie ISG 6 “The Processor is the new Transistor” [Rowen] 6 On-chip interconnect: A system perspective FPGAs, Multi-core/Network Processors, SoCs – using more routing resources Xilinx Virtex 4 IBM Cell: 8 cores Cisco CRS-1: 192 cores Clearspeed CSX600: 64-96 cores Power Consumption Clock 26% Network 36% Computation 39% Example MIT Designed to relieve the interconnect scaling problem – RAW numbers Existing architectures Latency, capacitance, loss But, interconnects will consume nearly 50% of chip power Power and Area Efficiency becoming critical MIT ISG 7 7 Electronic shared memory switches Sun Niagara IBM CELL DIMM DIMM DIMM DIMM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM interface DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM interfaces Intel Terascale cross-bar Fast, but scales poorly Power hungry Distributed switch (e.g. Mesh) Slow, power hungry Other on-chip networks – (fat tree, torus, etc) – bus Centralized switch (Bus, Crossbar) DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM mesh rings Compromise between density, power and latency/data rate Memory interfaces Power and density limit MIT ISG 8 8 The problem spans many layers Interconnect Problem On-chip Network Network Architecture Off-chip I/O Design Optimization Communication (Eq., Mod, Coding) 2.5 Energy/Bit (pJ/Bit) 2 Equalized, 30mV Eye Equalized, 50mV Eye Equalized, 90mV Eye Repeated Link modeling, Characterization 1.5 1 0.5 Circuits Tx, Rx, Ctrl, Meas clk clk outP outN outN outP inP inP Q inN I I thresh 2 I I th resh 2 pre-amp with offset 0 0 1 2 Data Rate Density (Gbps/um) clk Q comparator 3 [IBM] Interconnect technology CNTs Si-Photonics MIT ISG 9 Cu 9 Repeaters in On-chip Networks Intel-80 core Terascale Application benchmarks, Architectural decision, Guo, SLIP 2006 Jin, HPCA 2007 Network Wire Width and Space (um) 3 Eq., Width Eq., Space Rep., Width Rep., Space 2.5 2.5 2 Repeated Interconnect 1.5 1 0.5 0 0 1 2 Throughput Density (Gbps/um) 3 Circuit + wire parameters MIT ISG 10 Energy/Bit (pJ/Bit) 3.5 Repeated 2 1.5 1 10mm wire 0.5 0 1 2 3 10 Data Rate Density (Gbps/um) Equalized Interconnect in On-chip Networks Intel-80 core Terascale Network Models and Tools No performance and power models No modeling/tool framework Challenges Joint Optimization : Circuit + Wire 2.5 s p + Vth - D -h1 + D + - 2 + width, spacing w w1 w2 ... wnFFE Vth receive bits D Energy/Bit (pJ/Bit) transmit bits Wdrv + - Equalized ,V ,VInterconnect Equalized, 30mV Eye Equalized, 50mV Eye Equalized, 90mV Eye Repeated 1.5 1 - h1 +h1 0.5 T 0 MIT ISG 0 11 1 2 Data Rate Density (Gbps/um) 3 11 Equalized Interconnect No Equalization Feed Forward Equalization (FFE) w0 D w1 D w2 + y1 w0 FFE + Decision Feed Back Equalization (DFE) D w1 D w2 + D + -y1 MIT ISG 12 12 Joint Optimization Problem : Communication + Circuits + Wires Large design space Circuit parameters (WLCM, Vp, Vs), wire dimensions (W, S), sample delay (Td), equalization coefficients (W1, W2 , W3, y1) (>400k points) Voltages Vs FFE Coefficients Vp dnFFE ... ... WnFFE dk d1 Wk W1 ... Wire Sizes ... Driver Size ... ... d1 dk dnFFE Sample Delay Td ws W1 ... GND + clk Vth - ... Wk -y1 receivebits D GND WnFFE + clk Vth - y1 GND B. Kim and Vladimir Stojanovic “Equalized Interconnects for On-Chip Networks: Modeling and Optimization Framework,” IEEE ICCAD 2007 MIT ISG 13 DFE Coefficient 13 Connecting Performance and Power Models 0.07 0.6 0.06 main tap 0.05 Voltage (V) 0.5 0.7 DFE tap 0.03 0.6 0.01 0.3 equation spice T(f) 0.02 0 0.2 -0.01 0.2 0.25 0.3 0.35 Time (nsec) 0 Estimated Eye -0.1 -0.2 0 0.2 0.4 Time (ns) 0.8 0 0 Vp 2 ... ... Wk W1 ... Wire Sizes Interconnect ... ws ... ... 10 Sample Delay Td Driver Size dk dnFFE 8 WnFFE dk d1 4 6 frequency (GHz) FFE Coefficients dnFFE d1 0.1 Channel = Wire + Devices & Circuit Eye estimation Sample time Equalization coefficients 0.3 Vs Voltages Transfer function 0.4 0.2 0.6 Need accurate channel model 0.5 0.4 0.1 Magnitude Voltage (V) 0.4 0.04 W1 ... GND + clk Vth - ... Wk -y1 WnFFE GND receivebits D GND + clk Vth - y1 DFE Coefficient MIT ISG 14 14 Result: M8, 10mm Repeater and equalized interconnect 2.5 0.3 Energy/Bit (pJ/bit) Energy/Bit (pJ/Bit) 2 0.35 Equalized, 30mV Eye Equalized, 50mV Eye Equalized, 90mV Eye Repeated 1.5 1 0.25 Worst case eye and LMSE metod NORM1, with crosstalk LMSE, with crosstalk NORM1, no crosstalk LMSE, no crosstalk 0.2 0.15 0.1 0.5 0.05 0 0 1 2 Data Rate Density (Gbps/um) 3 0 0.5 1 1.5 2 2.5 Data Rate Density (Gbps/um) 3 Equalized Interconnect : x10 more energy efficient for the same data rate density and latency Crosstalk doubles energy cost MIT ISG 15 15 Model Verification Energy/bit (pJ/bit) 0.35 Solid Line: This tool 0.3 Dashed Line: Spice Ebtot 0.25 EbactiveDrv EbPre 0.2 Ebw 0.15 EbscDrv 0.1 0.05 0 0 1 2 3 Data Rate Density (Gb/s/um) Auto generated spice netlist based on the parameters. Spice : 190 sec Formula-based : 6.5m sec MIT ISG 16 16 On-chip Link Test-chip 90nm test chip 8 Gb/s max 4 Gb/s/um 0.5 pJ/bit @ 8 Gb/s IBM process DDR muxing Eye > 100mV 400 fJ/bit Tx 100 fJ/bit TIA+DFE Trades-off data-rate density and energy-efficiency MIT ISG 17 17 What else can we do? Try new device technology Carbon Nanotubes Silicon Photonics MIT ISG 18 18 Estimating the impact of CNTs Create Models for Circuits RW L CDRV CLOAD CW L CNT Process Characterization P H CC S CP CC Extract Tradeoff Curves W T RDRV Performance CP Power Area Make Informed Architecture Choices MIT ISG 19 19 Effective Resistivity of CNTs k 2 RF Rcont L L RCONT/2 RF/2 LK LK RF/2 4CQ 4CQ CE CE RCONT/2 Resistivity of CNTs vs. Cu 5 Cu Ideal CNT CNT (L=10x min. pitch) CNT (L=100x min. pitch) CNT (L=1000x min. pitch) 4.5 Effective Resistivity ( -cm) EFF d 4 3.5 CNT R 3 cont = 50k 2.5 2 1.5 1 0.5 20 40 60 80 100 120 140 Technology Node (nm) 160 CNT bundles that are densely packed, fully & ideally contacted have a resistivity ~7x lower than Cu at 22nm Non-ideal contact resistance amortized over length of nanotube Insignificant for lengths > ~1000 gate pitches MIT ISG 20 20 180 Effective Resistivity: Non-idealities CONTACTED METALLIC CNT 2 Rcont L L L0 where L0 C d , k, Fraction of Contacted Metallic CNTs EFF h k 4e 2 L0 d 1 Resistivity Contours, CNT/Cu (22nm) 0.9 0.8 CNT / Cu (22nm) 0.7 0.25 0.6 33% Metallic 0.5 0.5 0.4 0.3 1.0 0.2 0.1 0.5 0.125 1 1.5 d, CNT diameter (nm) 2 Current growth limitations result in only ~2x lower resistivity than copper MIT ISG 21 21 Scaling Interconnect: Design Space T/SH P=S+W=constant P W- W S+ W CP Copper OR CNT Vias Consider rescaling the interlayer dielectric (ILD) stackup OR CC H*SH OR Copper OR CNT Interconnects Cu Scale width: maintain minimum wire pitch Increase ILD height (H), decreasing wire thickness (T): maintain constant wire bandwidth Integration of vertically aligned CNTs closer to realization MIT ISG 22 22 CNTs Require Rescaling 1.8 1.6 Optimal W (in WMIN) 1.4 1.2 Min. Width in Cu 1 0.8 Cu (45 nm) Cu (22 nm) CNT (45 nm) CNT (22 nm) 0.6 Range of sub- 0.4 optimal CNT wire lengths @ 22nm 0.20 500 1000 1500 2000 2500 Wire Length (in min. wire pitch) Range of suboptimal CNT wire lengths @ 45nm FO=3 3000 3500 Range of sub-optimally sized wires is greater if CNTs are used with the same cross-section as copper. MIT ISG 23 23 Energy vs. Delay: W & H Scaling 6 Cu Vias + Cu Wires FO = 0.5 Cu, W=1, H=1 5.5 CNT Vias + Cu Wires FO = 8 CNT, W=1, H=1 5 Energy (fJ) Cu Vias + CNT Wires increasing FO CNT, W=0.5, H=1 4.5 Cu, W=1, H=2 4 CNT Vias + CNT Wires CNT, W=0.5, H=2 Cu, W=1, H=4 3.5 Min. Delay H Min. Energy H CNT, W=0.5, H=4 3 0 20 40 60 80 Delay (ps) 100 120 140 Scaling both width and height result in almost 33% energy savings for the same delay F. Chen, et al “Scaling and Evaluation of Carbon Nanotube Interconnects for VLSI Applications,” ACM Nanonets Symposium 2007 MIT ISG 24 24 Characterization Limitations Measurement Limitations: • Large impedance mismatch VNA Z0=50 - measurment variance > than measured value • Large test parasitics Z0~20K - limit bandwidth of input signals - need to de-embed, much larger than CNT Physical Limitations: • Cumbersome, difficult, unrepeatable test setups • Limited CMOS integration • Limited number of measurements MIT ISG 25 25 CNT Characterization Test-chip Horizontally grown SWCNTs Grow CNTs on separate substrate Transfer to testchip using resist Use ball bonder, or metal deposition to make final pad contact Vertically grown CNTs Create compatible CNT masks that align with test interface Create contacts at the end of CNT chips Align & mate using optical IR aligner Metallic Underlayer Metal Cap PASSIVATION Metal Alloy UBM PAD METAL PASSIVATION Metal Alloy PAD METAL CNT Metallic Underlayer Metal Cap CNTs Insulating Scaffold MIT ISG 26 26 UBM Extracting CNT Data Any pair of transceivers acts like a bidirectional link/equivalent time scope R~20K Sweep time & voltage to measure step response waveforms Vref Vref Vref Vref Extract S-parameters, delay, edge rates of both CMOS & CML signals MIT ISG 27 27 What else can we do? Try new device technology Carbon Nanotubes Silicon Photonics MIT ISG 28 28 Manycore interconnect bottlenecks P R Electrical on-chip network Electrical memory interface DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DIMM DIMM DIMM DIMM Processor + Router Request P Processor Router Memory Controller P P P P P P P P P P P P P DIMM DIMM DIMM DIMM DRAM 1024 DRAM DRAM DRAM DRAM DRAM DRAM DRAM Response Proc Request Network Proc Response Network DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM P DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM P DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM Logical View P Slow on-chip mesh On-chip memory controllers Power and density limited memory BW At most 3-6 Tb/s in next few years Need to move them off-chip – MIT ISG 29 Use Si-Photonics 29 Optical system architecture Physical View Likely Bottleneck! DIMM Optical Acces Point DIMM Proc Req Net DIMM Waveguides Hub chip with memory controllers Hub Req Net Hub Resp Net Hub Req Net Hub Resp Net Hub Req Net Hub Resp Net Hub Req Net Hub Resp Net Proc Resp Net Message path Core Waveguides Logical View Electrical mesh to reach the appropriate access point Core waveguide to the switch matrix Statically routed through the switch matrix DIMM waveguide and optic fiber to reach hub chip Routed through hub chip to correct DRAM chip Returns through the separate response networks Si-photonic network Faster, more energy-efficient communication across the chip Exports memory controllers and switching to DRAM hub chips Overcomes power and density limitations on memory bandwidth MIT ISG 30 30 Optical multi-group system architecture Physical View Optical Network Group 0 Grp0 Proc Req Net Grp1 Proc Req Net Group 1 Multi-group architecture – – Hub Req Net Hub Resp Net Hub Req Net Hub Resp Net Hub Req Net Hub Resp Net Hub Req Net Hub Resp Net Can help alleviate on-chip interconnect bottleneck Potentially Breaks the single on-chip electrical mesh into several groups – Logical View Grp0 Proc Resp Net Grp1 Proc Resp Net 40-80 Tb/s Each with its own smaller mesh Each group still has one AP for each DIMM and thus can access all of memory Since there are more APs each AP is narrower (uses less λs) Uses optical network as a very efficient global crossbar Hub networks now include a crossbar and arbitration MIT ISG 31 31 Traffic generator modeling GUPS access pattern Average Latency per Request (cycles) 6 GUPS 17 GUPS For rough comparison the Cray X1E with 248 cores (1.13GHz) sustains 1.8 GUPS, while the Cray Red Storm/XT3 cluster with 12,960 cores (2.4GHz) sustains 30 GUPS Approximately 100 Giga-Updates Per Second (GUPS) http://icl.cs.utk.edu/hpcc Electrical (3 Tb/s Mem BW) 256 Cores Electrical (6 Tb/s Mem BW) Optical with 16 Groups Total Offered Bandwidth (Bytes/cycle) MIT ISG 32 32 PD (0.1 dB) 64 Rx (0.01 dB/ring) 63 Tx-Rx pairs (through loss 0.01 dB/ring) Drop (2.5 dB/drop) Coupler (1 dB) SM fiber 10 cm Memory – buffer chip 2 cm (1 dB/cm) 64 Crossings (0.05 dB/cross.) 64 Tx (0.01 dB/ring) Insertion (1 dB) ~ 0.9 cm (1 dB/cm) 64 x λ Coupler laser source (1 dB) 64 x λ laser source Link components CouplerMIT ISG (1 dB) 33 Core 1:32 splitter (0.2 dB/split) Coupler (1 dB) 64 filter pairs (through loss 0.01 dB/ring) Drop (2.5 dB/drop) 2 cm Multi-core chip 33 Optical power budget Component Preliminary Power Design loss Optimized Design Power loss Coupler loss Splitter loss 1 dB/coupler 0.2 dB/split 3 dB 1 dB 1 dB/coupler 0.2 dB/split 3 dB 1 dB Non-linearity 1 dB 1dB 1dB 1dB Through loss 0.01 dB/ring Modulator Insertion loss 1 dB Crossing loss 0.2 dB/crossing 3.17 dB 1 dB 12.8 dB 0.01 dB/ring 0.5 dB 0.05 dB/crossing 3.17 dB 0.5 dB 3.2 dB On-chip waveguide loss 5 dB/cm Off-chip waveguide loss 0.5e-5 dB/cm Drop loss 2.5 dB/drop Photodetector loss 0.1 dB Receiver sensitivity -20 dBm 20 dB ~ 0 dB 5 dB 0.1 dB -20 dBm 26.07 dBm (0.40 W) 1 dB/cm 0.5e-5 dB/cm 1.5dB/drop 0.1 dB -20 dBm 4 dB ~ 0 dB 3 dB 0.1 dB -20 dBm -1.03 dBm (0.78 mW) Power per wavelength Power required at source 3.3 kW MIT ISG 34 6.38 W 34 Data transmission latency Component Latency Serializer/Deserializer (50ps each) Modulator driver latency Through latency (2.5ps/adjacent channel) Drop latency (20ps/drop) Waveguide latency (106.7ps/cm) SM fiber latency (48.3ps/cm) Photodetector+TIA latency Total latency 50ps 108ps 7.5ps 60ps 427ps 483ps 200ps 1.385ns Total latency 14 bit times Less than 4 clock cycles (with 2.5 GHz core clock) Almost equally split among – Fiber, waveguide, and Tx/Rx circuits MIT ISG 35 35 Electrical back-end Global clock (load + source) / link 4 fF (2 fJ/bit) optical 40 fF (20 fJ/bit) PLL 20 fJ/bit (6 bit DC dac) 5-10 GHz Local clock 16 fF (8 fJ/bit) 20 fJ/bit 10 fF (2.5 fJ/bit) 40 fJ/bit 60 fF (15 fJ/bit) 10 fF (2.5 fJ/bit) 40 fF (20 fJ/bit) Power < 250 fJ/bit Area 0.02% for 4000 links MIT ISG 36 ~200 cells (0.5um x 0.2 um) 20 (um)2 per link 36 MIT Eos1 65 nm test chip Texas Instruments standard 65 nm bulk CMOS process First ever photonic chip in sub-100nm CMOS Automated photonic device layout Monolithic integration with electrical modulator drivers 2 mm x 2 mm die 16 ring modulators 8 modulator drivers 84 ring filters ~10 cm of waveguides waveguide crossings 8 MZ modulators 12 photo detectors MIT ISG 37 37 Two-ring filter Ring modulator Digital driver Vertical coupler grating One-ring filter Photo detector Paperclips Waveguide crossings M-Z test structures 4 ring filter banks MIT ISG 38 38 65 nm lithography great for rings ! Ring filter layout in Cadence 286 nm 266 nm 300 nm 6.5 m Poly-Si heater Poly-Si ring SEM photos after dielectric etch MIT ISG 39 39 Frequency response of 4-channel filter FSR is ~16nm (2.7THz) The scans are 1268nm - 1276nm at 0.05nm intervals 120 GHz spacing 5nm ring radius step Sensitivity 1THz/nm poly H 30GHz/nm poly W cleave 4 3 2 1 image window input First 100 GHz spaced bank in sub-100nm, first in bulk CMOS, first in poly Si - Enables 60-120 wavelengths/waveguide ( >200 Gb/s/um data rate density) MIT ISG 40 40 Conclusion Interconnects are very complex microcommunication systems Cross-layer design approach is needed to solve the on-chip and off-chip interconnect problem Equalized interconnects extend copper – CNTs further improve the performance by 2x – 10x in energy-efficiency with proper circuit-wire co-design With proper bundle resizing and ILD scaling Si Photonics improves the throughput by 10-20x – – Unifies on-chip and off-chip interconnects Requires a different network architecture (electrical for switching, optical for energy-efficient transport) MIT ISG 41 41 Acknowledgments Franz Kaertner, Rajeev Ram, Judy Hoyt, Krste Asanovic, Henry Smith, Erich Ippen, Martin Schmidt Jason Orcutt, Anatoly Khilo, Ben Moss, Charles Holzwarth, Hanqing Lee, Christopher Batten, Ajay Joshi, Fred Chen, Byungsub Kim DARPA UNIC program – PM Jag Shah Texas Instruments – Dennis Buss and Tom Bonifield FCRP Interconnect Focus Center MIT ISG 42 42