* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download t_line - Computer Science and Engineering
Variable-frequency drive wikipedia , lookup
Skin effect wikipedia , lookup
Signal-flow graph wikipedia , lookup
Switched-mode power supply wikipedia , lookup
Mains electricity wikipedia , lookup
Telecommunications engineering wikipedia , lookup
Transmission line loudspeaker wikipedia , lookup
Rectiverter wikipedia , lookup
Power inverter wikipedia , lookup
Alternating current wikipedia , lookup
Prediction of High-Performance On-Chip Global Interconnection Yulei Zhang1, Xiang Hu1, Alina Deutsch2, A. Ege Engin3 James F. Buckwalter1, and Chung-Kuan Cheng1 1Dept. of ECE, UC San Diego, La Jolla, CA 2IBM T. J. Watson Research Center, Yorktown Heights, NY 3Dept. of ECE, San Diego State Univ., San Diego, CA Outline  Introduction    On-Chip Global Interconnection        Overview: structures, tradeoffs Interconnect schemes Global wire modeling Performance analysis Design Methodologies for T-line schemes Prediction of Performance Metrics    Technology trend Current approaches Experimental settings Performance metrics comparison and scaling trend  Latency  Energy per bit  Throughput Signal Integrity Conclusion 2 Introduction – Performance Impact  Interconnect delay determines the system performance [ITRS08]  542ps for 1mm minimum pitch Cu global wire w/o repeater @ 45nm  ~150ps for 10 level FO4 delay @ 45nm [Ho2001] “Future of Wire” 3 Introduction – Power Dissipation  Interconnects consume a significant portion of power  1-2 order larger in magnitude compared with gates   Half of the dynamic power dissipated on repeaters to minimize latency [Zhang07] Wires consume 50% of total dynamic power for a 0.13um microprocessor [Magen04]  About 1/3 burned on the global wires. 4 Introduction – Different Approaches and Our Contributions  Different Approaches  Repeater Insertion Approach    T-line Approach [Zhang09]      Pros: Low latency. Cons: low throughput density due to low bandwidth and large wire dimension Equalized T-line Approach [Zhang08]   Pros: High throughput density. Cons: Overhead in terms of power consumption and wiring complexity. Pros: Low power, Low noise, Higher throughput than single-ended. Cons: The area overhead brought by passive components. We explore different global interconnection structures and compare their performance metrics across multiple technology nodes. Contributions:  A simple linear model  A general design framework  A complete prediction and comparison 5 Organization of On-Chip Global Interconnections 6 Multi-Dimensional Design Consideration   Preliminary analysis results assuming 65nm CMOS process. Application-oriented choice  Low Latency T-TL or UT-TL -> Single-Ended T-lines  High Throughput R-RC  Low Power PE-TL or UE-TL  Low Noise Differential T-lines PE-TL or UE-TL  Low Area/Cost R-RC For each architecture, the more area the pentagon covers, the better overall performance is achieved. 7 On-Chip Global Interconnect Schemes (1)  Repeated RC wires (R-RC) R-RC structure  Repeater size/Length of segments  Adopt previous design methodology [Zhang07]  UT-TL structure  Full swing at wire-end  Tapered inverter chain as TX  T-TL structure  Optimize eye-height at wire-end  Non-Tapered inverter chain as TX Un-Terminated and Terminated T-Line (UT-TL and T-TL) 8 On-Chip Global Interconnect Schemes (2) Un-Equalized and Passive-Equalized T-Line (UE-TL and PE-TL)  Driver side: Tapered differential driver  Receiver side: Termination resistance, Sense-Amplifier (SA) + inverter chain  Passive equalizer: parallel RC network  Design Constraint: enough eye-opening (50mV) needed at the wire-end 9 Global Wire Modeling – Single-Ended & Differential On-Chip T-lines    Orthogonal layers replaced by ground planes -> 2D cap extraction, accurate when loading density is high. Top-layer thick wires used -> dimension maintains as technology scales. LC-mode behavior dominant Determine the bit rate   Smallest wire dimensions that satisfy eye constraint Notice PE-TL needs narrower wire -> Equalization helps to increase density. 10 Global Wire Modeling – RC wires and T-lines  RC wire modeling    T-line 2D-R(f)L(f)C parameter extraction 2D-C Extraction Template  Distributed Π model composed of wire resistance and capacitance Closed-form equations [Sim03] to calculate 2D wire capacitance 2D-R(f)L(f) Extraction Template T-line Modeling  R(f)L(f)C Tabular model -> Transient simulation to estimate eye-height.  Synthesized compact circuit model [Kopcsay02] -> Study signal integrity issue. 11 Performance Analysis – Definitions  Normalized delay (unit: ps/mm)   Normalized energy per bit (unit: pJ/m)   Propagation delay includes wire delay and gate delay. Bit rate is assumed to be the inverse of propagation delay for RC wires Normalized throughput (unit: Gbps/um) 12 Performance Analysis – Latency  Variables: technology-defined parameters  Supply voltage: Vdd (unit: V)  Dielectric constant:  r  Min-sized inverter FO4 delay:   R-RC structure (min-d) (unit: ps)  T-line structures     r0 is roughly constant  Sum of wire delay and TX delay Wire delay   r TX delay improved w/ FO4 delay   cnmos  1/ S , rw  S 2 , cw   r FO4 delay scales w/ scaling factor S   1/ S Decreasing w/ technology scaling! Increasing w/ technology scaling! 13 Performance Analysis – Energy per Bit  Same variables defined before  R-RC structure (min-d)  T-line structures  Sum of power consumed on wire and TX. 2 Power of T-line  VDD 2 Power of TX circuit  fCVDD  FO4 delay reduces exponentially     Constant ! Vdd reduces as technology scales  r reduces as technology scales Energy decreases w/ technology scaling! Energy decreases w/ larger slope!! 14 Performance Analysis – Throughput  Same variables defined before  R-RC structure (min-d)  Assuming wire pitch  1/ S    FO4 delay reduces exponentially  T-line structures     Throughput increases by 20% per generation! TX bandwidth  1/  Neglect the minor change of wire pitch K1 = 0, for UT-TL FO4 delay reduces exponentially Throughput increases by 43% per generation !! 15 Design Framework for On-Chip T-line Schemes   Proposed framework can be applied to design UT-TL/T-TL/UE-TL/PE-TL by changing wire configuration and circuit structure. Different optimization routines (LP/ILP/SQP, etc) can be adopted according to the problem formulation. 16 Experimental Settings      Design objective: min-d Technology nodes: 90nm-22nm Five different global interconnection structures Wire length: 5mm Parameter extraction    Transistor models    Predictive transistor model from [Uemura06] Synopsys level 3 MOSFET model tuned according to ITRS roadmap Simulation   2D field solver CZ2D from EIP tool suite of IBM Tabular model or synthesized model HSPICE 2005 Modeling and Optimization   Linear or non-linear regression/SQP routine MATLAB 2007 17 Performance Metric: Normalized Delay – Results and Comparison  Technology trends R-RC ↑  T-line schemes ↓   T-line structures  Outperform R-RC beyond 90nm  Single-ended: lowest delay  At 22nm node  R-RC: 55ps/mm  T-lines: 8ps/mm (85% reduction)  Speed of light: 5ps/mm  Linear model  < 6% average percent error 18 Performance Metric: Normalized Energy per Bit – Results and Comparison  Technology trends R-RC and T-lines ↓  T-lines reduce more quickly   T-line structures  Outperform R-RC beyond 45nm  Differential: lowest energy.  Single-ended similar to R-RC.   T-TL > UT-TL At 22nm node  R-RC: 100pJ/m  Single-ended: 60% reduction  Differential: 96% reduction  Linear model  < 12% average percent error  Error for T-TL and PE-TL  RL and passive equalizers. 19 Performance Metric: Normalized Throughput – Results and Comparison  Technology trends R-RC and T-lines ↑  T-lines increase more quickly   T-line structures  Outperform R-RC beyond 32nm  Differential better than single-ended  At 22nm node  R-RC: 12Gbps/um  T-TL: 30% improvement  UE-TL: 75% improvement  PE-TL: ~ 2X of R-RC  Linear model  < 7% average percent error 20 Signal Integrity – single-ended T-lines Worst-case switching pattern for peak noise simulation Using w.c. pattern Using single or multiple PRBS patterns  UT-TL structure  380mV peak noise at 1V supply voltage w/ 7ps rise time  SI could be a big issue as supply voltage drops  T-TL less sensitive to noise  At the same rise time, ~ 50% reduction of peak noise  Peak noise ↓ as technology scales 21 Signal Integrity – differential T-lines Worst-case switching pattern for peak noise simulation  More reliable    Peak noise   Termination resistance Common-mode noise reduction Within ~10mV range Eye-Heights   UE-TL  Eye reduces as bit rate ↑  Harder to meet constraint. PE-TL  > 70mV eye even at 22nm node  Equalization does help! 22 Conclusion   Compare five different global interconnections in terms of latency, energy per bit, throughput and signal integrity from 90nm to 22nm. A simple linear model provided to link  Architecture-level performance metrics  Technology-defined parameters  Some observations from experimental results  T-line structures have potential to replace R-RC at future node  Differential T-lines are better than single-ended   Low-power/High-throughput/Low-noise Equalization could be utilized for on-chip global interconnection   Higher throughput density, improve signal integrity Even w/ lower energy dissipation (passive equalizations) 23 Thank you! Q&A Back Up Slides Introduction – Technology Trend  On-Chip Interconnect Scaling  Dimension shrinks    Wire resistance increases -> RC delay Increasing capacitive coupling -> delay, power, noise, etc. Performance of global wires decreases w/ technology scaling. Wire Category Technology Node 90nm 45nm 22nm M1 Wire Rw(kohm/mm) 1.914 8.860 34.827 Cw(pF/mm) 0.183 0.157 0.129 Global Wire Rw(kohm/mm) 0.532 2.970 11.000 Cw(pF/mm) 0.205 0.179 0.151 Scaling trend of PUL wire resistance and capacitance Copper resistivity versus wire width 26 Design methodology: single-ended T-lines 2D frequency-dependent tabular Model Single-ended; Inverter chains Inverter size, number of stages, Rload (if any) SPICE simulation SPICE simulation to check inplane crosstalk, etc SPICE simulation to evaluate. Optimization Routine: 1. Optimal cycle time 2. Sweep for optimal inverter chain 27 Design methodology: differential T-lines 2D frequency-dependent Tabular Model Differential lines; SA-based TX Wire width; Driver impedance; RC equalizer (if any); Termination resistance. Closed-form equationbased model SPICE simulation to check inplane crosstalk, etc Evaluation based on models. Optimization Routine: 1. Binary search for wire width 2. SQP for other var. optimization 28 Effects of driver impedance and termination resistance  Lowering driver impedance improves eye  Eye reduces as frequency goes up  Optimal termination resistance. 29 Effects of driver impedance and termination resistance on step response Optimal Rload  Larger driver impedance leads to slower rise edge and lower saturation voltage  Larger termination resistance causes sharper rise edge but with larger reflection 30 Crosstalk effects    Three different PRBS input patterns, min-ddp solutions T-line Scheme A: Delay increased by 9.6%, Power increased by 37% T-line Scheme B: Delay increased by 2%, Power increased by 25.7% @Wire output 820mV @Wire output 750mV @Inverter chain output 3.6ps @Inverter chain output 6.9ps 31 Transceiver Design  Sense amplifier (SA)    Double-tail latch-type [Schinkel 07] Optimize sizing to minimize SA delay Inverter chain  Number of stage   Fixed to 6 Double-tail latch-type voltage sense amp. Sizing of each inverter   RS: output resistance of inverter chain Sweep the 1st inverter size to minimize the total transceiver delay for given [Veye, RS] @45nm tech node: M1/M3: 45nm/45nm M2/M4: 250nm/45nm M5/M6: 180nm/45nm M7/M8: 280nm/45nm M9: 495nm/45nm M10/M11: 200nm/45nm M12: 1.58um/45nm 32 Transceiver Modeling  Driver side     Voltage source Vs with output resistance Rs Vs: full-swing pulse signal with rise time Tr=0.1Tc Rs: output resistance of the last inverter in the chain. Receiver side    Extract look-up table for TX delay and power Fit the table using non-linear closed form formula The relative error is within 2% for fitting models Transceiver delay map at 45nm node Histogram of fitting errors at 45nm node Transceiver power map at 45nm node 33 Bit-rate: 50Gbps Rs=11.06ohm, Rd=350ohm, Cd=0.38pF, RL=107.69ohm 34 Conclusion (cont’) Low-Latency Application (ps/mm) Tech Node 90nm 65nm 45nm 32nm Low-Energy Application (pJ/m) 22nm Schemes Tech Node 90nm 65nm 45nm 32nm 22nm Schemes R-RC 3/35 1/42 1/46 1/55 1/55 R-RC 2/150 2/140 1/130 1/100 1/100 UT-TL 5/15 5/13 5/10 5/9 5/8 UT-TL 3/140 3/110 3/70 3/50 2/40 T-TL 5/15 5/13 5/10 5/9 5/8 T-TL 1/260 1/200 2/100 2/60 3/40 UE-TL 1/37 3/25 3/16 3/12 5/8 UE-TL 4/60 4/36 4/20 4/10 5/4 PE-TL 1/37 3/25 3/16 3/12 5/8 PE-TL 5/26 5/16 5/8 5/5 5/2 High-Throughput Application (Gbps/um) Tech Node 90nm 65nm 45nm 32nm Low-Noise Application 22nm Tech Node 90nm 65nm 45nm 32nm 22nm Schemes Schemes R-RC 5/5 5/6 3/8 3/10 2/12 R-RC 1 1 1 1 1 UT-TL 2/3.3 1/3.3 1/3.3 1/3.3 1/3.3 UT-TL 1 1 1 1 1 T-TL 1/3 2/3.4 2/6 2/9 3/16 T-TL 3 3 3 3 3 UE-TL 3/3 3/5 4/9 4/13 4/21 UE-TL 5 5 4 4 4 PE-TL 4/4 4/5.3 5/9 5/15 5/24 PE-TL 4 4 5 5 5 Item in the table: score/value. Score: the higher, the better in terms of given metric, max. score is 5. The best structure in each column marked using red color. 35 Future Works  Explore novel global signaling schemes for high throughput and low energy dissipation.  Design, optimize > 50Gbps on-chip interconnection schemes  Architecture-level study to identify trade-offs  Wire configuration   Un-interrupted architectures   Dimension optimization, ground plane, etc. Equalization implementation, TX/RX choice Distributed architectures  Active or Passive compensation (RC equalizers, other networks, etc)  Novel high-speed transceiver circuitry design  Develop analysis and optimization capability to aid co-design and cooptimization of wire and transceiver circuit  Fabrication to verify analysis and demonstrate feasibility 36 Related Publications [Repeated RC Wire] 1. L. Zhang, H. Chen, B. Yao, K. Hamilton, and C.K. Cheng, “Repeated on-chip interconnect analysis and evaluation of delay, power and bandwidth metrics under different design goals,” IEEE International Symposium on Quality Electronic Design, 2007, pp.251-256. [Un-Terminated/Terminated T-Line] 2. Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh and C.K. Cheng, “Design Methodology of High Performance On-Chip Global Interconnect Using Terminated Transmission-Line, ” IEEE International Symposium on Quality Electronic Design, 2009, pp.451-458. [Passive-Equalized T-Line] 3. Y. Zhang, L. Zhang, A. Tsuchiya, M. Hashimoto, and C.K. Cheng, “On-chip high performance signaling using passive compensation, ” IEEE International Conference on Computer Design, 2008, pp. 182-187. 4. Y. Zhang, L. Zhang, A. Deutsch, G. A. Katopis, D. M. Dreps, J. F. Buckwalter, E. S. Kuh, and C. K. Cheng, “On-chip bus signaling using passive compensation,” IEEE Electrical Performance of Electronic Packaging, 2008, pp. 33-36. 5. L. Zhang, Y. Zhang, A. Tsuchiya, M. Hashimoto, E. Kuh, and C.K. Cheng, “High performance on-chip differential signaling using passive compensation for global communication, ” Asia and South Pacific Design Automation Conference, 2009, pp. 385-390. [Overview and Comparison] 6. Y. Zhang, X. Hu, A. Deutsch, A. E. Engin, J. F. Buckwalter, and C. K. Cheng, “Prediction of HighPerformance On-Chip Global Interconnection, ” ACM workshop on System Level Interconnection Prediction, 2009 37