Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PPGEE ’08 Reliability in Nanometer Technologies – Problems and Solutions Dr.-Ing. Frank Sill Department of Electrical Engineering, Federal University of Minas Gerais, Av. Antônio Carlos 6627, CEP: 31270-010, Belo Horizonte (MG), Brazil [email protected] http://www.cpdee.ufmg.br/~frank/ Agenda Motivation Failures in Nanometer Technologies Techniques to Increase Reliability Shadow Transistors Copyright Sill, 2008 PPGEE‘08, Reliability 2 Motivation Reliability important for Normal user Companies Medical applications Cars Air / Space Environment … Copyright Sill, 2008 PPGEE‘08, Reliability 3 Motivation [Mill.] Transistors [Mill.] Transistors 130 nm 400 400 90 nm 300 300 100 100 0 0 100 nm Yonah 65 nm 151 Mill. 200 200 Prescott 125 Mill. 45 nm 50 nm Northwood 55 Mill. Yonah, 151 Mill. 0 nm 2002 2002 2004 2004 Year Year 2006 2006 Probability for failures increases due to: Increasing transistor count Shrinking technology Copyright Sill, 2008 150 nm Technology Wolfdale 410 Mill. 500 500 PPGEE‘08, Reliability 2008 2008 Dimensions m 10 cm 100 nm 10 111mm cm µm µm Source: „Spektrum der Wissenschaften“ „65 nm“-Transistor Source: Intel Copyright Sill, 2008 PPGEE‘08, Reliability 5 Failures in Nanometer Technologies Process Failures Occur at production phase Based on Process Variations Particles … Source: Mak Copyright Sill, 2008 PPGEE‘08, Reliability 7 Sub-wavelength Lithography Generation [µ] 365nm 248nm 193nm 180nm 130nm 0,1 Gap 90nm 100 65nm Generation 45nm 32nm 13nm EUV 0,01 1980 1990 2000 2010 Lithography Wavelength [nm] 1000 1 10 2020 Source: Mark Bohr, Intel Copyright Sill, 2008 PPGEE‘08, Reliability 8 Field-dependent Aberrations CELL _ A( X1, Y1) CELL _ A( X 0 , Y0 ) CELL _ A( X 2 , Y2 ) Big Chip Towards Lens Lens Cell A (X1 , Y1) Cell A Wafer Plane (X0 , Y0) Cell A Center: Minimal Aberrations Edge: High Aberrations (X2 , Y2) Source: R. Pack, Cadence Copyright Sill, 2008 PPGEE‘08, Reliability 9 LineWidth [nm] Varying Line Width 2.3 2.2 2.1 2.0 1.9 1.8 150 60 100 50 Wafer X 0 0 20 40 Wafer Y Source: Zhou, 2001 Copyright Sill, 2008 PPGEE‘08, Reliability 10 Mean Number of Dopant Atoms Random Dopant Fluctuations Causes Vth Variations 10000 1000 100 10 1000 500 250 130 65 32 Technology Node (nm) Non-uniform Uniform Source: Borkar, Intel Copyright Sill, 2008 PPGEE‘08, Reliability 11 Power Density Sun’s Surface Power Density (W/cm2) 10000 Rocket Nozzle 1000 100 Nuclear Reactor Prescott Pentium® 8086 Hot Plate 10 4004 P4 8008 8085 Pentium® 386 286 486 8080 1 1970 Copyright Sill, 2008 1980 1990 Year PPGEE‘08, Reliability 2000 2010 Source: Moore, ISSCC 2003 12 Temperature Variation Power Map On-Die Temperature Power density is not uniformly distributed across the chip Silicon is not a good heat conductor Max junction temperature is determined by hot-spots Impact on packaging, cooling Source: Borkar, Intel Copyright Sill, 2008 PPGEE‘08, Reliability 13 Temperature Variation cont’d Power4 Server Chip Source: Devgan, ICCAD’03 Copyright Sill, 2008 PPGEE‘08, Reliability 14 Delay [s] Drain current IDS [pA] Temperature Variation cont’d Temperature [°C] Threshold voltage Vth changes with temperature drain-source current changes delay changes Source: Burleson, UMASS, 2007 Copyright Sill, 2008 PPGEE‘08, Reliability 15 Supply Voltage Drop Source: Trester, 2005 Copyright Sill, 2008 PPGEE‘08, Reliability 16 Failures Through Increasing Delay FF Logic FF Data are processed before clock phase is over Clk Clock (Clk) VDD↓, Temp.↑, ... FF → Data processing FF longer than clock phase → Wrong Data in next clock phase! Clk Copyright Sill, 2008 Logic too slow! PPGEE‘08, Reliability 17 Soft Errors Source: Automotive 7-8, 2004 1 In 70’s observed: DRAMs occasionally flip bits for no apparent reason Ultimately linked to alpha particles and cosmic rays Collisions with particles create electron-hole pairs in substrate These carriers are collected on dynamic nodes, disturbing the voltage Copyright Sill, 2008 PPGEE‘08, Reliability 18 Soft Errors cont’d Internal state of node flips shortly If error isn’t masked by Logic: Wrong input doesn’t lead to wrong output Electrical: Pulse is attenuated by following gates Timing: Data based on pulse reach flipflop after clock transistion wrong data Copyright Sill, 2008 FF FF FF FF PPGEE‘08, Reliability 19 Electromigration Electromigration: Top View Transport of material caused by the gradual movement of ions in a conductor One of the major failure mechanisms in interconnects. Proportional to the width and thickness of the metal lines Inversely proportional to the current density Void Metal 1 Metal 1 Whisker, Hillock Cross Section View Metal 1 Thick Oxide Metal 2 Source: Plusquellic, UMBC Copyright Sill, 2008 PPGEE‘08, Reliability 20 Electromigration cont’d Void in 0.45mm Al-0.5%Cu line Source: IMM-Bologna Whiskers in Sn Source: EPA Centre Hillocks in ZnSn Source: Ku&Lin,2007 Copyright Sill, 2008 PPGEE‘08, Reliability 21 Time-Dependent Dielectric Breakdown (TDDB) Tunneling currents Wear out of gate oxide Creation of conducting path between Gate and Substrate, Drain, Source Depending on electrical field over gate oxide, temperature (exp.), Source: Pey&Tung and gate oxide thickness (exp.) Also: abrupt damage due to extreme overvoltage (e.g. ElectroStatic Discharge) Source: Pey&Tung Copyright Sill, 2008 PPGEE‘08, Reliability 22 Variability Trends 70 60 Vdd % Variability 50 Vth 40 Performance 30 Power 20 Lgate 10 0 90 Copyright Sill, 2008 80 70 65 57 50 45 40 Technology Node [nm] PPGEE‘08, Reliability 36 32 28 Source: Burleson, UMASS, 2007 23 Variability Trends cont’d Soft Error / Chip (Logic & Mem) Relative SER 150 100 50 0 180 130 90 65 45 32 22 16 Technology [nm] Source: Borkar, Intel Copyright Sill, 2008 PPGEE‘08, Reliability 24 Variability Trends cont’d Frequency and sub-threshold leakage variations Normalized Frequency 1.4 1.3 Frequency ~30% 30% 1.2 Leakage Power ~5-10X 130nm ~1000 samples 1.1 1.0 5X 0.9 1 2 3 4 Normalized Leakage (Isub) 5 Source: Borkar, Intel Copyright Sill, 2008 PPGEE‘08, Reliability 25 Variability Trends cont’d 10000 16 Current Density Jox Reliability (Weibull slope β) Increasing probability for Gate-Oxide-Breakdown 12 8 4 0 1000 100 high-k? 10 1 0 2 4 6 8 10 12 Gate Oxide Thickness [nm] Source: Kauerauf, EDL, 2002 Copyright Sill, 2008 PPGEE‘08, Reliability 180 nm 90 nm 45 nm 22 nm Technology Source: Borkar, Intel 26 Future Designs 100 BT integration capacity 100 Billion Transistors Billions unusable (variations) Some will fail over time Intermittent failures Source: Borkar, Intel Copyright Sill, 2008 PPGEE‘08, Reliability 27 Approaches to Increase Reliability Failure Measurement Reliability R(t): – Probability of a system to perform as desired until time t – Example: R(tx) = 0.8 80 % chance that system is still running at time tx Mean Time To Failure MTTF: – Average time that a system runs until it fails Failure rate λ: – Probability that system fails in given time interval R (t ) e t MTTF R(t )dt 0 Copyright Sill, 2008 1 PPGEE‘08, Reliability 29 Bathtube Failure Model Wearout period Infant mortality Increasing failure rate Based on TDDB, EM, etc. Declining failure rate Based on latent reliability defects Normal lifetime Failure rate Constant failure rate Based on TDDB, EM, hot-electrons… 1-40 weeks Copyright Sill, 2008 7-15 years PPGEE‘08, Reliability Time 30 Classification Failure Temporary Permanent Defects, wearout, out of range parameters , EM, TDDB ... Transient Intermittent Process variations, infant mortality, random dopant fluctation, ... Radiation Non - Radiation Soft errors Power supply, coupling, operation peaks Source: Mitra, 2007 Copyright Sill, 2008 PPGEE‘08, Reliability 31 The Whole System Counts! Copyright Sill, 2008 PPGEE‘08, Reliability 32 Triple Module Redundancy (TMR) Input Logic L A Copy of Logic L B Voter Output C Copy of Logic L Copyright Sill, 2008 PPGEE‘08, Reliability 33 Triple Module Redundancy: Voter Hardware realization of 1-bit majority voter A OUT = AB+AC+BC Out B C Requires 2 gate delays Copyright Sill, 2008 PPGEE‘08, Reliability A B C OUT 1 1 0 1 0 0 1 0 0 1 0 0 0 1 1 1 : : 34 Triple Module Redundancy cont’d Note: For a constant module failure rate 1.0 Reliability TMR 0.5 Simplex (only 1 module) 0 Time After certain time: Reliability of TMR system is lower than of simplex system Why: After some time probability that 2 modules are wrong is higher that 2 modules are working! Copyright Sill, 2008 PPGEE‘08, Reliability 35 Self Adaptive Design Extend idea of clock domains to Adaptive Power Domains Tackle static process and slowly varying timing variations Control VDD, Vth (indirectly by body bias), fclk by calibration at Power On Test inputs and responses fclk Test Module VDD Module VBB Copyright Sill, 2008 PPGEE‘08, Reliability 36 Self Adaptive Design: Example 21 submodules per die Applying 0.5V Forward/Reverse Body Biasing (FBB/RBB) in steps of 32 mV, respectively noBB ABB within die ABB Accepted die 100% 97% highest bin 100% yield 60% 20% 0% Higher Frequency Source: Borkar, Intel For given Freq and Power density 100% yield with ABB 97% highest freq bin with ABB for within die variability Copyright Sill, 2008 PPGEE‘08, Reliability 37 Razor Flip-Flop For uncertainty- and variation-tolerant design Razor methodology Voltage-scaling methodology based on real-time detection and correction of circuit timing errors Use the actual hardware to check for errors Latch the input data twice: Once on the clock edge, and then a little later If the data is not the same, you are going too fast Source: Austin, Computer Magazine, 2004 Copyright Sill, 2008 PPGEE‘08, Reliability 38 Razor Flip-Flop cont’d D Logic Stage n Shadow FF M U X Main flip-flop Q Logic stage n+1 Error_Sl Shadow latch CLK Comperator Error CLK_delayed CLK CLK_delayed D Instr 1 Instr 2 Error Q Instr 1 Instr 2 Source: Austin, 2004 Copyright Sill, 2008 PPGEE‘08, Reliability 39 Shadow Transistor Approach TDDB model TDDB between gate and channel For an Inverter, 65nm-BPTM: 100% Gate 20 Gate Oxide Source Drain 75% 50% Model: 15 Vout/VDD 10 rel. delay 25% 5 0% 0 RGC - RGC [kΩ] → W W1 W2 W= W1+W2 Copyright Sill, 2008 Based on: Segura et. al., “A Detailed Analysis of GOS Defects in MOS Transistors: Testing Implications at Circuit Level” 1995. PPGEE‘08, Reliability 41 TDDB Model cont’d TDDB between gate and source/drain For an Inverter, 65nm-BPTM: Gate Gate Oxide Source 100% Drain 75% Vout/VDD 50% Model: 25% 0% RGD RGS W -RGC [kΩ] → W Based on: Segura et. al., “A Detailed Analysis of GOS Defects in MOS Transistors: Testing Implications at Circuit Level” 1995. Copyright Sill, 2008 PPGEE‘08, Reliability 42 Shadow Transistors 1. Insertion of additional transistors in parallel to vulnerable transistors Shadow transistors (ST) VDD/Vout Relative Delay 10 8 6 4 2 0 wo/ ST 100% 75% w/ ST 50% w/ ST wo/ ST 25% R - GC [kΩ] → 0% -R GC [kΩ] → For an Inverter, 65nm-BPTM Copyright Sill, 2008 PPGEE‘08, Reliability 43 Shadow Transistors cont’d H-Vt/To 2. Application of H-Vt/To transistors with: – Higher threshold voltage – Thicker gate oxide Less vulnerable to TDDB 10 tox 0.22 MTTF – Mean Time To Failure Copyright Sill, 2008 0.15 MTTFH Vt / To 10 0.22 4.81 MTTFLVt / To Source: Srinivasan, “RAMP: A Model for Reliability Aware Microprocessor Design” Stathis, J., “Reliability Limits for the Gate Insulator in CMOS Technology” PPGEE‘08, Reliability 44 Shadow Transistors cont’d 3. Selective insertion of shadow transistors in parallel to vulnerable transistors: – Component reliability depends on Activity, state, temperature, size, fabrication … Most vulnerable can be identified Netlist modification Copyright Sill, 2008 PPGEE‘08, Reliability Shadow transistors only added in parallel to most vulnerable devices. 45 Shadow Transistors cont’d 3. Selective insertion of shadow transistors in parallel to vulnerable transistors: – Component reliability depends on New Approach Activity, state, temperature, size, fabrication … Estimation of stress factors Most vulnerable can be identified Determination of components reliability Adding redundancy only at most vulnerable components Advantage: Lower area, power and delay penalty compared to complete redundancy or random insertion [Sri04] Shadow transistors Source: [Sri04] Sirisantana, D&T, 2004 Netlist only added in parallel modification to most vulnerable devices. Copyright Sill, 2008 PPGEE‘08, Reliability 46 Shadow Transistors cont’d Advantages Increased reliability in respect to TDDB H-Vt/To: Reliability increases by ~5x (for Δtox = 0.15 nm) Remarkable increase of system life time Drawbacks Higher input capacity → higher delay and dynamic power dissipation Area increase Remarks Only slight improvements for Gate-Drain/Source breakdown H-Vt/To has to be supported by technology Copyright Sill, 2008 PPGEE‘08, Reliability 47 ST – Improvement MTTF Improvemnet of MTTF as regards TDDB ≈ 23 % additional transistors 20% 15% 10% 5% 0% c17 c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552 our algorithm random insertion Insertion of L-Vt/To Shadow Transistors Copyright Sill, 2008 PPGEE‘08, Reliability 48 Improvemnet of MTTF as regards TDDB ST – Improvement MTTF (H-Vt/To) 250% 200% 150% 100% 50% 0% c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552 SPth = 30 SPth = 55 Insertion of H-Vt/To Shadow Transistors Copyright Sill, 2008 PPGEE‘08, Reliability 49 Take Home Messages Integrated circuits face several kinds of failures Decreasing structures sizes create more failure sources Future designs should (have to) be failure tolerant Possible approaches: Triple Module Redundancy (TMR) Self-Adapting Designs Razor Flip-Flops Shadow Transistors There’s still a lot to do! Copyright Sill, 2008 PPGEE‘08, Reliability 50 Thank you! [email protected] Copyright Sill, 2008 PPGEE‘08, Reliability 51