Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 8 Coping with Physical Failures, Soft Errors, and Reliability Issues EE141 System-on-Chip Test Architectures 1 Ch. 8 – Physical Failures - P. 1 What is this chapter about? Gives an Overview of and Promising Solutions to the Causes of Manufacturing Defects and Soft Errors Focus on Signal Integrity Defect-Based Tests Process Sensors and Adaptive Design Soft Errors – BISER – Circuit-Level Approaches Defect and Error Tolerance EE141 System-on-Chip Test Architectures 2 Ch. 8 – Physical Failures - P. 2 Coping with Physical Failures, Soft Errors, and Reliability Issues Introduction Signal Integrity Manufacture Defects, Process Variations, and Reliability Soft Errors Defect and Error Tolerance Concluding Remarks EE141 System-on-Chip Test Architectures 3 Ch. 8 – Physical Failures - P. 3 Introduction Defects Random defects – Caused by manufacturing imperfections and occur in random places Systematic defects – Caused by process or manufacturing variations Defect level (DL) is a function of process yield (Y) and fault coverage (FC) DL 1 Y EE141 System-on-Chip Test Architectures 1 FC 4 Ch. 8 – Physical Failures - P. 4 Concept of Signal Integrity Signal integrity is the ability of a signal to generate correct responses in a circuit. A signal with good integrity stays within safe margins for its voltage amplitude and transition time. EE141 System-on-Chip Test Architectures 5 Ch. 8 – Physical Failures - P. 5 Basic Concept of Integrity Loss Integrity Loss: any portion of signal that exceeds amplitude-safe and time-safe margin. IL ( IntegrityLoss ) ( Vi f (t ) dt ) ei i bi where Vi is one of the acceptable amplitude levels and bi , ei is a time frame during which integrity loss occurs. EE141 System-on-Chip Test Architectures 6 Ch. 8 – Physical Failures - P. 6 Sources of Integrity Loss Interconnects Power Supply Noise Process Variations EE141 System-on-Chip Test Architectures 7 Ch. 8 – Physical Failures - P. 7 Integrity Loss Sensors/Monitors (1) Current Sensor Current sensors are often used to detect the completion of asynchronous circuits. EE141 System-on-Chip Test Architectures 8 Ch. 8 – Physical Failures - P. 8 Integrity Loss Sensors/Monitors (2) Power Supply Noise Sensor The voltage V x depends on the power/ground bounces: the higher the PSN is, the longer the propagation and the higher the voltage V x will be. EE141 System-on-Chip Test Architectures 9 Ch. 8 – Physical Failures - P. 9 Integrity Loss Sensors/Monitors (3) Noise Detector (ND) Sensor ND sensor is designed to detect integrity loss due to voltage violations. EE141 System-on-Chip Test Architectures 10 Ch. 8 – Physical Failures - P. 10 Integrity Loss Sensors/Monitors (4) Integrity Loss Sensor (ILS) The integrity loss sensor is a delay violation sensor. EE141 System-on-Chip Test Architectures 11 Ch. 8 – Physical Failures - P. 11 Integrity Loss Sensors/Monitors (5) Jitter Monitor Jitter is often defined as the time deviation of a signal from its ideal location in time. EE141 System-on-Chip Test Architectures 12 Ch. 8 – Physical Failures - P. 12 Integrity Loss Sensors/Monitors (6) A ring oscillator can work as a Process Variation Sensor The variation of delay caused by PV-faults in any of the inverters in the loop results in deviation in the frequency of the oscillator, which can be detected. f RO f RO 1 1 N inv Vdd C Load ( W 2 2Tox )(VGS Vt ) 2 (1 K VDS ) Leff NinvTinv , where N inv is an odd number of inverters and Tinv is the delay of one inverter. EE141 System-on-Chip Test Architectures 13 Ch. 8 – Physical Failures - P. 13 Readout Architectures (1) BIST-Based Architecture BIST Architecture Readout Circuitry When a noise or delay violation occurs (flag=1), the contents of all scan cells are then scanned out through Sout for further reliability and diagnosis analysis. EE141 System-on-Chip Test Architectures 14 Ch. 8 – Physical Failures - P. 14 Readout Architectures (2) Scan-Based Architecture At the driving side of an interconnect, pattern generation BSC(PGBSC) is used to generate test patterns. At the receiving side of the interconnect, an observation BSC(OBSC) is used to detect integrity loss. EE141 System-on-Chip Test Architectures 15 Ch. 8 – Physical Failures - P. 15 Readout Architectures (3) Basic Concept of PV-Test Architecture On-chip ROs with counters, embedded in a test chip are used to detect process variation by measuring the RO’s frequency shifts. EE141 System-on-Chip Test Architectures 16 Ch. 8 – Physical Failures - P. 16 Manufacture Defects, Process Variations, and Reliability 100% single stuck-at fault coverage cannot guarantee perfect product quality, because there are remaining defects that are: Timing-dependent Sequence-dependent Attributed to timing-dependent, non-single-stuck-at faults EE141 System-on-Chip Test Architectures 17 Ch. 8 – Physical Failures - P. 17 Structural Tests A Defect-Based Test Architecture ATPG Structural Tests RTL Library RC Extraction Layout Synthesis Modeling Timing Analysis Defect-Based Fault Enumeration Path Extractor Physical Faults Critical Path List Fault Mapping Gate-level Netlist Defect-Based Fault Simulator Functional Tests Fault List Defect-Based ATPG EE141 System-on-Chip Test Architectures Logical Fault List Defect-Based Tests 18 Ch. 8 – Physical Failures - P. 18 Defect-Based Tests Small Delay Defect Tests Bridge Defect Tests N-Detect Tests I ddq Tests MinV DD Tests VLV Tests EE141 System-on-Chip Test Architectures 19 Ch. 8 – Physical Failures - P. 19 Reliability Stress Concept of Infant Mortality Methods to screen infant mortality Method I - Burn-in ttf c e EA kT Where ttf is time to failure, C is a constant, E A is the activation energy (eV), k is the boltzman’s constant, and T is an absolute temperature. Method II - Elevated Voltage Stress EE141 System-on-Chip Test Architectures 20 Ch. 8 – Physical Failures - P. 20 Redundancy and Memory Repair Redundancy: Spare rows, columns, or blocks Repair schemes: Pellston Technology [Wuu 2005]: If repeated error are detected, disable cache line (set “not to use” bit) Perform memory BIST at new operating conditions; exclude failing cells and resize cache (cache size can vary larger or smaller, depending on whether new conditions are more favourable or worse) EE141 System-on-Chip Test Architectures 21 Ch. 8 – Physical Failures - P. 21 Process Sensors and Adaptive design Compare traditional test structures put on the scribe lines and embed additional process sensors on-chip. On-Chip Process Sensors: Process Variation Sensor Thermal Sensor Dynamic Voltage Scaling EE141 System-on-Chip Test Architectures 22 Ch. 8 – Physical Failures - P. 22 Process variation Sensor Ring oscillators: Many factors can affect the frequency of the ring oscillator such as process variation, temperature and voltage. Analog Process Variation Sensor: The analog circuit will be sensitive to different process parameters. Neither can report the process variation at the specific spot on the die and unlikely to extract and analyze the data in real time. EE141 System-on-Chip Test Architectures 23 Ch. 8 – Physical Failures - P. 23 Thermal Sensor On-chip thermal sensors are the last defence to prevent system crash or permanent damage to the chip. Thermal sensor example: _ + □I3 I2 I1 Vref-1 Vref_diode R1 Δvf Vb R2 Vc Vref-n MUX Vref_diode Vb Vref_TTLEVEL _ + Tx Detect Vref_diode N Figure 8.14:Thermal sensor example EE141 System-on-Chip Test Architectures 24 Ch. 8 – Physical Failures - P. 24 Dynamic Voltage Scaling DVS Request frequency change 1 f MAX Frequency f MIN Transition 1, 3 in range of 100s of pS □ 3 4 2 VIDnomNOM Vcc Transition 2, 4 in range of 100s of μS VIDmin Time Figure 8.15: Dynamic voltage scaling scheme EE141 System-on-Chip Test Architectures 25 Ch. 8 – Physical Failures - P. 25 Dynamic Voltage Scaling (cont’d) Use sleep transistors and dynamic biasing to save power Use the adaptive test method for smart binning EE141 System-on-Chip Test Architectures 26 Ch. 8 – Physical Failures - P. 26 Soft Errors Introduction Sources of Soft Errors and SER Trends Coping with Soft Errors EE141 System-on-Chip Test Architectures 27 Ch. 8 – Physical Failures - P. 27 Introduction Soft errors Soft errors are transient single-event upsets (SEUs) caused by various type of radiation Cosmic radiation is the major source of soft errors,especially in memories. Terrestrial radiation is another source of soft errors. EE141 System-on-Chip Test Architectures 28 Ch. 8 – Physical Failures - P. 28 Sources of Soft Errors and SER Trends If a glitch is induced at the junction (red label) in a memory element, its state can be reversed. Figure 8.16: Induced soft error on a SRAM cell EE141 System-on-Chip Test Architectures 29 Ch. 8 – Physical Failures - P. 29 Sources of Soft Errors and SER Trends Logic circuits are less susceptible to these glitches than memories for the following reasons. The glitch must be of sufficient strength to propagate from the location of the strike. The glitch needs to have a functionally sensitized path to be latched. The glitch must arrive at a latch during its latching window. Figure 8.18: Masking factors of soft errors in combinational logic EE141 System-on-Chip Test Architectures 30 Ch. 8 – Physical Failures - P. 30 Coping with Soft Errors As chips are susceptible to soft errors, many soft error protection schemes targeting chip designs have been proposed. Fault Tolerance □ Error-resilient microarchitectures soft errroe mitigation EE141 System-on-Chip Test Architectures 31 Ch. 8 – Physical Failures - P. 31 Fault Tolerance Removing the source of soft errors to improve the reliability of a chip. Three fundamental fault tolerance schemes: Hardware (spatial) redundancy – assumption that defects and radiation particles will only hit on a specific□ device and not another device Time (temporal) redundancy – assumption that the radiation strike will not happen on the same circuitry against at a slightly later time Information redundancy – using error-detecting code or error-correcting code to represent information contents EE141 System-on-Chip Test Architectures 32 Ch. 8 – Physical Failures - P. 32 Fault Tolerance Common fault tolerance schemes used in high reliability system Duplicate and compare – used in mainframes and high-end servers Triple modular redundancy □ – used for systems that cannot fail Redundant multithreading – using error-detecting code or error-correcting code to represent information contents EE141 System-on-Chip Test Architectures 33 Ch. 8 – Physical Failures - P. 33 Error-Resilient Microarchitectures Two representative error-resilient processor microarchitectures DIVA Razor DIVA □ Dynamic Implementation Verification Architecture (DIVA) DIVA Checker – a smaller and simpler shadow processor – contain a functional checker stage (CHK), commit stage (CT), and a watchdog timer(WT) DIVA Core – The main processor that fetches, decodes, and executes instructions, holding their speculative results in the reorder buffer (ROB) EE141 System-on-Chip Test Architectures 34 Ch. 8 – Physical Failures - P. 34 Error-Resilient Microarchitectures Razor Dynamic voltage scaling (DVS) is one of the most effective and widely used methods for power-aware computing. The key idea of Razor is to tune the supply voltage by □ circuit of operation; this is monitoring the error during accomplished with a shadow unit, but this shadow unit has been pushed all the way down into a Razor flipflop. This Razor flip-flop is shown in Figure 8.21a. EE141 System-on-Chip Test Architectures 35 Ch. 8 – Physical Failures - P. 35 Error-Resilient Microarchitectures clk Logic Stage D1 L1 0 1 Logic Stage Q1 Main Flip-Flop □ Shadow Latch L2 Error_L comparator RAZOR FF Error clk_del Figure 8.21(a) Schematic of the Razor flip-flop EE141 System-on-Chip Test Architectures 36 Ch. 8 – Physical Failures - P. 36 Error-Resilient Microarchitectures Razor A reduced overhead Razor flip-flop with the metastability detection circuit is illustrated in Figure 8.21b. clk clk_b □ D clk_b 0 1 Error_L clk Q Metastability Detector Inv_n Inv_p clk_del_b Error_L clk_del Shadow Latch Figure 8.21(b) Reduced overhead Razor flip-flop with metastability detection circuit EE141 System-on-Chip Test Architectures 37 Ch. 8 – Physical Failures - P. 37 Soft Error Mitigation Soft error mitigation techniques are to provide partial immunity of a design to potential soft errors while significantly minimizing the required cost over fault tolerance schems. There are three soft error mitigation methods: (1) Built-In Soft-Error Resilience (BISER) BISER proposed in [Mitra 2005] can be used to allow scan design to protect a device from soft errors during normal operation. EE141 System-on-Chip Test Architectures 38 Ch. 8 – Physical Failures - P. 38 Soft Error Mitigation Figure 8.22 shows the BISER scan cell design that reduces the impact of soft errors affecting storage elements by more than 20 times. Scan portion SCB LA 1D C1 Q 2D C2 SI SCA CAPTURE UPDATE D CLK . . . . PH2 C1 Q 1D LB O2 C1 Q 1D . . C-element . . PH1 1D C1 O1 Q 2D C2 SO Keeper . Q . System flip-flop TEST Figure 8.22: Built-in soft-error resilience (BISER) scan cell EE141 System-on-Chip Test Architectures 39 Ch. 8 – Physical Failures - P. 39 Soft Error Mitigation Circuit-level approaches (2) Gate resizing for soft error mitigation [Zhou 2006] is based on physical-level design modifications. Figure 8.23 illustrates the effect of gate resizing on the amplitude and width of a 0-to-1 transient at the output of a gate. Figure 8.23: Effect of gate resizing on the amplitude/width of SETs [Zhou 2006] EE141 System-on-Chip Test Architectures 40 Ch. 8 – Physical Failures - P. 40 Soft Error Mitigation Circuit-level approaches (3) Netlist transformation for soft error mitigation [Almukhaizim 2006] is based on logic-level design modifications. . Figure 8.24: Example of rewiring to reduce the soft error failure rate EE141 System-on-Chip Test Architectures 41 Ch. 8 – Physical Failures - P. 41 Defect and Error Tolerance Defect Tolerance Insert redundancy circuitry in a circuit under test The circuit can continue correct operation in the presence of defects. Error Tolerance Allow the circuit to continue acceptable operation in the presence of errors EE141 System-on-Chip Test Architectures 42 Ch. 8 – Physical Failures - P. 42 Random Spot defects Assume a design consists N submodules. Each module has n unique positions where a defect would cause it to fail its tests. D defects uniformly distributed over the submodule. Number of defects in any submodule is independent of the number of defects in other submodules. EE141 System-on-Chip Test Architectures 43 Ch. 8 – Physical Failures - P. 43 Defect Probability Probability that an arbitrary position on a submodule is associated with a defect is: p = D / (nN) Probability of having d defects in a given submodule is: P(d) = C(n,d)pd(1-p)n-d where C(n,d) = n! / (d!(n-d)!) EE141 System-on-Chip Test Architectures 44 Ch. 8 – Physical Failures - P. 44 Poisson Distribution P(d) is binomially distributed, the average number of defects in an arbitrary submodule is: E(d) = λ = np = D / N For large n and small p, the binomial distribution can be approximated by Poisson distribution P d e EE141 System-on-Chip Test Architectures d d! 45 Ch. 8 – Physical Failures - P. 45 Example Assume a submodule is equally likely to be defect-free or defective: Pd = 0 = e 0 / 0! λ = 0.693. Effective yield can increase significantly if the system can accept some defective submodules. Thus, EE141 System-on-Chip Test Architectures 46 Ch. 8 – Physical Failures - P. 46 Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y) for Various Values of Failure Rate λ d λ= λ= λ= λ= λ= λ= λ= λ= λ= 0.105 0.223 0.357 0.511 0.693 0.916 1.204 1.609 2.303 0 1 2 3 4 5 6 7 Y= Y= 0.90 0.80 0.09 0.18 0.02 EE141 System-on-Chip Test Architectures Y= 0.70 0.25 0.04 0.01 Y= 0.60 0.31 0.08 0.01 Y= 0.50 0.35 0.12 0.03 Y= 0.40 0.37 0.17 0.05 0.01 Y= 0.30 0.36 0.22 0.09 0.03 0.01 Y= 0.20 0.32 0.26 0.14 0.06 0.02 Y= 0.10 0.23 0.27 0.20 0.12 0.05 0.02 0.01 47 Ch. 8 – Physical Failures - P. 47 Defect Tolerance Used to be called redundancy repair A typical defect-tolerant design is shown on the left M M Switch M Two spares (identical modules) A switch used to select one module EE141 System-on-Chip Test Architectures 48 Ch. 8 – Physical Failures - P. 48 Error Tolerance The main Objective of error tolerance is to increase the effective yield of a process by identifying defective but acceptable chips This lies in the development of An accurate method to estimate error rate An effective method to predict yield EE141 System-on-Chip Test Architectures 49 Ch. 8 – Physical Failures - P. 49 Fault-Oriented Test Methodology Enhance effective yield based on error-rate analysis Estimate error rate of each modeled fault A set of acceptable faults is identified based on their error rates IC Fabrication Fault Ranking Testing Acceptable Chips Unacceptable Chips EE141 System-on-Chip Test Architectures 50 Ch. 8 – Physical Failures - P. 50 Error-Oriented Test Methodology IC Fabrication Bad Chips Focus on errors produced by defective chips rather than on modeled faults Error-Rate Estimation estimate the error rates of these chips determine the acceptability of the faulty chips by estimated results Estimated Error Rate Classification Based on Estimated Error Rate Acceptable Chip Set 1 EE141 System-on-Chip Test Architectures Good Chips Testing Acceptable Chip Set 2 … Unacceptable Chips 51 Ch. 8 – Physical Failures - P. 51 Concluding Remarks Circuit Errors can be caused by manufacturing defects and soft errors. Design for Manufacturability (DFM) – Fault avoidance schemes to cope with physical failures caused by signal integrity, defects, and process variations during manufacturing. Design for Reliability (DFR) – Embedded error resilience and defect tolerance circuitry on-chip to tolerate soft errors and manufacturing defects. EE141 System-on-Chip Test Architectures 52 Ch. 8 – Physical Failures - P. 52