Download Soft Errors - CS Course Webpages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Opto-isolator wikipedia , lookup

Portable appliance testing wikipedia , lookup

Automatic test equipment wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Fault tolerance wikipedia , lookup

Transcript
Chapter 8
Coping with Physical Failures,
Soft Errors, and
Reliability Issues
EE141
System-on-Chip
Test Architectures
1
Ch. 8 – Physical Failures - P. 1
What is this chapter about?

Gives an Overview of and Promising Solutions
to the Causes of Manufacturing Defects and
Soft Errors

Focus on




Signal Integrity
Defect-Based Tests
Process Sensors and Adaptive Design
Soft Errors
– BISER
– Circuit-Level Approaches
 Defect and Error Tolerance
EE141
System-on-Chip
Test Architectures
2
Ch. 8 – Physical Failures - P. 2
Coping with Physical Failures, Soft Errors,
and Reliability Issues






Introduction
Signal Integrity
Manufacture Defects, Process Variations, and
Reliability
Soft Errors
Defect and Error Tolerance
Concluding Remarks
EE141
System-on-Chip
Test Architectures
3
Ch. 8 – Physical Failures - P. 3
Introduction

Defects
 Random defects
– Caused by manufacturing imperfections and occur in random places
 Systematic defects
– Caused by process or manufacturing variations
Defect level (DL) is a function of process yield (Y) and fault coverage (FC)
DL  1  Y
EE141
System-on-Chip
Test Architectures
1 FC 
4
Ch. 8 – Physical Failures - P. 4
Concept of Signal Integrity
Signal integrity is the ability of a signal to generate correct responses in a
circuit.
A signal with good integrity stays within safe margins for its voltage amplitude
and transition time.
EE141
System-on-Chip
Test Architectures
5
Ch. 8 – Physical Failures - P. 5
Basic Concept of Integrity Loss
Integrity Loss: any portion of signal that exceeds
amplitude-safe and time-safe margin.

IL ( IntegrityLoss )   (  Vi  f (t )  dt )
ei
i
bi
where Vi is one of the acceptable amplitude levels and bi , ei  is a
time frame during which integrity loss occurs.
EE141
System-on-Chip
Test Architectures
6
Ch. 8 – Physical Failures - P. 6
Sources of Integrity Loss
Interconnects
 Power Supply Noise
 Process Variations

EE141
System-on-Chip
Test Architectures
7
Ch. 8 – Physical Failures - P. 7
Integrity Loss Sensors/Monitors (1)

Current Sensor
Current sensors are often used to detect the completion
of asynchronous circuits.

EE141
System-on-Chip
Test Architectures
8
Ch. 8 – Physical Failures - P. 8
Integrity Loss Sensors/Monitors (2)

Power Supply Noise Sensor
The voltage V x depends on the power/ground bounces:
the higher the PSN is, the longer the propagation and the
higher the voltage V x will be.

EE141
System-on-Chip
Test Architectures
9
Ch. 8 – Physical Failures - P. 9
Integrity Loss Sensors/Monitors (3)
 Noise Detector (ND) Sensor
ND sensor is designed to detect integrity loss due to
voltage violations.

EE141
System-on-Chip
Test Architectures
10
Ch. 8 – Physical Failures - P. 10
Integrity Loss Sensors/Monitors (4)

Integrity Loss Sensor (ILS)

The integrity loss sensor is a delay violation sensor.
EE141
System-on-Chip
Test Architectures
11
Ch. 8 – Physical Failures - P. 11
Integrity Loss Sensors/Monitors (5)

Jitter Monitor
Jitter is often defined as the time deviation of a signal
from its ideal location in time.

EE141
System-on-Chip
Test Architectures
12
Ch. 8 – Physical Failures - P. 12
Integrity Loss Sensors/Monitors (6)


A ring oscillator can work as a Process Variation Sensor
The variation of delay caused by PV-faults in any of the
inverters in the loop results in deviation in the frequency
of the oscillator, which can be detected.
f RO 

f RO  1
1
N inv  Vdd C Load
(
W 2
2Tox
)(VGS  Vt ) 2 (1 
K
VDS )
Leff
NinvTinv  , where N inv is an odd number of inverters
and Tinv is the delay of one inverter.
EE141
System-on-Chip
Test Architectures
13
Ch. 8 – Physical Failures - P. 13
Readout Architectures (1)

BIST-Based Architecture
BIST Architecture

Readout Circuitry
When a noise or delay violation occurs (flag=1), the
contents of all scan cells are then scanned out through
Sout for further reliability and diagnosis analysis.
EE141
System-on-Chip
Test Architectures
14
Ch. 8 – Physical Failures - P. 14
Readout Architectures (2)

Scan-Based Architecture

At the driving side of an interconnect, pattern generation
BSC(PGBSC) is used to generate test patterns. At the
receiving side of the interconnect, an observation
BSC(OBSC) is used to detect integrity loss.
EE141
System-on-Chip
Test Architectures
15
Ch. 8 – Physical Failures - P. 15
Readout Architectures (3)

Basic Concept of PV-Test Architecture

On-chip ROs with counters, embedded in a test chip are
used to detect process variation by measuring the RO’s
frequency shifts.
EE141
System-on-Chip
Test Architectures
16
Ch. 8 – Physical Failures - P. 16
Manufacture Defects, Process Variations,
and Reliability

100% single stuck-at fault coverage cannot guarantee
perfect product quality, because there are remaining
defects that are:
Timing-dependent
Sequence-dependent
Attributed to timing-dependent, non-single-stuck-at faults
EE141
System-on-Chip
Test Architectures
17
Ch. 8 – Physical Failures - P. 17
Structural Tests
 A Defect-Based Test Architecture
ATPG
Structural Tests
RTL
Library
RC Extraction
Layout
Synthesis
Modeling
Timing Analysis
Defect-Based
Fault Enumeration
Path Extractor
Physical Faults
Critical Path
List
Fault Mapping
Gate-level Netlist
Defect-Based Fault Simulator
Functional Tests
Fault List
Defect-Based ATPG
EE141
System-on-Chip
Test Architectures
Logical Fault List
Defect-Based Tests
18
Ch. 8 – Physical Failures - P. 18
Defect-Based Tests
Small Delay Defect Tests
 Bridge Defect Tests
 N-Detect Tests
 I ddq Tests
 MinV DD Tests
 VLV Tests

EE141
System-on-Chip
Test Architectures
19
Ch. 8 – Physical Failures - P. 19
Reliability Stress
Concept of Infant Mortality
 Methods to screen infant mortality

Method I - Burn-in
ttf  c  e
EA
kT
Where ttf is time to failure, C is a constant, E A is the
activation energy (eV), k is the boltzman’s constant, and T is
an absolute temperature.
Method II - Elevated Voltage Stress
EE141
System-on-Chip
Test Architectures
20
Ch. 8 – Physical Failures - P. 20
Redundancy and Memory Repair
 Redundancy:
 Spare rows, columns, or blocks
 Repair
schemes:
 Pellston Technology [Wuu 2005]: If repeated error
are detected, disable cache line (set “not to use”
bit)
 Perform memory BIST at new operating conditions;
exclude failing cells and resize cache (cache size
can vary larger or smaller, depending on whether
new conditions are more favourable or worse)
EE141
System-on-Chip
Test Architectures
21
Ch. 8 – Physical Failures - P. 21
Process Sensors and Adaptive design

Compare traditional test structures put on the
scribe lines and embed additional process
sensors on-chip.

On-Chip Process Sensors:
 Process Variation Sensor
 Thermal Sensor
 Dynamic Voltage Scaling
EE141
System-on-Chip
Test Architectures
22
Ch. 8 – Physical Failures - P. 22
Process variation Sensor

Ring oscillators:
Many factors can affect the frequency of the ring oscillator such as
process variation, temperature and voltage.

Analog Process Variation Sensor:
The analog circuit will be sensitive to different process parameters.
Neither can report the process variation at the specific spot
on the die and unlikely to extract and analyze the data in
real time.
EE141
System-on-Chip
Test Architectures
23
Ch. 8 – Physical Failures - P. 23
Thermal Sensor
On-chip thermal sensors are the last defence
to prevent system crash or permanent
damage to the chip.
 Thermal sensor example:

_
+
□I3
I2
I1
Vref-1
Vref_diode
R1
Δvf
Vb
R2
Vc
Vref-n
MUX
Vref_diode
Vb
Vref_TTLEVEL
_
+
Tx Detect
Vref_diode
N
Figure 8.14:Thermal sensor example
EE141
System-on-Chip
Test Architectures
24
Ch. 8 – Physical Failures - P. 24
Dynamic Voltage Scaling

DVS
Request
frequency
change
1
f MAX
Frequency f
MIN
Transition 1, 3 in
range of 100s of pS
□
3
4
2
VIDnomNOM
Vcc
Transition 2, 4 in
range of 100s of μS
VIDmin
Time
Figure 8.15: Dynamic voltage scaling scheme
EE141
System-on-Chip
Test Architectures
25
Ch. 8 – Physical Failures - P. 25
Dynamic Voltage Scaling (cont’d)
Use sleep transistors and dynamic biasing to
save power
 Use the adaptive test method for smart
binning

EE141
System-on-Chip
Test Architectures
26
Ch. 8 – Physical Failures - P. 26
Soft Errors

Introduction

Sources of Soft Errors and SER
Trends

Coping with Soft Errors
EE141
System-on-Chip
Test Architectures
27
Ch. 8 – Physical Failures - P. 27
Introduction

Soft errors
 Soft errors are transient single-event upsets
(SEUs) caused by various type of radiation
 Cosmic radiation is the major source of soft
errors,especially in memories.
 Terrestrial radiation is another source of soft
errors.
EE141
System-on-Chip
Test Architectures
28
Ch. 8 – Physical Failures - P. 28
Sources of Soft Errors and SER Trends

If a glitch is induced at the junction (red label) in a memory
element, its state can be reversed.
Figure 8.16: Induced soft error on a SRAM
cell
EE141
System-on-Chip
Test Architectures
29
Ch. 8 – Physical Failures - P. 29
Sources of Soft Errors and SER Trends

Logic circuits are less susceptible to these glitches
than memories for the following reasons.
The glitch must be of sufficient strength to propagate from
the location of the strike.
The glitch needs to have a functionally sensitized path to be
latched.
The glitch must arrive at a latch during its latching window.
Figure 8.18: Masking factors of soft errors in
combinational logic
EE141
System-on-Chip
Test Architectures
30
Ch. 8 – Physical Failures - P. 30
Coping with Soft Errors

As chips are susceptible to soft errors, many soft
error protection schemes targeting chip designs have
been proposed.
 Fault Tolerance
□
 Error-resilient microarchitectures
 soft errroe mitigation
EE141
System-on-Chip
Test Architectures
31
Ch. 8 – Physical Failures - P. 31
Fault Tolerance


Removing the source of soft errors to improve the
reliability of a chip.
Three fundamental fault tolerance schemes:
 Hardware (spatial) redundancy
– assumption that defects and radiation particles will
only hit on a specific□
device and not another device
 Time (temporal) redundancy
– assumption that the radiation strike will not happen
on the same circuitry against at a slightly later time
 Information redundancy
– using error-detecting code or error-correcting code
to represent information contents
EE141
System-on-Chip
Test Architectures
32
Ch. 8 – Physical Failures - P. 32
Fault Tolerance

Common fault tolerance schemes used in high
reliability system
 Duplicate and compare
– used in mainframes and high-end servers
 Triple modular redundancy
□
– used for systems that cannot fail
 Redundant multithreading
– using error-detecting code or error-correcting code
to represent information contents
EE141
System-on-Chip
Test Architectures
33
Ch. 8 – Physical Failures - P. 33
Error-Resilient Microarchitectures

Two representative error-resilient processor
microarchitectures
 DIVA
 Razor

DIVA
□
 Dynamic Implementation Verification Architecture (DIVA)
 DIVA Checker
– a smaller and simpler shadow processor
– contain a functional checker stage (CHK), commit stage (CT),
and a watchdog timer(WT)
 DIVA Core
– The main processor that fetches, decodes, and executes
instructions, holding their speculative results in the reorder
buffer (ROB)
EE141
System-on-Chip
Test Architectures
34
Ch. 8 – Physical Failures - P. 34
Error-Resilient Microarchitectures

Razor
 Dynamic voltage scaling (DVS) is one of the most
effective and widely used methods for power-aware
computing.
 The key idea of Razor is to tune the supply voltage by
□ circuit of operation; this is
monitoring the error during
accomplished with a shadow unit, but this shadow unit
has been pushed all the way down into a Razor flipflop.
This Razor flip-flop is shown in Figure 8.21a.
EE141
System-on-Chip
Test Architectures
35
Ch. 8 – Physical Failures - P. 35
Error-Resilient Microarchitectures
clk
Logic
Stage
D1
L1
0
1
Logic
Stage
Q1
Main
Flip-Flop
□
Shadow
Latch
L2
Error_L
comparator
RAZOR FF
Error
clk_del
Figure 8.21(a) Schematic of the Razor
flip-flop
EE141
System-on-Chip
Test Architectures
36
Ch. 8 – Physical Failures - P. 36
Error-Resilient Microarchitectures

Razor
A reduced overhead Razor flip-flop with the
metastability detection circuit is illustrated in Figure
8.21b.
clk
clk_b
□
D
clk_b
0
1
Error_L
clk
Q
Metastability Detector
Inv_n
Inv_p
clk_del_b
Error_L
clk_del
Shadow Latch
Figure 8.21(b) Reduced overhead Razor
flip-flop with metastability detection circuit
EE141
System-on-Chip
Test Architectures
37
Ch. 8 – Physical Failures - P. 37
Soft Error Mitigation



Soft error mitigation techniques are to provide partial
immunity of a design to potential soft errors while
significantly minimizing the required cost over fault
tolerance schems.
There are three soft error mitigation methods:
(1) Built-In Soft-Error Resilience (BISER)
BISER proposed in [Mitra 2005] can be used to allow scan
design to protect a device from soft errors during normal
operation.
EE141
System-on-Chip
Test Architectures
38
Ch. 8 – Physical Failures - P. 38
Soft Error Mitigation

Figure 8.22 shows the BISER scan cell design that
reduces the impact of soft errors affecting storage
elements by more than 20 times.
Scan portion
SCB
LA
1D
C1 Q
2D
C2
SI
SCA
CAPTURE
UPDATE
D
CLK
. .
.
.
PH2
C1 Q
1D
LB
O2
C1
Q
1D
. .
C-element
.
.
PH1
1D
C1
O1
Q
2D
C2
SO
Keeper
.
Q
.
System flip-flop
TEST
Figure 8.22: Built-in soft-error resilience
(BISER) scan cell
EE141
System-on-Chip
Test Architectures
39
Ch. 8 – Physical Failures - P. 39
Soft Error Mitigation

Circuit-level approaches
(2) Gate resizing for soft error mitigation [Zhou 2006] is
based on physical-level design modifications.
Figure 8.23 illustrates the effect of gate resizing on the
amplitude and width of a 0-to-1 transient at the output of a
gate.
Figure 8.23: Effect of gate resizing on the
amplitude/width of SETs [Zhou 2006]
EE141
System-on-Chip
Test Architectures
40
Ch. 8 – Physical Failures - P. 40
Soft Error Mitigation

Circuit-level approaches
(3) Netlist transformation for soft error mitigation
[Almukhaizim 2006] is based on logic-level design
modifications.
.
Figure 8.24: Example of rewiring to
reduce the soft error failure rate
EE141
System-on-Chip
Test Architectures
41
Ch. 8 – Physical Failures - P. 41
Defect and Error Tolerance

Defect Tolerance
 Insert redundancy circuitry in a circuit under test
 The circuit can continue correct operation in the
presence of defects.

Error Tolerance
 Allow the circuit to continue acceptable operation
in the presence of errors
EE141
System-on-Chip
Test Architectures
42
Ch. 8 – Physical Failures - P. 42
Random Spot defects
Assume a design consists N submodules.
 Each module has n unique positions where a
defect would cause it to fail its tests.


D defects uniformly distributed over the
submodule.

Number of defects in any submodule is
independent of the number of defects in other
submodules.
EE141
System-on-Chip
Test Architectures
43
Ch. 8 – Physical Failures - P. 43
Defect Probability

Probability that an arbitrary position on a
submodule is associated with a defect is:
p = D / (nN)

Probability of having d defects in a given
submodule is:
P(d) = C(n,d)pd(1-p)n-d
where
C(n,d) = n! / (d!(n-d)!)
EE141
System-on-Chip
Test Architectures
44
Ch. 8 – Physical Failures - P. 44
Poisson Distribution

P(d) is binomially distributed, the average number of
defects in an arbitrary submodule is:
E(d) = λ = np = D / N

For large n and small p, the binomial distribution can be
approximated by Poisson distribution
P d   e
EE141
System-on-Chip
Test Architectures


d

d!
45
Ch. 8 – Physical Failures - P. 45
Example
 Assume
a submodule is equally likely to
be defect-free or defective:

Pd = 0 = e   0 / 0!

λ = 0.693.
 Effective yield can increase significantly
if the system can accept some defective
submodules.
 Thus,
EE141
System-on-Chip
Test Architectures
46
Ch. 8 – Physical Failures - P. 46
Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y)
for Various Values of Failure Rate λ
d λ=
λ= λ= λ= λ= λ= λ= λ=
λ=
0.105 0.223 0.357 0.511 0.693 0.916 1.204 1.609 2.303
0
1
2
3
4
5
6
7
Y= Y=
0.90 0.80
0.09 0.18
0.02
EE141
System-on-Chip
Test Architectures
Y=
0.70
0.25
0.04
0.01
Y=
0.60
0.31
0.08
0.01
Y=
0.50
0.35
0.12
0.03
Y=
0.40
0.37
0.17
0.05
0.01
Y=
0.30
0.36
0.22
0.09
0.03
0.01
Y=
0.20
0.32
0.26
0.14
0.06
0.02
Y=
0.10
0.23
0.27
0.20
0.12
0.05
0.02
0.01
47
Ch. 8 – Physical Failures - P. 47
Defect Tolerance
Used to be called
redundancy repair
 A typical defect-tolerant
design is shown on the
left

M
M
Switch
M
 Two spares (identical
modules)
 A switch used to select
one module
EE141
System-on-Chip
Test Architectures
48
Ch. 8 – Physical Failures - P. 48
Error Tolerance
The main Objective of error tolerance is to
increase the effective yield of a process by
identifying defective but acceptable chips
 This lies in the development of

 An accurate method to estimate error rate
 An effective method to predict yield
EE141
System-on-Chip
Test Architectures
49
Ch. 8 – Physical Failures - P. 49
Fault-Oriented Test Methodology

Enhance effective yield based on error-rate
analysis
 Estimate error rate of each modeled fault
 A set of acceptable faults is identified based on
their error rates
IC
Fabrication
Fault
Ranking
Testing
Acceptable
Chips
Unacceptable
Chips
EE141
System-on-Chip
Test Architectures
50
Ch. 8 – Physical Failures - P. 50
Error-Oriented Test Methodology
IC
Fabrication

Bad
Chips
Focus on errors produced
by defective chips rather
than on modeled faults
Error-Rate
Estimation
 estimate the error rates of
these chips
 determine the
acceptability of the faulty
chips by estimated results
Estimated
Error Rate
Classification
Based on Estimated
Error Rate
Acceptable
Chip Set 1
EE141
System-on-Chip
Test Architectures
Good
Chips
Testing
Acceptable
Chip Set 2
… Unacceptable
Chips
51
Ch. 8 – Physical Failures - P. 51
Concluding Remarks



Circuit Errors can be caused by manufacturing
defects and soft errors.
Design for Manufacturability (DFM) – Fault avoidance
schemes to cope with physical failures caused by
signal integrity, defects, and process variations during
manufacturing.
Design for Reliability (DFR) – Embedded error
resilience and defect tolerance circuitry on-chip to
tolerate soft errors and manufacturing defects.
EE141
System-on-Chip
Test Architectures
52
Ch. 8 – Physical Failures - P. 52