Download F(t)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Electromagnetic compatibility wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Portable appliance testing wikipedia , lookup

Automatic test equipment wikipedia , lookup

Fault tolerance wikipedia , lookup

Transcript
1. Introduction
1. Faults and their manifestation (4)
2. Analysis of faults (12)
3. Classification of tests (5)
4. Fault coverage requirements (3)
5. Test economics (4)
1.1 Faults and their manifestation
Definition of the terms: Failure, Error and Fault
Failure: A system failure is present when the service of the system
differs from the expected service
A failure is caused by an error
Error: There is an error in the system when its state differs from the
state required to deliver the expected service
An error is caused by a fault
Fault: A fault is present when there is a physical difference between
the correct system and the current system
1.2 Faults and their manifestation: Example
Example: A car cannot be used due to a flat tire
Failure: The car cannot be driven due a flat tire
I.e., the service differs from the expected service
The failure is caused by an error
Error: The air pressure has an erroneous state
An error is caused by a fault
Fault: A puncture, causing an erroneous air-pressure-state
I..e, the puncture is the difference between the correct system and
the current system
Note: A fault may not immediately result in a failure; e.g., as will be
the case with a slowly leaking tire
1.3 Fault manifestation
According to the way faults manifest themselves in time, they
can be divided into permanent and non-permanent faults
Permanent fault: Affects the system’s functional behavior permanently
Permanent faults are also referred to as solid or hard faults
Examples: Broken wires, functional design errors, etc.
Non-permanent fault: Affects the system’s functional behavior only part
of the time
1.4 Non-permanent faults
Non-permanent faults are only present part of the time
•
They occur at random moments and affect the system behavior for
finite periods of time
•
Therefore, their detection and localization is difficult
These faults consist of the groups
•
Transient faults
–
Caused by environmental conditions
–
They are also referred to as soft errors
Examples: cosmic rays, -particles, temperature, pressure, vibration
•
Intermittent faults
–
Caused by non-environmental conditions
Examples: Loose connections, deteriorating or aging components
2.1 Analysis of faults
The following topics explain this subject
•
Analyze the frequency of occurrence of faults
•
Analyze system failure rate over its life time
•
Show failure rates of series and parallel systems
•
Explain physical and electrical causes of faults

There are referred to as failure mechanisms
2.2 Frequency of occurrence of faults (1)
Can be explained using reliability theory
The point in time t at which a fault occurs can be considered a random
variable u
The probability of a failure before time t , F(t), is the unreliabilty of the
system F (t )  P(u  t )
The reliability of a system, R(t), is the probability of a correct
functioning system at time t. R (t )  1  F (t ) , or alternatively:
R(t ) 
# of failing components surviving at time t
# of components at time 0
It is assumed that:
F(0) = 0: Initially the system will be operable
F() = 1: Ultimately the system will fail
F (t )  R(t )  1 : System is either operable or failing
2.3 Frequency of occurrence of faults (2)
The derivative of F(t), f(t), is called the failure probability density
dF (t )
dR(t )
function
F (t ) 
dt

dt
Hence:
t

0
t
F (t )   f (t )dt and R(t )   f (t )dt
The failure rate , z(t), is defined as the conditional probability that the
system fails during the period (t, t+t); given that the system was
operational at time t
z (t )  lim
t 0
F (t  t )  F (t ) 1
dF (t ) 1
f (t )
*

*

t
R(t )
dt
R(t ) R(t )
Alternatively, z(t) can be expressed as follows:
# of failing components per unit time at time t
z (t ) 
# of surviving components at time t
2.4 Frequency of occurrence of faults (3)
R(t) can be expressed in terms of z(t) as follows

t
0
z (t )dt  
t
0
R ( t ) dR (t )
f (t )
R(t )
dt   
  ln
R ( 0 ) R(t )
dt
R(0)
or,
t
z ( t ) dt

0
R(t )  R(0)e

The average lifetime of a system, , can be expressed as the

mathematical expectation of t to be
(t )   t * f (t )dt
0
For a non-maintained system, , is called the Mean Time To Failure,
MTTF. Using partial integration, and assuming lim T * R(T )  0


T T
MTTF  lim  t*R(t )   R(t )dt   
0
0
T 

T 

0
T 
R(t )dt
2.5 Frequency of occurrence of faults (4)
Given a system with the following reliability R(t )  e t
The failure rate, z(t), of that system is computed below, and has a
constant value 
f (t ) dF (t )
d (1  e t ) t
z (t ) 

/ R(t ) 
/ e  e t / e t  
R(t )
dt
dt
Assuming failures occur randomly with a constant rate , the MTTF
can be expressed as

MTTF     e dt 
1
 t
0

Example: R(t) & F(t) of
Dutch male population
(over years: 1976– 1980)
Note: # of people > 100 yrs old too small
2.6 Frequency of occurrence of faults (5)
R(t) & F(t) of
Dutch male
population
z(t)
f(t)
Note: Increase of z(t) &
f(t) between ages 18—20
due to driving accidents
Note: Infant mortality rate
2.7 Failure rate over product lifetime (1)
A well-know graphical representation of the failure rate, z(t), is the
bathtub curve. It consists of three regions:
1. Infant mortality
Failures in this region are termed infant mortalities. They are
attributed to poor quality due to variations in the production
process
z(t) Dutch males
2. Working life; Constant failure rate: z(t) = 
Failures are considered to occur randomly in time
3. Wear out; Increasing failure rate
This represents the end-of-life period of a system
It should be clear that a system
should be shipped after it has passed
the infant mortality period, in order
to reduce the # of field returns.
2.8 Failure rate over product lifetime (2)
Shipping a system after the infant mortality period can be done by:
1. Aging the system for that period (this can be several months)
2. Aging the system under stress
– This accelerates the aging process
An important stress condition is increased temperature: Burn-In
The accelerating effect of temperature follows Arrhenius’ equation
T  T * e
2
•
•
•
•
( Ea (1/ T1 1/ T2 ) / k )
1
T1 and T2 are absolute temperatures (in degrees Kelvin, K)
T1 and T2 are the failure rates at T1 and T2, respectively
Ea is the activation energy; constant expressed in electron-volts, eV
k is Boltzmann’s constant k = 8.617*10-5 eV/K
The equation shows that the failure rate is exponentially dependent on
the temperature
2.9 Failure rate over product lifetime (3)
Example of use of Arrhenius equation
Assume Burn-In takes place at 150 oC = 423 oK; i.e., T2 = 423
Note: Room temperature is 30 oC = 303 oK; i.e., T1 = 303
Given that the Ea for the targeted failure rate is: Ea = 0.6 eV
Then the acceleration factor is: 678
T / T  e
2
0.6 (1/ 3031/ 4230/ 8.617*105
1
 678
This means that the 150 oC temperature stress reduces the aging time by
a factor of 678.
Note: Every failure
mechanism has its
typical Ea value
Failure mechanism
Corrosion of metallization
Electrolytic corrosion
Electromigration
Bonding (purple plague)
Ionic contamination
Alloying (contact migration)
Ea: Activation energy
0.3 – 0.6 eV
0.8 – 1.0 eV
0.4 – 0.8 eV
1.0 – 2.2 eV
0.5 – 1.0 eV
1.7 – 1.8 eV
2.10 Failure rates of series and parallel systems
A series system is a system of which all components have to be
operational in order for the system to be operational
Consider that the system consists of n components with reliability
Ri(t), then the reliability of the system, R(t), is: R (t )  n R (t )

s
i 1 i
n
It can be shown that zs (t )  i 1 zi (t )
A parallel system is a system which is operational as long as one of
its n components is operational. The unreliability is:
Fp (t )  i 1 Fi (t )
n
The reliability is: Rp (t )  1  i 1 Fi (t )
n
2.11 Failure mechanisms
Failure mechanisms describe the physical and electrical
causes for faults. They can be divided into 3 classes:
1. Electrical stress
Poor design leading to electrical overstress, or careless handling
causing static damage
2. Intrinsic failure mechanisms
Inherent to the semiconductor material itself.
Examples: Crystal defects, dislocations and processing defects
3. Extrinsic failure mechanisms
Originate in the packaging and interconnection process
Examples: Poor bonding, corrosion, etc.
2.12 Failure
mechanisms
Electrical
stress
Intrinsic
failure
mechanisms
Failure
mechanism
class
Extrinsic
failure
mechanisms
Electrical overstress
Electrostatic discharge
Gate oxide breakdown
Ionic contamination
Surface charge spreading
Charge effects
•Slow rapping
•Hot electrons
•Secondary slow trapping
Piping
Dislocations
Packaging
Metallization
•Corrosion
•Electromigration
•Contact migration
•Microcracks
Bonding (purple plague)
Die attachments failure
Particle contamination
Radiation
•External
•Intrinsic
3.1 Classification of tests
A test is a procedure which allows one to distinguish
between good and bad parts
Tests can be classified according to:
1. The technology they are designed for
2. The parameters they measure
3. The purpose for which the test results are used
4. The test application method
3.2 Technology aspects
The type of test depends heavily on the technology of the
circuit to be tested:
1. Analog tests
The domain of input and output signal values is analog; i.e., they can
take on any value within a given range (Ex.: a range of 0 – 5 V)
Analog tests aim at determining the values of analog parameters
such as voltage and current levels, frequency response, bandwidth,
etc. The generation of the input stimuli and the measurement of the
responses is inherently imprecise. Therefore, a range of values is
used to determine the operational correctness
2. Digital tests
The input and output signals are digital (0 or 1); hence, precise. The
test are called logical or digital tests.
3. Mixed signal tests
The domain of either the input or the output values is analog, while
the other is digital. Typically used for testing digital-to-analog and
analog-to-digital converters
3.3 Measured parameter aspects
The nature of the measured parameter can be:
1.
Logical: Logical tests aim at detecting faults causing a change in
the logical behavior of the system ( a 0 is expected, while a 1 is
measured)
2. Electrical: Electrical tests measure the values of electrical
parameters (voltage and current levels) as well as their behavior
over time; they can be divided into Parametric and Dynamic tests
Parametric tests
Are concerned with the external behavior of the circuit
Ex.: Voltage & current levels & delays on the input & output
pins
–
–
DC parametric tests are concerned with the with time-independent
properties of the input and output values
IDDQ tests are a special class of DC parametric tests; they are
concerned with the leakage currents during the quiescent state of
the circuit
AC parametric tests are concerned with the with time-dependent
properties of the input and output values
Dynamic tests aim at faults which are time-dependent and
internal to the chip
3.4 Purpose of test results
The most obvious use of the test results is to distinguish between
good and bad parts. This can be done with a test which detects
faults. In case of repair, a test capable of locating faults is
required.
Testing can be done during normal use of the system; referred to as
concurrent testing; for example, parity checking is a simple for of
concurrent testing. Alternatively, non-concurrent tests cannot be
performed during normal use of the system, because they do not
preserve the application data. However, they usually have a higher
fault detection capability.
Design-for-Testability (DFT) includes extra circuitry on the to-betested chip; it allows non-concurrent tests to be performed faster
and/or with a higher fault coverage.
Built-in-Self Test (BIST) includes extra circuitry on the to-be-tested
chip, to the extent that the complete test function can be performed
on chip, without external tester support.
3.5 Test application methods
Tests can also be classified according to the way the test
stimuli are applied and the test responses are evaluated
•
External test: Automatic Test Equipment ‘ATE’ is used to apply the
test stimuli and evaluate the test responses
At the board level the stimuli can be applied :
– Via the regular board connectors
Allows for a simple interface with the ATE and for at-speed
testing. However, the nt all circuits are easy to reach. Manual test
program design is required, called functional tests
–
Via special fixture (set of connectors)
That way each components pins becomes accessible. Structural tests,
which can be generated automatically, can now be used.
•
Internal test (BIST)
The ATE function is completely integrated on the to-be-tested chip.
This requires extra silicon area, however, no ATE is required and
the chip can be tested at speed.
4.1 Fault coverage requirements (1)
Given a chip with potential defects, the question can be raised on how
extensive the tests have to be?
This question can be answered in terms of the chips defect level and the
yield of the fabrication process.
• Defect Level ‘DL’ is the fraction of bad parts that passes all tests
– Values for DL are usually expressed in Parts Per Million ‘PPM’
• Process Yield ‘Y’ is the fraction of the manufactured parts that is
fault free. Exact value hard to establish. Therefore, Y approximated
as follows: Y  # of not - defective parts
•
total # of parts
Fault Coverage ‘FC’ is a measure of the quality of a test. It is defined
actual # of detected faults
as:
FC 
total # of faults
In practice it is impossible to have a complete test (FC=1), because of:
1.
2.
3.
Imperfect fault modeling: An actual fault may not correspond with a
modeled fault
Data dependency of faults (e.g., the carry function in an ALU)
Testability limitations (e.g., ATE pin and/or speed limitations)
4.2 Fault coverage requirements (2)
Because tests may not be complete, a defective chip may pass the tests.
Assume that a chip has exactly n Stuck-At Faults ‘SAFs’
–
A SA0 fault causes a 0 value on a line; a SA1 fault causes a 1 value
Let m be the number of detected faults (m  n)
Assume that the probability of a fault is independent of the occurrence
of another fault (i.e., there is no fault clustering) and that all faults
are equally likely with probability p
Assume that: A is the event that a part is free of defects, and B that a
part has been tested for m defects while none were found. Then:
• The Fault Coverage of a test is defined as: FC  m / n
• The Process Yield is defined as: Y  (1  p) n  P( A)
•
P( A  B)  P( A)  (1  p) n P( B)  (1  p) m
P( A B)  P( A  B) / P(B)  (1  p)n /(1  p)m  (1  p) n(1m / n)  Y (1 FC )
•
DL can now be expressed as: DL  1  P( A B)  1  Y (1 FC)
4.3 Fault coverage requirements (3)
(1 FC )
DL is expressed as (see figure): DL  1  P( A B)  1  Y
For large values of Y (i.e., a manufacturing process with a high yield), it
approaches a straight line
Example: Assume a manufacturing process with Y = 0.5 and a TC =
0.8, then: DL  1  0.5(10.8)  0.1295
This means that 12.95% of the shipped parts are defective!
If a DL=200 PPM (i.e., DL = 0.0002)is required, given Y = 0.5, then:
FC  1  (log( 1  DL ) / log Y )  0.99971 This is a FC of 99.971%
5.1 Test economics
Repair cost during the product phases
A move from one product phase to the next
causes the volume of parts and the test &
repair cost to increase by a factor of 10
This is the rule-of-ten
Economics and liability of testing. Good tests
•
•
•
•
reduce test & repair cost (see above rule-of-ten)
can reduce development time & time-to market
can reduce field maintenance costs
Optimum
reduce personal injury and law suits
There is an optimum in test development
cost and its contribution to profit:
Too many tests require a long test
development time and test cost
5.2 Total profit
The life time of a product has several economic phases
• The development phase
– Product design takes place
– No income; only expenses
– Area under zero-line is development cost
• The market growth phase
– Market acceptance increases with time
• The market decline phase
– Product becomes less attractive
– Market share decreases
– Price may have to be reduced
The total profit over the life time of a product is the area above the
zero-line (revenue) – area below the zero-line (development cost)
In case of a delay ‘D’ in product development, the development cost is
higher, while the revenue is reduced, because the obsolescence
point will not change
5.3 Product development delay cost
Assuming M is the maximum market growth, which is reached after
time W, the revenue lost due to a delay D (hatched area) can be
computed as follows:
1
• The Expected Revenue ‘ER’ is: ER  * 2W * M  W * M
2
• The Revenue of the Delayed Product ‘RDP’ is:
1
W D
RDP  * (2W  D) * (
*M )
2
W
• The Lost Revenue ‘LR’ is: LR  ER  RDP
2W 2  3D *W  D 2
D * (3W  D)
LR  W * M 
* M  ER *
2W
2W 2
5.4 Life-cycle cost
The cost of a product over its life time, consists of:
1. The design cost
This typically is on the order of 5% of the product cost
2. The manufacturing cost
This is the cost associated with the production and sales of the
product
3. The maintenance cost
The cost associated with repair, calibration, etc.
This may be the largest cost factor
Note: Product life is 30
years; e.g., for a
telephone exchange