Download Stuck-at Fault Diagnosis

Document related concepts
no text concepts found
Transcript
HT NO: 13N41D5709
CHAPTER – 1
FUNDAMENTALS
1
HT NO: 13N41D5709
1.1 Field-programmable Gate Array:
A Field-programmable Gate Array (FPGA) is a configurable integrated circuit
that can implement an arbitrary logic design. An FPGA can be repeatedly programmed to
permit fast and inexpensive prototyping, product development, and design error
correction.
Proto-typing, or emulation, is the modeling and simulation of all aspects of a
logic design whose final implementation is not FPGA-based, while product development
is the design, analyze-sis, and test of a logic design whose final implementation is F
PGA-based. Because fabrication is required for Application-specific Integrated Circuits
(ASICs), the development time (time to production) of ASIC-based designs is greater
than a year and the Non-recurring Engineering (NRE) costs are tens of millions of
dollars.
However, because FPGA-based designs do not require fabrication (an FPGA is
simply programmed with a logic design), a much shorter development time and much
lower NRE results, which has led to the current popularity of FPGAs for product
development and prototyping. FPGA-based products have the additional benefit that
design errors discovered after product release can be corrected by simply reprogramming
the FPGA with the corrected design, called field-upgrading. Note that ASICs are better
suited for high-volume designs since the cost of ASIC NRE, amortized per chip, is
typically less than the expensive cost per FPGA.
The size, functionality, and performance of an FPGA vary both within and
between device families (a series of differently sized FPGAs that share a common
architecture). A particular device family and size is chosen for implementation based on
the specifications of a logic design.
Although internal clock frequencies are slower than those for ASICs, the arrayed
structure, configurable logic elements, and abundant routing resources of an FPGA
permit highly parallel computation that yields increased throughput. Applications in
networking, Digital Signal Processing (DSP), graphics, and cryptography, among others,
can be efficiently y implemented.
2
HT NO: 13N41D5709
1.2 FPGA Manufacturing:
For correct functionality, the resources used to implement a logic design must be
defect-free. A defect is a physical imperfection, introduced during manufacturing that
causes an FPGA to function incorrectly. A device is tested to detect defects, thereby
ensuring an Acceptable Quality Level (AQL), a measure of the number of defective
devices that escape manufacturing test (lower AQL means fewer defective outgoing
devices).
Although desired, there is no guarantee that a device is defect-free because it
passes manufacturing test: tests are imperfect and unable to detect all defects.
FPGA test is the responsibility of the manufacturer. An applicationindependent FPGA is one whose configuration may change throughout its lifetime.
Because the manufacturer cannot know which logic and routing resources of an
application-independent FPGA a customer will use and because literally billions of
configurations are possible (all of which can-not be tested), special techniques and
tools are required for test. In contrast, an application-dependent FPGA is one whose
configuration remains unchanged throughout its lifetime, useful for low- to mediumvolume or cost sensitive logic designs.
Although most devices that fail manufacturing test are discarded, some are
diagnosed to determine the cause of failure. The information obtained from diagnosis
is used to make manufacturing improvements to increase yield. A measure of the
number of defect-free manufactured devices—is the hallmark of a good
manufacturing process: high yield means few defective devices.
However, FPGA diagnosis is difficult and time-consuming, primarily due to the
complexity of the configurable interconnection network. Like FPGA test, special
techniques and tools are required for diagnosis.
3
HT NO: 13N41D5709
1.3 Defects and Faults:
As stated, a defect is a physical imperfection introduced during manufacturing
that causes device failure. However, the physical nature of a defect, which can have a
very complicated behavior, does not permit direct mathematical treatment of test and
diagnosis.
Thus, the logical effect of a defect is modeled by a fault, and detection or
identification of defects is accomplished by detection or identification of faults (with
the exception of defect-based test, which is the detection of defects by measuring
some device parameter, for example a particular current magnitude or voltage level).
Note that a flaw is also a physical imperfection; however, it causes device failure
after some time period (flaw de-grades into defect). Flaws and the screen required to
eliminate flaws are not discussed in this dissertation.
There are several common fault models: stuck-at, bridge, and delay, among
others. A stuck-at fault is a fixed logic value on a circuit node, either stuck-at 0 or
stuck-at 1, regardless of the value driven on that node.
A bridge fault is the logical behavior of a bridge, for example wired-AND or
wired-OR, caused by a short between signal lines that should not be connected.
Finally, a delay fault is an additional delay on some path or through some gate in a
circuit that causes timing failures.
Throughout this dissertation an attempt is made not to mix the usage of the
terms defect and fault. For example, in Chap. 3 detection of delay faults, not delay
defects, is discussed since the cause of delay is irrelevant while in Chap.
4
HT NO: 13N41D5709
1.4 Test:
Delay Fault Test:
The objective of test is to detect defects. Detecting delay faults, which cause
timing failures in an otherwise functioning FPGA, is difficult due to the slow test
speeds of contemporary Automated Test Equipment (ATE). At slow test speeds, path
slack—the difference between the actual time a signal on a path stabilizes and the
time at which the signal must be stable—is so large that even a delay fault may not
cause a failure (a negative path slack).
however, the test configurations implement long paths of interconnects and logic
elements, which (1) fail if either one large delay fault (spot defect) or many small
delay faults (process variation) are present along the path (which may or may not be
de-sired), and (2) increase the likelihood of fault masking, or the inability to detect a
fault due to the presence of another.
Bridge Fault Test:
A bridge is a short between signal lines that should not be connected, which
usu-ally causes a fault current when activated (bridged circuit nodes are driven to
opposite logic values). This fault current can be observed by measuring the steadystate current drawn by a device, IDDQ. IDDQ test is important for FPGAs because of the
large number of transmission-gate multiplexers in the interconnection network, in
which certain bridge faults can only be detected via IDDQ test.
However, the effectiveness of IDDQ test is limited due to the shrinking
geometries of deep sub-micron processes. Scaling of threshold voltages has caused
transistor leakage currents to increase, thereby exponentially increasing IDDQ but not
correspondingly increasing fault currents. It is shown that only 5 configurations, and
thus 5 IDDQ measurements (one measurement per configuration), can detect all
interconnect bridge faults in contemporary FPGAs. Additionally, the fault current
detect ability (size of the fault current compared to the magnitude of the IDDQ) can be
increased by partitioning, or dividing, the nets of a test configuration into subsets,
each of which corresponds to a new, smaller test configuration.
5
HT NO: 13N41D5709
Test Time Reduction:
The configuration time —the time required to program an FPGA with a
test configuration—consumes the majority of total test time, and
therefore
dominates test cost. To significantly decrease the test time of an FPGA, a reduction
must be made in either the number of test configurations, or the configuration time.
Several test configuration generation techniques have been proposed to
alleviate this test time problem by reducing the number of test configurations. Those
that consider faults in the interconnection network derive a reduced number of test
configurations using interconnect modeling and graph traversal algorithms. Rather
than re-duce the number of test configurations, test time can be made shorter by
reducing the con-figuration time (time required to program an FPGA under test with
each test configuration).
However, it assumes the configuration memory is structured as a single scan
chain. Furthermore, the specially designed test configuration that is internally shifted
within an FPGA detects only faults in the logic resources.
Because the presented technique reduces only the configuration time, it can be
used in conjunction wit h those methods that reduce the number of test
configurations. A small amount of Design for Test (DFT) hardware— several 2-bit
multiplexers (R − 3 in total for an R × C FPGA)—is added to the existing FPGA
hardware.
These multiplexers, which take advantage of the regular structure of both an
FPGA and each test configuration, permit parallel programming of test configuration
data, achieving a configuration time reduction of approximately 40%. By reducing
the configuration time, the number of test configurations can be increased to achieve
a more thorough FPGA test without increasing test cost.
6
HT NO: 13N41D5709
1.5 Diagnosis:
Bridge Fault Location:
The objective of diagnosis is to locate or identify defects. Fault location is an
important part of fault diagnosis since it is usually necessary to locate a fault to
identify the underlying defect. Because the majority of an FPGA is its interconnection
network and due to the high density of the physical layout in deep sub-micron process
technologies, locating faults in the interconnection network is an important concern in
FPGAs.
By iteratively partitioning the nets of the test configuration for which the
FPGA failed differential IDDQ test and independently testing the FPGA with each
partition, fault-free nets can be progressively eliminated until only a faulty net one of
the
bridged nets remains (the other bridged net is found by applying the location
process a second time using a complementary test configuration).
Very few configurations and IDDQ measurements are required, logarithmic in
the number of nets in the test configuration. Furthermore, because of the simplicity in
partitioning accomplished by device configuration alone the process can be easily
automated.
Automated Fault Diagnosis:
When the type of defect present is not known (unknown cause of failure), both
identification and location must be conducted. A common method is stuck-at fault
diagnosis (the stuck-at fault model is again chosen primarily for its simplicity), which
uses stuck-at faults to model the behavior of a defect, and identifies the smallest
possible set of fault candidates, or faults that could explain the failure.
Furthermore, the set of test configurations and test vectors must be
regenerated for different FPGA architectures. Compared to the ad hoc diagnosis seen
in practice that requires several days or weeks to diagnose a single device, the
diagnosis time of the presented method is reduced to only several hours; compared to
the formal diagnosis techniques, the presented technique is capable of diagnosing any
fault detectable by the manufacturing test configurations, and applicable to any FPGA
architecture in production.
7
HT NO: 13N41D5709
CHAPTER – 2
FPGA Structure
8
HT NO: 13N41D5709
2.1
Overview:
An FPGA is a two-dimensional array of blocks that are joined by an
interconnection network. Logic Blocks (LBs) contain the configurable hardware for
logic design implementation. Input/output Blocks (IOBs) are used for primary inputs
and outputs. Multiplier blocks (MBs) contain hardware to implement specialized
arithmetic.
Finally, RAM blocks (BRAMs) contain user-addressable memory arrays. Since
most blocks in an FPGA are logic blocks and very few are multiplier or RAM blocks,
an FPGA is basically a regular array of logic blocks.
The interconnection network consists of interconnects, switch matrices or
multiplexers, and buffers. An interconnect is a wire segment. A Switch Matrix (SM),
or programmable multiplexer, is used to join blocks.
Buffers are used in the same manner as they are for any interconnection network.
There are only a few types of tiles: a logic block and its associated switch matrix
make an LB tile while an input/output block and its associated switch matrix make an
IOB tile (for simplicity multiplier blocks and RAM blocks are not discussed). Local
routing (interconnects) is included in each tile.
An FPGA is organized as an array of R rows and C columns. One row, containing
C tiles, spans the entire width of the device. Similarly, one column, containing R tiles,
spans the entire height of the device.
Figure shows a simplified 5 × 5 FPGA with logic blocks in the center and
input/output blocks on the periphery: the upper left half shows the tile representation
while the lower right half shows the switch matrices, blocks, and a subset of the
9
HT NO: 13N41D5709
interconnects.
Input/Output Block
row 1
IOB Tile
IOB Tile
IOB Tile
Switch Matrix
Interconnect
Logic Block
row 2
IOB Tile
LB Tile
Switch Matrix
Interconnect
Logic Block
row 3
IOB Tile
Switch Matrix
Interconnect
Logic Block
row 4
Logic Block
Switch Matrix
Switch Matrix
Interconnect
Input/Output Block
row 5
Switch Matrix
Interconnect
Input/Output Block
Switch Matrix
Input/Output Block
Switch Matrix
Logic Block
Switch Matrix
Interconnect
Logic Block
Switch Matrix
Interconnect
Input/Output Block
Switch Matrix
Input/Output Block
Switch Matrix
Interconnect
Input/Output Block
Switch Matrix
Interconnect
Input/Output Block
Switch Matrix
Interconnect
Input/Output Block
Switch Matrix
Figure : 5 × 5 FPGA (only some interconnects shown)
10
HT NO: 13N41D5709
2.2 Logic Resources:
A logic block contains the combinational and sequential elements needed to
implement arbitrary logic functions. In the Xilinx Virtex architecture—the same basic
architecture of most contemporary FPGAs—a logic block is subdivided into two smaller
units, called slices. Each slice contains two Look-up Table (LUTs) and two bistables, as
well as some carry logic and several small multiplexers.
A single look-up table can implement any 4-input combinational logic function;
multiple look-up tables can be combined to implement larger functions. A look-up table
is a small array of memory, 16 × 1 bits, that stores the truth table of a 4-input Boolean
4
function (2 = 16 values). A bistable, used to implement sequential functions, can be
configured to operate as either a D latch or a D flip-flop.
2.3 Routing Resources:
A net is the concatenation of several interconnects, or wire segments, that form a
path between blocks. In contemporary FPGAs, interconnects are hierarchically
organized by length and direction. In the Virtex architecture, there are three lengths,
single, hex, and long, each routed either horizontally or vertically for a total of six
types. Single lines interconnect adjacent switch matrices in all four directions.
Hex lines interconnect every third and sixth switch matrix in all four directions.
Finally, long lines span the entire height or width of a device. The majority of an
FPGA is its interconnection network, constituting up to 80% of the die area and up to
8 metal layers.
Routing of nets is accomplished by switch matrices, which either join
interconnects to block inputs or outputs, or other interconnects. The configurability
of a switch matrix is achieved by Programmable Interconnect Points (PIPs). A PIP is
a pass transistor controlled by an associated memory cell that determines whether the
transistor is on (conducting) or off (non-conducting).
11
HT NO: 13N41D5709
CHAPTER – 3
Delay Fault Test
12
HT NO: 13N41D5709
3.1 Motivation and Previous Work
Detecting delay faults is difficult due to the slow test speed s of contemporary Automated Test
Equipment (ATE); however, it is important for two reasons: a delay fault can cause timing
failures in an otherwise functioning device and may result in an early-life failure (device
prematurely stops functioning) if it is caused by a resistive open a partially conducting circuit
path.
A study of microprocessors (manufactured in high volume) found that 58% of customerreturned devices had some type of open defect, a significant portion of which were resistive
opens. The additional load increases the RC delay of the path under test, resulting in a negative
path slack if a resistive open is present.
Although effective, the method is only applicable to FPGAs whose switch matrices and
PIPs are not implemented as multiplexers (XC3000 and XC4000 architectures are not multiplexer
based, Virtex, Virtex-II and Spartan-3 architectures are multiplexer-based).
Long paths consisting of both logic elements and interconnects are configured to have
approximately equal fault-free propagation delays. A signal transition is simultaneously driven at
the input of each path, and the difference in the propagation delays is measured at the outputs by
means of an on-chip oscillator.
However, test configurations that contain long paths of interconnects and logic elements
will fail if either one large de-lay fault (spot defect) or many small delay faults (process shift) are
present along the path, which may or may not be desired. Additionally, the probability of fault
masking, or the inability to detect a fault due to the presence of another, is higher for longer paths.
13
HT NO: 13N41D5709
3.2 Delay Fault Detection:
The presented delay fault test is applicable to all contemporary FPGA architectures. Several
paths under test, consisting of a set of interconnect and PIPs between two logic blocks, are
configured to have equal fault-free propagation delays. A race between the signals on the paths is
started at the first logic block and the difference in transition times of the signals at the second
logic block is observed.
If the difference is above a predetermined threshold (one signal transitions much later than
the others), a delay fault is detected. By configuring the shortest possible paths under test that do
no t include logic elements (PIPs and interconnects only), the probability of fault masking the
result of a delay fault on multiple paths is minimized. The tiled structure of contemporary FPGAs
and the hierarchical organization of the interconnects easily facilitates the configuration of short
paths with nearly equal fault-free propagation delays.
set i
path i
set i+1
1
path
LB
i2
i−1
LBi
path
i 2N
Figure: Iterative Logic Array
A set is the group of paths under test between a pair of logic blocks: seti is the group of 2N
paths between LBi−1 and LBi (although not necessary, 2N, rather than N, is chosen to make the
number of paths even to simplify the analysis). The test of an FPGA, which is configured to
contain one or more ILAs, begins by applying a signal transition to the first logic block of each
ILA, LB0. This causes LB0 to create a race on its output set, set1, that propagates to the next logic
block, LB1. LB1 subsequently observes the difference in transition times between the fastest and
slowest signals on set1: if the difference is small, then no delay fault is present and LB1 creates a
race on its output set, set2; if the difference is large, then a delay fault is present and LB1 outputs a
special fail signal (not a signal race) that propagates to the output of the ILA (primary output). In
14
HT NO: 13N41D5709
this manner every path in every set along the ILA is tested. Multiple delay faults in each set can
be detected, provided at least one path in the set is fault-free.
Bridge Faults :
To configure paths of equal fault-free propagation delay in a n FPGA, equal length wire
segments that are routed alongside each other are used. Therefore, the effect of a bridge fault
between paths under test should be considered. By driving the Xi paths to the opposite logic value
as that of the Yi paths and interleaving the interconnects of the two groups such that each Xi j path
runs adjacently to a Yik as shown in Fig, all bridge faults between adjacent paths can be detected.
X
i1
R
Y
bridge
i1
R
X
bridge
i2
R
Y
bridge
i2
Figure : Interleaved Paths
A bridge fault that causes the signal on at least one of the bridged paths in a set to transition
(for example a wired-AND or wired-OR bridge fault) creates an artificial race on that set that is
detected as a delay fault of infinite delay b y the succeeding logic block. For example, given a
wired-AND bridge fault between paths Xi j and Yik , if Xi j is driven to logic-0 and Yik is driven to
logic-1, the logic value of Yik will be forced to logic-0 (0 ·1 = 0):
Yik transitions from 1 to 0 while Xi
j
remains at logic-0. Similarly, given a wired-OR
bridge fault between paths Xi j and Yik , if Xi j is driven to logic-0 and Yik is driven to logic-1, the
logic value of Xi j will be forced to logic-1 (0 + 1 = 1): Xik transitions from 0 to 1 while Yi
j
remains at logic-1. Note that any bridge fault that causes a signal transition on any bridged path
will be detected.
15
HT NO: 13N41D5709
3.3 Partitioning:
The magnitude of the total current the background leakage current plus the inter-connect
leakage current plus a possible fault current increases with FPGA size due to the increased
number of transistors (PIPs). Unfortunately, the larger number of PIPs that contribute to the
leakage between opposite-valued interconnects (IDDQint ) causes the total current to increase at a
much faster rate than the reference current. This difference, further aggravated by the shrinking
geometries of deep sub-micron process technologies, causes the signature current to increase,
making it more difficult t o detect a bridge fault.
This scaling problem can be overcome by partitioning, or dividing, the nets (used
interconnects) of a test configuration for a large FPGA into subsets such that the total current for
each partition is roughly the same that it would be for a small FPGA (make the number of nets in
each partition roughly equal to the number of nets in a test configuration for a small FPGA).
Consequently, the signature current becomes invariant with respect to FPGA size.
Consider an FPGA and a test configuration whose used interconnects composes the set S.
This configuration, that tests for bridge faults between an y interconnect in S and any interconnect
not in S, is partitioned into N smaller configurations, each testing only a subset of the
interconnects in S for bridge faults. Thus, instead of measuring one large total current for the
original configuration, N smaller total currents are measured, one for each of the N smaller
configurations.
For simplicity and without loss of generality, assume that each partition contains 1/Nth of
the interconnects in S. Because each partition also contains approximately 1/Nth the number of
PIPs, each interconnect leakage current is reduced by N compared to the interconnect leakage
current that results when the configuration is not partitioned.
Since the reference current does not change whether an FPGA is partitioned or not
(reference current is still measured only once while all interconnects are driven to logic-1), each
signature current is also reduced by N. Therefore, the delectability ratio increases: the fault
current is more easily detected. Note that the partitions of S need not be disjoint: the only
requirement is that every interconnect of S belongs to at least one partition. Additionally, N
threshold values are now required.
16
HT NO: 13N41D5709
CHAPTER – 4
Diagnosis
17
HT NO: 13N41D5709
4.1 Bridge fault location:
Fault location is an important part of fault diagnosis since it is usually necessary to locate
a fault to identify the underlying defect. Because of the abundant logic and routing resources in an
FPGA (most of which remain unused during normal operation), a manufacturing defect can be
tolerated by locating the resulting fault, and avoiding the fault when mapping a logic design to the
device. Note that defect tolerance describes the fault location and fault avoidance process
implemented by the manufacturer prior to device operation while fault tolerance describes the
process implemented by the device during operation (for example concurrent error detection).
Fault Search
During each iteration of the fault location process, the nets of the fault-free partitions
(partitions that do not display an elevated signature current) are eliminated from the set S.
Consequently, two criteria must be met during each iteration of the fault search. First, to ensure
the iterative process converges to locate a bridged net, no net in S can be included in all partitions.
Second, without knowing the expected magnitude of the signature current for each partition, the
partitions of S must be balanced such that the interconnect leakage current for each partition is
approximately equal.
To first order this balancing is accomplished simply by equalizing the number of nets in each
partition; more precise balancing can be accomplished by explicitly equalizing the number of
PIPs, although experimental results show this is unnecessary.Therefore, simply by comparing the
signature currents of the two partitions during a particular iteration, the fault-free partition can be
identified and its nets can be eliminated from S. This reduced set S, containing approximately half
the nets it did in the previous iteration, is subsequently repartitioned, signature currents are
obtained, and the nets of a new, smaller fault-free partition are eliminated.
Figure shows the general fault location algorithm. In general, if S, which initially contains M
nets (|S| = M), is divided into N partitions during each iteration, then NdlogN Me + 1
configurations and current measurements are required to locate the faulty net (+1 is for the single
reference current measurement and the corresponding configuration). To locate multiple bridge
faults, the location algorithm is applied independently to each partition that displays an elevated
signature current during iteration (multiple fault currents). The number of configurations and
current measurements is still provided a constant number of bridge faults are to be located.
18
HT NO: 13N41D5709
Finally, if location of the other net of a bridge fault is desired, for example to obtain more
information about the fault for diagnosis, the location algorithm is applied a second time to the
FPGA using an inverse test configuration whose used interconnects are the un-used interconnects
of the original configuration and vice versa. For example, given the original test configuration of
Fig.(A, B, C, D, and E are entire nets, not individual interconnects), S = {B, D} and the bridged
net B is identified as the faulty net.
To also locate the bridged net C, the inverse test configuration with S0 = {A,C, E} is used,
shown in Fig. Note that the second bridged net can also be located using a second test
configuration in which only the interconnects adjacent to the faulty net are used; however,
because layout information is required, the more simple inverse test configuration is desired. For
example, a test configuration with S0 = {A,C} is sufficient to locate the bridged net C in Fig.net E
can be neglected because it is not adjacent to net B and thus unlikely to be bridged to net B.
A
D
Used
E
Unused
B
C
Bridge Fault
Switch Matrix
Switch Matrix
(a) Original Test Configuration
A
D
Used
E
Unused
B
C
Bridge Fault
Switch Matrix
Switch Matrix
(b) Inverse Test Configuration
Figure: Location of Both Nets of a Bridge Fault
19
HT NO: 13N41D5709
4.2 Automated fault diagnosis:
Fault diagnosis is the process of determining the cause of a device failure, important
because the information obtained can be used to make manufacturing improvements to in-crease
yield. A common method is stuck-at fault diagnosis, which uses stuck-at faults to model the
behavior of a defect, and identifies the smallest possible set of fault candidates, or faults that
could explain the failure (explain the output response of the defective device). Although the
stuck-at fault model is not an accurate model of real defects, for ex-ample a bridge, it is appealing
to use for diagnosis for several reasons. First, the simplicity of the stuck-at fault model permits
simple mathematical treatment of the faults considered during diagnosis.
Stuck-at Fault Diagnosis:
The main (and most challenging) part of the previous diagnosis techniques is creating test
configurations, which is done because the existing manufacturing test configurations are not well
suited for diagnosis (they typically have very few primary outputs). The number of primary
outputs of a manufacturing test configuration is kept low for two main reasons: to reduce pincount requirements of the ATE, and permit each test configuration to be used for differently
sized FPGAs that have different packages and pin-outs (in many test configurations all logic
blocks are simply joined t o form a single long ILA that has a single primary output). In contrast,
the presented diagnosis technique is not based on generating a set of test configurations, but
instead takes advantage of the configurability of an FPGA to make use of the already existing
manufacturing test configurations (overcoming the problem of few primary outputs).
There are two primary benefits to using the existing manufacturing test configurations: the
diagnosis process is (1) very fast, and (2) applicable to any FPGA in production regardless of size or
architecture (provided manufacturing test configurations exist to test devices before being sold to
customers). First, compared to the ad hoc diagnosis done in practice, the diagnosis time is reduced
from days or weeks to only hours by eliminating the entire configuration generation process. Second,
compared to the diagnosis techniques that generate test configurations in advance, no additional
configurations must be created to diagnose FPGAs of different size or architecture.
20
HT NO: 13N41D5709
Diagnosis Process:
The diagnosis of an FPGA consists of two main procedures:
(1) Identify fault candidates, and
(2) Systematically eliminate fault candidates.
Table lists all required steps in more detail. Fault candidates are identified in steps 1 through 4
and 7 (step 7 appears later in the process for implementation reasons, discussed in the fan-in cone
logic extraction section) and fault candidates are eliminated in steps 5, 6, 8, and 9. The entire
process (except step 10) is fully automated, producing a small set of fault candidates that are each
verified in step 10.
Table : Automated Fault Diagnosis Process
Step
Description
Purpose
1
Device test
Determine passing and failing test configurations
2
Readback verification
3
Bistable location
Determine first failing bistables
4
Fan-in cone logic extraction
Determine set of logic fault candidates
5
Logic fault commonality
Eliminate logic fault candidates not common to all
Eliminate failing configurations with faulty readbacks
bistable-located configurations
6
Logic fault consistency
Eliminate logic fault candidates shown not to exist by
passing configurations
7
Fan-in cone PIP extraction
Determine set of routing fault candidates
8
Routing fault commonality
Eliminate routing fault candidates not common to all
bistable-located configurations
9
Routing fault consistency
Eliminate routing fault candidates shown not to exist by
passing configurations
∗
10 Fault verification
Eliminate remaining fault candidates by independently
testing for each
∗Not an automated step.
21
HT NO: 13N41D5709
Step 1: Device Test
To detect as many defects as possible during manufacturing test, the set of manufacturing
test configurations is developed to have a high fault coverage, defined as the percentage of faults
that can be detected out of all faults considered. A fault list, or list of faults to be detected (usually
all considered faults), guides this configuration generation process.
A particular test configuration is designed to detect a specific set of faults in the fault list;
how-ever, as a byproduct the test configuration can also detect other faults. Consequently, the test
configuration is fault graded to determine all detectable faults, which are then dropped from the
fault list for the test configurations that remain to be generated.
Since it is known exactly which faults can be detected by each test configuration (each
was fault graded during manufacturing test configuration generation), the set of fault candidates
for the presented diagnosis process can be determined as the faults detectable by those test
configurations for which the FPGA failed (the manufacturing test configurations used to diagnose
the devices were fault graded for stuck-at faults on logic element inputs, logic element outputs,
and interconnects). Therefore, the first step in diagnosing an FPGA is to test the device with all,
or at least many, of these manufacturing test configurations to determine for which the device
fails and for which it passes.
This information is not collected during manufacturing test because an FPGA is rejected
immediately after it first fails (to reduce test time and therefore test cost); to avoid using
expensive ATE during this diagnosis step, inexpensive ATE can be used, as was done to obtain
the experimental results A failing test configuration is one for which an FPGA fails while a
passing test configuration is one for which an FPGA passes. Although more fault candidates are
identified when the device is tested with more test configurations (there are more failing
configurations), the number of fault candidates that can be eliminated also increases (there are
more passing configurations). Therefore, a finer resolution—fewer fault candidates—ultimately
results by maximizing the number of test configurations with which the device is tested.
22
HT NO: 13N41D5709
Step 2: Read back Verification
After device test has identified the set of fault candidates (literally millions of candidates), the
next step is to systematically reduce this set. The first method is to locate the first bistable (or
bistables) in the FPGA to capture a faulty value—the first failing bistable —for each failing test
configuration (different bistables may be identified for each test configuration because each test
configuration either implements a different logic function or is a different mapping of the same
logic function to the FPGA). Bistable location is done using the special DFT hardware of
contemporary FPGAs that enables read back.
Read back is the scanning out serially of both configuration data and th e state of all
bistables in an FPGA (conceptually the inverse of configuration), which permits the state of the
device to be captured at any time. However, because the device being diagnosed is faulty, read
back may not function correctly; read back functionality must therefore be verified.
The result of capturing the device state via read back is a read back file , which contains
both the configuration memory bits and the bistable data bits of an FPGA. Any bit of a read back
file may be erroneous, or faulty. To continue the diagnosis process, it is necessary to determine
(1) which bits are faulty, and (2) how the faulty bit behaves (random or stuck-at). To identify a
read back bit that is stuck at some logic value (stuck-at bit), a read back file is obtained from the
faulty device (device being diagnosed) immediately after it is programmed with a test
configuration and compared to the read back file of a known-good device obtained in the same
manner. Because the state of both devices should match, any bits that differ between the faulty
and known-good read back files are faulty. If the logic value of those faulty bits does not change
when additional read back files of the faulty device are obtained (again immediately after
configuration), then those faulty bits are stuck-at bits. Stuck-at 1 bits are identified by
programming both de vices with a special all-zero configuration (everything initialized to logic-0)
while stuck-at 0 bits are identified by programming both devices with a special all-ones
configuration (everything initialized to logic-1). Bits whose logic value changes when additional
23
HT NO: 13N41D5709
read back files of the faulty device are obtained are random bits (readback is done in the exact
same manner each time, therefore each subsequent readback file should match the previous).
Faulty readback bits do not necessarily mean that the FPGA cannot be diagnosed using the
presented process: most readback bits will not be used to conduct bistable location anyway (only
about 2-3% of readback bits are meaningful in any given configuration). Only if there is a stuckat 0 or stuck-at 1 readback bit that corresponds to a bistable used in a particular failing test
configuration (the readback bit represents the data stored in that bistable) is that test configuration
discarded as one for which the first failing bistable can be located. A random readback bit does
not cause a test configuration to be discarded since a method of majority voting can be used
during bistable location.
Step 3: Bistable Location
When a fault is activated, a faulty response may be observed on a primary output; however,
fault activation may have occurred many clock periods before the faulty primary output is
observed (this is what makes the manufacturing test configurations difficult to use for diagnosis).
By obtaining readback files at various times during the test of an FPGA, the first bistable (or
bistables) to capture a faulty value (the result of fault activation) can be determined, thereby using
the bistable as a test observation point. Because the fault must lie in the fan-in cone of this first
failing bistable, which is defined as all logic an d routing resources on any path traced backwards
from the input of that bistable to the output of another bistable or a primary input, a large number
of fault candidates can be eliminated (those candidates not in that fan-in cone can be eliminated).
Since each test configuration either implements a different logic function or is a different
mapping of the same logic function to the FPGA, each may use different logic resources and may
have a different routing of paths; therefore, a different first failing bistable may be identified for
each failing test configuration. To determine the first failing bistable, readback files of the faulty
FPGA are compared to those of a known-good F PGA, both programmed with the same test
configuration and both tested with the same test vectors, V vectors in total. If the fault has not
been activated after V vectors are applied, then the state of each device will match; once the fault
is activated, the state of each device will not match.
24
HT NO: 13N41D5709
The number of vectors, V , to apply to each device is determined via divide-and-conquer.
First, V is set to one half the total number of vectors for the test configuration (one half of Vtot ).
These V vectors, the first
V tot
2
, are then applied to both devices, after which readback files are
obtained for both devices and compared: if they match, the number of vectors is increased by half
(V = V +
V tot
4
); if they do not match, the number of vectors is reduced by half (V = V −
V tot
4
).
Figure shows the full binary search algorithm to identify the first failing bistable (or bistables);
given Vtot vectors for a particular test configuration, the process requires dlog2Vtot e iterations.
Note that because FPGA programming and readback is slow, bistable location consumes most of
the time required for the presented diagnosis process. Diagnosis time is thus proportional to the
number of failing test configuration for which the first failing bistable is located, called bistablelocated test configurations.
for each failing test configuration
set vectors V = dVtot e
# start search at midpoint
2
for k = 1 to dlog2Vtot e
# iterate
test both devices with V vectors
obtain readbacks for both devices
if readbacks match
increase vectors to V = V + b
V
tot
c
# increase search space
e
# decrease search space
2k+1
if readbacks do not match
decrease vectors to V = V − d
Vtot
2k+1
Figure : Bistable Location Algorithm
Although a known-good device can be tested ahead of time and the known-good readback files can be stored, there is no reason: no delay is incurred by testing the known-good device
in parallel with the faulty device (test both devices at the same time). Furthermore, it is
impractical to store all possible readback files for the known-good device because the required
disk space is impractically large. One Readback fil e is required to store the state of the knowngood device after each additional vector is applied (one readback file for V = 1, one readback file
for V = 2, all the way up to one readback file for V = Vtot ), where each readback file is several
megabytes in size.
25
HT NO: 13N41D5709
However, if it was determined by the readback verification step that random readback bits
exist, then the faulty device can be tested multiple times with V vectors (an odd number of times)
and multiple readback files can be obtained per iteration. The 'correct' value of each random
readback bit during a particular iteration can be determined by a majority vote between the
multiple readback files and a 'correct' faulty readback file can be composed for comparison to the
known-good readback file. This majority vote technique was success fully used for both devices
diagnosed in Sec. it was found that 3 tests (configuration and application of V vectors) and
readbacks per iteration were sufficient.
If a majority vote for random readback bits is not done then the bistable location algorithm
may identify an incorrect first failing bistable: the random readback bit may result in an incorrect
decision (increase or decrease in the number of vectors, V ) during readback comparison. Note
that as the number of tests and readbacks of the faulty device increases, the probability that the
bistable location algorithm will identify an incorrect first failing bistable decreases (unless the
logic value of the random readback bit has an equal likelihood of being either logic-0 or logic-1).
Additionally, because the faulty device is programmed multiple times and multiple readback files
are obtained (per iteration), the total diagnosis time increases (configuration and readback are
slow).
Step 4: Fan-in Cone Logic Extraction
By definition, the first failing bistable (or bistables) for e
ach bistable-located test
configuration is the first bistable to capture a faulty value; all ot her bistables must have captured
fault-free values. Thus, the fan-in cone of each first failing bistable, defined as all paths traced
backwards from the input of the first failing bistable , through only routing and combinational
elements, to the output of a previous bistable (called a last passing bistable) or a primary input,
can be used to eliminate a large number of fault candidates since the fault must lie in this fan-in
cone. Figure shows the fan-in cone of a first failing bistable. Note that the output from the last
passing bistable and both the data and clock inputs to the first failing bistable are also potentiallyfaulty since a fault on each could cause the bistable to capture a faulty value.
26
HT NO: 13N41D5709
Fault Candidates
D
Q
Last Passing
Comb. Logic
and Routing
D Q
First Failing
D
Q
Fan−in Cone
Last Passing
Figure: First Failing Bistable Fan-in Cone
First, each bistable-located test configuration is converted from its Hardware Description
Language (HDL) syntax into graph notation to permit traversal of the fan-in paths of a particular
first failing bistable. For the experiments of , all test configurations were originally specified in
Xilinx Description Language (XDL). In an XDL description, logic elements are explicitly
identified but routing elements (PI Ps and interconnects) are only implicitly identified (discussed
in the fan-in cone PIP extraction section). Consequently, to simplify the implementation of the
diagnosis process, only logic fault candidates are extracted from the fan-in cone. Thus, in the
presented diagnosis process, logic faults candidates are considered first, followed by routing fault
candidates.
In the graph notation for a given test configuration, each ver tex is an input or output of a
logic element and each edge is a net. To determine the set of logic fault candidates, each fan-in
path of the first failing bistable is traversed backwards to the last passing bistable (or primary
input), recording each vertex along the way.
27
HT NO: 13N41D5709
Step 5: Logic Fault Commonality
Up to this point each test configuration has been considered individually, each used to
individually test the FPGA, identifying a set of logic fault candidates for that particular test
configuration. If only a single fault is assumed to exist (one fault in the FPGA being diagnosed),
then there must be a logic fault candidate that is common to every failing test configuration (the
single fault must be the cause of failure f or each test configuration). Thus, logic fault candidates
not common to all bistable-located test configurations can be eliminated. The number of logic
fault candidates that can be eliminated is proportional to the number of bistable-located test
configurations. For both devices diagnosed in, it was found that fault-locating only 2–3% of the
failing test configurations resulted in enough fault candidate elimination for successful diagnosis.
The aggregate set of logic fault candidates is the union of all logic fault candidates
identified by each bistable-located test configuration. If elimination yields a very small aggregate
set of logic fault candidates, say no more than 10 (10 is chosen because it is considered a
sufficiently small number of fault candidates t o be easily verified by the fault verification step),
then routing fault candidates need not b e considered (routing faults steps should be conducted if
routing faults are still suspected). If elimination yields an empty aggregate set of logic fault
candidates, then
(1) there are multiple logic faults,
(2) there is one or more routing faults,
(3) there are both logic and routing faults,
(4) there is a defect in the faulty device that is not modeled by a single stuck-at fault.
To not eliminate possible logic faults (if multiple faults or an un modeled defect exists), the
empty aggregate set of logic fault candidates is restored to contain all logic fault candidates and
the diagnosis process is continued.
28
HT NO: 13N41D5709
Step 6: Logic Fault Consistency:
Test configurations for which the faulty device passes (passing test configurations) are as
important to the diagnosis process as those for which the device fails (failing test con-figurations)
since they identify fault-free resources. Logic fault candidates shown not to exist by these passing
test configurations can be eliminated. The amount of elimination is proportional to both the
number of passing configurations an d the fault coverage of each.
Similar to the logic fault commonality step, if elimination yields a very small aggregate set of
logic fault candidates (say no more than 10), then routing fault candidates need not be considered.
In contrast, if all logic fault candidates are eliminated (all logic faults are shown not to exist by
the passing test configurations), then either (1) the fault being diagnosed is a routing fault, or (2)
there is an un modeled defect in the faulty device. If after the following routing fault steps are
conducted and no routing fault can be identified, the empty aggregate logic fault candidate set can
be restored to contain all logic fault candidates and fault verification (discussed in the fault
verification sect ion) can be conducted on each (if the set is sufficiently small). For the diagnosed
device in Se c. 7.3 whose logic fault candidate set became empty after the logic fault consistency
step, restoration was not required: a routing fault was successfully identified.
Step 7: Fan-in Cone PIP Extraction
In the presented diagnosis process routing faults are considered only if too many logic fault
candidates remain (a large number of logic faults were not eliminated). To determine the routing
fault candidates, each fan-in path of a particular first failing bistable (an edge in the graph
notation discussed in the fan-in cone logic extraction section) is expanded to reveal all the PIPs
and interconnects that compose that path. Note that it is not necessary to include fan-out paths of
a net. For example, consider the configuration shown in Fig. in which each interconnect is
denoted as I and each PIP as I → J, I =6 J. The fan-in cone of the first failing bistable consists of
path {A → B, B → C,C → D} but the entire net also contains the fan-out path {B → X , X →
Y,Y → Z}: this fan-out path can be ignored because no fault exists along this path (if a fault did
exist then the bistable at the end of that path would also be a first failing bistable).
29
HT NO: 13N41D5709
Because the XDL specification of a test configuration describes a net by the set of PIPs it
contains and only implicitly identifies the interconnect s (PIP I → J implicitly identifies
interconnects I and J), it is sufficient to extract only the PIPs in the fan-in cone o f each first
failing bistable. Thus, to determine the set of routing fault candidates, each fan-in path of the first
failing bistable is traversed backwards to the last passing bistable (or primary input), recording
each PIP along the way. Note that the routing fault candidates correspond to stuck-at faults on the
input and output of each PIP (each input or output of a PIP is an interconnect, logic element input,
or logic element output).
Step 8: Routing Fault Commonality
Elimination of routing fault candidates via routing fault commonality is analogous to
elimination of logic fault candidates via logic fault commonality. Again, if only a single fault is
assumed to exist (one fault in the FPGA being diagnosed), then there must be some routing fault
candidates that are common to every failing test configuration. Thus, routing fault candidates not
common to all bistable-located test configurations can be eliminated. The number of routing fault
candidates that can be eliminated is proportional to the number of bistable-located test
configurations.
The aggregate set of routing fault candidates is the union of all routing fault candidates
identified by each bistable-located test configuration. If elimination yields a very small aggregate
set of routing fault candidates (say no more than 10), then diagnosis can immediately conclude
with the fault verification step.
Step 9: Routing Fault Consistency
The final automated step in the presented diagnosis process i s to eliminate routing fault
candidates shown not to exist by the passing test configurations. Since a PIP implicitly identifies
two interconnects, if it is shown to be fault-free then the two interconnects it joins must also be
fault-free (for example if PIP I → J passes in some test configuration, then the PIP I → J and
interconnects I and J are fault-free, and can be eliminated as routing fault candidates). The
number of routing fault candidates that can be eliminated is proportional to both the number of
30
HT NO: 13N41D5709
passing configurations and the fault coverage of each.
If all logic fault candidates were eliminated after the logic fault commonality and consistency
steps (empty aggregate set of logic fault candidates resulting from a single fault assumption), then
the only way all routing fault candidates can be eliminated during this step is if there is an un
modeled defect in the faulty device. If this occurs, the aggregate set of routing fault candidates is
restored to contain all routing fault candidates and fault verification is conducted on each (if the
set is sufficiently s mall).
Step 10: Fault Verification
The fault verification step identifies which of the remaining fault candidates (those fault
candidates not eliminated) are actually faulty. However, unlike the preceding steps, fault
verification is not easily automated since it involves generating a new test configuration (or
multiple configurations) that independently tests each fau lt candidate (if possible). Because this
is most easily done by hand, automating the fault verification step was not done, and is therefore
not discussed further.
A new test configuration can be created by modifying an existing manufacturing test
configuration—one that already tests for the particular fault candidate that is to be verified.
The manufacturing test configuration is modified to contain only the resource identified by the
fault candidate (for example a PIP or a look-up table) and a minimal amount of re-sources that
permit control and observation of the logic value at the fault candidate location (for example PIPs
and interconnects shown to be fault-free by passing test configurations).
Control can be achieved by creating a path from a primary input to the fault candidate location
while observation can be achieved by creating a path from the fault candidate location to a
primary output. Note that control and observation can also be achieved by creating a path from
some bistable to the fault candidate location, creating a path from the fault candidate location to
another bistable, and using readback to control and observe the bistable inputs and outputs,
respectively.
31
HT NO: 13N41D5709
Verifying a fault candidate is done by independently testing for that fault—controlling and
observing the logic value at the fault location while not activating any of the other fault
candidates. Thus, when a fault candidate is tested (one at a time, each with its own test
configuration) and a failure is observed, and assuming all ot her elements used to create the
control and observation paths are fault-free, then the failure can only be the result of that fault
candidate and not of any others (or the result of an un modeled defect). Therefore, a fault
candidate verified in this manner is said to be an actual fault (no longer a candidate).
Although logic fault verification is simply done by independently testing for each logic fault
candidate, routing fault verification requires a little more care since both PIPs and interconnects
must be considered.
For example, let the aggregate set of routing fault candidates of an FPGA with a single
routing fault be PIPs B → C, C → D, and W → C, identified via bistable location of the two
manufacturing test configurations shown in Fig. (potentially-faulty PIPs are bold). Clearly the
common PIP C → D may be faulty; however, either interconnect C or D may instead be faulty.
Thus, new test configurations used for routing fault candidate verification must test for PIPs and
interconnects independently.
Logic Fault Diagnosis
Device 1 failed roughly 30% of the several hundred manufacturing test configurations (device
test, step 1), of which 5 were determined to be usable for bistable location (read-back verification,
step 2). The same first failing bistable was identified for each test configuration (bistable location,
step 3), resulting in 34 logic fault candidates (fan-in cone logic extraction, step 4). After logic
fault candidate elimination (steps 5 and 6) only 9 logic fault candidates remained; therefore, the
routing fault steps were skipped and fault verification (step 10) was conducted.
Eight of the 9 logic fault candidates were stuck-at 0 and stuck-at 1 faults on the 4 address
inputs of a look-up table while the ninth was a stuck-at 1 fault on the write select enable (ws en)
input of the same look-up table. By testing for each logic fault independently (fault verification,
step 10), it was shown that only the ninth fault candidate is actually a fault.
32
HT NO: 13N41D5709
4.3 Routing Fault Diagnosis:
Device 2 failed roughly 25% of the several hundred manufacturing test configurations (device
test, step 1), of which 4 were determined to be usable for bistable location (readback verification,
step 2). Four different first failing bistable s were identified (bistable location, step 3), resulting in
273 logic fault candidates (fan-in cone logic extraction, step 4). After logic fault candidate
elimination (steps 5 and 6), no logic fault candidates remained; therefore, it was assumed that the
fault being diagnosed is a routing fault.
The fan-in cones of the 4 first failing bistables identified 83 routing fault candidates (fan-in
cone PIP extraction, step 7), of which 27 were common to all bistable-located test configurations
(routing fault commonality, step 8). Only 9 r outing fault candidates were not eliminated by the
passing test configurations (routing fault consistency, step 9), yielding a small set of fault
candidates for fault verification.
One interconnect was common to 3 of the 9 potentially-faulty PIPs; therefore, it was logical to
test this interconnect first during fault verification (step 10), which revealed that it was stuck-at 1.
Furthermore, independently testing each of the other routing fault candidates showed that they did
not correspond to actual faults.
33
HT NO: 13N41D5709
CHAPTER – 5
BUILT IN SELF TEST
34
HT NO: 13N41D5709
5.1 INTRODUCTION:
We present a Built-In Self-Test (BIST) approach able to detect and accurately diagnose all
single and practically all multiple faulty programmable logic blocks (PLBs) in Field
Programmable Gate Arrays (FPGAs) with maximum diagnostic resolution. Unlike conventional
BIST, FPGA BIST does not involve any area overhead or performance degradation. We also
identify and solve the problem of testing configuration multiplexers that was either ignored or
incorrectly solved in most previous work. We intro-duce the first diagnosis method for multiple
faulty PLBs; for any faulty PLB, we also identify its internal faulty modules or modes of
operation. Our accurate diagnosis provides the basis for both failure analysis used for yield
improvement and for any repair strategy used for fault-tolerance in reconfigurable systems. We
present experimental results showing detection and identification of faulty PLBs in actual
defective FPGAs. Our BIST architecture is easily scalable.
Advances in VLSI technology have resulted in more defects affecting the delays in the circuit,
thus increasing the importance of delay-fault testing. The current FPGA manufacturing testing
practice relies on configuring the FPGA with many designs and running them at speed. This is
useful for speed-binning, but it may not be a comprehensive delay-fault test. One of the main
difficulties in the problem of delay-fault testing for FPGAs originates in the fact that users’
circuits are not known when the chip is manufactured. While for an Application-Specific
Integrated Circuit (ASIC) the target clock frequency is known before the chip is fabricated, it is
impossible for an FPGA manufacturer to test the circuits that will be implemented on each device.
Previous work in FPGA delay-fault testing considered only the testing of the user’s circuit.
However, such an approach is problematic both for FPGA manufacturers, who want to ship defect
-free chips, and for users, who want to purchase defect -free parts (otherwise, system-level
debugging becomes very difficult if design problems are intermixed with manufacturing faults).
Testing only the user’s circuit is also unsafe for on-line testing, because new faults are equally
likely to occur in spare resources or in the currently unused portion of the operational part of the
system. These dormant faults may accumulate and have a detrimental effect on the system
reliability since these spare and unused resources will be used in fault-tolerant applications to
replace defective resources. Thus to guarantee a reliable operation, an on-line test must
completely check all the system resources, including spares.
35
HT NO: 13N41D5709
In testing FPGAs, we can take advantage of their unique properties reconfigurability, partial
reconfigurability, and regular structure - to achieve features that cannot be realized in ASIC
testing. Unlike in ASICs, where BIST requires large area overhead and some performance
degradation, FPGA BIST can be done with no area overhead and no delay penalty. In FPGAs, we
can pseudo-exhaustively test both the programmable logic blocks (PLBs) and the programmable
interconnect network, during both off-line and on-line testing with reasonable test application
times. Pseudo exhaustive testing is practically complete and removes the need to evaluate the test
quality. Thus, pseudo exhaustive FPGA BIST approaches do not require either Automatic Test
Pattern Generation (ATPG) or fault simulation, which are computationally expensive tasks for
ASICs. Accurate fault diagnosis is difficult to achieve in ASICs, but it can be done efficiently in
FPGAs. In ASICs, fault tolerance for manufacturing faults cannot be accomplished without
massive redundancy, but FPGAs allow efficient fault tolerance without dedicated spare resources
as a result of the ability to reconfigure around faults.
Our first attempt to perform FPGA delay-fault testing was to run our BIST configurations for
logic and interconnect at the specified clock rate for the target FPGA. However, we quickly
learned that this approach cannot work without significantly increasing the number of BIST
configurations that must be applied to the FPGA. The number of configurations is the dominant
factor determining the total test time, because the configuration download time is several orders
of magnitude larger than the test-pattern application time. To minimize the total number of
configurations, our BIST techniques try to test as many resources as possible within the same
configuration. For off-line PLB test, this requires distributing the patterns from the test-pattern
generator (TPG) to many PLBs under test using long signal paths. Similarly, for interconnect test,
the wires under test are long, connecting many wire segments and programmable switches. Under
these conditions, we must run the BIST configurations at a frequency much lower than that used
in normal operation, so most delay -faults will not be detected by these tests. An alter-native
would be to reduce the amount of logic and/or interconnect that is under test during any given
BIST configuration to allow BIST execution at a much higher clock frequency; however, this will
require many more configurations, and ultimately significant increases in testing time and cost.
36
HT NO: 13N41D5709
5.2 THE BIST ARCHITECTURE:
The strategy of our FPGA BIST approach is to configure groups of PLBs as test pattern
generators (TPGs) and output response analyzers (ORAs), and another group as blocks under test
(BUTs), as illustrated in Figure 1a. The BUTs are then repeatedly reconfigured to test them in all
their modes of operation. We refer to the test process that occurs for one configuration as a test
phase. A test session is a sequence of test phases that completely test the BUTs in all of their
modes of operation. Once the BUTs have been tested, the roles of the PLBs are reversed so that
in the next test session the previous BUTs become TPGs or ORAs, and vice versa. Since half of
the PLBs are BUTs during each test session, we need only two test sessions to test all PLBs in
the FPGA. Figures1b and 1c show the floorplans for the two test sessions - called NS and SN -
Figure 1b. The name of a session denotes the direction of the flow of test patterns during that
session. The floorplan for SN is obtained by flipping the floorplan for NS around the horizontal
axis shown as a dotted line in the middle of the array. This changes the roles of the PLBs so that
every PLB is under test in one of the two test sessions. Note that all BUTs are tested in parallel
and that patterns may be applied at-speed.
BIST Start/Reset
...
TPG
BIST Done
TPGs
BUTs
TPG
ORAs
...
BUT
BUT
BUT
BUTs
BUT
ORAs
ORA
ORA
ORA
ORA
Pass/Fail
BUTs
ORAs
BUT
BUT
BUT
BUTs
BUT
b) Floorplan for first
test session (NS)
a) TPG, BUT, and ORA connection
Figure : The BIST architecture.
37
HT NO: 13N41D5709
Each test phase consists of the following steps: 1) reconfigure the FPGA, 2) initiate the test
sequence, 3) generate test patterns, 4) analyze output responses, and 5) read the test results. In
step 1, the test controller (ATE for wafer/package testing; CPU or maintenance processor for
board/system testing) interacts with the FPGA(s) under test to reconfigure the logic by retrieving
a BIST configuration from the configuration storage (ATE memory; disk) and loading it into the
FPGA(s). The test controller also initializes the TPGs, BUTs, and ORAs and initiates the BIST
sequence and reads the subsequent Pass/Fail results. Steps 3 and 4 are concurrently performed by
the BIST logic within the device. After the board or system-level BIST is complete, the test
controller must reconfigure the FPGA for its normal system function; hence the normal device
configuration must be stored along with the BIST configurations. The test application time is
dominated by the FPGA reconfiguration time. Since the total test and diagnosis time is a major
factor in the system down time (system availability) and cost, an important goal of our BIST
approach is to minimize the number of configurations used for test and diagnosis.
Figure illustrates the typical structure of a PLB, consisting of a memory block that can
function as a look-up table (LUT) or RAM, several flip-flops (FFs), and multiplexing output
logic. The LUT/RAM block may also contain special-purpose logic for arithmetic functions
(counters, adders, multipliers, etc.) The RAM may be configured in various modes of operation
(synchronous, asynchronous, single-port, dual-port, etc.). The FFs can also be configured as
latches, and may have programmable clock-enable, pre-set/clear, and data selector functions. Our
strategy relies on pseudo exhaustive testing , which in this context means that every sub circuit of
a PLB is tested with exhaustive patterns in each one of its modes of operation [1]. The memory
block is checked with RAM test sequences which are exhaustive for faults specific to RAMs [12].
Note that all three sub circuits of a PLB are easily controllable and observable from the PLB’s I/O
pins, and that exhaustive testing of every module is feasible since the number of inputs is
reasonably small.
This results in practically complete fault coverage without explicit fault model assumptions
and without fault simulation. For example, all single and multiple stuck-at faults, as well as all
faults (of any type) that do not increase the number of states, are guaranteed to be detected; the
overwhelming majority of faults that do increase the number of states are also detected. Thus for
any practical purpose the PLB logic test is complete. In every test phase, a BUT is configured in a
different mode of operation; hence its pseudo exhaustive
38
HT NO: 13N41D5709
Output
LUT/RAM
FFs
Logic
Figure : Typical PLB structure
Figure shows the structure of an ORA comparing four pairs of BUT outputs. The signal
BuOi (BdOi) is the i-th output from the BUT up above (down below) the ORA. The FF stores
the result of the comparison, and the feedback loop latches the first mismatch in the FF. This
represents a compression of the results, since any number of mismatches in the same phase
translate into one error. The result FFs of all ORAs are connected to form a scan chain
(indicated by the dotted lines in Figure1a). In every test phase, the scan chain is also tested, to
assure the integrity of the test results. Our previous BIST approach [40] recorded only one
Pass/Fail result for every row of ORAs. However, storing the test results of each ORA cell
significantly improves the diagnosis resolution achievable by this architecture, as we will show
in Section5. For an N N FPGA, the number of ORA cells is NORA=(N2/2)-N.
BuO1
BuO3
BdO1
BdO3
Scan
1
BuO2
BuO4
BdO2
BdO4
Scan In from previous ORA
TDI
TCK
Figure : Integrated ORA/scan cell
39
Out
0
HT NO: 13N41D5709
Testing Configuration Multiplexers:
A configuration multiplexer (MUX) is a commonly used hardware mechanism that selects
sub circuits for various modes of operation. A configuration MUX is controlled by
configuration memory bits to select one input to be connected to its output. In Figure4a, assume
that we set the configuration bit S to 0 to connect V0 to X. Then the sub circuit producing V1
disappears from the circuit model seen by the user. This is correct from a design viewpoint,
because the value V1 can no longer affect X in the current configuration. But from a testing
viewpoint, in any test for the MUX, we need to setV0 and V1 to complementary values. In
general, for a MUX with k inputs, if V is the value of the selected input, all the other k-1 inputs
should be set to value V.
Our solution relies on separately configuring the invisible logic so that it will generate the
proper values needed for the inactive MUX inputs. Then we “overlay” the resulting
configuration files over the main configuration file with the active logic, and we “merge” them
without changing any MUX setting done in the main configuration. This process is conceptually
simple, but its implementation requires knowledge of the FPGA configuration stream structure.
Note that in most previous work dealing with testing FPGAs, the problem of testing a
configuration MUX is either not addressed or it is “solved” functionally, by connecting every
input in turn to the output, and providing both 0 and 1 values to the selected input. However, the
invisible logic driving the inactive inputs is completely ignored. Hence prior claims of
“complete testing” may not be valid since the testing of every configuration MUX in the FPGA
is likely to be incomplete. Methods that did provide complete tests for configuration
multiplexers, such as, used models that did not remove the invisible logic. But such models
cannot be used with the existing FPGA CAD tools to generate configuration files.
40
HT NO: 13N41D5709
Testing Memory Blocks:
For the LUT/RAM block of a PLB, we first test its RAM mode of operation. We configure
the TPGs to apply a march test , which detects, among other faults, all the stuck faults in
memory cells, as well as all faults in the address and read/write circuitry of the RAM. We
rely on this RAM-mode test as the major test of the memory block, so that subsequent tests
for different modes of operation of the same block do not need to retarget the already
detected faults. When we test the LUT operation, we configure alternating XOR/XNOR
functions for the LUT outputs during one test phase, and alternating XNOR/XOR functions
during a second test phase. For any combinational function of n inputs, the TPG will apply
all 2n vectors.
The difference is that in an FPGA the BIST logic would disappear after the RAM test
session. All embedded RAMs may be tested in parallel, possibly sharing the BIST controller.
If the FPGA has several identical RAM modules, we can feed them with the same patterns,
and use comparators to check mismatches between corresponding outputs.
Scalability of the BIST Approach:
Our BIST architecture has a very regular easily-scalable structure, automatically
generated by a sim-ple procedure (that also does algorithmic placement and routing), based
on the dimensions ( N N) of the FPGA array. Since an ORA compares the outputs of its two
neighbor BUTs, all signals from BUTs to ORAs use only local routing resources which are
oblivious to the size of the FPGA. Global routing is used to distribute the patterns generated
by TPGs to BUTs. Ignoring fanout load limitations, adding rows and columns to an array of
PLBs will just extend the length of the vertical and horizontal global lines fed by TPG
outputs, hence the usage of the global routing resources required for distributing the TPG
patterns does not change with the FPGA size. In the ORCA FPGA, we use the bidirectional
drivers available in the local routing surrounding a PLB to redistribute incoming TPG
signals, thus avoiding fanout overload. The following analysis is for FPGAs where such
drivers are not available. If k is the number of cells used by one TPG (k=4 in our
implementation), a TPG row has NTPG = N
k TPGs. (For simplicity we assume that N is
always a multiple of k.) We divide the BUTs into NTPG subsets of k columns and we feed
41
HT NO: 13N41D5709
each such subset from its closest two TPGs. Then each TPG output drives N2 BUTs. This
shows that the loading grows only linearly with N, which is the square root of the size of the
FPGA. Nevertheless, if the loading may not grow over a given limit Lmax, then the largest N
for which the BIST architecture is feasible is Nmax
Lmax
k (note that Lmax depends on
the BIST clock frequency). When N grows over this limit, we divide the FPGA in four
quadrants, such that it is feasible to implement the BIST architecture separately in each
quadrant. For example, if Nmax=8, then an 16
All quadrants are tested concurrently. However, the scan chains of each quadrant will be
connected into a single scan chain, so that the result retrieval time grows linearly with the
size of the FPGA.
We emphasize that increasing N does not affect the number of test sessions, which is
always two. The number of test configurations (phases) depends only on the structure and
the modes of operation of the PLB, and it is independent of N (the dependence shown in
Figure 6 of [20] for the BIST method is a result of incorrect assumptions). Since all BUTs
are tested in parallel, the BIST execution time is also independent of the size of the FPGA.
In addition to the time for scanning out the results, the only test-time dependence on the size
of the array is the reconfiguration time for each test phase. This is inherent in all the
currently available FPGAs, for which the configuration loading is a serial process. The time
of the method described in does not depend on N, but this is based on a parallel loading
mechanism that is not featured in existing FPGAs.
42
HT NO: 13N41D5709
5.3 Boundary Scan Access:
This allows FPGAs to be reconfigured and tested in-system without requiring additional I/O
pins. The Lucent [26] and the Xilinx [43] architectures also provide user-defined access to
the FPGA core, which we use to control the ORA scan chain. Altera FPGAs [5] do not
feature user-defined TAP instructions, but the equivalent functionality may be implemented
in test mode at the cost of several test-dedicated I/O pins.
Reconfiguring the FPGA with the BIST test phases, initiating the BIST sequence, and
reading the BIST results (steps 1, 2, and 5 of each test phase) are performed using the TAP.
Access to our BIST archi-tecture requires the ability to control the BIST Start/Reset to
initialize the TPGs and ORAs and start the BIST sequence, as well as the ability to read the
BIST Done and ORA Pass/Fail results from the BIST circuitry. This access must be made
within the confines of the typical FPGA boundary-scan circuitry archi-tecture. For example,
unlike boundary-scan cells for external interconnect testing, the Shift/Capture and Update
control signals are not made available to the core of the FPGA by the TAP circuitry in most
FPGAs. Instead, the only internal signals provided by the TAP for user-defined internal scan
chains are TCK, TDI, a port to send data out on TDO, and an internal enable signal (TEN).
TEN is active when the user-defined scan-chain instruction is decoded by the TAP controller
and remains active until a different instruction is loaded into the TAP instruction register.
Additional considerations in selecting a boundary-scan access method for final
implementation include total test time, PLB overhead, routability, and diag-nostic resolution.
A detailed analysis of four methods for access to the FPGA BIST architecture via the
boundary-scan interface was given in [13]. In this section, we describe the method selected
for the final implementation of our BIST approach.
43
HT NO: 13N41D5709
5.4 BIST-Based Fault Detection:
In this section we show that our BIST approach achieves complete fault detection for
single faulty PLBs, and practically complete fault detection for multiple faulty PLBs. A
faulty PLB in a TPG or in an ORA may not produce an error if its fault does not affect the
operation of the TPG or ORA. Thus we will rely on detection of PLB faults only when a
faulty PLB is under test (configured as a BUT).
Claim 1: Any single faulty PLB is guaranteed to be detected.
Proof: When the faulty PLB is a BUT, it receives the correct pseudo-exhaustive patterns
from a fault-free TPG in every one of its modes of operation. The outputs of the faulty BUT are
compared with a fault-free BUT fed by a fault-free TPG, and the comparison and error latching
are done by a fault-free ORA. (Since the scanning mechanism of the ORA result register is
tested as part of every test phase, we can assume that errors propagate through a fault-free
scan chain.) Thus the faulty PLB is detected.
The more difficult question is whether we can have multiple faulty PLBs that mask each
other, so that together they escape detection. Although, in general, we cannot claim that any
possible combination of faulty PLBs will be detected, we can identify many large classes of
multiple faulty PLBs whose detection can be guaranteed. In the following, when we analyze
the delectability of a group G of faulty PLBs, we implicitly assume that all the other PLBs
are fault-free (in other words, G is not a subset of a larger group of faulty PLBs). G is
detected when any of its PLBs is detected.
Claim 2: Any group of faulty PLBs in the same row is guaranteed to be detected.
Proof: Except the blocks in a TPG, the only interaction among PLBs located in different
columns is provided by the scan register connecting the ORA result flip-flops. However, the
scan operation of the ORA result register is also tested by the technique described in
Section3. Hence, faults in BUT and ORA PLBs in the same row cannot mask each other.
Faulty PLBs in rows 1 or N may not be detected when they are part of a TPG (for example,
44
HT NO: 13N41D5709
when two faulty TPGs feeding pairs of BUTs compared by the same ORAs generate
identical patterns), but they will be detected when configured as BUTs.The next result deals
with the detection of faulty PLBs residing in the same column, called the faulty column. A
middle row refers to any row except 1 and N.
Claim 3: Any group of faulty PLBs in the middle rows of the same column that has at
least two adjacent fault-free PLBs, is guaranteed to be detected.
Note that to escape detection, there must be at least N/2 faulty PLBs in the group, because
otherwise the faulty column will have at least two adjacent fault-free PLBs. This is a very
restrictive condition: for example, in a
FPGA is guaranteed to be detected. Furthermore, the faulty PLBs configured as BUTs must
produce identical output responses during every test phase. The reason Claim 3 restricts faulty
PLBs to the middle rows of the faulty column is to guarantee that the TPGs in rows 1 and N are
fault-free and thus generate all test patterns. Otherwise, it is theoretically possible that the group
of faulty PLBs escape detection, even if the faulty column has two adjacent fault-free PLBs; for
this to happen, all of the following conditions must be satisfied: 1) the faulty PLB in row 1 or N
must change the patterns produced by one TPG, and, 2) the faulty TPG must skip all the patterns
that detect every faulty BUT bordering two adjacent fault-free PLBs, and, 3) the fault-free BUTs
must have identical responses for every pair of different vectors produced by the two TPGs
(otherwise errors will appear at all the ORAs in the fault-free columns). Clearly, this set of
conditions is so restrictive that it is unlikely to ever occur in practice. Hence in practice,
columns that include faulty PLBs in rows 1 or N will also be detected. (The above analysis
shows that the claim made in [20] that BIST-based approaches “do not detect all double
faulty PLBs” is incorrect.)
Figure shows an example of a faulty column with N=8 where no pair of fault-free PLBs
are adjacent. Consider the test session in which all 4 faulty PLBs are BUTs. In addition,
assume that their faults are functionally equivalent, so that no mismatches will be produced.
If these faults are not activated in the other test session when the faulty PLBs in the middle
rows serve as ORAs, and the PLB in row 8 is part of a TPG, then this multiple fault will
escape detection. However, this situation can be characterized as “pathological,” since it has
an extremely low probability of occurrence: 1) half of the PLBs in the same column must be
45
HT NO: 13N41D5709
faulty, and, 2) these PLBs must reside in every other row and, 3) their faults must be functionally equivalent, and, 4) these faults must not be detected when these PLBs are configured
as ORAs and TPG cell. The probability of occurrence is very low, because we need the AND
of four conditions which are very unlikely by themselves.Because masking cannot occur
among faults in the middle rows of different columns, we can derive from Claim 3 the
following more general result.
Claim 4: Any group of faulty PLBs in the middle rows of the FPGA, such that every
faulty column has at least two adjacent fault-free PLBs, is guaranteed to be detected.
Like Claim 3, Claim 4 can also be extended in practice to cover faulty PLBs in rows 1 and
N. Figure 7 illustrates an interesting group of 4 faulty PLBs that can escape detection if the
following conditions are all satisfied: 1) PLBs X and Y have the same position in the first and
the second TPG in row 1,and, 2) PLBs V and Z have the same position in the first and the second
TPG in row 8, and, 3) the faults in X and Y are equivalent, and, 4) the faults in V and Z are
equivalent, and, 5) the TPGs in row 1 skip all the patterns that detect faults in V and Z, and, 6)
the TPGs in row 8 skip all the patterns that detect faults in X and Y. The faulty PLBs escape
detection because in every test session, the patterns generated by the two TPGs are identical (so
no mismatches are detected at any ORAs), and they miss the faults in the faulty BUTs.
Clearly, this would be another pathological situation.
X
V
TPGs Y
X
BUTs Y
BUTs
ORAs
ORAs
BUTs
BUTs
ORAs
ORAs
BUTs
BUTs
ORAs
ORAs
BUTs
BUTs Z
V
TPGs Z
Figure 7. Pathological case that escapes fault detection
46
HT NO: 13N41D5709
Although we cannot guarantee that the two test sessions illustrated in Figure1 will detect
any possible combination of faulty PLBs, it appears that the conditions that allow a group of
faulty PLBs to escape detection are so restrictive, that they are very unlikely to occur in
practice. Therefore we can conclude that, in practice, any combination of faulty PLBs will be
detected.
But if needed, we can enhance the BIST sequence to guarantee the detection of any
combination of faulty PLBs by adding the two test sessions shown in Figure8. These are
obtained by rotating the test ses
horizontal instead of vertical. If a group of faulty PLBs did escape detection in the first two
test sessions, this was caused by masking among faulty PLBs in the same column. But the
rotation is in effect interchanging the rows and the columns, so that it destroys the masking
relations because the faulty PLBs interact now as if they are in the same row. Hence a group
that escapes detection in the first two sessions is guaranteed to be detected in at least one of
ORAs
BUTs
ORAs
BUTs
ORAs
BUTs
TPGs
BUTs
1
a) Test session WE
2
3
4
5
6
7
8
ORAs
BUTs
BUTs
ORAs
BUTs
ORAs
BUTs
3
4
5
6
7
8
1
2
TPGs
the additional two sessions.
b) Test session EW
Figure : Additional (“horizontal”) test session
47
HT NO: 13N41D5709
5.5 BIST-Based Diagnosis
In this section, we present the use of the BIST approach in diagnosing an FPGA that
failed the tests provided by the two test sessions shown in Figure1. In any fault location
procedure, maximum diagnostic resolution is achieved when faults are isolated within an
equivalence class containing all the faults that pro-duce the observed response. If every
equivalence class has only one fault, we say that the fault is uniquely diagnosed, that is, there
is no other fault which can produce the same response. In our case a fault is one faulty PLB
that may have any number of internal faults, and the response is obtained at the outputs of the
ORAs. We begin by assuming a single faulty PLB in the FPGA, then we analyze the case of
multiple faulty PLBs, and we also discuss locating faults inside a defective PLB.
Diagnosis Within A Faulty PLB:
During each test session, the errors (failing ORA indications) are actually recorded for
every test phase. This allows us to identify the failing mode(s) of operation of the faulty PLB
and its faulty internal module(s). For example, consider the test phases developed for the
ORCA 2C and 2CA series FPGA given in Table 2. The first 9 phases are used to test the
various modes of operations of the PLBs in the ORCA 2C series, while the complete set of
14 test phases are used to test the ORCA 2CA series. Phases 1-4 test the LUT portion of the
PLB, phases 5-9 test the FFs, and phases 10-14 test additional LUT modes of operation
present only in ORCA 2CA series. For example, if only phase 10 fails, then we know that
only the logic used exclusively to implement the multiplier is faulty. If only phases 5 and 7
fail, then the most likely cause is faulty connection(s) between the LUT and the flip-flops.
Such accurate diagnosis is extremely useful in failure-mode analysis and yield improvement.
This accuracy also provide the basis for allowing the reuse of a partially defective PLBs in
fault-tolerant applications or adaptive computing applications. For example, if only phases 5
through 9 fail, then the LUT is fault-free (since it passed exhaustive tests in each of its
modes of operation), and this PLB with defective flip-flops can be safely used to implement
any combinational logic function.
48
HT NO: 13N41D5709
Table 2: Summary of BIST Phases for ORCA 2C and 2CA series Logic Blocks
Flip-Flop/Latch Modes &
Phase
No.
Options
No.
Mode
Outs
FF/Latch
Set/Reset
Clock
1
-
-
-
-
-
Asynchronous RAM
4
2
-
-
-
-
-
Adder/subtracter
5
3
-
-
-
-
-
5-variable MUX
4
4
-
-
-
-
-
5-variable XOR
4
5
Flip-Flop
async. reset
falling edge
active low
LUT output
Count up
5
6
Flip-Flop
async. set
falling edge
enabled
PLB input
Count up/down
5
7
Latch
sync. set
active low
active high
LUT output
Count down
5
8
Flip-Flop
sync. reset
rising edge
active low
PLB input
4-variable
4
9
Latch
-
active high
active low
dynamic select
4-variable
4
10
-
-
-
-
-
Multiplier
5
11
-
-
-
-
-
Greater/equal to Comp
5
12
-
-
-
-
-
Not equal to Comp
5
13
-
-
-
-
-
Synchronous RAM
4
14
-
-
-
-
-
Dual port RAM
4
BuO1
Clk Enable Flip-Flop Data In
LUT
1
BdO1
BuO4
1
BdO4
0
Out
0
S ca n In from
previous O R A
TDI
TCK
Figure : ORA with greater diagnostic resolution
49
Scan
HT NO: 13N41D5709
6. IMPLEMENTATION
We have also implemented the approach in Xilinx Spartan series FPGAs. Figure 10a
shows the delay-fault BIST implementation as it would reside in a STAR taken from the
Xilinx FPGA Editor . The TPG is at the top of the STAR, and the ORA logic (OR and
NAND gates to create OSC) at the bottom. Te ORA counter is a 6-bit counter constructed
from three PLBs implementing a 2-bit counter ach.
The OSC output drives the clock inputs of the ORA counter. This implementation is
for off-line testing, as the global reset generated after configuration is used to initiate the
BIST sequence. The global reset cannot be used during on-line testing, since it would reset
the system function; for on-line testing, the BIST sequence is initiated via the boundary-scan
interface. The TPG is constructed from a shift register with the input to the shift register tied
to a logic 1 and the FFs clocked by the internal 8MHz oscillator.
The role of the 2-bit shift register is to allow some time for the FPGA to stabilize
after downloading the configuration before measuring delays; this was a precautionary
measure and may not be necessary. Once configuration of the FPGA is complete, the global
reset sets the two FFs to logic 0, and two 8MHz clock cycles later a 0/1 transition will start
propagating on the four PUTs. The PUTs travel through an equal number of CIPs, including
the switch boxes, prior to entering the ORA logic. At the conclusion of the BIST sequence,
the contents of the ORA counter FFs are obtained via a configuration memory readback
operation in this Spartan implementation, as opposed to the scan-chain-based boundary-scan
access we used for the ORCA FPGA.
50
HT NO: 13N41D5709
global reset
TPG
1
to
clr PUTs
clr
8MHz internal clock
b) TPG implementation
PUTs
ORA
counter
(MSB)
2-bit
c
o
u
n
t
e
r
carry-in
ORA
counter
carry-out
2-bit
counter
carry-in
carry-out
ORA
counter
(LSB)
8MHz
internal clock
2-bit
counter
carry-in
4
OSC
G LUT
H LUT
PUTs
ORA
logic
F LUT
c) ORA implementation
a) Xilinx FPGA Editor layout
Figure : Delay-fault implementation in Spartan FPGA
51
HT NO: 13N41D5709
In this section we present the results of the implementation of our BIST-based
diagnostic approach using ORCA 2C15A FPGAs, and discuss our experience with testing
array and it requires the full set of 14 test phases summarized in Table2 for each one of the
two test sessions. Each test phase for the 2C15A requires 220,800 bits of memory to store
the configuration for that test phase for a total storage requirement of 0.77 Mbytes for test
configurations for both test sessions. The total FPGA test time, including all
reconfigurations, is less than one second, using a 10 MHz boundary-scan clock.
52
HT NO: 13N41D5709
7. RESULTS
We were provided with five 2C15A devices by Lucent Technologies Microelectronics
Group in Allentown, PA. From manufacturing test results, two FPGAs were known to be
fault-free and three were known to be defective. All three defective devices failed our test,
and the two fault-free devices passed. Analyzing the results of the defective FPGAs, our
PLB diagnosis procedure MULTICELLO reported inconsistencies for Chip1 and Chip2, and
identified one faulty PLB in Chip3 (in row 3 and column 18).
Test applied
Off-line PLB test (this paper)
Off-line routing test [41]
On-line PLB test [2][3]
Chip1
Chip2
Chip3
FAIL FAIL FAIL
FAIL FAIL PASS
P&F FAIL FAIL
Table 3: Summary of testing faulty FPGAs
To validate these results, we retested the defective chips with the off-line BIST for
programmable interconnect described in , and with the on-line BIST for PLBs (the Roving
STARs approach) presented in. No application was programmed in FPGAs for the on-line
test. The on-line BIST tests each PLB twice, once in a vertical STAR and one in an
horizontal STAR. Table3 summarizes the results of all tests.
The failures of the routing BIST in Chip1 and Chip2 show that the inconsistencies
reported by MULTICELLO trying to locate defective PLBs in these devices are caused by the
presence of interconnect faults.
For Chip1, the on-line PLB test passed all vertical roving tests, but failed some of the
horizontal roving tests. Since these tests involve the same PLBs but different routing
resources, we concluded that Chip1 has only interconnect faults that also cause its off-line
PLB test to fail.
53
HT NO: 13N41D5709
Chip2 fails every test and has both faulty PLBs and faulty interconnect. Chip3 passed
the interconnect test, and the on-line diagnosis method described in identified the same faulty
PLB as MULTICELLO. The faulty PLB in Chip3 failed phases 5-9, and by using one
additional configuration that separated between FFs and LUT, we were able to determine that
the fault affects only the FFs and the LUT is fault-free.
This defective PLB has been successfully reused to implement combinational logic
functions in a fault-tolerant application. Since faulty FPGAs are difficult to obtain, we have
also performed verification of our BIST approach using a fault emulator that injects faults in
the FPGA by changing configuration bits.
54
HT NO: 13N41D5709
8. CONCLUSIONS
In this paper, we have described a BIST approach for programmable logic blocks in
SRAM-based FPGAs. Our approach is applicable at any level of testing and, unlike
conventional BIST, it does not intro-duce any area overhead or delay penalty. Every PLB is
pseudo exhaustively tested in all its modes of operation, so the tests are practically complete
for any logic fault model.
We have shown that in practice, our method detects any combination of faulty PLBs.
Our architecture facilitates the testing of all PLBs in only two test sessions, and the number
of test configurations is also independent of the size of the FPGA. Its regular structure allows
easy scalability with the size of the FPGA and algorithmic generation of placement and
routing data for every test phase based only on the size of the array.
We also presented the first diagnosis algorithm that can accurately locate any single and
most multiple faulty PLBs with maximum diagnostic resolution. The multiple faulty PLBs
that cannot be diagnosed appear to be very restrictive situations unlikely to occur in practice.
Our method can also identify defective sub circuits inside a PLB.
This diagnostic information can then be used for yield enhancement in the
manufacturing process or for repair strategies in fault-tolerant applications. We have
successfully verified our approach with faults injected using a fault emulator and with actual
defective FPGAs. An open problem to be investigated in the future is diagnosis under a
mixed-fault model, that is, locating faults in a chip that has both faulty PLBs and faulty
interconnect.
55
HT NO: 13N41D5709
9. REFERENCES
1. M. Abramovici, C. Stroud, S. Wijesuriya, C. Hamilton, and V. Verma, “Using Roving
STARs for On-Line Testing and Diagnosis of FPGAs in Fault-Tolerant Applications,”
Proc. IEEE Intn’l. Test Conf., 1999, pp. 973-982.
2. M. Abramovici, C. Stroud, B. Skaggs, and J. Emmert, “Improving On-Line BIST-Based
Diagnosis for Roving STARs”, Proc. IEEE Intn’l. On-Line Test Workshop, 2000, pp. 3139.
3. M. Abramovici, J. Emmert, and C. Stroud, “Roving STARs: An Integrated Approach to
On-Line Testing, Diagnosis, and Fault Tolerance for FPGAs in Adaptive Computing
Systems,” Proc. Third NASA/DoD Workshop on Evolvable Hardware, 2001, pp. 73-92.
4. M. Abramovici and C. Stroud, “BIST-Based Test and Diagnosis of FPGA Logic Blocks,”
IEEE Trans. on VLSI, Vol. 9, No. 1, pp. 159-172, Feb., 2001.
5. J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic Fault Tolerance in
FPGAs via Partial Reconfiguration,” Proc. IEEE Symp. on Field-Programmable Custom
Computing Machines, 2000, pp. 165-174.
6. J. Emmert, S. Baumgart, P. Kataria, A. Taylor, C. Stroud, M. Abramovici, “On-line Fault
Tolerance for FPGA Interconnect with Roving STARs,” IEEE Intn’l. Symp. on Defect
and Fault Tolerance in VLSI Systems, 2001, pp. 445-454.
7. Lattice Semiconductor Co., http://www.latticesemi.com/products/fpga.
8. E. McCluskey, “Verification Testing - A Pseudo exhaustive Test Technique,” IEEE Trans.
on Computers, Vol. C-33, No. 6, pp. 541-546, June, 1984.
56
HT NO: 13N41D5709
9. S. D’Angelo, C. Metra, G. Sechi, “Transient and Permanent Fault Diagnosis for FPGABased Sys-tems,” IEEE International Symp. on Defect and Fault Tolerance in VLS , pp.
330-338, Novembe 1999.
10. J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic Fault Tolerance in
FPGAs via Par-tial Reconfiguration,” Proc. 8th Annual IEEE Symp. on FieldProgrammable Custom Computing Machines, April 2000.
11. A. van de Goor, Testing Semiconductor Memories Theory and Practice, John Wiley and
Sons, 1991.
12. C. Hamilton, G. Gibson, S. Wijesuriya, and C. Stroud, “Enhanced BIST-Based
Diagnosis of FPGAs via Boundary Scan Access,” Proc. IEEE VLSI Test Symp ., pp.
413-418, May 1999
13. F. Hanchek and S. Dutt, “Methodologies for Tolerating Logic and Interconnect Faults in
FPGAs,” IEEE Trans. on Computers, pp. 15-33, Jan. 1998.
14. I.G. Harris and R. Tessier, “Interconnect Testing in Cluster-Based FPGA Architectures,”
Proc. Design Automation Conf., June 2000.
15. F. Hatori et al., “Introducing Redundancy in Field Programmable Gate Arrays,” Proc.
IEEE Custom Integrated Circuits Conf., pp. 7.1.1-7.1.4, 1993.
16. W. K. Huang and F. Lombardi, “An Approach to Testing Programmable/Configurable
Field Pro-grammable Gate Arrays,” Proc. IEEE VLSI Test Symp. , pp. 450-455, 1996.
17. W. K. Huang, F. J. Meyer, N. Park, and F. Lombardi, “Testing Memory Modules in
SRAM-based Configurable FPGAs,” IEEE International. Workshop on Memory Tech.,
Design and Testing , August 1997
57
HT NO: 13N41D5709
18. W. K. Huang, F. J. Meyer, X. Chen, and F. Lombardi, “Testing Configurable LUTBased FPGAs,” IEEE Trans. on VLSI Systems, Vol. 6, No. 2, pp. 276-283, June 1998.
19. W. K. Huang, F. J. Meyer, and F. Lombardi, “An Approach for Detecting Multiple
Faulty FPGA Logic Blocks”, IEEE Trans. on Computers, Vol. 49, No. 1, pp. 48-54,
2000.
20. T. Inoue, S. Miyazaki, and H. Fujiwara, “Universal Fault Diagnosis for Lookup Table
FPGAs,” IEEE Design & Test of Computers, Vol. 15, No. 1, pp. 39-44, Jan. 1998.
21. C. Jordan and W. P. Marnane, “Incoming Inspection of FPGAs,” Proc. European Test
Conf., pp. 371-377, 1993.
22. J. L. Kelly and P. A. Ivey, “Defect Tolerant SRAM Based FPGAs,” Proc. International
Conf. on Computer Design, pp. 479-482, 1994.
23. V. Lakamraju and R. Tessier, “Tolerating Operational Faults in Cluster-based FPGAs,”
Proc. ACM/ SIGDA International Symp. on FPGAs, pp. 187-194, Febr. 2000.
24. F. Lombardi, D. Ashen, X. Chen, and W. K. Huang, “Diagnosing Programmable
Interconnect Sys-tems for FPGAs,” Proc. ACM/SIGDA International Symp. on
FPGAs, pp. 100-106, Febr. 1996.
58