Download Stuck-at Fault Diagnosis

HT NO: 13N41D5709 CHAPTER – 1 FUNDAMENTALS 1 HT NO: 13N41D5709 1.1 Field-programmable Gate Array: A Field-programmable Gate Array (FPGA) is a configurable integrated circuit that can implement an arbitrary logic design. An FPGA can be repeatedly programmed to permit fast and inexpensive prototyping, product development, and design error correction. Proto-typing, or emulation, is the modeling and simulation of all aspects of a logic design whose final implementation is not FPGA-based, while product development is the design, analyze-sis, and test of a logic design whose final implementation is F PGA-based. Because fabrication is required for Application-specific Integrated Circuits (ASICs), the development time (time to production) of ASIC-based designs is greater than a year and the Non-recurring Engineering (NRE) costs are tens of millions of dollars. However, because FPGA-based designs do not require fabrication (an FPGA is simply programmed with a logic design), a much shorter development time and much lower NRE results, which has led to the current popularity of FPGAs for product development and prototyping. FPGA-based products have the additional benefit that design errors discovered after product release can be corrected by simply reprogramming the FPGA with the corrected design, called field-upgrading. Note that ASICs are better suited for high-volume designs since the cost of ASIC NRE, amortized per chip, is typically less than the expensive cost per FPGA. The size, functionality, and performance of an FPGA vary both within and between device families (a series of differently sized FPGAs that share a common architecture). A particular device family and size is chosen for implementation based on the specifications of a logic design. Although internal clock frequencies are slower than those for ASICs, the arrayed structure, configurable logic elements, and abundant routing resources of an FPGA permit highly parallel computation that yields increased throughput. Applications in networking, Digital Signal Processing (DSP), graphics, and cryptography, among others, can be efficiently y implemented. 2 HT NO: 13N41D5709 1.2 FPGA Manufacturing: For correct functionality, the resources used to implement a logic design must be defect-free. A defect is a physical imperfection, introduced during manufacturing that causes an FPGA to function incorrectly. A device is tested to detect defects, thereby ensuring an Acceptable Quality Level (AQL), a measure of the number of defective devices that escape manufacturing test (lower AQL means fewer defective outgoing devices). Although desired, there is no guarantee that a device is defect-free because it passes manufacturing test: tests are imperfect and unable to detect all defects. FPGA test is the responsibility of the manufacturer. An applicationindependent FPGA is one whose configuration may change throughout its lifetime. Because the manufacturer cannot know which logic and routing resources of an application-independent FPGA a customer will use and because literally billions of configurations are possible (all of which can-not be tested), special techniques and tools are required for test. In contrast, an application-dependent FPGA is one whose configuration remains unchanged throughout its lifetime, useful for low- to mediumvolume or cost sensitive logic designs. Although most devices that fail manufacturing test are discarded, some are diagnosed to determine the cause of failure. The information obtained from diagnosis is used to make manufacturing improvements to increase yield. A measure of the number of defect-free manufactured devices—is the hallmark of a good manufacturing process: high yield means few defective devices. However, FPGA diagnosis is difficult and time-consuming, primarily due to the complexity of the configurable interconnection network. Like FPGA test, special techniques and tools are required for diagnosis. 3 HT NO: 13N41D5709 1.3 Defects and Faults: As stated, a defect is a physical imperfection introduced during manufacturing that causes device failure. However, the physical nature of a defect, which can have a very complicated behavior, does not permit direct mathematical treatment of test and diagnosis. Thus, the logical effect of a defect is modeled by a fault, and detection or identification of defects is accomplished by detection or identification of faults (with the exception of defect-based test, which is the detection of defects by measuring some device parameter, for example a particular current magnitude or voltage level). Note that a flaw is also a physical imperfection; however, it causes device failure after some time period (flaw de-grades into defect). Flaws and the screen required to eliminate flaws are not discussed in this dissertation. There are several common fault models: stuck-at, bridge, and delay, among others. A stuck-at fault is a fixed logic value on a circuit node, either stuck-at 0 or stuck-at 1, regardless of the value driven on that node. A bridge fault is the logical behavior of a bridge, for example wired-AND or wired-OR, caused by a short between signal lines that should not be connected. Finally, a delay fault is an additional delay on some path or through some gate in a circuit that causes timing failures. Throughout this dissertation an attempt is made not to mix the usage of the terms defect and fault. For example, in Chap. 3 detection of delay faults, not delay defects, is discussed since the cause of delay is irrelevant while in Chap. 4 HT NO: 13N41D5709 1.4 Test: Delay Fault Test: The objective of test is to detect defects. Detecting delay faults, which cause timing failures in an otherwise functioning FPGA, is difficult due to the slow test speeds of contemporary Automated Test Equipment (ATE). At slow test speeds, path slack—the difference between the actual time a signal on a path stabilizes and the time at which the signal must be stable—is so large that even a delay fault may not cause a failure (a negative path slack). however, the test configurations implement long paths of interconnects and logic elements, which (1) fail if either one large delay fault (spot defect) or many small delay faults (process variation) are present along the path (which may or may not be de-sired), and (2) increase the likelihood of fault masking, or the inability to detect a fault due to the presence of another. Bridge Fault Test: A bridge is a short between signal lines that should not be connected, which usu-ally causes a fault current when activated (bridged circuit nodes are driven to opposite logic values). This fault current can be observed by measuring the steadystate current drawn by a device, IDDQ. IDDQ test is important for FPGAs because of the large number of transmission-gate multiplexers in the interconnection network, in which certain bridge faults can only be detected via IDDQ test. However, the effectiveness of IDDQ test is limited due to the shrinking geometries of deep sub-micron processes. Scaling of threshold voltages has caused transistor leakage currents to increase, thereby exponentially increasing IDDQ but not correspondingly increasing fault currents. It is shown that only 5 configurations, and thus 5 IDDQ measurements (one measurement per configuration), can detect all interconnect bridge faults in contemporary FPGAs. Additionally, the fault current detect ability (size of the fault current compared to the magnitude of the IDDQ) can be increased by partitioning, or dividing, the nets of a test configuration into subsets, each of which corresponds to a new, smaller test configuration. 5 HT NO: 13N41D5709 Test Time Reduction: The configuration time —the time required to program an FPGA with a test configuration—consumes the majority of total test time, and therefore dominates test cost. To significantly decrease the test time of an FPGA, a reduction must be made in either the number of test configurations, or the configuration time. Several test configuration generation techniques have been proposed to alleviate this test time problem by reducing the number of test configurations. Those that consider faults in the interconnection network derive a reduced number of test configurations using interconnect modeling and graph traversal algorithms. Rather than re-duce the number of test configurations, test time can be made shorter by reducing the con-figuration time (time required to program an FPGA under test with each test configuration). However, it assumes the configuration memory is structured as a single scan chain. Furthermore, the specially designed test configuration that is internally shifted within an FPGA detects only faults in the logic resources. Because the presented technique reduces only the configuration time, it can be used in conjunction wit h those methods that reduce the number of test configurations. A small amount of Design for Test (DFT) hardware— several 2-bit multiplexers (R − 3 in total for an R × C FPGA)—is added to the existing FPGA hardware. These multiplexers, which take advantage of the regular structure of both an FPGA and each test configuration, permit parallel programming of test configuration data, achieving a configuration time reduction of approximately 40%. By reducing the configuration time, the number of test configurations can be increased to achieve a more thorough FPGA test without increasing test cost. 6 HT NO: 13N41D5709 1.5 Diagnosis: Bridge Fault Location: The objective of diagnosis is to locate or identify defects. Fault location is an important part of fault diagnosis since it is usually necessary to locate a fault to identify the underlying defect. Because the majority of an FPGA is its interconnection network and due to the high density of the physical layout in deep sub-micron process technologies, locating faults in the interconnection network is an important concern in FPGAs. By iteratively partitioning the nets of the test configuration for which the FPGA failed differential IDDQ test and independently testing the FPGA with each partition, fault-free nets can be progressively eliminated until only a faulty net one of the bridged nets remains (the other bridged net is found by applying the location process a second time using a complementary test configuration). Very few configurations and IDDQ measurements are required, logarithmic in the number of nets in the test configuration. Furthermore, because of the simplicity in partitioning accomplished by device configuration alone the process can be easily automated. Automated Fault Diagnosis: When the type of defect present is not known (unknown cause of failure), both identification and location must be conducted. A common method is stuck-at fault diagnosis (the stuck-at fault model is again chosen primarily for its simplicity), which uses stuck-at faults to model the behavior of a defect, and identifies the smallest possible set of fault candidates, or faults that could explain the failure. Furthermore, the set of test configurations and test vectors must be regenerated for different FPGA architectures. Compared to the ad hoc diagnosis seen in practice that requires several days or weeks to diagnose a single device, the diagnosis time of the presented method is reduced to only several hours; compared to the formal diagnosis techniques, the presented technique is capable of diagnosing any fault detectable by the manufacturing test configurations, and applicable to any FPGA architecture in production. 7 HT NO: 13N41D5709 CHAPTER – 2 FPGA Structure 8 HT NO: 13N41D5709 2.1 Overview: An FPGA is a two-dimensional array of blocks that are joined by an interconnection network. Logic Blocks (LBs) contain the configurable hardware for logic design implementation. Input/output Blocks (IOBs) are used for primary inputs and outputs. Multiplier blocks (MBs) contain hardware to implement specialized arithmetic. Finally, RAM blocks (BRAMs) contain user-addressable memory arrays. Since most blocks in an FPGA are logic blocks and very few are multiplier or RAM blocks, an FPGA is basically a regular array of logic blocks. The interconnection network consists of interconnects, switch matrices or multiplexers, and buffers. An interconnect is a wire segment. A Switch Matrix (SM), or programmable multiplexer, is used to join blocks. Buffers are used in the same manner as they are for any interconnection network. There are only a few types of tiles: a logic block and its associated switch matrix make an LB tile while an input/output block and its associated switch matrix make an IOB tile (for simplicity multiplier blocks and RAM blocks are not discussed). Local routing (interconnects) is included in each tile. An FPGA is organized as an array of R rows and C columns. One row, containing C tiles, spans the entire width of the device. Similarly, one column, containing R tiles, spans the entire height of the device. Figure shows a simplified 5 × 5 FPGA with logic blocks in the center and input/output blocks on the periphery: the upper left half shows the tile representation while the lower right half shows the switch matrices, blocks, and a subset of the 9 HT NO: 13N41D5709 interconnects. Input/Output Block row 1 IOB Tile IOB Tile IOB Tile Switch Matrix Interconnect Logic Block row 2 IOB Tile LB Tile Switch Matrix Interconnect Logic Block row 3 IOB Tile Switch Matrix Interconnect Logic Block row 4 Logic Block Switch Matrix Switch Matrix Interconnect Input/Output Block row 5 Switch Matrix Interconnect Input/Output Block Switch Matrix Input/Output Block Switch Matrix Logic Block Switch Matrix Interconnect Logic Block Switch Matrix Interconnect Input/Output Block Switch Matrix Input/Output Block Switch Matrix Interconnect Input/Output Block Switch Matrix Interconnect Input/Output Block Switch Matrix Interconnect Input/Output Block Switch Matrix Figure : 5 × 5 FPGA (only some interconnects shown) 10 HT NO: 13N41D5709 2.2 Logic Resources: A logic block contains the combinational and sequential elements needed to implement arbitrary logic functions. In the Xilinx Virtex architecture—the same basic architecture of most contemporary FPGAs—a logic block is subdivided into two smaller units, called slices. Each slice contains two Look-up Table (LUTs) and two bistables, as well as some carry logic and several small multiplexers. A single look-up table can implement any 4-input combinational logic function; multiple look-up tables can be combined to implement larger functions. A look-up table is a small array of memory, 16 × 1 bits, that stores the truth table of a 4-input Boolean 4 function (2 = 16 values). A bistable, used to implement sequential functions, can be configured to operate as either a D latch or a D flip-flop. 2.3 Routing Resources: A net is the concatenation of several interconnects, or wire segments, that form a path between blocks. In contemporary FPGAs, interconnects are hierarchically organized by length and direction. In the Virtex architecture, there are three lengths, single, hex, and long, each routed either horizontally or vertically for a total of six types. Single lines interconnect adjacent switch matrices in all four directions. Hex lines interconnect every third and sixth switch matrix in all four directions. Finally, long lines span the entire height or width of a device. The majority of an FPGA is its interconnection network, constituting up to 80% of the die area and up to 8 metal layers. Routing of nets is accomplished by switch matrices, which either join interconnects to block inputs or outputs, or other interconnects. The configurability of a switch matrix is achieved by Programmable Interconnect Points (PIPs). A PIP is a pass transistor controlled by an associated memory cell that determines whether the transistor is on (conducting) or off (non-conducting). 11 HT NO: 13N41D5709 CHAPTER – 3 Delay Fault Test 12 HT NO: 13N41D5709 3.1 Motivation and Previous Work Detecting delay faults is difficult due to the slow test speed s of contemporary Automated Test Equipment (ATE); however, it is important for two reasons: a delay fault can cause timing failures in an otherwise functioning device and may result in an early-life failure (device prematurely stops functioning) if it is caused by a resistive open a partially conducting circuit path. A study of microprocessors (manufactured in high volume) found that 58% of customerreturned devices had some type of open defect, a significant portion of which were resistive opens. The additional load increases the RC delay of the path under test, resulting in a negative path slack if a resistive open is present. Although effective, the method is only applicable to FPGAs whose switch matrices and PIPs are not implemented as multiplexers (XC3000 and XC4000 architectures are not multiplexer based, Virtex, Virtex-II and Spartan-3 architectures are multiplexer-based). Long paths consisting of both logic elements and interconnects are configured to have approximately equal fault-free propagation delays. A signal transition is simultaneously driven at the input of each path, and the difference in the propagation delays is measured at the outputs by means of an on-chip oscillator. However, test configurations that contain long paths of interconnects and logic elements will fail if either one large de-lay fault (spot defect) or many small delay faults (process shift) are present along the path, which may or may not be desired. Additionally, the probability of fault masking, or the inability to detect a fault due to the presence of another, is higher for longer paths. 13 HT NO: 13N41D5709 3.2 Delay Fault Detection: The presented delay fault test is applicable to all contemporary FPGA architectures. Several paths under test, consisting of a set of interconnect and PIPs between two logic blocks, are configured to have equal fault-free propagation delays. A race between the signals on the paths is started at the first logic block and the difference in transition times of the signals at the second logic block is observed. If the difference is above a predetermined threshold (one signal transitions much later than the others), a delay fault is detected. By configuring the shortest possible paths under test that do no t include logic elements (PIPs and interconnects only), the probability of fault masking the result of a delay fault on multiple paths is minimized. The tiled structure of contemporary FPGAs and the hierarchical organization of the interconnects easily facilitates the configuration of short paths with nearly equal fault-free propagation delays. set i path i set i+1 1 path LB i2 i−1 LBi path i 2N Figure: Iterative Logic Array A set is the group of paths under test between a pair of logic blocks: seti is the group of 2N paths between LBi−1 and LBi (although not necessary, 2N, rather than N, is chosen to make the number of paths even to simplify the analysis). The test of an FPGA, which is configured to contain one or more ILAs, begins by applying a signal transition to the first logic block of each ILA, LB0. This causes LB0 to create a race on its output set, set1, that propagates to the next logic block, LB1. LB1 subsequently observes the difference in transition times between the fastest and slowest signals on set1: if the difference is small, then no delay fault is present and LB1 creates a race on its output set, set2; if the difference is large, then a delay fault is present and LB1 outputs a special fail signal (not a signal race) that propagates to the output of the ILA (primary output). In 14 HT NO: 13N41D5709 this manner every path in every set along the ILA is tested. Multiple delay faults in each set can be detected, provided at least one path in the set is fault-free. Bridge Faults : To configure paths of equal fault-free propagation delay in a n FPGA, equal length wire segments that are routed alongside each other are used. Therefore, the effect of a bridge fault between paths under test should be considered. By driving the Xi paths to the opposite logic value as that of the Yi paths and interleaving the interconnects of the two groups such that each Xi j path runs adjacently to a Yik as shown in Fig, all bridge faults between adjacent paths can be detected. X i1 R Y bridge i1 R X bridge i2 R Y bridge i2 Figure : Interleaved Paths A bridge fault that causes the signal on at least one of the bridged paths in a set to transition (for example a wired-AND or wired-OR bridge fault) creates an artificial race on that set that is detected as a delay fault of infinite delay b y the succeeding logic block. For example, given a wired-AND bridge fault between paths Xi j and Yik , if Xi j is driven to logic-0 and Yik is driven to logic-1, the logic value of Yik will be forced to logic-0 (0 ·1 = 0): Yik transitions from 1 to 0 while Xi j remains at logic-0. Similarly, given a wired-OR bridge fault between paths Xi j and Yik , if Xi j is driven to logic-0 and Yik is driven to logic-1, the logic value of Xi j will be forced to logic-1 (0 + 1 = 1): Xik transitions from 0 to 1 while Yi j remains at logic-1. Note that any bridge fault that causes a signal transition on any bridged path will be detected. 15 HT NO: 13N41D5709 3.3 Partitioning: The magnitude of the total current the background leakage current plus the inter-connect leakage current plus a possible fault current increases with FPGA size due to the increased number of transistors (PIPs). Unfortunately, the larger number of PIPs that contribute to the leakage between opposite-valued interconnects (IDDQint ) causes the total current to increase at a much faster rate than the reference current. This difference, further aggravated by the shrinking geometries of deep sub-micron process technologies, causes the signature current to increase, making it more difficult t o detect a bridge fault. This scaling problem can be overcome by partitioning, or dividing, the nets (used interconnects) of a test configuration for a large FPGA into subsets such that the total current for each partition is roughly the same that it would be for a small FPGA (make the number of nets in each partition roughly equal to the number of nets in a test configuration for a small FPGA). Consequently, the signature current becomes invariant with respect to FPGA size. Consider an FPGA and a test configuration whose used interconnects composes the set S. This configuration, that tests for bridge faults between an y interconnect in S and any interconnect not in S, is partitioned into N smaller configurations, each testing only a subset of the interconnects in S for bridge faults. Thus, instead of measuring one large total current for the original configuration, N smaller total currents are measured, one for each of the N smaller configurations. For simplicity and without loss of generality, assume that each partition contains 1/Nth of the interconnects in S. Because each partition also contains approximately 1/Nth the number of PIPs, each interconnect leakage current is reduced by N compared to the interconnect leakage current that results when the configuration is not partitioned. Since the reference current does not change whether an FPGA is partitioned or not (reference current is still measured only once while all interconnects are driven to logic-1), each signature current is also reduced by N. Therefore, the delectability ratio increases: the fault current is more easily detected. Note that the partitions of S need not be disjoint: the only requirement is that every interconnect of S belongs to at least one partition. Additionally, N threshold values are now required. 16 HT NO: 13N41D5709 CHAPTER – 4 Diagnosis 17 HT NO: 13N41D5709 4.1 Bridge fault location: Fault location is an important part of fault diagnosis since it is usually necessary to locate a fault to identify the underlying defect. Because of the abundant logic and routing resources in an FPGA (most of which remain unused during normal operation), a manufacturing defect can be tolerated by locating the resulting fault, and avoiding the fault when mapping a logic design to the device. Note that defect tolerance describes the fault location and fault avoidance process implemented by the manufacturer prior to device operation while fault tolerance describes the process implemented by the device during operation (for example concurrent error detection). Fault Search During each iteration of the fault location process, the nets of the fault-free partitions (partitions that do not display an elevated signature current) are eliminated from the set S. Consequently, two criteria must be met during each iteration of the fault search. First, to ensure the iterative process converges to locate a bridged net, no net in S can be included in all partitions. Second, without knowing the expected magnitude of the signature current for each partition, the partitions of S must be balanced such that the interconnect leakage current for each partition is approximately equal. To first order this balancing is accomplished simply by equalizing the number of nets in each partition; more precise balancing can be accomplished by explicitly equalizing the number of PIPs, although experimental results show this is unnecessary.Therefore, simply by comparing the signature currents of the two partitions during a particular iteration, the fault-free partition can be identified and its nets can be eliminated from S. This reduced set S, containing approximately half the nets it did in the previous iteration, is subsequently repartitioned, signature currents are obtained, and the nets of a new, smaller fault-free partition are eliminated. Figure shows the general fault location algorithm. In general, if S, which initially contains M nets (|S| = M), is divided into N partitions during each iteration, then NdlogN Me + 1 configurations and current measurements are required to locate the faulty net (+1 is for the single reference current measurement and the corresponding configuration). To locate multiple bridge faults, the location algorithm is applied independently to each partition that displays an elevated signature current during iteration (multiple fault currents). The number of configurations and current measurements is still provided a constant number of bridge faults are to be located. 18 HT NO: 13N41D5709 Finally, if location of the other net of a bridge fault is desired, for example to obtain more information about the fault for diagnosis, the location algorithm is applied a second time to the FPGA using an inverse test configuration whose used interconnects are the un-used interconnects of the original configuration and vice versa. For example, given the original test configuration of Fig.(A, B, C, D, and E are entire nets, not individual interconnects), S = {B, D} and the bridged net B is identified as the faulty net. To also locate the bridged net C, the inverse test configuration with S0 = {A,C, E} is used, shown in Fig. Note that the second bridged net can also be located using a second test configuration in which only the interconnects adjacent to the faulty net are used; however, because layout information is required, the more simple inverse test configuration is desired. For example, a test configuration with S0 = {A,C} is sufficient to locate the bridged net C in Fig.net E can be neglected because it is not adjacent to net B and thus unlikely to be bridged to net B. A D Used E Unused B C Bridge Fault Switch Matrix Switch Matrix (a) Original Test Configuration A D Used E Unused B C Bridge Fault Switch Matrix Switch Matrix (b) Inverse Test Configuration Figure: Location of Both Nets of a Bridge Fault 19 HT NO: 13N41D5709 4.2 Automated fault diagnosis: Fault diagnosis is the process of determining the cause of a device failure, important because the information obtained can be used to make manufacturing improvements to in-crease yield. A common method is stuck-at fault diagnosis, which uses stuck-at faults to model the behavior of a defect, and identifies the smallest possible set of fault candidates, or faults that could explain the failure (explain the output response of the defective device). Although the stuck-at fault model is not an accurate model of real defects, for ex-ample a bridge, it is appealing to use for diagnosis for several reasons. First, the simplicity of the stuck-at fault model permits simple mathematical treatment of the faults considered during diagnosis. Stuck-at Fault Diagnosis: The main (and most challenging) part of the previous diagnosis techniques is creating test configurations, which is done because the existing manufacturing test configurations are not well suited for diagnosis (they typically have very few primary outputs). The number of primary outputs of a manufacturing test configuration is kept low for two main reasons: to reduce pincount requirements of the ATE, and permit each test configuration to be used for differently sized FPGAs that have different packages and pin-outs (in many test configurations all logic blocks are simply joined t o form a single long ILA that has a single primary output). In contrast, the presented diagnosis technique is not based on generating a set of test configurations, but instead takes advantage of the configurability of an FPGA to make use of the already existing manufacturing test configurations (overcoming the problem of few primary outputs). There are two primary benefits to using the existing manufacturing test configurations: the diagnosis process is (1) very fast, and (2) applicable to any FPGA in production regardless of size or architecture (provided manufacturing test configurations exist to test devices before being sold to customers). First, compared to the ad hoc diagnosis done in practice, the diagnosis time is reduced from days or weeks to only hours by eliminating the entire configuration generation process. Second, compared to the diagnosis techniques that generate test configurations in advance, no additional configurations must be created to diagnose FPGAs of different size or architecture. 20 HT NO: 13N41D5709 Diagnosis Process: The diagnosis of an FPGA consists of two main procedures: (1) Identify fault candidates, and (2) Systematically eliminate fault candidates. Table lists all required steps in more detail. Fault candidates are identified in steps 1 through 4 and 7 (step 7 appears later in the process for implementation reasons, discussed in the fan-in cone logic extraction section) and fault candidates are eliminated in steps 5, 6, 8, and 9. The entire process (except step 10) is fully automated, producing a small set of fault candidates that are each verified in step 10. Table : Automated Fault Diagnosis Process Step Description Purpose 1 Device test Determine passing and failing test configurations 2 Readback verification 3 Bistable location Determine first failing bistables 4 Fan-in cone logic extraction Determine set of logic fault candidates 5 Logic fault commonality Eliminate logic fault candidates not common to all Eliminate failing configurations with faulty readbacks bistable-located configurations 6 Logic fault consistency Eliminate logic fault candidates shown not to exist by passing configurations 7 Fan-in cone PIP extraction Determine set of routing fault candidates 8 Routing fault commonality Eliminate routing fault candidates not common to all bistable-located configurations 9 Routing fault consistency Eliminate routing fault candidates shown not to exist by passing configurations ∗ 10 Fault verification Eliminate remaining fault candidates by independently testing for each ∗Not an automated step. 21 HT NO: 13N41D5709 Step 1: Device Test To detect as many defects as possible during manufacturing test, the set of manufacturing test configurations is developed to have a high fault coverage, defined as the percentage of faults that can be detected out of all faults considered. A fault list, or list of faults to be detected (usually all considered faults), guides this configuration generation process. A particular test configuration is designed to detect a specific set of faults in the fault list; how-ever, as a byproduct the test configuration can also detect other faults. Consequently, the test configuration is fault graded to determine all detectable faults, which are then dropped from the fault list for the test configurations that remain to be generated. Since it is known exactly which faults can be detected by each test configuration (each was fault graded during manufacturing test configuration generation), the set of fault candidates for the presented diagnosis process can be determined as the faults detectable by those test configurations for which the FPGA failed (the manufacturing test configurations used to diagnose the devices were fault graded for stuck-at faults on logic element inputs, logic element outputs, and interconnects). Therefore, the first step in diagnosing an FPGA is to test the device with all, or at least many, of these manufacturing test configurations to determine for which the device fails and for which it passes. This information is not collected during manufacturing test because an FPGA is rejected immediately after it first fails (to reduce test time and therefore test cost); to avoid using expensive ATE during this diagnosis step, inexpensive ATE can be used, as was done to obtain the experimental results A failing test configuration is one for which an FPGA fails while a passing test configuration is one for which an FPGA passes. Although more fault candidates are identified when the device is tested with more test configurations (there are more failing configurations), the number of fault candidates that can be eliminated also increases (there are more passing configurations). Therefore, a finer resolution—fewer fault candidates—ultimately results by maximizing the number of test configurations with which the device is tested. 22 HT NO: 13N41D5709 Step 2: Read back Verification After device test has identified the set of fault candidates (literally millions of candidates), the next step is to systematically reduce this set. The first method is to locate the first bistable (or bistables) in the FPGA to capture a faulty value—the first failing bistable —for each failing test configuration (different bistables may be identified for each test configuration because each test configuration either implements a different logic function or is a different mapping of the same logic function to the FPGA). Bistable location is done using the special DFT hardware of contemporary FPGAs that enables read back. Read back is the scanning out serially of both configuration data and th e state of all bistables in an FPGA (conceptually the inverse of configuration), which permits the state of the device to be captured at any time. However, because the device being diagnosed is faulty, read back may not function correctly; read back functionality must therefore be verified. The result of capturing the device state via read back is a read back file , which contains both the configuration memory bits and the bistable data bits of an FPGA. Any bit of a read back file may be erroneous, or faulty. To continue the diagnosis process, it is necessary to determine (1) which bits are faulty, and (2) how the faulty bit behaves (random or stuck-at). To identify a read back bit that is stuck at some logic value (stuck-at bit), a read back file is obtained from the faulty device (device being diagnosed) immediately after it is programmed with a test configuration and compared to the read back file of a known-good device obtained in the same manner. Because the state of both devices should match, any bits that differ between the faulty and known-good read back files are faulty. If the logic value of those faulty bits does not change when additional read back files of the faulty device are obtained (again immediately after configuration), then those faulty bits are stuck-at bits. Stuck-at 1 bits are identified by programming both de vices with a special all-zero configuration (everything initialized to logic-0) while stuck-at 0 bits are identified by programming both devices with a special all-ones configuration (everything initialized to logic-1). Bits whose logic value changes when additional 23 HT NO: 13N41D5709 read back files of the faulty device are obtained are random bits (readback is done in the exact same manner each time, therefore each subsequent readback file should match the previous). Faulty readback bits do not necessarily mean that the FPGA cannot be diagnosed using the presented process: most readback bits will not be used to conduct bistable location anyway (only about 2-3% of readback bits are meaningful in any given configuration). Only if there is a stuckat 0 or stuck-at 1 readback bit that corresponds to a bistable used in a particular failing test configuration (the readback bit represents the data stored in that bistable) is that test configuration discarded as one for which the first failing bistable can be located. A random readback bit does not cause a test configuration to be discarded since a method of majority voting can be used during bistable location. Step 3: Bistable Location When a fault is activated, a faulty response may be observed on a primary output; however, fault activation may have occurred many clock periods before the faulty primary output is observed (this is what makes the manufacturing test configurations difficult to use for diagnosis). By obtaining readback files at various times during the test of an FPGA, the first bistable (or bistables) to capture a faulty value (the result of fault activation) can be determined, thereby using the bistable as a test observation point. Because the fault must lie in the fan-in cone of this first failing bistable, which is defined as all logic an d routing resources on any path traced backwards from the input of that bistable to the output of another bistable or a primary input, a large number of fault candidates can be eliminated (those candidates not in that fan-in cone can be eliminated). Since each test configuration either implements a different logic function or is a different mapping of the same logic function to the FPGA, each may use different logic resources and may have a different routing of paths; therefore, a different first failing bistable may be identified for each failing test configuration. To determine the first failing bistable, readback files of the faulty FPGA are compared to those of a known-good F PGA, both programmed with the same test configuration and both tested with the same test vectors, V vectors in total. If the fault has not been activated after V vectors are applied, then the state of each device will match; once the fault is activated, the state of each device will not match. 24 HT NO: 13N41D5709 The number of vectors, V , to apply to each device is determined via divide-and-conquer. First, V is set to one half the total number of vectors for the test configuration (one half of Vtot ). These V vectors, the first V tot 2 , are then applied to both devices, after which readback files are obtained for both devices and compared: if they match, the number of vectors is increased by half (V = V + V tot 4 ); if they do not match, the number of vectors is reduced by half (V = V − V tot 4 ). Figure shows the full binary search algorithm to identify the first failing bistable (or bistables); given Vtot vectors for a particular test configuration, the process requires dlog2Vtot e iterations. Note that because FPGA programming and readback is slow, bistable location consumes most of the time required for the presented diagnosis process. Diagnosis time is thus proportional to the number of failing test configuration for which the first failing bistable is located, called bistablelocated test configurations. for each failing test configuration set vectors V = dVtot e # start search at midpoint 2 for k = 1 to dlog2Vtot e # iterate test both devices with V vectors obtain readbacks for both devices if readbacks match increase vectors to V = V + b V tot c # increase search space e # decrease search space 2k+1 if readbacks do not match decrease vectors to V = V − d Vtot 2k+1 Figure : Bistable Location Algorithm Although a known-good device can be tested ahead of time and the known-good readback files can be stored, there is no reason: no delay is incurred by testing the known-good device in parallel with the faulty device (test both devices at the same time). Furthermore, it is impractical to store all possible readback files for the known-good device because the required disk space is impractically large. One Readback fil e is required to store the state of the knowngood device after each additional vector is applied (one readback file for V = 1, one readback file for V = 2, all the way up to one readback file for V = Vtot ), where each readback file is several megabytes in size. 25 HT NO: 13N41D5709 However, if it was determined by the readback verification step that random readback bits exist, then the faulty device can be tested multiple times with V vectors (an odd number of times) and multiple readback files can be obtained per iteration. The 'correct' value of each random readback bit during a particular iteration can be determined by a majority vote between the multiple readback files and a 'correct' faulty readback file can be composed for comparison to the known-good readback file. This majority vote technique was success fully used for both devices diagnosed in Sec. it was found that 3 tests (configuration and application of V vectors) and readbacks per iteration were sufficient. If a majority vote for random readback bits is not done then the bistable location algorithm may identify an incorrect first failing bistable: the random readback bit may result in an incorrect decision (increase or decrease in the number of vectors, V ) during readback comparison. Note that as the number of tests and readbacks of the faulty device increases, the probability that the bistable location algorithm will identify an incorrect first failing bistable decreases (unless the logic value of the random readback bit has an equal likelihood of being either logic-0 or logic-1). Additionally, because the faulty device is programmed multiple times and multiple readback files are obtained (per iteration), the total diagnosis time increases (configuration and readback are slow). Step 4: Fan-in Cone Logic Extraction By definition, the first failing bistable (or bistables) for e ach bistable-located test configuration is the first bistable to capture a faulty value; all ot her bistables must have captured fault-free values. Thus, the fan-in cone of each first failing bistable, defined as all paths traced backwards from the input of the first failing bistable , through only routing and combinational elements, to the output of a previous bistable (called a last passing bistable) or a primary input, can be used to eliminate a large number of fault candidates since the fault must lie in this fan-in cone. Figure shows the fan-in cone of a first failing bistable. Note that the output from the last passing bistable and both the data and clock inputs to the first failing bistable are also potentiallyfaulty since a fault on each could cause the bistable to capture a faulty value. 26 HT NO: 13N41D5709 Fault Candidates D Q Last Passing Comb. Logic and Routing D Q First Failing D Q Fan−in Cone Last Passing Figure: First Failing Bistable Fan-in Cone First, each bistable-located test configuration is converted from its Hardware Description Language (HDL) syntax into graph notation to permit traversal of the fan-in paths of a particular first failing bistable. For the experiments of , all test configurations were originally specified in Xilinx Description Language (XDL). In an XDL description, logic elements are explicitly identified but routing elements (PI Ps and interconnects) are only implicitly identified (discussed in the fan-in cone PIP extraction section). Consequently, to simplify the implementation of the diagnosis process, only logic fault candidates are extracted from the fan-in cone. Thus, in the presented diagnosis process, logic faults candidates are considered first, followed by routing fault candidates. In the graph notation for a given test configuration, each ver tex is an input or output of a logic element and each edge is a net. To determine the set of logic fault candidates, each fan-in path of the first failing bistable is traversed backwards to the last passing bistable (or primary input), recording each vertex along the way. 27 HT NO: 13N41D5709 Step 5: Logic Fault Commonality Up to this point each test configuration has been considered individually, each used to individually test the FPGA, identifying a set of logic fault candidates for that particular test configuration. If only a single fault is assumed to exist (one fault in the FPGA being diagnosed), then there must be a logic fault candidate that is common to every failing test configuration (the single fault must be the cause of failure f or each test configuration). Thus, logic fault candidates not common to all bistable-located test configurations can be eliminated. The number of logic fault candidates that can be eliminated is proportional to the number of bistable-located test configurations. For both devices diagnosed in, it was found that fault-locating only 2–3% of the failing test configurations resulted in enough fault candidate elimination for successful diagnosis. The aggregate set of logic fault candidates is the union of all logic fault candidates identified by each bistable-located test configuration. If elimination yields a very small aggregate set of logic fault candidates, say no more than 10 (10 is chosen because it is considered a sufficiently small number of fault candidates t o be easily verified by the fault verification step), then routing fault candidates need not b e considered (routing faults steps should be conducted if routing faults are still suspected). If elimination yields an empty aggregate set of logic fault candidates, then (1) there are multiple logic faults, (2) there is one or more routing faults, (3) there are both logic and routing faults, (4) there is a defect in the faulty device that is not modeled by a single stuck-at fault. To not eliminate possible logic faults (if multiple faults or an un modeled defect exists), the empty aggregate set of logic fault candidates is restored to contain all logic fault candidates and the diagnosis process is continued. 28 HT NO: 13N41D5709 Step 6: Logic Fault Consistency: Test configurations for which the faulty device passes (passing test configurations) are as important to the diagnosis process as those for which the device fails (failing test con-figurations) since they identify fault-free resources. Logic fault candidates shown not to exist by these passing test configurations can be eliminated. The amount of elimination is proportional to both the number of passing configurations an d the fault coverage of each. Similar to the logic fault commonality step, if elimination yields a very small aggregate set of logic fault candidates (say no more than 10), then routing fault candidates need not be considered. In contrast, if all logic fault candidates are eliminated (all logic faults are shown not to exist by the passing test configurations), then either (1) the fault being diagnosed is a routing fault, or (2) there is an un modeled defect in the faulty device. If after the following routing fault steps are conducted and no routing fault can be identified, the empty aggregate logic fault candidate set can be restored to contain all logic fault candidates and fault verification (discussed in the fault verification sect ion) can be conducted on each (if the set is sufficiently small). For the diagnosed device in Se c. 7.3 whose logic fault candidate set became empty after the logic fault consistency step, restoration was not required: a routing fault was successfully identified. Step 7: Fan-in Cone PIP Extraction In the presented diagnosis process routing faults are considered only if too many logic fault candidates remain (a large number of logic faults were not eliminated). To determine the routing fault candidates, each fan-in path of a particular first failing bistable (an edge in the graph notation discussed in the fan-in cone logic extraction section) is expanded to reveal all the PIPs and interconnects that compose that path. Note that it is not necessary to include fan-out paths of a net. For example, consider the configuration shown in Fig. in which each interconnect is denoted as I and each PIP as I → J, I =6 J. The fan-in cone of the first failing bistable consists of path {A → B, B → C,C → D} but the entire net also contains the fan-out path {B → X , X → Y,Y → Z}: this fan-out path can be ignored because no fault exists along this path (if a fault did exist then the bistable at the end of that path would also be a first failing bistable). 29 HT NO: 13N41D5709 Because the XDL specification of a test configuration describes a net by the set of PIPs it contains and only implicitly identifies the interconnect s (PIP I → J implicitly identifies interconnects I and J), it is sufficient to extract only the PIPs in the fan-in cone o f each first failing bistable. Thus, to determine the set of routing fault candidates, each fan-in path of the first failing bistable is traversed backwards to the last passing bistable (or primary input), recording each PIP along the way. Note that the routing fault candidates correspond to stuck-at faults on the input and output of each PIP (each input or output of a PIP is an interconnect, logic element input, or logic element output). Step 8: Routing Fault Commonality Elimination of routing fault candidates via routing fault commonality is analogous to elimination of logic fault candidates via logic fault commonality. Again, if only a single fault is assumed to exist (one fault in the FPGA being diagnosed), then there must be some routing fault candidates that are common to every failing test configuration. Thus, routing fault candidates not common to all bistable-located test configurations can be eliminated. The number of routing fault candidates that can be eliminated is proportional to the number of bistable-located test configurations. The aggregate set of routing fault candidates is the union of all routing fault candidates identified by each bistable-located test configuration. If elimination yields a very small aggregate set of routing fault candidates (say no more than 10), then diagnosis can immediately conclude with the fault verification step. Step 9: Routing Fault Consistency The final automated step in the presented diagnosis process i s to eliminate routing fault candidates shown not to exist by the passing test configurations. Since a PIP implicitly identifies two interconnects, if it is shown to be fault-free then the two interconnects it joins must also be fault-free (for example if PIP I → J passes in some test configuration, then the PIP I → J and interconnects I and J are fault-free, and can be eliminated as routing fault candidates). The number of routing fault candidates that can be eliminated is proportional to both the number of 30 HT NO: 13N41D5709 passing configurations and the fault coverage of each. If all logic fault candidates were eliminated after the logic fault commonality and consistency steps (empty aggregate set of logic fault candidates resulting from a single fault assumption), then the only way all routing fault candidates can be eliminated during this step is if there is an un modeled defect in the faulty device. If this occurs, the aggregate set of routing fault candidates is restored to contain all routing fault candidates and fault verification is conducted on each (if the set is sufficiently s mall). Step 10: Fault Verification The fault verification step identifies which of the remaining fault candidates (those fault candidates not eliminated) are actually faulty. However, unlike the preceding steps, fault verification is not easily automated since it involves generating a new test configuration (or multiple configurations) that independently tests each fau lt candidate (if possible). Because this is most easily done by hand, automating the fault verification step was not done, and is therefore not discussed further. A new test configuration can be created by modifying an existing manufacturing test configuration—one that already tests for the particular fault candidate that is to be verified. The manufacturing test configuration is modified to contain only the resource identified by the fault candidate (for example a PIP or a look-up table) and a minimal amount of re-sources that permit control and observation of the logic value at the fault candidate location (for example PIPs and interconnects shown to be fault-free by passing test configurations). Control can be achieved by creating a path from a primary input to the fault candidate location while observation can be achieved by creating a path from the fault candidate location to a primary output. Note that control and observation can also be achieved by creating a path from some bistable to the fault candidate location, creating a path from the fault candidate location to another bistable, and using readback to control and observe the bistable inputs and outputs, respectively. 31 HT NO: 13N41D5709 Verifying a fault candidate is done by independently testing for that fault—controlling and observing the logic value at the fault location while not activating any of the other fault candidates. Thus, when a fault candidate is tested (one at a time, each with its own test configuration) and a failure is observed, and assuming all ot her elements used to create the control and observation paths are fault-free, then the failure can only be the result of that fault candidate and not of any others (or the result of an un modeled defect). Therefore, a fault candidate verified in this manner is said to be an actual fault (no longer a candidate). Although logic fault verification is simply done by independently testing for each logic fault candidate, routing fault verification requires a little more care since both PIPs and interconnects must be considered. For example, let the aggregate set of routing fault candidates of an FPGA with a single routing fault be PIPs B → C, C → D, and W → C, identified via bistable location of the two manufacturing test configurations shown in Fig. (potentially-faulty PIPs are bold). Clearly the common PIP C → D may be faulty; however, either interconnect C or D may instead be faulty. Thus, new test configurations used for routing fault candidate verification must test for PIPs and interconnects independently. Logic Fault Diagnosis Device 1 failed roughly 30% of the several hundred manufacturing test configurations (device test, step 1), of which 5 were determined to be usable for bistable location (read-back verification, step 2). The same first failing bistable was identified for each test configuration (bistable location, step 3), resulting in 34 logic fault candidates (fan-in cone logic extraction, step 4). After logic fault candidate elimination (steps 5 and 6) only 9 logic fault candidates remained; therefore, the routing fault steps were skipped and fault verification (step 10) was conducted. Eight of the 9 logic fault candidates were stuck-at 0 and stuck-at 1 faults on the 4 address inputs of a look-up table while the ninth was a stuck-at 1 fault on the write select enable (ws en) input of the same look-up table. By testing for each logic fault independently (fault verification, step 10), it was shown that only the ninth fault candidate is actually a fault. 32 HT NO: 13N41D5709 4.3 Routing Fault Diagnosis: Device 2 failed roughly 25% of the several hundred manufacturing test configurations (device test, step 1), of which 4 were determined to be usable for bistable location (readback verification, step 2). Four different first failing bistable s were identified (bistable location, step 3), resulting in 273 logic fault candidates (fan-in cone logic extraction, step 4). After logic fault candidate elimination (steps 5 and 6), no logic fault candidates remained; therefore, it was assumed that the fault being diagnosed is a routing fault. The fan-in cones of the 4 first failing bistables identified 83 routing fault candidates (fan-in cone PIP extraction, step 7), of which 27 were common to all bistable-located test configurations (routing fault commonality, step 8). Only 9 r outing fault candidates were not eliminated by the passing test configurations (routing fault consistency, step 9), yielding a small set of fault candidates for fault verification. One interconnect was common to 3 of the 9 potentially-faulty PIPs; therefore, it was logical to test this interconnect first during fault verification (step 10), which revealed that it was stuck-at 1. Furthermore, independently testing each of the other routing fault candidates showed that they did not correspond to actual faults. 33 HT NO: 13N41D5709 CHAPTER – 5 BUILT IN SELF TEST 34 HT NO: 13N41D5709 5.1 INTRODUCTION: We present a Built-In Self-Test (BIST) approach able to detect and accurately diagnose all single and practically all multiple faulty programmable logic blocks (PLBs) in Field Programmable Gate Arrays (FPGAs) with maximum diagnostic resolution. Unlike conventional BIST, FPGA BIST does not involve any area overhead or performance degradation. We also identify and solve the problem of testing configuration multiplexers that was either ignored or incorrectly solved in most previous work. We intro-duce the first diagnosis method for multiple faulty PLBs; for any faulty PLB, we also identify its internal faulty modules or modes of operation. Our accurate diagnosis provides the basis for both failure analysis used for yield improvement and for any repair strategy used for fault-tolerance in reconfigurable systems. We present experimental results showing detection and identification of faulty PLBs in actual defective FPGAs. Our BIST architecture is easily scalable. Advances in VLSI technology have resulted in more defects affecting the delays in the circuit, thus increasing the importance of delay-fault testing. The current FPGA manufacturing testing practice relies on configuring the FPGA with many designs and running them at speed. This is useful for speed-binning, but it may not be a comprehensive delay-fault test. One of the main difficulties in the problem of delay-fault testing for FPGAs originates in the fact that users’ circuits are not known when the chip is manufactured. While for an Application-Specific Integrated Circuit (ASIC) the target clock frequency is known before the chip is fabricated, it is impossible for an FPGA manufacturer to test the circuits that will be implemented on each device. Previous work in FPGA delay-fault testing considered only the testing of the user’s circuit. However, such an approach is problematic both for FPGA manufacturers, who want to ship defect -free chips, and for users, who want to purchase defect -free parts (otherwise, system-level debugging becomes very difficult if design problems are intermixed with manufacturing faults). Testing only the user’s circuit is also unsafe for on-line testing, because new faults are equally likely to occur in spare resources or in the currently unused portion of the operational part of the system. These dormant faults may accumulate and have a detrimental effect on the system reliability since these spare and unused resources will be used in fault-tolerant applications to replace defective resources. Thus to guarantee a reliable operation, an on-line test must completely check all the system resources, including spares. 35 HT NO: 13N41D5709 In testing FPGAs, we can take advantage of their unique properties reconfigurability, partial reconfigurability, and regular structure - to achieve features that cannot be realized in ASIC testing. Unlike in ASICs, where BIST requires large area overhead and some performance degradation, FPGA BIST can be done with no area overhead and no delay penalty. In FPGAs, we can pseudo-exhaustively test both the programmable logic blocks (PLBs) and the programmable interconnect network, during both off-line and on-line testing with reasonable test application times. Pseudo exhaustive testing is practically complete and removes the need to evaluate the test quality. Thus, pseudo exhaustive FPGA BIST approaches do not require either Automatic Test Pattern Generation (ATPG) or fault simulation, which are computationally expensive tasks for ASICs. Accurate fault diagnosis is difficult to achieve in ASICs, but it can be done efficiently in FPGAs. In ASICs, fault tolerance for manufacturing faults cannot be accomplished without massive redundancy, but FPGAs allow efficient fault tolerance without dedicated spare resources as a result of the ability to reconfigure around faults. Our first attempt to perform FPGA delay-fault testing was to run our BIST configurations for logic and interconnect at the specified clock rate for the target FPGA. However, we quickly learned that this approach cannot work without significantly increasing the number of BIST configurations that must be applied to the FPGA. The number of configurations is the dominant factor determining the total test time, because the configuration download time is several orders of magnitude larger than the test-pattern application time. To minimize the total number of configurations, our BIST techniques try to test as many resources as possible within the same configuration. For off-line PLB test, this requires distributing the patterns from the test-pattern generator (TPG) to many PLBs under test using long signal paths. Similarly, for interconnect test, the wires under test are long, connecting many wire segments and programmable switches. Under these conditions, we must run the BIST configurations at a frequency much lower than that used in normal operation, so most delay -faults will not be detected by these tests. An alter-native would be to reduce the amount of logic and/or interconnect that is under test during any given BIST configuration to allow BIST execution at a much higher clock frequency; however, this will require many more configurations, and ultimately significant increases in testing time and cost. 36 HT NO: 13N41D5709 5.2 THE BIST ARCHITECTURE: The strategy of our FPGA BIST approach is to configure groups of PLBs as test pattern generators (TPGs) and output response analyzers (ORAs), and another group as blocks under test (BUTs), as illustrated in Figure 1a. The BUTs are then repeatedly reconfigured to test them in all their modes of operation. We refer to the test process that occurs for one configuration as a test phase. A test session is a sequence of test phases that completely test the BUTs in all of their modes of operation. Once the BUTs have been tested, the roles of the PLBs are reversed so that in the next test session the previous BUTs become TPGs or ORAs, and vice versa. Since half of the PLBs are BUTs during each test session, we need only two test sessions to test all PLBs in the FPGA. Figures1b and 1c show the floorplans for the two test sessions - called NS and SN - Figure 1b. The name of a session denotes the direction of the flow of test patterns during that session. The floorplan for SN is obtained by flipping the floorplan for NS around the horizontal axis shown as a dotted line in the middle of the array. This changes the roles of the PLBs so that every PLB is under test in one of the two test sessions. Note that all BUTs are tested in parallel and that patterns may be applied at-speed. BIST Start/Reset ... TPG BIST Done TPGs BUTs TPG ORAs ... BUT BUT BUT BUTs BUT ORAs ORA ORA ORA ORA Pass/Fail BUTs ORAs BUT BUT BUT BUTs BUT b) Floorplan for first test session (NS) a) TPG, BUT, and ORA connection Figure : The BIST architecture. 37 HT NO: 13N41D5709 Each test phase consists of the following steps: 1) reconfigure the FPGA, 2) initiate the test sequence, 3) generate test patterns, 4) analyze output responses, and 5) read the test results. In step 1, the test controller (ATE for wafer/package testing; CPU or maintenance processor for board/system testing) interacts with the FPGA(s) under test to reconfigure the logic by retrieving a BIST configuration from the configuration storage (ATE memory; disk) and loading it into the FPGA(s). The test controller also initializes the TPGs, BUTs, and ORAs and initiates the BIST sequence and reads the subsequent Pass/Fail results. Steps 3 and 4 are concurrently performed by the BIST logic within the device. After the board or system-level BIST is complete, the test controller must reconfigure the FPGA for its normal system function; hence the normal device configuration must be stored along with the BIST configurations. The test application time is dominated by the FPGA reconfiguration time. Since the total test and diagnosis time is a major factor in the system down time (system availability) and cost, an important goal of our BIST approach is to minimize the number of configurations used for test and diagnosis. Figure illustrates the typical structure of a PLB, consisting of a memory block that can function as a look-up table (LUT) or RAM, several flip-flops (FFs), and multiplexing output logic. The LUT/RAM block may also contain special-purpose logic for arithmetic functions (counters, adders, multipliers, etc.) The RAM may be configured in various modes of operation (synchronous, asynchronous, single-port, dual-port, etc.). The FFs can also be configured as latches, and may have programmable clock-enable, pre-set/clear, and data selector functions. Our strategy relies on pseudo exhaustive testing , which in this context means that every sub circuit of a PLB is tested with exhaustive patterns in each one of its modes of operation [1]. The memory block is checked with RAM test sequences which are exhaustive for faults specific to RAMs [12]. Note that all three sub circuits of a PLB are easily controllable and observable from the PLB’s I/O pins, and that exhaustive testing of every module is feasible since the number of inputs is reasonably small. This results in practically complete fault coverage without explicit fault model assumptions and without fault simulation. For example, all single and multiple stuck-at faults, as well as all faults (of any type) that do not increase the number of states, are guaranteed to be detected; the overwhelming majority of faults that do increase the number of states are also detected. Thus for any practical purpose the PLB logic test is complete. In every test phase, a BUT is configured in a different mode of operation; hence its pseudo exhaustive 38 HT NO: 13N41D5709 Output LUT/RAM FFs Logic Figure : Typical PLB structure Figure shows the structure of an ORA comparing four pairs of BUT outputs. The signal BuOi (BdOi) is the i-th output from the BUT up above (down below) the ORA. The FF stores the result of the comparison, and the feedback loop latches the first mismatch in the FF. This represents a compression of the results, since any number of mismatches in the same phase translate into one error. The result FFs of all ORAs are connected to form a scan chain (indicated by the dotted lines in Figure1a). In every test phase, the scan chain is also tested, to assure the integrity of the test results. Our previous BIST approach [40] recorded only one Pass/Fail result for every row of ORAs. However, storing the test results of each ORA cell significantly improves the diagnosis resolution achievable by this architecture, as we will show in Section5. For an N N FPGA, the number of ORA cells is NORA=(N2/2)-N. BuO1 BuO3 BdO1 BdO3 Scan 1 BuO2 BuO4 BdO2 BdO4 Scan In from previous ORA TDI TCK Figure : Integrated ORA/scan cell 39 Out 0 HT NO: 13N41D5709 Testing Configuration Multiplexers: A configuration multiplexer (MUX) is a commonly used hardware mechanism that selects sub circuits for various modes of operation. A configuration MUX is controlled by configuration memory bits to select one input to be connected to its output. In Figure4a, assume that we set the configuration bit S to 0 to connect V0 to X. Then the sub circuit producing V1 disappears from the circuit model seen by the user. This is correct from a design viewpoint, because the value V1 can no longer affect X in the current configuration. But from a testing viewpoint, in any test for the MUX, we need to setV0 and V1 to complementary values. In general, for a MUX with k inputs, if V is the value of the selected input, all the other k-1 inputs should be set to value V. Our solution relies on separately configuring the invisible logic so that it will generate the proper values needed for the inactive MUX inputs. Then we “overlay” the resulting configuration files over the main configuration file with the active logic, and we “merge” them without changing any MUX setting done in the main configuration. This process is conceptually simple, but its implementation requires knowledge of the FPGA configuration stream structure. Note that in most previous work dealing with testing FPGAs, the problem of testing a configuration MUX is either not addressed or it is “solved” functionally, by connecting every input in turn to the output, and providing both 0 and 1 values to the selected input. However, the invisible logic driving the inactive inputs is completely ignored. Hence prior claims of “complete testing” may not be valid since the testing of every configuration MUX in the FPGA is likely to be incomplete. Methods that did provide complete tests for configuration multiplexers, such as, used models that did not remove the invisible logic. But such models cannot be used with the existing FPGA CAD tools to generate configuration files. 40 HT NO: 13N41D5709 Testing Memory Blocks: For the LUT/RAM block of a PLB, we first test its RAM mode of operation. We configure the TPGs to apply a march test , which detects, among other faults, all the stuck faults in memory cells, as well as all faults in the address and read/write circuitry of the RAM. We rely on this RAM-mode test as the major test of the memory block, so that subsequent tests for different modes of operation of the same block do not need to retarget the already detected faults. When we test the LUT operation, we configure alternating XOR/XNOR functions for the LUT outputs during one test phase, and alternating XNOR/XOR functions during a second test phase. For any combinational function of n inputs, the TPG will apply all 2n vectors. The difference is that in an FPGA the BIST logic would disappear after the RAM test session. All embedded RAMs may be tested in parallel, possibly sharing the BIST controller. If the FPGA has several identical RAM modules, we can feed them with the same patterns, and use comparators to check mismatches between corresponding outputs. Scalability of the BIST Approach: Our BIST architecture has a very regular easily-scalable structure, automatically generated by a sim-ple procedure (that also does algorithmic placement and routing), based on the dimensions ( N N) of the FPGA array. Since an ORA compares the outputs of its two neighbor BUTs, all signals from BUTs to ORAs use only local routing resources which are oblivious to the size of the FPGA. Global routing is used to distribute the patterns generated by TPGs to BUTs. Ignoring fanout load limitations, adding rows and columns to an array of PLBs will just extend the length of the vertical and horizontal global lines fed by TPG outputs, hence the usage of the global routing resources required for distributing the TPG patterns does not change with the FPGA size. In the ORCA FPGA, we use the bidirectional drivers available in the local routing surrounding a PLB to redistribute incoming TPG signals, thus avoiding fanout overload. The following analysis is for FPGAs where such drivers are not available. If k is the number of cells used by one TPG (k=4 in our implementation), a TPG row has NTPG = N k TPGs. (For simplicity we assume that N is always a multiple of k.) We divide the BUTs into NTPG subsets of k columns and we feed 41 HT NO: 13N41D5709 each such subset from its closest two TPGs. Then each TPG output drives N2 BUTs. This shows that the loading grows only linearly with N, which is the square root of the size of the FPGA. Nevertheless, if the loading may not grow over a given limit Lmax, then the largest N for which the BIST architecture is feasible is Nmax Lmax k (note that Lmax depends on the BIST clock frequency). When N grows over this limit, we divide the FPGA in four quadrants, such that it is feasible to implement the BIST architecture separately in each quadrant. For example, if Nmax=8, then an 16 All quadrants are tested concurrently. However, the scan chains of each quadrant will be connected into a single scan chain, so that the result retrieval time grows linearly with the size of the FPGA. We emphasize that increasing N does not affect the number of test sessions, which is always two. The number of test configurations (phases) depends only on the structure and the modes of operation of the PLB, and it is independent of N (the dependence shown in Figure 6 of [20] for the BIST method is a result of incorrect assumptions). Since all BUTs are tested in parallel, the BIST execution time is also independent of the size of the FPGA. In addition to the time for scanning out the results, the only test-time dependence on the size of the array is the reconfiguration time for each test phase. This is inherent in all the currently available FPGAs, for which the configuration loading is a serial process. The time of the method described in does not depend on N, but this is based on a parallel loading mechanism that is not featured in existing FPGAs. 42 HT NO: 13N41D5709 5.3 Boundary Scan Access: This allows FPGAs to be reconfigured and tested in-system without requiring additional I/O pins. The Lucent [26] and the Xilinx [43] architectures also provide user-defined access to the FPGA core, which we use to control the ORA scan chain. Altera FPGAs [5] do not feature user-defined TAP instructions, but the equivalent functionality may be implemented in test mode at the cost of several test-dedicated I/O pins. Reconfiguring the FPGA with the BIST test phases, initiating the BIST sequence, and reading the BIST results (steps 1, 2, and 5 of each test phase) are performed using the TAP. Access to our BIST archi-tecture requires the ability to control the BIST Start/Reset to initialize the TPGs and ORAs and start the BIST sequence, as well as the ability to read the BIST Done and ORA Pass/Fail results from the BIST circuitry. This access must be made within the confines of the typical FPGA boundary-scan circuitry archi-tecture. For example, unlike boundary-scan cells for external interconnect testing, the Shift/Capture and Update control signals are not made available to the core of the FPGA by the TAP circuitry in most FPGAs. Instead, the only internal signals provided by the TAP for user-defined internal scan chains are TCK, TDI, a port to send data out on TDO, and an internal enable signal (TEN). TEN is active when the user-defined scan-chain instruction is decoded by the TAP controller and remains active until a different instruction is loaded into the TAP instruction register. Additional considerations in selecting a boundary-scan access method for final implementation include total test time, PLB overhead, routability, and diag-nostic resolution. A detailed analysis of four methods for access to the FPGA BIST architecture via the boundary-scan interface was given in [13]. In this section, we describe the method selected for the final implementation of our BIST approach. 43 HT NO: 13N41D5709 5.4 BIST-Based Fault Detection: In this section we show that our BIST approach achieves complete fault detection for single faulty PLBs, and practically complete fault detection for multiple faulty PLBs. A faulty PLB in a TPG or in an ORA may not produce an error if its fault does not affect the operation of the TPG or ORA. Thus we will rely on detection of PLB faults only when a faulty PLB is under test (configured as a BUT). Claim 1: Any single faulty PLB is guaranteed to be detected. Proof: When the faulty PLB is a BUT, it receives the correct pseudo-exhaustive patterns from a fault-free TPG in every one of its modes of operation. The outputs of the faulty BUT are compared with a fault-free BUT fed by a fault-free TPG, and the comparison and error latching are done by a fault-free ORA. (Since the scanning mechanism of the ORA result register is tested as part of every test phase, we can assume that errors propagate through a fault-free scan chain.) Thus the faulty PLB is detected. The more difficult question is whether we can have multiple faulty PLBs that mask each other, so that together they escape detection. Although, in general, we cannot claim that any possible combination of faulty PLBs will be detected, we can identify many large classes of multiple faulty PLBs whose detection can be guaranteed. In the following, when we analyze the delectability of a group G of faulty PLBs, we implicitly assume that all the other PLBs are fault-free (in other words, G is not a subset of a larger group of faulty PLBs). G is detected when any of its PLBs is detected. Claim 2: Any group of faulty PLBs in the same row is guaranteed to be detected. Proof: Except the blocks in a TPG, the only interaction among PLBs located in different columns is provided by the scan register connecting the ORA result flip-flops. However, the scan operation of the ORA result register is also tested by the technique described in Section3. Hence, faults in BUT and ORA PLBs in the same row cannot mask each other. Faulty PLBs in rows 1 or N may not be detected when they are part of a TPG (for example, 44 HT NO: 13N41D5709 when two faulty TPGs feeding pairs of BUTs compared by the same ORAs generate identical patterns), but they will be detected when configured as BUTs.The next result deals with the detection of faulty PLBs residing in the same column, called the faulty column. A middle row refers to any row except 1 and N. Claim 3: Any group of faulty PLBs in the middle rows of the same column that has at least two adjacent fault-free PLBs, is guaranteed to be detected. Note that to escape detection, there must be at least N/2 faulty PLBs in the group, because otherwise the faulty column will have at least two adjacent fault-free PLBs. This is a very restrictive condition: for example, in a FPGA is guaranteed to be detected. Furthermore, the faulty PLBs configured as BUTs must produce identical output responses during every test phase. The reason Claim 3 restricts faulty PLBs to the middle rows of the faulty column is to guarantee that the TPGs in rows 1 and N are fault-free and thus generate all test patterns. Otherwise, it is theoretically possible that the group of faulty PLBs escape detection, even if the faulty column has two adjacent fault-free PLBs; for this to happen, all of the following conditions must be satisfied: 1) the faulty PLB in row 1 or N must change the patterns produced by one TPG, and, 2) the faulty TPG must skip all the patterns that detect every faulty BUT bordering two adjacent fault-free PLBs, and, 3) the fault-free BUTs must have identical responses for every pair of different vectors produced by the two TPGs (otherwise errors will appear at all the ORAs in the fault-free columns). Clearly, this set of conditions is so restrictive that it is unlikely to ever occur in practice. Hence in practice, columns that include faulty PLBs in rows 1 or N will also be detected. (The above analysis shows that the claim made in [20] that BIST-based approaches “do not detect all double faulty PLBs” is incorrect.) Figure shows an example of a faulty column with N=8 where no pair of fault-free PLBs are adjacent. Consider the test session in which all 4 faulty PLBs are BUTs. In addition, assume that their faults are functionally equivalent, so that no mismatches will be produced. If these faults are not activated in the other test session when the faulty PLBs in the middle rows serve as ORAs, and the PLB in row 8 is part of a TPG, then this multiple fault will escape detection. However, this situation can be characterized as “pathological,” since it has an extremely low probability of occurrence: 1) half of the PLBs in the same column must be 45 HT NO: 13N41D5709 faulty, and, 2) these PLBs must reside in every other row and, 3) their faults must be functionally equivalent, and, 4) these faults must not be detected when these PLBs are configured as ORAs and TPG cell. The probability of occurrence is very low, because we need the AND of four conditions which are very unlikely by themselves.Because masking cannot occur among faults in the middle rows of different columns, we can derive from Claim 3 the following more general result. Claim 4: Any group of faulty PLBs in the middle rows of the FPGA, such that every faulty column has at least two adjacent fault-free PLBs, is guaranteed to be detected. Like Claim 3, Claim 4 can also be extended in practice to cover faulty PLBs in rows 1 and N. Figure 7 illustrates an interesting group of 4 faulty PLBs that can escape detection if the following conditions are all satisfied: 1) PLBs X and Y have the same position in the first and the second TPG in row 1,and, 2) PLBs V and Z have the same position in the first and the second TPG in row 8, and, 3) the faults in X and Y are equivalent, and, 4) the faults in V and Z are equivalent, and, 5) the TPGs in row 1 skip all the patterns that detect faults in V and Z, and, 6) the TPGs in row 8 skip all the patterns that detect faults in X and Y. The faulty PLBs escape detection because in every test session, the patterns generated by the two TPGs are identical (so no mismatches are detected at any ORAs), and they miss the faults in the faulty BUTs. Clearly, this would be another pathological situation. X V TPGs Y X BUTs Y BUTs ORAs ORAs BUTs BUTs ORAs ORAs BUTs BUTs ORAs ORAs BUTs BUTs Z V TPGs Z Figure 7. Pathological case that escapes fault detection 46 HT NO: 13N41D5709 Although we cannot guarantee that the two test sessions illustrated in Figure1 will detect any possible combination of faulty PLBs, it appears that the conditions that allow a group of faulty PLBs to escape detection are so restrictive, that they are very unlikely to occur in practice. Therefore we can conclude that, in practice, any combination of faulty PLBs will be detected. But if needed, we can enhance the BIST sequence to guarantee the detection of any combination of faulty PLBs by adding the two test sessions shown in Figure8. These are obtained by rotating the test ses horizontal instead of vertical. If a group of faulty PLBs did escape detection in the first two test sessions, this was caused by masking among faulty PLBs in the same column. But the rotation is in effect interchanging the rows and the columns, so that it destroys the masking relations because the faulty PLBs interact now as if they are in the same row. Hence a group that escapes detection in the first two sessions is guaranteed to be detected in at least one of ORAs BUTs ORAs BUTs ORAs BUTs TPGs BUTs 1 a) Test session WE 2 3 4 5 6 7 8 ORAs BUTs BUTs ORAs BUTs ORAs BUTs 3 4 5 6 7 8 1 2 TPGs the additional two sessions. b) Test session EW Figure : Additional (“horizontal”) test session 47 HT NO: 13N41D5709 5.5 BIST-Based Diagnosis In this section, we present the use of the BIST approach in diagnosing an FPGA that failed the tests provided by the two test sessions shown in Figure1. In any fault location procedure, maximum diagnostic resolution is achieved when faults are isolated within an equivalence class containing all the faults that pro-duce the observed response. If every equivalence class has only one fault, we say that the fault is uniquely diagnosed, that is, there is no other fault which can produce the same response. In our case a fault is one faulty PLB that may have any number of internal faults, and the response is obtained at the outputs of the ORAs. We begin by assuming a single faulty PLB in the FPGA, then we analyze the case of multiple faulty PLBs, and we also discuss locating faults inside a defective PLB. Diagnosis Within A Faulty PLB: During each test session, the errors (failing ORA indications) are actually recorded for every test phase. This allows us to identify the failing mode(s) of operation of the faulty PLB and its faulty internal module(s). For example, consider the test phases developed for the ORCA 2C and 2CA series FPGA given in Table 2. The first 9 phases are used to test the various modes of operations of the PLBs in the ORCA 2C series, while the complete set of 14 test phases are used to test the ORCA 2CA series. Phases 1-4 test the LUT portion of the PLB, phases 5-9 test the FFs, and phases 10-14 test additional LUT modes of operation present only in ORCA 2CA series. For example, if only phase 10 fails, then we know that only the logic used exclusively to implement the multiplier is faulty. If only phases 5 and 7 fail, then the most likely cause is faulty connection(s) between the LUT and the flip-flops. Such accurate diagnosis is extremely useful in failure-mode analysis and yield improvement. This accuracy also provide the basis for allowing the reuse of a partially defective PLBs in fault-tolerant applications or adaptive computing applications. For example, if only phases 5 through 9 fail, then the LUT is fault-free (since it passed exhaustive tests in each of its modes of operation), and this PLB with defective flip-flops can be safely used to implement any combinational logic function. 48 HT NO: 13N41D5709 Table 2: Summary of BIST Phases for ORCA 2C and 2CA series Logic Blocks Flip-Flop/Latch Modes & Phase No. Options No. Mode Outs FF/Latch Set/Reset Clock 1 - - - - - Asynchronous RAM 4 2 - - - - - Adder/subtracter 5 3 - - - - - 5-variable MUX 4 4 - - - - - 5-variable XOR 4 5 Flip-Flop async. reset falling edge active low LUT output Count up 5 6 Flip-Flop async. set falling edge enabled PLB input Count up/down 5 7 Latch sync. set active low active high LUT output Count down 5 8 Flip-Flop sync. reset rising edge active low PLB input 4-variable 4 9 Latch - active high active low dynamic select 4-variable 4 10 - - - - - Multiplier 5 11 - - - - - Greater/equal to Comp 5 12 - - - - - Not equal to Comp 5 13 - - - - - Synchronous RAM 4 14 - - - - - Dual port RAM 4 BuO1 Clk Enable Flip-Flop Data In LUT 1 BdO1 BuO4 1 BdO4 0 Out 0 S ca n In from previous O R A TDI TCK Figure : ORA with greater diagnostic resolution 49 Scan HT NO: 13N41D5709 6. IMPLEMENTATION We have also implemented the approach in Xilinx Spartan series FPGAs. Figure 10a shows the delay-fault BIST implementation as it would reside in a STAR taken from the Xilinx FPGA Editor . The TPG is at the top of the STAR, and the ORA logic (OR and NAND gates to create OSC) at the bottom. Te ORA counter is a 6-bit counter constructed from three PLBs implementing a 2-bit counter ach. The OSC output drives the clock inputs of the ORA counter. This implementation is for off-line testing, as the global reset generated after configuration is used to initiate the BIST sequence. The global reset cannot be used during on-line testing, since it would reset the system function; for on-line testing, the BIST sequence is initiated via the boundary-scan interface. The TPG is constructed from a shift register with the input to the shift register tied to a logic 1 and the FFs clocked by the internal 8MHz oscillator. The role of the 2-bit shift register is to allow some time for the FPGA to stabilize after downloading the configuration before measuring delays; this was a precautionary measure and may not be necessary. Once configuration of the FPGA is complete, the global reset sets the two FFs to logic 0, and two 8MHz clock cycles later a 0/1 transition will start propagating on the four PUTs. The PUTs travel through an equal number of CIPs, including the switch boxes, prior to entering the ORA logic. At the conclusion of the BIST sequence, the contents of the ORA counter FFs are obtained via a configuration memory readback operation in this Spartan implementation, as opposed to the scan-chain-based boundary-scan access we used for the ORCA FPGA. 50 HT NO: 13N41D5709 global reset TPG 1 to clr PUTs clr 8MHz internal clock b) TPG implementation PUTs ORA counter (MSB) 2-bit c o u n t e r carry-in ORA counter carry-out 2-bit counter carry-in carry-out ORA counter (LSB) 8MHz internal clock 2-bit counter carry-in 4 OSC G LUT H LUT PUTs ORA logic F LUT c) ORA implementation a) Xilinx FPGA Editor layout Figure : Delay-fault implementation in Spartan FPGA 51 HT NO: 13N41D5709 In this section we present the results of the implementation of our BIST-based diagnostic approach using ORCA 2C15A FPGAs, and discuss our experience with testing array and it requires the full set of 14 test phases summarized in Table2 for each one of the two test sessions. Each test phase for the 2C15A requires 220,800 bits of memory to store the configuration for that test phase for a total storage requirement of 0.77 Mbytes for test configurations for both test sessions. The total FPGA test time, including all reconfigurations, is less than one second, using a 10 MHz boundary-scan clock. 52 HT NO: 13N41D5709 7. RESULTS We were provided with five 2C15A devices by Lucent Technologies Microelectronics Group in Allentown, PA. From manufacturing test results, two FPGAs were known to be fault-free and three were known to be defective. All three defective devices failed our test, and the two fault-free devices passed. Analyzing the results of the defective FPGAs, our PLB diagnosis procedure MULTICELLO reported inconsistencies for Chip1 and Chip2, and identified one faulty PLB in Chip3 (in row 3 and column 18). Test applied Off-line PLB test (this paper) Off-line routing test [41] On-line PLB test [2][3] Chip1 Chip2 Chip3 FAIL FAIL FAIL FAIL FAIL PASS P&F FAIL FAIL Table 3: Summary of testing faulty FPGAs To validate these results, we retested the defective chips with the off-line BIST for programmable interconnect described in , and with the on-line BIST for PLBs (the Roving STARs approach) presented in. No application was programmed in FPGAs for the on-line test. The on-line BIST tests each PLB twice, once in a vertical STAR and one in an horizontal STAR. Table3 summarizes the results of all tests. The failures of the routing BIST in Chip1 and Chip2 show that the inconsistencies reported by MULTICELLO trying to locate defective PLBs in these devices are caused by the presence of interconnect faults. For Chip1, the on-line PLB test passed all vertical roving tests, but failed some of the horizontal roving tests. Since these tests involve the same PLBs but different routing resources, we concluded that Chip1 has only interconnect faults that also cause its off-line PLB test to fail. 53 HT NO: 13N41D5709 Chip2 fails every test and has both faulty PLBs and faulty interconnect. Chip3 passed the interconnect test, and the on-line diagnosis method described in identified the same faulty PLB as MULTICELLO. The faulty PLB in Chip3 failed phases 5-9, and by using one additional configuration that separated between FFs and LUT, we were able to determine that the fault affects only the FFs and the LUT is fault-free. This defective PLB has been successfully reused to implement combinational logic functions in a fault-tolerant application. Since faulty FPGAs are difficult to obtain, we have also performed verification of our BIST approach using a fault emulator that injects faults in the FPGA by changing configuration bits. 54 HT NO: 13N41D5709 8. CONCLUSIONS In this paper, we have described a BIST approach for programmable logic blocks in SRAM-based FPGAs. Our approach is applicable at any level of testing and, unlike conventional BIST, it does not intro-duce any area overhead or delay penalty. Every PLB is pseudo exhaustively tested in all its modes of operation, so the tests are practically complete for any logic fault model. We have shown that in practice, our method detects any combination of faulty PLBs. Our architecture facilitates the testing of all PLBs in only two test sessions, and the number of test configurations is also independent of the size of the FPGA. Its regular structure allows easy scalability with the size of the FPGA and algorithmic generation of placement and routing data for every test phase based only on the size of the array. We also presented the first diagnosis algorithm that can accurately locate any single and most multiple faulty PLBs with maximum diagnostic resolution. The multiple faulty PLBs that cannot be diagnosed appear to be very restrictive situations unlikely to occur in practice. Our method can also identify defective sub circuits inside a PLB. This diagnostic information can then be used for yield enhancement in the manufacturing process or for repair strategies in fault-tolerant applications. We have successfully verified our approach with faults injected using a fault emulator and with actual defective FPGAs. An open problem to be investigated in the future is diagnosis under a mixed-fault model, that is, locating faults in a chip that has both faulty PLBs and faulty interconnect. 55 HT NO: 13N41D5709 9. REFERENCES 1. M. Abramovici, C. Stroud, S. Wijesuriya, C. Hamilton, and V. Verma, “Using Roving STARs for On-Line Testing and Diagnosis of FPGAs in Fault-Tolerant Applications,” Proc. IEEE Intn’l. Test Conf., 1999, pp. 973-982. 2. M. Abramovici, C. Stroud, B. Skaggs, and J. Emmert, “Improving On-Line BIST-Based Diagnosis for Roving STARs”, Proc. IEEE Intn’l. On-Line Test Workshop, 2000, pp. 3139. 3. M. Abramovici, J. Emmert, and C. Stroud, “Roving STARs: An Integrated Approach to On-Line Testing, Diagnosis, and Fault Tolerance for FPGAs in Adaptive Computing Systems,” Proc. Third NASA/DoD Workshop on Evolvable Hardware, 2001, pp. 73-92. 4. M. Abramovici and C. Stroud, “BIST-Based Test and Diagnosis of FPGA Logic Blocks,” IEEE Trans. on VLSI, Vol. 9, No. 1, pp. 159-172, Feb., 2001. 5. J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic Fault Tolerance in FPGAs via Partial Reconfiguration,” Proc. IEEE Symp. on Field-Programmable Custom Computing Machines, 2000, pp. 165-174. 6. J. Emmert, S. Baumgart, P. Kataria, A. Taylor, C. Stroud, M. Abramovici, “On-line Fault Tolerance for FPGA Interconnect with Roving STARs,” IEEE Intn’l. Symp. on Defect and Fault Tolerance in VLSI Systems, 2001, pp. 445-454. 7. Lattice Semiconductor Co., http://www.latticesemi.com/products/fpga. 8. E. McCluskey, “Verification Testing - A Pseudo exhaustive Test Technique,” IEEE Trans. on Computers, Vol. C-33, No. 6, pp. 541-546, June, 1984. 56 HT NO: 13N41D5709 9. S. D’Angelo, C. Metra, G. Sechi, “Transient and Permanent Fault Diagnosis for FPGABased Sys-tems,” IEEE International Symp. on Defect and Fault Tolerance in VLS , pp. 330-338, Novembe 1999. 10. J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, “Dynamic Fault Tolerance in FPGAs via Par-tial Reconfiguration,” Proc. 8th Annual IEEE Symp. on FieldProgrammable Custom Computing Machines, April 2000. 11. A. van de Goor, Testing Semiconductor Memories Theory and Practice, John Wiley and Sons, 1991. 12. C. Hamilton, G. Gibson, S. Wijesuriya, and C. Stroud, “Enhanced BIST-Based Diagnosis of FPGAs via Boundary Scan Access,” Proc. IEEE VLSI Test Symp ., pp. 413-418, May 1999 13. F. Hanchek and S. Dutt, “Methodologies for Tolerating Logic and Interconnect Faults in FPGAs,” IEEE Trans. on Computers, pp. 15-33, Jan. 1998. 14. I.G. Harris and R. Tessier, “Interconnect Testing in Cluster-Based FPGA Architectures,” Proc. Design Automation Conf., June 2000. 15. F. Hatori et al., “Introducing Redundancy in Field Programmable Gate Arrays,” Proc. IEEE Custom Integrated Circuits Conf., pp. 7.1.1-7.1.4, 1993. 16. W. K. Huang and F. Lombardi, “An Approach to Testing Programmable/Configurable Field Pro-grammable Gate Arrays,” Proc. IEEE VLSI Test Symp. , pp. 450-455, 1996. 17. W. K. Huang, F. J. Meyer, N. Park, and F. Lombardi, “Testing Memory Modules in SRAM-based Configurable FPGAs,” IEEE International. Workshop on Memory Tech., Design and Testing , August 1997 57 HT NO: 13N41D5709 18. W. K. Huang, F. J. Meyer, X. Chen, and F. Lombardi, “Testing Configurable LUTBased FPGAs,” IEEE Trans. on VLSI Systems, Vol. 6, No. 2, pp. 276-283, June 1998. 19. W. K. Huang, F. J. Meyer, and F. Lombardi, “An Approach for Detecting Multiple Faulty FPGA Logic Blocks”, IEEE Trans. on Computers, Vol. 49, No. 1, pp. 48-54, 2000. 20. T. Inoue, S. Miyazaki, and H. Fujiwara, “Universal Fault Diagnosis for Lookup Table FPGAs,” IEEE Design & Test of Computers, Vol. 15, No. 1, pp. 39-44, Jan. 1998. 21. C. Jordan and W. P. Marnane, “Incoming Inspection of FPGAs,” Proc. European Test Conf., pp. 371-377, 1993. 22. J. L. Kelly and P. A. Ivey, “Defect Tolerant SRAM Based FPGAs,” Proc. International Conf. on Computer Design, pp. 479-482, 1994. 23. V. Lakamraju and R. Tessier, “Tolerating Operational Faults in Cluster-based FPGAs,” Proc. ACM/ SIGDA International Symp. on FPGAs, pp. 187-194, Febr. 2000. 24. F. Lombardi, D. Ashen, X. Chen, and W. K. Huang, “Diagnosing Programmable Interconnect Sys-tems for FPGAs,” Proc. ACM/SIGDA International Symp. on FPGAs, pp. 100-106, Febr. 1996. 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Stuck-at Fault Diagnosis