Download Section II SEE Mitigation Strategies for Digital Circuit - Inf

2007 IEEE NSREC Short Course Section II SEE Mitigation Strategies for Digital Circuit Design Applicable to ASIC and FPGAs Fernanda Lima Kastensmidt Universidade Federal do Rio Grande do Sul (UFRGS) NSREC’07 Short course Fernanda Lima Kastensmidt 1 SEE Mitigation Strategies for Digital Circuit Design Applicable to ASIC and FPGAs Fernanda Lima Kastensmidt Computer Science Department - PPGC Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre – RS – Brazil [email protected] Table of Contents 1. Radiation Effects on Digital ICs ........................................................................4 1.1 Charge Collection Mechanism in MOS devices..................................................... 4 1.2 Single Event Effects in Digital ICs........................................................................ 9 2. Radiation Hardening by Design: Strategies for ASICs ..................................14 2.1 Layout- and Electrical-level based techniques ......................................................16 2.1.1 Bulk Built-in Current Sensors........................................................................16 2.1.2 Transistor Resizing for Charge Dissipation ...................................................18 2.2 Logic-level based techniques................................................................................20 2.2.1 Hardware redundancy techniques .................................................................20 2.2.2 Time redundancy techniques .........................................................................27 2.2.3 Mixed Hardware and Time Redundancy Techniques .....................................30 2.2.3 Hardened Memory Cells................................................................................34 2.2.4 Error Correcting Code (ECC).......................................................................37 2.3 Architectural level based techniques.....................................................................42 2.4 Area and Performance Tradeoffs Summary..........................................................45 3. Radiation Effects on FPGAs............................................................................49 3.1 Antifuse-based FPGAs.........................................................................................49 3.2 SRAM-based FPGAs ...........................................................................................53 4. Radiation Hardening by Design: Strategies for SRAM-based FPGAs ..........65 4.1 Scrubbing.............................................................................................................68 4.2 Triple Modular Redundancy.................................................................................69 4.3 Duplication with Comparison with Concurrent Error Detection............................70 2 Fernanda Lima Kastensmidt NSREC’07 Short course 4.4 Placement and Routing Issues ..............................................................................73 4.4.1 Solutions based on Placement and Routing....................................................74 4.4.2 Solutions based on Voting Adjustments..........................................................75 4.5 Partial Triple Modular Redundancy......................................................................76 5. Final Remarks..................................................................................................78 References ....................................................................................................................79 NSREC’07 Short course Fernanda Lima Kastensmidt 3 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service capable of fulfilling the system function in spite of (a limited number of) faults. Fault-tolerance on semiconductor devices has been meaningful since upsets were first experienced in space applications several years ago. Since then, the interest in studying fault-tolerant techniques in order to keep integrated circuits (ICs) operational in such hostile environment has increased, driven by all possible applications of radiation tolerant circuits, such as space missions, satellites, high-energy physics experiments and others. Spacecraft systems include a large variety of analog and digital components that are potentially sensitive to radiation and therefore fault-tolerant techniques must be used to ensure reliability. In addition, because of the continuous evolution of the fabrication technology process of semiconductor components, in terms of transistor geometry shrinking, power supply, speed, and logic density, as presented in International Technology Roadmap for Semiconductors (ITRS) [72], the fault-tolerance starts to be a matter of concern for circuits operating at ground level as well. As stated in [32], [36], [59], [24], [21], [26] and [25], drastic device shrinking, power supply reduction, and increasing operating speeds significantly reduce the noise margins and thus increase the threats that very deep submicron (VDSM) ICs face from the various internal sources of noise. This process is now approaching a point where it will be unfeasible to produce ICs that are free from these effects. Consequently, fault-tolerance is no longer a concern exclusively for space designers but also for designers of next generation products, which must cope with errors at ground level due to the advanced technology. 1.1 Charge Collection Mechanism in MOS devices The radiation environment is composed of various particles generated by sun activity, as presented by [7]. The particles can be classified as two major types: (1) energetic particles such as electrons, protons and heavy ions, and (2) electromagnetic radiation (photons), which can be x-ray, gamma ray, or ultraviolet light. The main sources of energetic particles that contribute to radiation effects are protons and electrons 4 Fernanda Lima Kastensmidt NSREC’07 Short course trapped in the Van Allen belts, heavy ions trapped in the magnetosphere, galactic cosmic rays and solar flares. The charged particles interact with the silicon atoms causing excitation and ionization of atomic electrons. At the ground level, the neutrons are the most frequent cause of upset as shown by [57, 58]. Neutrons are created by cosmic ion interactions with the oxygen and nitrogen in the upper atmosphere. The neutron flux is strongly dependent on key parameters such as altitude, latitude and longitude. There are high-energy neutrons that interact with the material generating free electron hole pairs and low energy neutrons. Those neutrons interact with a certain type of Boron present in semiconductor material creating others particles, as shown by [9]. Alpha particles are secondary types of particles emitted from interactions with radioactive impurities present in the device itself or in the packaging materials and they are the greatest concern. Materials aim to minimize the emission of alpha particles. However, it does not eliminate the problem completely. As an energetic particle traverses the material of interest, it deposits energy along its path, as shown in figure 1-1. This energy is measured as a linear energy transfer (LET), which is defined as the amount of energy deposited per unit of distance traveled, normalized to the material's density. It is usually expressed in MeV-cm2/mg. The ionized track contains equal numbers of electrons and holes. The total number of charges is proportional to the LET of the incoming particle. Figure 1-1. Silicon substrate ionization due to an energetic particle hit NSREC’07 Short course Fernanda Lima Kastensmidt 5 The sensitive sites are the surroundings of the reverse-biased drain junctions of a transistor biased in the off state, as explained by [22]. If an energetic particle passes through the pn-junction of a CMOS transistor in the off state, a short is momentarily created between the substrate and the struck drain terminal. The amount of charge that is collected produces a transient current pulse that lasts until the deposited charge disappears by recombination or is conducted away via open current paths to VDD or ground, returning the logic node to its original state. Figure 1-2 shows a collected charge occurring in the drain junction of the pchannel transistor. Originally the node held the value ‘0’. As current flows through the pn-junction of the struck transistor, from the bulk connected to VDD and the drain, the transistor in the on-state (n-channel transistor in figure 1-2) conducts a current that attempts to balance the current induced by the particle strike. If the collected charge induced by the particle strike is high enough that the on-transistor can not balance the current before the node capacitance is charged, a voltage change at the node will occur. This voltage change lasts until the charge is conducted away by the current feed through the on-transistor. off Transient current Transient voltage pulse 1 on + Vout ! 0 - Figure 1-2. Charge Collection Mechanism in inverter gate The maximum charge collection current (Qc) depends on the energy and ion type, as well as the path length over which the charge is collected. And it correlates with the energetic particle linear energy transfer (LET) value, as shown: Qc = (Lth T d e) / X, 6 Fernanda Lima Kastensmidt (1) NSREC’07 Short course where Qc is the collected charge, Lth is the threshold effective LET (in MeVcm2/milligram), T is the device thickness (in microns), d is the material density (2.32 g/cm3 for Si), e is the electronic charge = 1.602 x 10-7 pC, X is the energy needed to create one electron-hole pair (3.6 eV in Si). Replacing d, e and X in the equation (1), then: Qc = 1.03 x 10-2 (Lth T ) pC for Si (2) Considering T 1µm as a reasonable order of magnitude for conventional logic circuits and LET from 5 to 40 MeV-cm2/mg. Critical charge values range from 50fF to 410fF, obtained by equation (2). These numbers agree with those published by [23], in silicon, an LET of 97 MeV-cm2/mg corresponds to a charge deposition of 1 pC/µm. At the electrical SPICE level, the charge deposition mechanism can be modeled by a double exponential current pulse at the particle strike site, as presented by [45]: -t / τα -t / τβ IP(t) = I0 (e -e ), (3) where I0 is approximately the maximum charge collection current, τα is the collection time constant of the junction and τβ is the time constant for initially establishing the ion track. In the circuit simulations and modeling, τβ is assumed to be much smaller than τα, while τα is used as a variable parameter, as shown by [75] and by [22]. SPICE transient analysis is performed injecting a double exponential current pulse as given by (3), with the values of I0 and τα being used as the variable parameters to determine the minimum charge QC corresponding to a given τα. The double exponential model as given by (3) is proven to be adequate to study the soft error mechanism at the circuit simulation level [45]. Depending on the fabrication details and the electrical characteristics of each sensitive node (capacitance and resistance), different shapes of current transients can be observed as shown by [23, 25]. Figure 1-3 illustrates a double exponential current with a NSREC’07 Short course Fernanda Lima Kastensmidt 7 correspondent amount of charge Qi. The width of the induced transient voltage pulse is dependent on the energy of the incident particle, the charge stored at the affected node and the charge collection efficiency of the affected junction. So, according to the electrical characteristics of the struck node such as resistance and capacitance, different amplitude and duration of the transient voltage pulse are generated. There are equation models as the ones proposed by [85] to represent the generated voltage pulse in each sensitive node according to parameters such as I0, τα, τβ, node capacitance and resistance. Usually the time duration of the transient voltage pulse in nanometer technologies ranges from few hundreds of pico seconds to few nano seconds. As discussed by [42], in designs working in GHz frequencies, some transient voltage current pulses may endure for few periods of clock. QDrift Qdiffusion Charge Qi … time Figure 1-3. The effect of a transient current pulse modeled as a double exponential current with a certain amount of charge in two different circuit nodes. Once the values of I0, τα, and τβ are determined for a given technology and particles of interest, any circuit designed in that technology may be evaluated at the circuit level by modeling the charge deposition mechanism by (1). The values of I0, τα, and τβ for a given technology may be obtained by device simulation as well as from closed form expressions, as presented by [75] and by [22] and [26]. There is a minimal amount of charge able to create a transient current pulse in a certain node, which is known as the critical charge. Very often it is important to obtain the critical charge of a circuit in order to define the environment that the circuit is hardened to. In the next 8 Fernanda Lima Kastensmidt NSREC’07 Short course section, the effects of energetic particle ionization are explained in combinational and sequential circuits. 1.2 Single Event Effects in Digital ICs A single particle can hit either the combinational logic or the sequential logic in the silicon generating a soft error, as discussed by [19] and [1]. Figure 1-4 illustrates a typical circuit topology found in nearly all sequential circuits. The data from the first latch is typically released to the combinatorial logic on a falling or rising clock edge, at which time logic operations are performed. The output of the combinatorial logic reaches the second latch sometime before the next falling or rising clock edge. At this clock edge, whatever data happens to be present at its input (and meeting the setup and hold times) is stored within the latch. Combinational logic sequential logic sequential logic Figure 1-4. The occurrence of transient faults in combinational and sequential logics When a charged particle strikes one of the sensitive nodes of a memory cell, such as a drain in an off state transistor, it generates a transient current pulse that can turn on the gate of the opposite transistor. The effect can produce an inversion in the stored value, in other words, a bit flip in the memory cell. Memory cells have two stable states, one that represents a stored ‘0’ and one that represents a stored ‘1.’ In each state, two transistors are turned on and two are turned off (sensitive nodes). A bit-flip in the memory element occurs when an energetic particle causes the state of the transistors in the circuit to reverse, as discussed by [8] and [39]. This effect is called Single Event Upset (SEU) and it is one of the major concerns in digital circuits because usually NSREC’07 Short course Fernanda Lima Kastensmidt 9 memory cells are designed with very compact transistors that present high soft error sensitivity (low critical charge). The SEU phenomenon is illustrated in figure 1-5. (a) Static memory cell with a particle strike (b) Transient voltage pulse 1 (d) off (c) off 0 (a) off (e) off Transient current (b) The induced transient pulse flips the original stored value Figure 1-5. Single Event Upset (SEU) in a static memory cell When a charged particle hits the combinational logic block, it also generates a transient current pulse. This phenomenon is called single transient effect (SET), as presented in [39]. If the logic propagates the induced transient pulse, then the SET will eventually appear at the input of a latch, where it may be interpreted as a valid signal. Whether or not the SET gets stored as real data depends on the temporal relationship between its arrival time and the falling or rising edge of the clock. The transient pulse generated by the charge deposition mechanism might not be captured by a memory cell because it could be logically, electrically or latching-window masked as discussed by [74] and [51]. Logical masking occurs when the input stimulus are holding controlled values in the logical path in such a way that the SET can not be propagated to the outputs. Figure 1-6(a) exemplifies this logical masking. Note that the 10 Fernanda Lima Kastensmidt NSREC’07 Short course output holds the value one, independently to the SET value because the nand gate has one of the inputs at logical zero and the nor gate presented in the SET path has consequently one of the inputs at logical one. Electrical masking occurs if the pulse is attenuated as it propagates through the logic chain and fades out before it reaches the registered output, as shown in figure 1-6(b). If a SET is either logically or electrically masked, it is interpreted as a valid signal at the register input and it can be captured by the element memory according to the latching window (usually based on the setup time and hold time of the memory element), figure 1-6(c). Once a SET is captured, a wrong value will be stored in the register provoking a soft error. e0 e1 1 0 0 e2 a3 Q 1 (a) Logical Masking e0 e1 Q e2 a3 1 1 0 (b) Electrical Masking e0 e1 e2 a3 Q 1 1 0 clk (c) Latch-window Masking Figure 1-6. Single Event Transient (SET) in a combinational circuit As a result, the rate at which SETs get latched as errors depends on the operating frequencies and the logic structure of the circuit. Further, since the inherent delay of MOS transistors is decreasing with rapid technology scaling, the frequencies at which circuits are operated is continuously increasing. This increases the probability of SETs NSREC’07 Short course Fernanda Lima Kastensmidt 11 getting latched as errors. In addition, as the process technology shrinks and supply voltage decreases, the charge stored at logic circuit nodes reduces roughly according to Qnode = Cnode × Vdd, which is the main reason for the increased sensitivity of nodes to radiation-induced upsets, as Qc can be larger then Qnode more often. Additional reasons are the reduction in electrical and timing masking. The impact of the electrical masking decreases with the technology scaling. This is due to shorter gate delays and reduced logic depth between pipeline registers. The reduction in timing masking is a consequence of higher operating frequencies which increases the probability of a SEU pulse being latched. Thus, in Very Deep Sub-Micron (VDSM) technologies soft errors in logic circuits are becoming a significant reliability problem. In [60], [88], [56], [66], the probability of a SET becoming a SEU is discussed. The analysis of SET is very complex in large circuits composed of many paths. Techniques such as timing analysis presented by [4], [88], [51], [55], and [20], can be applied to analyze the probability of a SEU in the combinational logic being stored by a memory cell or resulting in an error in the design operation. Other techniques based on formal binary decision diagrams are also proposed in [87]. Multiple bit upsets (MBU) are also becoming a concern because of the process technology shrinking. MBU can appear due to SETs in nodes with fan-out higher than one as shown in figure 1-7; or from double node ionizations due to angle of incidence of the particle, as shown in figure 1-8, which is more common in highly dense memory arrays. a0 y0 a1 a2 a3 a4 a5 y1 X X Q0 Q1 Figure 1-7. Multiple Bit Upset due to a single SET 12 Fernanda Lima Kastensmidt NSREC’07 Short course + - +- + +- + + + + - ! Figure 1-8. Multiple Bit Upset due to an incident angle of the particle In summary, it is mandatory to investigate techniques able to tolerate SETs and SEUs in integrated circuits. In the next sections, a set of fault tolerant techniques for integrated circuits is discussed. The limitations of each technique are addressed. There is always a drawback to find the most reliable technique with a minimum area and performance impact. In addition, according to the target design and application, fault tolerant techniques can be applied at many different steps of the design flow, as it is presented. NSREC’07 Short course Fernanda Lima Kastensmidt 13 2. Radiation Hardening by Design: Strategies for ASICs Modifications in the fabrication process technology can reduce the amount of collected charge, but the reduction is not sufficient to avoid the SET occurrence. The results published by [21] indicate that significant transients can be generated in both bulk and SOI technologies at fairly low LETs. In bulk technologies these transients can be quite large for technologies of 100nm and below, with durations of nearly 1 ns at LET above 50 MeV-cm2/mg. Consequently, soft error mitigation techniques must still be applied at different levels of the circuit design flow to ensure reliability. Figure 2-1 represents the sequence of events that may occur once an energetic particle hits the substrate, provoking ionization, as it was discussed previously. The ionization track generates a set of electron-hole pairs that creates a transient current that is injected or extracted at that node. According to the amplitude and duration of this current pulse, a transient voltage pulse may appear at the hit node. This is characterized as the FAULT. There is a FAULT LATENCY period that defines the time needed for that fault to become an ERROR in the circuit. This will only occur if this transient voltage node changes the logic of a storage element (flip-flop), generating a bit-flip. This bit-flip may generate an error if the content of this flip-flop is used for a certain operation. But from the application point of view, it is not set that this error is manifested as a FAILURE in the system. There is also an ERROR LATENCY that defines the time needed for that error to become a failure in the system. For each phase a different fault tolerant technique can be used. Modern circuits may need fault-tolerance in many different levels to ensure reliability. For example, at the ionization and transient current generation phase, sensors can be built in the silicon substrate to detect ionization currents. The idea at this point is to notify the system that ionization has occurred. Once a transient voltage pulse is generated, temporal filtering can be applied to detect the transient pulse in time. However, the limitations of temporal filtering will be presented later on in this manuscript. To mitigate the bit-flips, hardware redundancy and error correcting codes can be used to correct the data. To correct an error, it is possible to use self-checking blocks 14 Fernanda Lima Kastensmidt NSREC’07 Short course with recovery mechanisms or recomputation to restore the correct data. Finally, spare chips may be used to guarantee operation of the system if a failure occurs. Figure 2-1. Sequence of events from ionization to failure and a set of fault tolerant techniques applied at different times. A set of techniques able to tolerate this entire sequence of events is analyzed in this manuscript:    Layout and Electrical level based techniques: o Built-in sensors for ionization detection o Transistor resizing for charge dissipation Logic-level based techniques: o Hardware redundancy for majority voting o Time redundancy for temporal filtering o Error correcting codes for detection and correction of bit-flips in memory elements o Hardened memory cell for bit-flip avoidance Architectural level based techniques: o Recomputation It is important to point out that there is always some penalty to be paid when protecting circuits against upsets. Each technique may present a combination of area overhead, performance penalty and power dissipation increase. The challenge is to select the most suitable techniques for the target circuit application in order to meet the area, NSREC’07 Short course Fernanda Lima Kastensmidt 15 time and power constraints, as well as the soft error hardness needed. In the next sections, a set of techniques are presented continuing the discussion done by [38]. 2.1 Layout- and Electrical-level based techniques 2.1.1 Bulk Built-in Current Sensors Built-in current sensors have been used for permanent fault detection, where the permanent faults typically originated due to imperfections in the integrated circuit fabrication process as presented in [5]. It is well-known that stuck-at faults can change the amount of current consumed by a circuit, so BICS connected to the power lines can detect current variations and consequently relay the occurrence of permanent faults. However, soft errors, which are one of the major concerns nowadays, have a transient effect and consequently, they do not present current variations at the power lines that can be distinguished from any other circuit activity. The source of the effect is a transient ionization that can only be seen at the hit node or at the bulk region. For this reason, BICS connected to the power lines cannot help on soft error detection as is, but BICS connected to the bulk region can sense the ionization. As discussed above, during normal circuit operation the current flowing between a reverse biased drain junction and bulk is negligible, if compared to the current peak induced by an energetic particle hit. Consequently, it is cost-effective to think about a BICS connected to the bulk of a circuit, instead of connecting it to the power lines of a circuit. The bulk-BICS works as monitor that senses the current at the bulk terminal. During normal operation, the current in the bulk is approximately zero. Only the leakage current flows through the biased junction, which is still very low compared to the current generated by charged particles. So, when a charged particle generates a current in the bulk, it is very clear to the bulk-BICS that a SET has happened. Figure 2-2 (a) shows the bulk-BICS connected to an integrated circuit as proposed by [29]. For the bulk-BICS approach, it is necessary to have a dedicated BICS in each type of well (N-well, P-well), consequently one BICS design is used for PMOS transistors in the N-well and another BICS design is used for NMOS transistors in the Pwell. In addition, the possibility of distinguishing upsets that occur in the PMOS region (BICS-P output) from the ones in the NMOS region (BICS-N output) can help to 16 Fernanda Lima Kastensmidt NSREC’07 Short course precisely map the faulty region in the circuit design. Each bulk-BICS can detect ionizations in a certain number of transistors, where this number is determined by the designer considering the SET-detection sensitivity. For a certain circuit with n number of transistors, it is necessary i number of bulk-BICS, where each bulk-BICS is connected to n/i transistors. Figure 2-2 (b) depicts the connection of the bulk-BICS to the body ties. The circuit itself is connected to the power lines (at the transistor sources), while the body ties are connected to VDD or ground through the bulk-BICS. Vdd Vdd p6 Vdd’ p5 Circuit Design p1 p2 p4 p3 RST BICS -P Vdd Gnd’ n1 n2 n4 n3 nRST n5 n6 BICS -N (a) Schematic of the Bulk-BICS (b) Bulk-BICS sensors placed at the silicon substrate, body-tie is connected to VDD or ground through the bulk-BICS Figure 2-2. The N-BICS and P-BICS connected to the bulk of a integrated circuit, as presented in [29]. In the case of N-well, the body-ties are connected to VDD through the bulk-BICS, while in the case of P-well, the body-ties are connected to ground through the bulk-BICS. NSREC’07 Short course Fernanda Lima Kastensmidt 17 The bulk-BICS can be calibrated to detect ionizations that that can produce transient current pulses at the struck node. A SET is assumed to occur if the voltage of the logic gate output node changes by more than VDD/2. This bulk-BICS technique presents a conservative approach for SET detection. Figure 2-3 shows the temporal diagram of a SET detected by the bulk-BICS. Once a SET occurs at any moment in a clock cycle, the bulk-BICS detects this SET after a certain delay, called SET detection time, that depends on the amount of area (transistors) protected by that bulk-BICS and on the size of the SET. The more intense the SET (large I0 and τα), the faster the SET detection by the bulk-BICS occurs. The more transistors connected to one BICS, the larger the capacitance associated with that connection and consequently, the SET detection time is longer. Once a SET is detected, the output of the bulk-BICS is raised, which notifies a control logic in the circuit to perform some fault tolerant technique to tolerate the detected SET and to reset the bulk-BICS. clk SET Vdd/2 1 bulk-BICS delay bulk-BICS_ctrl 2 3 reset_BICS Figure 2-3. Bulk-BICS time diagram for detection and reset 2.1.2 Transistor Resizing for Charge Dissipation Digital circuits have different resistance and capacitance values at each gate node according to its fan-out and gate logic type, consequently, each node presents a distinct critical charge (Qcrit), which is the minimum collected charge needed to provoke a SET or SEU at that node. When a soft error analysis tool is used (such as the ones 18 Fernanda Lima Kastensmidt NSREC’07 Short course referred to previously), the probability of SET occurrence in a certain design is evaluated. So, it is possible to draw the most sensitive nodes, which will be the ones that present a higher chance of propagating a SET to the outputs (low logic and electrical masking) and a low Qcrit. The idea of transistor resizing is to enlarge the width of some transistor in order to increase the capacitance of the most sensitive nodes in such a way that the node critical capacitance is increased. It is not desirable to increase all node capacitance because this would make the circuit slow and high power consuming. So, it is important to analyze the circuit sensitivity to soft errors in order to choose the nodes that are going to be modified. Some recent works such as the ones published by [89], [17], and [20], showed the variation of Qcrit as a function of the transistor channel widths and therefore have presented results about the decreasing of SET sensitivity by applying transistor resizing. The transistor resizing can also be replaced by gate duplication as proposed by [56]. Figure 2-4 presents an example of a circuit with the three most sensitive nodes to SET. By eliminating the chance of a SET occurrence in these nodes, the sensitivity to SET of the entire circuit reduces by 50%. In order to determine the size of the transistors (node capacitance and resistance) that is able to mitigate a certain range of energetic particles, the model equations applied for the calculation of the critical charge node and for the SET generation [85] can be used. The challenges of this technique are: (a) keeping the circuit time requirements when increasing the transistor sizing of the most sensitive nodes and (b) finding a transistor size with a critical charge that is able to avoid SET for a range of LET. It is clear that this method is suitable for low LET such as alpha particle LET, which is around 2 MeV-cm2/mg and neutrons up to 2 MeV-cm2/mg. NSREC’07 Short course Fernanda Lima Kastensmidt 19 most sensitive nodes A B C D Z E F Figure 2-4. Transistor Resizing 2.2 Logic-level based techniques The logic-level based techniques are all fault tolerant techniques that can be easily applied at the gate level to tolerate soft errors (SET in combinational circuits and SEU in sequential circuits). The logic-level based techniques can be applied in hardware description level languages such as VHDL and Verilog or at the schematic description level. Techniques will be presented based on hardware and time redundancy, the hardened memory cells and error correction codes for information redundancy. As will be discussed in the next sections, some of these techniques are able to mitigate SET and SEU, others only SET and others only SEU. 2.2.1 Hardware redundancy techniques Redundancy has always been successfully used to detect and vote out errors of the logic. The first basic approach is duplication with comparison (DWC), where the module is replicated and the outputs are compared. If the outputs mismatch, an error is detected. Of course, some errors can be masked by the application so the error is only detected when it manifests a wrong output value. This scheme can be used for both combinational and sequential logic to SET and SEU detection, respectively, as presented in figure 2-5. It can also be applied for the entire circuit. It is common to have two processors executing the same task to detect errors in one of the two chips. 20 Fernanda Lima Kastensmidt NSREC’07 Short course However, the comparator is the key circuit because it expects to detect the error and be immune of error as well. Usually comparators can be designed with larger transistors in order to be less sensitive to upsets and they can also be duplicated. Figure 2-5. Duplication with comparison scheme However, duplication with comparison can only notify the circuit that an error is present, it can not inform which module or piece of logic has the error. A self-checking circuit can be used to detect an error. For example, parity checking in arithmetic logic functions. In this case, a hot backup approach can be used, as illustrated in figure 2-6(a). There is the main module (module 0) and the spare module (module 1). The output by default receives the module 0 output. But if an error is detected in this module by the selfchecking block, then the output receives the module 1 output that is supposed to be fault free. On the other hand, the self-checking block can be very difficult to design and very often the checker can have the same complexity of the block that it must check. So, the duplication with backup approach is also very commonly used, as shown in figure 2-6 (b). Module 0 and module 1 work in tandem and their outputs are continuously compared. If an upset is detected, then the output receives the module 3 output, which is the spare module and it is supposed to be fault free. The only problem is how to ensure that the spare module is fault free. To overcome this issue, modular redundancy with majority voters (MAJ voters) can be used. NSREC’07 Short course Fernanda Lima Kastensmidt 21 Module 0 in Module 0 in out Module 1 out Self-checking Spare Spare Module 1 Module 2 (a) Hot backup approach (b) Duplication with backup approach Figure 2-6. Hot backup and Duplication with backup approaches In order to be able to detect and vote the correct output, it is necessary n redundant elements; when n typically is an odd number equal or larger then 3. This approach is called N-modular redundancy (N-MR). The triple modular redundancy (TMR) is the most common approach. It requires three modules working in tandem and a majority voter (MAJ voter) to vote the correct output. When an upset is presented, it is expected that at least two out of three outputs are correct, so the vote can decide the correct output. Figure 2-7 illustrates this approach used for sequential elements (flipflops). Figure 2-7. Triple Modular Redundancy (TMR) in the sequential logic and the majority voter (MAJ) However, there are two main limitations in soft error protection when using only TMR in the sequential logic as presented in figure 2-8 and figure 2-9. The first limitation is the SET in combinational logic can be stored in all three flip-flops at the same time, which makes the majority voter choose a wrong output (figure 2-8). The second limitation is when the SET occurs in the majority voter. This SET can be propagated and 22 Fernanda Lima Kastensmidt NSREC’07 Short course latched by the three flip-flops later on the circuit, as presented in figure 2-9. In both cases, the MAJ voter chooses the wrong output because 3 out of 3 values are wrong. Figure 2-8. SET propagation in a TMR scheme in the sequential logic Figure 2-9. SET propagation in the majority voter (MAJ) In order to solve this problem, the full TMR is proposed. In this case, the combinational logic and the voters are also triplicate as shown in figure 2-10. If a SEU occurs in one of the flip-flops, the MAJ voter chooses the correct output, as shown in figure 2-10(a) and at the next clock the correct output can be loaded to the flip-flop clearing the SEU. If a SET occurs in one of the combinational logic blocks, the SET may be captured by only one of the flip-flops and the MAJ voter will be able to choose the correct output, as represented in figure 2-10(b). If a SET occurs in one of the MAJ voters, the voter output will show the transient for a short period of time, as shown in figure 210(c). But since all the circuit is triplicate, only one redundant part is affected and the SET will be voted out at the next MAJ voter. Usually the three voter outputs can be connected outside the chip as shown in figure 2-10(d). This scheme is kind of analog voter. So, even if an upset occurs in one of the voters, the currents at the output will provide the correct output. TMR presents two weaknesses: (a) it does not protect against double faults simultaneously affecting different redundant modules, of which the probability of occurrence has increased in the nanometer technologies as discussed by [70]; and (b) a NSREC’07 Short course Fernanda Lima Kastensmidt 23 single fault in the last voter itself can generate undetected errors as shown by [43]. Even when using TMR for the voters as well, a fourth voter is always needed to choose the correct output of the circuit, the last voter in the chain, even with a lower probability of producing an error due to a SET, will always be subject to this problem. (a) SEU Mitigation (b) SET Mitigation in the logic (c) SET in the voters 24 Fernanda Lima Kastensmidt NSREC’07 Short course board chip voter OUT voter voter (d) Output voter Figure 2-10. Full Triple Modular Redundancy (TMR) with self-recovery In [71], an analog voter has been proposed to ensure complete tolerance against SET in TMR solutions. This voter, shown in figure 2-11, uses an analog comparator, instead of the traditional digital sum of products, to decide the output value. The robustness of the proposed analog majority voter relies in three main points: duplicated input nodes, well-dimensioned transistors in the analog comparator and output transistors always in the on state. module 0 + - module 1 module 2 + VDD/2 Majority logic voter (MAJ voter) Figure 2-11. Full Triple Modular Redundancy (TMR) with Analog Majority Voter Consequently, if a SET occurs at one of the inverter transistors, there is always another transistor, in parallel, to ensure the correct value. The same happens if a SET occurs inside the analog comparator logic; the transistors are set in a way that the SET will not be able to turn on or off other transistors, holding the correct value at the output. Finally, if a SET occurs at the output node, that transistor is already conducting a current and the additional current generated by the transient pulse does not change the output NSREC’07 Short course Fernanda Lima Kastensmidt 25 state and, therefore, does not harm the operation of the circuit. A more detailed analysis of an analog comparator behavior presented on SET can be seen in [48]. The schematic diagram of the analog comparator used to implement the MAJ voter and the dimensions of the transistors, implemented using a 32 nm technology, as proposed by [46] are shown in figure 2-12. The six inverters connecting the inputs to the comparator, shown in figure 2-11, were implemented using PMOS transistors with W=144nm and L=32nm, and NMOS transistors with W=80 nm and L=32nm. M3 M7 M6 M4 Out Vref Vin M1 M2 M5 Figure 2-12. The schematic of the Analog Comparator used in the Analog Majority Voter and the W/L ratio of each transistor for the 32nm CMOS technology using the predictive model from Berkeley. The analog majority voter can also be used as the basic gate logic to implement Boolean functions. As presented in [46], any combinational logic can be logic mapped to a tree of analog majority voters, which makes each node of the circuit robust to SET. By using this technique, the circuit can tolerate multiple SETs. Figure 2-13 illustrates a onebit full adder mapped to the analog MAJ voters. !A B !Cin MAJ voter 0 0 A B Cin !A B !Cin 1 1 MAJ voter MAJ voter sum MAJ voter A B Cin MAJ voter cout Figure 2-13. One-bit Full Adder implemented by only Analog Majority Voters 26 Fernanda Lima Kastensmidt NSREC’07 Short course In summary, hardware redundancy techniques such as full TMR for combinational and sequential logic and the solution of mapping Boolean logic functions to analog majority voters can protect the circuit against SET and SEU. The drawback of these techniques is the area overhead and consequently power dissipation. 2.2.2 Time redundancy techniques Time redundancy techniques are solutions able to process the data at different times, which allows the detection of faults [47]. The most simple scheme is when two flip-flops controlled by a clock and a delayed clock are used to latch the combinational output at two different times, which allows the detection of a SET, as shown in figure 214. The delay (d) must be chosen according to the SET time duration that must be detected. Let one suppose that the larger SET that can occur in this circuit has duration of 600ps. So, the delay (d) must be at least of 600ps to allow one flip-flop capture the correct data while the other one captures the SET. This scheme can only detect SET but not vote the correct output. Figure 2-14. Time redundancy scheme for SET detection Full time redundancy is when the output is captured three different times, which allows a majority voter to choose the correct output. In this case, the output of the combinational logic is latched at three different moments, where the clock edge of the second latch is shifted by the time delay d and the clock of the third latch is shifted by the time delay 2.d. A voter chooses the correct value, as shown in figure 2-15. NSREC’07 Short course Fernanda Lima Kastensmidt 27 Figure 2-15. Full time redundancy scheme This technique is also able to vote SET that occurs in the MAJ voter as presented in figure 2-16(a) when the MAJ voter is not the very last one, figure 2-16(b). (a) SET in MAJ voters (b) SET in the very last MAJ voter Figure 2-16. SET propagation in the MAJ in time redundancy schemes However this time redundancy technique based on clock delay does not work for SET mitigation in nanometer technologies when the SET has large pulse durations compared to the clock period. Figure 2-17 shows the problem in a time diagram. Let us consider a 90nm technology working at 1 GHz. The maximum delay time between two registers separated by a combinational logic is 1ns, which is the period of the clock. For this technology, as discussed previously, SETs can vary from a few hundred pico seconds to nano seconds. Let one suppose that SET pulses up to 600ps must be tolerated. So, the 28 Fernanda Lima Kastensmidt NSREC’07 Short course clock delay d must be 600ps at least. Then, the first clock (clk) occurs at time t=0, the second clock (clk+d) occurs at t=600ps and the third clock (clk+2.d) occurs at time t=1,200ps. After storing the data in all three latches, the MAJ voter chooses the correct output, which also presents a propagation delay. Consequently, it is necessary 1,200ps plus the MAJ voter propagation delay to vote out the SET, not counting the combinational logic propagation delay. The new achieved frequency is less than half of the original frequency. This shows that this method is suitable only for SETs with time duration not higher than 10% of the clock period. However, it is well known that for nanometer technologies, SETs are at the same order of magnitude as the clock period, as discussed by [21], [26] and [42]. So, new time redundancy techniques must be investigated. d d d clk clk clk+d clk+d clk+2d clk+2d SET d SET comb comb ffp0 ffp0 ffp1 ffp1 ffp2 ffp2 MAJ MAJ MAJ + comb delays MAJ + comb delays T T (a) short duration SET (b) long duration SET Figure 2-17. SET propagation in the MAJ in time redundancy schemes NSREC’07 Short course Fernanda Lima Kastensmidt 29 2.2.3 Mixed Hardware and Time Redundancy Techniques Mixed hardware and time redundancy techniques attempt to mitigate SET and SEU by using the best characteristics of hardware and time redundancy in order to meet a lower area overhead and at the same time a lower performance penalty. The code word state preserving (CWSP) proposed by [2, 3], as illustrated in figure 2-18, is an example of mixed hardware and time redundancy. From the hardware redundancy point of view, it has redundant combinational logic and extra transistors in the very last gate stages. From the time redundancy point of view, the output is only transmitted to the flip-flop input when both combinational logic outputs agree. So, if a SET occurs at one of the combinational logics, the flip-flop input has a high impedance value, while the SET is still on. Combinational logic … a b CWSP Combinational logic … a* b* clk+delay (a) General CWSP scheme (b) Example of logic gates with extra transistors to block the SET propagation Figure 2-18. Code Word State Peserving technique for SET mitigation 30 Fernanda Lima Kastensmidt NSREC’07 Short course This technique does not need voters or comparators but it has an asynchronous behavior because it is not possible to determine when both combinational outputs are ready to be stored in the flip-flop. Consequently, the clock period cannot be fixed. Also, the very last transistors of the logic are sensitive to SET, so they must be sized in order to be less sensitive than the others. Figure 2-19 presents the time diagram of this technique showing the limitations of this technique for SET with long pulse duration. Note that in figure 2-19(b) the flip-flop will store the high impedance value as the two combinational output values are not yet agreeing with each other. clk clk SET SET a a a* a* out `Z` out t `Z` t (a) short duration SET (b) long duration SET Figure 2-19. Time diagram of CWSP approach The technique proposed by [50], uses a hardware redundancy for the combinational logic and for the register circuit, with a C-element and a keeper circuit at the latches outputs, as shown in figure 2-20. If a SEU occurs in one of the latches, as illustrated in figure 2-20(a), the C-element does not propagate the upset and the keeper element is able to maintain the output value. Note that the C-element in this case inverts the values stored at the latches. This can ensure the correct value at the output (OUT) in the presence of SEU. NSREC’07 Short course Fernanda Lima Kastensmidt 31 If an upset occurs in the combinational logic, then, one latch registers the SET while the other one registers the correct value as shown in figure 2-20(b). When Clock is equal to one there will be time that the C-element propagates the correct value and there is a time when C-element blocks the propagation because the latch values mismatch. This phenomenon is illustrated in figure 2-21(a). However, for long pulse SET, the C-element may never propagate the correct value as the latches may stay diverging for the entire period when clock is equal to one, as seen in figure 2-21(a) on the left. In this case, the output (OUT) may hold a previous clock cycle value, which compromises the synchronization of the circuit. The redundant logic can be replaced by a delay (time filtering) as shown in figure 2-20(c). In this case, the two latches may store the SET at different times as represented in figure 2-21(b). The problem occurs when a long pulse SET happens, because in this case the two latches can hold the SET values at the same time making the C-element propagate the wrong value, which will be kept by the keeper circuit. In summary, this technique works well for SEU but it cannot protect properly upsets like SET. (a) SEU in the latches 32 Fernanda Lima Kastensmidt NSREC’07 Short course (b) SET in the combinational logic when using logic duplication (c) SET in the combinational logic when using time filtering Figure 2-20. Hardware and Time redundancy with C-element [50] This technique, as the ones presented previously, is inadequate for long duration pulse SETs. When long SET pulses occur, the output holds the previous value or even the wrong value, which can compromise the synchronization of the circuit. A solution for mitigating long SET pulses may be based on recomputation combined with a low cost technique able to detect the SET and SEU faults. NSREC’07 Short course Fernanda Lima Kastensmidt 33 clock clock c_out0 c_out0 SET SET c_out1 c_out1 C-element propagates the correct value Previous value OUT OUT Previous value Keeper holds the value C-element propagates the correct value Keeper holds the value (a) Time diagram for SET in the combinational logic when using logic duplication for short and long SET pulses clock clock SET SET c_out c_out SET c_out+! c_out+! OUT Previous value C-element propagates the correct value OUT SET Previous value Keeper holds the value Keeper holds the value C-element propagates the wrong value (b) Time diagram for SET in the combinational logic when using time filtering for short and long SET pulses Figure 2-21. Time diagram from the technique presented in [50] 2.2.3 Hardened Memory Cells Memory elements can be protected against SEU (bit-flip) by modifying their original design with extra resistors or transistors, able to recover the stored value if an upset strikes one of the drains of a transistor in “off” state. These cells are called hardened memory cells, and they can avoid the occurrence of a SEU by design, according to the particle charge and flux. 34 Fernanda Lima Kastensmidt NSREC’07 Short course In order to better understand how these hardened memory cells work, let one start with the analysis of a standard static memory cell composed of 6 transistors. When a memory cell holds a value, it has two transistors in “on” state and two transistors in “off” state; consequently there are always two SEU sensitive nodes in the cell. When a particle strikes one of these nodes, the energy transferred by the particle can provoke a transistor to switch “on”. This event will flip the value stored in the memory. If a resistor is inserted between the output of one of the inverters and the input of the other one, the signal can be delayed for such a time as to avoid the bit flip. The SEU tolerant memory cell protected by resistors proposed by [83] for ASICs and by [68] for FPGAs was the first proposed solution to this matter, figure 2-22(a). The decoupling resistor slows the regenerative feedback response of the cell, so the cell can discriminate between an upset caused by a voltage transient pulse and a real write signal. It provides a high silicon density, for example, the gate resistor can be built using two levels of polysilicon. The main drawbacks are temperature sensitivity, performance vulnerability in low temperatures, and an extra mask in the fabrication process for the gate resistor. However, a transistor controlled by the bulk can also implement the resistor avoiding the extra mask in the fabrication process. In this case, the gate resistor layout has a small impact in the circuit density. Memory cells can also be protected by an appropriate feedback devoted to restore the data when it is corrupted by an ion hit. The main problems are the placement of the extra transistors in the feedback in order to restore the upset and the influence of the new sensitive nodes. Examples of this method are IBM hardened memory cells proposed by [69] in figure 2-22(b), HIT cells in figure 2-22(c) proposed by [12, 79], DICE cells in figure 2-22(d) proposed by [14] and NASA memory cells proposed by [84, 44, 15], represented in figure 2-22(e). The main advantages of this method are temperature, voltage supply and technology process independence, and good SEU immunity. The main drawback is silicon area overhead that is due to the extra transistors and their extra size. NSREC’07 Short course Fernanda Lima Kastensmidt 35 Vdd Vdd PE clk PF PC PD C Vdd PA c PB Vdd R d D /d R q (a) Vdd MP4 MP3 /q Vdd Vdd MP1 MP2 Q D MN3 Q L P1 A P2 N1 B N2 /D /Q (b) IBM hardened memory cell Vdd Vdd Vdd Vdd MP0 MP1 MP2 MP3 A B C D /Q MN1 MN2 MN4 /D MN0 MN5 Vss MN6 MN1 Vss Vss MN2 MN3 Vss Vss Vss clk MN4 clk (c) N4 Vss Vdd MP6 MP5 Vdd Vss Resistor memory cell M N3 clk Vdd MN5 MN6 MN7 /D D HIT memory cell (d) DICE hardened memory cell /clk D /Q clk D Q Vss Vss (e) NASA hardened memory cell Figure 2-22. Examples of SEU hardened cells 36 Fernanda Lima Kastensmidt NSREC’07 Short course 2.2.4 Error Correcting Code (ECC) Error correcting code technique is based on information redundancy and it is used to mitigate SEU in integrated circuits, as discussed by [18]. It is usually used in memory arrays, but it can be also applied in registers or other small memory structures in microprocessors, for instance. Designers can implement ECC detection and correction as hardware or software. [73] compares the reliability of ECC implemented in these two levels of approaches. The simplest error correcting codes can correct single-bit errors and detect double-bit errors while more complex ones can detect or correct multi-bit errors. Examples include Hamming code, BCH code, Reed-Solomon code, Reed-Muller code, Binary Golay code, convolutional code, and others. Simple codes are usually implemented in hardware using extra memory bits and encoding/decoding circuitry. Figure 2-23 illustrated an 8-bit data being written and read from a register. If an SEU occurs and there is no information redundancy, an error occurs but it is not detected, which can lead in catastrophic consequences in the circuit. If a parity bit is added to the stored data, it is possible to detect an error when the parity bits mismatch. For many applications, it is not enough to detect the error, but it is necessary to correct it. For those, it is possible to use an ECC code with encoder and decoder blocks. The encoder block creates a set of check bits that will help to identify the error position, and then the decoder block is able to restore the correct value. 1 1 1 1 1 1 1 1 0 0 error 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 write 0 read write 0 read write 0 decoder 1 encoder 1 1 read 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P= 0 0 P= 1 1 1 Parity does not match Error detected! 1 0 1 1 Error corrected! 0 Figure 2-23. Error Correcting Code Principle An example of largely used ECC is the Hamming code [31] in its simplest version. It is an error-detecting and error-correcting binary code that can detect all singleNSREC’07 Short course Fernanda Lima Kastensmidt 37 and double-bit errors and correct all single-bit errors (SEC-DED). This coding method is recommended for systems with low probabilities of multiple errors in a single data structure (e.g., only a single bit error in a byte of data). The code satisfies the relation 2k ≤ m+k+1, where m+k is the total number of bits in the coded word, m is the number of information bits in the original word, and k is the number of check bits in the coded word. Following this equation the hamming code can correct all single-bit errors on n-bit words and detect double-bit errors when an overall parity check bit is used. The hamming code implementation is composed of a combinational block responsible for encoding the data (encoder block), inclusion of extra bits in the word that indicate the parity (extra latches or flip-flops) and another combinational block responsible for decoding the data (decoder block). The encoder block calculates the check bits and it can be implemented by a set of n-input XOR gates. The decoder block is more complex than the encoder block, because it needs not only to detect the fault, but it must also correct it. It is basically composed of the same logic used to compose the check bits plus a decoder that will indicate the bit address that contains the upset. The decoder block can also be composed of a set of n-input XOR gates and some AND and INVERTER gates. The encoder block calculates the check bits that are placed in the coded word at positions 1, 2, 4, …, 2(k-1). For example, for 8-bit data, 4 check bits (p1, p2, p3, p4) are necessary, so that the hamming code is able to detect and correct a single-bit error (SECSED). Figure 2-24 demonstrates a 12-bit coded word (m=8 and k=4) with the check bits p1, p2, p3 and p4 located at positions 1, 2, 4 and 8 respectively. The check bits are able to inform the position of the error. The check bit p1 creates even parity for the bit group {1, 3, 5, 7, 9, 11}. The check bit p2 creates even parity for the bit group {2, 3, 6, 7, 10, 11}. Similarly, p3 creates an even parity for the bit group {4, 5, 6, 7, 12}. Finally, the check bit p4 creates even parity for the bit group {8, 9, 10, 11, 12}. 38 Fernanda Lima Kastensmidt NSREC’07 Short course Encoder block: check bits generation P1 = W3 xor W5 xor W7 xor W9 xor W11 P2 = W3 xor W6 xor W7 xor W10 xor W11 P3 = W5 xor W6 xor W7 xor W12 P4 = W9 xor W10 xor W11 xor W12 Decoder block: syndromes Syndrome P1 Syndrome P2 Syndrome P3 Syndrome P4 = P1 xor W3 xor W5 xor W7 xor W9 xor W11 = P2 xor W3 xor W6 xor W7 xor W10 xor W11 = P3 xor W5 xor W6 xor W7 xor W12 = P4 xor W9 xor W10 xor W11 xor W12 Decoder block: mask generation Syndrome P4P3P2P1 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 Mask P1P2W3…….W11W12 no error 100000000000 010000000000 001000000000 000100000000 000010000000 000001000000 000000100000 000000010000 000000001000 000000000100 000000000010 000000000001 Figure 2-24. Hamming code check bits generation for an 8-bit word, 12-bit coded word NSREC’07 Short course Fernanda Lima Kastensmidt 39 Hamming code can protect structures such as registers, register files and memories, as presented in figure 2-25. According to the organization, one encoder and one decoder can be multiplexed in time to protect many registers and memory elements. Hamming code increases area by requiring additional storage cells (check bits), plus the encoder and the decoder blocks. For an n bit word, there are approximately log2 (2.n) more storage cells. However, the encoder and decoder blocks may add a more significant area increase, thanks for the extra XOR gates. Regarding performance, the delay of the encoder and decoder block is added in the critical path. The delay gets more critical when the number of bits in the coded word increases. The number of XOR gates in serial is directly proportional to the number of bits in the coded word. check bits data refreshing WR RD data words decoder Encoder Encoder encoder word decoder check bits Decoder Refreshing logic (a) Registers protected by ECC (b) Memory protected by ECC Figure 2-25. ECC in memory elements with a feedback refreshing to clean up the SEUs In [41], it was proposed a microcontroller protected by Hamming code. The ECC was implemented in the datapath, memory arrays and control logic. Figure 2-26 shows the microcontroller datapath. 40 Fernanda Lima Kastensmidt NSREC’07 Short course ROM data decoder 12-bit data RAM memory AD_low AD_high data ROM memory PC decoder refreshing Datapath encoder add/sub PC decoder All the registers are 12-bit (coded by Hamming Code) decoder decoder decoder encoder decoder AD encoder ALU Figure 2-26. Example of a microcontroller datapath protected by ECC The limitation of hamming code is that it can not correct double bit upsets, which can be very important for very deep sub-micron technologies, especially in memories because of the high density of the cells [65]. Other codes must be investigated to be able to cope with multiple bit upsets, which probability of occurrence is increasing due to the advance in technology as shown in [13]. Reed-Solomon [31] is an error-correcting coding system that was devised to address the issue of correcting multiple errors. It has a wide range of applications in digital communications and storage. Reed-Solomon codes are used to correct errors in many systems including: storage devices, wireless or mobile communications, high-speed modems and others. Reed-Solomon (RS) encoding and decoding is commonly carried out in software. But an efficient RS code implementation in hardware was presented by [53, 54], to protect memories against multiple SEUs. When using ECC, it is appropriate to implement interleaving technique, which means that the bits of a word protected by the same check bits must not be placed physically on side to each other. This technique helps ensure that no upset of two nearestneighbor memory cells resides in the same check word, which can make multiple bit upsets in a single bit ECC. [54] proposed a memory interleaving organization where two ECC are used: Reed-Solomon and Hamming code to ensure correction in presence of massive multiple upsets, figure 2-27. NSREC’07 Short course Fernanda Lima Kastensmidt 41 Figure 2-27. Example of interleaving in a memory protected by two ECCs [54] 2.3 Architectural level based techniques Whenever the effect of a fault is detected in the circuit architecture, this means that the circuit is already computing with some error and in this case it is mandatory to have a computational recovery. Current microprocessors already maintain checkpoints across 10’s of instructions for purposes of speculation recovery, as discussed by [82]. This makes suitable apply fault recovery in nowadays microprocessors. There are two general principles for recovery: forward and backward. Forward error recovery means detecting an error and continuing on in time, while attempting to mitigate the effects of the faults that may have caused the error. It implies the constructive use of redundancy. For example, temporally or spatially replicated messages, may be averaged or compared to compensate for lost or corrupted data. Backward error recovery means detecting an error and retracting back to an earlier system state or time. It includes the use of error detection (by redundancy, comparing pairs or error-detecting codes) so evasive action can be taken. It subsumes rollback to an earlier version of data or to an earlier system state. Backward error recovery may also include fail-safe, fail-stop and graceful degradation modes that may yield a safe state, degraded performance or a less complete functional behavior. There can be problems if 42 Fernanda Lima Kastensmidt NSREC’07 Short course backward recovery is used in real-time systems. One problem is that interactions with the environment cannot be undone. Another problem is how to meet the time requisites. In summary, each forward and backward recovery approaches have advantages and drawbacks. While forward recovery can meet time requisites but it needs inherent redundancy for error detection and redundant computation, backward recovery needs only error detection but not inherent redundancy of the entire computation, because once an error is detected, the computation is going to be performed again by the same hardware, but it is hard to meet time requisites sometimes with this approach. So the challenge is to efficiently detect the fault effect. One of the first option is the concept of processing and checking in parallel the outputs of a system for only a subset of its possible inputs, also called fingerprinting as presented by [6], can be applied to the general case of a circuit that must be hardened against soft errors, thus providing tolerance against transient faults caused by pulses that affect parts of the circuit, even when the duration of the transient pulse is longer than the delay of several gates. In [49], it is also presented concurrent error detection techniques for combinational logic blocks in order to detect an error. Figure 2-28 illustrates the general concept for using infrastructure IP to check logic-block integrity. Checking a logic function requires predictor block to compute the input signature and a checker to compare the output and input signatures. The challenge is in designing and implementing the most efficient blocks while preserving performance and keeping cost down. inputs Function f Output Characteristic Prediction checker output error Figure 2-28. Fingerprinting Technique NSREC’07 Short course Fernanda Lima Kastensmidt 43 In contrast with other proposed solutions based on checker circuits when fingerprinting is applied, the random checker does not provide full fault detection, figure 2-29. In this case, the random checker performs some of the functions of the main circuit only on a small set of possible inputs, being able to statistically detect errors at the output with a given probability. The main goal of this approach is to provide an acceptable level of fault detection using a circuit that is significantly smaller than the main circuit under inspection, thereby providing low area overhead. The underlying concept presented here is generic, and can be adopted for several different applications or circuits, with the subset of inputs, the operations performed by the checker, the performance, area, and power overheads varying according to the application. inputs output main circuit random checker error Figure 2-29. Random checker technique Typically, embedded systems in safety critical applications use watchdog schemes, which will detect an erroneous behavior after a long series of clock cycles under worst case conditions. Subsequently, the error is repaired by interrupt and retry. For applications which are also time-critical, such methods are too slow. If a roll-back is done several clock cycles after error detection, the system may have had the time to write large amounts of wrong date back into the system memory before compensation. Micro rollback tries to recognize an error condition very early and provides error correction within a few clock cycles. In [27], it was proposed a micro rollback scheme based on two separate processors, whereby the backup processor (trailer) is one or two clock cycles delayed, but performs exactly the same operations as the master processor. Micro rollback re-stores the last error-free processor state by re-loading of all register contents and re-executes the erroneous instruction on the master processor. In this scheme, the trailer processor holds register contents long enough to re-establish the original status. Such an approach is based on the implicit assumptions that the trailer 44 Fernanda Lima Kastensmidt NSREC’07 Short course processor is self checking in order to identify the fault device. The trailer as acts as a backup in case of transient faults. Figure 2-30 exemplifies the micro-rollback, as presented by [30]. Figure 2-30. Micro-rollback Example 2.4 Area and Performance Tradeoffs Summary Each technique discussed previously here presents a different area and performance overhead. It is possible to choose the most efficient one or combine them to achieve the fault tolerance requirement for each type of application and system platform. Table 2-1 shows the area and performance of each presented technique when implemented in an 8-bit adder design with registered inputs and output, which contains 294 transistors to implement the combinational logic and 384 transistors to implement the 32 master-slave flip-flops. The results in area are computed for 90nm technology (PTM model). The performance is shown in a general form equation based on the setup and hold times and propagation delay of the flip-flops (tpffp), the propagation delay of the adder (tplogic) and the propagations delays of the added blocks such as comparators NSREC’07 Short course Fernanda Lima Kastensmidt 45 (tpcomparator), voters (tpvoters), encoding (tpenc) and decoding (tpdec) blocks and others. Table 2-1. Comparison of Area and Performance in an 8-bit adder case-study circuit with registered inputs and outputs protected by SEE mitigation techniques. Fault Tolerance Technique Area Performance Capability No protected circuit  Combinational Delay = tpffp + None, only logic: 294 tplogic + tpffp inherent masking transistors  Sequential logic: 384 transistors Area = 584.24 µm2 Entire circuit  2x Combinational Delay = tpffp + SEU and SET protected by logic: 588 tplogic + tpffp detection Duplication with transistors + tpcomparator comparison (DWC)  2x Sequential logic: 768 transistors  Comparator for the 16-bit output: 156 transistors Area = 1,300.32 µm2 (+ 122%) Triple Modular  3x Combinational Delay = tpffp + SEU and SET Redundancy (TMR) logic: 882 tplogic + tpffp correction, but in the entire circuit transistors + tpvoter final voter can be with single voter at  3x Sequential upset. the output logic: 1152 transistors  Majority voters for the 16-bit output: 288 transistors Area = 1,996.92 µm2 (+ 241%) TMR in the entire  3x Combinational Delay = tpffp + SEU and SET circuit with triple logic: 882 tplogic + tpffp correction. voter at the output transistors + tpvoter  3x Sequential logic: 1152 transistors  3x Majority voters for the 16-bit output: 864 transistors 46 Fernanda Lima Kastensmidt NSREC’07 Short course Time redundancy in the output of the combinational logic with TMR in the registers. Built-in Current Sensors in the combinational and sequential logic Hardened memory cells in the registers Error Correction Code such as Hamming code in the input and output of the registers Area = 2,492.28 µm2 (+ 326%)  1x Combinational logic: 294 transistors  3x Sequential logic: 1152 transistors  2x Majority voters for the 8-bit inputs: 288 transistors  1x Majority voters for the 16-bit output: 288 transistors  Considering delay (δ) as 16 transistors (chains of inverters) Area = 1,752.68 µm2 (+ 199%)  Combinational logic: 294 transistors  Sequential logic: 384 transistors  33 Bulk-BICS: 1 for each 21 transistors Area = 789.79 µm2 (+ 35%)  Combinational logic: 294 transistors  2x Sequential logic: 768 transistors Area = 913.32 µm2 (+ 56%)  Combinational logic: 294 transistors  Sequential logic: 384 + (4 + 4 + 5)parity bits x12 NSREC’07 Short course Delay = tpffp + tplogic + δ + δ + tpffp + tpvoter SEU and SET can be corrected, the added delay (δ) must be chosen according to the SET pulse width. Delay = tpffp + tplogic + tpffp SEU and SET detection Delay = tpffp* + tplogic + tpffp* SEU correction only. None SET detection and correction. tpffp*= delay from the hardened flipflop Delay = tpenc + tpffp +tpdec + tplogic + tpenc + tpffp + tpdec Fernanda Lima Kastensmidt SEU correction only. None SET detection and correction. 47 Recomputing with Shifted or Swapped operands 48 transistors  2x 8-bit Encoding: 144 transistors  2x 8-bit decoding: 288 transistors  1x 16-bit Encoding: 144 transistors  1x 16-bit decoding: 288 transistors Area = 1,460.28 µm2 (+ 150%)  Combinational logic: 294 transistors  Sequential logic: 384 transistors  2x 8-bit Multiplexers 2:1: 64 transistors  Comparator for the 16-bit output: 156 transistors Area = 772.28 µm2 (+ 32%) Fernanda Lima Kastensmidt Delay = 2 x SEU and SET (tpffp + tpmux detection. + tplogic + tpffp) + tpcomp NSREC’07 Short course 3. Radiation Effects on FPGAs Field-Programmable Gate Arrays (FPGAs) are configurable integrated circuit based on a high logic density regular structure, which can be customizable by the end user to realize different designs. The FPGA architecture is based on an array of logic blocks and interconnections customizable by programmable switches. Several different programming technologies are used to implement the programmable switches. There are three types of such programmable switch technologies currently in use:  SRAM, where the programmable switch is usually a pass transistor or multiplexer controlled by the state of a SRAM bit (SRAM based FPGAs)  Antifuse, when an electrically programmable switch forms a low resistance path between two metal layers. (Antifuses based FPGAs)  EPROM, EEPROM or FLASH cell, where the switch is a floating gate transistor that can be turned off by injecting charge onto the floating gate. Customizations based on SRAM are volatile. This means that SRAM-based FPGAs can be reprogrammed as many times as necessary at the work site and that they loose their contents information when the memories are not connected to the power supply. The antifuse customizations are non-volatile, so they hold the customizable content even when not connected to the power supply and they can be programmed just once. Each FPGA has a particular architecture. Programmable logic companies such as Xilinx, Actel, Aeroflex (licensed for Quicklogic FPGAs), Atmel and Honeywell (licensed for Atmel FPGAs) offer radiation tolerant FPGA families. Each company uses different mitigation techniques to better take into account the architecture characteristics. 3.1 Antifuse-based FPGAs The Actel RTAX-S family is an example of antifuse-based FPGAs for space applications. It consists of a regular matrix composed of combinational (C-cells) and sequential (R-cells) surrounding by regular routing channels, as shown in figure 3-1. All the customizations of the routing and the C-cells and R-cells are done by an antifuse NSREC’07 Short course Fernanda Lima Kastensmidt 49 element (programmable switch). Results from radiation ground testing have shown that the programmable switches either based on ONO (oxide-nitride-oxide) or MIM (metalinsulator-metal) technology are tolerant to ionization and total dose effect [81]. Therefore, the customizable routing is not sensitive to SEU, only the flip-flops used to implement the design user sequential logic are sensitive to SEU. Figure 3-1. ACTEL: RTAX-S device The R-cell is composed of a Triple mode Redundancy (TMR) flip-flop or DFF with a wired-or voter at the output, as presented in figure 3-2. This makes the R-cell robust to SEUs. However, at high frequency operation, SETs can be observed [11]. Due to the number of transistors contained in an R-cell, there exist several points susceptible to Single Event Transient (SET). As discussed previously, some of these SETs may be propagated through the logic and captured by the R-cell, where all the 3 DFFs share the same data, clock, enable, and reset lines. Due to this fact, a glitch appearing on one of these lines during a clock edge will most likely appear as the same value to all of the DFFs and will not be correctly mitigated. As the system clock frequency is increased, so is the probability of capturing the SET. As the number of levels of combinatorial logic between each DFF increases, the 50 Fernanda Lima Kastensmidt NSREC’07 Short course probability of generating a SET increases. The user may protect the C-cells by using high level mitigation techniques in the description of the design (TMR, duplication and others). Figure 3-2. ACTEL: RTAX-S device At [11], radiation test results performed in Actel RTAX-S device showed the influence of the frequency in the error cross-section. The case-studied architectures (shift registers) are illustrated in figure 3-3. The logic levels between two flip-flops were chosen from 0, 4, and 8 inverter gates. Figure 3-3. Shift registers implemented at ACTEL: RTAX-S device for radiation ground testing [11] NSREC’07 Short course Fernanda Lima Kastensmidt 51 At each LET, several tests were performed at various frequencies on all of the shift register string types [11]. As the frequency increased, the error cross-section increased, as seen in figure 3-4. This is due to the probability of SET propagation and capture. A shift register string containing hardened (TMR) DFFs and combinatorial logic between these hardened flip-flops should present errors only when SETs in the combinational logic are captured by the TMR DFFs. And, as higher is the frequency; higher is the probability to capture the SET. Figure 3-4. ACTEL: RTAX-S device test when using the shift register with 8 inverters between flip-flops [11] The RadHard Eclipse FPGA is another example of antifuse-based FPGAs. It is provided by Aeroflex that uses QuickLogic Corporation’s licensed ESP (Embedded Standard Products) technology. Its architecture is also composed of a regular matrix of a configurable logic cell composed of logic and flip-flops surrounding by a regular routing matrix, as illustrated at figure 3-5. All the customizations are done by a programmable switch called ViaLink connector. It is fabricated on 0.25µm five-layer metal ViaLink CMOS process. The CLB flip-flops are SEU hardened flip-flops, which makes the CLB robust to SEU as well. However, the CLB logic can be susceptible to SETs that can be propagated through the logic and being captured by one of the flip-flops. Fault tolerant techniques at the high level can be implemented to mitigate SETs in the designs synthesized into these types of FPGAs too. 52 Fernanda Lima Kastensmidt NSREC’07 Short course Figure 3-5. RadHard Eclipse FPGA from Aeroflex 3.2 SRAM-based FPGAs SRAM-based FPGAs are very attractive due to high density, high performance, low NRE (Non-Recurring Engineering) cost, fast turnaround time and reconfigurability feature. For space and remote applications, SRAM-based FPGAs can offer additional benefits by allowing in-orbit design changes thanks to reconfigurability, which can reduce the mission cost by correcting errors or improving system performance after launch. In addition, the same circuitry can be used with different configurations at different stages of a mission, reducing weight and power requirements. Also, if part of an FPGA fails, then circuitry can be reprogrammed to make use of remaining functional portions of the chips. Xilinx FPGAs have an array composed of configurable logic blocks (CLBs) surrounded by programmable input/output blocks (IOBs), all interconnected by a NSREC’07 Short course Fernanda Lima Kastensmidt 53 hierarchy of fast and versatile routing resources. Each CLB has a set of Look-up tables (LUT), multiplexers and flip-flops, which are divided into slices. A LUT is a logic structure able to implement a Boolean function as a truth table. The CLBs provide the functional elements for constructing logic while the IOBs provide the interface between the package pins and the CLBs. The CLBs are interconnected through a general routing matrix (GRM) that comprises an array of routing switches located at the intersections of horizontal and vertical routing channel. The FPGA matrix also has dedicated memory blocks called Block SelectRAMs, clock DLLs for clock-distribution delay compensation and clock domain control and other components that vary according to the FPGA family. Virtex devices are quickly programmed by loading a configuration bitstream (collection of configuration bits) into the device. The device functionality can be changed at anytime by loading in a new bitstream. The bitstream is divided into frames and it contains all the information to configure the programmable storage elements in the matrix located in the Look-up tables (LUT) and flip-flops, CLBs configuration cells and interconnections. Figure 3-6 shows a general Xilinx FPGA architecture, where each matrix tile is a configurable logic block (CLB) with the logic slices and the general routing matrix (GRM). The characteristic of the CLB logic and slice may change consistent with the FPGA family. Due to the technology process evolution, FPGAs are in the nanometer technology era. As shown in figure 3-7, the latest families Virtex4 and Virtex5 are fabricated in 90 nm and 65 nm, respectively [86]. This evolution has allowed high logic integration. Nowadays it is possible to implement millions of gates and data memory in a single FPGA. In addition, there are families composed of hardened microprocessors, such as the VirtexII-Pro family with a PowerPC connected to the customizable array. The CLBs and interconnection structures have also evolved in the past decade, figure 3-8. 54 Fernanda Lima Kastensmidt NSREC’07 Short course Figure 3-6. Example of SRAM-based FPGA architecture based on regular array NSREC’07 Short course Fernanda Lima Kastensmidt 55 Figure 3-7. Evolution of Xilinx FPGA families in the last decade Figure 3-8. CLB logic evolution in the last decade The CLBs used to contain a small number of 4 input LUTs, where each LUT can implement any 4-input Boolean logic function, as for example in Virtex family and nowadays a CLB can contain a large number of 4-input LUTs, as in the Virtex4 family or 56 Fernanda Lima Kastensmidt NSREC’07 Short course even 6-input LUTs, where each LUT can implement any 6-input Boolean logic function, as in the latest released Virtex5 family. The interconnection structures located in the GRM have also improved in the last decade, able to reduce the delay and increase the performance in the implemented designs. All this evolution has increased the interest on using SRAM-based FPGAs for a wide range of applications, but at the same time has brought the necessity to analyze carefully the soft error susceptibility of these high complex structures. The effect of soft errors in the FPGA architecture in to the implement designs must be evaluated in order to implement efficient fault tolerant techniques. In FPGAs, a soft error has a peculiar effect in the user logic design since the combinational and sequential logics are mapped into the programmable architecture. Remember that in an ASIC, the effect of a soft error either in the combinational or in the sequential logic is transient; the only variation is the time duration of the fault. A fault in the combinational logic creates a transient logic pulse (SET) in a node that can propagate through the logic according to the logic delay and topology. In other words, this means that a SET in the combinational logic may or may not be latched by a flip-flop placed at the combinational logic output. Faults in the sequential logic (SEU) manifest themselves as bit flips, which will remain in the flip-flop until the next input load. On the other hand, in a SRAM-based FPGA, both the user’s combinational and sequential logic are implemented by customizable logic memory cells, in other words, SRAM cells, as represented in figure 3-9. SEU can occur in all SRAM cells, for example, in the ones that configure the LUTs, controls the CLB configurations, the routing (GRM) and others. When a SEU occurs in a memory cell that configures the LUT, it flips one of the stored values modifying the implemented combinational logic. This fault has a permanent effect in the user logic and it can only be corrected at the next load of the configuration bitstream, when then the LUT is configured again with the original Boolean function defined by the user. When a SEU occurs in a memory cell that controls the CLB configurations, as shown in figure 3-9, the multiplexer controlled by the affect memory cell changes its NSREC’07 Short course Fernanda Lima Kastensmidt 57 connection, and the original connection is undo. It has also a permanent effect and its effect can be mapped to an open or a short circuit in the user combinational logic implemented by the FPGA. The fault will also be corrected at the next load of the configuration bitstream, when the original configuration is loaded to the CLB control memory cells. FPGA CLB slice: User design logic: E1 E2 map E1 E3 clk E2 E3 F inputs: A B map C D 0 0 LUT LUT 0 1 SRAM configuration cells 0 1 1 1 0 0 0 upset 1 0 1 1 1 1 Figure 3-9. SEU Sensitive Bits in the CLB Slice When a SEU occurs in a memory cell that controls the routing (GRM), as shown in figure 3-10, it may affect the multiplexer or the pass transistors responsible to perform the connection between the logic. The SEU can result in open and short cuts in the logic. It has also a permanent effect and its effect can be mapped to an open or a short circuit in the user combinational logic implemented by the FPGA. The fault will also be corrected 58 Fernanda Lima Kastensmidt NSREC’07 Short course at the next load of the configuration bitstream, when the original configuration is loaded to the CLB control memory cells. When an upset occurs in the CLB flip-flop or in the embedded memory, it has a transient effect, because at the next load of the flip-flop or at the new data storage in the memory, the bit-flip can be corrected. In [34], all these effects are discussed in more details. Figure 3-10. Examples of upsets in the SRAM-based FPGA architecture in the general routing matrix (GRM) Radiation tests performed in Xilinx FPGAs, presented by [10], [62, 63] [78] and [80], show the effects of SEU in the design application and confirm the necessity of using fault-tolerant techniques for space applications. A fault-tolerant system designed into SRAM-based FPGAs must be able to cope with the peculiarities mentioned in this NSREC’07 Short course Fernanda Lima Kastensmidt 59 section such as transient and permanent effects of a SEU in the combinational logic, short and open circuit in the design connections and bit flips in the flip-flops and memory cells. Results presented by [62, 63], shows the multiple bit upsets in Virtex SRAMbased FPGAs. These results are very relevant because they determine the probability of MBU overcome mitigations techniques applied in these devices. Results show that MBU events are not as common in the Virtex family; most Virtex resources events have 10% MBU events compared to VirtexII and Virtex4. The only resource in all three families that does not follow these patterns is the BRAM blocks because their high density. Figure 3-11 (a and b) show the normalized percentage of MBU events by resource [62, 63]. The normalized percentages are determined by the ratio of the number of MBU events to all events for the resource. A comparison of the normalized values indicates that IOBs are very sensitive to MBUs. For the Virtex-II and Virtex-II Pro families IOBs are nearly as sensitive as CLBs to MBUs. It was observed five-bit and larger events in Virtex-4. In summary due to the technology scaling, the paper [62, 63] has shown that MBUs are 27–33 times more common in the Virtex-II and Virtex-II Pro families than in the earlier Virtex family. MBU events are nearly three times more likely in the Virtex-4 family (fabricated in 90nm process technology) than in the Virtex-II and Virtex-II Pro families (fabricated in 130nm process technology) and 69 times more likely than in the Virtex family (fabricated in 220nm process technology). 60 Fernanda Lima Kastensmidt NSREC’07 Short course (a) Virtex family, in 0.22µm process technology [63] (b) VirtexII family, in 0.13µm process technology [63] Figure 3-11. Percentage of MBU events in all events induced by heavy ion radiation for each resource in the Xilinx FPGAs NSREC’07 Short course Fernanda Lima Kastensmidt 61 Concerning single event effects, a set of the results presented by [28] is shown in figure 3-12. The graphic from figure 3-12(a) shows the upset sensitivity for any physical bit in the configuration bitstream, largely dominated by the configuration logic blocks (CLBs). The data for the configuration logic blocks (CLBs) and BlockRAM (BRAM) are shown separately. The Virtex-4 data look very much like that of the Virtex-II Pro. The Block RAM cells (open symbols) have a small but consistently higher susceptibility than the CLBs (filled symbols) in the knee region of the curve on a per-bit basis. In addition to single event upsets (SEUs), complex devices like the Virtex-4 are susceptible to single-event-functional-interrupt (SEFI) modes. These are upsets to a control circuit that disable large portions of the devices function. From studies of prior Virtex FPGA generations we might expect to see SEFI modes involving the power-onreset circuit (POR), failures of the JTAG or SelectMap communication ports, or others. A possible configuration clock (CCLK) upset observed in the Virtex-II Pro device was the only Virtex SEFI mode yet seen that required a power-cycle to recover. All other modes could be recovered by simply reloading the configuration. At this writing, we have studied only the POR SEFI and looked for modes requiring a power cycle. SEFI results of the POR are shown in figure 3-12(b) from [28]. (a) BRAM and CLB sensitive parts: VirtexII versus Virtex4 62 Fernanda Lima Kastensmidt NSREC’07 Short course (b) SEFI sensitivity for the Power-On-Reset (POR) Figure 3-12. Virtex-4 static SEU cross sections for three device types [28] Note that there is also the possibility of having single event transient (SET) in the combinational logic used to build the CLB such as input and output multiplexers used to control part of the routing and the LUTs, as shown in figure 3-13. The evaluation SET propagation in a design implemented in a FPGA relies on the analysis of the SET propagation from a LUT through a chain of pass transistors and multiplexers along the routing until reach a CLB flip-flop. The sensitivity of each node must be evaluated according to its capacitance and logic connection. Figure 3-13. SET propagation in SRAM-based FPGA NSREC’07 Short course Fernanda Lima Kastensmidt 63 Figure 3-14 shows a LUT structure based on pass transistors and a multiplexer also based in a pass transistor tree. The both structures have a valid path that is defined by the inputs at a time. The sensitive nodes to SET are the drain of all transistors that are at the off state. For the selected paths in figure 3-14 and the stored values, there are few sensitive points as indicated. In the case of the LUT that is propagating a ‘0’ value, only SETs that charge the node needed to be analyzed. These are the ones generated by ionization in the drain of the PMOS transistors at the off-state placed at the same selected path. SRAM LUT A B C D SRAM routing 0 1 1 1 1 1 1 1 ‘1’ ‘0’ ‘1’ 0 1 0 0 1 0 1 0 Figure 3-14. SET propagation in the internal LUT and routing multiplexers For the multiplexer that is propagating a ‘1’ value, only SET that discharges the node needed to be analyzed. These are the ones generated by ionization in the drain of the NMOS transistors at the off-state placed at the selected path. 64 Fernanda Lima Kastensmidt NSREC’07 Short course 4. Radiation Hardening by Design: Strategies for SRAM-based FPGAs Designers can protect the design at the high-level description (VHDL or Verilog) level by using some sort of redundancy targeting the FPGA architecture. The most popular high-level SEU mitigation technique used nowadays to protect designs synthesized in the SRAM-based FPGAs is the TMR combined with scrubbing. Xilinx has released the tool called X-TMR that automatically implements TMR into the user description. But the user himself can also implement the TMR in his design. However, due to the high area overhead of the TMR, some alternative solutions have been proposed in the last years. So the user has the flexibility on implementing duplication and self checking techniques instead of TMR. These techniques may compromise the fault tolerance in some point but the final result may be acceptable for a set of applications. In this way, it is possible to use a commercial FPGA part to implement the design and the soft error mitigation technique is applied to the design description before being synthesized in the FPGA. The user has the flexibility of choosing the fault-tolerant technique and consequently the overheads in terms of area, performance and power dissipation. Figure 4-1 exemplifies the design flow of a general circuit implemented in a FPGA. One very important step of the design flow is the validation of the fault tolerance technique that is usually done by fault injection. The original bitstream configured into the FPGA can be modified by a circuit or a tool in the computer by flipping one of the bits of bitstream, one at a time. This flip emulates a SEU in the configuration memory cells. The output of the design under test (DUT) can be constantly monitored to analyze the effect of the injected fault into the design. If an error is detected, this means that the fault tolerant technique implemented is not robust for that specific fault (SEU) in that target configuration memory bit. It is possible to inject faults in all the configuration bits and to analyze the most critical parts of the design [67]. This can help to guide designers in early stages of the development process to choose the most appropriated fault tolerant design, even before NSREC’07 Short course Fernanda Lima Kastensmidt 65 any radiation ground testing. The entire fault injection campaign can spend from few hours to days depending on the amount of bits that are going to be flipped and the connection to the fault injection control circuit. When the entire system (fault injection control + DUT + golden designs) is implemented at the hardware level (board), avoiding the communication with the computer, the process is speeded up in orders of magnitude. Figure 4-1. FPGA mitigation design flow by editing the design hardware description language and the fault injection approach used to validate the design. As discussed in [37], configuration logic blocks (CLBs), which are composed of lookup tables (LUTs) for logic generation, storage elements, multiplexers, and carry logic, in addition with the customizable routing account for by far the largest number of configurable bits in each device. However, the FPGA devices contain important functional blocks that can also be upset by radiation and once this occurs the effects can be catastrophic. Consequently, the susceptibility of these functional blocks must also be analyzed and mitigation techniques must be applied. Examples are: Digital Clock Managers (DCMs) provide phase-locked, skew-corrected clock signals to all parts of the 66 Fernanda Lima Kastensmidt NSREC’07 Short course chip, Phase-Matched Clock Dividers (PMCDs) offer additional frequency division options, Configuration controller circuit, power on reset (POR) circuitry, Input/Output Blocks (IOBs) implement 28 common single-ended or differential (in pairs) I/O standards with digitally controlled impedance, each XtremeDSP (DSP48) slice contains a dedicated 18x18-bit multiplier, adder, and 48-bit accumulator and other specialized blocks. Table 41 presents a summary of SEE issues and possible SEU mitigation solutions that have been presented in [37]. Table 4-1. Representative Xilinx Virtex Family Potential Types of Device SEE Sensitivity from [37] FPGA component parts Configuration Memory Configuration Controller SEE Issues Single and multiple bit errors corrupting circuit operation, causing bus conflicts (current creep), etc… Improper device configuration can occur if hit during configuration/reconfiguration CLB Logic hits and propagated upsets caused by transients BRAM Memory upsets in user area Half-latches Sensitive structure used in configuration/routing SEUs on POR can cause inadvertent reboot of device POR IOB SEUs can cause false outputs to other devices or inputs to logic DCM Can cause clock errors that spread across clock cycles Hard IP that is unhardened that can cause single event functional interrupts (SEFIs) or data errors Gigabit transceivers. Hits in logic DSP MGT NSREC’07 Short course Possible SEU mitigations Scrubbing Partial reconfiguration Partitioned design Multiple chip voting (Redundancy by using multiple devices) Triple modular redundancy (TMR) Acceptable error rates TMR Error Detection and Correction (EDAC) scrubbing Removal of half-latches from design Multiple chip voting (Redundancy by using multiple devices) Leverage Immune Config. Memory cell Evaluate input SET propagation TMR Temporal TMR TMR Temporal TMR TMR Fernanda Lima Kastensmidt 67 PPC SEL can cause bursts or SEFIs. O/w bit errors in data stream Hard IP that is unhardened. SEFIs are prime concern Higher current condition that is potentially damaging Protocol re-writes TMR or software task redundancy No mitigation other than substrate addition (epi). Circumvention techniques possible 4.1 Scrubbing It is important to notice that the use of hardware redundancy by itself it is not sufficient to avoid errors in the FPGA, it is mandatory to reload the bitstream constantly to avoid the accumulation of faults. This continuous load of the bitstream is called scrubbing. The scrubbing as explained at the Xilinx Application Notes 138 and 151, allows a system to repair bit-flips in the configuration memory without disrupting its operations, which includes the memory cells that configures the LUT, the ones that control the routing (GMR) and the CLB customization. Configuration scrubbing prevents the build-up of multiple configuration faults and reduces the time in which an invalid circuit configuration is allowed to operate. The scrubbing does not refresh the contents of CLB flip-flops and embedded memories: the Block SelectRAMs. The scrubbing is performed through the Virtex SelectMAP interface. Furthermore, systems must employ configuration scrubbing for redundancy-based mitigation techniques such as TMR before any reliability enhancement is observed. Without scrubbing, the build-up of multiple faults would eventually break the redundancy. It is recommended to scrub at least 10X faster than worst-case SEU rate. When the FPGA is in this mode, an external oscillator generates the configuration clock that drives the FPGA and PROM that contains the “gold” bitstream. At each clock cycle new data are available on the PROM data pins. The frequency that scrubbing must be performed depends on the particle flux and cross-section of the device. For system-on-chip (SoC) platforms, the Hardware Internal Configuration Access Port (HWICAP) module can also be used to reconfigure parts of the configuration matrix from inside the FPGA controlled by the embedded processor (hard core Power-PC or soft core Microblaze). The ICAP is able to load partial bitstream without interrupt the application and to configure them. It implements a subset of SelectMAP interface and it 68 Fernanda Lima Kastensmidt NSREC’07 Short course generates no noise during reconfiguration. The ICAP module is connected to the embedded processor by the available local bus OPB and the EDK tool can be used for that task. 4.2 Triple Modular Redundancy Triple Modular Redundancy (TMR) is a well-known fault tolerant technique for avoiding errors in integrated circuits. The TMR scheme uses three identical logic blocks performing the same task in tandem with corresponding outputs being compared through majority voters (MAJ). Since all the customizable memory cells are sensitive to soft errors, single points of failures must be avoided inside the FPGA. Consequently, all the inputs and outputs must also be triplicate. If an upset occurs in one of the IO cells or in the routing that connects the logic from/to the IO blocks, there are two other redundant inputs or outputs able to ensure the correct value. The voter is also triplicate because if one fails, there are two voters able to maintain the correct value in two redundant logic parts. This full TMR, also called X-TMR, is especially suitable for protecting designs synthesized in SRAM-based Field Programmable Gate Arrays (FPGAs) as proposed by [16]. Figure 4-2 shows the full TMR that must be applied in the user design logic before synthesize into the Xilinx FPGA. The CLB flip-flops (user sequential logic) are triplicate with triple majority voters (MAJ) and a feedback connection that is able to restore the correct data of the flip-flops. This setup is important as it was seen that the scrubbing is not able to restore the correct value of a CLB flip-flop. The majority voter defines the correct output as two out of three input values, as defined in the truth table in the figure 4-2. The very last output voter, which can be placed at the output of CLB flip-flops or combinational logic blocks, is different than the MAJ voter. Note that there are three output signals that go to the IO pads. Each one is controlled by a tri-state buffer. The redundant logic part that holds an error should not pass the error to the output, so the tristate buffer should block the faulty redundant part. The TMR output voter choose the tristate buffer controller based on one reference value (that is the input coming from one of the redundant logic part) and the other two inputs coming from the others redundant logic parts, as shown in the truth table at figure 4-2. Each voter can be implemented in a LUT. NSREC’07 Short course Fernanda Lima Kastensmidt 69 REDUNDANT LOGIC (tr1) REDUNDANT LOGIC (tr2) REDUNDANT LOGIC (tr0) REDUNDANT LOGIC (tr1) REDUNDANT LOGIC (tr2) TMR flip-flop INPUT TMR flip-flop REDUNDANT LOGIC (tr0) REDUNDANT LOGIC (tr0) REDUNDANT LOGIC (tr1) REDUNDANT LOGIC (tr2) TMR Output Voter FPGA OUTPUT package PIN package PIN REF R0 R1 R2 MAJ 0 0 0 0 1 1 1 1 TMR flip-flop tr0 tr1 tr2 clk0 clk1 clk2 MAJ R0 R1 R2 MAJ MAJ MAJ CLB flip-flop 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 0 0 R0 O_voter LUT: 00011000_00011000 3-state_0 R0 R1 O_voter 3-state_1 R1 R2 O_voter LUT: 00010111_00010111 3-state_2 R2 Figure 4-2. TMR implemented in FPGA Since all the outputs are connected together outside the FPGA device, this connection works as an analog voter where the majority prevails, so, even if one output voter fails, the output can manage to show the correct value. 4.3 Duplication with Comparison with Concurrent Error Detection The TMR technique is a suitable solution for FPGAs because it provides a full hardware redundancy, including the user’s combinational and sequential logic, the routing, and the I/O pads. However, it comes with some penalties because of its full hardware redundancy, such as area, I/O pad limitations and power dissipation. Many applications can accept the limitations of the TMR approach but some cannot. Aiming to reduce the number of pins overhead of a full hardware redundancy implementation (TMR), and at the same time coping with permanent upset effects, a technique based on duplication with comparison (DWC) and concurrent error detection (CED) technique was proposed by [34], figure 4-3. The CED must be able to detect the fault free redundant 70 Fernanda Lima Kastensmidt NSREC’07 Short course logic part. The CED should have a smaller area than the redundant logic block in order to present a reduced area overhead compared to the X-TMR. FPGA INPUT REDUNDANT LOGIC (dr1) REDUNDANT LOGIC (dr0) TMR flip-flop TMR flip-flop REDUNDANT LOGIC (dr0) REDUNDANT LOGIC (dr1) CED REDUNDANT LOGIC (dr0) REDUNDANT LOGIC (dr1) CED OUTPUT CED package PIN package PIN TMR flip-flop dr0 dr1 MAJ clk0 R0 R1 R2 MAJ MAJ clk1 dr0 MAJ clk2 dr1 CED Logic able to detect which redundant logic (dr0 or dr1) is fault free. CLB flip-flop 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 LUT: 00010111_00010111 Figure 4-3. Duplication with Comparison and Error Concurrent Detection technique (DWC-CED) for SRAM-based FPGAs The CED scheme can be based on time redundancy. In this way, it recomputes the input operands in two different ways to detect permanent faults. During the first computation at the first clock cycle, the operands are used directly in the combinational block and the result is stored for further comparison. During the second computation at the second clock cycle, the operands are modified, prior to use, in such a way that errors resulting from permanent faults in the combinational logic are different in the first calculation than in the second and can be detected when results are compared. These modifications are seen as encoder and decoder processes and they depend on the characteristics of the logic block. The general scheme is presented in figure 4-4. NSREC’07 Short course Fernanda Lima Kastensmidt 71 Figure 4-4. General Time redundancy scheme for permanent fault detection Figure 4-5 shows the scheme proposed for an arithmetic module, for instance a multiplier. There are two multiplier modules: mult_dr0 and mult_dr1. There are multiplexers at the inputs able to provide normal or shifted operands. The outputs computed from the normal operands are always stored in a sample register, one for each module. Each output goes directly to the input of the user’s TMR register. Module dr0 connects to register tr0 and module dr1 connects to register tr1. Register tr2 will receive the module that does not have any fault. By default, the circuit starts passing the module dr0. A comparator at the output of register dr0 and dr1 indicates when outputs mismatch (Hc). If Hc=0, no error is found and the circuit will continue to operate normally. If Hc=1, an error is characterized and the operands need to be recomputed using the RESO (recomputing with shifted operands) method to detect the module that has fault. The detection takes one clock cycle. While the circuit performs the detection, the user’s TMR register holds its previous value. When the faulty free module is found, register tr2 receives the output of this module and it will continue to receive this output until the next chip reconfiguration (fault correction). 72 Fernanda Lima Kastensmidt NSREC’07 Short course B A encoder ST0 B encoder 1 0 0 1 A encoder encoder ST0 ST1 dr0 1 0 0 ST1 dr1 decoder decoder = = = Hc Tc0 Fault-free module 1 Tc1 CED ST1 ST0 Figure 4-5. Case-study CED based on encoder and decoder for arithmetic logic blocks. 4.4 Placement and Routing Issues The problem of using fault tolerance techniques based on redundancy and majority voters is that one must ensure that SEU can not affect more that one redundant domain of the design, [40], [64]. If a SEU is able to affect two domains of a redundant design, the majority voter is not able to choose the correct results out of three, and errors can appear in the design output. The only way a single fault can affect more than one redundant domain is by upsetting the SRAM cells controlling the routing connections. The upsets in the routing represent the main concern, as 90% of the SRAM cells inside the FPGA are responsible for routing control. The main effects of an upset in the routing are open lines and shortcuts between distinct lines as it was discussed previously. The probability of SEUs in the routing upsetting more than one redundant domain depends on the logic placement and the number of majority voters in the design. In figure 4-6, there are two examples of upsets in the routing. Upset “a” connects two signals from the same redundant domain, which does not generate an error in the NSREC’07 Short course Fernanda Lima Kastensmidt 73 TMR output, because the outermost voters will vote the upset effect. However, upset “b” may provoke an error in the TMR output, because it connects two signals from distinct redundant logic blocks affecting two out of three redundant domains of the TMR. In the next sections three solutions will be discussed to improve reliability in this matter. Figure 4-6. SEU in the routing affecting two distinct redundant logic parts 4.4.1 Solutions based on Placement and Routing Dedicated floorplanning for each redundant part of the TMR can reduce the probability of upsets in the routing affecting two or more logic modules, but it may not be sufficient, since placement can be too complex in some cases. Remember that each time it is necessary to include voters, there are connections between the redundant parts, which make impossible to place the redundant logic parts very far away from each other with no connections at all, figure 4-7. One solution is the Reliability-Oriented Place and Route algorithm (RoRA) proposed by [76, 77], which is a place and route algorithm for SRAMbased FPGAs able to enforce particular technique in order to enforce every circuit mapped on SRAM-FPGAs against SEUs in their configuration memory cells. Routing duplication can also be a solution to improve reliability in TMR. In [33], it is proposed a method to duplicate the routing locally inside the CLB to avoid problems with open and short circuits provoked by SEUs in the routing. 74 Fernanda Lima Kastensmidt NSREC’07 Short course Figure 4-7. Majority voter placement in the TMR approach 4.4.2 Solutions based on Voting Adjustments The first voting adjustment was proposed by [35]. It is proposed a logic partition in order to add more voter stages in the circuit. If the redundant logic parts tr0, tr1 and tr2 (represented in figure 4-6 after the TMR register with voters and refresh) are partitioned in smaller logic blocks with voters, a connection between signals from distinct redundant parts could be voted by different voters. This logic partition by voters is represented in figure 4-8. Notice that now the upset “b” can not provoke an error in the TMR output, which increases the robustness of the TMR in the presence of routing upsets without being of concern to floorplanning. The problem is to evaluate the best size of the logic to achieve the best robustness. If the logic is partitioned in very small blocks, the number of voters will increase dramatically, causing an overly costly TMR implementation. The objective is finding the best partition in terms of area cost, performance and robustness. The results presented by [35] suggest that there is a trade off between the logic partition of the throughput logic (and consequently between the number of voters) and the number of routing upsets that could provoke an error in the TMR. In contrary to what was expected, large number of voters does not always mean larger protection against NSREC’07 Short course Fernanda Lima Kastensmidt 75 upsets. There is an optimal logic partition for each circuit that can reduce the propagation of the upset effect in the routing. Figure 4-8. Triple Modular Redundancy (TMR) scheme with logic partition in the FPGA 4.5 Partial Triple Modular Redundancy A partial TMR mitigation strategy was proposed in [52, 61] and it is based on the idea that there are more critical parts than others in the circuit and not all logic blocks need to be protected by TMR. Some of non-critical blocks can only be corrected by scrubbing from time to time. The idea is based that sensitive configuration bits can be separated into two categories called “persistent” and “non-persistent” [52, 61], shown in figure 4-9. A nonpersistent configuration bit is a sensitive configuration bit that will cause a functional error in the design but when the non-persistent configuration bit is repaired through configuration scrubbing, the design returns to normal operation. And eventually all previously induced functional errors will disappear. No additional intervention is required to return the circuit to normal functionality. A persistent configuration bit is a sensitive configuration bit that will also cause functional error, however, after repairing the upset configuration bits through configuration scrubbing, the FPGA circuit does not return to normal operation. This is due to the errors that are stored in flip-flops at feedback loops that can not be corrected by scrubbing. In this case, a global reset is needed to return the circuit to a proper state, or normal operation. This global reset takes the circuit offline for the time needed to reset the circuit and start up in normal operating mode. In [52, 61], it is proposed that feedback structures of the design should be mitigated first because they are more critical. Any logic feeding into the feedback 76 Fernanda Lima Kastensmidt NSREC’07 Short course structures should follow since these contribute to the state of the design and thus the persistence. The feed forward logic, the non-persistent circuit components, does not contribute to the persistence of a design and should be mitigated last. This depends on the application and the expected mean time between failures (MTBF). Figure 4-9. Example of non-persistent and persistent upset defined at [61]. NSREC’07 Short course Fernanda Lima Kastensmidt 77 5. Final Remarks This manuscript has explored fault tolerant techniques to protect integrated circuits against soft errors. A set of hardening by design solutions for application specific circuits (ASICs) and for field programmable gate arrays (FPGAs) was presented and discussed. The main challenge for ASIC is to have techniques able to work with the new paradigm for nanometer technologies: the occurrence of transient pulses with duration longer than the cycle time of the circuits, that may affect one or more bits of the circuit output, and multiple faults, thereby making obsolete most of the currently known mitigation techniques. For memory cells, the traditional solutions such as ECCs and hardened memory cells can still be applied but taking into account the probability of multiple upsets. The main challenge for FPGAs is to characterize the user design sensitivity to soft error once the design is implemented in the SRAM-based FPGA and to define the most efficient redundancy that must be applied for a limited area resource. The effect of soft error in a user logic design synthesized in a SRAM-based FPGA was detailed analyzed. Triple Modular Redundancy (TMR) and Duplication with Comparison and Concurrent Error Detection (DWC-CED) techniques were presented. Also, the issues about the placement and routing of redundant blocks inside the FPGAs were discussed and some solutions were proposed. In summary, there is no hardening by design solution that is totally efficient for all types of circuits, applications and environments. It is important to characterize very well the sensitivity to soft error of your target design and application and then choose a set of fault tolerant solutions that will work properly in your design. The ideal solution for a reliable system may be composed of solutions that pass at different steps of your design processes: from layout constraints, transistor level redundancy, logic level solutions, recomputation and system level approaches. 78 Fernanda Lima Kastensmidt NSREC’07 Short course References [1] Alexandrescu, D., Anghel, L., Nicolaidis, M., “New methods for evaluating the impact of single event transients in VDSM ICs”, in IEEE International Symposium on Defect And Fault Tolerance in VLSI Systems, DFT, 17., 2002. p. 99-107. [2] Anghel, A., Nicolaidis, M., “Cost Reduction and Evaluation of a Temporary Faults Detecting Technique”, in Proc. DATE, IEEE Computer Society, 2000, p. 591-598. [3] Anghel, L., Alexandrescu, D., Nicolaidis, M., “Evaluation of a soft error tolerance technique based on time and/or space redundancy”, in the Proceedings of Symposium on Integrated Circuits and Systems Design, SBCCI, 13., 2000. Proceedings… Los Alamitos : IEEE Computer Society, 2000. p. 237-242. [4] Asadi, G., Tahoori, M., “An Accurate SER Estimation Method Based on Propagation Probability Design”, in Proceedings of Automation and Test in Europe Conference, DATE, 2005. [5] Athan, S., Landis, D., Al-Arian, S., “A Novel Built-in Current Sensor for IDDQ Testing of Deep Submicron CMOS ICs, In Proceedings of 14th VLSI Test Symposium, 1996. pp 118-123. [6] Austin, T. “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design”, in MICRO32 - Proceedings of the 32nd ACM/IEEE International Symposium on Microarchitecture, pages 196-207, Los Alamitos, CA, November, 1999. [7] Barth, J., “Applying Computer Simulation Tools to Radiation Effects Problems”, in: IEEE Nuclear Space Radiation Effects Conference Short Course, NSREC, 1997. [8] Baumann, R., “The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction”, Electron Devices Meeting, 2002. IEDM '02. Digest. International, Dec., 2002, p. 329-332. [9] Baumann, R.; Smith, E., “Neutron-induced boron fission as a major source of soft errors in deep submicron SRAM devices”, in: Proceedings of IEEE International Reliability Physics Symposium, 38., IEEE Computer Society, 2000. [10] Berg, M., “Fault Tolerance Implementation within SRAM Based FPGA Design Based upon the Increased Level of Single Event Upset Susceptibility”, in IEEE International On-line Test Symposium, IOLTS, 2006. pp. 89-91. NSREC’07 Short course Fernanda Lima Kastensmidt 79 [11] Berg, M., Wang, J.J., Ladbury, R., Buchner, S., Kim, H., Howard, J., LaBel, K., Phan, A., Irwin, T., Friendlich, M., “An Analysis of Single Event Upset Dependencies on High Frequency and Architectural Implementations within Actel RTAX-S Family Field Programmable Gate Arrays”, IEEE Transactions On Nuclear Science, VOL. 53, NO. 6, Dec., 2006. p. 3569- 3574. [12] Bessot, D.; Velazco, R., “Design of SEU-hardened CMOS memory cells: the HIT Cell”, in European Conference on Radiation and Its Effects on Components and Systems, RADECS, 2., 1993. p. 563-570. [13] Buchner, S.; Campbell, A.; Meehan, T.; Clark, K.; Mcmorrow, D.; Dyer, C.; Sanderson, C.; Comber, C. Kuboyama, S., “Investigation of Single-Ion MultipleBit Upsets in Memories on Board a Space Experiment”, IEEE Transactions on Nuclear Science, Vol. 47, Issue 3, pp. 705-711, June 2000. [14] Calin, T., Nicolaidis, M., Velazco, R., “Upset hardened memory design for submicron CMOS technology”, IEEE Transactions on Nuclear Science, New York, v.43, n.6, p. 2874 -2878, Dec. 1996. [15] Canaris, J.; Whitaker, S., “Circuit techniques for the radiation environment of space”, in the Proceedings of Custom Integrated Circuits Conference, 1995. p. 7780. [16] Carmichael, C., Triple Module Redundancy Design Techniques for Virtex® Series FPGA: Application Notes 197. San Jose, USA: Xilinx, 2000. [17] Cazeaux, J., Rossi, D., Oma˜na, M., Metra, C., Chatterjee, A., “On Transistor Level Gate Sizing for Increased Robustness to Transient Faults”, in Proceedings of International On-line Test Symposium, IOLTS, 2005. [18] Chen, C., Hsiao, M., “Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review,” IBM J. Res. Develop., Vol. 28, pp. 124-134, Mar. 1984. [19] Crain, S. et al., “Analog and digital single-event effects experiments in space”, IEEE Transactions on Nuclear Science, New York, v.48, n.6, Dec. 2001. [20] Dhillon, Y., Diril, A., Chatterjee, A., Singh, A., “Analysis and Optimization of Nanometer CMOS Circuits for Soft-Error Tolerance”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL. 14, NO. 5, May 2006. pp. 514-524. [21] Dodd, P. E., et al., “Production and propagation of Single-Event Transients in High-Speed Digital Logic ICs”, IEEE Transactions on Nuclear Science, Vol 51, No 6, Part 2, IEEE Computer Society, Los Alamitos, CA, December 2004, pp 3278-3284. 80 Fernanda Lima Kastensmidt NSREC’07 Short course [22] Dodd, P. E., Massengill, L. W. “Basic Mechanism and Modeling of Single-Event Upset in Digital Microelectronics”, IEEE Transaction on Nuclear Science, vol. 50, June, 2003, pp. 583-602. [23] Dodd, P., “Physics-Based Simulation of Single-Event Effects IEEE Transactions On Device And Materials Reliability”, VOL. 5, NO. 3, Sept. 2005. pp.343-457. [24] Dupont, E.; Nicolaidis, M.; Rohr, P., “Embedded robustness IPs for transienterror-free ICs”. IEEE Design & Test of Computers, New York, v.19, n.3, MayJune 2002, p. 54-68. [25] Ferlet-Cavrois, V. et al., “Statistical Analysis of the Charge Collected in SOI and Bulk Devices Under Heavy Ion and Proton Irradiation - Implications for Digital SETs”, IEEE Transactions on Nuclear Science, Vol 53, No 6, Part 1, IEEE Computer Society, Los Alamitos, CA, December 2006, pp 3242-3252. [26] Gadlage, M. J., Schrimpf, R. D., Benedetto, J. M., Eaton, P. H., Mavis, D. G., Sibley, M., Avery, K., and Turflinger, T. L., “Single Event Transient Pulsewidths in Digital Microcircuits”, IEEE Transactions on Nuclear Sciences, Vol. 51, No 6, Part 2, IEEE Computer Society, Los Alamitos, CA, December 2004, pp. 32853290. [27] Galke, C., Pflanz, M., Vierhaus, H., “On-line Detection and Compensation of Transient Errors in Processor Pipeline-Structures”, in Proceedings of the International On-line Test Symposium, IOLTS, 2002. [28] George, J., Koga, R., Swift, G., Allen, G., Carmichael, C., Tseng, C., “Single Event Upsets in Xilinx Virtex-4 FPGA Devices”, IEEE Radiation Effects Data Workshop, 2006, p.109 – 114. [29] Henes, E., Vieira, M., Ribeiro, I., Wirth, G., Kastensmidt, F. L., “Using Bulk Built-in Current Sensors in Combinational and Sequential Logic to Detect Soft Errors”, IEEE Micro, IEEE Computer Society, v. Set-Ou, p. 10-18, 2006. [30] Hertwig, A., Hellebrand, S., Wunderlich, H., “Fast Self-Recovering Controllers”, in Proceedings of 16th IEEE VLSI Test Symposium, 1998. [31] Houghton, A. D. “The Engineer’s Error Coding Handbook”. Londres: Chapman & Hall, 1997. [32] Johnston, A., “Scaling and Technology Issues for Soft Error Rates”, in Proceedings of 4th Annual Research Conference on Reliability, Stanford University, October 2000. [33] Kastensmidt, F. L.;, Kinzel Filho, C., Carro, L., “Improving Reliability of SRAM Based FPGAs by Inserting Redundant Routing”, IEEE Transactions on Nuclear Science, New York, v. 53, n. 4, 2006. p. 2060-2068. NSREC’07 Short course Fernanda Lima Kastensmidt 81 [34] Kastensmidt, F., Neuberger, G., Carro, L., Reis, R., Hentschke, R., “Designing Fault- Techniques for SRAM-based FPGAs”, IEEE: Design and Test of Computers (D&T), v.21, n.6, Dec., 2004. [35] Kastensmidt, F., Sterpone, L., Carro, L., Sonza Reorda, M., “On the Optimal Design of Triple Modular Redundancy Logic for SRAM-based FPGAs”, in the Proceedings of Design Automation and Test in Europe (DATE), IEEE, 2005. [36] Label, K. et al., “A roadmap for NASA's radiation effects research in emerging microelectronics and photonics”, in Proceedings of IEEE Aerospace Conference, 2000, IEEE Computer Society, 2000. p. 535-545. [37] LaBel, K., Berg, M., Black, D., Robinson, W., Jordan, A., “Trade Space Involved with Single Event Upset (SEU) and Transient (SET) Handling of Field Programmable Gate Array (FPGA) Based Systems”, 2006 Workshop on Hardened Electronics and Radiation Technology, HEART, 2006. [38] Lacoe, R., “CMOS Scaling Design Principles and Hardening-by-Design Methodologies,” IEEE NSREC Short Course, 2003. [39] Leray, J., “Earth and Space Single-Events in Present and Future Electronics”, in European Conference on Radiation and Its Effects on Components and Systems, RADECS, 6., 2001. Short Course. IEEE Computer Society, 2001. [40] Lima, F., Carmichael, C., Fabula, J., Padovani, R., Reis, R., “A fault injection analysis of Virtex® FPGA TMR design methodology”, in European Conference on Radiation and Its Effects on Components and Systems, RADECS, 2001. pp. 275-282. [41] Lima, F., Cota, E., Carro, L., Lubaszewski, M., Reis, R., Velazco, R., Rezgui, S., “Designing a radiation hardened 8051-like micro-controller”, in Proceedings of IEEE Symposium on Integrated Circuits and Systems Design, SBCCI, 13., 2000. pp. 255-260. [42] Lisboa, C. A., Erigson, M. I., Carro, L., “System Level Approaches for Mitigation of Long Duration Transient Faults in Future Technologies”, in Proceedings of the 12th IEEE European Test Symposium, ETS, 2007. [43] Lisbôa, C. A., Schüler, E., Carro, L., “Going Beyond TMR for Protection Against Multiple Faults”, in Proceedings of the 18th Symposium on Integrated Circuits and Systems Design, SBCCI, 2005, pp. 80-85. [44] Liu, M.N., Whitaker, S., “Low power SEU immune CMOS memory circuits”, IEEE Transactions on Nuclear Science, New York, v.39, n.6, p. 1679-1684, Dec. 1992. 82 Fernanda Lima Kastensmidt NSREC’07 Short course [45] Messenger, G. C., “Collection of Charge on Junction Nodes from Ion Tracks”, IEEE Transactions on Nuclear Sciences, vol. NS-29, pp. 2024-2031, 1982. [46] Michels, A., Petroli, L., Lisboa, C. L., Kastensmidt, F. L., Carro, L., “SET Fault Tolerant Combinational Circuits Based on Majority Logic”, in Proceedings of IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT, 2006. [47] Mikkola, E., Vermeire, B. Barnaby, H. J., Parks, H. G., and Borhani, K., “SET Tolerant CMOS Comparator”, IEEE Transactions on Nuclear Science, vol. 51, no. 6, pp 3609-3614, IEEE Computer Society, New York-London, December, 2004. [48] Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., Kim, K., “Combinational Logic Soft Error Correction”, in Proceedings of International Test Concference, ITC, 2006. [49] Mitra, S., Mccluskey, E., “Which Concurrent Error Detection Scheme To Choose?”, in: IEEE International Test Conference, ITC, 2002. [50] Mohanram, K., “Soft Error Failure Rate Estimation in Combinational Logic Circuits”, Proceedings of the 6th Latin-American Test Workshop, Salvador., Brazil, 2005. pp.181-186. [51] Morgan, K., Caffrey, M., Graham, P., Johnson, E., Pratt, B., Wirthlin, M., “SEUinduced persistent error propagation in FPGAs”, IEEE Transactions on Nuclear Science, December 2005. [52] Neuberger, G. , Kastensmidt, F. L., Carro, L., Reis, R. “A Multiple Bit Upset Tolerant SRAM Memory”, Transactions on Design Automation of Electronic Systems, TODAES, v.8, 2003. pp.577-590. [53] Neuberger, G., Kastensmidt, F. L., Reis, R., “Designing an Automatic Technique for Optimization of Reed-Solomon Codes to Improve Fault-tolerance in Memories”, IEEE Design and Test of Computers, USA, v. 22, n. 1, 2005. p.50-58. [54] Neves, C., Henes Neto, E. C., Ribeiro, I., Wirth, G., Kastensmidt, F. L., Guntzel, J., “Automatic Evaluation of Single Event Transient Propagation in CMOS Logic Circuits Based on Topological Timing Analysis”, in Proceedings of LatinAmerican Test Workshop, LATW, 2006. p. 49-54. [55] Nicolaidis, M., "Time redundancy based soft-error tolerance to rescue nanometer technologies", in Proc. VLSI Test Symposium, IEEE Computer Society, 1999, pp. 86-94. NSREC’07 Short course Fernanda Lima Kastensmidt 83 [56] Nieuwland, A., Jasarevic, S., Jerin, G., Combinational Logic Soft Error Analysis and Protection. In Proceedings of IEEE International On-line Testing Symposium, IOLTS 2006. [57] Normand, E., “Correlation of in-flight neutron dosimeter and SEU measurements with atmospheric neutron model”, IEEE Transactions on Nuclear Science, New York, v.48, n.6, p. 1996-2003, Dec. 2001. [58] Normand, E., “Single event upset at ground level”, IEEE Transactions on Nuclear Science, New York, v.43, n.6, p. 2742 -2750, Dec. 1996. [59] O'bryan, M. et al., “Compendium of Single Event Effects Results for Candidate Spacecraft Electronics for NASA”, in IEEE Nuclear and Space Radiation Effects Conference, NSREC, 2006. [60] Omana, M., Papasso, G., Rossi, D., Metra, C., “A Model for Transient Fault Propagation in Combinatorial Logic”, in Proceedings of the 9th IEEE International On-Line Testing Symposium, IOLTS, 2003. [61] Pratt, B., Caffrey, M., Graham, P., Morgan, K., Wirthlin, M., “Improving FPGA Design Robustness with Partial TMR”, 44th Annual IEEE International Reliability Physics Symposium Proceedings, 2006. p. 226 – 232. [62] Quinn, H., Graham, P., "Terrestrial-Based Radiation Upsets: A Cautionary Tale," IEEE Symposium on Field-Programmable Custom Computing Machines, 2005. [63] Quinn, H.; Graham, P.; Krone, J.; Caffrey, M.; Rezgui, S., “Radiation-induced multi-bit upsets in SRAM-based FPGAs”, in IEEE Transactions on Nuclear Science, Vol. 52, Issue 6, Dec. 2005. pp. 2455 – 2461. [64] Rebaudengo, M., Sonza Reorda, M., Violante, M., “Simulation-based Analysis of SEU effects of SRAM-based FPGAs”, in the Proceeding of Field Programmable Logic, FPL, 2002. Los Alamitos : IEEE Computer Society, 2002. p. 607-615. [65] Reed, R. A., Carts, M. A., Marshall, P. W.;Musseau, O., Mcnulty, P. J., Roth, D. R., Buchner, S., Melinger, J., Corbiere, T., “Heavy Ion and Proton Induced Single Event Multiple Upsets”, IEEE Transactions on Nuclear Science, Vol. 44, Issue 6, pp. 2224-2229, December 1997. [66] Rejimon T., Bhanja, S., “A Timing-Aware Probabilistic Model for Single-EventUpset Analysis”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, VOL. 14, NO. 10, Oct. 2006, pp.1130-1139. [67] Reorda, S., Sterpone, L., Violante, M., “Efficient estimation of SEU effects in SRAM-based FPGAs”, 11th IEEE International On-Line Testing Symposium, 2005. p. 54 – 59. 84 Fernanda Lima Kastensmidt NSREC’07 Short course [68] Rockett, L. R., “A design based on proven concepts of an SEU-immune CMOS configurable data cell for reprogrammable FPGAs”, Microelectronics Journal, Elsevier, v.32, p. 99-111, 2000. [69] Rockett, L. R., “An SEU-hardened CMOS data latch design”, IEEE Transactions on Nuclear Science, New York, v.35, n.6, p. 1682-1687, Dec. 1988. [70] Rossi, D., Omaña, M., Toma, F. and Metra, C., “Multiple Transient Faults in Logic: An Issue for Next Generation ICs ?”, in Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2005), IEEE Computer Society, Los Alamitos, CA, October 2005, pp. 352-360. [71] Schüler, E., Carro, L., “Reliable Circuits Design Using Analog Components”, in Proceedings of the 11th Annual IEEE International Mixed-Signals Testing Workshop – IMSTW 2005, Volume 1, IEEE Computer Society, Cannes, June 2729, 2005, pp 166-170. [72] Semiconductor Industry Association. International Technology Roadmap for Semiconductors – ITRS 2005, last access May 25, 2006. http://www.itrs.net/Common/2005ITRS/Home2005.htm. [73] Shirvani, P., Saxena, N., Mccluskey, E., “Software Implemented EDAC Protection Against SEUs”, Center for Reliable Computing, May 2001. [74] Shivakumar, P. et al. “Modelling the Effect of Technology Trends on the Soft Error Rate of Combitional Logic”. In: International Conference on Dependable Systens and Networks. 2002. [75] Srinivasan, G. R., “Modeling the Cosmic-Ray-Induced Soft-Error Rate in Integrated Circuits: An Overview”. IBM Journal of Research and Development, Vol. 40, No. 1, 1996, pp. 77-90. [76] Sterpone, L., Reorda, M.S., Violante, M., “RoRA: a reliability-oriented place and route algorithm for SRAM-based FPGAs”, Research in Microelectronics and Electronics, 2005, Volume 1, 2005. p.173 – 176. [77] Sterpone, L., Violante, M., “A design flow for protecting FPGA-based system against single event upsets”, in the Proceedings of 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI System (DFT’05), October 35, 2005, pp. 436-444. [78] Swift, G. M., Rezgui, S., George, J., Carmichael, C., “Dynamic Testing of Xilinx Virtex-II Field Programmable Gate Array(FPGA) Input/Output Blocks(IOBs)”, IEEE Transactions on Nuclear Science, VOL. 51, NO. 6, 2004. NSREC’07 Short course Fernanda Lima Kastensmidt 85 [79] Velazco, R. et al., “Two CMOS memory cells suitable for the design of SEUtolerant VLSI circuits”, IEEE Transactions on Nuclear Science, New York, v.41, n.6, p. 2229–2234, Dec. 1994. [80] "Virtex-II Static Characterization", Xilinx Single Event Effects Consortium, 2004, http:Hparts.jpl.nasa.gov/docs/swift/virtex2 0104.pdf [81] Wang, J.J., RTAXS Single Event Effects Test Rep., Aug. 2004 [available on-line at http://www.actel.com/documents/RTAXS_SEE_Report.pdf] [82] Wang, N., Patel, S. ReStore: Symptom Based Soft Error Detection in Microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks, DSN, 2005. [83] Weaver, H., et al., “An SEU Tolerant Memory Cell Derived from Fundamental Studies of SEU Mechanisms in SRAM”, IEEE Transactions on Nuclear Science, New York, v.34, n.6, Dec. 1987. [84] Whitaker, S., Canaris, J., LIU, K., “SEU hardened memory cells for a CCSDS Reed-Solomon encoder”, IEEE Transactions on Nuclear Science, New York, v.38, n.6, p. 1471-1477, Dec. 1991. [85] Wirth, G., Vieira, M., Henes, E., Kastensmidt, F. L. “Modeling the sensitivity of CMOS circuits to radiation induced single event transients”. Microelectronics Reliability. Elsevier, 2007. [86] Xilinx Inc. Virtex® Series Datasheets and Application Notes, www.xilinx.com, 2006. [87] Zhang, B., Wang, W., Orshansky, M., “FASER: Fast Analysis of Soft Error Susceptibility for Cell-Based Designs”, Workshop on System Effects of Logic Soft Errors, SELSE, 2005. [88] Zhang, M., Shanbhag, N., “Soft-Error-Rate-Analysis (SERA) Methodology”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, VOL. 25, NO. 10, Oct. 2006. pp.2140-2155. [89] Zhou, Q. et al., ''Transistor Sizing for Radiation Hardening'', in Proceedings of IRPS, 2004. pp. 310-315. 86 Fernanda Lima Kastensmidt NSREC’07 Short course

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Section II SEE Mitigation Strategies for Digital Circuit - Inf