Download Section II SEE Mitigation Strategies for Digital Circuit - Inf

Document related concepts

Transistor wikipedia , lookup

Time-to-digital converter wikipedia , lookup

Opto-isolator wikipedia , lookup

Two-port network wikipedia , lookup

Flip-flop (electronics) wikipedia , lookup

Curry–Howard correspondence wikipedia , lookup

Integrated circuit wikipedia , lookup

Control system wikipedia , lookup

CMOS wikipedia , lookup

Immunity-aware programming wikipedia , lookup

Fault tolerance wikipedia , lookup

Digital electronics wikipedia , lookup

Transcript
2007 IEEE NSREC
Short Course
Section II
SEE Mitigation Strategies for
Digital Circuit Design Applicable
to
ASIC and FPGAs
Fernanda Lima Kastensmidt
Universidade Federal do Rio Grande do Sul (UFRGS)
NSREC’07 Short course
Fernanda Lima Kastensmidt
1
SEE Mitigation Strategies for Digital Circuit Design
Applicable to ASIC and FPGAs
Fernanda Lima Kastensmidt
Computer Science Department - PPGC
Universidade Federal do Rio Grande do Sul (UFRGS)
Porto Alegre – RS – Brazil
[email protected]
Table of Contents
1.
Radiation Effects on Digital ICs ........................................................................4
1.1 Charge Collection Mechanism in MOS devices..................................................... 4
1.2 Single Event Effects in Digital ICs........................................................................ 9
2.
Radiation Hardening by Design: Strategies for ASICs ..................................14
2.1 Layout- and Electrical-level based techniques ......................................................16
2.1.1 Bulk Built-in Current Sensors........................................................................16
2.1.2 Transistor Resizing for Charge Dissipation ...................................................18
2.2 Logic-level based techniques................................................................................20
2.2.1 Hardware redundancy techniques .................................................................20
2.2.2 Time redundancy techniques .........................................................................27
2.2.3 Mixed Hardware and Time Redundancy Techniques .....................................30
2.2.3 Hardened Memory Cells................................................................................34
2.2.4 Error Correcting Code (ECC).......................................................................37
2.3 Architectural level based techniques.....................................................................42
2.4 Area and Performance Tradeoffs Summary..........................................................45
3.
Radiation Effects on FPGAs............................................................................49
3.1 Antifuse-based FPGAs.........................................................................................49
3.2 SRAM-based FPGAs ...........................................................................................53
4.
Radiation Hardening by Design: Strategies for SRAM-based FPGAs ..........65
4.1 Scrubbing.............................................................................................................68
4.2 Triple Modular Redundancy.................................................................................69
4.3 Duplication with Comparison with Concurrent Error Detection............................70
2
Fernanda Lima Kastensmidt
NSREC’07 Short course
4.4 Placement and Routing Issues ..............................................................................73
4.4.1 Solutions based on Placement and Routing....................................................74
4.4.2 Solutions based on Voting Adjustments..........................................................75
4.5 Partial Triple Modular Redundancy......................................................................76
5.
Final Remarks..................................................................................................78
References ....................................................................................................................79
NSREC’07 Short course
Fernanda Lima Kastensmidt
3
1. Radiation Effects on Digital ICs
Fault-tolerance is defined as a set of techniques to provide a service capable of
fulfilling the system function in spite of (a limited number of) faults. Fault-tolerance on
semiconductor devices has been meaningful since upsets were first experienced in space
applications several years ago. Since then, the interest in studying fault-tolerant
techniques in order to keep integrated circuits (ICs) operational in such hostile
environment has increased, driven by all possible applications of radiation tolerant
circuits, such as space missions, satellites, high-energy physics experiments and others.
Spacecraft systems include a large variety of analog and digital components that are
potentially sensitive to radiation and therefore fault-tolerant techniques must be used to
ensure reliability.
In addition, because of the continuous evolution of the fabrication technology
process of semiconductor components, in terms of transistor geometry shrinking, power
supply, speed, and logic density, as presented in International Technology Roadmap for
Semiconductors (ITRS) [72], the fault-tolerance starts to be a matter of concern for
circuits operating at ground level as well. As stated in [32], [36], [59], [24], [21], [26] and
[25], drastic device shrinking, power supply reduction, and increasing operating speeds
significantly reduce the noise margins and thus increase the threats that very deep
submicron (VDSM) ICs face from the various internal sources of noise. This process is
now approaching a point where it will be unfeasible to produce ICs that are free from
these effects. Consequently, fault-tolerance is no longer a concern exclusively for space
designers but also for designers of next generation products, which must cope with errors
at ground level due to the advanced technology.
1.1 Charge Collection Mechanism in MOS devices
The radiation environment is composed of various particles generated by sun
activity, as presented by [7]. The particles can be classified as two major types: (1)
energetic particles such as electrons, protons and heavy ions, and (2) electromagnetic
radiation (photons), which can be x-ray, gamma ray, or ultraviolet light. The main
sources of energetic particles that contribute to radiation effects are protons and electrons
4
Fernanda Lima Kastensmidt
NSREC’07 Short course
trapped in the Van Allen belts, heavy ions trapped in the magnetosphere, galactic cosmic
rays and solar flares. The charged particles interact with the silicon atoms causing
excitation and ionization of atomic electrons.
At the ground level, the neutrons are the most frequent cause of upset as shown by
[57, 58]. Neutrons are created by cosmic ion interactions with the oxygen and nitrogen in
the upper atmosphere. The neutron flux is strongly dependent on key parameters such as
altitude, latitude and longitude. There are high-energy neutrons that interact with the
material generating free electron hole pairs and low energy neutrons. Those neutrons
interact with a certain type of Boron present in semiconductor material creating others
particles, as shown by [9]. Alpha particles are secondary types of particles emitted from
interactions with radioactive impurities present in the device itself or in the packaging
materials and they are the greatest concern. Materials aim to minimize the emission of
alpha particles. However, it does not eliminate the problem completely.
As an energetic particle traverses the material of interest, it deposits energy along
its path, as shown in figure 1-1. This energy is measured as a linear energy transfer
(LET), which is defined as the amount of energy deposited per unit of distance traveled,
normalized to the material's density. It is usually expressed in MeV-cm2/mg. The ionized
track contains equal numbers of electrons and holes. The total number of charges is
proportional to the LET of the incoming particle.
Figure 1-1. Silicon substrate ionization due to an energetic particle hit
NSREC’07 Short course
Fernanda Lima Kastensmidt
5
The sensitive sites are the surroundings of the reverse-biased drain junctions of a
transistor biased in the off state, as explained by [22]. If an energetic particle passes
through the pn-junction of a CMOS transistor in the off state, a short is momentarily
created between the substrate and the struck drain terminal. The amount of charge that is
collected
produces a transient current pulse that lasts until the deposited charge
disappears by recombination or is conducted away via open current paths to VDD or
ground, returning the logic node to its original state.
Figure 1-2 shows a collected charge occurring in the drain junction of the pchannel transistor. Originally the node held the value ‘0’. As current flows through the
pn-junction of the struck transistor, from the bulk connected to VDD and the drain, the
transistor in the on-state (n-channel transistor in figure 1-2) conducts a current that
attempts to balance the current induced by the particle strike. If the collected charge
induced by the particle strike is high enough that the on-transistor can not balance the
current before the node capacitance is charged, a voltage change at the node will occur.
This voltage change lasts until the charge is conducted away by the current feed through
the on-transistor.
off
Transient
current
Transient
voltage pulse
1
on
+
Vout ! 0
-
Figure 1-2. Charge Collection Mechanism in inverter gate
The maximum charge collection current (Qc) depends on the energy and ion type,
as well as the path length over which the charge is collected. And it correlates with the
energetic particle linear energy transfer (LET) value, as shown:
Qc = (Lth T d e) / X,
6
Fernanda Lima Kastensmidt
(1)
NSREC’07 Short course
where Qc is the collected charge, Lth is the threshold effective LET (in MeVcm2/milligram), T is the device thickness (in microns), d is the material density (2.32
g/cm3 for Si), e is the electronic charge = 1.602 x 10-7 pC, X is the energy needed to
create one electron-hole pair (3.6 eV in Si). Replacing d, e and X in the equation (1),
then:
Qc = 1.03 x 10-2 (Lth T ) pC for Si
(2)
Considering T 1µm as a reasonable order of magnitude for conventional logic
circuits and LET from 5 to 40 MeV-cm2/mg. Critical charge values range from 50fF to
410fF, obtained by equation (2). These numbers agree with those published by [23], in
silicon, an LET of 97 MeV-cm2/mg corresponds to a charge deposition of 1 pC/µm.
At the electrical SPICE level, the charge deposition mechanism can be modeled
by a double exponential current pulse at the particle strike site, as presented by [45]:
-t / τα -t / τβ
IP(t) = I0 (e
-e
),
(3)
where I0 is approximately the maximum charge collection current, τα is the
collection time constant of the junction and τβ is the time constant for initially
establishing the ion track. In the circuit simulations and modeling, τβ is assumed to be
much smaller than τα, while τα is used as a variable parameter, as shown by [75] and by
[22]. SPICE transient analysis is performed injecting a double exponential current pulse
as given by (3), with the values of I0 and τα being used as the variable parameters to
determine the minimum charge QC corresponding to a given τα. The double exponential
model as given by (3) is proven to be adequate to study the soft error mechanism at the
circuit simulation level [45].
Depending on the fabrication details and the electrical characteristics of each
sensitive node (capacitance and resistance), different shapes of current transients can be
observed as shown by [23, 25]. Figure 1-3 illustrates a double exponential current with a
NSREC’07 Short course
Fernanda Lima Kastensmidt
7
correspondent amount of charge Qi. The width of the induced transient voltage pulse is
dependent on the energy of the incident particle, the charge stored at the affected node
and the charge collection efficiency of the affected junction. So, according to the
electrical characteristics of the struck node such as resistance and capacitance, different
amplitude and duration of the transient voltage pulse are generated.
There are equation models as the ones proposed by [85] to represent the generated
voltage pulse in each sensitive node according to parameters such as I0, τα, τβ, node
capacitance and resistance. Usually the time duration of the transient voltage pulse in
nanometer technologies ranges from few hundreds of pico seconds to few nano seconds.
As discussed by [42], in designs working in GHz frequencies, some transient voltage
current
pulses may endure for few periods of clock.
QDrift
Qdiffusion
Charge Qi
…
time
Figure 1-3. The effect of a transient current pulse modeled as a double exponential
current with a certain amount of charge in two different circuit nodes.
Once the values of I0, τα, and τβ are determined for a given technology and
particles of interest, any circuit designed in that technology may be evaluated at the
circuit level by modeling the charge deposition mechanism by (1). The values of I0, τα,
and τβ for a given technology may be obtained by device simulation as well as from
closed form expressions, as presented by [75] and by [22] and [26]. There is a minimal
amount of charge able to create a transient current pulse in a certain node, which is
known as the critical charge. Very often it is important to obtain the critical charge of a
circuit in order to define the environment that the circuit is hardened to. In the next
8
Fernanda Lima Kastensmidt
NSREC’07 Short course
section, the effects of energetic particle ionization are explained in combinational and
sequential circuits.
1.2 Single Event Effects in Digital ICs
A single particle can hit either the combinational logic or the sequential logic in
the silicon generating a soft error, as discussed by [19] and [1]. Figure 1-4 illustrates a
typical circuit topology found in nearly all sequential circuits. The data from the first
latch is typically released to the combinatorial logic on a falling or rising clock edge, at
which time logic operations are performed. The output of the combinatorial logic reaches
the second latch sometime before the next falling or rising clock edge. At this clock edge,
whatever data happens to be present at its input (and meeting the setup and hold times) is
stored within the latch.
Combinational logic
sequential logic
sequential logic
Figure 1-4. The occurrence of transient faults in combinational and sequential logics
When a charged particle strikes one of the sensitive nodes of a memory cell, such
as a drain in an off state transistor, it generates a transient current pulse that can turn on
the gate of the opposite transistor. The effect can produce an inversion in the stored
value, in other words, a bit flip in the memory cell. Memory cells have two stable states,
one that represents a stored ‘0’ and one that represents a stored ‘1.’ In each state, two
transistors are turned on and two are turned off (sensitive nodes). A bit-flip in the
memory element occurs when an energetic particle causes the state of the transistors in
the circuit to reverse, as discussed by [8] and [39]. This effect is called Single Event
Upset (SEU) and it is one of the major concerns in digital circuits because usually
NSREC’07 Short course
Fernanda Lima Kastensmidt
9
memory cells are designed with very compact transistors that present high soft error
sensitivity (low critical charge). The SEU phenomenon is illustrated in figure 1-5.
(a) Static memory cell with a particle strike
(b)
Transient
voltage pulse
1
(d)
off (c) off
0
(a)
off (e) off
Transient
current
(b) The induced transient pulse flips the original stored value
Figure 1-5. Single Event Upset (SEU) in a static memory cell
When a charged particle hits the combinational logic block, it also generates a
transient current pulse. This phenomenon is called single transient effect (SET), as
presented in [39]. If the logic propagates the induced transient pulse, then the SET will
eventually appear at the input of a latch, where it may be interpreted as a valid signal.
Whether or not the SET gets stored as real data depends on the temporal relationship
between its arrival time and the falling or rising edge of the clock.
The transient pulse generated by the charge deposition mechanism might not be
captured by a memory cell because it could be logically, electrically or latching-window
masked as discussed by [74] and [51]. Logical masking occurs when the input stimulus
are holding controlled values in the logical path in such a way that the SET can not be
propagated to the outputs. Figure 1-6(a) exemplifies this logical masking. Note that the
10
Fernanda Lima Kastensmidt
NSREC’07 Short course
output holds the value one, independently to the SET value because the nand gate has one
of the inputs at logical zero and the nor gate presented in the SET path has consequently
one of the inputs at logical one. Electrical masking occurs if the pulse is attenuated as it
propagates through the logic chain and fades out before it reaches the registered output,
as shown in figure 1-6(b). If a SET is either logically or electrically masked, it is
interpreted as a valid signal at the register input and it can be captured by the element
memory according to the latching window (usually based on the setup time and hold time
of the memory element), figure 1-6(c). Once a SET is captured, a wrong value will be
stored in the register provoking a soft error.
e0
e1
1
0
0
e2
a3
Q
1
(a) Logical Masking
e0
e1
Q
e2
a3
1
1
0
(b) Electrical Masking
e0
e1
e2
a3
Q
1
1
0
clk
(c) Latch-window Masking
Figure 1-6. Single Event Transient (SET) in a combinational circuit
As a result, the rate at which SETs get latched as errors depends on the operating
frequencies and the logic structure of the circuit. Further, since the inherent delay of
MOS transistors is decreasing with rapid technology scaling, the frequencies at which
circuits are operated is continuously increasing. This increases the probability of SETs
NSREC’07 Short course
Fernanda Lima Kastensmidt
11
getting latched as errors. In addition, as the process technology shrinks and supply
voltage decreases, the charge stored at logic circuit nodes reduces roughly according to
Qnode = Cnode × Vdd, which is the main reason for the increased sensitivity of nodes to
radiation-induced upsets, as Qc can be larger then Qnode more often. Additional reasons
are the reduction in electrical and timing masking. The impact of the electrical masking
decreases with the technology scaling. This is due to shorter gate delays and reduced
logic depth between pipeline registers. The reduction in timing masking is a consequence
of higher operating frequencies which increases the probability of a SEU pulse being
latched. Thus, in Very Deep Sub-Micron (VDSM) technologies soft errors in logic
circuits are becoming a significant reliability problem.
In [60], [88], [56], [66], the probability of a SET becoming a SEU is discussed.
The analysis of SET is very complex in large circuits composed of many paths.
Techniques such as timing analysis presented by [4], [88], [51], [55], and [20], can be
applied to analyze the probability of a SEU in the combinational logic being stored by a
memory cell or resulting in an error in the design operation. Other techniques based on
formal binary decision diagrams are also proposed in [87].
Multiple bit upsets (MBU) are also becoming a concern because of the process
technology shrinking. MBU can appear due to SETs in nodes with fan-out higher than
one as shown in figure 1-7; or from double node ionizations due to angle of incidence of
the particle, as shown in figure 1-8, which is more common in highly dense memory
arrays.
a0
y0
a1
a2
a3
a4
a5
y1
X
X
Q0
Q1
Figure 1-7. Multiple Bit Upset due to a single SET
12
Fernanda Lima Kastensmidt
NSREC’07 Short course
+
-
+- +
+- +
+
+
+
-
!
Figure 1-8. Multiple Bit Upset due to an incident angle of the particle
In summary, it is mandatory to investigate techniques able to tolerate SETs and
SEUs in integrated circuits. In the next sections, a set of fault tolerant techniques for
integrated circuits is discussed. The limitations of each technique are addressed. There is
always a drawback to find the most reliable technique with a minimum area and
performance impact. In addition, according to the target design and application, fault
tolerant techniques can be applied at many different steps of the design flow, as it is
presented.
NSREC’07 Short course
Fernanda Lima Kastensmidt
13
2. Radiation Hardening by Design: Strategies for ASICs
Modifications in the fabrication process technology can reduce the amount of
collected charge, but the reduction is not sufficient to avoid the SET occurrence. The
results published by [21] indicate that significant transients can be generated in both bulk
and SOI technologies at fairly low LETs. In bulk technologies these transients can be
quite large for technologies of 100nm and below, with durations of nearly 1 ns at LET
above 50 MeV-cm2/mg. Consequently, soft error mitigation techniques must still be
applied at different levels of the circuit design flow to ensure reliability.
Figure 2-1 represents the sequence of events that may occur once an energetic
particle hits the substrate, provoking ionization, as it was discussed previously. The
ionization track generates a set of electron-hole pairs that creates a transient current that
is injected or extracted at that node. According to the amplitude and duration of this
current pulse, a transient voltage pulse may appear at the hit node. This is characterized
as the FAULT. There is a FAULT LATENCY period that defines the time needed for that
fault to become an ERROR in the circuit. This will only occur if this transient voltage
node changes the logic of a storage element (flip-flop), generating a bit-flip. This bit-flip
may generate an error if the content of this flip-flop is used for a certain operation. But
from the application point of view, it is not set that this error is manifested as a FAILURE
in the system. There is also an ERROR LATENCY that defines the time needed for that
error to become a failure in the system.
For each phase a different fault tolerant
technique can be used. Modern circuits may need fault-tolerance in many different levels
to ensure reliability.
For example, at the ionization and transient current generation phase, sensors can
be built in the silicon substrate to detect ionization currents. The idea at this point is to
notify the system that ionization has occurred. Once a transient voltage pulse is
generated, temporal filtering can be applied to detect the transient pulse in time.
However, the limitations of temporal filtering will be presented later on in this
manuscript. To mitigate the bit-flips, hardware redundancy and error correcting codes can
be used to correct the data. To correct an error, it is possible to use self-checking blocks
14
Fernanda Lima Kastensmidt
NSREC’07 Short course
with recovery mechanisms or recomputation to restore the correct data. Finally, spare
chips may be used to guarantee operation of the system if a failure occurs.
Figure 2-1. Sequence of events from ionization to failure and a set of fault tolerant
techniques applied at different times.
A set of techniques able to tolerate this entire sequence of events is analyzed in
this manuscript:



Layout and Electrical level based techniques:
o Built-in sensors for ionization detection
o Transistor resizing for charge dissipation
Logic-level based techniques:
o Hardware redundancy for majority voting
o Time redundancy for temporal filtering
o Error correcting codes for detection and correction of bit-flips in
memory elements
o Hardened memory cell for bit-flip avoidance
Architectural level based techniques:
o Recomputation
It is important to point out that there is always some penalty to be paid when
protecting circuits against upsets. Each technique may present a combination of area
overhead, performance penalty and power dissipation increase. The challenge is to select
the most suitable techniques for the target circuit application in order to meet the area,
NSREC’07 Short course
Fernanda Lima Kastensmidt
15
time and power constraints, as well as the soft error hardness needed. In the next sections,
a set of techniques are presented continuing the discussion done by [38].
2.1 Layout- and Electrical-level based techniques
2.1.1 Bulk Built-in Current Sensors
Built-in current sensors have been used for permanent fault detection, where the
permanent faults typically originated due to imperfections in the integrated circuit
fabrication process as presented in [5]. It is well-known that stuck-at faults can change
the amount of current consumed by a circuit, so BICS connected to the power lines can
detect current variations and consequently relay the occurrence of permanent faults.
However, soft errors, which are one of the major concerns nowadays, have a transient
effect and consequently, they do not present current variations at the power lines that can
be distinguished from any other circuit activity. The source of the effect is a transient
ionization that can only be seen at the hit node or at the bulk region. For this reason,
BICS connected to the power lines cannot help on soft error detection as is, but BICS
connected to the bulk region can sense the ionization.
As discussed above, during normal circuit operation the current flowing between
a reverse biased drain junction and bulk is negligible, if compared to the current peak
induced by an energetic particle hit. Consequently, it is cost-effective to think about a
BICS connected to the bulk of a circuit, instead of connecting it to the power lines of a
circuit. The bulk-BICS works as monitor that senses the current at the bulk terminal.
During normal operation, the current in the bulk is approximately zero. Only the leakage
current flows through the biased junction, which is still very low compared to the current
generated by charged particles. So, when a charged particle generates a current in the
bulk, it is very clear to the bulk-BICS that a SET has happened.
Figure 2-2 (a) shows the bulk-BICS connected to an integrated circuit as proposed
by [29]. For the bulk-BICS approach, it is necessary to have a dedicated BICS in each
type of well (N-well, P-well), consequently one BICS design is used for PMOS
transistors in the N-well and another BICS design is used for NMOS transistors in the Pwell. In addition, the possibility of distinguishing upsets that occur in the PMOS region
(BICS-P output) from the ones in the NMOS region (BICS-N output) can help to
16
Fernanda Lima Kastensmidt
NSREC’07 Short course
precisely map the faulty region in the circuit design. Each bulk-BICS can detect
ionizations in a certain number of transistors, where this number is determined by the
designer considering the SET-detection sensitivity. For a certain circuit with n number of
transistors, it is necessary i number of bulk-BICS, where each bulk-BICS is connected to
n/i transistors. Figure 2-2 (b) depicts the connection of the bulk-BICS to the body ties.
The circuit itself is connected to the power lines (at the transistor sources), while the body
ties are connected to VDD or ground through the bulk-BICS.
Vdd
Vdd
p6
Vdd’
p5
Circuit
Design
p1
p2
p4
p3
RST
BICS -P
Vdd
Gnd’
n1
n2
n4
n3
nRST
n5
n6
BICS -N
(a) Schematic of the Bulk-BICS
(b) Bulk-BICS sensors placed at the silicon substrate, body-tie is connected to
VDD or ground through the bulk-BICS
Figure 2-2. The N-BICS and P-BICS connected to the bulk of a integrated circuit,
as presented in [29].
In the case of N-well, the body-ties are connected to VDD through the bulk-BICS,
while in the case of P-well, the body-ties are connected to ground through the bulk-BICS.
NSREC’07 Short course
Fernanda Lima Kastensmidt
17
The bulk-BICS can be calibrated to detect ionizations that that can produce
transient current pulses at the struck node. A SET is assumed to occur if the voltage of
the logic gate output node changes by more than VDD/2. This bulk-BICS technique
presents a conservative approach for SET detection.
Figure 2-3 shows the temporal diagram of a SET detected by the bulk-BICS.
Once a SET occurs at any moment in a clock cycle, the bulk-BICS detects this SET after
a certain delay, called SET detection time, that depends on the amount of area
(transistors) protected by that bulk-BICS and on the size of the SET. The more intense
the SET (large I0 and τα), the faster the SET detection by the bulk-BICS occurs. The more
transistors connected to one BICS, the larger the capacitance associated with that
connection and consequently, the SET detection time is longer.
Once a SET is detected, the output of the bulk-BICS is raised, which notifies a
control logic in the circuit to perform some fault tolerant technique to tolerate the
detected SET and to reset the bulk-BICS.
clk
SET
Vdd/2
1
bulk-BICS
delay
bulk-BICS_ctrl
2
3
reset_BICS
Figure 2-3. Bulk-BICS time diagram for detection and reset
2.1.2 Transistor Resizing for Charge Dissipation
Digital circuits have different resistance and capacitance values at each
gate node according to its fan-out and gate logic type, consequently, each node presents a
distinct critical charge (Qcrit), which is the minimum collected charge needed to provoke a
SET or SEU at that node. When a soft error analysis tool is used (such as the ones
18
Fernanda Lima Kastensmidt
NSREC’07 Short course
referred to previously), the probability of SET occurrence in a certain design is evaluated.
So, it is possible to draw the most sensitive nodes, which will be the ones that present a
higher chance of propagating a SET to the outputs (low logic and electrical masking) and
a low Qcrit.
The idea of transistor resizing is to enlarge the width of some transistor in order to
increase the capacitance of the most sensitive nodes in such a way that the node critical
capacitance is increased. It is not desirable to increase all node capacitance because this
would make the circuit slow and high power consuming. So, it is important to analyze the
circuit sensitivity to soft errors in order to choose the nodes that are going to be modified.
Some recent works such as the ones published by [89], [17], and [20], showed the
variation of Qcrit as a function of the transistor channel widths and therefore have
presented results about the decreasing of SET sensitivity by applying transistor resizing.
The transistor resizing can also be replaced by gate duplication as proposed by [56].
Figure 2-4 presents an example of a circuit with the three most sensitive nodes to SET.
By eliminating the chance of a SET occurrence in these nodes, the sensitivity to SET of
the entire circuit reduces by 50%. In order to determine the size of the transistors (node
capacitance and resistance) that is able to mitigate a certain range of energetic particles,
the model equations applied for the calculation of the critical charge node and for the
SET generation [85] can be used. The challenges of this technique are: (a) keeping the
circuit time requirements when increasing the transistor sizing of the most sensitive nodes
and (b) finding a transistor size with a critical charge that is able to avoid SET for a range
of LET. It is clear that this method is suitable for low LET such as alpha particle LET,
which is around 2 MeV-cm2/mg and neutrons up to 2 MeV-cm2/mg.
NSREC’07 Short course
Fernanda Lima Kastensmidt
19
most sensitive nodes
A
B
C
D
Z
E
F
Figure 2-4. Transistor Resizing
2.2 Logic-level based techniques
The logic-level based techniques are all fault tolerant techniques that can be easily
applied at the gate level to tolerate soft errors (SET in combinational circuits and SEU in
sequential circuits). The logic-level based techniques can be applied in hardware
description level languages such as VHDL and Verilog or at the schematic description
level.
Techniques will be presented based on hardware and time redundancy, the
hardened memory cells and error correction codes for information redundancy. As will be
discussed in the next sections, some of these techniques are able to mitigate SET and
SEU, others only SET and others only SEU.
2.2.1 Hardware redundancy techniques
Redundancy has always been successfully used to detect and vote out errors of the
logic. The first basic approach is duplication with comparison (DWC), where the module
is replicated and the outputs are compared. If the outputs mismatch, an error is detected.
Of course, some errors can be masked by the application so the error is only detected
when it manifests a wrong output value.
This scheme can be used for both combinational and sequential logic to SET and
SEU detection, respectively, as presented in figure 2-5. It can also be applied for the
entire circuit. It is common to have two processors executing the same task to detect
errors in one of the two chips.
20
Fernanda Lima Kastensmidt
NSREC’07 Short course
However, the comparator is the key circuit because it expects to detect the error
and be immune of error as well. Usually comparators can be designed with larger
transistors in order to be less sensitive to upsets and they can also be duplicated.
Figure 2-5. Duplication with comparison scheme
However, duplication with comparison can only notify the circuit that an error is
present, it can not inform which module or piece of logic has the error. A self-checking
circuit can be used to detect an error. For example, parity checking in arithmetic logic
functions. In this case, a hot backup approach can be used, as illustrated in figure 2-6(a).
There is the main module (module 0) and the spare module (module 1). The output by
default receives the module 0 output. But if an error is detected in this module by the selfchecking block, then the output receives the module 1 output that is supposed to be fault
free. On the other hand, the self-checking block can be very difficult to design and very
often the checker can have the same complexity of the block that it must check. So, the
duplication with backup approach is also very commonly used, as shown in figure 2-6
(b). Module 0 and module 1 work in tandem and their outputs are continuously
compared. If an upset is detected, then the output receives the module 3 output, which is
the spare module and it is supposed to be fault free. The only problem is how to ensure
that the spare module is fault free. To overcome this issue, modular redundancy with
majority voters (MAJ voters) can be used.
NSREC’07 Short course
Fernanda Lima Kastensmidt
21
Module 0
in
Module 0
in
out
Module 1
out
Self-checking
Spare
Spare
Module 1
Module 2
(a) Hot backup approach
(b) Duplication with backup approach
Figure 2-6. Hot backup and Duplication with backup approaches
In order to be able to detect and vote the correct output, it is necessary n
redundant elements; when n typically is an odd number equal or larger then 3. This
approach is called N-modular redundancy (N-MR). The triple modular redundancy
(TMR) is the most common approach. It requires three modules working in tandem and a
majority voter (MAJ voter) to vote the correct output. When an upset is presented, it is
expected that at least two out of three outputs are correct, so the vote can decide the
correct output. Figure 2-7 illustrates this approach used for sequential elements (flipflops).
Figure 2-7. Triple Modular Redundancy (TMR) in the sequential logic and the
majority voter (MAJ)
However, there are two main limitations in soft error protection when using only
TMR in the sequential logic as presented in figure 2-8 and figure 2-9. The first limitation
is the SET in combinational logic can be stored in all three flip-flops at the same time,
which makes the majority voter choose a wrong output (figure 2-8). The second
limitation is when the SET occurs in the majority voter. This SET can be propagated and
22
Fernanda Lima Kastensmidt
NSREC’07 Short course
latched by the three flip-flops later on the circuit, as presented in figure 2-9. In both
cases, the MAJ voter chooses the wrong output because 3 out of 3 values are wrong.
Figure 2-8. SET propagation in a TMR scheme in the sequential logic
Figure 2-9. SET propagation in the majority voter (MAJ)
In order to solve this problem, the full TMR is proposed. In this case, the
combinational logic and the voters are also triplicate as shown in figure 2-10. If a SEU
occurs in one of the flip-flops, the MAJ voter chooses the correct output, as shown in
figure 2-10(a) and at the next clock the correct output can be loaded to the flip-flop
clearing the SEU. If a SET occurs in one of the combinational logic blocks, the SET may
be captured by only one of the flip-flops and the MAJ voter will be able to choose the
correct output, as represented in figure 2-10(b). If a SET occurs in one of the MAJ voters,
the voter output will show the transient for a short period of time, as shown in figure 210(c). But since all the circuit is triplicate, only one redundant part is affected and the
SET will be voted out at the next MAJ voter. Usually the three voter outputs can be
connected outside the chip as shown in figure 2-10(d). This scheme is kind of analog
voter. So, even if an upset occurs in one of the voters, the currents at the output will
provide the correct output.
TMR presents two weaknesses: (a) it does not protect against double faults
simultaneously affecting different redundant modules, of which the probability of
occurrence has increased in the nanometer technologies as discussed by [70]; and (b) a
NSREC’07 Short course
Fernanda Lima Kastensmidt
23
single fault in the last voter itself can generate undetected errors as shown by [43]. Even
when using TMR for the voters as well, a fourth voter is always needed to choose the
correct output of the circuit, the last voter in the chain, even with a lower probability of
producing an error due to a SET, will always be subject to this problem.
(a) SEU Mitigation
(b) SET Mitigation in the logic
(c) SET in the voters
24
Fernanda Lima Kastensmidt
NSREC’07 Short course
board
chip
voter
OUT
voter
voter
(d) Output voter
Figure 2-10. Full Triple Modular Redundancy (TMR) with self-recovery
In [71], an analog voter has been proposed to ensure complete tolerance against
SET in TMR solutions. This voter, shown in figure 2-11, uses an analog comparator,
instead of the traditional digital sum of products, to decide the output value. The
robustness of the proposed analog majority voter relies in three main points: duplicated
input nodes, well-dimensioned transistors in the analog comparator and output transistors
always in the on state.
module 0
+
-
module 1
module 2
+
VDD/2
Majority logic voter
(MAJ voter)
Figure 2-11. Full Triple Modular Redundancy (TMR) with Analog Majority Voter
Consequently, if a SET occurs at one of the inverter transistors, there is always
another transistor, in parallel, to ensure the correct value. The same happens if a SET
occurs inside the analog comparator logic; the transistors are set in a way that the SET
will not be able to turn on or off other transistors, holding the correct value at the output.
Finally, if a SET occurs at the output node, that transistor is already conducting a current
and the additional current generated by the transient pulse does not change the output
NSREC’07 Short course
Fernanda Lima Kastensmidt
25
state and, therefore, does not harm the operation of the circuit. A more detailed analysis
of an analog comparator behavior presented on SET can be seen in [48].
The schematic diagram of the analog comparator used to implement the MAJ
voter and the dimensions of the transistors, implemented using a 32 nm technology, as
proposed by [46] are shown in figure 2-12. The six inverters connecting the inputs to the
comparator, shown in figure 2-11, were implemented using PMOS transistors with
W=144nm and L=32nm, and NMOS transistors with W=80 nm and L=32nm.
M3
M7
M6
M4
Out
Vref
Vin
M1 M2
M5
Figure 2-12. The schematic of the Analog Comparator used in the Analog Majority
Voter and the W/L ratio of each transistor for the 32nm CMOS technology using
the predictive model from Berkeley.
The analog majority voter can also be used as the basic gate logic to implement
Boolean functions. As presented in [46], any combinational logic can be logic mapped to
a tree of analog majority voters, which makes each node of the circuit robust to SET. By
using this technique, the circuit can tolerate multiple SETs. Figure 2-13 illustrates a onebit full adder mapped to the analog MAJ voters.
!A
B
!Cin
MAJ
voter
0
0
A
B
Cin
!A
B
!Cin
1
1
MAJ
voter
MAJ
voter
sum
MAJ
voter
A
B
Cin
MAJ
voter
cout
Figure 2-13. One-bit Full Adder implemented by only Analog Majority Voters
26
Fernanda Lima Kastensmidt
NSREC’07 Short course
In summary, hardware redundancy techniques such as full TMR for
combinational and sequential logic and the solution of mapping Boolean logic functions
to analog majority voters can protect the circuit against SET and SEU. The drawback of
these techniques is the area overhead and consequently power dissipation.
2.2.2 Time redundancy techniques
Time redundancy techniques are solutions able to process the data at different
times, which allows the detection of faults [47]. The most simple scheme is when two
flip-flops controlled by a clock and a delayed clock are used to latch the combinational
output at two different times, which allows the detection of a SET, as shown in figure 214. The delay (d) must be chosen according to the SET time duration that must be
detected. Let one suppose that the larger SET that can occur in this circuit has duration
of 600ps. So, the delay (d) must be at least of 600ps to allow one flip-flop capture the
correct data while the other one captures the SET. This scheme can only detect SET but
not vote the correct output.
Figure 2-14. Time redundancy scheme for SET detection
Full time redundancy is when the output is captured three different times, which
allows a majority voter to choose the correct output. In this case, the output of the
combinational logic is latched at three different moments, where the clock edge of the
second latch is shifted by the time delay d and the clock of the third latch is shifted by the
time delay 2.d. A voter chooses the correct value, as shown in figure 2-15.
NSREC’07 Short course
Fernanda Lima Kastensmidt
27
Figure 2-15. Full time redundancy scheme
This technique is also able to vote SET that occurs in the MAJ voter as presented
in figure 2-16(a) when the MAJ voter is not the very last one, figure 2-16(b).
(a) SET in MAJ voters
(b) SET in the very last MAJ voter
Figure 2-16. SET propagation in the MAJ in time redundancy schemes
However this time redundancy technique based on clock delay does not work for
SET mitigation in nanometer technologies when the SET has large pulse durations
compared to the clock period. Figure 2-17 shows the problem in a time diagram. Let us
consider a 90nm technology working at 1 GHz. The maximum delay time between two
registers separated by a combinational logic is 1ns, which is the period of the clock. For
this technology, as discussed previously, SETs can vary from a few hundred pico seconds
to nano seconds. Let one suppose that SET pulses up to 600ps must be tolerated. So, the
28
Fernanda Lima Kastensmidt
NSREC’07 Short course
clock delay d must be 600ps at least. Then, the first clock (clk) occurs at time t=0, the
second clock (clk+d) occurs at t=600ps and the third clock (clk+2.d) occurs at time
t=1,200ps. After storing the data in all three latches, the MAJ voter chooses the correct
output, which also presents a propagation delay. Consequently, it is necessary 1,200ps
plus the MAJ voter propagation delay to vote out the SET, not counting the
combinational logic propagation delay. The new achieved frequency is less than half of
the original frequency. This shows that this method is suitable only for SETs with time
duration not higher than 10% of the clock period. However, it is well known that for
nanometer technologies, SETs are at the same order of magnitude as the clock period, as
discussed by [21], [26] and [42]. So, new time redundancy techniques must be
investigated.
d
d
d
clk
clk
clk+d
clk+d
clk+2d
clk+2d
SET
d
SET
comb
comb
ffp0
ffp0
ffp1
ffp1
ffp2
ffp2
MAJ
MAJ
MAJ + comb delays
MAJ + comb delays
T
T
(a) short duration SET
(b) long duration SET
Figure 2-17. SET propagation in the MAJ in time redundancy schemes
NSREC’07 Short course
Fernanda Lima Kastensmidt
29
2.2.3 Mixed Hardware and Time Redundancy Techniques
Mixed hardware and time redundancy techniques attempt to mitigate SET and
SEU by using the best characteristics of hardware and time redundancy in order to meet a
lower area overhead and at the same time a lower performance penalty.
The code word state preserving (CWSP) proposed by [2, 3], as illustrated in
figure 2-18, is an example of mixed hardware and time redundancy. From the hardware
redundancy point of view, it has redundant combinational logic and extra transistors in
the very last gate stages. From the time redundancy point of view, the output is only
transmitted to the flip-flop input when both combinational logic outputs agree. So, if a
SET occurs at one of the combinational logics, the flip-flop input has a high impedance
value, while the SET is still on.
Combinational
logic
…
a
b
CWSP
Combinational
logic
…
a*
b*
clk+delay
(a) General CWSP scheme
(b) Example of logic gates with extra transistors to block the SET propagation
Figure 2-18. Code Word State Peserving technique for SET mitigation
30
Fernanda Lima Kastensmidt
NSREC’07 Short course
This technique does not need voters or comparators but it has an asynchronous
behavior because it is not possible to determine when both combinational outputs are
ready to be stored in the flip-flop. Consequently, the clock period cannot be fixed. Also,
the very last transistors of the logic are sensitive to SET, so they must be sized in order to
be less sensitive than the others.
Figure 2-19 presents the time diagram of this technique showing the limitations of
this technique for SET with long pulse duration. Note that in figure 2-19(b) the flip-flop
will store the high impedance value as the two combinational output values are not yet
agreeing with each other.
clk
clk
SET
SET
a
a
a*
a*
out
`Z`
out
t
`Z`
t
(a) short duration SET
(b) long duration SET
Figure 2-19. Time diagram of CWSP approach
The technique proposed by [50], uses a hardware redundancy for the
combinational logic and for the register circuit, with a C-element and a keeper circuit at
the latches outputs, as shown in figure 2-20. If a SEU occurs in one of the latches, as
illustrated in figure 2-20(a), the C-element does not propagate the upset and the keeper
element is able to maintain the output value. Note that the C-element in this case inverts
the values stored at the latches. This can ensure the correct value at the output (OUT) in
the presence of SEU.
NSREC’07 Short course
Fernanda Lima Kastensmidt
31
If an upset occurs in the combinational logic, then, one latch registers the SET
while the other one registers the correct value as shown in figure 2-20(b). When Clock is
equal to one there will be time that the C-element propagates the correct value and there
is a time when C-element blocks the propagation because the latch values mismatch. This
phenomenon is illustrated in figure 2-21(a). However, for long pulse SET, the C-element
may never propagate the correct value as the latches may stay diverging for the entire
period when clock is equal to one, as seen in figure 2-21(a) on the left. In this case, the
output (OUT) may hold a previous clock cycle value, which compromises the
synchronization of the circuit.
The redundant logic can be replaced by a delay (time filtering) as shown in figure
2-20(c). In this case, the two latches may store the SET at different times as represented
in figure 2-21(b). The problem occurs when a long pulse SET happens, because in this
case the two latches can hold the SET values at the same time making the C-element
propagate the wrong value, which will be kept by the keeper circuit. In summary, this
technique works well for SEU but it cannot protect properly upsets like SET.
(a) SEU in the latches
32
Fernanda Lima Kastensmidt
NSREC’07 Short course
(b) SET in the combinational logic when using logic duplication
(c) SET in the combinational logic when using time filtering
Figure 2-20. Hardware and Time redundancy with C-element [50]
This technique, as the ones presented previously, is inadequate for long duration
pulse SETs. When long SET pulses occur, the output holds the previous value or even the
wrong value, which can compromise the synchronization of the circuit. A solution for
mitigating long SET pulses may be based on recomputation combined with a low cost
technique able to detect the SET and SEU faults.
NSREC’07 Short course
Fernanda Lima Kastensmidt
33
clock
clock
c_out0
c_out0
SET
SET
c_out1
c_out1
C-element propagates the correct value
Previous
value
OUT
OUT
Previous
value
Keeper holds the value
C-element propagates the correct value
Keeper holds the value
(a) Time diagram for SET in the combinational logic when using logic duplication
for short and long SET pulses
clock
clock
SET
SET
c_out
c_out
SET
c_out+!
c_out+!
OUT
Previous
value
C-element propagates the correct value
OUT
SET
Previous
value
Keeper holds the value
Keeper holds the value
C-element propagates the wrong value
(b) Time diagram for SET in the combinational logic when using time filtering for
short and long SET pulses
Figure 2-21. Time diagram from the technique presented in [50]
2.2.3 Hardened Memory Cells
Memory elements can be protected against SEU (bit-flip) by modifying their
original design with extra resistors or transistors, able to recover the stored value if an
upset strikes one of the drains of a transistor in “off” state. These cells are called
hardened memory cells, and they can avoid the occurrence of a SEU by design, according
to the particle charge and flux.
34
Fernanda Lima Kastensmidt
NSREC’07 Short course
In order to better understand how these hardened memory cells work, let one start
with the analysis of a standard static memory cell composed of 6 transistors. When a
memory cell holds a value, it has two transistors in “on” state and two transistors in “off”
state; consequently there are always two SEU sensitive nodes in the cell. When a particle
strikes one of these nodes, the energy transferred by the particle can provoke a transistor
to switch “on”. This event will flip the value stored in the memory. If a resistor is inserted
between the output of one of the inverters and the input of the other one, the signal can be
delayed for such a time as to avoid the bit flip.
The SEU tolerant memory cell protected by resistors proposed by [83] for ASICs
and by [68] for FPGAs was the first proposed solution to this matter, figure 2-22(a). The
decoupling resistor slows the regenerative feedback response of the cell, so the cell can
discriminate between an upset caused by a voltage transient pulse and a real write signal.
It provides a high silicon density, for example, the gate resistor can be built using two
levels of polysilicon. The main drawbacks are temperature sensitivity, performance
vulnerability in low temperatures, and an extra mask in the fabrication process for the
gate resistor. However, a transistor controlled by the bulk can also implement the resistor
avoiding the extra mask in the fabrication process. In this case, the gate resistor layout
has a small impact in the circuit density.
Memory cells can also be protected by an appropriate feedback devoted to restore
the data when it is corrupted by an ion hit. The main problems are the placement of the
extra transistors in the feedback in order to restore the upset and the influence of the new
sensitive nodes. Examples of this method are IBM hardened memory cells proposed by
[69] in figure 2-22(b), HIT cells in figure 2-22(c) proposed by [12, 79], DICE cells in
figure 2-22(d) proposed by [14] and NASA memory cells proposed by [84, 44, 15],
represented in figure 2-22(e). The main advantages of this method are temperature,
voltage supply and technology process independence, and good SEU immunity. The
main drawback is silicon area overhead that is due to the extra transistors and their extra
size.
NSREC’07 Short course
Fernanda Lima Kastensmidt
35
Vdd
Vdd
PE
clk
PF
PC
PD
C
Vdd
PA
c
PB
Vdd
R
d
D
/d
R
q
(a)
Vdd
MP4
MP3
/q
Vdd
Vdd
MP1
MP2
Q
D
MN3
Q
L
P1
A
P2
N1
B
N2
/D
/Q
(b) IBM hardened memory cell
Vdd
Vdd
Vdd
Vdd
MP0
MP1
MP2
MP3
A
B
C
D
/Q
MN1
MN2
MN4
/D
MN0
MN5
Vss
MN6
MN1
Vss
Vss
MN2
MN3
Vss
Vss
Vss
clk
MN4
clk
(c)
N4
Vss
Vdd
MP6
MP5
Vdd
Vss
Resistor memory cell
M
N3
clk
Vdd
MN5
MN6
MN7
/D
D
HIT memory cell
(d) DICE hardened memory cell
/clk
D
/Q
clk
D
Q
Vss
Vss
(e) NASA hardened memory cell
Figure 2-22. Examples of SEU hardened cells
36
Fernanda Lima Kastensmidt
NSREC’07 Short course
2.2.4 Error Correcting Code (ECC)
Error correcting code technique is based on information redundancy and it is used
to mitigate SEU in integrated circuits, as discussed by [18]. It is usually used in memory
arrays, but it can be also applied in registers or other small memory structures in
microprocessors, for instance. Designers can implement ECC detection and correction as
hardware or software. [73] compares the reliability of ECC implemented in these two
levels of approaches. The simplest error correcting codes can correct single-bit errors and
detect double-bit errors while more complex ones can detect or correct multi-bit errors.
Examples include Hamming code, BCH code, Reed-Solomon code, Reed-Muller code,
Binary Golay code, convolutional code, and others. Simple codes are usually
implemented in hardware using extra memory bits and encoding/decoding circuitry.
Figure 2-23 illustrated an 8-bit data being written and read from a register. If an
SEU occurs and there is no information redundancy, an error occurs but it is not detected,
which can lead in catastrophic consequences in the circuit. If a parity bit is added to the
stored data, it is possible to detect an error when the parity bits mismatch. For many
applications, it is not enough to detect the error, but it is necessary to correct it. For those,
it is possible to use an ECC code with encoder and decoder blocks. The encoder block
creates a set of check bits that will help to identify the error position, and then the
decoder block is able to restore the correct value.
1
1
1
1
1
1
1
1
0
0 error
0
1
0
1
1
1
1
0
1
1
1
1
1
1
0
0
0
0
0
write
0
read
write
0
read
write
0
decoder
1
encoder
1
1
read
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
P= 0
0
P= 1
1
1
Parity does not
match
Error detected!
1
0
1
1
Error corrected!
0
Figure 2-23. Error Correcting Code Principle
An example of largely used ECC is the Hamming code [31] in its simplest
version. It is an error-detecting and error-correcting binary code that can detect all singleNSREC’07 Short course
Fernanda Lima Kastensmidt
37
and double-bit errors and correct all single-bit errors (SEC-DED). This coding method is
recommended for systems with low probabilities of multiple errors in a single data
structure (e.g., only a single bit error in a byte of data). The code satisfies the relation 2k
≤ m+k+1, where m+k is the total number of bits in the coded word, m is the number of
information bits in the original word, and k is the number of check bits in the coded
word. Following this equation the hamming code can correct all single-bit errors on n-bit
words and detect double-bit errors when an overall parity check bit is used.
The hamming code implementation is composed of a combinational block
responsible for encoding the data (encoder block), inclusion of extra bits in the word that
indicate the parity (extra latches or flip-flops) and another combinational block
responsible for decoding the data (decoder block). The encoder block calculates the
check bits and it can be implemented by a set of n-input XOR gates. The decoder block is
more complex than the encoder block, because it needs not only to detect the fault, but it
must also correct it. It is basically composed of the same logic used to compose the check
bits plus a decoder that will indicate the bit address that contains the upset. The decoder
block can also be composed of a set of n-input XOR gates and some AND and
INVERTER gates.
The encoder block calculates the check bits that are placed in the coded word at
positions 1, 2, 4, …, 2(k-1). For example, for 8-bit data, 4 check bits (p1, p2, p3, p4) are
necessary, so that the hamming code is able to detect and correct a single-bit error (SECSED). Figure 2-24 demonstrates a 12-bit coded word (m=8 and k=4) with the check bits
p1, p2, p3 and p4 located at positions 1, 2, 4 and 8 respectively. The check bits are able to
inform the position of the error. The check bit p1 creates even parity for the bit group {1,
3, 5, 7, 9, 11}. The check bit p2 creates even parity for the bit group {2, 3, 6, 7, 10, 11}.
Similarly, p3 creates an even parity for the bit group {4, 5, 6, 7, 12}. Finally, the check
bit p4 creates even parity for the bit group {8, 9, 10, 11, 12}.
38
Fernanda Lima Kastensmidt
NSREC’07 Short course
Encoder block: check bits generation
P1 = W3 xor W5 xor W7 xor W9 xor W11
P2 = W3 xor W6 xor W7 xor W10 xor W11
P3 = W5 xor W6 xor W7 xor W12
P4 = W9 xor W10 xor W11 xor W12
Decoder block: syndromes
Syndrome P1
Syndrome P2
Syndrome P3
Syndrome P4
= P1 xor W3 xor W5 xor W7 xor W9 xor W11
= P2 xor W3 xor W6 xor W7 xor W10 xor W11
= P3 xor W5 xor W6 xor W7 xor W12
= P4 xor W9 xor W10 xor W11 xor W12
Decoder block: mask generation
Syndrome
P4P3P2P1
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
Mask
P1P2W3…….W11W12
no error
100000000000
010000000000
001000000000
000100000000
000010000000
000001000000
000000100000
000000010000
000000001000
000000000100
000000000010
000000000001
Figure 2-24. Hamming code check bits generation for an 8-bit word, 12-bit coded
word
NSREC’07 Short course
Fernanda Lima Kastensmidt
39
Hamming code can protect structures such as registers, register files and
memories, as presented in figure 2-25. According to the organization, one encoder and
one decoder can be multiplexed in time to protect many registers and memory elements.
Hamming code increases area by requiring additional storage cells (check bits), plus the
encoder and the decoder blocks. For an n bit word, there are approximately log2 (2.n)
more storage cells. However, the encoder and decoder blocks may add a more significant
area increase, thanks for the extra XOR gates. Regarding performance, the delay of the
encoder and decoder block is added in the critical path. The delay gets more critical when
the number of bits in the coded word increases. The number of XOR gates in serial is
directly proportional to the number of bits in the coded word.
check bits
data
refreshing
WR
RD
data
words
decoder
Encoder
Encoder
encoder
word
decoder
check bits
Decoder
Refreshing logic
(a) Registers protected by ECC
(b) Memory protected by ECC
Figure 2-25. ECC in memory elements with a feedback refreshing to clean up the
SEUs
In [41], it was proposed a microcontroller protected by Hamming code. The ECC
was implemented in the datapath, memory arrays and control logic. Figure 2-26 shows
the microcontroller datapath.
40
Fernanda Lima Kastensmidt
NSREC’07 Short course
ROM data
decoder
12-bit data
RAM
memory
AD_low
AD_high
data
ROM
memory
PC
decoder
refreshing
Datapath
encoder
add/sub PC
decoder
All the registers are 12-bit
(coded by Hamming Code)
decoder
decoder
decoder
encoder
decoder
AD
encoder
ALU
Figure 2-26. Example of a microcontroller datapath protected by ECC
The limitation of hamming code is that it can not correct double bit upsets, which
can be very important for very deep sub-micron technologies, especially in memories
because of the high density of the cells [65]. Other codes must be investigated to be able
to cope with multiple bit upsets, which probability of occurrence is increasing due to the
advance in technology as shown in [13]. Reed-Solomon [31] is an error-correcting coding
system that was devised to address the issue of correcting multiple errors. It has a wide
range of applications in digital communications and storage. Reed-Solomon codes are
used to correct errors in many systems including: storage devices, wireless or mobile
communications, high-speed modems and others. Reed-Solomon (RS) encoding and
decoding is commonly carried out in software. But an efficient RS code implementation
in hardware was presented by [53, 54], to protect memories against multiple SEUs.
When using ECC, it is appropriate to implement interleaving technique, which
means that the bits of a word protected by the same check bits must not be placed
physically on side to each other. This technique helps ensure that no upset of two nearestneighbor memory cells resides in the same check word, which can make multiple bit
upsets in a single bit ECC. [54] proposed a memory interleaving organization where two
ECC are used: Reed-Solomon and Hamming code to ensure correction in presence of
massive multiple upsets, figure 2-27.
NSREC’07 Short course
Fernanda Lima Kastensmidt
41
Figure 2-27. Example of interleaving in a memory protected by two ECCs [54]
2.3 Architectural level based techniques
Whenever the effect of a fault is detected in the circuit architecture, this means
that the circuit is already computing with some error and in this case it is mandatory to
have a computational recovery. Current microprocessors already maintain checkpoints
across 10’s of instructions for purposes of speculation recovery, as discussed by [82].
This makes suitable apply fault recovery in nowadays microprocessors.
There are two general principles for recovery: forward and backward. Forward
error recovery means detecting an error and continuing on in time, while attempting to
mitigate the effects of the faults that may have caused the error. It implies the
constructive use of redundancy. For example, temporally or spatially replicated
messages, may be averaged or compared to compensate for lost or corrupted data.
Backward error recovery means detecting an error and retracting back to an earlier
system state or time. It includes the use of error detection (by redundancy, comparing
pairs or error-detecting codes) so evasive action can be taken. It subsumes rollback to an
earlier version of data or to an earlier system state. Backward error recovery may also
include fail-safe, fail-stop and graceful degradation modes that may yield a safe state,
degraded performance or a less complete functional behavior. There can be problems if
42
Fernanda Lima Kastensmidt
NSREC’07 Short course
backward recovery is used in real-time systems. One problem is that interactions with the
environment cannot be undone. Another problem is how to meet the time requisites.
In summary, each forward and backward recovery approaches have advantages
and drawbacks. While forward recovery can meet time requisites but it needs inherent
redundancy for error detection and redundant computation, backward recovery needs
only error detection but not inherent redundancy of the entire computation, because once
an error is detected, the computation is going to be performed again by the same
hardware, but it is hard to meet time requisites sometimes with this approach.
So the challenge is to efficiently detect the fault effect. One of the first option is
the concept of processing and checking in parallel the outputs of a system for only a
subset of its possible inputs, also called fingerprinting as presented by [6], can be applied
to the general case of a circuit that must be hardened against soft errors, thus providing
tolerance against transient faults caused by pulses that affect parts of the circuit, even
when the duration of the transient pulse is longer than the delay of several gates. In [49],
it is also presented concurrent error detection techniques for combinational logic blocks
in order to detect an error. Figure 2-28 illustrates the general concept for using
infrastructure IP to check logic-block integrity. Checking a logic function requires
predictor block to compute the input signature and a checker to compare the output and
input signatures. The challenge is in designing and implementing the most efficient
blocks while preserving performance and keeping cost down.
inputs
Function f
Output
Characteristic
Prediction
checker
output
error
Figure 2-28. Fingerprinting Technique
NSREC’07 Short course
Fernanda Lima Kastensmidt
43
In contrast with other proposed solutions based on checker circuits when
fingerprinting is applied, the random checker does not provide full fault detection, figure
2-29. In this case, the random checker performs some of the functions of the main circuit
only on a small set of possible inputs, being able to statistically detect errors at the output
with a given probability. The main goal of this approach is to provide an acceptable level
of fault detection using a circuit that is significantly smaller than the main circuit under
inspection, thereby providing low area overhead. The underlying concept presented here
is generic, and can be adopted for several different applications or circuits, with the
subset of inputs, the operations performed by the checker, the performance, area, and
power overheads varying according to the application.
inputs
output
main circuit
random checker
error
Figure 2-29. Random checker technique
Typically, embedded systems in safety critical applications use watchdog
schemes, which will detect an erroneous behavior after a long series of clock cycles
under worst case conditions. Subsequently, the error is repaired by interrupt and retry.
For applications which are also time-critical, such methods are too slow. If a roll-back is
done several clock cycles after error detection, the system may have had the time to write
large amounts of wrong date back into the system memory before compensation.
Micro rollback tries to recognize an error condition very early and provides error
correction within a few clock cycles. In [27], it was proposed a micro rollback scheme
based on two separate processors, whereby the backup processor (trailer) is one or two
clock cycles delayed, but performs exactly the same operations as the master processor.
Micro rollback re-stores the last error-free processor state by re-loading of all register
contents and re-executes the erroneous instruction on the master processor. In this
scheme, the trailer processor holds register contents long enough to re-establish the
original status. Such an approach is based on the implicit assumptions that the trailer
44
Fernanda Lima Kastensmidt
NSREC’07 Short course
processor is self checking in order to identify the fault device. The trailer as acts as a
backup in case of transient faults. Figure 2-30 exemplifies the micro-rollback, as
presented by [30].
Figure 2-30. Micro-rollback Example
2.4 Area and Performance Tradeoffs Summary
Each technique discussed previously here presents a different area and
performance overhead. It is possible to choose the most efficient one or combine them to
achieve the fault tolerance requirement for each type of application and system platform.
Table 2-1 shows the area and performance of each presented technique when
implemented in an 8-bit adder design with registered inputs and output, which contains
294 transistors to implement the combinational logic and 384 transistors to implement the
32 master-slave flip-flops. The results in area are computed for 90nm technology (PTM
model). The performance is shown in a general form equation based on the setup and
hold times and propagation delay of the flip-flops (tpffp), the propagation delay of the
adder (tplogic) and the propagations delays of the added blocks such as comparators
NSREC’07 Short course
Fernanda Lima Kastensmidt
45
(tpcomparator), voters (tpvoters), encoding (tpenc) and decoding (tpdec) blocks and
others.
Table 2-1. Comparison of Area and Performance in an 8-bit adder case-study
circuit with registered inputs and outputs protected by SEE mitigation techniques.
Fault Tolerance
Technique
Area
Performance
Capability
No protected circuit
 Combinational
Delay = tpffp + None, only
logic: 294
tplogic + tpffp inherent masking
transistors
 Sequential logic:
384 transistors
Area = 584.24 µm2
Entire circuit
 2x Combinational Delay = tpffp + SEU and SET
protected by
logic: 588
tplogic + tpffp detection
Duplication with
transistors
+ tpcomparator
comparison (DWC)
 2x Sequential
logic: 768
transistors
 Comparator for
the 16-bit output:
156 transistors
Area = 1,300.32 µm2
(+ 122%)
Triple Modular
 3x Combinational Delay = tpffp + SEU and SET
Redundancy (TMR)
logic: 882
tplogic + tpffp correction, but
in the entire circuit
transistors
+ tpvoter
final voter can be
with single voter at
 3x Sequential
upset.
the output
logic: 1152
transistors
 Majority voters for
the 16-bit output:
288 transistors
Area = 1,996.92 µm2
(+ 241%)
TMR in the entire
 3x Combinational Delay = tpffp + SEU and SET
circuit with triple
logic: 882
tplogic + tpffp correction.
voter at the output
transistors
+ tpvoter
 3x Sequential
logic: 1152
transistors
 3x Majority voters
for the 16-bit
output: 864
transistors
46
Fernanda Lima Kastensmidt
NSREC’07 Short course
Time redundancy in
the output of the
combinational logic
with TMR in the
registers.
Built-in Current
Sensors in the
combinational and
sequential logic
Hardened memory
cells in the registers
Error Correction
Code such as
Hamming code in
the input and output
of the registers
Area = 2,492.28 µm2
(+ 326%)
 1x Combinational
logic: 294
transistors
 3x Sequential
logic: 1152
transistors
 2x Majority voters
for the 8-bit
inputs: 288
transistors
 1x Majority voters
for the 16-bit
output: 288
transistors
 Considering delay
(δ) as 16
transistors (chains
of inverters)
Area = 1,752.68 µm2
(+ 199%)
 Combinational
logic: 294
transistors
 Sequential logic:
384 transistors
 33 Bulk-BICS: 1
for each 21
transistors
Area = 789.79 µm2
(+ 35%)
 Combinational
logic: 294
transistors
 2x Sequential
logic: 768
transistors
Area = 913.32 µm2
(+ 56%)
 Combinational
logic: 294
transistors
 Sequential logic:
384 + (4 + 4 +
5)parity bits x12
NSREC’07 Short course
Delay = tpffp +
tplogic + δ + δ
+ tpffp +
tpvoter
SEU and SET can
be corrected, the
added delay (δ)
must be chosen
according to the
SET pulse width.
Delay = tpffp +
tplogic + tpffp
SEU and SET
detection
Delay = tpffp*
+ tplogic +
tpffp*
SEU correction
only. None SET
detection and
correction.
tpffp*= delay
from the
hardened flipflop
Delay = tpenc
+ tpffp +tpdec
+ tplogic +
tpenc + tpffp +
tpdec
Fernanda Lima Kastensmidt
SEU correction
only. None SET
detection and
correction.
47
Recomputing with
Shifted or Swapped
operands
48
transistors
 2x 8-bit Encoding:
144 transistors
 2x 8-bit decoding:
288 transistors
 1x 16-bit
Encoding: 144
transistors
 1x 16-bit
decoding: 288
transistors
Area = 1,460.28 µm2
(+ 150%)
 Combinational
logic: 294
transistors
 Sequential logic:
384 transistors
 2x 8-bit
Multiplexers 2:1:
64 transistors
 Comparator for
the 16-bit output:
156 transistors
Area = 772.28 µm2
(+ 32%)
Fernanda Lima Kastensmidt
Delay = 2 x
SEU and SET
(tpffp + tpmux detection.
+ tplogic +
tpffp) + tpcomp
NSREC’07 Short course
3. Radiation Effects on FPGAs
Field-Programmable Gate Arrays (FPGAs) are configurable integrated circuit
based on a high logic density regular structure, which can be customizable by the end
user to realize different designs. The FPGA architecture is based on an array of logic
blocks and interconnections customizable by programmable switches.
Several different programming technologies are used to implement the
programmable switches. There are three types of such programmable switch technologies
currently in use:

SRAM, where the programmable switch is usually a pass transistor or
multiplexer controlled by the state of a SRAM bit (SRAM based FPGAs)

Antifuse, when an electrically programmable switch forms a low resistance
path between two metal layers. (Antifuses based FPGAs)

EPROM, EEPROM or FLASH cell, where the switch is a floating gate
transistor that can be turned off by injecting charge onto the floating gate.
Customizations based on SRAM are volatile. This means that SRAM-based
FPGAs can be reprogrammed as many times as necessary at the work site and that they
loose their contents information when the memories are not connected to the power
supply. The antifuse customizations are non-volatile, so they hold the customizable
content even when not connected to the power supply and they can be programmed just
once. Each FPGA has a particular architecture. Programmable logic companies such as
Xilinx, Actel, Aeroflex (licensed for Quicklogic FPGAs), Atmel and Honeywell (licensed
for Atmel FPGAs) offer radiation tolerant FPGA families. Each company uses different
mitigation techniques to better take into account the architecture characteristics.
3.1 Antifuse-based FPGAs
The Actel RTAX-S family is an example of antifuse-based FPGAs for space
applications. It consists of a regular matrix composed of combinational (C-cells) and
sequential (R-cells) surrounding by regular routing channels, as shown in figure 3-1. All
the customizations of the routing and the C-cells and R-cells are done by an antifuse
NSREC’07 Short course
Fernanda Lima Kastensmidt
49
element (programmable switch). Results from radiation ground testing have shown that
the programmable switches either based on ONO (oxide-nitride-oxide) or MIM (metalinsulator-metal) technology are tolerant to ionization and total dose effect [81].
Therefore, the customizable routing is not sensitive to SEU, only the flip-flops used to
implement the design user sequential logic are sensitive to SEU.
Figure 3-1. ACTEL: RTAX-S device
The R-cell is composed of a Triple mode Redundancy (TMR) flip-flop or DFF
with a wired-or voter at the output, as presented in figure 3-2. This makes the R-cell
robust to SEUs. However, at high frequency operation, SETs can be observed [11]. Due
to the number of transistors contained in an R-cell, there exist several points susceptible
to Single Event Transient (SET).
As discussed previously, some of these SETs may be propagated through the
logic and captured by the R-cell, where all the 3 DFFs share the same data, clock, enable,
and reset lines. Due to this fact, a glitch appearing on one of these lines during a clock
edge will most likely appear as the same value to all of the DFFs and will not be correctly
mitigated. As the system clock frequency is increased, so is the probability of capturing
the SET. As the number of levels of combinatorial logic between each DFF increases, the
50
Fernanda Lima Kastensmidt
NSREC’07 Short course
probability of generating a SET increases. The user may protect the C-cells by using high
level mitigation techniques in the description of the design (TMR, duplication and
others).
Figure 3-2. ACTEL: RTAX-S device
At [11], radiation test results performed in Actel RTAX-S device showed the
influence of the frequency in the error cross-section. The case-studied architectures (shift
registers) are illustrated in figure 3-3. The logic levels between two flip-flops were
chosen from 0, 4, and 8 inverter gates.
Figure 3-3. Shift registers implemented at ACTEL: RTAX-S device for radiation
ground testing [11]
NSREC’07 Short course
Fernanda Lima Kastensmidt
51
At each LET, several tests were performed at various frequencies on all of the shift
register string types [11]. As the frequency increased, the error cross-section increased, as
seen in figure 3-4. This is due to the probability of SET propagation and capture. A shift
register string containing hardened (TMR) DFFs and combinatorial logic between these
hardened flip-flops should present errors only when SETs in the combinational logic are
captured by the TMR DFFs. And, as higher is the frequency; higher is the probability to
capture the SET.
Figure 3-4. ACTEL: RTAX-S device test when using the shift register with 8
inverters between flip-flops [11]
The RadHard Eclipse FPGA is another example of antifuse-based FPGAs. It is
provided by Aeroflex that uses QuickLogic Corporation’s licensed ESP (Embedded
Standard Products) technology. Its architecture is also composed of a regular matrix of a
configurable logic cell composed of logic and flip-flops surrounding by a regular routing
matrix, as illustrated at figure 3-5. All the customizations are done by a programmable
switch called ViaLink connector. It is fabricated on 0.25µm five-layer metal ViaLink
CMOS process.
The CLB flip-flops are SEU hardened flip-flops, which makes the CLB robust to
SEU as well. However, the CLB logic can be susceptible to SETs that can be propagated
through the logic and being captured by one of the flip-flops. Fault tolerant techniques at
the high level can be implemented to mitigate SETs in the designs synthesized into these
types of FPGAs too.
52
Fernanda Lima Kastensmidt
NSREC’07 Short course
Figure 3-5. RadHard Eclipse FPGA from Aeroflex
3.2 SRAM-based FPGAs
SRAM-based FPGAs are very attractive due to high density, high performance,
low NRE (Non-Recurring Engineering) cost, fast turnaround time and reconfigurability
feature. For space and remote applications, SRAM-based FPGAs can offer additional
benefits by allowing in-orbit design changes thanks to reconfigurability, which can
reduce the mission cost by correcting errors or improving system performance after
launch. In addition, the same circuitry can be used with different configurations at
different stages of a mission, reducing weight and power requirements. Also, if part of an
FPGA fails, then circuitry can be reprogrammed to make use of remaining functional
portions of the chips.
Xilinx FPGAs have an array composed of configurable logic blocks (CLBs)
surrounded by programmable input/output blocks (IOBs), all interconnected by a
NSREC’07 Short course
Fernanda Lima Kastensmidt
53
hierarchy of fast and versatile routing resources. Each CLB has a set of Look-up tables
(LUT), multiplexers and flip-flops, which are divided into slices. A LUT is a logic
structure able to implement a Boolean function as a truth table. The CLBs provide the
functional elements for constructing logic while the IOBs provide the interface between
the package pins and the CLBs. The CLBs are interconnected through a general routing
matrix (GRM) that comprises an array of routing switches located at the intersections of
horizontal and vertical routing channel. The FPGA matrix also has dedicated memory
blocks called Block SelectRAMs, clock DLLs for clock-distribution delay compensation
and clock domain control and other components that vary according to the FPGA family.
Virtex devices are quickly programmed by loading a configuration bitstream
(collection of configuration bits) into the device. The device functionality can be changed
at anytime by loading in a new bitstream. The bitstream is divided into frames and it
contains all the information to configure the programmable storage elements in the matrix
located in the Look-up tables (LUT) and flip-flops, CLBs configuration cells and
interconnections.
Figure 3-6 shows a general Xilinx FPGA architecture, where each matrix tile is a
configurable logic block (CLB) with the logic slices and the general routing matrix
(GRM). The characteristic of the CLB logic and slice may change consistent with the
FPGA family.
Due to the technology process evolution, FPGAs are in the nanometer technology
era. As shown in figure 3-7, the latest families Virtex4 and Virtex5 are fabricated in 90
nm and 65 nm, respectively [86]. This evolution has allowed high logic integration.
Nowadays it is possible to implement millions of gates and data memory in a single
FPGA. In addition, there are families composed of hardened microprocessors, such as the
VirtexII-Pro family with a PowerPC connected to the customizable array. The CLBs and
interconnection structures have also evolved in the past decade, figure 3-8.
54
Fernanda Lima Kastensmidt
NSREC’07 Short course
Figure 3-6. Example of SRAM-based FPGA architecture based on regular array
NSREC’07 Short course
Fernanda Lima Kastensmidt
55
Figure 3-7. Evolution of Xilinx FPGA families in the last decade
Figure 3-8. CLB logic evolution in the last decade
The CLBs used to contain a small number of 4 input LUTs, where each LUT can
implement any 4-input Boolean logic function, as for example in Virtex family and
nowadays a CLB can contain a large number of 4-input LUTs, as in the Virtex4 family or
56
Fernanda Lima Kastensmidt
NSREC’07 Short course
even 6-input LUTs, where each LUT can implement any 6-input Boolean logic function,
as in the latest released Virtex5 family. The interconnection structures located in the
GRM have also improved in the last decade, able to reduce the delay and increase the
performance in the implemented designs.
All this evolution has increased the interest on using SRAM-based FPGAs for a
wide range of applications, but at the same time has brought the necessity to analyze
carefully the soft error susceptibility of these high complex structures. The effect of soft
errors in the FPGA architecture in to the implement designs must be evaluated in order to
implement efficient fault tolerant techniques.
In FPGAs, a soft error has a peculiar effect in the user logic design since the
combinational and sequential logics are mapped into the programmable architecture.
Remember that in an ASIC, the effect of a soft error either in the combinational or in the
sequential logic is transient; the only variation is the time duration of the fault. A fault in
the combinational logic creates a transient logic pulse (SET) in a node that can propagate
through the logic according to the logic delay and topology. In other words, this means
that a SET in the combinational logic may or may not be latched by a flip-flop placed at
the combinational logic output. Faults in the sequential logic (SEU) manifest themselves
as bit flips, which will remain in the flip-flop until the next input load.
On the other hand, in a SRAM-based FPGA, both the user’s combinational and
sequential logic are implemented by customizable logic memory cells, in other words,
SRAM cells, as represented in figure 3-9. SEU can occur in all SRAM cells, for example,
in the ones that configure the LUTs, controls the CLB configurations, the routing (GRM)
and others.
When a SEU occurs in a memory cell that configures the LUT, it flips one of the
stored values modifying the implemented combinational logic. This fault has a permanent
effect in the user logic and it can only be corrected at the next load of the configuration
bitstream, when then the LUT is configured again with the original Boolean function
defined by the user.
When a SEU occurs in a memory cell that controls the CLB configurations, as
shown in figure 3-9, the multiplexer controlled by the affect memory cell changes its
NSREC’07 Short course
Fernanda Lima Kastensmidt
57
connection, and the original connection is undo. It has also a permanent effect and its
effect can be mapped to an open or a short circuit in the user combinational logic
implemented by the FPGA. The fault will also be corrected at the next load of the
configuration bitstream, when the original configuration is loaded to the CLB control
memory cells.
FPGA CLB slice:
User design logic:
E1
E2
map
E1
E3
clk
E2
E3
F inputs: A
B
map
C
D
0
0
LUT
LUT
0
1
SRAM configuration cells
0
1
1
1
0
0
0
upset
1
0
1
1
1
1
Figure 3-9. SEU Sensitive Bits in the CLB Slice
When a SEU occurs in a memory cell that controls the routing (GRM), as shown
in figure 3-10, it may affect the multiplexer or the pass transistors responsible to perform
the connection between the logic. The SEU can result in open and short cuts in the logic.
It has also a permanent effect and its effect can be mapped to an open or a short circuit in
the user combinational logic implemented by the FPGA. The fault will also be corrected
58
Fernanda Lima Kastensmidt
NSREC’07 Short course
at the next load of the configuration bitstream, when the original configuration is loaded
to the CLB control memory cells.
When an upset occurs in the CLB flip-flop or in the embedded memory, it has a
transient effect, because at the next load of the flip-flop or at the new data storage in the
memory, the bit-flip can be corrected. In [34], all these effects are discussed in more
details.
Figure 3-10. Examples of upsets in the SRAM-based FPGA architecture in the
general routing matrix (GRM)
Radiation tests performed in Xilinx FPGAs, presented by [10], [62, 63] [78] and
[80], show the effects of SEU in the design application and confirm the necessity of using
fault-tolerant techniques for space applications. A fault-tolerant system designed into
SRAM-based FPGAs must be able to cope with the peculiarities mentioned in this
NSREC’07 Short course
Fernanda Lima Kastensmidt
59
section such as transient and permanent effects of a SEU in the combinational logic, short
and open circuit in the design connections and bit flips in the flip-flops and memory cells.
Results presented by [62, 63], shows the multiple bit upsets in Virtex SRAMbased FPGAs. These results are very relevant because they determine the probability of
MBU overcome mitigations techniques applied in these devices. Results show that MBU
events are not as common in the Virtex family; most Virtex resources events have 10%
MBU events compared to VirtexII and Virtex4. The only resource in all three families
that does not follow these patterns is the BRAM blocks because their high density. Figure
3-11 (a and b) show the normalized percentage of MBU events by resource [62, 63].
The normalized percentages are determined by the ratio of the number of MBU
events to all events for the resource. A comparison of the normalized values indicates that
IOBs are very sensitive to MBUs. For the Virtex-II and Virtex-II Pro families IOBs are
nearly as sensitive as CLBs to MBUs. It was observed five-bit and larger events in
Virtex-4. In summary due to the technology scaling, the paper [62, 63] has shown that
MBUs are 27–33 times more common in the Virtex-II and Virtex-II Pro families than in
the earlier Virtex family. MBU events are nearly three times more likely in the Virtex-4
family (fabricated in 90nm process technology) than in the Virtex-II and Virtex-II Pro
families (fabricated in 130nm process technology) and 69 times more likely than in the
Virtex family (fabricated in 220nm process technology).
60
Fernanda Lima Kastensmidt
NSREC’07 Short course
(a) Virtex family, in 0.22µm process technology [63]
(b) VirtexII family, in 0.13µm process technology [63]
Figure 3-11. Percentage of MBU events in all events induced by heavy ion radiation
for each resource in the Xilinx FPGAs
NSREC’07 Short course
Fernanda Lima Kastensmidt
61
Concerning single event effects, a set of the results presented by [28] is shown in
figure 3-12. The graphic from figure 3-12(a) shows the upset sensitivity for any physical
bit in the configuration bitstream, largely dominated by the configuration logic blocks
(CLBs). The data for the configuration logic blocks (CLBs) and BlockRAM (BRAM) are
shown separately. The Virtex-4 data look very much like that of the Virtex-II Pro. The
Block RAM cells (open symbols) have a small but consistently higher susceptibility than
the CLBs (filled symbols) in the knee region of the curve on a per-bit basis.
In addition to single event upsets (SEUs), complex devices like the Virtex-4 are
susceptible to single-event-functional-interrupt (SEFI) modes. These are upsets to a
control circuit that disable large portions of the devices function. From studies of prior
Virtex FPGA generations we might expect to see SEFI modes involving the power-onreset circuit (POR), failures of the JTAG or SelectMap communication ports, or others. A
possible configuration clock (CCLK) upset observed in the Virtex-II Pro device was the
only Virtex SEFI mode yet seen that required a power-cycle to recover. All other modes
could be recovered by simply reloading the configuration. At this writing, we have
studied only the POR SEFI and looked for modes requiring a power cycle. SEFI results
of the POR are shown in figure 3-12(b) from [28].
(a) BRAM and CLB sensitive parts: VirtexII versus Virtex4
62
Fernanda Lima Kastensmidt
NSREC’07 Short course
(b) SEFI sensitivity for the Power-On-Reset (POR)
Figure 3-12. Virtex-4 static SEU cross sections for three device types [28]
Note that there is also the possibility of having single event transient (SET) in the
combinational logic used to build the CLB such as input and output multiplexers used to
control part of the routing and the LUTs, as shown in figure 3-13.
The evaluation SET propagation in a design implemented in a FPGA relies on the
analysis of the SET propagation from a LUT through a chain of pass transistors and
multiplexers along the routing until reach a CLB flip-flop. The sensitivity of each node
must be evaluated according to its capacitance and logic connection.
Figure 3-13. SET propagation in SRAM-based FPGA
NSREC’07 Short course
Fernanda Lima Kastensmidt
63
Figure 3-14 shows a LUT structure based on pass transistors and a multiplexer
also based in a pass transistor tree. The both structures have a valid path that is defined by
the inputs at a time. The sensitive nodes to SET are the drain of all transistors that are at
the off state. For the selected paths in figure 3-14 and the stored values, there are few
sensitive points as indicated. In the case of the LUT that is propagating a ‘0’ value, only
SETs that charge the node needed to be analyzed. These are the ones generated by
ionization in the drain of the PMOS transistors at the off-state placed at the same selected
path.
SRAM LUT
A
B
C
D
SRAM routing
0
1
1
1
1
1
1
1
‘1’
‘0’
‘1’
0
1
0
0
1
0
1
0
Figure 3-14. SET propagation in the internal LUT and routing multiplexers
For the multiplexer that is propagating a ‘1’ value, only SET that discharges the
node needed to be analyzed. These are the ones generated by ionization in the drain of the
NMOS transistors at the off-state placed at the selected path.
64
Fernanda Lima Kastensmidt
NSREC’07 Short course
4. Radiation Hardening by Design: Strategies for SRAM-based
FPGAs
Designers can protect the design at the high-level description (VHDL or Verilog)
level by using some sort of redundancy targeting the FPGA architecture. The most
popular high-level SEU mitigation technique used nowadays to protect designs
synthesized in the SRAM-based FPGAs is the TMR combined with scrubbing. Xilinx has
released the tool called X-TMR that automatically implements TMR into the user
description. But the user himself can also implement the TMR in his design. However,
due to the high area overhead of the TMR, some alternative solutions have been proposed
in the last years. So the user has the flexibility on implementing duplication and self
checking techniques instead of TMR. These techniques may compromise the fault
tolerance in some point but the final result may be acceptable for a set of applications.
In this way, it is possible to use a commercial FPGA part to implement the design
and the soft error mitigation technique is applied to the design description before being
synthesized in the FPGA. The user has the flexibility of choosing the fault-tolerant
technique and consequently the overheads in terms of area, performance and power
dissipation. Figure 4-1 exemplifies the design flow of a general circuit implemented in a
FPGA.
One very important step of the design flow is the validation of the fault tolerance
technique that is usually done by fault injection. The original bitstream configured into
the FPGA can be modified by a circuit or a tool in the computer by flipping one of the
bits of bitstream, one at a time. This flip emulates a SEU in the configuration memory
cells. The output of the design under test (DUT) can be constantly monitored to analyze
the effect of the injected fault into the design. If an error is detected, this means that the
fault tolerant technique implemented is not robust for that specific fault (SEU) in that
target configuration memory bit.
It is possible to inject faults in all the configuration bits and to analyze the most
critical parts of the design [67]. This can help to guide designers in early stages of the
development process to choose the most appropriated fault tolerant design, even before
NSREC’07 Short course
Fernanda Lima Kastensmidt
65
any radiation ground testing. The entire fault injection campaign can spend from few
hours to days depending on the amount of bits that are going to be flipped and the
connection to the fault injection control circuit. When the entire system (fault injection
control + DUT + golden designs) is implemented at the hardware level (board), avoiding
the communication with the computer, the process is speeded up in orders of magnitude.
Figure 4-1. FPGA mitigation design flow by editing the design hardware description
language and the fault injection approach used to validate the design.
As discussed in [37], configuration logic blocks (CLBs), which are composed of
lookup tables (LUTs) for logic generation, storage elements, multiplexers, and carry
logic, in addition with the customizable routing account for by far the largest number of
configurable bits in each device. However, the FPGA devices contain important
functional blocks that can also be upset by radiation and once this occurs the effects can
be catastrophic. Consequently, the susceptibility of these functional blocks must also be
analyzed and mitigation techniques must be applied. Examples are: Digital Clock
Managers (DCMs) provide phase-locked, skew-corrected clock signals to all parts of the
66
Fernanda Lima Kastensmidt
NSREC’07 Short course
chip, Phase-Matched Clock Dividers (PMCDs) offer additional frequency division
options, Configuration controller circuit, power on reset (POR) circuitry, Input/Output
Blocks (IOBs) implement 28 common single-ended or differential (in pairs) I/O standards
with digitally controlled impedance, each XtremeDSP (DSP48) slice contains a dedicated
18x18-bit multiplier, adder, and 48-bit accumulator and other specialized blocks. Table 41 presents a summary of SEE issues and possible SEU mitigation solutions that have
been presented in [37].
Table 4-1. Representative Xilinx Virtex Family Potential Types of Device
SEE Sensitivity from [37]
FPGA component
parts
Configuration
Memory
Configuration
Controller
SEE Issues
Single and multiple bit errors
corrupting circuit operation,
causing bus conflicts (current
creep), etc…
Improper device configuration can
occur if hit during
configuration/reconfiguration
CLB
Logic hits and propagated upsets
caused by transients
BRAM
Memory upsets in user area
Half-latches
Sensitive structure used in
configuration/routing
SEUs on POR can cause
inadvertent reboot of device
POR
IOB
SEUs can cause false outputs to
other devices or inputs to logic
DCM
Can cause clock errors that spread
across clock cycles
Hard IP that is unhardened that
can cause single event functional
interrupts (SEFIs) or data errors
Gigabit transceivers. Hits in logic
DSP
MGT
NSREC’07 Short course
Possible SEU mitigations
Scrubbing
Partial reconfiguration
Partitioned design
Multiple chip voting
(Redundancy by using
multiple devices)
Triple modular redundancy
(TMR)
Acceptable error rates
TMR
Error Detection and
Correction (EDAC)
scrubbing
Removal of half-latches
from design
Multiple chip voting
(Redundancy by using
multiple devices)
Leverage Immune Config.
Memory cell
Evaluate input SET
propagation
TMR
Temporal TMR
TMR
Temporal TMR
TMR
Fernanda Lima Kastensmidt
67
PPC
SEL
can cause bursts or SEFIs. O/w bit
errors in data stream
Hard IP that is unhardened. SEFIs
are prime concern
Higher current condition that is
potentially damaging
Protocol re-writes
TMR or software task
redundancy
No mitigation other than
substrate addition (epi).
Circumvention techniques
possible
4.1 Scrubbing
It is important to notice that the use of hardware redundancy by itself it is not
sufficient to avoid errors in the FPGA, it is mandatory to reload the bitstream constantly
to avoid the accumulation of faults. This continuous load of the bitstream is called
scrubbing. The scrubbing as explained at the Xilinx Application Notes 138 and 151,
allows a system to repair bit-flips in the configuration memory without disrupting its
operations, which includes the memory cells that configures the LUT, the ones that
control the routing (GMR) and the CLB customization. Configuration scrubbing prevents
the build-up of multiple configuration faults and reduces the time in which an invalid
circuit configuration is allowed to operate. The scrubbing does not refresh the contents of
CLB flip-flops and embedded memories: the Block SelectRAMs. The scrubbing is
performed through the Virtex SelectMAP interface. Furthermore, systems must employ
configuration scrubbing for redundancy-based mitigation techniques such as TMR before
any reliability enhancement is observed. Without scrubbing, the build-up of multiple
faults would eventually break the redundancy.
It is recommended to scrub at least 10X faster than worst-case SEU rate. When
the FPGA is in this mode, an external oscillator generates the configuration clock that
drives the FPGA and PROM that contains the “gold” bitstream. At each clock cycle new
data are available on the PROM data pins. The frequency that scrubbing must be
performed depends on the particle flux and cross-section of the device.
For system-on-chip (SoC) platforms, the Hardware Internal Configuration Access
Port (HWICAP) module can also be used to reconfigure parts of the configuration matrix
from inside the FPGA controlled by the embedded processor (hard core Power-PC or soft
core Microblaze). The ICAP is able to load partial bitstream without interrupt the
application and to configure them. It implements a subset of SelectMAP interface and it
68
Fernanda Lima Kastensmidt
NSREC’07 Short course
generates no noise during reconfiguration. The ICAP module is connected to the
embedded processor by the available local bus OPB and the EDK tool can be used for
that task.
4.2 Triple Modular Redundancy
Triple Modular Redundancy (TMR) is a well-known fault tolerant technique for
avoiding errors in integrated circuits. The TMR scheme uses three identical logic blocks
performing the same task in tandem with corresponding outputs being compared through
majority voters (MAJ). Since all the customizable memory cells are sensitive to soft
errors, single points of failures must be avoided inside the FPGA. Consequently, all the
inputs and outputs must also be triplicate. If an upset occurs in one of the IO cells or in
the routing that connects the logic from/to the IO blocks, there are two other redundant
inputs or outputs able to ensure the correct value. The voter is also triplicate because if
one fails, there are two voters able to maintain the correct value in two redundant logic
parts.
This full TMR, also called X-TMR, is especially suitable for protecting designs
synthesized in SRAM-based Field Programmable Gate Arrays (FPGAs) as proposed by
[16]. Figure 4-2 shows the full TMR that must be applied in the user design logic before
synthesize into the Xilinx FPGA. The CLB flip-flops (user sequential logic) are triplicate
with triple majority voters (MAJ) and a feedback connection that is able to restore the
correct data of the flip-flops. This setup is important as it was seen that the scrubbing is
not able to restore the correct value of a CLB flip-flop. The majority voter defines the
correct output as two out of three input values, as defined in the truth table in the figure
4-2. The very last output voter, which can be placed at the output of CLB flip-flops or
combinational logic blocks, is different than the MAJ voter. Note that there are three
output signals that go to the IO pads. Each one is controlled by a tri-state buffer. The
redundant logic part that holds an error should not pass the error to the output, so the tristate buffer should block the faulty redundant part. The TMR output voter choose the tristate buffer controller based on one reference value (that is the input coming from one of
the redundant logic part) and the other two inputs coming from the others redundant logic
parts, as shown in the truth table at figure 4-2. Each voter can be implemented in a LUT.
NSREC’07 Short course
Fernanda Lima Kastensmidt
69
REDUNDANT
LOGIC (tr1)
REDUNDANT
LOGIC (tr2)
REDUNDANT
LOGIC (tr0)
REDUNDANT
LOGIC (tr1)
REDUNDANT
LOGIC (tr2)
TMR flip-flop
INPUT
TMR flip-flop
REDUNDANT
LOGIC (tr0)
REDUNDANT
LOGIC (tr0)
REDUNDANT
LOGIC (tr1)
REDUNDANT
LOGIC (tr2)
TMR Output Voter
FPGA
OUTPUT
package PIN
package PIN
REF
R0 R1 R2 MAJ
0
0
0
0
1
1
1
1
TMR flip-flop
tr0
tr1
tr2
clk0
clk1
clk2
MAJ
R0 R1 R2 MAJ
MAJ
MAJ
CLB flip-flop
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
1
1
0
0
0
R0
O_voter
LUT: 00011000_00011000
3-state_0
R0
R1
O_voter
3-state_1
R1
R2
O_voter
LUT: 00010111_00010111
3-state_2
R2
Figure 4-2. TMR implemented in FPGA
Since all the outputs are connected together outside the FPGA device, this
connection works as an analog voter where the majority prevails, so, even if one output
voter fails, the output can manage to show the correct value.
4.3 Duplication with Comparison with Concurrent Error Detection
The TMR technique is a suitable solution for FPGAs because it provides a full
hardware redundancy, including the user’s combinational and sequential logic, the
routing, and the I/O pads. However, it comes with some penalties because of its full
hardware redundancy, such as area, I/O pad limitations and power dissipation. Many
applications can accept the limitations of the TMR approach but some cannot. Aiming to
reduce the number of pins overhead of a full hardware redundancy implementation
(TMR), and at the same time coping with permanent upset effects, a technique based on
duplication with comparison (DWC) and concurrent error detection (CED) technique was
proposed by [34], figure 4-3. The CED must be able to detect the fault free redundant
70
Fernanda Lima Kastensmidt
NSREC’07 Short course
logic part. The CED should have a smaller area than the redundant logic block in order to
present a reduced area overhead compared to the X-TMR.
FPGA
INPUT
REDUNDANT
LOGIC (dr1)
REDUNDANT
LOGIC (dr0)
TMR flip-flop
TMR flip-flop
REDUNDANT
LOGIC (dr0)
REDUNDANT
LOGIC (dr1)
CED
REDUNDANT
LOGIC (dr0)
REDUNDANT
LOGIC (dr1)
CED
OUTPUT
CED
package PIN
package PIN
TMR flip-flop
dr0
dr1
MAJ
clk0
R0 R1 R2 MAJ
MAJ
clk1
dr0
MAJ
clk2
dr1
CED
Logic able to detect which
redundant logic (dr0 or dr1)
is fault free.
CLB flip-flop
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
1
1
LUT: 00010111_00010111
Figure 4-3. Duplication with Comparison and Error Concurrent Detection
technique (DWC-CED) for SRAM-based FPGAs
The CED scheme can be based on time redundancy. In this way, it recomputes the
input operands in two different ways to detect permanent faults. During the first
computation at the first clock cycle, the operands are used directly in the combinational
block and the result is stored for further comparison. During the second computation at
the second clock cycle, the operands are modified, prior to use, in such a way that errors
resulting from permanent faults in the combinational logic are different in the first
calculation than in the second and can be detected when results are compared. These
modifications are seen as encoder and decoder processes and they depend on the
characteristics of the logic block. The general scheme is presented in figure 4-4.
NSREC’07 Short course
Fernanda Lima Kastensmidt
71
Figure 4-4. General Time redundancy scheme for permanent fault detection
Figure 4-5 shows the scheme proposed for an arithmetic module, for instance a
multiplier. There are two multiplier modules: mult_dr0 and mult_dr1. There are
multiplexers at the inputs able to provide normal or shifted operands. The outputs
computed from the normal operands are always stored in a sample register, one for each
module. Each output goes directly to the input of the user’s TMR register. Module dr0
connects to register tr0 and module dr1 connects to register tr1. Register tr2 will receive
the module that does not have any fault. By default, the circuit starts passing the module
dr0. A comparator at the output of register dr0 and dr1 indicates when outputs mismatch
(Hc). If Hc=0, no error is found and the circuit will continue to operate normally. If
Hc=1, an error is characterized and the operands need to be recomputed using the RESO
(recomputing with shifted operands) method to detect the module that has fault. The
detection takes one clock cycle. While the circuit performs the detection, the user’s TMR
register holds its previous value. When the faulty free module is found, register tr2
receives the output of this module and it will continue to receive this output until the next
chip reconfiguration (fault correction).
72
Fernanda Lima Kastensmidt
NSREC’07 Short course
B
A
encoder
ST0
B
encoder
1
0
0
1
A
encoder
encoder
ST0 ST1
dr0
1
0
0
ST1
dr1
decoder
decoder
=
=
=
Hc
Tc0
Fault-free module
1
Tc1
CED
ST1 ST0
Figure 4-5. Case-study CED based on encoder and decoder for arithmetic logic
blocks.
4.4 Placement and Routing Issues
The problem of using fault tolerance techniques based on redundancy and
majority voters is that one must ensure that SEU can not affect more that one redundant
domain of the design, [40], [64]. If a SEU is able to affect two domains of a redundant
design, the majority voter is not able to choose the correct results out of three, and errors
can appear in the design output. The only way a single fault can affect more than one
redundant domain is by upsetting the SRAM cells controlling the routing connections.
The upsets in the routing represent the main concern, as 90% of the SRAM cells
inside the FPGA are responsible for routing control. The main effects of an upset in the
routing are open lines and shortcuts between distinct lines as it was discussed previously.
The probability of SEUs in the routing upsetting more than one redundant domain
depends on the logic placement and the number of majority voters in the design.
In figure 4-6, there are two examples of upsets in the routing. Upset “a” connects
two signals from the same redundant domain, which does not generate an error in the
NSREC’07 Short course
Fernanda Lima Kastensmidt
73
TMR output, because the outermost voters will vote the upset effect. However, upset “b”
may provoke an error in the TMR output, because it connects two signals from distinct
redundant logic blocks affecting two out of three redundant domains of the TMR. In the
next sections three solutions will be discussed to improve reliability in this matter.
Figure 4-6. SEU in the routing affecting two distinct redundant logic parts
4.4.1 Solutions based on Placement and Routing
Dedicated floorplanning for each redundant part of the TMR can reduce the
probability of upsets in the routing affecting two or more logic modules, but it may not be
sufficient, since placement can be too complex in some cases. Remember that each time
it is necessary to include voters, there are connections between the redundant parts, which
make impossible to place the redundant logic parts very far away from each other with no
connections at all, figure 4-7. One solution is the Reliability-Oriented Place and Route
algorithm (RoRA) proposed by [76, 77], which is a place and route algorithm for SRAMbased FPGAs able to enforce particular technique in order to enforce every circuit
mapped on SRAM-FPGAs against SEUs in their configuration memory cells. Routing
duplication can also be a solution to improve reliability in TMR. In [33], it is proposed a
method to duplicate the routing locally inside the CLB to avoid problems with open and
short circuits provoked by SEUs in the routing.
74
Fernanda Lima Kastensmidt
NSREC’07 Short course
Figure 4-7. Majority voter placement in the TMR approach
4.4.2 Solutions based on Voting Adjustments
The first voting adjustment was proposed by [35]. It is proposed a logic partition
in order to add more voter stages in the circuit. If the redundant logic parts tr0, tr1 and tr2
(represented in figure 4-6 after the TMR register with voters and refresh) are partitioned
in smaller logic blocks with voters, a connection between signals from distinct redundant
parts could be voted by different voters. This logic partition by voters is represented in
figure 4-8. Notice that now the upset “b” can not provoke an error in the TMR output,
which increases the robustness of the TMR in the presence of routing upsets without
being of concern to floorplanning. The problem is to evaluate the best size of the logic to
achieve the best robustness. If the logic is partitioned in very small blocks, the number of
voters will increase dramatically, causing an overly costly TMR implementation. The
objective is finding the best partition in terms of area cost, performance and robustness.
The results presented by [35] suggest that there is a trade off between the logic
partition of the throughput logic (and consequently between the number of voters) and
the number of routing upsets that could provoke an error in the TMR. In contrary to what
was expected, large number of voters does not always mean larger protection against
NSREC’07 Short course
Fernanda Lima Kastensmidt
75
upsets. There is an optimal logic partition for each circuit that can reduce the propagation
of the upset effect in the routing.
Figure 4-8. Triple Modular Redundancy (TMR) scheme with logic partition in the
FPGA
4.5 Partial Triple Modular Redundancy
A partial TMR mitigation strategy was proposed in [52, 61] and it is based on the
idea that there are more critical parts than others in the circuit and not all logic blocks
need to be protected by TMR. Some of non-critical blocks can only be corrected by
scrubbing from time to time.
The idea is based that sensitive configuration bits can be separated into two
categories called “persistent” and “non-persistent” [52, 61], shown in figure 4-9. A nonpersistent configuration bit is a sensitive configuration bit that will cause a functional
error in the design but when the non-persistent configuration bit is repaired through
configuration scrubbing, the design returns to normal operation. And eventually all
previously induced functional errors will disappear. No additional intervention is required
to return the circuit to normal functionality. A persistent configuration bit is a sensitive
configuration bit that will also cause functional error, however, after repairing the upset
configuration bits through configuration scrubbing, the FPGA circuit does not return to
normal operation. This is due to the errors that are stored in flip-flops at feedback loops
that can not be corrected by scrubbing. In this case, a global reset is needed to return the
circuit to a proper state, or normal operation. This global reset takes the circuit offline for
the time needed to reset the circuit and start up in normal operating mode.
In [52, 61], it is proposed that feedback structures of the design should be
mitigated first because they are more critical. Any logic feeding into the feedback
76
Fernanda Lima Kastensmidt
NSREC’07 Short course
structures should follow since these contribute to the state of the design and thus the
persistence. The feed forward logic, the non-persistent circuit components, does not
contribute to the persistence of a design and should be mitigated last. This depends on the
application and the expected mean time between failures (MTBF).
Figure 4-9. Example of non-persistent and persistent upset defined at [61].
NSREC’07 Short course
Fernanda Lima Kastensmidt
77
5. Final Remarks
This manuscript has explored fault tolerant techniques to protect integrated
circuits against soft errors. A set of hardening by design solutions for application specific
circuits (ASICs) and for field programmable gate arrays (FPGAs) was presented and
discussed.
The main challenge for ASIC is to have techniques able to work with the new
paradigm for nanometer technologies: the occurrence of transient pulses with duration
longer than the cycle time of the circuits, that may affect one or more bits of the circuit
output, and multiple faults, thereby making obsolete most of the currently known
mitigation techniques. For memory cells, the traditional solutions such as ECCs and
hardened memory cells can still be applied but taking into account the probability of
multiple upsets.
The main challenge for FPGAs is to characterize the user design sensitivity to soft
error once the design is implemented in the SRAM-based FPGA and to define the most
efficient redundancy that must be applied for a limited area resource. The effect of soft
error in a user logic design synthesized in a SRAM-based FPGA was detailed analyzed.
Triple Modular Redundancy (TMR) and Duplication with Comparison and Concurrent
Error Detection (DWC-CED) techniques were presented. Also, the issues about the
placement and routing of redundant blocks inside the FPGAs were discussed and some
solutions were proposed.
In summary, there is no hardening by design solution that is totally efficient for
all types of circuits, applications and environments. It is important to characterize very
well the sensitivity to soft error of your target design and application and then choose a
set of fault tolerant solutions that will work properly in your design. The ideal solution
for a reliable system may be composed of solutions that pass at different steps of your
design processes: from layout constraints, transistor level redundancy, logic level
solutions, recomputation and system level approaches.
78
Fernanda Lima Kastensmidt
NSREC’07 Short course
References
[1]
Alexandrescu, D., Anghel, L., Nicolaidis, M., “New methods for evaluating the
impact of single event transients in VDSM ICs”, in IEEE International
Symposium on Defect And Fault Tolerance in VLSI Systems, DFT, 17., 2002. p.
99-107.
[2]
Anghel, A., Nicolaidis, M., “Cost Reduction and Evaluation of a Temporary
Faults Detecting Technique”, in Proc. DATE, IEEE Computer Society, 2000, p.
591-598.
[3]
Anghel, L., Alexandrescu, D., Nicolaidis, M., “Evaluation of a soft error tolerance
technique based on time and/or space redundancy”, in the Proceedings of
Symposium on Integrated Circuits and Systems Design, SBCCI, 13., 2000.
Proceedings… Los Alamitos : IEEE Computer Society, 2000. p. 237-242.
[4]
Asadi, G., Tahoori, M., “An Accurate SER Estimation Method Based on
Propagation Probability Design”, in Proceedings of Automation and Test in
Europe Conference, DATE, 2005.
[5]
Athan, S., Landis, D., Al-Arian, S., “A Novel Built-in Current Sensor for IDDQ
Testing of Deep Submicron CMOS ICs, In Proceedings of 14th VLSI Test
Symposium, 1996. pp 118-123.
[6]
Austin, T. “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture
Design”, in MICRO32 - Proceedings of the 32nd ACM/IEEE International
Symposium on Microarchitecture, pages 196-207, Los Alamitos, CA, November,
1999.
[7]
Barth, J., “Applying Computer Simulation Tools to Radiation Effects Problems”,
in: IEEE Nuclear Space Radiation Effects Conference Short Course, NSREC,
1997.
[8]
Baumann, R., “The impact of technology scaling on soft error rate performance
and limits to the efficacy of error correction”, Electron Devices Meeting, 2002.
IEDM '02. Digest. International, Dec., 2002, p. 329-332.
[9]
Baumann, R.; Smith, E., “Neutron-induced boron fission as a major source of soft
errors in deep submicron SRAM devices”, in: Proceedings of IEEE International
Reliability Physics Symposium, 38., IEEE Computer Society, 2000.
[10]
Berg, M., “Fault Tolerance Implementation within SRAM Based FPGA Design
Based upon the Increased Level of Single Event Upset Susceptibility”, in IEEE
International On-line Test Symposium, IOLTS, 2006. pp. 89-91.
NSREC’07 Short course
Fernanda Lima Kastensmidt
79
[11]
Berg, M., Wang, J.J., Ladbury, R., Buchner, S., Kim, H., Howard, J., LaBel, K.,
Phan, A., Irwin, T., Friendlich, M., “An Analysis of Single Event Upset
Dependencies on High Frequency and Architectural Implementations within Actel
RTAX-S Family Field Programmable Gate Arrays”, IEEE Transactions On
Nuclear Science, VOL. 53, NO. 6, Dec., 2006. p. 3569- 3574.
[12]
Bessot, D.; Velazco, R., “Design of SEU-hardened CMOS memory cells: the HIT
Cell”, in European Conference on Radiation and Its Effects on Components and
Systems, RADECS, 2., 1993. p. 563-570.
[13]
Buchner, S.; Campbell, A.; Meehan, T.; Clark, K.; Mcmorrow, D.; Dyer, C.;
Sanderson, C.; Comber, C. Kuboyama, S., “Investigation of Single-Ion MultipleBit Upsets in Memories on Board a Space Experiment”, IEEE Transactions on
Nuclear Science, Vol. 47, Issue 3, pp. 705-711, June 2000.
[14]
Calin, T., Nicolaidis, M., Velazco, R., “Upset hardened memory design for
submicron CMOS technology”, IEEE Transactions on Nuclear Science, New
York, v.43, n.6, p. 2874 -2878, Dec. 1996.
[15]
Canaris, J.; Whitaker, S., “Circuit techniques for the radiation environment of
space”, in the Proceedings of Custom Integrated Circuits Conference, 1995. p. 7780.
[16]
Carmichael, C., Triple Module Redundancy Design Techniques for Virtex®
Series FPGA: Application Notes 197. San Jose, USA: Xilinx, 2000.
[17]
Cazeaux, J., Rossi, D., Oma˜na, M., Metra, C., Chatterjee, A., “On Transistor
Level Gate Sizing for Increased Robustness to Transient Faults”, in Proceedings
of International On-line Test Symposium, IOLTS, 2005.
[18]
Chen, C., Hsiao, M., “Error-Correcting Codes for Semiconductor Memory
Applications: A State-of-the-Art Review,” IBM J. Res. Develop., Vol. 28, pp.
124-134, Mar. 1984.
[19]
Crain, S. et al., “Analog and digital single-event effects experiments in space”,
IEEE Transactions on Nuclear Science, New York, v.48, n.6, Dec. 2001.
[20]
Dhillon, Y., Diril, A., Chatterjee, A., Singh, A., “Analysis and Optimization of
Nanometer CMOS Circuits for Soft-Error Tolerance”, IEEE Transactions On
Very Large Scale Integration (VLSI) Systems, VOL. 14, NO. 5, May 2006. pp.
514-524.
[21]
Dodd, P. E., et al., “Production and propagation of Single-Event Transients in
High-Speed Digital Logic ICs”, IEEE Transactions on Nuclear Science, Vol 51,
No 6, Part 2, IEEE Computer Society, Los Alamitos, CA, December 2004, pp
3278-3284.
80
Fernanda Lima Kastensmidt
NSREC’07 Short course
[22]
Dodd, P. E., Massengill, L. W. “Basic Mechanism and Modeling of Single-Event
Upset in Digital Microelectronics”, IEEE Transaction on Nuclear Science, vol.
50, June, 2003, pp. 583-602.
[23]
Dodd, P., “Physics-Based Simulation of Single-Event Effects IEEE Transactions
On Device And Materials Reliability”, VOL. 5, NO. 3, Sept. 2005. pp.343-457.
[24]
Dupont, E.; Nicolaidis, M.; Rohr, P., “Embedded robustness IPs for transienterror-free ICs”. IEEE Design & Test of Computers, New York, v.19, n.3, MayJune 2002, p. 54-68.
[25]
Ferlet-Cavrois, V. et al., “Statistical Analysis of the Charge Collected in SOI and
Bulk Devices Under Heavy Ion and Proton Irradiation - Implications for Digital
SETs”, IEEE Transactions on Nuclear Science, Vol 53, No 6, Part 1, IEEE
Computer Society, Los Alamitos, CA, December 2006, pp 3242-3252.
[26]
Gadlage, M. J., Schrimpf, R. D., Benedetto, J. M., Eaton, P. H., Mavis, D. G.,
Sibley, M., Avery, K., and Turflinger, T. L., “Single Event Transient Pulsewidths
in Digital Microcircuits”, IEEE Transactions on Nuclear Sciences, Vol. 51, No 6,
Part 2, IEEE Computer Society, Los Alamitos, CA, December 2004, pp. 32853290.
[27]
Galke, C., Pflanz, M., Vierhaus, H., “On-line Detection and Compensation of
Transient Errors in Processor Pipeline-Structures”, in Proceedings of the
International On-line Test Symposium, IOLTS, 2002.
[28]
George, J., Koga, R., Swift, G., Allen, G., Carmichael, C., Tseng, C., “Single
Event Upsets in Xilinx Virtex-4 FPGA Devices”, IEEE Radiation Effects Data
Workshop, 2006, p.109 – 114.
[29]
Henes, E., Vieira, M., Ribeiro, I., Wirth, G., Kastensmidt, F. L., “Using Bulk
Built-in Current Sensors in Combinational and Sequential Logic to Detect Soft
Errors”, IEEE Micro, IEEE Computer Society, v. Set-Ou, p. 10-18, 2006.
[30]
Hertwig, A., Hellebrand, S., Wunderlich, H., “Fast Self-Recovering Controllers”,
in Proceedings of 16th IEEE VLSI Test Symposium, 1998.
[31]
Houghton, A. D. “The Engineer’s Error Coding Handbook”. Londres: Chapman
& Hall, 1997.
[32]
Johnston, A., “Scaling and Technology Issues for Soft Error Rates”, in
Proceedings of 4th Annual Research Conference on Reliability, Stanford
University, October 2000.
[33]
Kastensmidt, F. L.;, Kinzel Filho, C., Carro, L., “Improving Reliability of SRAM
Based FPGAs by Inserting Redundant Routing”, IEEE Transactions on Nuclear
Science, New York, v. 53, n. 4, 2006. p. 2060-2068.
NSREC’07 Short course
Fernanda Lima Kastensmidt
81
[34]
Kastensmidt, F., Neuberger, G., Carro, L., Reis, R., Hentschke, R., “Designing
Fault- Techniques for SRAM-based FPGAs”, IEEE: Design and Test of
Computers (D&T), v.21, n.6, Dec., 2004.
[35]
Kastensmidt, F., Sterpone, L., Carro, L., Sonza Reorda, M., “On the Optimal
Design of Triple Modular Redundancy Logic for SRAM-based FPGAs”, in the
Proceedings of Design Automation and Test in Europe (DATE), IEEE, 2005.
[36]
Label, K. et al., “A roadmap for NASA's radiation effects research in emerging
microelectronics and photonics”, in Proceedings of IEEE Aerospace Conference,
2000, IEEE Computer Society, 2000. p. 535-545.
[37]
LaBel, K., Berg, M., Black, D., Robinson, W., Jordan, A., “Trade Space Involved
with Single Event Upset (SEU) and Transient (SET) Handling of Field
Programmable Gate Array (FPGA) Based Systems”, 2006 Workshop on
Hardened Electronics and Radiation Technology, HEART, 2006.
[38]
Lacoe, R., “CMOS Scaling Design Principles and Hardening-by-Design
Methodologies,” IEEE NSREC Short Course, 2003.
[39]
Leray, J., “Earth and Space Single-Events in Present and Future Electronics”, in
European Conference on Radiation and Its Effects on Components and Systems,
RADECS, 6., 2001. Short Course. IEEE Computer Society, 2001.
[40]
Lima, F., Carmichael, C., Fabula, J., Padovani, R., Reis, R., “A fault injection
analysis of Virtex® FPGA TMR design methodology”, in European Conference
on Radiation and Its Effects on Components and Systems, RADECS, 2001. pp.
275-282.
[41]
Lima, F., Cota, E., Carro, L., Lubaszewski, M., Reis, R., Velazco, R., Rezgui, S.,
“Designing a radiation hardened 8051-like micro-controller”, in Proceedings of
IEEE Symposium on Integrated Circuits and Systems Design, SBCCI, 13., 2000.
pp. 255-260.
[42]
Lisboa, C. A., Erigson, M. I., Carro, L., “System Level Approaches for Mitigation
of Long Duration Transient Faults in Future Technologies”, in Proceedings of the
12th IEEE European Test Symposium, ETS, 2007.
[43]
Lisbôa, C. A., Schüler, E., Carro, L., “Going Beyond TMR for Protection Against
Multiple Faults”, in Proceedings of the 18th Symposium on Integrated Circuits
and Systems Design, SBCCI, 2005, pp. 80-85.
[44]
Liu, M.N., Whitaker, S., “Low power SEU immune CMOS memory circuits”,
IEEE Transactions on Nuclear Science, New York, v.39, n.6, p. 1679-1684, Dec.
1992.
82
Fernanda Lima Kastensmidt
NSREC’07 Short course
[45]
Messenger, G. C., “Collection of Charge on Junction Nodes from Ion Tracks”,
IEEE Transactions on Nuclear Sciences, vol. NS-29, pp. 2024-2031, 1982.
[46]
Michels, A., Petroli, L., Lisboa, C. L., Kastensmidt, F. L., Carro, L., “SET Fault
Tolerant Combinational Circuits Based on Majority Logic”, in Proceedings of
IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems,
DFT, 2006.
[47]
Mikkola, E., Vermeire, B. Barnaby, H. J., Parks, H. G., and Borhani, K., “SET
Tolerant CMOS Comparator”, IEEE Transactions on Nuclear Science, vol. 51,
no. 6, pp 3609-3614, IEEE Computer Society, New York-London, December,
2004.
[48]
Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., Kim, K., “Combinational
Logic Soft Error Correction”, in Proceedings of International Test Concference,
ITC, 2006.
[49]
Mitra, S., Mccluskey, E., “Which Concurrent Error Detection Scheme To
Choose?”, in: IEEE International Test Conference, ITC, 2002.
[50]
Mohanram, K., “Soft Error Failure Rate Estimation in Combinational Logic
Circuits”, Proceedings of the 6th Latin-American Test Workshop, Salvador.,
Brazil, 2005. pp.181-186.
[51]
Morgan, K., Caffrey, M., Graham, P., Johnson, E., Pratt, B., Wirthlin, M., “SEUinduced persistent error propagation in FPGAs”, IEEE Transactions on Nuclear
Science, December 2005.
[52]
Neuberger, G. , Kastensmidt, F. L., Carro, L., Reis, R. “A Multiple Bit Upset
Tolerant SRAM Memory”, Transactions on Design Automation of Electronic
Systems, TODAES, v.8, 2003. pp.577-590.
[53]
Neuberger, G., Kastensmidt, F. L., Reis, R., “Designing an Automatic Technique
for Optimization of Reed-Solomon Codes to Improve Fault-tolerance in
Memories”, IEEE Design and Test of Computers, USA, v. 22, n. 1, 2005. p.50-58.
[54]
Neves, C., Henes Neto, E. C., Ribeiro, I., Wirth, G., Kastensmidt, F. L., Guntzel,
J., “Automatic Evaluation of Single Event Transient Propagation in CMOS Logic
Circuits Based on Topological Timing Analysis”, in Proceedings of LatinAmerican Test Workshop, LATW, 2006. p. 49-54.
[55]
Nicolaidis, M., "Time redundancy based soft-error tolerance to rescue nanometer
technologies", in Proc. VLSI Test Symposium, IEEE Computer Society, 1999, pp.
86-94.
NSREC’07 Short course
Fernanda Lima Kastensmidt
83
[56]
Nieuwland, A., Jasarevic, S., Jerin, G., Combinational Logic Soft Error Analysis
and Protection. In Proceedings of IEEE International On-line Testing Symposium,
IOLTS 2006.
[57]
Normand, E., “Correlation of in-flight neutron dosimeter and SEU measurements
with atmospheric neutron model”, IEEE Transactions on Nuclear Science, New
York, v.48, n.6, p. 1996-2003, Dec. 2001.
[58]
Normand, E., “Single event upset at ground level”, IEEE Transactions on Nuclear
Science, New York, v.43, n.6, p. 2742 -2750, Dec. 1996.
[59]
O'bryan, M. et al., “Compendium of Single Event Effects Results for Candidate
Spacecraft Electronics for NASA”, in IEEE Nuclear and Space Radiation Effects
Conference, NSREC, 2006.
[60]
Omana, M., Papasso, G., Rossi, D., Metra, C., “A Model for Transient Fault
Propagation in Combinatorial Logic”, in Proceedings of the 9th IEEE
International On-Line Testing Symposium, IOLTS, 2003.
[61]
Pratt, B., Caffrey, M., Graham, P., Morgan, K., Wirthlin, M., “Improving FPGA
Design Robustness with Partial TMR”, 44th Annual IEEE International
Reliability Physics Symposium Proceedings, 2006. p. 226 – 232.
[62]
Quinn, H., Graham, P., "Terrestrial-Based Radiation Upsets: A Cautionary Tale,"
IEEE Symposium on Field-Programmable Custom Computing Machines, 2005.
[63]
Quinn, H.; Graham, P.; Krone, J.; Caffrey, M.; Rezgui, S., “Radiation-induced
multi-bit upsets in SRAM-based FPGAs”, in IEEE Transactions on Nuclear
Science, Vol. 52, Issue 6, Dec. 2005. pp. 2455 – 2461.
[64]
Rebaudengo, M., Sonza Reorda, M., Violante, M., “Simulation-based Analysis of
SEU effects of SRAM-based FPGAs”, in the Proceeding of Field Programmable
Logic, FPL, 2002. Los Alamitos : IEEE Computer Society, 2002. p. 607-615.
[65]
Reed, R. A., Carts, M. A., Marshall, P. W.;Musseau, O., Mcnulty, P. J., Roth, D.
R., Buchner, S., Melinger, J., Corbiere, T., “Heavy Ion and Proton Induced Single
Event Multiple Upsets”, IEEE Transactions on Nuclear Science, Vol. 44, Issue 6,
pp. 2224-2229, December 1997.
[66]
Rejimon T., Bhanja, S., “A Timing-Aware Probabilistic Model for Single-EventUpset Analysis”, IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, VOL. 14, NO. 10, Oct. 2006, pp.1130-1139.
[67]
Reorda, S., Sterpone, L., Violante, M., “Efficient estimation of SEU effects in
SRAM-based FPGAs”, 11th IEEE International On-Line Testing Symposium,
2005. p. 54 – 59.
84
Fernanda Lima Kastensmidt
NSREC’07 Short course
[68]
Rockett, L. R., “A design based on proven concepts of an SEU-immune CMOS
configurable data cell for reprogrammable FPGAs”, Microelectronics Journal,
Elsevier, v.32, p. 99-111, 2000.
[69]
Rockett, L. R., “An SEU-hardened CMOS data latch design”, IEEE Transactions
on Nuclear Science, New York, v.35, n.6, p. 1682-1687, Dec. 1988.
[70]
Rossi, D., Omaña, M., Toma, F. and Metra, C., “Multiple Transient Faults in
Logic: An Issue for Next Generation ICs ?”, in Proceedings of the 20th IEEE
International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT
2005), IEEE Computer Society, Los Alamitos, CA, October 2005, pp. 352-360.
[71]
Schüler, E., Carro, L., “Reliable Circuits Design Using Analog Components”, in
Proceedings of the 11th Annual IEEE International Mixed-Signals Testing
Workshop – IMSTW 2005, Volume 1, IEEE Computer Society, Cannes, June 2729, 2005, pp 166-170.
[72]
Semiconductor Industry Association. International Technology Roadmap for
Semiconductors
–
ITRS
2005,
last
access
May
25,
2006.
http://www.itrs.net/Common/2005ITRS/Home2005.htm.
[73]
Shirvani, P., Saxena, N., Mccluskey, E., “Software Implemented EDAC
Protection Against SEUs”, Center for Reliable Computing, May 2001.
[74]
Shivakumar, P. et al. “Modelling the Effect of Technology Trends on the Soft
Error Rate of Combitional Logic”. In: International Conference on Dependable
Systens and Networks. 2002.
[75]
Srinivasan, G. R., “Modeling the Cosmic-Ray-Induced Soft-Error Rate in
Integrated Circuits: An Overview”. IBM Journal of Research and Development,
Vol. 40, No. 1, 1996, pp. 77-90.
[76]
Sterpone, L., Reorda, M.S., Violante, M., “RoRA: a reliability-oriented place and
route algorithm for SRAM-based FPGAs”, Research in Microelectronics and
Electronics, 2005, Volume 1, 2005. p.173 – 176.
[77]
Sterpone, L., Violante, M., “A design flow for protecting FPGA-based system
against single event upsets”, in the Proceedings of 20th IEEE International
Symposium on Defect and Fault Tolerance in VLSI System (DFT’05), October 35, 2005, pp. 436-444.
[78]
Swift, G. M., Rezgui, S., George, J., Carmichael, C., “Dynamic Testing of Xilinx
Virtex-II Field Programmable Gate Array(FPGA) Input/Output Blocks(IOBs)”,
IEEE Transactions on Nuclear Science, VOL. 51, NO. 6, 2004.
NSREC’07 Short course
Fernanda Lima Kastensmidt
85
[79]
Velazco, R. et al., “Two CMOS memory cells suitable for the design of SEUtolerant VLSI circuits”, IEEE Transactions on Nuclear Science, New York, v.41,
n.6, p. 2229–2234, Dec. 1994.
[80]
"Virtex-II Static Characterization", Xilinx Single Event Effects Consortium,
2004, http:Hparts.jpl.nasa.gov/docs/swift/virtex2 0104.pdf
[81]
Wang, J.J., RTAXS Single Event Effects Test Rep., Aug. 2004 [available on-line
at http://www.actel.com/documents/RTAXS_SEE_Report.pdf]
[82]
Wang, N., Patel, S. ReStore: Symptom Based Soft Error Detection in
Microprocessors. In Proceedings of the International Conference on Dependable
Systems and Networks, DSN, 2005.
[83]
Weaver, H., et al., “An SEU Tolerant Memory Cell Derived from Fundamental
Studies of SEU Mechanisms in SRAM”, IEEE Transactions on Nuclear Science,
New York, v.34, n.6, Dec. 1987.
[84]
Whitaker, S., Canaris, J., LIU, K., “SEU hardened memory cells for a CCSDS
Reed-Solomon encoder”, IEEE Transactions on Nuclear Science, New York,
v.38, n.6, p. 1471-1477, Dec. 1991.
[85]
Wirth, G., Vieira, M., Henes, E., Kastensmidt, F. L. “Modeling the sensitivity of
CMOS circuits to radiation induced single event transients”. Microelectronics
Reliability. Elsevier, 2007.
[86]
Xilinx Inc. Virtex® Series Datasheets and Application Notes, www.xilinx.com,
2006.
[87]
Zhang, B., Wang, W., Orshansky, M., “FASER: Fast Analysis of Soft Error
Susceptibility for Cell-Based Designs”, Workshop on System Effects of Logic
Soft Errors, SELSE, 2005.
[88]
Zhang, M., Shanbhag, N., “Soft-Error-Rate-Analysis (SERA) Methodology”,
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, VOL. 25, NO. 10, Oct. 2006. pp.2140-2155.
[89]
Zhou, Q. et al., ''Transistor Sizing for Radiation Hardening'', in Proceedings of
IRPS, 2004. pp. 310-315.
86
Fernanda Lima Kastensmidt
NSREC’07 Short course