Download Computer Engineering

Document related concepts
no text concepts found
Transcript
Computer Engineering
Self Repair Technology
for Logic Circuits
Architecture, Overhead and Limitations
Heinrich T. Vierhaus
BTU Cottbus
Computer Engineering Group
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Outline
1. Introduction: Nano Structure Problems
2. The Problem of Wear-Out
3. Repair for Memory and FPGAs
4. Basic Logic Repair Strategies & Structures
5. Test and Repair Administration
6. De-Stressing Strategies
7. Cost, Overhead, Single Points of Failure
8. Summary and Conclusions
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
1. Introduction
A bunch of new problems from nano structures ...
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Nanoelectronic Problems
Lithography:
The wavelength used to „map“ structural information from
masks to wafers is larger (4 times of more) than the minimum
structural features (193 versus 90 / 65 / 45 nm).
Adaptation of layouts for correction of mapping faults.
Statistical Parameter Variations:
The number of atoms in MOS-transistor channels becomes so
small that statistical variations of doping densities have an impact
on device parameters such as threshold voltages.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
New Problems with Nano-Technologies
Light
source
Wave length: 193 nm
mask (reticle)
resist
wafer
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
exposed resist
Feature size: down to 28 nm
Computer Engineering
Layout Correction
Modified layout
for compensation
of mapping faults
Compensation is critical and non-ideal
Faults are not random but correlated!
Requires fast fault diagnosis
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Doping Fluctuations in MOS Transistors
Poly-Si
n
doping atom
n
p-Substrate
Density and distribution of doping atoms
cause shifts in transistor threshold voltages!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Nanostructure Problems
Individual device characteristics such as Vth are more dependent
on statistical variations of underlying physical features such
as doping profiles.
Primary Relevance: Yield
A significant share of basic devices will be „out or specs“ and needs
a replacement by backup elements for yield improvement after
production.
Primary Relevance: Yield
Smaller features mean higher stress (field strength, current
density), also foster new mechanisms of early wear-out.
Primary Relevance: Lifetime
Transient error recognition and compensation „in time“ is becoming a must due
to e. g. charged particles that can discharge circuit nodes.
Primary Relevance: Dependability
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Fault Tolerant Computing
Software-based
fault detection
& compensation
Works only
for transient faults!
specific
HW logic &
RT-level
detection &
compensation
Typically works
for transient and
permanent faults!
universal
Fault
event
Transistor-and switch level
compensation
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Typically works
for specific types of
transient faults
only!
very
specific
Computer Engineering
2. Wear-Out Problems and Mechanisms
Structures on ICs used to live longer than either their application
or even their users. Not any more ...
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
IC Structures May Get Tired
„Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier,
causing a lot of problems for dependable long-time applications !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Fault Effects on ICs
metal
migration
low- k insulator
deterioration
Metal 3
Metal 2
Polyimide
(low-k)
Via
FieldOxide
n
n
p
n-well p
Gate
Oxide
(high-k)
Metal 1
Transistor deterioration (HCI, NBTI),
eventually gate oxide shorts !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Wear-Out Mechnisms
Metal Migration:
Metal atoms (Al, Cu) tend
to migrate under high current
density and high temperature.
Stress migration:
Migration effects may be enhanced
under mechanical stress conditons.
Effect:
Metal lines and vias may actually
cause line interrupts. The effect is
partly reversible by changing current
directions.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Metal Migration
neighbor
metal -wire under high current density:
new
neighbor
After some time in operation
neighbor
Voids (holes)
Open-defect
Vias are specially prone to such defects
short
The effect is reversible by reversing the direction of current flow !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Transistor Degradation
Negative Bias Thermal Instability (NBTI): Reduced switching speed
for p-channel MOS transistors that have operated under long-time constant
negative gate bias. The effect is partly reversible.
Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOS
transistors, induced by positive gate bias and frequent switching.
Not reversible.
Gate Oxide Deterioration: Induced by high field strengh. Not reversible
Dielectric Breakdown: Insulating layers between metal lines may break
causing shorts between signal lines.
Design technology including a prospective „life time budget“!!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Management of Wear-Out by
„Fault Tolerant Computing?
Built-in fault tolerance and error compensation are needed in nanotechnologies anyway and for the management of transient faults.
Wear-out induced faults may show up as „intermittent“ faults first,
which become more and more frequent.
Fault in synchronous circuits and systems are detected „by clock cycle“.
Hence the detection does not even recognize if the fault is permanent
or not for many types of fault tolerant architecture.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Triple Modular Redundancy
input
signal
Execution
Unit 1
Execution
Unit 2
Execution
Unit 3
Result out
(majority)
Comparator
Voter
Error
detect
Can detect and compensate almost any type of fault
Overhead about 200-300 %, additional signal delays
The voter itself is not covered but must be a „self checking checker“
Standard (by law) in avionics applications!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Error Detecting / Correcting Codes
Data
Data
Error
correction
Transmission /
Storage
Signature
Often applicable to 1- or 2-bit faults only
Often limited to certain fault models (uni-directional)
Becomes expensive if applied to
computational units
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Signature
Comparison
Signature
Faultdetect
Computer Engineering
Can TMR and Codes Compensate
Permanent Faults?
Fault / error detection circuitry typically works on a clock-cycle base.
It does not „know“ if a fault is transient or permanent.
A permanent fault is a fault event that occurs in several to many successive
clock cycles repeatedly.
Error correction technology can detect and compensate such permanent faults
as well as transient faults.
A critical condition occurs if transient faults occur on top of
permanent faults. Then the superposition of fault effects is likely to
exceed the system‘s fault handling capacity.
System components that run actively „in parallel“ suffer from the same
wear-out effects. Therefore there is a an increase in dependability before
wear-out limits, but no significant life time extension!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Redundancy and Wear-Out
During the normal life time of the system, duplication or triplication
can enhance reliability significantly. But also area and power consumption
are about triplicated.
And by the end of normal operating time (out of fuel / steam) all three
systems will fail shortly one after the other !!
Reliability enhancement is not equal to life time extension !!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Self Repair?
Software-based
fault detection
& compensation
Works only
for transient faults!
specific
HW logic &
RT-level
detection &
compensation
Typically works
for transient and
permanent faults!
universal
Fault
event
Self Repair for permanent faults!
Transistor-and switch level
compensation
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Typically works
for specific types of
transient faults
only!
very
specific
Computer Engineering
3. Repair for Memory and FPGAs
Compensation of transient faults is not enough.
Some technologies for transient compensation can handle
permanent faults, too, but not on the long run and with
additional transient faults!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Memory Test & Repair
Lines
Line
address
Read- /
write lines
spare
column
columns
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Memory Test & Repair (2)
Line
address
Lines
Read- /
Write lines
spare
column
Memory
BIST
controller
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
columns
... is already state-of-the-art!
Computer Engineering
FPGA-based Self Repair
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
In-System FPGA Repair
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Repair Mechanism: Row/Line-Shift
CLB
CLB
CLB
CLB
occupied
CLBs
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
row with
faulty CLB
CLB
CLB
CLB
CLB
occupied
CLBs
CLB
CLB
CLB
CLB reserve
row
Little Overhead for the re-configuration process
Loss of many “good” CLBs for every fault
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Distributed Backup CLBs
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB faulty CLB
functionally
occupied CLB
selected
non-occupied
CLB
CLB replacement CLB
CLB (reserve)
CLB
Minimum loss of functional CLBs
High effort for re-wiring requires massive „embedded“
computing power (32-bit CPU, 500 MHz)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Self Repair within FPGA Basic Blocks
Heterogeneous repair strategies required (memory, logic)
Logic blocks may use methods known from memory BISR
Additional repair strategies are necessary for logic elements
The basic overhead for FPGAs versus standard logic
(about 10) is enhanced.
Repair strategies for logic may use some features already
used in FPGAs (e. g. switched interconnects).
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Structure of a CLB Slice
Program in
FF
in
Logic
in
SRAM
M
U
X
Logic
Field
Redudant Row
M
U
X
Logic
out
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
SRAM
out
FF
out
FF
Computer Engineering
FPGAs for a Solution?
The granularity of re-configurable logic blocks (CLBs)
in most FPGAs is the order of several thousand transistors.
Replacement strategies must be placed on a granularity of
blocks in the area of 100-500 transistors for fault densities
between 0.01 % and 0.1 %.
Efficient FPGA- repair mechanism requires detailed fault diagnosis
plus specific repair schemes, which cannot be kept as pre-computed
reconfiguration schemes.
Computation of specific repair schemes requires „in-system
EDA“ (re-placement and routing) with a massive demand
for computing power.
There is no source of such „always available“ computing power.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Self-Repairing FPGA ?
Reconfigurable Logic
CLB
WB
CLB
WB
CLB
WB
CLB
CLB
WB
CLB
WB
CLB
WB
CLB
WB
CLB
WB
CLB
WB
CLB
CLB
WB
CLB
WB
CLB
WB
CLB
CLB
WB
CLB
WB
CLB
WB
CLB
WB
CLB
WB
CLB
WB
New-Config.
Memory
Virtual CPU
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Config.
CLB
Scheme
CLB
Program
CLB
Computer Engineering
Advanced FPGA Structures
CPU
CPU
WB
CLB
WB
CLB
WB
CLB
WB
CLB
CLB
WB
CLB
WB
CLB
WB
CLB
ALU
WB
MULT WB
ALU
WB
MULT
CLB
WB
CLB
WB
CLB
WB
CLB
CLB
WB
CLB
WB
CLB
WB
CLB
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
... are only partly
re-configurable for
performance
reasons !
Computer Engineering
FPGA / CPLD Repair
Looks pretty easy at first glance because of regular
architecture!
Requires lines / columns of switches for configuration at
inputs and between AND / OR matrices.
Requires additional programmability of cross-points
by double-gate transistor as in EEPROMs or Flash memory.
Not fully compatible with standard CMOS
Limited number of (re-) configurations
Floating gate (FAMOS) transistors are fault-sensitive!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
4. Basic Logic Repair Strategies
Repair techniques that replace failing building blocks by redundant
elements from a „silent“ storage are not new.
IBM has been selling such computer systems specifically for
applications in banks for decade.
But always with few (2-10) backup elements (CPUs) assuming
a small number of failures (< 10) within years.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Mainframes
.. will often contain „redundant“ CPUs for eventual fault
compensation. But one faulty transistor then „costs“ a whole CPU,
limiting the fault handling to a few (about 10) permanent fault cases.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Granularity of Replacement
Block-level
replacement
(e. g. FPGAs)
Hardly
explored
(logic)
CoreReplacement
(e. g. CPU)
Expected fault density (1 out of..)
trans.
100
gate
101
FPGAmacro block
102
103
cores
104
CPU
105
106 Granularity
(transistors)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Repair Overhead versus Element Loss
Repair procedure
overhead
Functioning
elements lost
New
Methods
and
Architectures
Prohibitive
fault density
Prohibitive
overhead
1
10
100
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
1k
10k
100k 1M 10M
Size of replaced blocks
(granularity)
Computer Engineering
Built-in Self Repair (BISR)
BISR is well understood for highly regular structures such as embedded
memory blocks.
BISR is essentially depending on built-in self test (BIST) with high
diagnostic resolution.
Fault
Detection
Fault
Diagnosis
Fault
Isolation
Redundancy
Allocation
Fault / Redundancy Management
Redundancy management must monitor faults, replacements, available redundancy and
must also re-establish a „working“ system state after power-down states.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Levels of Repair
Transistors - Switch Level
Replace transistors or transistor groups
Losses by reconfiguration: (switched-off „good“ devices):
Potentially small ( 20 – 50%) for transistor faults
Overhead for test and diagnosis: Very high
Repair overhead
Gate Level
will dominate
Replace gates or logic cells
reliability!
Losses by reconfiguration:
Medium (60 to 90 %) for single transistor faults
Overhead for test and diagnosis: High
Macro-Block Level
Replace functional macros (ALU, FPU, CPU)
Losses by reconfiguration: High, 99% or more
Overhead for test and diagnosis: Maybe acceptable
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
The Fault Isolation Problem
Load
1
Driver
Gateshort
Load
2
GND-shorts of input gates affect the whole fan-in
network and make redundancy obsolete!!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Block-Level Repair
&
&
SE
SE
&
SE
&
Blocks of logic / RT elements (gates and larger) contain
a redundant element each that can replace a faulty unit.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Switching Concept (1)
inputs
Test in
outputs
inputs
outputs
Functional
Block 1
Functional
Block 1
Functional
Block 2
Functional
Block 2
Functional
Block 3
Functional
Block 3
Replacement
Block
Replacement
Block
1
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Test out
Test in
2
Test out
Computer Engineering
Switching Concept (2)
inputs
Test in
outputs
inputs
outputs
Functional
Block 1
Functional
Block 1
Functional
Block 2
Functional
Block 2
Functional
Block 3
Functional
Block 3
Replacement
Block
Replacement
Block
3
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Test out
Test in
4
Test out
Computer Engineering
A Regular Switching Scheme
The scheme is regular and scalable by nature, comprising always k functional
blocks of the same nature plus 1 additional block for backup.
Building blocks are separated by (pass-) transistor switches at inputs and
outputs, providing a full isolation of a faulty block.
Always 2 additional pass-transistors between two functional blocks.
The reconfiguration scheme is regular in shifting functionality between
blocks, which results in a simple scheme of administration.
The functional access to the „spare“ block can be used for testing purposes.
In any state of (re-) configuration, the potentially „faulty“ block is connected
to test input / output terminals.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Overhead Depending on Block Size
Transistors
Basic Element
Functional backup norm switch
ext. switch
3 /4- 2-NAND
12
4
18
24
3 / 4 2-AND
18
6
18
24
3/4 2-XOR
18
6
18
24
H- Adder
36
12
24
30
F- Adder
90
30
30
36
For small basic blocks, the switches make the essential overhead (200%)!
For larger basic blocks,the overhead can be reduced to about 30-50%
... not counting test- and administration overhead!
Extract larger basic units from seemingly irregular logic netlists!!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Overhead
Transistors per RLB (3 functional units)
Basic Block functional backup
Switches Overhead
min. / ext.
2- NAND
12
4
18 /24
230 %
2- AND
18
6
18 /24
160 %
XOR
18
6
18 /24
160 %
Half Adder
36
12
24 /30
116 %
Full Adder
90
30
30 /36
73 %
4500
1500
168 / 224
38 %
8-bit ALU
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
5. Test and Repair Administration
Test Generator
RLB
Conf.
RLB
RLB
BIST
BIST
Logic
RLB
Logic
RLB
Conf.
RLB
Configurator
and
Status
Memory
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Conf.
RLB
RLB
BIST
BIST
System
Monitoring
Test Analyzer
Centralized Control
Conf.
May be faulty!
De-centralized
test and control
Computer Engineering
Blocks, Switching, Administration
Local (re-) configuration
Remote (re-) configuration
Columns of Switches
Columns of Switches
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
F-Unit
Red.-Unit
Red.-Unit
Red.-Unit
Red.-Unit
F-Unit
F-Unit
F-Unit
F-Unit
Conf.-Unit
Conf.-Unit
Decoder
Decoder
Conf.-Unit
Conf.-Unit
Global
Control-Unit
Global
Control-Unit
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Combining Test and Re-Configuration
Reference
Test
input
Logic
under
Test
next state
Config. Memory /
Counter
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Test
out
Compare
fault
detect
Computer Engineering
Test and Administration
inputs
Input Switches
Test is done by comparison
with reference outputs. The system is run
through states of re-configuration with the same
input test pattern applied.
At test, a functional unit is always removed
from normal operation and connected
to test I / O s.
Functional
Block n
Replacement
Block
Test in
Output Switches
Functional
Block 1
Each of the elements in a
block is testable via specific
test inputs.
In case of a „fault detect“,
the system is fixed in the current status.
outputs
Test out
Decoder
fix at fault
Such a procedure of self-test
and self-reconfiguration can
run at every system start-up, avoiding
a central „fault memory“.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
State Reg.
Fault indicator
Test clock
Self Test Circ.
Fault
flag
Computer Engineering
Controller for (Re-) Configuration
out
RLB
f1
+
f2
f3
Scan
out
Decoder
s1
act
s2
1
act
2
s3
3
s4
f1
f2
f3
4
in
F
>1
BISR
clock
reset
+
Controller minimum
complexity: 80 transistors
(3 + 1 configuration)
Control-Bits
Test
in
&
Reference
Switches
scan path
Switches
+
>1
freset
fault
test
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
f
>1
A controller may drive
one or several re-configurable
blocks in parallel, depending
on their size
Computer Engineering
Local Interconnects
The block-based repair scheme so far can not cover faults on wires between
re-configurable blocks.
For small basic blocks (such as logic gates) the majority of
wiring is between re-configurable units and not covered.
For larger (RT-level) basic blocks the majority of wiring
is within basic blocks and covered.
Schemes that can also cover inter-block wiring are possible,
but require FPGA-like configurable switching and complex switching schemes.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Essentials of the Repair Scheme
Logic self repair is feasible at cost below triple modular
redundancy (TMR).
There is a trade-off between the size or the reconfigurable
logic blocks (RLBs) and the maximum tolerable fault density.
Administration, not redundancy makes the critical overhead.
Efforts can be saved by administrating several RLBs in
parallel.
Low-level interconnects between RLBs make for the essential
„single point of failure“ in the repair scheme!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
6. De-Stressing
Component
failure rates
failure curve
without de-stressing
10-1
failure curve
with de-stressing
10-2
10-3
10-4
t1
t2
t3
t4
System life time
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
The Purpose of De-Stressing
Building blocks in digital systems of equal type may be more or
less heavily used.
Blocks running with the highest dynamic load and at the highest
temperature are candidates for early failure.
Using otherwize „silent“ resources to relieve such units from stress
periodically may serve the overall life time of the system.
The re-configuration scheme developed for repair may also serve
such purpose with slight modifications.
..and the scheme must be compatible with repair architectures !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
The Scheme of De-Stressing
state 0
Task 1
Task 1
BB1
BB2
Task 2
low load
BB3
Task 3
heavy load
test
RB
A better initial distribution
of taks and stress makes
a better re-distribution.
Task 2
Repair capabilities can be
preserved.
heavy load
Task 1
BB2
low load
BB3
Task 3
Backup test
RB
state 3
state 1
BB1
But:
Task 1
BB1
medium load
medium load
BB2
Task 2
low load
BB3
Task 3
De-stressing may need
re-organisation within an
active system, while repair
has been off-line so far !
Task 2
BB2
low load
Task 3
BB3
heavy load
heavy load
Backup
BB1
medium load
medium load
Backup
state 2
test
RB
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Backup
RB
Computer Engineering
Modified Control Scheme
For de-stressing, functions have to be shifted while the system
is in „hot“ operation.
As long as all building blocks are fully functional, running two
functional blocks in parallel serving the same inputs and outputs
is possible.
With a total of k building blocks (including the spare one) there are
k „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1)
intermediate states for „handover“ in case of de-stressing.
There are no extra switches necessary, but an additional overhead
in state management and state decoding.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
FSM including Transitional States
0
tr=1
0/1
tr=1
tr =0
1
tr =0
1/2
tr=1
2
tr =0
2/3
3
If a „flying“ transition between repair states becomes necessary,
the control logic will have seven states instead of four!
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Control Logic Functionality
Test access to each of four basic blocks is possible through the extra test acces.
With a test input pattern applied, the RBB is run through the 4 states.
If a BB or the RB is found to be faulty through the test access, the control
is fixed in this state. The faulty block is then not in functional use.
The controller has a „fault“ flag, which indicates the
status of „backup in use“.
Once a RBB has a fault detected, it cannot be used
for de-stressing operations.
As long as a RBB has no fault detected, if can activate
the re-configuration for de- stressing with an extra
Test
control signal, which makes the FSM run throught
scheme of extended logic states for „hot“ re-configuration. in
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
BB
BB
BB
RB
Test
out
Computer Engineering
Extended Control Logic
Test in
Reconfigurable Block
(RB)
„1“ for
fault detect
Test out
Switch control
signals
Decoder
FSM
>1
&
&
FF
fault
flag
test
FF reset
FSM reset
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
clock
tr
Computer Engineering
7. Overhead and Limitations
BISR requires additional overhead.
The inevitable extra circuitry used for fault administration is
not fault-free by definition.
But we can assume that such circuitry, if fabricated correctly,
is not in heavy use all the time and will exhibit much reduced
failure from stress.
Memory cells used for repair state administration are prone to
transient fault effects from particle radiation.
Wit suitable state encoding (1-out of n-code) parity check
can be applied.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Overhead
Overhead factors:
- Number and size of redundant elements,
- Number of switches for (re)- configuration,
- Test and fault diagnosis,
- Control logic,
- Extra overhead for system – management.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Cost / Overhead
( 3 functional blocks plus 1 backup in RLB)
Basic
Block
2-NAND
H- Adder
F- Adder
Trans. Trans. Switch
funct. backup Trans.
3* 4
4
30
Contr.* Overhead
Unit Tr.
%
81 /200 960 / 3600
3 * 12
3 * 30
12
30
40
50
81 /200 369 / 700
81 /200 179 / 311
2-bit ALU 3 * 352
4-bit ALU 3 * 699
8-bit ALU 3 * 1367
352
699
1367
140
180
260
81 /200 54.2 / 65.5
81 /200 45.8 / 51.5
81 /200 41.6 / 44.5
* with / without extensions for de-stressing, controller design
optimized for supervision by parity control.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Sources of Overhead
Basic
Block
2-NAND
H-Adder
F-Adder
2Bit ALU
4Bit ALU
8Bit ALU
Complexity
Overhead in %
(trans.)
redund. switches control ctrl/destr.
4
12
30
352
699
1367
33
33
33
33
33
33
250
111
55
13
8.5
6.2
675
225
90
7.6
3.8
2
1666
555
222
18.9
9.5
4.8
Switches and control overhead dominate, reasonable lower bound
for complexity of basic blocks is around 100-200 transistors.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Overhead and Block Size
Overhead
in %
1000
self repair plus de-stressing
self repair
100
33
10
10
102
103
104
Basic Block Size
(transistors)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
The Switching Problem (1)
switch
control
switch
control
Compensates „always on“
switch
control
Compensates „always off“
switch
control
switch
control
Compensates „always on“ and
„always off“
... always in one single transistor.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Single Points of Failure
Transistor Switches
Config.
Control
Network
switch
control
1
2
Signal
wiring
3
1: short gate - signal input
2: short gate - block input
3: channel short
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Reconfigurable
Logic Block
(RLB)
Computer Engineering
Pass Transistor Faults
Short
A short condition between the signal input (Usign) and the control
input (Uctrl) may be solved by designing the gate input line (Rbr)
as a fuse. Then one additional transistor is needed as a „power sink“.
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Blowing Fuses
CTL in
VDDhigh
n
fuse
gate
short
sin
p
n
sout
Power-Sink-Transistor
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
8. Summary and Conclusions
Logic self-repair is not impossible, but noch cheap either.
The lower bound for logic blocks is about 100 transistors.
Experience shows that most logic designs „yield“ some potential
for logic extraction.
Repair technologies work even (much) better for regular processor
architectures such as VLIW processors.
In real-life designs, a large part of the system (memory, 50-90 %),
functional units, 10-40 %) is regular. Only a small fraction is truly
„irregular“ and needs higher overhead.
No such strategy yet for analog and mixed signal circuits !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Real Embedded Systems
CPU
CPU
Data Path
Data Path
Mem.
Ctrl
DSP
Cache
Memory
Ctrl
Cache
Mixed
Signal / RF
.. only a small fraction of the real system is truly irregular and needs
„expensive“ logic repair !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Regular Processor Architectures
Needs
Logic-BISR
Crtl.Logic
Add
Register File
Mult
Multiple parallel
Processing units
Regular processor structures with multiple parallel units need
expensive logic (self-) repair only for their control logic. Reconfiguration
of data-path elements can be arranged by software, which does not have wear-out !
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Computer Engineering
Design for Repairability
RT netlist
Extract obvious
regular blocks
RLB
Control
Circuitry
Random
Logic
done
Find and extract
regular entities
Random
Rest Logic
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn
Compose
RT-RLBs
Compose
Gate-Level
RLBs
Compose
Estimate
RLB control
Reliability
Scheme
Computer Engineering
This is the END !
Thank you for not falling asleep !
(I would have....)
CREDES / ZUSYS / DAAD Summer School 2011, Tallinn