Download Apres_Paper2a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Distributed operating system wikipedia , lookup

Transcript
Space Science Centre
School of Engineering
University of Sussex, England
Electrical Engineering Dept.
Catholic University - PUCRS
Porto Alegre, Brazil
Merging BIST and Configurable Computing Technology
to Improve Availability in Space Applications
Eduardo Bezerra 1, Fabian Vargas 2, Michael Paul Gough 3
1, 3
Space Science Centre, University of Sussex, Brighton, BN1 9QT, England
[email protected], [email protected]
1, 2
Catholic University - PUCRS, 90619-900 Porto Alegre - Brazil
[email protected]
1st IEEE Latin American test Workshop - LATW’00.
Marina Palace Hotel, March 13-15, 2000.
Rio de Janeiro, Brazil
1/14
FPGA
Agenda
1. Motivation: Important concerns about the design of
reconfigurable systems for space applications
2. System Description Overview
3. SEU Prevention Strategies
3.1. Refresh Operation in a TMR-FPGA System
3.2. Periodic Refresh Without FPGA Replication
3.3. Signature Analysis-Driven Refresh Without FPGA Replication
3.4. Signature Analysis With Continuous Readback Execution
4. Masking Connectivity Faults
5. Numerical Analysis of the CCM Node in Two Modes of Operation
6. Expected Performance
7. Conclusions & Future Work
2/14
1. Motivation:
Important concerns of computer designers for space applications :
•
Power computation, area usage, weight, and
dependability (availability, reliability, and testability).
Main Characteristics & Drawbacks :
•
application-specific systems (requirements change frequently from application to application) :
 very expensive systems !
Possible Solution :
•
use of configurable devices :  allows the designers to have different HW configurations
adequate for every new application, without the need for changes in the whole board layout
(application-dependent solution).
Drawback :
•
SW development for this kind of HW is in most cases very difficult (e.g., complex data structure).
In the past few years :
•
many approaches devoted to improve dependability features of reconfigurable computer
systems mainly based on traditional strategies (i.e., microprocessor based systems).
3/14
1. Motivation:
 Radiation causes Single-Event Upset (SEU) in memory elements:
 Processor latches and cache mem. cells are sensitive to SEUs
 FPGAs store logic/routing in latches.
body
S
0V
0V
0V
ion t rack
D
5V
p+
Del ayed
(Diffusion)
+ -
+ -
n+
+ + -
elec
t
drift
n+
funn eling
+ -
r on
cur
r en
t
+ -
+ -
P ro mpt
(Dri ft + Fun neling )
Current
N FET
gate
+ -
diffu sion
p sub strate
0
(a)
0.2
0.4
1
10
100 Tim e
(nsec.)
(b)
Fig. 1. Illustration of the charge collection mechanism that causes single-event upset :
(a) particle strike and charge generation;
(b) current pulse shape generated in the n+p junction during the collection of the
charge.
4/14
2. System Description Overview :
User 1
User 2
…
User n
CCM 1
CCM 2
…
CCM n
3
1
On-board network bus
Fig. 2. Block diagram of
the proposed system :
CCM
(TC/TM)
Shared
RAM
1
2
On-board instrument processing board
Ground station
Legend:
1 - Protocol for on-board communication
2 - ESA standard protocol
3 - Configurable interface
(a)
CCM (Configurable Computer Module):
(configuration manager
& readback)
Flash
memory
(configuration
bitstream)
FPGA B
(control)
serial
PROM
FPGA A
(processing
element)
RAM
(emergency recovery bitstream)
(b)
5/14
(optional)
(a) Network architecture.
(b) Basic CCM node.
3. SEU Prevention Strategies
3.1. Refresh Operation in a Triple Modular Redundancy (TMR) FPGA System
Configuration
bitstreams
Readback
bitstreams:
• user registers
• user logic
• routing
FPGA
voter
Error signal
Serial
EPROM
Start refresh
signals
Fig. 3. A TMR FPGA system.
- 3 FPGAs configured with the same bitstream (TMR) and operate in synchronism.
- A controller reads the 3 FPGA bitstream, bit after bit, and if there are no differences,
then a correct functioning with no SEU occurrence is assumed.
- Executed continuously (FPGAs readback feature, during normal FPGA operation).
Drawbacks : - HW overhead (TMR),
- Total loss of data measurement.
6/14
3. SEU Prevention Strategies
3.2. Periodic Refresh Without FPGA Replication
counter <= counter + 1;
15 Hz Application
if counter = 0 then
process
PROG <= ‘0’; -- reset counter
else
Start refresh signal
Application
PRG pin
PROG <= ‘1’;
process
end if;
Application
Configuration
process
FPGA
bitstream
Fig. 4. Using a counter to start the refresh operation.
- A 15Hz clock increments the 19-bit counter,
- At every 20 hoours, the coutner resets, which leads to FPGA reconfiguration.
Drawback: refresh periodically, even if there are no SEU occurrence
(system availability may be seriously affected).
7/14
3. SEU Prevention Strategies
3.3. Signature Analysis-Driven Refresh Without FPGA Replication
A signature analysis (LFSR/PSG) method is used to identify when an FPGA refresh is
necessary.
15 Hz
PRG pin
1
System 2
LFSR/PSG
clock
3
Flash
memory
4
PRG pin
Refresh?
2
Readback
Start readback?
Fig. 5.
The LFSR/PSG
approach.
3
readback pin
FPGA B
FPGA A
- LFSR/PSG process created in VHDL  2 operating modes :
(a) LFSR mode  15Hz clock signal (19-bit LFSR -prim. polynomial- counts up to 20 h.)
When the LFSR output matches a given seed:
(b) PSG mode, the LFSR/PSG process  at speed (parallel signature generator)
Drawback: - HW required slightly higher them in the previous clock/counter approach.
8/14
3. SEU Prevention Strategies
3.4. Signature Analysis With Continuous Readback Execution
 In the previous strategy, the test for SEU occurrences is executed periodically. The
LFSR is used to start the readback operation and to compact the configuration
bitstream time after time.
 Another option for the test is to execute the readback continuously, as it does not
affect the normal FPGA operation.
 Advantage: optimize HW overhead (part of the LFSR/PSG process is useless: the
internal 15 Hz clock used to “start readback” process on FPGA A, and the circuit used
for the clock signal switching, are eliminated).
 Alternatively, the 15Hz clock could be used, in a different process to control the FPGA
B self-refreshing activity.
 This strategy saves space on FPGA B and allows the integrity of FPGA A to be verified
more frequently.
 Drawback: power consumption is slightly larger than the LFSR/PSG approach due to
the continuous readback operation of FPGA A.
9/14
4. Masking Connectivity Faults
Reliability improvements in the processing elements is worthless
if the input data correction is not guaranteed.
Goal: mask faults in the external FPGA pins and in the internal FPGA routing resources.
Sensor 1

Sensor 2

Sensor 3
Application
process
K
e
r
n
e
l

FPGA
Fig. 6. Using replicated inputs/voter to mask connectivity faults.
10/14
5. Numerical Analysis of the CCM Node in Two
Modes of Operation
First situation: the 3 flash memories hold 3 different configuration bitstreams (CBs).
- This scenario represents a real reconfigurable computing system, because the FPGA
functionality can be altered, on-the-fly, according to the application requirements.
- From the fault-tolerance point of view it is not a good approach as, in case of an SEU
occurrence in one of the flash memories, the respective application has to stop, and wait
for a good CB be up-loaded from the ground station.
Second situation, the 3 flash memories hold the same CB, which characterises a TMR
system. The vote is executed, implicitly, by FPGA B.
- This test strategy is not capable of fault location: then, it is not possible to identify if
the problem was in the flash memory or in the FPGA.
- In any case, the FPGA A is reconfigured with a CB from another flash memory. If the
error persists, then the diagnostic is a permanent fault in FPGA A, and the module has to
be by-passed. On the other hand, if with the new CB no error is detected, then the
respective flash memory is considered faulty, and it needs to be refreshed in order to try
to clear any occurrence of SEUs.
11/14
5. Numerical Analysis of the CCM Node in Two
Modes of Operation
0.9
0.8
0.7
1
10
20
30
40
50
60
70
80
90
100
200
300
400
500
1000
2000
3000
Reliability
1
Time (hours)
non-redundant (R1)
redundant (R2)
Fig. 7. The reliability responses for the two situations.
12/14
6. Expected Performance
Application Program: auto-correlation (ACF) processing of particle count pulses
as a means of studing processes occurring in near Earth plasmas.
Process 1
Process 2
Process 3
Process 4
Process 5
Process 6
microcontroller
FPGA
Rate
4,518T
1T
4,518 times faster
8T .. 36T
1T
8 to 36 times faster
18T .. 1018T
1T .. 68T 18 to 14.97 times faster
1,240T
48T
25.8 times faster
1,334T..3,438T 132T..143T 10.11 to 24.0 times faster
11,116T
288T
38.6 times faster
DS87C520 [8051 family] (Assembly) X FPGA (VHDL)
Table 1. Performance comparison for the case study
(clock cycles).
13/14
7. Conclusions & Future Work
 This paper introduced the use of a BIST technique and traditional faulttolerance strategies together with configurable computing technology to
improve the availability of on-board computers used in space applications.
 network architecture for spacecraft instruments was presented;
 test and fault-tolerance strategies to detect and fix/tolerate SEU
occurrences were analysed;
 a technique to mask connectivity faults was also proposed;
 expected strategy performance was estimated.
The strategies described here deserve a deeper investigation, in order to be
used in the design of a fault-tolerant on-board instrument processing
system, entirely based on configurable computing.
The next step will be the implementation of a prototype to determine the
feasibility of the test and fault-tolerant strategies proposed here.
14/14