Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Space Science Centre School of Engineering University of Sussex, England Electrical Engineering Dept. Catholic University - PUCRS Porto Alegre, Brazil Merging BIST and Configurable Computing Technology to Improve Availability in Space Applications Eduardo Bezerra 1, Fabian Vargas 2, Michael Paul Gough 3 1, 3 Space Science Centre, University of Sussex, Brighton, BN1 9QT, England [email protected], [email protected] 1, 2 Catholic University - PUCRS, 90619-900 Porto Alegre - Brazil [email protected] 1st IEEE Latin American test Workshop - LATW’00. Marina Palace Hotel, March 13-15, 2000. Rio de Janeiro, Brazil 1/14 FPGA Agenda 1. Motivation: Important concerns about the design of reconfigurable systems for space applications 2. System Description Overview 3. SEU Prevention Strategies 3.1. Refresh Operation in a TMR-FPGA System 3.2. Periodic Refresh Without FPGA Replication 3.3. Signature Analysis-Driven Refresh Without FPGA Replication 3.4. Signature Analysis With Continuous Readback Execution 4. Masking Connectivity Faults 5. Numerical Analysis of the CCM Node in Two Modes of Operation 6. Expected Performance 7. Conclusions & Future Work 2/14 1. Motivation: Important concerns of computer designers for space applications : • Power computation, area usage, weight, and dependability (availability, reliability, and testability). Main Characteristics & Drawbacks : • application-specific systems (requirements change frequently from application to application) : very expensive systems ! Possible Solution : • use of configurable devices : allows the designers to have different HW configurations adequate for every new application, without the need for changes in the whole board layout (application-dependent solution). Drawback : • SW development for this kind of HW is in most cases very difficult (e.g., complex data structure). In the past few years : • many approaches devoted to improve dependability features of reconfigurable computer systems mainly based on traditional strategies (i.e., microprocessor based systems). 3/14 1. Motivation: Radiation causes Single-Event Upset (SEU) in memory elements: Processor latches and cache mem. cells are sensitive to SEUs FPGAs store logic/routing in latches. body S 0V 0V 0V ion t rack D 5V p+ Del ayed (Diffusion) + - + - n+ + + - elec t drift n+ funn eling + - r on cur r en t + - + - P ro mpt (Dri ft + Fun neling ) Current N FET gate + - diffu sion p sub strate 0 (a) 0.2 0.4 1 10 100 Tim e (nsec.) (b) Fig. 1. Illustration of the charge collection mechanism that causes single-event upset : (a) particle strike and charge generation; (b) current pulse shape generated in the n+p junction during the collection of the charge. 4/14 2. System Description Overview : User 1 User 2 … User n CCM 1 CCM 2 … CCM n 3 1 On-board network bus Fig. 2. Block diagram of the proposed system : CCM (TC/TM) Shared RAM 1 2 On-board instrument processing board Ground station Legend: 1 - Protocol for on-board communication 2 - ESA standard protocol 3 - Configurable interface (a) CCM (Configurable Computer Module): (configuration manager & readback) Flash memory (configuration bitstream) FPGA B (control) serial PROM FPGA A (processing element) RAM (emergency recovery bitstream) (b) 5/14 (optional) (a) Network architecture. (b) Basic CCM node. 3. SEU Prevention Strategies 3.1. Refresh Operation in a Triple Modular Redundancy (TMR) FPGA System Configuration bitstreams Readback bitstreams: • user registers • user logic • routing FPGA voter Error signal Serial EPROM Start refresh signals Fig. 3. A TMR FPGA system. - 3 FPGAs configured with the same bitstream (TMR) and operate in synchronism. - A controller reads the 3 FPGA bitstream, bit after bit, and if there are no differences, then a correct functioning with no SEU occurrence is assumed. - Executed continuously (FPGAs readback feature, during normal FPGA operation). Drawbacks : - HW overhead (TMR), - Total loss of data measurement. 6/14 3. SEU Prevention Strategies 3.2. Periodic Refresh Without FPGA Replication counter <= counter + 1; 15 Hz Application if counter = 0 then process PROG <= ‘0’; -- reset counter else Start refresh signal Application PRG pin PROG <= ‘1’; process end if; Application Configuration process FPGA bitstream Fig. 4. Using a counter to start the refresh operation. - A 15Hz clock increments the 19-bit counter, - At every 20 hoours, the coutner resets, which leads to FPGA reconfiguration. Drawback: refresh periodically, even if there are no SEU occurrence (system availability may be seriously affected). 7/14 3. SEU Prevention Strategies 3.3. Signature Analysis-Driven Refresh Without FPGA Replication A signature analysis (LFSR/PSG) method is used to identify when an FPGA refresh is necessary. 15 Hz PRG pin 1 System 2 LFSR/PSG clock 3 Flash memory 4 PRG pin Refresh? 2 Readback Start readback? Fig. 5. The LFSR/PSG approach. 3 readback pin FPGA B FPGA A - LFSR/PSG process created in VHDL 2 operating modes : (a) LFSR mode 15Hz clock signal (19-bit LFSR -prim. polynomial- counts up to 20 h.) When the LFSR output matches a given seed: (b) PSG mode, the LFSR/PSG process at speed (parallel signature generator) Drawback: - HW required slightly higher them in the previous clock/counter approach. 8/14 3. SEU Prevention Strategies 3.4. Signature Analysis With Continuous Readback Execution In the previous strategy, the test for SEU occurrences is executed periodically. The LFSR is used to start the readback operation and to compact the configuration bitstream time after time. Another option for the test is to execute the readback continuously, as it does not affect the normal FPGA operation. Advantage: optimize HW overhead (part of the LFSR/PSG process is useless: the internal 15 Hz clock used to “start readback” process on FPGA A, and the circuit used for the clock signal switching, are eliminated). Alternatively, the 15Hz clock could be used, in a different process to control the FPGA B self-refreshing activity. This strategy saves space on FPGA B and allows the integrity of FPGA A to be verified more frequently. Drawback: power consumption is slightly larger than the LFSR/PSG approach due to the continuous readback operation of FPGA A. 9/14 4. Masking Connectivity Faults Reliability improvements in the processing elements is worthless if the input data correction is not guaranteed. Goal: mask faults in the external FPGA pins and in the internal FPGA routing resources. Sensor 1 Sensor 2 Sensor 3 Application process K e r n e l FPGA Fig. 6. Using replicated inputs/voter to mask connectivity faults. 10/14 5. Numerical Analysis of the CCM Node in Two Modes of Operation First situation: the 3 flash memories hold 3 different configuration bitstreams (CBs). - This scenario represents a real reconfigurable computing system, because the FPGA functionality can be altered, on-the-fly, according to the application requirements. - From the fault-tolerance point of view it is not a good approach as, in case of an SEU occurrence in one of the flash memories, the respective application has to stop, and wait for a good CB be up-loaded from the ground station. Second situation, the 3 flash memories hold the same CB, which characterises a TMR system. The vote is executed, implicitly, by FPGA B. - This test strategy is not capable of fault location: then, it is not possible to identify if the problem was in the flash memory or in the FPGA. - In any case, the FPGA A is reconfigured with a CB from another flash memory. If the error persists, then the diagnostic is a permanent fault in FPGA A, and the module has to be by-passed. On the other hand, if with the new CB no error is detected, then the respective flash memory is considered faulty, and it needs to be refreshed in order to try to clear any occurrence of SEUs. 11/14 5. Numerical Analysis of the CCM Node in Two Modes of Operation 0.9 0.8 0.7 1 10 20 30 40 50 60 70 80 90 100 200 300 400 500 1000 2000 3000 Reliability 1 Time (hours) non-redundant (R1) redundant (R2) Fig. 7. The reliability responses for the two situations. 12/14 6. Expected Performance Application Program: auto-correlation (ACF) processing of particle count pulses as a means of studing processes occurring in near Earth plasmas. Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 microcontroller FPGA Rate 4,518T 1T 4,518 times faster 8T .. 36T 1T 8 to 36 times faster 18T .. 1018T 1T .. 68T 18 to 14.97 times faster 1,240T 48T 25.8 times faster 1,334T..3,438T 132T..143T 10.11 to 24.0 times faster 11,116T 288T 38.6 times faster DS87C520 [8051 family] (Assembly) X FPGA (VHDL) Table 1. Performance comparison for the case study (clock cycles). 13/14 7. Conclusions & Future Work This paper introduced the use of a BIST technique and traditional faulttolerance strategies together with configurable computing technology to improve the availability of on-board computers used in space applications. network architecture for spacecraft instruments was presented; test and fault-tolerance strategies to detect and fix/tolerate SEU occurrences were analysed; a technique to mask connectivity faults was also proposed; expected strategy performance was estimated. The strategies described here deserve a deeper investigation, in order to be used in the design of a fault-tolerant on-board instrument processing system, entirely based on configurable computing. The next step will be the implementation of a prototype to determine the feasibility of the test and fault-tolerant strategies proposed here. 14/14