Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Ghazanfar (Hossein) Asadi Test & Reliability Group (TRG) Department of Electrical & Computer Engineering Northeastern University 2004 MAPLD/221 1 Asadi Outline 2004 MAPLD/221 Problem Statement & Motivation Soft Errors Background & Previous work Error Models in FPGAs SER Estimation Experimental Results Summary & conclusions 2 Asadi Problem Statement Estimating soft error rate in FPGAs The probability of system failure For a given mapped design Mean time to manifest a corrupted conf. bit 2004 MAPLD/221 Due to soft errors To primary outputs or Flip-flops 3 Asadi Motivation Need for soft error rate estimation Exponential growth of vulnerable bits due to Moore’s law High cost of Error tolerant schemes To make appropriate cost/reliability trade-offs Why an analytical method? 2004 MAPLD/221 Where to put redundancy Previous work: Fault Injection Time-consuming / Incomplete / Expensive Needs physical prototype board Cannot be used in design phases 4 Asadi Background: Error Definitions Soft Errors: Intermittent malfunctions of the hardware Not reproducible Energetic Particles Single Event Upsets (SEUs) Soft Errors (may cause) System Failure 2004 MAPLD/221 5 Asadi Previous Work Based on Fault Injection (FI) 1. Inject fault 2. Run several workloads 3. Compare results with fault-free circuit Exhaustive FI is very time-consuming 2004 MAPLD/221 Candidate some locations for FI Analysis based on statistics 6 Asadi Previous Work (Cont.) Radiation-based fault injection Expensive & not commonly used Needs physical implementation Can damage prototype board Hard error Simulation-based fault injection 2004 MAPLD/221 Cannot be used during design phases Bit-stream alteration Needs physical implementation Bridging errors may lead to hard errors 7 Asadi Outline 2004 MAPLD/221 Problem Statement & Motivation Soft Errors Background & Previous work Error Models in FPGAs SER Estimation Experimental Results Summary & conclusions 8 Asadi Error Models in FPGAs Memory resources: User bits Configuration bits 2004 MAPLD/221 Flip-flops, RAMs, … Mux select bits, LUT bits, … User bits Transient errors Config. bits Permanent errors 9 Asadi Error Models in FPGAs (Cont.) Bit flip Permanent error Corrected by reconfiguration Bit flip Transient error Can be corrected at the next load E1 E2 E1 E3 Short or open circuit Corrected by reconfiguration clk E2 E3 BlockRAM F1 F2 F3 F4 LUT M ff M M M M M M M 2004 MAPLD/221 Configuration Memory Cell SEU (Bit flip) 10 © Lima (DAC03) Virtex (Xilinx) Asadi Error Models in FPGAs (Cont.) Transient errors User flip-flops, Logic gates, Block RAMs Permanent errors (all configuration bits) Routing: 2004 MAPLD/221 MUX select bits PIP: Short/Open Buffer: On/Off LUT Control/Clocking Bits 11 Asadi Error Models in FPGAs (Cont.) Only permanent errors considered Conf. bits comprise more than 2004 MAPLD/221 99% of all memory elements excluding RAM blocks 95% of all memory elements including RAM blocks Device # of Config. Bits # of FlipFlops Ratio XCV50 559,200 3,996 99% XCV400 2,546,048 22,812 99% XCV800 4,715,616 43,872 99% XCV1000 6,127,744 56,832 99% 12 Asadi Outline 2004 MAPLD/221 Problem Statement & Motivation Soft Errors Background & Previous work Error Models in FPGAs SER Estimation Experimental Results Summary & conclusions 13 Asadi SER Estimation Traversing structural paths [Asadi04] From fault sites to POs Off-Path Signals: Thin Lines PO SEU PO On-Path Signals: Thick Lines FF Off-Path Signals 2004 MAPLD/221 14 Asadi SER Estimation in ASIC Designs S(n): System failure probability (SFP) vector Si: SFP given node i erroneous n: total fault sites Experiments on ISCAS89 show that: Three order of magnitude faster 2004 MAPLD/221 Compared to random-input simulation Average accuracy: 97% 15 Asadi FPGA vs. ASIC in SER Estimation ASIC: transient error Only requires propagation probability FPGA: both transient & permanent errors Transient errors: the same Permanent errors: needs activation as well Nodes with different error rates in FPGAs Fault sites: all nodes 1 1 n1 A 1 B 1 n2 1 2004 MAPLD/221 1 16 C Asadi SER Estimation of FPGAs: Steps Compute permanent error rates for all nodes PRi : the permanent error rate of node i Compute netlist failure probability vector Ni= failure prob. given node i erroneous System failure rate vector (S) = PR N 2004 MAPLD/221 n: total number of fault sites Si = PRi Ni 17 Asadi How to Compute Ni? Open & stuck-at errors: Bridging wired-AND error (nets i and j): Ni = [SPi(1-SPj)PPi(0)] + [(1-SPi) SPjPPj(0)] Bridging wired-OR error (nets i and j): 2004 MAPLD/221 Ni = [SPi PPi(0) + (1-SPi) PPi(1)] = PPi PPi: Propagation prob. (the method used for ASIC) SP: Signal probability is used for activation prob. Ni = [SPi(1-SPj)PPj(1)] + [(1-SPi) SPjPPi(1)] 18 Asadi How to Compute PRi? PR(n): permanent error rate vector PRi : r f r: Raw error rate of an SRAM cell f: Number of all possible errors at node i n: total number of fault sites PRAB= 6 r 1 1 1 1 A B 0 0 1 2004 MAPLD/221 19 Asadi System Failure Rate For the first clock: n SFR 1 (1 S i ) i 1 For c clock cycles: n SFR 1 (1 PRi (1 N i ) c i 1 2004 MAPLD/221 The same probability is valid for the next clock cycles c: Number of clocks checking the state of the circuit After particle hit 20 Asadi Outline 2004 MAPLD/221 Problem Statement & Motivation Soft Errors Background & previous work Error Models in FPGAs SER Estimation Experimental Results Summary & conclusions 21 Asadi Error List 2004 MAPLD/221 Mux-open PIP open Buffer off A bit-flip in LUT Control bit-flip 22 Asadi Experimental Setup Xilinx Virtex 300 (XCV300) Xilinx Design Language (XDL) Benchmark: some ISCAS89 circuits r = raw failure rate for an SRAM cell 1000 clocks executed for each SEU Platform: Sun Solaris Ultra-10 2004 MAPLD/221 r=0.01 FIT/bit 256 MB Main Memory 23 Asadi Results: Sensitive Bits Number of sensitive SRAM bits for each part Circuit 2004 MAPLD/221 S27 S298 S344 S349 s382 s386 Routing 64 459 536 650 807 714 LUT 68 418 392 520 712 660 Control/ Clocking 40 140 168 187 207 160 Total 172 1017 1096 1357 1726 1534 24 Asadi Results: Manifestation Time Mean Time To Manifest (MTTM) errors to outputs Circuit S27 S298 S344 S349 s382 s386 Routing 2.07 2.86 2.58 2.91 3.30 3.82 LUT 14.49 20.75 17.33 20.48 22.08 30.07 Control/ Clocking 1.18 1.31 1.36 1.40 1.40 1.77 (Results are in terms of cycles) 2004 MAPLD/221 25 Asadi Results: SFR & Estimation Time System Failure Rate & Estimation Time Circuit S27 S298 S344 S349 s382 s386 SFR (FIT) 1.71 9.87 9.99 12.77 16.04 12.11 SP Time (sec) 0.15 0.76 0.91 1.09 1.25 1.05 SFR Time (sec) 0.02 0.09 0.13 0.14 0.19 0.25 Total Time (sec) 0.17 0.85 1.04 1.23 1.44 1.30 Number of Clock cycles: 1000 SP Time: Signal Probability computation time SFR Time: System Failure Rate computation time 2004 MAPLD/221 26 Asadi Summary & Conclusions A new approach for SER estimation No physical implementation required Can be used in early design stages Very fast simulation time Can cover all possible faults Mean Time To Manifest errors to outputs: 2004 MAPLD/221 For SRAM-based FPGAs MTTM(Control/clocking) < MTTM(routing) MTTM(routing) < MTTM(LUT) 27 Asadi Appendix & Backup 2004 MAPLD/221 28 Asadi Background: Soft Error Origin The main sources in terrestrial conditions: Soft Error occurs: Alpha particles & Neutrons if hitting particles generate more than Qcrit Critical Charge (Qcrit): the minimum charge needed 2004 MAPLD/221 To flip the value stored in the cell 29 Asadi Exp. Increase of Soft Errors 1.00000 0.90000 SRAM: exp(-Qcrit/Qs) 0.80000 Latch: exp(-Qcrit/Qs) 0.70000 0.60000 0.50000 0.40000 0.30000 0.20000 0.10000 0.00000 600 nm 350 nm 250 nm 180 nm 130 nm 100 nm 70 nm 50 nm e-Qcrit/Qs trend with technology scaling (Shivakumar , DSN 2002) • Qcrit: the critical charge (depend on characteristics of the circuit) • Qs: the charge collection efficiency of a particle strike on the device 2004 MAPLD/221 Particles of lower energies occur far more frequently 30 Asadi Background: Definitions 2004 MAPLD/221 How to express Soft Error Rate (SER) MTBF (Mean Time Between Failures) FIT (Failure-in-Time) 1 failure in a billion hours 1 year MTBF = 114,155 FIT 31 Asadi Background: Definitions Failure definition: (a) Propagation of an erroneous value to at least one flip-flip or primary output or (b) Propagation of an erroneous value Definition (a) is compatible with (b) 2004 MAPLD/221 to at least one primary output If there is no redundant flip-flop in the circuit 32 Asadi Failure Error Rate of LUT To reduce number of nodes P(tx): the probability of O=tx LUT failure rate 2004 MAPLD/221 LUT as a complex gate LUT F1 F2 F3 F4 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 O t12 t13 t14 t15 SO=[AP(t0)+AP(t1)+…+AP(t15)].r.NO = r.NO 33 Asadi Xilinx Virtex FPGA Model CLB Logic block IO Mux Switch Matrix (SM) Line Segments IOB 2004 MAPLD/221 34 Asadi CLB Architecture 2004 MAPLD/221 35 Asadi Error Models in FPGAs (Cont.) Config. Bits: Care bits Don’t care bits 2004 MAPLD/221 All 1s Some of 0s Some of 0s 36 Asadi Error Models: PIP Short/Open 10: causes open 01: may cause short or bridging error 1 0 1 0 0 1 0 N2 Stuck-closed: Permanently ON N1 N3 N3 E1 W1 E1 W2 E2 W2 E2 W3 E3 W3 E3 S2 S1 S3 Stuck-open and stuck-closed errors 2004 MAPLD/221 N2 W1 S1 1 Bridiging error Stuck-open: Permanently OFF N1 1 37 S2 S3 Bridiging error Asadi Error Models (Cont.) Buffer on/off 1 0 0 1 1 0 Tri-state buffers Used in IOBs Buffer on Buffer off LUT 2004 MAPLD/221 Look-Up Table 38 F1 0 1 1 1 F2 0 1 0 F3 1 1 0 1 O 1 F4 0 1 1 1 1 Asadi