Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS137: Electronic Design Automation Day 9: October 17, 2005 Fault Detection CALTECH CS137 Fall2005 -- DeHon Today • Faults in Logic • Error Detection Schemes • Optimization Problem CALTECH CS137 Fall2005 -- DeHon Problem • Gates, wires, memories: – built out of physical media – may fail CALTECH CS137 Fall2005 -- DeHon Device Physics • Represent a 1 or 0 with charge – On a gate, in a memory • Charge may be disrupted – -particle (other ionizing particles) – Ground bounce – Noise coupling – Tunneling – Thermal noise – Behavior of individual electrons is statistical CALTECH CS137 Fall2005 -- DeHon DRAMs • • • • Small cells Store charge dynamically on capacitor Store about 50,000 electrons Must be refreshed – Data leaks away through parasitic resistance • -particle can be 1,000,000 carriers? CALTECH CS137 Fall2005 -- DeHon System Reliability • • • • Device fail with Probability: Pfail Have N components in system All must work for device to work Psys = (1-Pfail)N N 2 Psys 1 N Pfail Pfail 2 CALTECH CS137 Fall2005 -- DeHon N 3 Pfail ... 3 System Reliability N 2 Psys 1 N Pfail Pfail 2 N 3 Pfail ... 3 • If NPfail << 1 NPfail dominates higher order terms… Psys 1 N Pfail CALTECH CS137 Fall2005 -- DeHon System Reliability Psys 1 N Pfail • Psysfail N Pfail CALTECH CS137 Fall2005 -- DeHon Modern System • 100 Million 1 Billion Transistors – Not to mention wiring… • > GHz = > 1 Billion Transitions / sec. • N = 1018 per second… Psys 1 N Pfail CALTECH CS137 Fall2005 -- DeHon As we scale? • N increases • Charge/gate decreases Psys 1 N Pfail – Less electrons – Higher probability they wander – Greater variability in behavior • Voltage levels decrease – Smaller barriers • Greater variability in device parameters Pfail increases CALTECH CS137 Fall2005 -- DeHon Exacerbated at Nanoscale • Small numbers of dopants (10s) – High variability • Small numbers of electrons (10-1000s?) – High variability – Highly susceptible to noise • Small number of molecules – May break, decay… CALTECH CS137 Fall2005 -- DeHon What do we do about it? • Tolerate faulty components • Detect faults – Not do anything bad – Try it again • If statistically unlikely error, –high likelihood won’t recur. • …Focus on detection… CALTECH CS137 Fall2005 -- DeHon Detect Faults • Key Idea: redundancy • Include enough redundancy in computation – Can tell that an error occurred CALTECH CS137 Fall2005 -- DeHon What kind of redundancy can we use? • Multiple copies of logic • Compute something about result – Parity on number of outputs – Count of number of 1’s in output CALTECH CS137 Fall2005 -- DeHon Error Detection CALTECH CS137 Fall2005 -- DeHon What do we protect against? • Any n errors – Worst-case selection of errors CALTECH CS137 Fall2005 -- DeHon Single Error Detection • If Pfail small: – No error: (1-Pfail)N 1-NPfail – One error: NPfail (1-Pfail)N-1 NPfail – Two errors: [N(N-1)/2] (Pfail )2(1-Pfail)N-1 • Probability of an error going undetected For: NPfail << 1 Goes from NPfail to (NPfail )2 CALTECH CS137 Fall2005 -- DeHon Single Error Detection (Example) • Probability of an error going undetected For: NPfail << 1 Goes from NPfail to (NPfail )2 N=1010 Pfail=10-20 NPfail=10-10<<1 ~1010 cycles MTTF Mean Time To Failure 1GHz = 10s (NPfail)2=10-20 1020 cycles MTTUF Mean Time To Undetected Fault 1011s = 3000 years CALTECH CS137 Fall2005 -- DeHon Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected Goes from NPfail to (cNPfail )2 To come out ahead, want: c2 << 1/(NPfail ) c=3, N=1010 Pfail=10-20 (cNPfail)2=910-20 1019 cycles MTTUF 1010s = 300 years CALTECH CS137 Fall2005 -- DeHon Detection Overhead • …but: Correction and detection circuitry increase circuit size. • Ndetect > Nlogic • Ndetect = c Nlogic • Probability of an error going undetected Goes from NPfail to (cNPfail )2 To come out ahead, want: c2 << 1/(NPfail ) c=3, N=31010 Pfail=10-11 NPfail=0.3 (cNPfail)2=0.81 worse Neither workable! CALTECH CS137 Fall2005 -- DeHon Reliability Tuning • Want NPfail small – Want: (cNPfail )2 very small • Idea: – Guard subsystems independently – Make Ns suitably small – Smaller probability there is a double error localized in this small subsystem • That is: as long as compartmentalization guarantees very small (cNsPfail )2: – can reduce to single detect case. CALTECH CS137 Fall2005 -- DeHon Guarding Subsystems CALTECH CS137 Fall2005 -- DeHon Composing Subsystems • • • • • Psysundetect = (Nsys/Ns) Psubundetect Psubundetect = (cNsPfail )2 Psysundetect = (Nsys/Ns) (cNsPfail )2 Psysundetect = Nsys Ns (cPfail )2 Extermes: • Ns= Nsys • Ns=1 CALTECH CS137 Fall2005 -- DeHon No benefit Maximum benefit factor of Nsys [in practice c=f(Ns)] Composing Subsystems • Psysundetect = Nsys Ns (cPfail )2 • Example: c=3, Nsys=31010 Pfail=10-11 • • • • Ns=103 31010 103 (310-11)2 33 10-9 310-8 Still < 1s MTTUF … CALTECH CS137 Fall2005 -- DeHon (<<0.81) Problem Motivates Problem: • Generate logic capable of detecting any single error CALTECH CS137 Fall2005 -- DeHon Terminology • Fault-secure: system never produces incorrect code word – Either produces correct result – Or detects the error • Self-testing: for every fault, there is some input that produces an incorrect code word – That detects the error CALTECH CS137 Fall2005 -- DeHon Terminology • Totally Self Checking: system is both fault-secure and self-testing. CALTECH CS137 Fall2005 -- DeHon Duplication Detects any single fault (even in checker) CALTECH CS137 Fall2005 -- DeHon Duplication • N original gates • Duplicate: + N • O outputs – O xors – O/2 2 2 ors – Total 3O gates • Total: 2N+3O • O<N • 2<c<5 CALTECH CS137 Fall2005 -- DeHon Duplication • Total: 2N+3O • O<N • Rent’s Rule: O~kNp – p<1 • Total: 2N+3kNp • c(N)=2+3k/N(1-p) – N small 5 – N large 2 CALTECH CS137 Fall2005 -- DeHon Duplication with PLA Logic Duplicate CALTECH CS137 Fall2005 -- DeHon PLA Duplication • N product terms in original • N in duplicate • 2 O product terms for matching • ON • 2<c<4 CALTECH CS137 Fall2005 -- DeHon Can we do better? • Seems like overkill to compute twice? CALTECH CS137 Fall2005 -- DeHon Idea • Encode so outputs have some checkable property – E.g. parity CALTECH CS137 Fall2005 -- DeHon Will this work? Original Logic Extra cubes for parity parity CALTECH CS137 Fall2005 -- DeHon Problem • Single fault may produce multiple output errors CALTECH CS137 Fall2005 -- DeHon How Fix? • How do we fix? CALTECH CS137 Fall2005 -- DeHon No Logic Sharing • No sharing • Single fault effects single output CALTECH CS137 Fall2005 -- DeHon Parity Checking • To check parity – Need xor tree on outputs/parity – [(O+1)/2]22 = 2(O+1) xors • For PLA – xor would blow up – Wrap multiple times – 2 product terms per xor – 4O product terms CALTECH CS137 Fall2005 -- DeHon nanoPLA Wrapped xor Note: two planes here just for buffering/inversion CALTECH CS137 Fall2005 -- DeHon Better or Worse than Dual? Design Ins Outs OrigPterms Parity Dual add4 9 5 135 283 240 ex1010 10 10 284 880 568 inc 7 9 29 53 58 misex1 8 7 12 40 24 rd73 7 3 7 20 10 rd84 8 4 255 389 441 sao2 10 4 9 31 14 squar5 5 8 25 38 49 z5xp1 7 10 63 96 125 CALTECH CS137 Fall2005 -- DeHon (not include checking) Can we allow sharing? • When? CALTECH CS137 Fall2005 -- DeHon Multiple Parity Groups • Can share with different parity groups • Common error flagged in both groups CALTECH CS137 Fall2005 -- DeHon Multi-Parity Group Compare (AMD) Design grps Mparity Orig add4 ex1010 inc misex1 rd73 rd84 sao2 squar5 z5xp1 4 2 6 6 7 1 9 5 9 CALTECH CS137 Fall2005 -- DeHon 209 822 44 25 10 402 17 36 103 135 284 29 12 7 255 9 25 63 Parity Dual 283 240 880 568 53 58 40 24 20 10 389 441 31 14 38 49 96 125 (not include checking) Best Results from Winter2004 CS137 Design class add4 193 ex1010 inc misex1 rd73 rd84 sao2 squar5 23 *8 385 13 34 z5xp1 CALTECH CS137 Fall2005 -- DeHon AMD Orig Parity Dual 209 135 283 240 822 284 880 568 44 29 53 58 25 12 40 24 10 7 20 10 402 255 389 441 17 9 31 14 36 25 38 49 103 63 96 125 (not include checking) Better or Worse than Dual? • Typical results from Mitra [ITC2002] – Multi-level gate mapping to LSI std. cell library (parity here includes multiple parity) CALTECH CS137 Fall2005 -- DeHon Admin • Assignment #2 due Friday • Wednesday reading online • Friday reading handout CALTECH CS137 Fall2005 -- DeHon Big Ideas • Low-level physics imperfect – Statistical, noisy • Larger number of devices greater likelihood of faults • Redundancy • Self-checking circuits CALTECH CS137 Fall2005 -- DeHon