Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Blade – A Path Towards Average-Case Silicon via Bundled-Data Resilient Design Peter A. Beerel CHIP in Bahia, Salvador Brazil Sept 4th, 2015 Collaborations and Acknowledgements USC (USA): Dylan Hand, Fei Huang, Ramy Tadros, Alan Huang, Yang Zhang, Mel Breuer, Danlei Chen, Weizhe Hua, Huimei Cheng, Austin Katzin, Yuqi Song, Zhe Liu, Jun He Tsinghua (China): Benmao Cheng BITS (India): Ajay Singhvi PUCRS (Brazil): Ney Calazans, Matheus Moreira, Guilherme Heck, Leandro Heck, Matheus Gibiluka USCS (Brazil): Frederico Butzke MOTIVATION | 2 Motivation: Delay Overheads Cycle time of clocked logic Logic Time PVT margin Clock margin Logic gates Flip-flop alignment [Dreslinksi et al., IEEE Proc. 2010] Traditional synchronous design suffers from increased margins • Worse at low and near-threshold regions MOTIVATION | 2 Motivation: Data Dependent Delays Average Delay Cycle time of clocked logic Logic gates Worst-Case Data Data delay in Plasma CPU Number of Operations Logic Time Worst-Case Delay Delay Slower Delay variation due to data is rarely exploited in traditional designs MOTIVATION | 3 Outline Resiliency Design and its Pitfalls Blade Design, Operation, and Special Cells Blade Analysis and Optimization Blade CAD Flow and Case Study Conclusions and On-Going Work OUTLINE | 5 Resilient Design CLK = 1 0 Sequential Gate Logic Shadow Sequential 1 0 Error Detection Next Pipeline Stage To Control Logic Allow and detect timing errors in datapath • Correct via architectural replay or gating/pausing clock Effective approaches have been elusive! RESILIENT DESIGN | 5 Metastability Issue Shadow FF Latch CLK Logic X Next Pipeline Stage To Control Logic X [Bubble Razor, Fojtik, 2013] The Problem [Beer et al, 2014] • Flop metastability can propagate to ERR signal • Metastability in control path can cause system failure RESILIENT DESIGN | 6 Handling Metastability CLK Logic Latch Transition Detector X Next Pipeline Stage 1 or 0 FFControl FF Logic To Synchronizer [RazorII, Das, 2009] Transition detector may exhibit metastability • Error signal must go through synchronizer, increasing delay • Uses architectural replay to recover from errors RESILIENT DESIGN | 7 Hold Time Concerns Ring Oscillator Clock Generator CLK = 1 Data In STOP MS Detector Comb. Logic In FF CLK = 0 Internal Data Latch Out [SafeRazor, Cannizzaro, 2014] Relies on latch for error correction Hold times are problematic RESILIENT DESIGN | 8 Resiliency Landscape Design Template Sync / Async MTBF Safe Avoids Replay Logic Hold Low Error Time Penalty Robust Bubble Razor Sync No Yes Yes Yes Razor II Sync Yes No No No SafeRazor Async Yes* Yes No Yes Blade Async Yes Yes Yes Yes Blade combines the best features of past resiliency schemes RESILIENT DESIGN | 9 Outline Resiliency Design and its Pitfalls Blade Design, Operation, and Special Cells Blade Analysis and Optimization Blade CAD Flow and Case Study Conclusions and On-Going Work OUTLINE | 11 Blade Template Asynchronous Controller δ Reconfigurable Delay Lines Combinational Logic L.data CLK L.ack L.req Blade Controller RE.req RE.ack Δ R.ack R.req Sample LE.req LE.ack Err 2 EDL Error Detection Logic Blade Stage R.data Error Detecting Latch Single Rail Datapath BLADE TEMPLATE | 11 Δ Δ EDL Err δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 12 Δ Δ EDL Err = 0 δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 12 Δ Δ EDL Err = 1 δ Combinational Logic Error Detection Logic Sample Controller B CLK Controller A Sample CLK Blade Template Operation Err EDL Error Detection Logic Send request speculatively before data is guaranteed stable Timing errors delay handshaking signals BLADE TEMPLATE | 12 Δ Δ EDL Error Detection Logic Err = 0 δ Combinational Logic Sample Controller B CLK Controller A Sample CLK Positive Hold Margins Err EDL Error Detection Logic BLADE TEMPLATE | 13 Δ Δ EDL Error Detection Logic Err δ Combinational Logic Sample Controller B CLK Controller A Sample CLK Positive Hold Margins Err EDL Error Detection Logic Handshaking delays create positive hold margin! BLADE TEMPLATE | 13 Error Detection Logic From other Q-Flops tTD delay X In D Latch +C CE = 1 0 { Err0 = 0 { CLK = 01 0 Err1 = 1 0 Sample = 1 Blade Controller Q-Flop Out Q EDL Error Detection Logic C-element stores error signal, which is sampled by Q-Flop BLADE TEMPLATE | 14 Metastable-Safe Operation Err0 = 10 From other Q-Flops tTD delay X In D Latch +C CE = X 0 Some time later… { { CLK = 01 Err1 = 0 0 Sample = 1 Blade Controller Q-Flop Out Q EDL Error Detection Logic Q-Flop prevents metastability propagation to control path BLADE TEMPLATE | 15 Overhead Reduction tcomp tTD From other Q-Flops delay delay X C + { From other C-elements In D Latch Err0 { Sample CLK { Err1 Blade Controller Q-Flop Out Q EDL Error Detection Logic 4-Input C-element Each Q-Flop collects output covers 12 from 3 EDLs EDLs 4-Input OR gate collects output from 4 C-elements C-element and OR gates amortize overhead over many EDLs BLADE TEMPLATE | 16 Controller Implementation L.req L.ack LE.req LE.ack goR δ Left R.req R.ack RE.req RE.ack Right edi goL goD Bottom edo Δ delay Blade Controller Sample Δ CLK Err[1] Err[0] Three part burst-mode state machine • Implemented using 3D [Yun, 1992] BLADE TEMPLATE | 17 Controller Implementation goR+ R.ack- / goD+ R.req+ L.req+ / LE.req+ δ L.req L.ack LE.ack+ / goR+ LE.req LE.ack goL- / L.ackRE.req+ Err[1]+ / edo+ Left goR edi+ / LE.ackRE.ack+ goD- edo- goD+ / CLK+ Err[1]- edi- / RE.req+ Err[0]+ / RE.ack+ goD- Err[1]- edi- / Right edi+ / / goRRE.ack- goD- edogoD- delay- / SampleErr[0]- / goD RE.req- Err[1]+ / edo+ L.req- / LE.reqgoL+ / Lack+ goL edo delay+ / Bottom delay+ / goR- R.ack+ / goD+ goL+ Sample+ CLKgoL- Sample+ CLKdelay R.reqΔ Blade Controller Sample goD- delay/ Sample- Err[0]- / CLK goD+ /Err[1] CLK+ R.req R.ack RE.req RE.ack edi RE.req- Err[0]+ / Δ RE.ack- goD- Err[0] Three part burst-mode state machine • Implemented using 3D [Yun, 1992] BLADE TEMPLATE | 17 Delay Elements: MUX-DE Most popular programmable Delay Elements (DEs) Delay controlled by number of buffers in the signal path Binary codewords N. Mahapatra et al., “Comparison and Analysis of Delay Elements,” in MWSCAS, 2002, pp. 473–476. DELAY ELEMENTS | 18 Delay Elements: One-Hot DCCS Lengths of the current source transistors increase linearly • (1L, 2L, 3L, ..., nL) Replicate inverting stages to balance rise-fall delays For a particular codeword, only one of the current source transistors is ON [Singhvi et al., ISVLSI 2015] DELAY ELEMENTS | 19 Delay Elements: Delay Shift Inverter Simple inverter structure 3 flavors for back-biasing • For RBB, PBIAS > 1V and NBIAS < 0V • Use flavor (b) Amount of delay shift controlled by two factors: • Inverter sizing • Magnitude of back-bias voltage [Singhvi et al., ISVLSI 2015] DELAY ELEMENTS | 20 Delay Lines and Voltage Scaling Voltage Scaled Delay Ratio • Delay at nominal voltage over delay at near threshold voltage The problem • VSDR changes with codeword • Forces needed margin Solution • Replicate and size designs to reduce margins *Delays range: 300ps to 1.4ns *Sized to match rise/fall delays [Singhvi et al, VLSID 2015] DELAY ELEMENTS | 23 Comparison of Characteristics * Energy vs Quantization Error Trade-offs Exist Can be Exploited *DQE: Delay Quantization Error *EPT: Energy Per Transition [Singhvi et al., ISVLSI 2015] DELAY ELEMENTS | 21 Delay Line Quantification Effects Delay Recall: 𝑑 = 𝛿 + p ∗ Δ δ p*Δ Quantification effects reduced due to inherent tradeoff between nominal delayδ and error penalty p * Δ DELAY LINE QUANTIZATION | 25 Delay Line Quantification in BD Delay Bundled Data: 𝑑 = 𝛿 δ 𝑑 Linear relationship between delay line quantization and average stage delay in Bundled Data DELAY LINE QUANTIZATION | 26 Error Detecting Latch: TBTD Transition Detector (TD) and Error Latch (EL) TD EL Formed by a transition detector and an error latch (an asymmetric C-Element) [Bowman et al, ISSCC 2007] ERROR DETECTING LATCH | 27 Error Detecting Latch: Glitch Sensitivity Transistor size XOR • OX-TD+EL Use static C-element • SOX-TD+EL Analyze for glitch sensitivity twin Increased sensitivity Enables safe operation even in the presence of glitches [Moreira et al., ISQED 2015] ERROR DETECTING LATCH | 28 Outline Resiliency Design and its Pitfalls Blade Design, Operation, and Special Cells Blade Analysis and Optimization Blade CAD Flow and Case Study Conclusions and On-Going Work OUTLINE | 29 Resiliency Performance Benefit Nominal Delay Worst-Case Delay Number of Operations Data delay in Plasma CPU Δ δ Delay Key Question: How do we set δ to optimize performance ANALYSIS | 30 Frequency Probability Delay Models Normal Distribution Logic Delay Real World Distribution Probability Logic Delay Log-Normal Distribution Logic Delay Our approach • Analyze the performance of Blade for a variety of delay models DELAY MODELS | 31 Probability Optimal Average-Case Performance Probability of error (p) σ Probability μ δ δ+Δ PR( d ≤ δ ) • • • • C : Clock Period / Cycle Time EC : Effective Clock Period p : Probability of error 𝑑 : Average delay of Blade stage 𝐶 =𝛿+Δ 𝑑 =𝛿+p∗Δ 1-PR( d ≤ δ ) δ Definitions δ+Δ Optimal performance achieved by minimizing 𝑑 *Assumes backward latency is hidden via latch retiming DELAY MODELS | 32 Optimal Probability of Error - popt popt observations • Varies between 20% and 35% for lognormal distributions • Significantly higher than in sync resiliency Higher Variance • Constant for normal distributions! DELAY MODELS | 33 Proof of constant popt Assume worst case delay per stage is constant 𝐾 = 𝛿 + Δ Worst case delay is set by mean, variance, and SER 𝐾 =𝜇+𝑚∗𝜎 Systematic Error Rate (ξ) sets the worst-case delay perpstage, For−1 Normal 2𝜎[erf 1 −distribution: 2𝑝 ] + 𝜇] + ∗ KK 𝑁 1 𝛿−𝜇 𝜉= 1 − 𝑝 = [1 + erf1 − 𝑃𝑅] 𝑑 ≤ 𝐶 Recall: 𝑑 = 𝛿 + 𝑝 ∗ Δ = 1 − 𝑝 ∗ 𝛿 + 𝑝 ∗ 𝐾 Rewrite: 𝑑 = 1 − 𝑝 [ 2 zero : 2𝜎 Minimize 𝑑 by taking derivative and setting it equal to 𝑚 𝛿−𝜇 = 𝑓(𝜉) 𝜕𝑑 = 1+𝑦 𝜕𝑝 1 − 2𝑝 = erf 2𝜎 −1 𝜕erf 𝑦 Taking inverse error function of both sides: 2 − 2 erf −1 𝑦 𝛿+−𝑚 =0 𝜇 𝜕𝑦 erf −1 1 − 2𝑝 = 2𝜎 𝑦 = 1 −𝛿2𝑝 = 2𝜎[erf −1 1 − 2𝑝 ] + 𝜇 Note y and p are independent of σ and μ! m depends only on 𝜉 Implication: Tuning of delay line may target fixed probability! DELAY MODELS | 34 (%) Optimal Size of Resiliency Window - Δ Blade supports maximum Δ of 50% of clock cycle Optimal Δ is larger for designs with high-variance! OPTIMIZATION | 35 Comparison to Sync Resiliency N-Stage Rings Synchronous F F • 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥 M S EDL Blade S EDL • 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 M EDL Bubble Razor [Zhang,2014] EDL • EC set by systematic error rate F F 2𝑁 COMPARISON | 36 Comparison to Sync Resiliency Synchronous Synchronous • EC set by systematic error rate Bubble Razor [Zhang,2014] • 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 Blade BR 35% 23% Blade 2𝑁 Normal Distribution • 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥 COMPARISON | 36 Comparison to Sync Resiliency Synchronous • EC set by systematic error rate Bubble Razor [Zhang,2014] • 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 Blade 2𝑁 Normal Distribution • 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥 COMPARISON | 36 Outline Resiliency Design and its Pitfalls Blade Design, Operation, and Special Circuits Blade Analysis and Optimization Blade CAD Flow and Case Study Conclusions and On-Going Work OUTLINE | 37 Automatic Conversion Flow Convert single-clock sync RTL design to Blade Re-uses synchronous EDA tools and libraries Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation Async Netlist RTL Spec Seamless integration into existing flows Blade Design Flow CASE STUDY | 38 Synthesis Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK RTL Spec Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Async Netlist FF FF Synthesize sync RTL design using standardSimulation EDA tools Blade Design Flow CASE STUDY | 39 Latches Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 M S M S Replace flip flops with master-slave latches Two-phase non-overlapping clocking CASE STUDY | 40 Latch Retiming Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 M S S M Retime latches to spread logic delay across stages Allow time borrowing to reduce area overhead CASE STUDY | 41 EDL Insertion Sync Synthesis FF to Latch Conversion Latch Retiming Add EDL + Async Control Simulation CLK2 CLK1 M EDL TB S M EDL S TB Replace non-TB latches with error detecting latches Not all latches need be error detecting CASE STUDY | 42 Async Control Blade Controller EDL Sync Synthesis FF to Latch Conversion Blade Controller TB Latch Retiming Blade Controller EDL Add EDL + Async Control Simulation Blade Controller TB Remove synchronous clock trees Add Blade controllers and delay lines CASE STUDY | 43 Simulation Blade Controller EDL Sync Synthesis Blade Controller TB FF to Latch Conversion Latch Retiming Add EDL + Async Control Blade Controller Simulation Blade Controller EDL TB Back annotated SDF simulation using final netlist CASE STUDY | 44 Resynthesis EDL EDL A D Critical Path EDL Latch EDL Logic B >δ Delay < EDL C E Correlated EDL Latch EDL <δ Delay > F Add set_max_delay from A to F = δ CASE STUDY | 45 Brute Force Resynthesis Best run reduced error rate by 39% and EDLs by 27% (-1.9% area) Best Run 122 Correlated EDLs! Evaluated hundreds of resynthesis runs • Each run sets a max delay constraint to a single latch Chose result that led to largest reduction in area and error rate CASE STUDY | 46 Case Study: Plasma MIPS OpenCore 3-stage pipeline 28nm FDSOI @ 666MHz (w/ ideal clock and Vdd) Type Count Combinational 11,740 Buf/Inv 1,683 Seq. (Non-RF) 531 RF 2,048 Total 14,319 http://opencores.org/project,plasma CASE STUDY | 47 Case Study: Area and Performance 2% 24% AND/OR Trees C-elements Q-Flops 24% Plasma running CoreMark Delay Elements Controllers 6% 12% FF to Latch + EDL 32% Overall area overhead is 8.4% Performance increases • Average: 19% • Peak: 42% CASE STUDY | 48 Performance Comparison with Margins Must add margins for PVT variation, clock skew / jitter, and/or aging • Synchronous frequency degraded to accommodate Margin in Blade design is only imposed when an error occurs • A 30% error rate reduces impact of margin by ~70% Margin Synchronous Blade (30% ER) % Advantage 0% 666MHz 800MHz 20% 15% 566MHz 764MHz 35% 30% 466Mhz 728MHz 56% CASE STUDY | 49 Outline Resiliency Design and its Pitfalls Blade Design, Operation, and Special Circuits Blade Analysis and Optimization Blade CAD Flow and Case Study Conclusions and On-Going Work OUTLINE | 50 Other Related Work Canary Circuits [Sato, 2007] • Removes some PVT margins • But cannot take advantage of data dependency Bundled Data Designs [Sutherland’89, Nowick’97] • Speculative completion sensing exploits some data dependency • Margins impact performance on every cycle • No observability of errors Soft Mousetrap [Liu, 2013] • Hold time constraints remain difficult to meet CONCLUSIONS | 51 Conclusions Blade Template • Achieves higher performance by exploiting data dependency • Benefits from average vs worst-case MS resolution times • Reduces impact of margins for PVT variations • Enables voltage scaling for power savings • Supported by design flow w/ commercial EDA tools Plasma Case Study • Highlights design and CAD techniques for area efficiency • Achieves 19% increase in performance with 8.4% area overhead CONCLUSIONS | 52 On-Going Work Design • Further design and analysis of efficient DEs and EDLs Automated Flow • Support designs specified in System Verilog CSP • Develop average-case-driven re-synthesis tools • Automated resilient-aware PnR Flow Testing Strategy • On-line tuning of delay lines • Screen chips with variations too large to correct Applications • Encryption Designs (Triple DES and Light Weight Crypto) ON-GOING WORK | 53 Blade Publications (2015) D. Hand, M. Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li, M. Gibiluka, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Blade - A Timing Violation Resilient Asynchronous Template. ASYNC 2015: 21-28 D. Hand, H.-H. Huang, B. Cheng, Y. Zhang, M. Moreira, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Performance Optimization and Analysis of Blade Designs under Delay Variability. ASYNC 2015: 61-68 Y. Zhang, L. S. Heck, . M. Moreira, D. Zar, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Design and Analysis of Testable Mutual Exclusion Elements. ASYNC 2015: 124-131 M. Moreira, D. Hand, P. A. Beerel, N. Calazans: TDTB error detecting latches: Timing violation sensitivity analysis and optimization. ISQED 2015: 379-383 G. Heck, L. S. Heck, A. Singhvi, M. Moreira, P. A. Beerel, N. L. V. Calazans: Analysis and Optimization of Programmable Delay Elements for 2-Phase Bundled-Data Circuits. VLSI Design 2015: 321-326. A. Singhvi, M. T. Moreira, R. N. Tadros, N. L. V. Calazans, P. A. Beerel: A Fine-Grained, Uniform, Energy-Efficient Delay Element for FD-SOI Technologies, ISLVSI 2015 P. A. Beerel, N. Calazans: A Path Towards Average Case Silicon using Resilient Bundled Data Design, ECCTD 2015 (Invited) ON-GOING WORK | 55 The Blade Team++ Asynchronous Circuit in Industry 3D Network on chips (STMicroelectronics) STMicroelectronics WIOMING 3D-IC Ethernet Switches (Intel) Ultra high-speed FPGAs (Achronix) Achronix FPGA. 1.7 M LUTs. 2.1 Gbps IO Low-power smart cards (Tiempo) Tiempo TAM16 Clockless 16-bit Microcontroller Neuromorphic Computing (IBM) Internet of Things (???) IBM True North Multi-Chip Neuromorphic System Fulcrum/Intel Ethernet Switch Chip Obrigado! Do not miss.... 22nd IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC) MAY 8 - 11, 2016 PORTO ALEGRE, BRAZIL Homepage: http://www.inf.pucrs.br/async2016 We hope to see you there!