Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Increasing Reliability of Performance-critical Pipeline structures Niranjan Soundararajan Advisors: Vijaykrishnan Narayanan Anand Sivasubramaniam Computer Systems Lab (CSL) Microsystems Design Lab (MDL) Computer Science and Engineering The Pennsylvania State University 1 Reliability – Increasing Importance Decreasing transistor size More transistors Power/Temperature Hotspots Increasing Market Segments HARDWARE RELIABILITY 2 Performance critical pipeline structures FRONT END BACK ENDactivity Out-of-order entry Back-to-Back wakeup Load/Store Multi-width pipeline Dcache Queue increase Clock frequency BHT BTB Inst Fetch Icache Decode Alloc RAT Issue Queue ALU Reorder Buffer Inst Retires ARF 3 Transistor Failure Failure Rate Solutions to address impact of Process Variations on Issue Queue Solutions to reduce nonuniform aging due to NBTI, HCE on microprocessor structures Manufacturing Defects Wearout Soft Error impact of DVFS on vulnerability of GALS architectures Bounding vulnerability of processor structures to provide reliability guarantees Random Errors Time 4 Outline Motivation Contributions Vulnerability bounding mechanisms Other solutions – Impact of DVFS on architectural vulnerability of GALS architectures – Address process variations in issue queue – Mitigate NBTI, HCE degradation in structures Conclusion and Future work 5 Introduction to Soft Errors Error N 1 0 n+ p n+- - + +- + + Strike creates electron-hole pairs that can be absorbed by source/diffusion areas of the transistor to change state of device Source: M. Tahoori 6 Impact of Soft Errors Severity of Soft Error Rates – In 2003, Fujitsu released SPARC64 with 80% of 200,000 latches covered by transient fault protection Single Event Upset (SEU) model Metrics – MTBF : Mean Time Between Failures Relative Soft Error Rate Increase Severity 150 100 50 0 180 130 90 65 45 32 22 16 Chip Feature Size – FIT : Failure in Time = 1 failure in a billion hours. FITeff = FITraw * AVF Source: Shekar Borkar, Intel 2004 7 Architectural Vulnerability Factor (AVF) Architecturally Correct Execution (ACE) Instruction LD A BR Dead Store ST B Wrong Path ADD AVF ST B User Visible Output unACE Instruction - Fraction of bits in a structure vulnerable to soft errors - ACE bits / (ACE bits + UnACE bits) - Fn (Size, Time) 8 AVF: Why is it important to Micro-architects? System Specification Architectural Design Logic Synthesis Circuit Design AVF per structure AVF System Reliability = ∑ (FITraw * AVF) Fabrication and Packaging Physical Design FITraw 9 State-of-Art Microprocessor design: Multi-dimensional problem involving Performance, Power and Reliability Performance Overhead Transient Fault Tolerance – Simultaneous Redundant Threading (SRT) – Lockstepping Single point in Optimization techniques Performance-Reliability space – Parashar et al., ISCA’04 – Gomaa et al., ISCA’05 – Parashar et al., ASPLOS’06 – Reddy et al., ASPLOS’06 10 Reliability Micro-architectural Reliability Knob More Reliable Less Performance FITrequired Ideal Solution FITeff = FITraw * AVF FITraw and AVF being constants FITraw inflexible Tune AVF to meet specifications Less Reliable More Performance Performance “Challenge for computer architects is not to provide absolute guarantees in reliability, but rather how to provide the adequate amount of reliability at the lowest cost for the target market segment” Architecture Design for Soft Errors – Shubu Mukherjee, Intel 11 Contributions First work that provides microarchitectural knobs to satisfy processor reliability budgets for transient faults Proactive and Reactive mechanisms to monitor and bound vulnerabilities of processor structures at cycle-level granularity 12 AVF Monitoring Reorder Buffer/Physical Register File RAT Fetch Decode Reorder Buffer (ROB) 1. Large pipeline structure holding number of instructions ARF Issue Queue ALU Reorder Buffer (PRF) Commit 2. Each instruction spends significant percentage of lifetime in ROB Pipeline In-order Pipeline out-of-order Pipeline In-order 13 AVF Monitoring Mechanism Reorder Buffer (ROB) R Commit Event Filled at WB Filled at Dispatch B Reorder Buffer N entries Each entry B bits Result R bits Mis-speculation N Writeback Event Dispatch Event 14 Vulnerability Control via Throttling (VCT) D I S P A T C H Entire Entry ACE at Dispatch STALL DISPATCH AND WRITEBACK Size = Fn (AVF Bound) N - Entry REORDER BUFFER W R I T E B A C K Writeback cannot be stalled 15 thread Avg Performance w.r.t single VCT Performance 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 VCT 0% 20% High Integrity 40% 60% AVF Bounds 80% 100% Low Integrity 16 Advantages of a Reactive Bounding Mechanism Reorder Buffer AVF Bound Exceeded Verify Results Early Accounting of Writebacks Mis-speculated Instructions 17 Simultaneous Redundant Threading (SRT): Importance of Selective Redundancy Fetch RAT ARF RAT ARF ISQ ALU Decode Redundant Thread After Primary Thread Reorder Buffer (PRF) Result Verification Reduces AVF Redundant Execution protects entire pipeline AVF goes down 18 Vulnerability Control via Selective Redundancy (VCSR) Infrastructure Fetch RAT ARF RAT ARF Decode Greedy Heuristic ISQ ALU Reorder Buffer (ROB) AVF Bound Exceeded Result Buffer 19 VCSR Performance VCSR SRT 0.9 VCT 0.8 thread Avg Performance w.r.t single 1 0.7 0.6 0.5 0.4 0% 20% High Integrity 40% 60% AVF Bounds 80% 100% Low Integrity 20 Optimizations Primary Thread Out Of Order Commit Non-compacting Reorder Buffer Reduces AVF Performance Boost since lesser inst are re-executed RAT Fetch ARF Decode RAT ARF Writeback – Commit ROB AVF affected ISQ ALU Reorder Buffer (PRF) Sec. Thread maintains architected state Result Buffer 21 VCH with OOO Commit Performance VCH(OOO) 0.9 SRT VCSR 0.8 thread Avg Performance w.r.t single 1 0.7 VCT 0.6 0.5 0.4 0% 20% High Integrity 40% 60% AVF Bounds 80% 100% Low Integrity 22 Impact of vulnerability bounding Per-cycle vulnerability bounds, guaranteeing FIT rates are met Future Work – Looking at developing a system-level AVF monitoring and bounding infrastructure 23 Outline Motivation Contributions Vulnerability bounding mechanisms Summary of other works – Impact of DVFS on architectural vulnerability of GALS architectures – Address process variations in issue queue – Mitigate NBTI, HCE degradation in structures Conclusion and Future work 24 Need for vulnerability analysis in GALS Architectures Multiple domains, each driven by individual clocks – Need for global clock network avoided • Impact on AVF due to applying different Reliability Impact fine-grained ignored GALS enables VF scaling tuned to DVFS algorithms individual domains • Help designers choose DVFS algorithms – DVFS provides high performance per watt meeting reliability requirements DVFS algorithms for GALS architectures are studied w.r.t IPC per watt Voltage scaling affects FITraw, Frequency scaling affects AVF 25 AVF impact across algorithms Significant AVF variations when applying different algorithms Most DVFS algorithms lead to worser AVF than NonDVFS Normalized AVF 1.5 1.4 1.3 Lower is better 38% variation 1.2 Threshold 1.1 AD 1 ModAD 0.9 PI 0.8 Greedy 0.7 Issue Queue 26 26 Outline Motivation Contributions Vulnerability bounding mechanisms Other solutions – Impact of DVFS on architectural vulnerability of GALS architectures – Address process variations in issue queue – Mitigate NBTI, HCE degradation in structures Conclusion and Future work 27 Process Variation (PV) - Introduction Process Variation: Variation in characteristics between two identically designed circuits Process •Performance andVariation Power impact significant •Lack of predictability in timing characteristics lead Dynamic Static to loss of yield •Aging Definite need to address PV at circuit •Thermal Effects Random and microarchitectural level •Dose Systematic •RDF Mean Number of Dopant Atoms 1 m •Sub-wavelength Lithography •Overlay Lithography Wavelength 365nm 248nm 193nm 100 nm 180nm 130nm Gap 90nm 65nm Generation 45nm 32nm 1980 1000 100 10 1000 13nm EUV 10 nm 10000 500 250 130 65 Technology Node (nm) 32 [J. Tschanz et al., DAC 2005] 28 1990 2000 2010 2020 Contributions Study the impact of PV on the Issue Queue of a microprocessor PV-unaware design has about 21% performance degradation w.r.t Non-PV design PV is a non-deterministic phenomenon. Designtime static partitioning not possible. Our solution enables the fast and slow entries to co-exist Instruction steering and sub-component switching schemes to reduce the impact of PV Performance loss is about 1.3% w.r.t Non-PV design 29 Issue Queue Entry Tag1 Tag N Forwarding Comparison Opcode V R Forwarding Write Tag Operand R Tag ALLOC LOGIC t+1 t+2 Alloc stalls Dispatch ISQ Full Instruction wait for Ready Operands Operand Dest Tag Select Logic Dispatch Write t Issue Read SELECT INST. READY INSTRUCTION ISSUE Valid Bit Reset t+3 Time DISPATCH Valid Bit WRITE Set FORWARDING Operand Ready Bit Set 30 Results Stalls reduced w.r.t specific activity IPC 1.45 1.3% 1.4 Operand and port-switching further reduce stalls to a minimum 1.35 12% 7.3% 1.3 1.25 1.2 Non-PV Shutdown MCD PV-Aware 31 Outline Motivation Contributions Vulnerability bounding mechanisms Other solutions – Impact of DVFS on architectural vulnerability of GALS architectures – Address process variations in issue queue – Mitigate NBTI, HCE degradation in structures Conclusion and Future work 32 Increasing impact of transistor wearout Event Related (random) Failure Rate Infant Mortality Useful life (years) Time Source: Intel Device Wear-out Decreasing Technology Transistor lifetime decreasing with newer technologies Conservative Guardbands impact performance System longevity affects revenue More than 50% organizations, machine-age > 10 years Poll by Gartner Research, Source: J. Blome, Micro 2007 33 Contributions NBTI, HCE impact increasing in upcoming technologies Conventional collapsing issue queues have unwanted instruction movement across entries – Collapsing required for age-based selection Round-Robin scheme to provide restricted collapsing Restricted collapsing balances switching activity, not losing much of age-based selection 34 Implementation Capture Rd / Wr / Sw / Data probabilities per cell SPEC2K Benchmark HSpice (32nm, 380K) 10-year degradation 100M instructions Simplescalar Typically, solutions Architectural simulator look at worst-case probabilities [ISQ] Transistor-level Degradation model that might rarely occur Read Delay Degradation 35 Results Performance 1% reduction IPC 1.68 1.66 1.64 1.62 1.6 Conventional Round Robin 16 Degradation (%) 1.7 18 14 12 10 8 Read Delay 32% reduction Conventional Round Robin 6 4 2 0 36 Conclusion Growing Reliability concern “Pop culture of reliability has arrived” - Dr. Phil Emma, IBM [Architecture Design for Soft Errors] Work looks at increasing the fault-tolerance in back-end – Soft errors – Process variation – Wearout 37 Current Work Multi-core design have come to prominence While cache have ECC, the multiple pipelines involve structures holding data – ECC is hard – Total vulnerability to soft errors increases Study the impact on AVF of different structures in a multi-core environment 38 Future Work Multi-core – Cores increase, market segments increase – ILP vs TLP vs Clock frequency increase – Application/Hardware sense best configuration Reconfigurable Hardware – Defect Tolerance – Verification time increasing – “Firmware update” to control functionality 39 40 Backup slides 41 DVFS Algorithms Threshold – VF scale use fixed thresholds. Preset thresholds affects algorithm efficiency Attack-Decay(AD) – Based on util. in adjacent intervals. Attack whenever big util. change. Otherwise decay. Greedy nature affects efficiency Modified Attack-Decay (ModAD) – Attack phase modified to correspond to util. change. Large VF swing can affect performance per watt PI µk = µk-1 + KI (q’k – qref) + Kp (q’k – q’k-1) fk = µk / IPC Greedy – Sample and Hold phase. VF scaling based on ED2 of past 2 intervals 42 Vulnerability Efficiency Lower is better 40% variation Non-DVFS has the best vulnerability efficiency – On average, AD and PI provide the best vulnerability efficiency 43 Round Robin scheme Head Clk Ctrl Bit PseudoHead (PH) New Inst Clk Tail 44 Clk Ctrl Bit N Ctrl Bit 0 1 1 1 0 0 PH Collapse Control Vector Later Entries 44 Reliability Issues of Importance Solutions that are robust but overhead-aware as well 45 Contributions • Bounding vulnerability of Hardware Failure Solutions to reduce nonprocessor structures to Permanent Wearout uniform aging due to NBTI, provide reliability guarantees HCE on microprocessor structures Temporary • Study impact of DVFS on Solutions to address impact vulnerability of GALS architectures of process variations on issue queue Transient Radiation Soft Errors Intermittent Process variation Non-Radiation Power supply Source: ISCA 2005 tutorial 46 Results SR with T(OOO) 0.9 SRT SR 0.8 thread Avg Performance w.r.t single 1 0.7 Throttling (T) 0.6 0.5 0.4 0% 20% High Integrity 40% 60% AVF Bounds 80% 100% Low Integrity 47 Dest Tag Non-Collapsing Issue Queue ISQ Entry id Decoder Source Tags (STag1, STag2) --- Assigns ISQ Entry Slow Entry Bit --- Alloc Dest Tag STALL Demux RAT Op STag1 STag2 DTag PV-aware steering - OptiSteer Stall Optimization Table 48 Intra-Entry Variation schemes Operand- and Port-Switching Op STag1 Operand1 STag2 DTag V Opcode R Tag Operand R Tag Operand Issue Read Dest Tag Op STag2 STag1 Operand1 DTag Dispatch Write Port Switch Dispatch Operand Switch Op STag1 Operand1 STag2 DTag 49 Timeline of ISQ activities SELECT INST. READY Port Switch Slow issue read SELECT INST. READY INSTRUCTION ISSUE Less instructions selected Valid Bit Reset t t+1 ALLOC LOGIC Port Switch Alloc stalls Dispatch t+2 Operand Switch ISQ Full Slow Dispatch Write t+3 DISPATCH WRITE FORWARDING SOT Fill Instruction wait for Ready Operands Time Valid Bit Set Operand Ready Bit Set SOT Value Required Forwarding Stall 50 Issue Conventional Collapsing ISQ Collapse Head Collapsing Logic Clk 0 1 2 Ctrl Bit N N Clk Ctrl Bit 1 Tail Age-ordering for Instruction Selection 51 Round Robin scheme Head Collapse Clk PseudoHead Ctrl Bit New Inst Collapse Tail 52 52 NBTI/HCE NBTI – Traps due to negative voltage at gate (input “0”) – Dominant in PMOS transistor – Increased when holding same data for long periods HCE – Traps due to high electric field near the drain – Dominant in NMOS transistor – Increased when switching activity is high Vth shift accumulates over time, affects timing 53 Contributions Global solutions – Body Biasing •PV is a non-deterministic phenomenon. Our solution enables the fast and slow Frequency boost increases leakage. Non-ideal for Issue Queue entries to co-exist – Time-borrowing steering and difficult subAbsorbing clock•Instruction jitter and skew becomes component switching schemes are proposed to reduce the impact of PV Structure-specific solutions – Solutions for register file, and caches Issue Queue performance-determining structure, operation combines CAM, SRAM cells 54 Results IPC 1.5 1.43 1.4 1.36 1.31 1.3 1.2 1.43 1.42 1.14 1.1 1 NonPV PV-unAware SpeedSteer OptiSteer 55 Throughput comparison 10.5% relative decrease 56 56 Switching Activity 57 57 Wearout phenomena Negative Bias Temperature Instability G I gd D N+ d Ig I gc N+ cs Ig S s Hot Carrier Effects • NBTI, HCE impact increasing in upcoming technologies Oxide Oxide Igb P-well A. Tiwari, Micro 2008 S. Sapatnekar, ISQED 2006 B Electro-Migration Source: J. Blome. Micro 2007 Oxide Breakdown •Factors Temperature, switching activity, data (gate bias), Vdd, current density 58 Optimizations – Vulnerability Control Hybrid RAT Fetch ARF ISQ ALU Decode RAT Reduces bottleneck in inorder units like Result Buffer ARF Reorder Buffer (PRF) Dispatch Bandwidth not effectively utilized 59 Microprocessor Design: Multi-Dimensional Problem Data sensitivity – Application Dependent Microprocessor design: Performance not single dimension – Power – Thermal effects – Reliability Dimension-order driven by market – Aircraft, Health-care: Reliability – Embedded: Power, Thermal – Desktops, Game Consoles: Performance INTEGRITY LEVEL of APPLICATION DOMAIN Application Data Integrity Requirement Market Volume Examples Low Integrity Low Huge Consumer Electronics Moderate Large Present-day Automotive Very High Moderate Enterprise Server Small Flight Control Moderate Integrity High Integrity Safety Critical Very High Mitigation of Transient Faults at the System Level –60 the TTA approach. Herman Kopetz, SELSE 2006 GALS Architecture Fetch Domain 1 Domains driven byDVFS individual high performance per watt clocks – Domain is internally Domain 2 synchronousGALS enables fine-grained VF scaling tuned to individual domains Reg Careful tuning of global clock distribution network is avoided – Better frequency scaling File Domain 4 Different domains interact through FIFO Buffers Domain 3 Domain 2 Decode Rename Reg Read Reg Read Reg Read Int ISQ FP ISQ Mem ISQ Exec Exec Domain 5 Write Back Write Back Domain 3 Exec Domain 6 Write Back D-cache Retire 61 Contributions Reliability Impact ignored DVFS algorithms for GALS architectures are studied w.r.t IPC per watt • Impact on architectural vulnerability du to applying different DVFS algorithms Voltage scaling affects• Characterize FITraw, Frequency the Vulnerability Efficien scaling affects AVF (AVF*Watts/IPC) of DVFS algorithms • Help designers choose DVFS algorithms meeting reliability requirements 62