Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang OUTLINE Background Motivation Key Ideas Introduction to CRAFT Summary and Discussion Points BACKGROUND Smaller and Faster Transistors Lower threshold voltage Tighter noise margins Less reliable Results Recovery Incorrect program execution 0 Transient Faults Alpha Particle 1 Hardware Only Software Only REDUNDENCY Int main() { Int main() cout << “Hello\n”; } { cout << “Hello\n”; } MOTIVATION AND GOAL Lower Hardware Area and Cost Better Reliability and Performance Hybrid Solution KEY IDEA: COMPILER ASSISTED FAULT TOLERANCE (CRAFT) Characteristics: - Based on software technique - Minimal hardware adaptations - Take advantages from Software and Hardware solution Hardware Benefits: - Nearly perfect reliability - Low performance degradation - Low hardware cost Software CRAFT: HYBRID OF EXISTING METHODS Hardware Method Redundant Multithreading Technique (RMT) Error Correcting Codes (ECC) Advantages Almost-perfect fault coverage Low performance cost Software Method Software Implemented Fault Tolerance (SWIFT) Error Detection by Duplicating Instructions (EDDI) Advantages High fault coverage Modest performance cost Zero hardware cost EXISTING METHOD: HARDWARE RMT RMT makes use of SMT resource through loosely synchronized redundant threads Components not covered by redundant execution must employ alternative techniques, such as Error Correction Code (ECC) Redundant Multithreading (RMT) Original Thread Checker Thread EXISTING METHOD: SOFTWARE SWIFT ld r3 = [r4] ld r3 = [r4] mov r3’ = r3 A compiler based transformation Store instruction is the synchronization point Assumes that Error Correction Code (ECC) guards correctness of memory subsystem add r1 = r2, r3 add r1 = r2, r3 add r1’ = r2’, r3’ br Fault, r1 != r1’ br Fault, r2 != r2’ br Fault, r3 != r3’ st m[r1] = r2 st m[r1] = r2 (Original Code) (SWIFT Code) CRAFT: SUITE OF THREE DETECTION SYSTEM Preliminaries Assume Single Event Upset fault model List of the Suite: 1. Checking Store Buffer (CSB) Architecturally Correct Execution (ACE) 2. Load Value Queue (LVQ) Detected Unrecoverable Error (DUE) 3. CSB + LVQ Silent Data Corruption (SDC) SUITE 1: CHECKING STORE BUFFER (CSB) Problem to Improve: • SWIFT: Vulnerable to faults in the time interval between the validation and use of a register value Vulnerable to Faults Validated values Use of validated values Solution: • Add a Store Buffer to perform checks CSB : IMPLEMENTATION Basic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions Insn duplicate #1 Compiler duplicates stores st [r1] = r2 st1 [r1] = r2 st2 [r1’] = r2’ 0xFF Insn duplicate #2 0xEE 0x8 0x2 CSB # 0 1 2 3 Address -- -- 0xFF 0xEE Value -- -- 0x8 0x1 Validated -- -- N Y N N Table will fill up and structural hazard Store Value Checks Out! Send to MEM. Not match, not OK to go to MEM CSB : ADVANTAGES/ DISADVANTAGES Advantages Checking implemented in hardware level No longer need validation code; reduces code size Store instructions are no longer synchronization points (SWIFT) Exploit more dynamic scheduling Disadvantages Additional compiler requirements: distance between duplicated instruction should not exceed size of CSB SUITE 2: LOAD VALUE QUEUE (LVQ) Problem to Improve: • SWIFT: Window of vulnerability between load instruction and value duplication. Vulnerable to Faults Loading values Solution: • Add a load value queue Copying values LVQ : IMPLEMENTATION PROCEDURE Basic Idea: Duplicate load to enable Threadmill: Branch redundant to TEST1 computation Method : LVQ provides redundant load instruction execution ld insn Compiler duplicates loads ld [r1] = r2 ld1 [r1] = r2 ld2 [r1’] = r2’ LVQ # 0 1 Address -- Value -- ld insn duplicate 0xAA 0xAA 0x2 0x2 3 -- 2 0xAA -- -- 0x2 -- -- -- LVQ : ADVANTAGES/ DISADVANTAGES Advantages Reduces window of vulnerability by issuing duplicated load instruction Keep memory traffic low by bypassing load value Disadvantages Extra hardware to enforce loads and their duplicates access same entry in LVQ SUITE 3: CSB + LVQ Implements both CSB and LVQ simultaneously to software-only solutions like SWIFT EXPERIMENTAL EVALUATION Evaluation Method – Performance vs. Reliability: Inject randomly chosen faults to detailed microarchitectural simulation Each chosen bit-flip is tracked until completion of program Analyze final result to determine: - How much SDC is converted to DUE - How much work (# of application) did program complete before encountering SDC EXPERIMENTAL EVALUATION Results: Measures # of applications the program completed before encountering an SDC Implementation Performance CSB Enable better performance as it eliminates scheduling constraints LVQ Impact varies by benchmark SUMMARY AND CONCLUSION CRAFT, as compared to: Software-only Technique Hardware-only Technique Execution time reduction by 5% Significantly reduce area overhead Maintain comparable reliability SDC to DUE conversion rate increase by 75% Hybrid technique can provide better reliability with relatively low cost DISCUSSION POINTS CRAFT detects fault when CSB is clogged Tradeoff between detection latency and more flexible scheduling? Recovery method? Evaluation in terms of coverage?