Download 7810-25

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999 Redundancy • If a processor’s output is error-prone, reliability can be provided with redundancy Input Program Primary Core Checker Core Verify & Commit Redundancy • If a processor’s output is error-prone, reliability can be provided with redundancy Input Program Primary Core Checker Core Verify & Commit Checker Core One checker can detect errors. For recovery, we may need another checker or some other form of redundancy Why Redundancy? • Soft Errors: A high energy particle can strike a device and deposit enough charge to flip the value Input Program Primary Core Checker Core Verify & Commit  Cosmic rays  Alpha particles Why Redundancy? • Soft Errors: voltage spikes or noise Input Program Primary Core Checker Core Verify & Commit  Crosstalk  di/dt  Lower voltages Why Redundancy? • Allows unverified or aggressively clocked primary cores Input Program Primary Core Checker Core Verify & Commit  Functionally incorrect core: some corner case slips through  Electrically incorrect core: high temperature causes a circuit to not meet the timing constraint DIVA Microarchitecture BPred I-$ Dec/Ren IQ Rename Regs Arch Regs If both checks succeed, write 12 into LR15 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12 ALU D-$ LR3 + LR7  LR15 4 8 12 Microarchitecture Details • Instructions are fed to checker in order during commit • The logic and storage checks detect errors in ALUs and datapath • The checker core is a simple in-order pipeline – easy to design and verify • An error in an earlier stage (LR3 instead of LR2) can be detected by also adding a ren/decode stage to the checker • In-order core has no stalls (need bypass for register file) – no data dependences, cache misses, branch mispredicts • Contention for register file and data cache can degrade primary thread Recovery • The architected register file and data cache are ECC protected – when an error is detected, it is assumed that checker and architected state are correct • Primary core is re-started from faulting instruction • A fault in the primary core may result in deadlock: e.g. instruction that produces R5 is waiting for R5 to be produced (instead of R4) A timeout in the checker signals an error Redundant Multi-Threading • Execute two threads in parallel (CMP or SMT) – each thread maintains its own register state • Threads execute as in a conventional processor, except  trailing thread commits after verifying result  leading thread commits stores to a buffer – these get written to cache/memory only after verification  load values of the leading thread are sent to trailing thread, so trailing thread never accesses data cache  branch outcomes are also sent to trailing thread Reg results, load values, branch outcomes Leading Thread Trailing Thread Store values Fault Model • A single error in either core can be detected • Since loads are not replicated, the load/store datapath must be ECC protected • For recovery, a second checker thread is required • ECC in the checker register file will enable recovery in most cases without a second checker RMT on SMT/CMP + SMT does not require inter-core traffic – values can be read from shared register file/data cache – Single thread performance may be degraded – Each redundant instr executes on high-power pipeline + Trailing CMP core can be a simple in-order processor  low power/area overheads + Trailing core’s frequency can be independently controlled + Heterogeneous CMP where cores can be dynamically employed for throughput/reliability + Lower probability for errors Parallelization of Trailing Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 Sequential Thread Is it more power-efficient to execute the verification thread in parallel? Parallelization of Trailing Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 Sequential Thread If the trailing cores are frequency-scaled, dynamic power does not change, but leakage power increases If the trailing cores are frequency-and-voltage scaled, dynamic power decreases, and leakage power increases Error Types Acronyms!! • MTTF & MTBF: Mean time to/between failures • Errors are either SDC (silent data corruption) or DUE (detected unrecoverable errors) Many errors get masked: • ACE bits: these bits are required for architecturally correct execution • un-ACE bits: these bits do not affect the final output • AVF: architecture vulnerability factor (the percentage of time/space that a structure holds ACE state) Partial Coverage • RMT covers faults in the entire core (almost!) • If that is too expensive, provide error coverage in specific structures to reduce error probabilities • Are there ways to ensure that an instruction spends less time in architecturally vulnerable structures? Title • Bullet

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 7810-25