* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Telecommunications engineering wikipedia , lookup
Electrician wikipedia , lookup
Magnetic-core memory wikipedia , lookup
Computer program wikipedia , lookup
Magnetic core wikipedia , lookup
Anastasios Venetsanopoulos wikipedia , lookup
Computer science wikipedia , lookup
Overcoming Hard-Faults in High-Performance Microprocessors Presented by: I2PC Talk Amin Ansari Sept 15, 2011 University of Michigan Electrical Engineering and Computer Science Significance of Reliability Hard-Faults Core disabling Equipped with: Desktop Engine or Break CU or Server Processor • Triple Modular Redundancy • Watchdog Timer • Error Correction Code • Fault-Tolerant Scheduling Full Authority Digital Engine Financial Analysis or Transactions RAID HP Tandem NonStop Mission Critical Systems ECC IBM z series Commodity Systems 2 University of Michigan Electrical Engineering and Computer Science Hard-Faults Main sources o o o o Manufacturing defects Process variation induced In-field wearout Ultra low-power operation Have a direct impact on o o o o Manufacturing yield Performance Lifetime throughput Dependability of semiconductor parts 3 University of Michigan Electrical Engineering and Computer Science Manufacturing Defects Happen due to o o o Silicon crystal defects Random particles on the wafer Fabrication impreciseness ITRS: One defect per five 100mm2 dies expected o A real threat for yield 4 University of Michigan Electrical Engineering and Computer Science Protecting HPµPs against hardfaults is more challenging o Contain billions of transistors o Complex connectivity + many stages o ↑ operational stress accelerates aging Operating at most aggressive V/F curve o [AMD, Phenom] Fine-grained redundancy is not cost-eff. Higher clock frequency, voltage, temp. o ↑ transistors per core (no core disabling) Usage of high V/F guard-bands Large on-chip caches Bit-cell with worst timing characteristics dictates the V&F of the SRAM array 5 [IBM, POWER7] [Intel, Nehalem] Challenges with High-Performance µPs University of Michigan Electrical Engineering and Computer Science Outline Objective:protect overcome hard-faults in high-performance Archipelago [HPCA’11] µPs, Archipelago with comprehensive, low-cost solutions for protecting the on-chip caches against: IEEE Micro’10] on-chip caches and also the Necromancer non-cache parts[ISCA’10, of the core. ● Near-threshold failures ● Process variation ● Wearout and defects Necromancer protects general core area (noncache parts) against: ● Manufacturing defects ● Wearout failures 6 University of Michigan Electrical Engineering and Computer Science NT Operation: SRAM Bit-Error-Rate Extremely fast growth in failure rate with decreasing Vdd 7 University of Michigan Electrical Engineering and Computer Science Our Goal Enabling DVS to push core’s Vdd down to o o Ultra low voltage region ( < 650mV ) While preserving correct functionality of on-chip caches Proposing a highly flexible and FT cache architecture that can efficiently tolerate these SRAM failures Minimizing our overheads in highpower mode 8 University of Michigan Electrical Engineering and Computer Science Archipelago (AP) data chunk Thisautonomous particular cache has By forming islands, a6 single AP only saves out offunctional 8 lines. line. 1 Island 1 2 3 Island 2 4 5 sacrificial line 6 7 sacrificial line 8 9 University of Michigan Electrical Engineering and Computer Science Baseline AP Architecture Two lines collision, if they have at least one faulty in Addedchunk modules: Faulthave map address Sacrificial line Memory Map (10T) Memory map line the same Data position (blue and orange are collision●free) ● Fault map There should be no collision between lines within group layer G3 Input Address ●a MUXing [Group 3 (G3) contains green, blue, and orange lines] First Bank Second Bank S Fault Map (10T) MUXing layer G3 - - Functional Line 10 Two type of lines: ● data line ● sacrificial line University of Michigan Electrical Engineering and Computer Science AP with Relaxed Group Formation Sacrificial lines do not contribute to the effective capacity o We want to minimize the total number of groups Second Bank First Bank S S First Bank Second Bank S 11 University of Michigan Electrical Engineering and Computer Science Semi-Sacrificial Lines First Bank Accessed Line Semi-sacrificial line guarantees the parallel access In contrast to a sacrificial line, it also contributes to the effective cache capacity Sacrificial line MUXing Layer Second Bank Semi-sacrificial line 12 University of Michigan Electrical Engineering and Computer Science AP with Semi-Sacrificial Lines Memory Map Input Address G3 First Bank Second Bank S semisacrificial line way0 way1 way0 way1 Fault Map MUXing layer G3 Functional Block 13 University of Michigan Electrical Engineering and Computer Science AP Configuration We model the problem as a graph: o o Each node is a line of the cache. Edge when there is no collision between nodes A collision free group forms a clique o Group formation Finding the cliques To maximize the number of functional lines, we need to minimize the number of groups. o minimum clique cover (MCC). 14 University of Michigan Electrical Engineering and Computer Science AP Configuration Example First Bank Second Bank 1 2 3 4 5 G1(1) G2(1) G2(S) G1(2) G2(2) 6 7 8 9 10 D G2(3) G1(3) G1(S) G2(4) 10 1 Island or Group 2 7 9 2 4 Island or Group 1 8 5 6 15 3 Disabled University of Michigan Electrical Engineering and Computer Science Operation Modes High power mode (AP is turned off) There is no non-functional lines in this case Clock gating to reduce dynamic power of SRAM structures Low power mode o During the boot time in low-power mode BIST scans cache for potential faulty cells Processor switches back to high power mode Forms groups and configure the HW 16 University of Michigan Electrical Engineering and Computer Science Minimum Achievable Vdd 17 University of Michigan Electrical Engineering and Computer Science Performance Loss One extra cycle latency for L1 and 2 cycles for L2 18 University of Michigan Electrical Engineering and Computer Science Comparison with Alternative Methods 100 10T Recently Proposed Cache Area Overhead (%) 66% area overhead Conventional ZC ECC-2 10 SECDED Row Red AP BF Disabled: 25% Disabled: 9% 10T : [Verma, ISSCC’08] ZC : [Ansari, MICRO’09] BF : [Wilkerson, ISCA’08] 1 0.5 1 1.5 2 2.5 3 Power (at minimum Vdd) Normalized to Archipelago 19 University of Michigan Electrical Engineering and Computer Science Archipelago: Summary DVS is widely used to deal with high power dissipation o We proposed a highly flexible cache architecture o Minimum achievable voltage is bounded by SRAM structures To tolerate failures when operating in near-threshold region Using our approach o o o Vdd of processor can be reduced to 375mV 79% dynamic power saving and 51% leakage power saving < 10% area overhead and performance overheads 20 University of Michigan Electrical Engineering and Computer Science Outline Archipelago protect on-chip caches against: Archipelago [HPCA’11] Necromancer [ISCA’10, IEEE Micro’10] ● Near-threshold failures ● Process variation ● Wearout and defects Necromancer protects general core area (noncache parts) against: ● Manufacturing defects ● Wearout failures 21 University of Michigan Electrical Engineering and Computer Science Necromancer (NM) There are proper techniques to protect caches To maintain an acceptable level of yield, the processing cores need to be protected o More challenging due to inherent irregularity Given a CMP system, Necromancer o o Utilizes a dead core (i.e., a core with a hard-fault) to do useful work Enhances system throughput 22 University of Michigan Electrical Engineering and Computer Science Impact of Hard-Faults on Program Execution More than 40% of the injected faults cause an immediate (less than 10K) architectural state mismatch. Distribution of injected hard-faults that manifest as architectural across latencies Thus, a faultystate core mismatches cannot be trusted todifferent provide correct o Based on number of committed instructions before mismatch functionality even for short periods of program execution. happening when starting from a valid architectural state 23 University of Michigan Electrical Engineering and Computer Science Relaxing Absolute Correctness Constraint Distribution of injected faults resulting into similarity index For an across SI threshold of 90%, in more than 85% of mismatch different latencies cases, the dead core can successfully commit at least 100K instructions Similarity Index: PCs matching the faulty before % its of execution differs between by more than 10% and golden execution (sample @1K instruction intervals) 24 University of Michigan Electrical Engineering and Computer Science Using the Undead Core to Generate Hints The execution behavior of a dead core coarsely matches the intact program execution for long time periods o Accelerating the execution of another core! o We extract useful information from the execution of the program on the dead core and sending this information (hints) to the other core (the animator core), running the same program. 25 Hard-fault Undead Core Performance How to exploit the program execution on the dead core? Animator Core Hints University of Michigan Electrical Engineering and Computer Science Opportunities for Acceleration Increasing complexity/resources In most cases, Alpha by providing hintstoforEV4’s the simpler IPC of several cores, perfect normalized IPC. cores (EV4, Perfect hints: EV5, and EV4 (OoO)), these cores can o Perfect branch and comparable to that achieve a prediction performance o No L1 cache miss by a 6-issue OoO EV6. achieved 26 University of Michigan Electrical Engineering and Computer Science head Hint Gathering tail Cache Fingerprint ● Undead core executes the same program to provide hints for the AC. ● It works as “an external run-ahead engine Queue for the AC”. ● A 6-issue OoO EV6 (evaluation) Hint Distribution Hint Disabling ●generation I$ hints:signal PCand of committed instructions ● Animator core is an older Resynchronization ● NoFET communication for L2 warm-up DEC REN DIS EXE MEM COM hint information ●disabling D$ hints: address of committed ld/strs with the same ISA and less resources ● Most communications are from FE DERE EX ME CO ● Branch prediction hints: BP DIupdates ● A 2-issue OoO EV4 (evaluation) the undead core to the animator ● Handles exceptions coupled cores core except resynchronization and in NM Memory Hierarchy ● D$info dirty lines are dropped when they L1-Inst ● TreatsL1-Data $ hints as prefetching L1-Inst L1-Data hint disabling signals. required to be replaced ● Fuzzy hint disabling based on ● A single queue for sending hints approach Read-Only ● It can proceed on data L2 misses cont. monitoring of hints effectiveness and cache fingerprints Shared cache ● PC & arch. registers forL2 resynch A robust heterogeneous core coupling execution technique 27 University of Michigan Electrical Engineering and Computer Science The Animator Core The Undead Core Necromancer Architecture Example: Branch Prediction Hints Age tag ≤ num committed instructions + BP release window size Type Hint Format Original BP NM BP r Predictor Tournament Cache Fingerprint PC* a DIS EXE MEM COM Counter SC2 a Original BP of AC H Hint Distribution H Hint Disabling NM Predictor NPC Resynchronization PC* NPCsignal and hint disabling information > Threshold L1-Data H H H -Queue Disable Hint PC* L1-Inst Buffer head r PC* FET DEC REN SC1 -- Hint a Hint Gathering r a tail PC NPC Action Age PC* NPC Type r Age PC* NPC FE DERE DI EX ME CO NPC Memory Hierarchy L1-Inst L1-Data Read-Only Shared L2 cache 28 University of Michigan Electrical Engineering and Computer Science The Animator Core The Undead Core Prediction Outcomes NM Design for CMP Systems 29 University of Michigan Electrical Engineering and Computer Science Impact of Hard-Fault Location 30 University of Michigan Electrical Engineering and Computer Science Overheads 18% Necromancer Specific Structures in the Undead Core Interconnection Wires + Hint Queue Necromancer Specific Structures in the Animator Core Animator Core (net overhead) 16% % Overhead 14% 12% 10% 8% 6% 4% 2% 0% area power area power area power area power area power 1 Core 2 Cores 4 Cores 8 Cores 16 Cores 31 University of Michigan Electrical Engineering and Computer Science Performance Gain 88% 71% 32 University of Michigan Electrical Engineering and Computer Science Necromancer: Summary Enhancing system throughput by exploiting dead cores Necromancer leverages a set of microarchitectural techniques to provide o o o o Intrinsically robust hints Fine and coarse-grained hint disabling Online monitoring of hints effectiveness Dynamic state resynchronization between cores Applying Necromancer to a 4-core CMP o o On average, 88% of the original performance of the undead core can be retrieved Modest area and power overheads of 5.3% and 8.5% 33 University of Michigan Electrical Engineering and Computer Science Takeaways Mission-critical and conventional reliability solutions are too expensive for modern high-perf. processors AP: low-cost cache protection against major reliability threats in nanometer technologies For processing core, redundancy o NM an alternative to utilize dead cores To achieve efficient, reliable solutions o o o Runtime adaptability High degree of re-configurability Fine-grained spare substitution 34 University of Michigan Electrical Engineering and Computer Science Thank You 35 University of Michigan Electrical Engineering and Computer Science