Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
On Modeling the Lifetime Reliability of Homogeneous Manycore Systems Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University of Hong Kong Integrated Circuit (IC) Product Reliability IC errors can be broadly classified into two categories ● Soft errors • Do not fundamentally damage the circuits ● Hard errors • Permanent once manifest • E.g., time dependent dielectric breakdown (TDDB) in the gate oxides, electromigration (EM) and stress migration (SM) in the interconnects, and thermal cycling (TC) Manycore Systems State-of-the-art computing systems have started to employ multiple cores on a single die ● General-purpose processors, multi-digital signal processor systems ● Power-efficiency ● Short time-to-market Source: Intel Source: Nvidia Problem Formulation To model the lifetime reliability of homogeneous manycore systems using a load-sharing nonrepairable k-out-of-n: G system with general failure distributions Key features ● k-out-of-n: G systems: to provide fault tolerance ● Load-sharing: each embedded core carries only part of the load assigned by the operating system ● Nonrepairable: embedded cores are integrated on a single silicon die ● General failure distribution: embedded cores age in operation Queueing Model for Task Allocation Embedded cores execute tasks independently and one core can perform at most one task at a time Consider a manycore system composed of a set identical embedded cores ● The set of active cores , spare cores , and faulty cores Processor Cores Central Task Allocation Queue Set S1 λa Applications Set S2 S3 ∩ Queueing Model for Task Allocation A general-purpose parallel processing system with a central queue with a bulk arrival is modeled as queueing system The probability that a certain active core is occupied by tasks (also called utilization) is computed as Target system Processor Cores ● Gracefully degrading systems ● Standby redundant systems Central Task Allocation Queue Set S1 λa Applications Set S2 S3 ∩ Lifetime Reliability of Entire System – Gracefully Degrading System A functioning manycore system may contains cores Let be the probability that the system has cores at time The system reliability can therefore be expressed as good active Thus, the Mean Time to Failure (MTTF) of the system can be written as Lifetime Reliability of Entire System – Gracefully Degrading System To determine ● • ● • Conditional probability • ● For any • Conditional probability • The remaining is how to compute Behavior of Single Processor Core States of cores ● Spare mode – cold standby ● Active mode • Processing state • Wait state – warm standby The same shape but different scale parameter ● E.g., Active Wait (Warm Standby) Spare (Cold Standby) Process ing Lifetime Reliability of A Single Core – Gracefully Degrading System Define accumulated time in a certain state at time spends in such a state up to time Core Core Core Core Calculation as how long it Core Lifetime Reliability of A Single Core – Gracefully Degrading System Theorem 1 Suppose a manycore system with gracefully degrading scheme has experienced core failures, in the order of occurrence time at , respectively, for any core that has survived until time ● its accumulated time in the processing state up to time ● its accumulated time as warm standby up to time Lifetime Reliability of A Single Core – Gracefully Degrading System Recall that the reliability functions in wait and processing states have the same shape but different scale parameter ● ● ● ● General reliability function , abbreviated as Reliability function in processing state , denoted as Reliability function in wait state , denoted as Relationships: and Lifetime Reliability of A Single Core – Gracefully Degrading System A subdivision of the time wait processing : wait By the continuity of reliability function, we have Accumulated time in the processing state Accumulated time in the wait state Lifetime Reliability of A Single Core – Gracefully Degrading System Theorem 2 Given a gracefully degrading manycore system that has experienced core failures which occur at respectively, the probability that a certain core survives at time provided that it has survived until time is given by where Lifetime Reliability of Entire System – Standby Redundant System A standby redundant system is functioning if it contains at least good cores, among which are configured as active one, the remaining are spares To determine ● Again, the key point is to compute Lifetime Reliability of A Single Core – Standby Redundant System Define a core’s birth time as the time point when it is configured as an active one Theorem 3 In a standby redundant manycore system, for any core with birth time that has survived until time ● its accumulated time in the processing state up to time ● its accumulated time as warm standby up to time Lifetime Reliability of A Single Core – Standby Redundant System Theorem 4 In a manycore system with standby redundant scheme, the probability that a certain core with birth time survives at time is given by where Experimental Setup Lifetime distributions ● Exponential ● Weibull ● Linear failure rate System parameters ● ● Consider a manycore system consisting of cores Misleading Caused by Exponential Assumption Sojourn Time (years) Redundancy 0-Failure 1-Failure 2-Failure 3-Failure 4-Failure Scheme State State State State State 0 1 2 3 4 — 0.2188 — — — — 0.2188 Degrading 0.2121 0.2188 — — — 0.4309 Standby 0.2188 0.2188 — — — 0.4376 Degrading 0.2059 0.2121 0.2188 — — 0.6368 Standby 0.2188 0.2188 0.2188 — — 0.6564 Degrading 0.2000 0.2059 0.2121 0.2188 — 0.8368 Standby 0.2188 0.2188 0.2188 0.2188 — 0.8752 Degrading 0.1944 0.2000 0.2059 0.2121 0.2188 1.0312 Standby 0.2188 0.2188 0.2188 0.2188 0.2188 1.0940 : Expected lifetime of the -core system Lifetime Reliability for Non-Exponential Lifetime Distribution (a) Weibull Distribution (b) Linear Failure Rate Distribution Detailed Results for Gracefully Degrading System Sojourn Time (years) Distributio n Weibull Linear Failure Rate 0-Failure 1-Failure 2-Failure 3-Failure 4-Failure State State State State State 0 2.2039 — — — — 2.2039 1 2.2153 0.5573 — — — 2.7726 2 2.2260 0.5600 0.3055 — — 3.0915 3 2.2359 0.5626 0.3142 0.1040 — 3.2167 4 2.2452 0.5649 0.2988 0.0955 0.0820 3.2864 0 1.8572 — — — — 1.8572 1 1.8463 1.1367 — — — 2.9830 2 1.8354 1.1325 0.8926 — — 3.8605 3 1.8243 1.1282 0.8798 0.6941 — 4.5264 4 1.8133 1.1237 0.8762 0.7055 0.6269 5.1456 The Impact of Workload Comparison Between Gracefully Degrading System and Standby Redundant System Redundancy Scheme Distribution 2 Weibull 4 Linear Failure Rate 2 4 Warm Standby Hot Standby Cold Standby Degrading 1.5039 1.8232 2.1497 2.2930 2.4265 2.6258 Standby 1.5314 1.8227 2.1133 2.2488 2.3484 2.5309 Degrading 1.5046 1.8521 2.2305 2.4432 2.5771 2.8376 Standby 1.5577 1.8545 2.1715 2.3103 2.4266 2.6261 Degrading 1.9115 2.3197 2.7070 2.8697 3.0105 3.2424 Standby 1.9608 2.3314 2.7330 2.8851 3.0091 3.2146 Degrading 2.1348 2.7122 3.3642 3.6529 3.9385 4.3590 Standby 2.3008 2.7899 3.4307 3.6015 3.8588 4.1881 Conclusion State-of-the art CMOS technology enables the chip-level manycore processors The lifetime reliability of such large circuit is a major concern We propose a comprehensive analytical model to estimate the lifetime reliability of manycore systems Some experimental results are shown to demonstrate the effectiveness of the proposed model Thank You for Your Attention!