Download Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
On Modeling the Lifetime Reliability
of Homogeneous Manycore Systems
Lin Huang and Qiang Xu
CUhk REliable computing laboratory (CURE)
The Chinese University of Hong Kong
Integrated Circuit (IC) Product Reliability
 IC errors can be broadly classified into two categories
● Soft errors
• Do not fundamentally damage the circuits
● Hard errors
• Permanent once manifest
• E.g., time dependent dielectric breakdown (TDDB) in the gate oxides,
electromigration (EM) and stress migration (SM) in the interconnects, and
thermal cycling (TC)
Manycore Systems
 State-of-the-art computing systems have started to employ multiple
cores on a single die
● General-purpose processors, multi-digital signal processor systems
● Power-efficiency
● Short time-to-market
Source: Intel
Source: Nvidia
Problem Formulation
 To model the lifetime reliability of homogeneous manycore systems
using a load-sharing nonrepairable k-out-of-n: G system with
general failure distributions
 Key features
● k-out-of-n: G systems: to provide fault tolerance
● Load-sharing: each embedded core carries only part of the load
assigned by the operating system
● Nonrepairable: embedded cores are integrated on a single silicon die
● General failure distribution: embedded cores age in operation
Queueing Model for Task Allocation
 Embedded cores execute tasks independently and one core can
perform at most one task at a time
 Consider a manycore system composed of a set identical
embedded cores
● The set of active cores
, spare cores
, and faulty cores
Processor Cores
Central Task
Allocation Queue
Set S1
λa
Applications
Set S2 S3
∩
Queueing Model for Task Allocation
 A general-purpose parallel processing system with a central queue
with a bulk arrival is modeled as
queueing system
 The probability that a certain active core is occupied by tasks (also
called utilization) is computed as
 Target system
Processor Cores
● Gracefully degrading systems
● Standby redundant systems
Central Task
Allocation Queue
Set S1
λa
Applications
Set S2 S3
∩
Lifetime Reliability of Entire System
– Gracefully Degrading System
 A functioning manycore system may contains
cores
 Let
be the probability that the system has
cores at time
 The system reliability can therefore be expressed as
good
active
 Thus, the Mean Time to Failure (MTTF) of the system can be
written as
Lifetime Reliability of Entire System
– Gracefully Degrading System
 To determine
●
•
●
• Conditional probability
•
● For any
• Conditional probability
•
 The remaining is how to compute
Behavior of Single Processor Core
 States of cores
● Spare mode – cold standby
● Active mode
• Processing state
• Wait state – warm standby
 The same shape but different scale
parameter
● E.g.,
Active
Wait
(Warm
Standby)
Spare
(Cold
Standby)
Process
ing
Lifetime Reliability of A Single Core
– Gracefully Degrading System
 Define accumulated time in a certain state at time
spends in such a state up to time
Core
Core
Core
Core
 Calculation
as how long it
Core
Lifetime Reliability of A Single Core
– Gracefully Degrading System
 Theorem 1 Suppose a manycore system with gracefully degrading
scheme has experienced core failures, in the order of occurrence
time at
, respectively, for any core that has survived until
time
● its accumulated time in the processing state up to time
● its accumulated time as warm standby up to time
Lifetime Reliability of A Single Core
– Gracefully Degrading System
 Recall that the reliability functions in wait and processing states
have the same shape but different scale parameter
●
●
●
●
General reliability function
, abbreviated as
Reliability function in processing state
, denoted as
Reliability function in wait state
, denoted as
Relationships:
and
Lifetime Reliability of A Single Core
– Gracefully Degrading System
 A subdivision of the time
wait
processing
:
wait
 By the continuity of reliability function, we have
Accumulated time in
the processing state
Accumulated time
in the wait state
Lifetime Reliability of A Single Core
– Gracefully Degrading System
 Theorem 2 Given a gracefully degrading manycore system that has
experienced core failures which occur at
respectively,
the probability that a certain core survives at time
provided
that it has survived until time is given by
where
Lifetime Reliability of Entire System
– Standby Redundant System
 A standby redundant system is functioning if it contains at least
good cores, among which are configured as active one, the
remaining are spares
 To determine
● Again, the key point is to compute
Lifetime Reliability of A Single Core
– Standby Redundant System
 Define a core’s birth time as the time point when it is configured
as an active one
 Theorem 3 In a standby redundant manycore system, for any core
with birth time that has survived until time
● its accumulated time in the processing state up to time
● its accumulated time as warm standby up to time
Lifetime Reliability of A Single Core
– Standby Redundant System
 Theorem 4 In a manycore system with standby redundant scheme,
the probability that a certain core with birth time survives at time
is given by
where
Experimental Setup
 Lifetime distributions
● Exponential
● Weibull
● Linear failure rate
 System parameters
●
●
 Consider a manycore system
consisting of
cores
Misleading Caused by Exponential Assumption
Sojourn Time (years)
Redundancy
0-Failure 1-Failure 2-Failure 3-Failure 4-Failure
Scheme
State
State
State
State
State
0
1
2
3
4
—
0.2188
—
—
—
—
0.2188
Degrading
0.2121
0.2188
—
—
—
0.4309
Standby
0.2188
0.2188
—
—
—
0.4376
Degrading
0.2059
0.2121
0.2188
—
—
0.6368
Standby
0.2188
0.2188
0.2188
—
—
0.6564
Degrading
0.2000
0.2059
0.2121
0.2188
—
0.8368
Standby
0.2188
0.2188
0.2188
0.2188
—
0.8752
Degrading
0.1944
0.2000
0.2059
0.2121
0.2188
1.0312
Standby
0.2188
0.2188
0.2188
0.2188
0.2188
1.0940
: Expected lifetime of the
-core system
Lifetime Reliability for Non-Exponential Lifetime
Distribution
(a) Weibull Distribution
(b) Linear Failure Rate Distribution
Detailed Results for Gracefully Degrading System
Sojourn Time (years)
Distributio
n
Weibull
Linear
Failure
Rate
0-Failure 1-Failure 2-Failure 3-Failure 4-Failure
State
State
State
State
State
0
2.2039
—
—
—
—
2.2039
1
2.2153
0.5573
—
—
—
2.7726
2
2.2260
0.5600
0.3055
—
—
3.0915
3
2.2359
0.5626
0.3142
0.1040
—
3.2167
4
2.2452
0.5649
0.2988
0.0955
0.0820
3.2864
0
1.8572
—
—
—
—
1.8572
1
1.8463
1.1367
—
—
—
2.9830
2
1.8354
1.1325
0.8926
—
—
3.8605
3
1.8243
1.1282
0.8798
0.6941
—
4.5264
4
1.8133
1.1237
0.8762
0.7055
0.6269
5.1456
The Impact of Workload
Comparison Between Gracefully Degrading
System and Standby Redundant System
Redundancy
Scheme
Distribution
2
Weibull
4
Linear
Failure
Rate
2
4
Warm Standby
Hot
Standby
Cold
Standby
Degrading
1.5039
1.8232
2.1497
2.2930
2.4265
2.6258
Standby
1.5314
1.8227
2.1133
2.2488
2.3484
2.5309
Degrading
1.5046
1.8521
2.2305
2.4432
2.5771
2.8376
Standby
1.5577
1.8545
2.1715
2.3103
2.4266
2.6261
Degrading
1.9115
2.3197
2.7070
2.8697
3.0105
3.2424
Standby
1.9608
2.3314
2.7330
2.8851
3.0091
3.2146
Degrading
2.1348
2.7122
3.3642
3.6529
3.9385
4.3590
Standby
2.3008
2.7899
3.4307
3.6015
3.8588
4.1881
Conclusion
 State-of-the art CMOS technology enables the chip-level manycore
processors
 The lifetime reliability of such large circuit is a major concern
 We propose a comprehensive analytical model to estimate the
lifetime reliability of manycore systems
 Some experimental results are shown to demonstrate the
effectiveness of the proposed model
Thank You for Your Attention!
Related documents