Download CA226 — Advanced Computer Architecture Table of Contents Stephen Blott

Document related concepts

Theoretical computer science wikipedia , lookup

Transcript
CA226 — Advanced
Computer Architecture
Stephen Blott <[email protected]>
Table of Contents
1
CA226 — Advanced
Computer Architecture
Hardware Failures …
But first:
• we need to talk a little bit about probability
2
CA226 — Advanced
Computer Architecture
Note to self…
1. The Monty Hall problem [http://en.wikipedia.org/wiki/Monty_Hall_problem]
2. The two-children problem [http://www.maa.org/external_archive/devlin/
devlin_04_10.html]
3. The terrorist problem [http://en.wikipedia.org/wiki/Base_rate_fallacy]
4. The Geometrical distribution [http://en.wikipedia.org/wiki/Geometric_distribution]
Note
Of these four topics,
only the last is directly relevant to today’s material.
3
CA226 — Advanced
Computer Architecture
The Exponential Distribution
The exponential distribution:
• `f_lambda(x)\ =\ lambda e^{-lambda x}`
In which:
• `lambda` is know as the rate
• `e` is the base of natural logarithms
about `2.71828...`
4
CA226 — Advanced
Computer Architecture
The Exponential Distribution
The exponential distribution:
• `f_lambda(x)\ =\ lambda e^{-lambda x}`
For Poisson processes, `f_lambda(x)` is:
• the probability that `x` units of time elapse until the next event of interest occurs
• e.g. the probability that 10,000 hours pass until a disk fails
5
CA226 — Advanced
Computer Architecture
The Exponential Distribution
6
CA226 — Advanced
Computer Architecture
The Exponential Distribution
7
CA226 — Advanced
Computer Architecture
The Exponential Distribution
8
CA226 — Advanced
Computer Architecture
The Exponential Distribution
9
CA226 — Advanced
Computer Architecture
Exponential Distribution — Properties
Mean:
• `1/lambda`
Note
The mean is inversely proportional to the rate
(which is intuitively correct).
So, it is easy to convert between means and rates
(and we’ll be doing a fair amount of that).
10
CA226 — Advanced
Computer Architecture
Exponential Distribution — Properties
Probability of exceeding a value:
• `P(x>t)\ =\ e^{-lambda t}`
11
CA226 — Advanced
Computer Architecture
Multiple Exponential Distributions
Assume two (independent) exponentially-distributed random variables `X_1`
and `X_2` with rates `lambda_1` and `lambda_2`:
• `P(min(X_1,X_2) > t)\ =\ e^{-(lambda_1 + lambda_2)t}`
So:
• the smallest (or first) of two exponentially-distributed random events is itself
exponentially distributed
• and the corresponding rate is just the sum of the individual rates
Recall that `P(x>t)\ =\ e^{-lambda t}`.
12
CA226 — Advanced
Computer Architecture
Example — A Shopkeeper
Events:
1. mean time until a customer arrives is 5 minutes
2. mean time until the phone rings is 20 minutes
Assuming these are exponentially distributed:
• what is the mean time until either of these events occur?
13
CA226 — Advanced
Computer Architecture
Answer
Rates:
1. `\ 1/{5\ "minutes"}`
2. `\ 1/{20\ "minutes"}`
Combined rate:
• `4/20 + 1/20\ =\ 5/20\ =\ 1/{4\ "minutes"}`
14
CA226 — Advanced
Computer Architecture
Answer
Events:
1. mean time until a customer arrives is 5 minutes
2. mean time until the phone rings is 20 minutes
Mean time to the first of these events:
• 4 minutes
15
CA226 — Advanced
Computer Architecture
Multiple Exponential Distributions
More generally (and obviously):
• `P(min(X_1,X_2,...,X_n) > t)\ =\ e^{-(lambda_1 + lambda_2 + ... + lambda_n)t}`
Again, recalling that `P(x>t)\ =\ e^{-lambda t}`.
16
CA226 — Advanced
Computer Architecture
Aside
The exponential distribution (and its discrete version — the geometric
distribution):
• is the only memoryless probability distribution
Because exponential distributions are entirely characterised by their mean:
• they are often defined by just stating their mean
(or their half life)
17
CA226 — Advanced
Computer Architecture
Why is this relevant?
Many computer hardware failures are exponentially distributed:
• and, for those that aren’t:
the exponential distribution is nevertheless a reasonable first approximation
18
CA226 — Advanced
Computer Architecture
And…
Given:
• `P(min(X_1,X_2,...,X_n) > t)\ =\ e^{-(lambda_1 + lambda_2 + ... + lambda_n)t}`
we can reason about failure rates of complex (multi-component) systems without
knowing too much about the details of the exponential distribution itself
19
CA226 — Advanced
Computer Architecture
Failures
A system is in one of two states:
• functioning or not functioning
Transitions between these states are:
• failures and restorations
20
CA226 — Advanced
Computer Architecture
Metrics — MTTF
MTTF:
• mean time to failure
Examples:
1. MTTF is (perhaps) 1,000,000 hours for some hard disk
2. MTTF is (perhaps) 100,000 hours for a fan
21
CA226 — Advanced
Computer Architecture
Metrics — Failure Rate
The failure rate:
• is just the reciprocal of the MTTF
Examples:
1. if MTTF is `10^6` hours, then failure rate is `1//10^6` per hour
2. if MTTF is `10^5` hours, then failure rate is `1//10^5` per hour
22
CA226 — Advanced
Computer Architecture
MTTF — Example
What is the MTTF of a two-component system composed of:
1. a hard disk with a MTTF of 1,000,000 hours
2. and a fan with a MTTF 100,000 hours?
23
CA226 — Advanced
Computer Architecture
MTTF — Example
If we assume failures are independent and exponentially distributed, then:
• The means are `10^{6}` and `10^{5}`
• So the rates are `10^{-6}` and `10^{-5}`
• Since they’re exponentially distributed, we add these to get the overall rate:
`10^{-6} + 10^{-5} = 1.1 times 10^{-5}`
• Giving us the MTTF as the reciprocal of the rate:
`1/{1.1 times 10^{-5}} = 90909`
24
CA226 — Advanced
Computer Architecture
Metrics — Failure Rate
The failure rate:
• is often measured in failures per `10^9` (billion) hours
(this known as FIT — for failures in time)
1 FIT is one failure every 114155 years.
Examples (from previous slides):
1. for the disk, rate of `10^9//10^6\ =\ 1000\ "FIT"`
2. for the fan, rate of `10^9//10^5\ =\ 10000\ "FIT"`
25
CA226 — Advanced
Computer Architecture
Restorations
When a system fails, it must be repaired
26
CA226 — Advanced
Computer Architecture
Metrics — MTTR
MTTR:
• mean time to repair
Examples — if a power unit fails:
• it may take 24 hours (say) for it to be replaced
• or perhaps 168 hours (one week)
27
CA226 — Advanced
Computer Architecture
Metrics — MTBF
MTBF:
• mean time between failures
• `"MTTF" + "MTTR"`
28
CA226 — Advanced
Computer Architecture
Metrics — Availability
Availability:
• the proportion of time during which service is satisfactorily delivered
• `"MTTF" / {"MTTF" + "MTTR"}`
Availability is usually quoted as a percentage.
29
CA226 — Advanced
Computer Architecture
Availability — Example
If:
• MTTF is `10^5` hours
• MTTR is 168 hours
Then:
• availability is `10^5/{10^5+168}\ =\ 99.83%`
30
CA226 — Advanced
Computer Architecture
Systems
Computer systems consist of a number of components:
• e.g. processor, memory, bus, disk, fan, etc.
If components individually have exponentially distributed lifetimes:
• then so too does the system as a whole
31
CA226 — Advanced
Computer Architecture
Example
Assume a disk subsystem with the following components:
• 10 disks, each rated at `1 xx 10^6` MTTF (hours)
• 1 ATA controller, `5 xx 10^5` MTTF
• 1 power supply, `2 xx 10^5` MTTF
• 1 fan, `2 xx 10^5` MTTF
• 1 ATA cable, `1 xx 10^6` MTTF
Assuming exponentially-distributed lifetimes and independent failures:
• calculate the MTTF of the disk subsystem as a whole
32
CA226 — Advanced
Computer Architecture
Example — Failure rate of system …
Failure rate of system:
`\ 10 xx 1/{1 xx 10^6}\ +\ 1/{5 xx 10^5}\ +\ 2 xx 1/{2 xx 10^5}\ +\ 1/{1 xx 10^6}`
`\ \ = \ {10 + 2 + 10 + 1} / {10^6}`
`\ \ = \ 23 / {10^6\ "hours"}`
33
CA226 — Advanced
Computer Architecture
Example — Failure rate of system …
Or:
• `23 / {10^6\ "hours"} xx 10^9 = 23000\ "FIT"`
34
CA226 — Advanced
Computer Architecture
Example — MTTF of the system as a whole?
The MTTF of the system as a whole is the inverse of the failure rate:
`\ \ 1 / "failure rate of system"`
`\ \ =\ 43500\ "hours"` (just under five years, approx.)
35
CA226 — Advanced
Computer Architecture
Example — Availability of system as a whole …
Assume any failed component will be repaired in 24 hours.
Availability:
`\ \ "MTTF" / {"MTTF" + "MTTR"}`
`\ \ =\ 43500 / { 43500 + 24 }\ =\ 99.945%`
36
CA226 — Advanced
Computer Architecture
Example — Availability of system as a whole …
Alternatively, assume any failed component will be repaired in 168 hours (one
week).
Availability:
`\ \ "MTTF" / {"MTTF" + "MTTR"}`
`\ \ =\ 43500 / { 43500 + 168 }\ =\ 99.615%`
37
CA226 — Advanced
Computer Architecture
Key Points
This calculation is possible (and simple) because:
• of the assumptions of exponential distributions and independent failures
• of the simplicity of combining exponential distributions
• of the mean and the rate being merely the reciprocal of one another
38
CA226 — Advanced
Computer Architecture
Improving Reliability
One common approach to improving reliability is:
• redundancy
39
CA226 — Advanced
Computer Architecture
Improving Reliability — Example
What would be the MTTF of a power supply and its availability if:
• the MTTF of an individual power supply is `2xx10^5` hours
• we add an additional (redundant) power supply, and
• the MTTR for the power supply unit is 24 hours?
40
CA226 — Advanced
Computer Architecture
Well, let’s see …
Mean time to individual power supply failure:
• `"MTTF"_{"individual"}\ =\ 2xx10^5\ "hours"`
Mean time to any power supply failure:
• `"MTTF"_{"any"}\ =\ {"MTTF"_{"individual"}}/2`
41
CA226 — Advanced
Computer Architecture
And …
Probability of second failure before first failure is repaired:
• `{"MTTR"_{"individual"}}/{"MTTF"_{"individual"}}`
42
CA226 — Advanced
Computer Architecture
And …
MTTF of power supply pair:
• `"MTTF"_{"any"} times 1/{"probability of second failure before repair"}`
• `{"MTTF"_{"any"}}/{"probability of second failure before repair"}`
• `{{"MTTF"_{"individual"}}/2} / {({"MTTR"_{"individual"}}/{"MTTF"_{"individual"}})}`
43
CA226 — Advanced
Computer Architecture
And …
MTTF of power supply pair:
• `{{"MTTF"_{"individual"}}/2} / {({"MTTR"_{"individual"}}/{"MTTF"_{"individual"}})}`
• `{"MTTF"_"individual"^2} / {2xx"MTTR"_{"individual"}}`
44
CA226 — Advanced
Computer Architecture
So, let’s try that out …
Using the values from the previous example:
• `{"MTTF"_"individual"^2} / {2xx"MTTR"_{"individual"}}`
• `{ (2xx10^5)^2} / {2xx24} \ =\ 83xx10^7` hours
So the MTTF of the pair is 4150 times that of a single power supply:
• redundancy works!
45
CA226 — Advanced
Computer Architecture
Now …
What would be the effect of adding a third power supply?
46
CA226 — Advanced
Computer Architecture
Third Power Supply …
Assuming an MTTR of 24 hours:
• `{8xx10^15} / { 9 xx 24} \ = \ 36 xx 10^12\ "hours"`
• or … failures occur, on average, about 4 billion years apart
Assuming an MTTR of about 2 months:
• `{8xx10^15} / { 9 xx 1428} \ = \ 622 xx 10^9\ "hours"`
• or … failures occur, on average, about 71 million years apart
47
CA226 — Advanced
Computer Architecture
Note …
The analysis above applies to any form of redundancy:
• mirrored disks
• hot spare
Google use three replicas of all user data:
• hopefully in storage systems with independent failures
48
CA226 — Advanced
Computer Architecture
Data Centres — 1
Assume:
• a service provider runs `10^6` (one million) servers
• the MTTF for a server is `17520` hours (two years)
What is the MTTF for any server?
`1.0512` minutes!
49
CA226 — Advanced
Computer Architecture
Data Centres — 2
Assume:
• a server provider runs `10^6` (one million) servers
• the MTTF for a server is `35040` hours (four years)
What is the MTTF for any server?
`2.1024` minutes!
50
CA226 — Advanced
Computer Architecture
So…
If you’re running large numbers of machines:
• expect failures
they’re normal and common
regardless of the quality of your hardware
51
CA226 — Advanced
Computer Architecture
Overall — Potential Problems?
Assumptions:
• failures may not be independent
• failures may not be exponentially distributed
Note
Note to self: the bathtub curve.
52
CA226 — Advanced
Computer Architecture
Last year’s exam…
An ax consists of a handle and a blade:
• Assume that the MTTF of a handle is 100 hours and the MTTF of a blade is 250
hours.
Further assume that failures are exponentially distributed.
• Calculate the MTTF of an ax as a whole.
If the MTTR of an ax is one hour:
• Calculate the (system) availability of such axes.
Express your answer as a percentage.
53
CA226 — Advanced
Computer Architecture
Done
<script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax
= 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element
= document.createElement('script'); element.async = true; element.src = mathjax;
element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]||
document.body).appendChild(element); })(); </script>
54