Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CA226 — Advanced Computer Architecture Stephen Blott <[email protected]> Table of Contents 1 CA226 — Advanced Computer Architecture Hardware Failures … But first: • we need to talk a little bit about probability 2 CA226 — Advanced Computer Architecture Note to self… 1. The Monty Hall problem [http://en.wikipedia.org/wiki/Monty_Hall_problem] 2. The two-children problem [http://www.maa.org/external_archive/devlin/ devlin_04_10.html] 3. The terrorist problem [http://en.wikipedia.org/wiki/Base_rate_fallacy] 4. The Geometrical distribution [http://en.wikipedia.org/wiki/Geometric_distribution] Note Of these four topics, only the last is directly relevant to today’s material. 3 CA226 — Advanced Computer Architecture The Exponential Distribution The exponential distribution: • `f_lambda(x)\ =\ lambda e^{-lambda x}` In which: • `lambda` is know as the rate • `e` is the base of natural logarithms about `2.71828...` 4 CA226 — Advanced Computer Architecture The Exponential Distribution The exponential distribution: • `f_lambda(x)\ =\ lambda e^{-lambda x}` For Poisson processes, `f_lambda(x)` is: • the probability that `x` units of time elapse until the next event of interest occurs • e.g. the probability that 10,000 hours pass until a disk fails 5 CA226 — Advanced Computer Architecture The Exponential Distribution 6 CA226 — Advanced Computer Architecture The Exponential Distribution 7 CA226 — Advanced Computer Architecture The Exponential Distribution 8 CA226 — Advanced Computer Architecture The Exponential Distribution 9 CA226 — Advanced Computer Architecture Exponential Distribution — Properties Mean: • `1/lambda` Note The mean is inversely proportional to the rate (which is intuitively correct). So, it is easy to convert between means and rates (and we’ll be doing a fair amount of that). 10 CA226 — Advanced Computer Architecture Exponential Distribution — Properties Probability of exceeding a value: • `P(x>t)\ =\ e^{-lambda t}` 11 CA226 — Advanced Computer Architecture Multiple Exponential Distributions Assume two (independent) exponentially-distributed random variables `X_1` and `X_2` with rates `lambda_1` and `lambda_2`: • `P(min(X_1,X_2) > t)\ =\ e^{-(lambda_1 + lambda_2)t}` So: • the smallest (or first) of two exponentially-distributed random events is itself exponentially distributed • and the corresponding rate is just the sum of the individual rates Recall that `P(x>t)\ =\ e^{-lambda t}`. 12 CA226 — Advanced Computer Architecture Example — A Shopkeeper Events: 1. mean time until a customer arrives is 5 minutes 2. mean time until the phone rings is 20 minutes Assuming these are exponentially distributed: • what is the mean time until either of these events occur? 13 CA226 — Advanced Computer Architecture Answer Rates: 1. `\ 1/{5\ "minutes"}` 2. `\ 1/{20\ "minutes"}` Combined rate: • `4/20 + 1/20\ =\ 5/20\ =\ 1/{4\ "minutes"}` 14 CA226 — Advanced Computer Architecture Answer Events: 1. mean time until a customer arrives is 5 minutes 2. mean time until the phone rings is 20 minutes Mean time to the first of these events: • 4 minutes 15 CA226 — Advanced Computer Architecture Multiple Exponential Distributions More generally (and obviously): • `P(min(X_1,X_2,...,X_n) > t)\ =\ e^{-(lambda_1 + lambda_2 + ... + lambda_n)t}` Again, recalling that `P(x>t)\ =\ e^{-lambda t}`. 16 CA226 — Advanced Computer Architecture Aside The exponential distribution (and its discrete version — the geometric distribution): • is the only memoryless probability distribution Because exponential distributions are entirely characterised by their mean: • they are often defined by just stating their mean (or their half life) 17 CA226 — Advanced Computer Architecture Why is this relevant? Many computer hardware failures are exponentially distributed: • and, for those that aren’t: the exponential distribution is nevertheless a reasonable first approximation 18 CA226 — Advanced Computer Architecture And… Given: • `P(min(X_1,X_2,...,X_n) > t)\ =\ e^{-(lambda_1 + lambda_2 + ... + lambda_n)t}` we can reason about failure rates of complex (multi-component) systems without knowing too much about the details of the exponential distribution itself 19 CA226 — Advanced Computer Architecture Failures A system is in one of two states: • functioning or not functioning Transitions between these states are: • failures and restorations 20 CA226 — Advanced Computer Architecture Metrics — MTTF MTTF: • mean time to failure Examples: 1. MTTF is (perhaps) 1,000,000 hours for some hard disk 2. MTTF is (perhaps) 100,000 hours for a fan 21 CA226 — Advanced Computer Architecture Metrics — Failure Rate The failure rate: • is just the reciprocal of the MTTF Examples: 1. if MTTF is `10^6` hours, then failure rate is `1//10^6` per hour 2. if MTTF is `10^5` hours, then failure rate is `1//10^5` per hour 22 CA226 — Advanced Computer Architecture MTTF — Example What is the MTTF of a two-component system composed of: 1. a hard disk with a MTTF of 1,000,000 hours 2. and a fan with a MTTF 100,000 hours? 23 CA226 — Advanced Computer Architecture MTTF — Example If we assume failures are independent and exponentially distributed, then: • The means are `10^{6}` and `10^{5}` • So the rates are `10^{-6}` and `10^{-5}` • Since they’re exponentially distributed, we add these to get the overall rate: `10^{-6} + 10^{-5} = 1.1 times 10^{-5}` • Giving us the MTTF as the reciprocal of the rate: `1/{1.1 times 10^{-5}} = 90909` 24 CA226 — Advanced Computer Architecture Metrics — Failure Rate The failure rate: • is often measured in failures per `10^9` (billion) hours (this known as FIT — for failures in time) 1 FIT is one failure every 114155 years. Examples (from previous slides): 1. for the disk, rate of `10^9//10^6\ =\ 1000\ "FIT"` 2. for the fan, rate of `10^9//10^5\ =\ 10000\ "FIT"` 25 CA226 — Advanced Computer Architecture Restorations When a system fails, it must be repaired 26 CA226 — Advanced Computer Architecture Metrics — MTTR MTTR: • mean time to repair Examples — if a power unit fails: • it may take 24 hours (say) for it to be replaced • or perhaps 168 hours (one week) 27 CA226 — Advanced Computer Architecture Metrics — MTBF MTBF: • mean time between failures • `"MTTF" + "MTTR"` 28 CA226 — Advanced Computer Architecture Metrics — Availability Availability: • the proportion of time during which service is satisfactorily delivered • `"MTTF" / {"MTTF" + "MTTR"}` Availability is usually quoted as a percentage. 29 CA226 — Advanced Computer Architecture Availability — Example If: • MTTF is `10^5` hours • MTTR is 168 hours Then: • availability is `10^5/{10^5+168}\ =\ 99.83%` 30 CA226 — Advanced Computer Architecture Systems Computer systems consist of a number of components: • e.g. processor, memory, bus, disk, fan, etc. If components individually have exponentially distributed lifetimes: • then so too does the system as a whole 31 CA226 — Advanced Computer Architecture Example Assume a disk subsystem with the following components: • 10 disks, each rated at `1 xx 10^6` MTTF (hours) • 1 ATA controller, `5 xx 10^5` MTTF • 1 power supply, `2 xx 10^5` MTTF • 1 fan, `2 xx 10^5` MTTF • 1 ATA cable, `1 xx 10^6` MTTF Assuming exponentially-distributed lifetimes and independent failures: • calculate the MTTF of the disk subsystem as a whole 32 CA226 — Advanced Computer Architecture Example — Failure rate of system … Failure rate of system: `\ 10 xx 1/{1 xx 10^6}\ +\ 1/{5 xx 10^5}\ +\ 2 xx 1/{2 xx 10^5}\ +\ 1/{1 xx 10^6}` `\ \ = \ {10 + 2 + 10 + 1} / {10^6}` `\ \ = \ 23 / {10^6\ "hours"}` 33 CA226 — Advanced Computer Architecture Example — Failure rate of system … Or: • `23 / {10^6\ "hours"} xx 10^9 = 23000\ "FIT"` 34 CA226 — Advanced Computer Architecture Example — MTTF of the system as a whole? The MTTF of the system as a whole is the inverse of the failure rate: `\ \ 1 / "failure rate of system"` `\ \ =\ 43500\ "hours"` (just under five years, approx.) 35 CA226 — Advanced Computer Architecture Example — Availability of system as a whole … Assume any failed component will be repaired in 24 hours. Availability: `\ \ "MTTF" / {"MTTF" + "MTTR"}` `\ \ =\ 43500 / { 43500 + 24 }\ =\ 99.945%` 36 CA226 — Advanced Computer Architecture Example — Availability of system as a whole … Alternatively, assume any failed component will be repaired in 168 hours (one week). Availability: `\ \ "MTTF" / {"MTTF" + "MTTR"}` `\ \ =\ 43500 / { 43500 + 168 }\ =\ 99.615%` 37 CA226 — Advanced Computer Architecture Key Points This calculation is possible (and simple) because: • of the assumptions of exponential distributions and independent failures • of the simplicity of combining exponential distributions • of the mean and the rate being merely the reciprocal of one another 38 CA226 — Advanced Computer Architecture Improving Reliability One common approach to improving reliability is: • redundancy 39 CA226 — Advanced Computer Architecture Improving Reliability — Example What would be the MTTF of a power supply and its availability if: • the MTTF of an individual power supply is `2xx10^5` hours • we add an additional (redundant) power supply, and • the MTTR for the power supply unit is 24 hours? 40 CA226 — Advanced Computer Architecture Well, let’s see … Mean time to individual power supply failure: • `"MTTF"_{"individual"}\ =\ 2xx10^5\ "hours"` Mean time to any power supply failure: • `"MTTF"_{"any"}\ =\ {"MTTF"_{"individual"}}/2` 41 CA226 — Advanced Computer Architecture And … Probability of second failure before first failure is repaired: • `{"MTTR"_{"individual"}}/{"MTTF"_{"individual"}}` 42 CA226 — Advanced Computer Architecture And … MTTF of power supply pair: • `"MTTF"_{"any"} times 1/{"probability of second failure before repair"}` • `{"MTTF"_{"any"}}/{"probability of second failure before repair"}` • `{{"MTTF"_{"individual"}}/2} / {({"MTTR"_{"individual"}}/{"MTTF"_{"individual"}})}` 43 CA226 — Advanced Computer Architecture And … MTTF of power supply pair: • `{{"MTTF"_{"individual"}}/2} / {({"MTTR"_{"individual"}}/{"MTTF"_{"individual"}})}` • `{"MTTF"_"individual"^2} / {2xx"MTTR"_{"individual"}}` 44 CA226 — Advanced Computer Architecture So, let’s try that out … Using the values from the previous example: • `{"MTTF"_"individual"^2} / {2xx"MTTR"_{"individual"}}` • `{ (2xx10^5)^2} / {2xx24} \ =\ 83xx10^7` hours So the MTTF of the pair is 4150 times that of a single power supply: • redundancy works! 45 CA226 — Advanced Computer Architecture Now … What would be the effect of adding a third power supply? 46 CA226 — Advanced Computer Architecture Third Power Supply … Assuming an MTTR of 24 hours: • `{8xx10^15} / { 9 xx 24} \ = \ 36 xx 10^12\ "hours"` • or … failures occur, on average, about 4 billion years apart Assuming an MTTR of about 2 months: • `{8xx10^15} / { 9 xx 1428} \ = \ 622 xx 10^9\ "hours"` • or … failures occur, on average, about 71 million years apart 47 CA226 — Advanced Computer Architecture Note … The analysis above applies to any form of redundancy: • mirrored disks • hot spare Google use three replicas of all user data: • hopefully in storage systems with independent failures 48 CA226 — Advanced Computer Architecture Data Centres — 1 Assume: • a service provider runs `10^6` (one million) servers • the MTTF for a server is `17520` hours (two years) What is the MTTF for any server? `1.0512` minutes! 49 CA226 — Advanced Computer Architecture Data Centres — 2 Assume: • a server provider runs `10^6` (one million) servers • the MTTF for a server is `35040` hours (four years) What is the MTTF for any server? `2.1024` minutes! 50 CA226 — Advanced Computer Architecture So… If you’re running large numbers of machines: • expect failures they’re normal and common regardless of the quality of your hardware 51 CA226 — Advanced Computer Architecture Overall — Potential Problems? Assumptions: • failures may not be independent • failures may not be exponentially distributed Note Note to self: the bathtub curve. 52 CA226 — Advanced Computer Architecture Last year’s exam… An ax consists of a handle and a blade: • Assume that the MTTF of a handle is 100 hours and the MTTF of a blade is 250 hours. Further assume that failures are exponentially distributed. • Calculate the MTTF of an ax as a whole. If the MTTR of an ax is one hour: • Calculate the (system) availability of such axes. Express your answer as a percentage. 53 CA226 — Advanced Computer Architecture Done <script> (function() { var mathjax = 'mathjax/MathJax.js?config=asciimath'; // var mathjax = 'http://smblott.computing.dcu.ie/mathjax/MathJax.js?config=asciimath'; var element = document.createElement('script'); element.async = true; element.src = mathjax; element.type = 'text/javascript'; (document.getElementsByTagName('HEAD')[0]|| document.body).appendChild(element); })(); </script> 54