Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BSc/HND IETM Week 8 - Some Probability Distributions The aim When we looked at the histogram a few weeks ago, we were looking at frequency distributions. These showed how occurrences of a particular value (or a particular range of values) of some variable (say x) are distributed across the total range of values which x can adopt. It is equally possible (and very easy!) to convert such frequency distributions into probability distributions, such that the probability of encountering some particular value (or range of values) of x is plotted on the vertical axis, rather than the number of occurrences of that value of x. There are a few standard forms of such distributions, which make analysis rather easy - so long as the data really do fit the chosen form. We shall look at two of these standard forms, the normal and the negative exponential distributions. Probability distributions from frequency distributions Say that our previously-mentioned (and, sadly, hypothetical) optional unit for your course, ‘Flower Arranging for Engineers’, becomes extremely popular. In fact, it becomes so popular that it is studied by 208 students, from all the various BSc courses in the School. In an effort to analyse the performance of the students, so as to determine if any improvements to the unit are required, we might decide to plot a histogram of the final marks obtained. As we know, this is a frequency distribution, and might be obtained from the following summary of the students’ scores, as shown: Mark Scored (%) Frequency (No. of students) Mark Scored (%) Frequency (No. of students) 0-9.9 1 50-59.9 53 10-19.9 4 60-69.9 39 20-29.9 8 70-79.9 25 30-39.9 17 80-89.9 11 40-49.9 47 90-100 3 53 Frequency (No. of students) 50 47 39 40 30 25 20 17 11 10 8 4 3 1 0 0 10 20 30 40 50 60 70 80 90 100 Mark (per cent) Frequency polygons The first step in the conversion is to change from the histogram to what is called a frequency polygon. This is simply a line graph, joining the centres of each of the chosen data intervals. At the ends, our frequency polygon reaches the zero axis as shown, since no student can obtain less than zero or more than 100 per cent. In situations when this doesn’t apply, it is conventional to terminate the polygon on the zero axis, half way through the next interval. c:\ken\lects\ietm8.doc 1 5/6/2017 60 Frequency (No. of students) 53 50 47 39 40 30 25 20 17 11 10 8 4 3 1 0 0 10 20 30 40 50 60 70 80 90 100 Mark (per cent) Over to you: Sketch such a frequency polygon for the histogram of ages of the population, which is in the week 5 notes. Where the histogram has unequal intervals, the procedure is as shown below (the histogram bars and the polygon lines are not normally shown on the same plot). For bars of normal width (e.g. the first three bars in the example below, simply join up the centrevalues as above). For other bars, the frequency polygon must be drawn so that the line passes through the mid-point of the exposed side of the histogram bar (points ‘A’, ‘B’ and ‘C’ below) so that the shaded areas ‘gained’ and ‘lost’ automatically cancel out, and the total area under the plot stays the same (it is the area of histogram bars which matters). First three bars are normal width A B C Probability distributions: It is very easy to obtain probability distributions from diagrams such as those above. All that is necessary is to divide each frequency by the total number of (in this case) students, to obtain the probability of any individual student, selected at random, obtaining a mark in a particular range. For example, to convert the histogram on page 1, or the frequency polygon on page 2, into probability distributions, simply divide every number on the vertical axis (and therefore also the numbers written on the plots) by 208. Thus, the vertical axes would now be calibrated in probabilities from zero to 53/208 = 0.255. The probability of any given student obtaining a mark in the range 40 to 49.9 per cent will be 47/208 = 0.226. The probability of a student scoring 90 per cent or more will be 3/208 = 0.0144, etc. Over to you: c:\ken\lects\ietm8.doc 2 5/6/2017 Calculate the remaining probabilities for the various marks ranges, and add them up - what do you discover? The normal distribution It is not very surprising that the marks distribution (frequency or probability) looks like the diagrams above. In a fair examination, taken by a large number of students, we would expect that only a few students would obtain either abysmally low marks or astronomically high marks. We would expect the majority of marks to be ‘somewhere in the middle’, with a ‘tail’ at both the low and the high ends of the range. This is what we see above. Several real-life situations fit this general form of distribution, where it is most likely that results will be clustered around the centre of some range, with outlying values tailing off towards the ends of the range. Wisniewski, in his ‘Foundation’ text, uses an example based on the distributions of the weights of breakfast cereal packed by machines into boxes. There should always ideally be the stated amount in a box but, inevitably, some boxes will be lighter, and some heavier. There will be the odd ‘rogue’ boxes a long way from the mean. To make it easier to cope with such situations, they are often assumed to fit a standardised probability distribution, called the normal distribution. By doing this, it is possible to use standard printed tables to make predictions such as (for example), how many students would be expected to score less than 40 per cent? To allow standard tables to be used, we need to assume a certain fixed shape of probability distribution, and we also need to define it in terms of mean and standard deviation. We cannot define it in terms of actual data values (e.g. examination marks, or weight of cereal in a box), otherwise we would need a different set of tables for every new problem. The normal distribution curve is actually defined by a rather unpleasant formula (but we don’t need to use it, as we are going to use tables which have been derived from it by someone else). If the variable in which we are interested is x (e.g. a mark in per cent, or the weight of cereal in a box in kg), the mean value of x is x and the standard deviation of the data set is x, 1 P ( x) 2 1 x x 2 x e 2 2x The resulting plot of P(x) as x varies is a ‘bell-shaped’ curve, as shown below. Notes: 1) Firstly, notice that the horizontal axis has been plotted in terms of a normalised variable z, and not in units of x. This follows from our desire to get a distribution whose results are independent of the data units. The normalisation is very easy to do. Firstly we subtract the mean value x from all the x values we might have plotted along the horizontal axis. This has the effect of replacing the mean value with zero, and therefore of shifting the vertical axis to that value, as shown. Next, we divide all the resulting values by the standard deviation of the data set, so that the horizontal axis actually becomes calibrated in x x ‘standard deviations’ either side of the mean. So, we see that z . x c:\ken\lects\ietm8.doc 3 5/6/2017 P(x) 0.4 0.3 0.2 0.1 x 0 -4 -3 -2 -1 0 z = 0 for mean of x 1 2 3 4 z = no. of standard deviations of x from its mean value 2) The curve above is therefore effectively plotted for a data set with a mean value of x 0 and a standard deviation of x = 1. If you put these values into the nasty equation of the normal distribution, together with the particular value x = 0 (the mean), you will find that P(0) = 0.399, as shown at the centre of the curve. If you make x very large (either positively, or negatively) with respect to the mean value, then the exponential term tends to zero, giving P(x very much greater or smaller than the mean) 0, also agreeing with the plot and therefore tending to confirm its correctness. 3) If you estimate the area under the curve (by crudely ‘counting squares’, for example) you will find that it is equal to 1. Therefore, the probability that a value of x will fall somewhere under the curve is 1. Actually, the area shown in the diagram (and hence the probability) is very slightly less than 1, because the plot continues in both directions, getting closer and closer to the horizontal axis all the time. Estimating probabilities from the normal distribution curve Point (3) above, tells us how we can use this curve. If we know the mean and standard deviation of our x values (our data), we can ask questions about various areas of interest under the normal distribution curve. These will be the probabilities we require, and can be looked up from tables. Any text book statistics section dealing with the normal distribution will contain such tables. An approximate version calculated by EXCEL is given below. Example Say that a large set of examination results has a mean of 55 per cent, and a standard deviation of 15 per cent. How many students would we expect to fail the examination (if we define a failure as obtaining less than 40 per cent), and how many students would we expect to get a first-class result (defined as obtaining 70 per cent or more)? Firstly, we must remember that we are assuming that our data are normally-distributed. If they are not, then the results will be approximate only. One indication that a data set might at least stand a chance of being normally-distributed is if the mean and median are the same value (i.e. there should be zero skew, as the normal distribution is symmetrical). c:\ken\lects\ietm8.doc 4 5/6/2017 SD from mean 2.00 1.95 1.90 1.85 1.80 1.75 1.70 1.65 1.60 1.55 1.50 1.45 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00 Area 0.0227 0.0256 0.0287 0.0321 0.0359 0.0400 0.0445 0.0495 0.0548 0.0606 0.0668 0.0735 0.0807 0.0885 0.0968 0.1056 0.1150 0.1250 0.1356 0.1468 0.1586 SD from mean 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Area 0.1710 0.1840 0.1976 0.2118 0.2266 0.2419 0.2578 0.2742 0.2911 0.3085 0.3263 0.3446 0.3632 0.3821 0.4013 0.4207 0.4404 0.4602 0.4801 0.5000 Assuming a normally-distributed result, the bell-shaped curve, with our values of interest, appears below. Since the horizontal axis is calibrated in terms of z = ‘standard deviations’ from the mean, these are worked out using the mean and standard deviation in the formula in Note (1) above, as follows: z40 = (40 - 55)/15 = -1 (in other words, 40 per cent is 1 standard deviation below the mean) z70 = (70 - 55)/15 = +1 (in other words, 70 per cent is 1 standard deviation above the mean) The probability tables actually give us the shaded areas of the plot directly, so all we need do is to look them up. For the shaded area on the right, the z value is 1.00. From the tables, the corresponding probability is 0.1587. This means that we might expect 15.87 per cent of students to get ‘firsts’. The area at the other end of the curve looks to be a problem, as z = -1, but there are no negative values for z in the table. However, because the curve is symmetrical, we only need positive values. The probability of a student falling into the area to the left of z = -1 is identical to that of him or her falling into the area to the right of z = +1. The probability of ‘failure’ is therefore also 0.1587. c:\ken\lects\ietm8.doc 5 5/6/2017 0.4 P(x) area = probability that student gets a ’first’ area = probability that student fails x 0 -4 -3 -1 -2 z for x = 40 per cent 0 1 2 3 4 z for x = 70 per cent z for x = 55 per cent Note that we could also find the probability that a student will score less than 70 per cent (say). The tables give us the area to the right of 70 per cent (z = 1, remember) as 0.1587. The total area under the curve is 1, so the area to the left of z = 1 is 1 - 0.1587 = 0.8413 (so we would expect 84.13 per cent of students not to get a first-class result.) Over to you: Assuming the examination results data given on page 1 to be normally-distributed, what is the probability of a student obtaining a first-class result? What is the probability of a student obtaining more than 40 per cent? Discuss the differences between the page 1 data set and a normal distribution. NOTE: that this data set has a different mean and standard deviation from that just used in our example, so you’ll need to calculate them. We have not yet discovered how to do this for data grouped into classes, rather than a simple list of data points. Here’s how. From the frequency distribution plots on pages 1 or 2, call the frequency values f1, f2, ..., fn where n is the number of classes (or ‘bins’) chosen for the plot. So, in our case, n = 10, f1 = 1, f2 = 4, f3 = 8, etc. Next, work out the mid-points of the data classes (‘bins’) used, and call them m1, m2, ..., mn. So, in our case, these m values are 5 per cent, 15 per cent, 25 per cent, etc. The mean and standard deviation of the distributed data are then taken to be given by: n n x f i mi i 1 n f and x i i 1 fm 2 i i i 1 n f x 2 i i 1 A tabular approach will be easiest, as usual. The negative exponential distribution To cover a wider range of real-world situations, more ‘standardised’ probability distributions are required. The other one we shall briefly look at is the negative-exponential distribution. This is also sometimes called a ‘failure-rate’ curve, because it tends to describe how components fail with time (but it also has other uses, as we shall see). c:\ken\lects\ietm8.doc 6 5/6/2017 If a certain quantity of components is manufactured and put into service, it is reasonable to assume that they will all eventually fail (maybe after many, many years). The probability of any one of the components failing during a given time period might well depend on how many components are left in service. In other words, with a large number of components, we might expect several to fail during a given time period; but with a much smaller number of components, we should expect fewer failures over the same time period. For example, if 1000 components are put into service, knowing something about their reliability, we might expect 10 to fail in the first three years. However, if only 5 components were put into service, we certainly would not expect 10 to fail in the first three years! The example above is rather silly, but it suggests a better way of viewing such problems. Maybe we should expect a certain proportion of the components to fail over a given time period. For example, if 10 out of 1000 components fail in three years, that is 1 per cent. Perhaps we might therefore expect 1 per cent of 5 such components (i.e. most likely, none) to fail over three years too? So, to formalise this kind of idea, we follow this reasoning: Choose to measure time t in the best units for the problem (seconds, months, years, etc.). Technically, the unit chosen should be short compared with the expected lifetime of a component, so that any given component is expected to last for many time units. Let be the failure rate, that is, the proportion of components expected to fail in one time unit. This means that must have ‘dimensions’ of (1/time). In the example above, we said that 1 per cent of components might fail in three years so, in that case, the failure rate 0.01/3 (proportion per year). This can also be viewed as a probability - there is a probability of 0.01/3 that any given component will fail in a given period of one year. Therefore, to find the proportion of components expected to fail over a time t (measured in our chosen units), we need the quantity t. This is now dimensionless - it is actually the probability that any given component will fail over the stated time period. Again, this period should technically be short compared with the expected lifetime of the component. It follows that, if we start off with N components, then after the time t has passed, we will expect the actual number of failures to be Nt. We can now state the rate of change of the number of components as follows (it is negative, because the number decreases as time passes): change in the number of components number of failures N t N time period time period t dN N , in This is called a differential equation and would normally be written dt which the quantity dN / dt is to be interpreted as, ‘a very small change in N divided by the very small change in t over which it occurs’. The solution of this equation, to discover how the remaining number of components n might look over long periods of time, belongs to the branch of maths called calculus. It turns out that: n Ne t We can plot this negative exponential function as the following curve relating the remaining number of components n to time: c:\ken\lects\ietm8.doc 7 5/6/2017 Multiple of initial number of components (N) 1 0.8 0.6 t n = Ne - 0.4 0.2 0 0 1 2 3 4 5 6 Time units This plot effectively shows the number of components we expect to remain in service as time passes - just multiply the vertical scale by N (the initial number of components), and read off the plot the number expected to remain at any time of interest. To find the number of components which we expect to have failed at any time, subtract the plot value from N. Other things fit such a curve too. One example is radioactive decay. As a radioactive substance decays, it emits matter as radiation, thus reducing the amount of matter remaining. The remaining matter fits a curve similar to that above. Another distribution similar to the above is used in predicting the duration of conversations on a telephone network. Over to you: 3000 light bulbs are put into service in a large office complex. After 200 hours of use, 500 have failed. How many bulbs might we expect to have failed after (i) 800 hours and (ii) 3000 hours? Summary In this session we have seen how probability distributions can be plotted, and used to make predictions from data sets. We looked at two of the standard probability distributions, the normal and negative exponential distributions, and saw how these can be used to forecast results in a standardised manner, so long as our data fit the chosen distribution. Ken Dutton November 1998 ND table by Bill Barraclough and EXCEL, November 2001 c:\ken\lects\ietm8.doc 8 5/6/2017 BSc IETM Week 8 - Probability Distributions - ‘solutions’ Page 2 ‘Over to you’: Histogram data for 1991 were plotted as shown below. Freq. polygon is added to plot. “Nonstandard width compensation” is used as shown at ‘A’, ‘B’, ‘C’ and ‘D’ (and I did them in that order). Termination at ‘high’ end questionable - what’s the maximum allowable age?!! It would have been unfortunate if the line through ‘C’ missed the ‘6997’ bar altogether - let’s hope nobody asks! (I guess the answer would be that some approximation of merging the first two bars would have to be made.) Frequency (mllions of people) 14 6997 12383 12 3766 11974 D 9448 C 10 7955 8 6 A 4 3748 B 2 0 0 10 20 30 40 50 60 70 80 90 100 Age in years Page 3 ‘Over to you’: Dividing 1, 4, 8, 17, 47, 53, 39, 25, 11, 3 by 208 gives: 0.00481, 0.0192, 0.0384, 0.0817, 0.2260, 0.2548, 0.1875, 0.1202, 0.0529, 0.0144 These sum to 1, give, or take rounding error. This should always happen, so the area under such a probability distribution curve is always 1. Page 4, Note (3): Area under curve seems to be roughly 10 ‘squares’. Units are 1 by 0.1, so area = 1. Page 6 ‘Over to you’: The tabular approach, as recommended in the notes, is probably easiest for calculating the mean and S.D. Note the trick of getting f i mi2 by multiplying the already-obtained fimi values by mi, rather than doing the squaring and multiplying again. I did the values below the c:\ken\lects\ietm8.doc 9 5/6/2017 hard way (not using Excel), so there may be errors (space before and after in case you want to do an OHP just of the table - I shan’t bother): SUMS f 1 4 8 17 47 53 39 25 11 3 208 m 5 15 25 35 45 55 65 75 85 95 fm 5 60 200 595 2115 2915 2535 1875 935 285 11520 fm2 25 900 5000 20825 95175 160325 164775 140625 79475 27075 694200 694200 5538 . 2 = 16.45 per cent. 208 To find P(more than 40%) need to normalise 40 by subtracting mean and dividing by S.D. to get -0.935. Can’t look this up in table as it’s negative, but the same area under the curve would be obtained as 1 - value for +0.935. Required value is approx. 0.175 from table, so required answer is 1 - 0.175 = 0.825, equivalent to 82.5 per cent getting more than 40 %. So mean = 11520/208 = 55.38 per cent, and S.D. = Main difference from normal distribution is likely to be the ‘step’ at 40 per cent, and the resulting positive skew, due to trying to pass people!! Page 8 ‘Over to you’: 500 failures implies 500 = Nt, so = 500/3000/200 = 8.333*10-4. At 800 hr., number working = 3000e-0.0008333*800 = 1540, so number of failures = 1460. At 3000 hr., number working = 3000e-0.0008333*3000 = 246, so number of failures = 2754. Ken c:\ken\lects\ietm8.doc 10 5/6/2017