Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 2441 Probability and Statistics for Biological Sciences The Normal Distribution The so-called normal probability distribution (sometimes also called the Gaussian distribution after the name of the mathematician who first studied it extensively), is a continuous probability distribution. It turns out to be the most important probability distribution in statistics because of some very fundamental mathematical results in probability theory. We will use the symbol x to represent a generic normal random variable. The probability density function contains two constants, denoted and , for reasons that will become clear shortly. The formula for the density function is f ( x) 1 2 e ( x ) 2 / 2 2 (ND-1) The graph of this function has the familiar bell shape, centered on, and symmetric with respect to, the line x = . µ x The value of determines the width of the bell. Small values of give very tall narrow bells, whereas larger values of give shorter flatter bells: µ All three of these "bells" represent a normal distribution with the same value of , but different values of . You can see from formula (ND-1) that the height of the bell at the peak, where x = , is inversely proportional to . This means that the lowest broadest bell in the figure has a value of about three times as big as does the tallest narrowest bell. Although it may be difficult to gauge this precisely from the figure, David W. Sabo (1999) The Normal Distribution Page 1 of 10 the area under each of the three bells is the same (and equal to 1 when appropriate physical units of measurement are used). As with all continuous random variables, probabilities for the normal random variable are calculated as areas underneath this probability density curve, and so, mathematically, the computation of probabilities involves the evaluation of a definite integral: x b Pr(a x b) f ( x ) dx x a 1 2 b e ( x )2 / 2 2 dx (ND-2) a Here, a, b, , and stand for specific numerical values that would be known before you could attempt to compute the probability. Note the following general properties: (i.) (ii.) (iii.) the probability density curve is defined for x running from - and +, though when the value of x gets quite a bit different from the value of , the negative exponent becomes large enough that the value of the density function is negligible for practical purposes. The total area under this curve for x running from - and + is 1, with most of this area occurring in the near vicinity of x = . the probability density function is symmetric about the vertical line at x = . This line divides the bell into two mirror-image halves, each with an area of 0.5. Using formulas (RV-7a) and (RV-7b) from the preceding document, we find that the mean value for the normal distribution works out to be precisely the quantity denoted above as and the variance works out to be the quantity denoted above as 2, hence the use of these symbols in the defining formula (ND-1). Thus, the formula for a normal distribution probability density function (which is equivalent to the formula for the relative frequency distribution of a normally-distributed population) requires specification of the mean and variance of that distribution: the values of and 2 as numerical parameters. Unfortunately, even when the numbers for a, b, , and are plugged into this formula, the definite integral is too difficult to calculate exactly by hand (though approximation formulas of more than adequate accuracy are known, and are simple enough to build into even relatively inexpensive hand-held electronic calculators). As a result, statisticians have come to rely on printed tables of values for this integral. We need to simplify the situation somewhat first though. We need the normal probability tables to be able to handle any values of a, b, , and that may arise. In principle at least, a, b, and can have values anywhere between - and +, and can have values anywhere between 0 and +. Suppose we set up pages of the tabulation so that the value of 'a' varies down the rows, and the value of 'b' vary across columns. Thus one page of the table would give probabilities for each combination of values of a and b for a particular pair of values of and . (These would be pretty big pages because we'd need one row for each value of 'a' between - and +, and one column for each value of 'b' between - and +). Now, for each value of , we'd need a separate page for each possible value of (between 0 and +), forming a rather large book. Then we'd need one book like that for each possible value of between - and +. So, at first look, it seems like we might be facing having to set up a library of an infinite number of books, each with an infinite number of pages, with each page having an infinite number of rows and columns in order to tabulate any possible value of the definite integral above that may be required. Fortunately, there are ways of reducing the required amount of information considerably in fact, we'll now describe how to get it down to a single standard size page, and you won't even need a magnifying glass! Simplification #1: The Standard Normal Probability Distribution A very simple substitution in (ND-2) removes both and from the integrand (reducing our requirement for a library of an infinite number of volumes, each with an infinite number of pages which are infinitely large to just one infinitely large page! we'll whittle that page down next.) One of the main methods you used in MATH 1441 for calculating integrals was the method of substitution. We will apply it here, by making a substitution for x based on the formula: Page 2 of 10 The Normal Distribution David W. Sabo (1999) z x x z (ND-3) You should recognize z here as the so-called standard score associated with x. Substituting for x in (ND-2), recognizing that dx = dz, and noting that: x=a z a and x=b z b , we get Pr(a x b) 1 2 b e ( x ) / 2 2 2 dx a 1 2 z b z a e z 2 /2 dz b a Pr z (ND-4) When you compare the second integral in (ND-4) with the integral in (ND-2), you see that z is itself a normally distributed random variable it is the normally distributed random variable that has = 0 and = 1. Condensing (ND-4) somewhat, we get: a x b x Pr(a x b) Pr z x x (ND-5) We've put subscripts, x, on the and in this formula to emphasize that these stand for the mean and standard deviation of the probability distribution of x. Formula (ND-5) can be used to get any probabilities for any normally distributed random variable as long as we can calculate probabilities for that one special normally distributed random variable: z. Because probabilities of all normally-distributed random variables are ultimately related to probabilities of this variable z, it is called the standard normal random variable, and its probability distribution is called the standard normal probability distribution. The symbol 'z' will be reserved exclusively in this course to stand for the standard normal random variable. (Some more mathematically rigorous books on probability and statistics use the upper case symbol, Z, to represent the standard normal random variable, and the lower case symbol, z, to represent generic values of Z. The distinction between the name of the variable and values of the variable will usually be clear enough in this course that we don't have to complicate our notation to that extent.) Notice that the standard normal random variable has properties analogous to those listed above for normally-distributed random variables in general: (i.) the probability density curve is defined for all values of z between - and +, though in most areas of application, it is considered to have values which are zero for practical purposes for values of z much greater than +3 or much less than -3 (this is important). The area under the entire curve is exactly 1. the vertical line z = 0 partitions the curve into two mirror-image halves, each having an area of exactly 0.5. z 0 (ii.) David W. Sabo (1999) The Normal Distribution Page 3 of 10 Illustration: Formula (ND-5) is so fundamental to nearly everything that follows in this course that it is absolutely essential you understand its meaning. As an initial example, suppose we define the random variable: x = mass in grams of an apple of a particular variety and suppose we have independent information that x is approximately normally distributed. (We say "approximately" because the normal distribution is a mathematical idealization of a common occurrence in nature. However, real populations of things in nature probably never exactly match this mathematical idealization, and even if they did by chance, we don't really have any way of proving that something matches a mathematical ideal exactly. In this course, we will often insert the word "approximately" as done here to remind ourselves of this limitation of applied mathematics. However, we will perform calculations using the exact normal probability distribution formulas.) x has a mean value, , of 250 g x has a standard deviation, , of 25 g. Then, formula (ND-5) says: 275 250 225 250 Pr( 225 x 275 ) Pr z Pr( 1 z 1) 25 25 That is, the area under the probability density curve for x between x = 225 g and x = 275 g is the same as the area under the standard normal probability density curve between z = -1 and z = +1. You can see this from actual graphs if we adjust the horizontal scales appropriately. x 200 250 300 z -2 0 2 Simplification #2: Using the Properties of the Standard Normal Distribution It still looks like we need a table that gives probabilities of the form Pr(a z b) for all values of 'a' between - and + and all values of 'b' between - and +. However, we can do three things to reduce the number of entries actually necessary in a useful table. First, as already noted, the values of the probability density function are negligible for z less than about -3 or -4, and for z greater than about 3 or 4. Most reference books fix these two cutoffs at -3 and +3, respectively. So you can see why this is adequate, the table provided in this course goes from z = -4.1 to z = +4.1. This restriction gets rid of part of the infinity of rows and columns it looked like we needed. Page 4 of 10 The Normal Distribution David W. Sabo (1999) Secondly, we can make use of the symmetry of the distribution to be able to express any probability, Pr(a z b) in terms of probabilities of the form Pr(0 z b). This procedure will be illustrated by several examples below. Notice that this gets rid of one of the two constants that can take on different values in the problem. In effect, this means our table only has to handle one value of 'a', namely a = 0. That means that we can use the entire sheet to display probabilities corresponding to different values of 'b'. Finally, in practical terms, we don't really need Pr(0 z b) for every possible value of b between 0 and say 4. As far as working with tables is concerned, most people are comfortable with being able to get probabilities for values of b rounded to two decimal places. Thus, in a 10-column wide table, we need only 30 or 40 rows we're down to the one page table promised earlier. The resulting table is shown on the next page. It is known as a Standard Normal Probability Table or a z-table. Although there are other ways of preparing tables of standard normal probabilities, their principles of organization are similar enough the table given here that if you master the use of the table we present, you should be able to adapt to other variations with little difficulty. Determination of Standard Normal Probabilities We will illustrate the use of the z-table with a number of examples. The overall strategy in each case is as follows: (i.) make a rough sketch of the probability/area to be calculated. This is useful in organizing your thoughts. (ii.) express the required area in terms of sums or differences of areas of regions of the form 0zb (iii.) look up the areas of the components in the z-table, and compute the required overall area. The body of the table contains the values of Pr(0 z b). The first two figures of b are listed down the left edge of the table labeling the rows. The second decimal place of b is given across the top of the table, labeling the columns. Example 1: Find Pr(0 z 1.57). Solution: This probability has the standard form, Pr(0 z b), with b = 1.57, so we can read its value straight from the table with no further computation. Locate the row labeled 1.50 on the left, and move across the row to the column labeled 0.07 at the top. The table entry there is 0.4418. Thus, we conclude that z 0 1.57 Pr(0 z 1.57) 0.4418. Example 2: Find Pr(z 2.43). The region of interest here is shown in the sketch to the right. It is the area under the probability density curve starting at z = 2.43 on the right, and extending leftwards to -. You see that this overall region can be regarded as the combination of two parts: z 0 2.43 -the part to the left of z = 0, which is half the bell, and so has an area of 0.5 -the part between z = 0 and z = 2.43, which is Pr(0 z 2.43), of the form available from the table. Consulting the row labeled 2.40 and the column labeled 0.03 in the standard normal probability table, we find that Pr(0 z 2.43) 0.4925. Thus, as a final result, we get Pr(z 2.43).= 0.5 + Pr(0 z 2.43) 0.5 + 0.4925 = 0.9925. David W. Sabo (1999) The Normal Distribution Page 5 of 10 (MATH 2441/99) Table of Standard Normal Distribution Probabilities (0 < z < b): b 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2.30 2.40 2.50 2.60 2.70 2.80 2.90 3.00 3.10 3.20 3.30 3.40 3.50 3.60 3.70 3.80 3.90 4.00 4.10 0.00 0.0000 0.0398 0.0793 0.1179 0.1554 0.1915 0.2257 0.2580 0.2881 0.3159 0.3413 0.3643 0.3849 0.4032 0.4192 0.4332 0.4452 0.4554 0.4641 0.4713 0.4772 0.4821 0.4861 0.4893 0.4918 0.4938 0.4953 0.4965 0.4974 0.4981 0.4987 0.4990 0.4993 0.4995 0.49966 0.49977 0.49984 0.49989 0.49993 0.49995 0.49997 0.49998 0.01 0.0040 0.0438 0.0832 0.1217 0.1591 0.1950 0.2291 0.2611 0.2910 0.3186 0.3438 0.3665 0.3869 0.4049 0.4207 0.4345 0.4463 0.4564 0.4649 0.4719 0.4778 0.4826 0.4864 0.4896 0.4920 0.4940 0.4955 0.4966 0.4975 0.4982 0.4987 0.4991 0.4993 0.4995 0.49968 0.49978 0.49985 0.49990 0.49993 0.49995 0.49997 0.49998 0.02 0.0080 0.0478 0.0871 0.1255 0.1628 0.1985 0.2324 0.2642 0.2939 0.3212 0.3461 0.3686 0.3888 0.4066 0.4222 0.4357 0.4474 0.4573 0.4656 0.4726 0.4783 0.4830 0.4868 0.4898 0.4922 0.4941 0.4956 0.4967 0.4976 0.4982 0.4987 0.4991 0.4994 0.4995 0.49969 0.49978 0.49985 0.49990 0.49993 0.49996 0.49997 0.49998 0.03 0.0120 0.0517 0.0910 0.1293 0.1664 0.2019 0.2357 0.2673 0.2967 0.3238 0.3485 0.3708 0.3907 0.4082 0.4236 0.4370 0.4484 0.4582 0.4664 0.4732 0.4788 0.4834 0.4871 0.4901 0.4925 0.4943 0.4957 0.4968 0.4977 0.4983 0.4988 0.4991 0.4994 0.4996 0.49970 0.49979 0.49986 0.49990 0.49994 0.49996 0.49997 0.49998 0.04 0.0160 0.0557 0.0948 0.1331 0.1700 0.2054 0.2389 0.2704 0.2995 0.3264 0.3508 0.3729 0.3925 0.4099 0.4251 0.4382 0.4495 0.4591 0.4671 0.4738 0.4793 0.4838 0.4875 0.4904 0.4927 0.4945 0.4959 0.4969 0.4977 0.4984 0.4988 0.4992 0.4994 0.4996 0.49971 0.49980 0.49986 0.49991 0.49994 0.49996 0.49997 0.49998 0.05 0.0199 0.0596 0.0987 0.1368 0.1736 0.2088 0.2422 0.2734 0.3023 0.3289 0.3531 0.3749 0.3944 0.4115 0.4265 0.4394 0.4505 0.4599 0.4678 0.4744 0.4798 0.4842 0.4878 0.4906 0.4929 0.4946 0.4960 0.4970 0.4978 0.4984 0.4989 0.4992 0.4994 0.4996 0.49972 0.49981 0.49987 0.49991 0.49994 0.49996 0.49997 0.49998 0.06 0.0239 0.0636 0.1026 0.1406 0.1772 0.2123 0.2454 0.2764 0.3051 0.3315 0.3554 0.3770 0.3962 0.4131 0.4279 0.4406 0.4515 0.4608 0.4686 0.4750 0.4803 0.4846 0.4881 0.4909 0.4931 0.4948 0.4961 0.4971 0.4979 0.4985 0.4989 0.4992 0.4994 0.4996 0.49973 0.49981 0.49987 0.49992 0.49994 0.49996 0.49998 0.49998 0.07 0.0279 0.0675 0.1064 0.1443 0.1808 0.2157 0.2486 0.2794 0.3078 0.3340 0.3577 0.3790 0.3980 0.4147 0.4292 0.4418 0.4525 0.4616 0.4693 0.4756 0.4808 0.4850 0.4884 0.4911 0.4932 0.4949 0.4962 0.4972 0.4979 0.4985 0.4989 0.4992 0.4995 0.4996 0.49974 0.49982 0.49988 0.49992 0.49995 0.49996 0.49998 0.49998 0.08 0.0319 0.0714 0.1103 0.1480 0.1844 0.2190 0.2517 0.2823 0.3106 0.3365 0.3599 0.3810 0.3997 0.4162 0.4306 0.4429 0.4535 0.4625 0.4699 0.4761 0.4812 0.4854 0.4887 0.4913 0.4934 0.4951 0.4963 0.4973 0.4980 0.4986 0.4990 0.4993 0.4995 0.4996 0.49975 0.49983 0.49988 0.49992 0.49995 0.49997 0.49998 0.49999 0.09 0.0359 0.0753 0.1141 0.1517 0.1879 0.2224 0.2549 0.2852 0.3133 0.3389 0.3621 0.3830 0.4015 0.4177 0.4319 0.4441 0.4545 0.4633 0.4706 0.4767 0.4817 0.4857 0.4890 0.4916 0.4936 0.4952 0.4964 0.4974 0.4981 0.4986 0.4990 0.4993 0.4995 0.4997 0.49976 0.49983 0.49989 0.49992 0.49995 0.49997 0.49998 0.49999 Example 3: Find Pr(z -1.75) Solution: From the sketch to the right, we see that what is required is the area under the probability density curve from z = -1.75 rightwards to z = +. This area can be split into two pieces: z -1.75 0 -the region from z = 0 to z = +, which is half the bell, and so represents an area of 0.5 units. -the region from z = -1.75 to z = 0, which has area Pr(-1.75 z 0). Page 6 of 10 The Normal Distribution David W. Sabo (1999) Now, the standard normal probability table does not give probabilities of this sort directly. However, since the distribution is symmetric about the vertical line z = 0, we know that Pr(-1.75 z 0) = Pr(0 z 1.75) But, the probability on the right-hand side here is in the standard form of a table entry. From the row labeled 1.70 and the column labeled 0.05, we get that Pr(0 z 1.75) 0.4599. Thus, we conclude Pr(z -1.75) = 0.5 + Pr(0 z 1.75) 0.5 + 0.4599 = 0.9599. Example 4: Find Pr(-0.87 z 1.26) Solution: The sketch to the right shows the area that is required. It can be viewed as made up of two parts: z -the region between z = 0 and z = 1.26, which has an area equal to Pr(0 z 1.26) 0.3962 from the table -0.87 0 1.26 -the region between z = -0.87 and z = 0. By symmetry, the area of this region is the same as the area of the region between z = 0 and z = +0.87. But this is just Pr(0 z 0.87) 0.3078 from the table. Thus, the required final answer is Pr(-0.87 z 1.26) = Pr(-0.87 z 0) + Pr(0 z 1.26) = Pr(0 z 0.87) + Pr(0 z 1.26) 0.3078 + 0.3962 = 0.7040 Example 5: Find Pr(0.42 z 1.31) Solution The region of interest in this case is entirely to the right side of the line of symmetry at z = 0. You can see from the sketch, however, that it is possible to write this probability as the difference of two areas: z 0 Pr(0.42 z 1.31) = Pr(0 z 1.31) - Pr(0 z 0.42) 0.4049 - 0.1628 = 0.2421 Example 6: Find Pr(z 1.27). Solution From the sketch, we see that Pr(0 z 1.27) + Pr(z 1.27) = 0.5 z because together these two terms include the right half of the bell. Thus, 0 1.27 Pr(z 1.27) = 0.5 - Pr(0 z 1.27) 0.5 - 0.3980 = 0.1020. David W. Sabo (1999) The Normal Distribution Page 7 of 10 These last six examples cover pretty well all of the variations possible in calculating standard normal probabilities. Again, you may encounter standard normal probability tables organized in different ways than the table given in this document. In very general terms, the strategies for using them parallel the strategies illustrated above. Having mastered the task of determining standard normal probabilities, it is now straightforward to use formula (ND-5) to compute probabilities for any other normal distribution. We illustrate this with a few short examples. Example 7: Suppose that the amount of vegetable soup dispensed by a machine into containers at a food processing plant is an approximately normally-distributed random variable with a mean of 356 g and a standard deviation of 23 g. Determine the probability that a container selected at random will contain between 350 and 375 g of soup. Solution If we define x = grams of soup dispensed by the machine into a container then, according to this problem, x is an approximately normally-distributed random variable with a mean, , of 356 g, and a standard deviation, , of 23 g. The question is then asking for the probability: Pr(350 x 375) Using formula (ND-5), we first get: 375 356 350 356 Pr( 350 x 375 ) Pr z 23 23 = Pr(-0.26 z 0.83) = Pr(0 z 0.26) + Pr(0 z 0.83) 0.1026 + 0.2967 = 0.3993 z Thus, the probability that a randomly selected container will contain between 350 g and 375 g of soup is 0.3993. This means as well that approximately 39.93% of the containers have between 350 g and 375 g of soup in them. Example 8: Suppose that the label on these soup containers states that they contain 340 g of soup. What percentage of the containers will actually contain less soup than stated on the label? Solution: The percentage of containers with less than 340 g of soup is the same thing as the probability that a randomly selected can of soup will contain less than 340 g of soup. Thus, defining the random variable x in the same way as was done in Example 7, we see that the question is really asking us to determine -0.70 Pr( x < 340) But Page 8 of 10 The Normal Distribution David W. Sabo (1999) 340 356 Pr( x 340 ) Pr z Pr( z 0.70 ) 23 From the sketch, you can see that Pr(z < -0.70) = 0.5 - Pr(0 z 0.70) 0.5 - 0.2580 = 0.2420 Thus, since the probability of a randomly selected container containing less than 340 g of soup is 0.2420, we conclude that 24.20% of the containers contain less than 340 g of soup. Example 9: A technologist has a procedure for preparing culture tubes for screening tests. The specifications require that the amount of nutrient solution be between 9.75 ml and 10.25 ml. Studies indicate that the procedure results in a somewhat random amount of nutrient solution placed in each tube which is approximately normally distributed with a mean of 10.05 ml and a standard deviation of 0.18 ml. What percentage of these culture tubes do not conform to the nutrient solution amount specifications? Solution This question is asking for the percentage of culture tubes that either contain less than 9.75 ml of nutrient solution or more than 10.25 ml of nutrient solution. Again, percentages of culture tubes containing a certain amount of solution are equivalent to the probability of finding that the amount of nutrient solution in a randomly selected culture tube, so this is really a probability problem. To start, define x = the ml of nutrient solution in a randomly selected culture tube Then, according to the problem, x is an approximately normally distributed random variable with a mean of 10.05 ml and a standard deviation of 0.18 ml. We can compute: 9.75 10 .05 Pr( x 9.75 ) Pr z Pr( z 1.67 ) 0.5 Pr( 0 z 1.67 ) 0.18 0.5 - 0.4525 = 0.0475 If you have trouble following these calculations, try sketching a diagram, as was done above for Examples 1 through 7. Similarly, 10 .25 10 .05 Pr( x 10 .25 ) Pr z Pr( z 1.11) 0.5 Pr(0 z 1.11) 0.18 0.5 - 0.3665 = 0.1335 From these two results, we see that 4.75% of the culture tubes will have less than 9.75 ml of solution, and 13.35% of the tubes will have more than 10.25 ml of solution. Thus, a total of 4.75% + 13.35% = 18.1% of the culture tubes will be outside of the specifications relating to amount of nutrient solution. Remark Microsoft Excel provides the functions NORMSDIST() for calculating standard normal probabilities, and NORMDIST() for calculating cumulative probabilities for general normal distributions. Specifically, NORMSDIST(b) Pr(- < z b) = 0.5 + Pr(0 < z b) Thus, for a given value 'b', Excel's function NORMSDIST(b) gives a result which is 0.5 greater than the entries in the table of standard normal probabilities included in this document. Excel's NORMSDIST() function returns what are truly cumulative standard normal probabilities. You should be able to figure out David W. Sabo (1999) The Normal Distribution Page 9 of 10 how to use values of the NORMSDIST() function to compute standard normal probabilities for any of the variations illustrated in Examples 1 - 6 above. The reason we opted for a table based on Pr(0 < z b) for hand calculations here is out opinion that such a table reduces the amount of arithmetic necessary in most instances (unless you are willing to expand the table to include values of 'b' ranging over negative as well as positive numbers). The NORMDIST() function in Excel is a bit more complicated, but again gives a true cumulative probability: NORMDIST(b, , , True) Pr(- < x b) where x is a normally distributed random variable with mean and standard deviation . For a cumulative probability such as this, the fourth parameter must always be the word 'True'. (If you specify 'False' in this position, NORMDIST gives the value of the probability density function, f(b), rather than a cumulative probability). Recall formula (RV-6) Pr(a < x < b) = F(b) - F(a) (RV-6) from the preceding document, which indicates how to calculate other probabilities from values of cumulative probabilities such as produced by NORMDIST(). Obviously, if you were setting up normal probability calculations in an Excel spreadsheet (or other spreadsheet applications that have similar functionality), you would use these built-in functions to get values of required probabilities rather than reading the probabilities from a printed table and typing them in as numbers. Page 10 of 10 The Normal Distribution David W. Sabo (1999)