Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
6.1 Chapter 6: Some Continuous Probability Distributions Take Sample Sample Inference Population Again, PDFs are population quantities which gives us information about the distribution of items in the population. There are many PDFs where are used to understand probabilities associated with random variables. There are a few PDFs which are used for multiple real-life situations. These PDFs are described next. From this chapter, it is important to learn the following: What are these PDFs which can be used for multiple situations When can these PDFs be used The means and variances for random variables with these PDFs All PDFs in this chapter will be for continuous random variables. 2005 Christopher R. Bilder 6.2 6.1: Continuous Uniform Distribution The simplest PDF for continuous random variables is when the probability of observing a particular range of values for X is the same for all equal length ranges! Since the probabilities are the same, this PDF is called the uniform PDF. The Uniform PDF – Let X be a random variable on the interval [A,B]. The uniform PDF is 1 f(x;A,B) B A 0 for A x B otherwise Notes: o We examined this PDF at the beginning of Section 3.3! o The parameters, A and B, control the location of the PDF. In general, this is what a graph of the PDF looks like. f(x) f(x;A,B) 1 BA A B 2005 Christopher R. Bilder x 6.3 o The area under the curve is 1. Since the PDF looks like a rectangle, we can take baseheight = (B-A)[1/(B-A)] to find the area is 1. Example: Uniform distribution with A=1 and B=4 (uniform.xls) Uniform PDF 0.35 0.3 f(x) 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 x Areas underneath the curve correspond to probabilities. For example, P(1<X<3) = 0.67. How could I find this using calculus? Note the blue lines on the x-axis should be extended to the end of the plot. 2005 Christopher R. Bilder 6.4 Theorem 6.1 – The mean and variance of a random variable X with a uniform PDF are 2 A B (B A) E(X) and 2 Var(X) 2 12 Proving these are homework! 2005 Christopher R. Bilder 6.5 6.2: Normal Distribution This is the main PDF that we will be using since it occurs in many applications. Normal PDF – Let X be a random variable with mean E(X)= and Var(X)=2. The normal PDF is f(x; , ) 1 2 e ( x )2 2 2 for - x Notes: The parameters, and , control the location and scale of the distribution, respectively. These are the population mean and standard deviation! Thus, a nice simplification with the normal PDF is that the mean and standard deviation can be represented easily as parameters in the function. In most realistic applications, and will not be known and we will need to estimate them. How to do this will be discussed in future chapters. The book denotes f(x;,) by n(x;,). Terminology: Suppose X is a random variable with a normal PDF. One can shorten how this is said by saying X is a normal random variable. In general, this is what a graph of the distribution looks like. 2005 Christopher R. Bilder 6.6 1 2 f(x) f(x) x o The curve graphed are (x, f(x)) connected points. o The PDF is centered at (symmetric about ). Thus, P(X>) = P(X<) = 0.5. The parameter is often called a location parameter since it gives the central location of the PDF. o The area under the curve is 1. o The left and right sides of the curve extend out to - and + without touching the x-axis (although it will get very close). Note the plot above may be a little misleading with respect to this. The left and right sides of the PDFs are often called the “tails” of the PDF. o controls the scale of the PDF. The larger , the more spread out the PDF (large variability). The smaller , the less spread out the PDF (small variability). Below are three normal PDFs demonstrating this. 2005 Christopher R. Bilder 6.7 Normal PDF Example 0.7 0.6 0.5 f(x) 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 x (MPG) 24.3 & 0.6 24.3 & 1.3 23.1 & 0.6 A VERY IMPORTANT specific case of a normal PDF is the standard normal PDF. This PDF has =0 and =1. Therefore, x2 1 2 f(x) e for -<x< 2 Typically, “Z” is used instead of “X” to denote a standard normal random variable. This will be discussed more later. 2005 Christopher R. Bilder 6.8 Showing f(x)dx =1 is not as easy as it was in Chapter 3. The proof involves making a transformation to polar coordinates. Pages 104-5 of Casella and Berger’s (1990) textbook shows the proof (this book is used for STAT 882). Example: Interactive normal PDFs (normal_dist.xls) This file is constructed to help you visualize the normal probability distribution. For example, below is the normal PDF for =50 and =3. 2005 Christopher R. Bilder 6.9 Experiment on your own using different values of and to see changes in the distribution. Make sure you understand the following: What happens when is increased or decreased? What happens when is increased or decreased? Where is the highest point on the distribution? What is this highest point? Also in the file are examples of how to use the NORMDIST( ) and NORMINV( ) Excel functions which be discussed in detail in Section 6.3. Below is the proof showing that E(X) = . A similar proof can be done to show Var(X) = 2 (see p. 146 of the book). E(X) 1 xe 2 (x )2 2 2 dx 2005 Christopher R. Bilder 6.10 2005 Christopher R. Bilder 6.11 6.3-6.4: Areas Under the Normal Curve and Applications of the Normal Distribution Example: Grand Am (grand_am_normal.xls) Suppose that it is reasonable to assume a Grand Am’s MPG has a normal PDF with a mean MPG of =24.3 and a standard deviation of =0.6. Let X denote the MPG for one tank of gas. Answer the following questions. 1) Find the probability that a randomly selected Grand Am gets less than 23 MPG for one tank of gas. We need to find P(X<23) = F(23). This is the area to the left of the red line underneath the PDF. Grand Am Normal PDF Example 0.7 0.6 f(x) 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 x (MPG) 24.3 & 0.6 2005 Christopher R. Bilder 27 28 29 30 6.12 This probability can be found by: 23 1 0.6 2 e ( x 24.3)2 2(0.6)2 dx . Using Maple without evaluating at the limits of integration, we get: > assume(sigma>0); > f:=1/(sqrt(2*Pi)*0.6)*exp((x-24.3)^2/(2*0.6^2)); f := .8333333335 2 e 2 ( 1.388888889( x~ 24.3 ) ) > int(f,x); .4999999998 erf( 1.178511302 x~ 28.63782464 ) Notice that a capital P is used in the Pi function. See http://mathworld.wolfram.com/Erf.html for more information on the erf() function. Using Maple with the limits of integration, we get: > int(f,x=-infinity..23); .0151301397 where it uses numerical approximations for the last integral. 2005 Christopher R. Bilder 6.13 To make finding probabilities easier, many software packages (and calculators) have special functions which do the integration for X in some interval. In Excel, the NORMDIST(x, , , TRUE) function finds F(x) for a normal random variable with mean and standard deviation . For this example, use NORMDIST(23,24.3,0.6,TRUE) This results in 0.0151. Chris Malone’s Excel Instructions website contains help for this function at http://www.statsclass.com/excel/tables/prob_values.ht ml#prob_n. The web page shows another way to use the function through a window based format. Side note: To find the probability in Maple using its specialized functions, you can use the following code: > with(stats); [ anova, describe , fit , importdata, random, statevalf, statplots, transform ] > statevalf[cdf,normald[24.3,0.6]](23); .01513014001 2) Suppose is increased to =1.3. What do you expect to happen to P(X<23)? 2005 Christopher R. Bilder 6.14 The Excel function is NORMDIST(23,24.3,1.3,TRUE) Grand Am Normal PDF Example 0.7 0.6 0.5 f(x) 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 x (MPG) 24.3 & 1.3 3) Suppose =0.6 again, but is decreased to =23.1. What do you expect to happen to P(X<23)? The Excel function is NORMDIST(23,23.1,0.6,TRUE) 2005 Christopher R. Bilder 6.15 Grand Am Normal PDF Example 0.7 0.6 f(x) 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 x (MPG) 23.1 & 0.6 Below is a nice comparative graph for the 3 examples above. Grand Am Normal PDF Example 0.7 0.6 f(x) 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 x (MPG) 24.3 & 0.6 24.3 & 1.3 2005 Christopher R. Bilder 23.1 & 0.6 30 6.16 4) Suppose =0.6 and =24.3 again. What is P(23<X<25)? The probability needs to be broken up since the NORMDIST( ) function only finds probabilities in the form of F(x). P(23<X<25) = P(X<25) – P(X<23) = F(25) – F(23). This can be found with the Excel functions: NORMDIST(25,24.3,0.6,TRUE)-NORMDIST(23,24.3,0.6,TRUE) The probability is 0.8632. Grand Am Normal Probability Distribution Example for =24.3, =0.6 0.7 0.6 f(X) 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 X P(23<X<25) 5) Suppose =0.6 and =24.3 again. What is P(X>23)? 2005 Christopher R. Bilder 6.17 6) Suppose =0.6 and =24.3 again. What is P(X<23 or X>25)? 7) What MPG is at least required for a car to be in the top 5% of all Grand Ams? Suppose =0.6 and =24.3 again. This problem requires going in the opposite direction. We are now given a probability and need to find the corresponding “x” that works for P(X>x)=0.05. In terms of integration, we are trying to find x in the equation below: 0.05 x 1 0.6 2 (y 24.3)2 e 2(0.6)2 dy Equivalently, x 0.95 1 0.6 2 e (y 24.3)2 2(0.6)2 dy 2005 Christopher R. Bilder 6.18 Notice the limits of integration used are in terms of y. This is done to avoid confusion of integrating from “x=x to ”. The x value can be found by using Excel’s NORMINV(area,, ) function where area=P(X<x). Be careful! Notice that the area is for P(X<x), not P(X>x). The x value can be found with the Excel function: NORMINV(0.95,24.3,0.6) Therefore, P(X>25.29)=0.05. See http://www.statsclass.com/excel/tables/crit_values.html# crit_n for more information about this function. Note that 2005 Christopher R. Bilder 6.19 we will eventually use these types of values as “critical points” in hypothesis testing. Here are other ways to find the value of x in Maple: > with(stats); [ anova, describe , fit , importdata, random, statevalf, statplots, transform ] > statevalf[icdf,normald[24.3,0.6]](0.95); 25.28691218 > f:=1/(sqrt(2*Pi)*0.6)*exp((y-mu)^2/(2*sigma^2)); f := .8333333335 2e 2 1/2 ( y ) 2 > solve(0.95 = eval(int(f, y = -infinity..x), [mu=24.3, sigma=0.6], x); 25.28691217 Example: Grading (grade_bell.xls) Suppose the set of test #2 grades in the class has a normal distribution with =73% and =8%. Let X be a student’s grade. Answer the following. 2005 Christopher R. Bilder 6.20 1) What is the probability that a randomly chosen student in the class received a grade of 90% or better? Grading Normal PDF Example 0.14 0.12 f(x) 0.1 0.08 0.06 0.04 0.02 0 50 55 60 65 70 75 80 85 90 95 100 x (Grade) Let X be a normal random variable with =73% and =8%. Find P(X>90). Thus, we need to find P(X 90) 90 1 8 2 e ( x 73)2 2(8)2 dx The Excel function is 1-NORMDIST(90,73,8,TRUE) and the answer is 0.0168. 2) What percentage of students scored between a 70% and 90%? 2005 Christopher R. Bilder 6.21 Grading Normal PDF Example 0.14 0.12 f(x) 0.1 0.08 0.06 0.04 0.02 0 50 55 60 65 70 75 80 85 90 95 100 x (Grade) The Excel function is NORMDIST(90,73,8,TRUE) NORMDIST(70,73,8,TRUE) and the answer is 0.6294. 3) Suppose that your instructor curves the test #2 grades and that ONLY the top 10% of test scores receive A’s. Would a student be better off with a test #2 grade of 81% (still with =73% and =8%) or a grade of 68% on a different test #2 that has a normal distribution with =62% and =3%? 2005 Christopher R. Bilder 6.22 Grading Normal PDF Example 0.14 0.12 f(x) 0.1 0.08 0.06 0.04 0.02 0 50 55 60 65 70 75 80 85 90 95 100 x (Grade) 73 & 8 62 & 3 Find the top 10% of the scores for each situation. For =73% and =8%, find x for P(X>x)=0.10. The Excel function to find this is NORMINV(0.9,73,8) and the answer is 83.25. For =62% and =3%, find x for P(X>x)=0.10. The Excel function to find this is NORMINV(0.9,62,3) and the answer is 65.84. 2005 Christopher R. Bilder 6.23 A student would prefer the second test since an A would be received. Rule of thumb for the number of standard deviations all data lies from its mean: In Chapter 4, we discussed that approximately all data lies within 2 or 3. We also discussed the more formal expression of this using Chebyshev’s Rule. Examine what happens if our data comes from a normal PDF. The end result is what is often called the Empirical Rule. Example: Standard normal distribution template (stand_norm_prob.xls) Let Z be a random variable with a standard normal PDF. Thus, =0 and =1. All of these results apply for 0 and 1 also. Below are three screen captures that show a standard normal PDF. The distributions show the area between 1, 2, and 3 standard deviations of the mean. 2005 Christopher R. Bilder 6.24 Notice how large the probability is that Z is between 2 or 3 standard deviations from the mean! 2005 Christopher R. Bilder 6.25 Reminder about P(X=x)=0 What is P(X=x)? It is 0 since X is a continuous random variable. To see why this is true, consider this proof by example. Let Z be a standard normal random variable. The following table of probabilities can then be constructed. P(0.95<Z<1.05) P(0.98<Z<1.02) P(0.99<Z<1.01) P(0.99<Z<1) P(1<Z<1.01) P(Z=1) Probability 0.0242 0.0096 0.0049 0.0042 0.0025 0 Notice the probability gets smaller and smaller as the interval gets smaller. Eventually, the probability will become 0. b Remember that P(a X b) f(x)dx for some PDF f(x) a where X is a continuous random variable. When a=b, a then P(a X a) f(x)dx 0 . a Standard normal PDF 2005 Christopher R. Bilder 6.26 Probabilities associated with the standard normal PDF have been tabled. Example: Standard normal distribution tables (stand_norm_table.xls) Before there were readily assessable software package or calculators with functions for the normal PDF, people used tables based on the standard normal PDF in order to find probabilities associated with ANY normal PDF. Table A.3 on p.670-1 of the book is one of these tables. It provides F(z), the CDF for a standard normal random variable Z. The reason why I am using Z here is because this is the common practice when discussing standard normal random variables. Thus, Table A.3 gives probabilities such as the one shown below. 2005 Christopher R. Bilder 6.27 Below is an excerpt from the table contained in stand_norm_table.xls. -3.4 -3.3 -3.2 -3.1 -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 0.00 0.0003 0.0005 0.0007 0.0010 0.0013 0.0019 0.0026 0.0035 0.0047 0.0062 0.0082 0.0107 0.0139 0.0179 0.0228 0.0287 0.01 0.0003 0.0005 0.0007 0.0009 0.0013 0.0018 0.0025 0.0034 0.0045 0.0060 0.0080 0.0104 0.0136 0.0174 0.0222 0.0281 0.02 0.0003 0.0005 0.0006 0.0009 0.0013 0.0018 0.0024 0.0033 0.0044 0.0059 0.0078 0.0102 0.0132 0.0170 0.0217 0.0274 0.03 0.0003 0.0004 0.0006 0.0009 0.0012 0.0017 0.0023 0.0032 0.0043 0.0057 0.0075 0.0099 0.0129 0.0166 0.0212 0.0268 0.04 0.0003 0.0004 0.0006 0.0008 0.0012 0.0016 0.0023 0.0031 0.0041 0.0055 0.0073 0.0096 0.0125 0.0162 0.0207 0.0262 0.05 0.0003 0.0004 0.0006 0.0008 0.0011 0.0016 0.0022 0.0030 0.0040 0.0054 0.0071 0.0094 0.0122 0.0158 0.0202 0.0256 0.06 0.0003 0.0004 0.0006 0.0008 0.0011 0.0015 0.0021 0.0029 0.0039 0.0052 0.0069 0.0091 0.0119 0.0154 0.0197 0.0250 0.07 0.0003 0.0004 0.0005 0.0008 0.0011 0.0015 0.0021 0.0028 0.0038 0.0051 0.0068 0.0089 0.0116 0.0150 0.0192 0.0244 0.08 0.0003 0.0004 0.0005 0.0007 0.0010 0.0014 0.0020 0.0027 0.0037 0.0049 0.0066 0.0087 0.0113 0.0146 0.0188 0.0239 0.09 0.0002 0.0003 0.0005 0.0007 0.0010 0.0014 0.0019 0.0026 0.0036 0.0048 0.0064 0.0084 0.0110 0.0143 0.0183 0.0233 This table uses the NORMDIST(z, 0, 1, TRUE) function to find P(Z<z). For example, 2005 Christopher R. Bilder 6.28 P(Z<-3.41) = 0.0003, P(Z<-3.03) = 0.0012, P(Z<-2.57) = 0.0051, Why are we concerned with this table of standard normal probabilities? A simple transformation can be made from ANY normal PDF to the standard normal PDF using the following formula: Z X where X is a normal random variable with mean and standard deviation and Z is a standard normal random variable with mean 0 and standard deviation 1. Therefore, using this one table, we can find all normal PDF probabilities WITHOUT Excel or other means. Example: Grand Am (grand_am_normal.xls) Suppose that it is reasonable to assume a Grand Am’s MPG has a normal PDF with a mean MPG of =24.3 and a standard deviation of =0.6. Let X denote the 2005 Christopher R. Bilder 6.29 MPG for one tank of gas. Answer the following questions. 1) Find the probability that a randomly selected Grand Am gets less than 23 MPG for one tank of gas. We need to find P(X<23) = F(23). This is the area to the left of the red line underneath the PDF. Grand Am Normal PDF Example 0.7 0.6 f(x) 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 x (MPG) 24.3 & 0.6 The function, NORMDIST(23,24.3,0.6,TRUE), can be used in Excel to find the probability to be 0.0151. Using the tables, P(X<23) X 23 24.3 = P = P(Z<-2.1667) P(Z<-2.17) 0.6 = 0.0150. 2005 Christopher R. Bilder 6.30 2) Suppose is increased to =1.3. What do you expect to happen to P(X<23)? The function, NORMDIST(23,24.3,1.3,TRUE), can be used to find the probability to be 0.1587. Grand Am Normal PDF Example 0.7 0.6 0.5 f(x) 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 x (MPG) 24.3 & 1.3 Using the tables, P(X<23) X 23 24.3 = P = P(Z<-1) = 0.1587. 1.3 3) Suppose =0.6 again, but is decreased to =23.1. What do you expect to happen to P(X<23)? 2005 Christopher R. Bilder 30 6.31 The function, NORMDIST(23,23.1,0.6,TRUE), can be used to find the probability to be 0.4338. Using the tables, P(X<23) X 23 23.1 = P = P(Z<-0.1667) P(Z<-0.17) 0.6 = 0.4325 4) Suppose =0.6 and =24.3 again. What is P(23<X<25)? The function, NORMDIST(25,24.3,0.6,TRUE)-NORMDIST(23,24.3,0.6,TRUE) can be used to find the probability to be 0.8632 Using the tables, P(23<X<25) 23 24.3 X 25 24.3 = P 0.6 0.6 = P(-2.1667<Z<1.1667) P(-2.17<Z<1.17) = P(Z<1.17) – P(Z<-2.17) = 0.87900 – 0.01500 = 0.86400 5) Suppose =0.6 and =24.3 again. What is P(X>23)? 6) Suppose =0.6 and =24.3 again. What is P(X<23 or X>25)? 2005 Christopher R. Bilder 6.32 7) What is MPG required for a car to be in the top 5% of all Grand Ams? Suppose =0.6 and =24.3 again. The x value for P(X>x)=0.05 was found with the Excel function =NORMINV(0.95,24.3,0.6). This produced P(X>25.29)=0.05. Using the tables, P(X>x) X x 24.3 P Z z 0.05 . =P 0.6 Note that P(Z<z)=0.95 produces z1.64. X x 24.3 P Z 1.64 0.05 and Then P 0.6 x 24.3 1.64 . Thus, z=25.284. Therefore, 0.6 P(X>25.284)0.05. 2005 Christopher R. Bilder 6.33 Observing a sample from a population characterized by a normal PDF Suppose a population can be characterized by a normal PDF. What characteristics would you expect for a sample taken from that population? Example: MPG (gen_norm.xls) MPG example from before: X is a normal random variable with =E(X)=24.3 and = Var(X) =0.6. Suppose 1,000 different x’s are observed. In other words, a sample of 1,000 is taken from the population Questions: 1) What would you expect the average value of the 1,000 observed x’s to be approximately? 2) What range would you expect most of the x’s to fall within? Observed values of a normal random variable can also be generated in the same way as what was done in the Chapters 3 and 5. Excel also has a specific normal PDF option in the Random Number Generation window. The file, gen_norm.xls, gives an example of using the window below. More directions are available at Chris Malone’s Excel help website at http://www.statsclass.com/excel/misc/norm_dist.html. 2005 Christopher R. Bilder 6.34 In this case, 1 variable with 1,000 observed values are generated. The mean =24.3 and standard deviation =0.6 are used to coincide with the Grand Am example. The seed number gives Excel a random place to start when generating these observed values. I can use this seed number again and generate the exact same data! 2005 Christopher R. Bilder 6.35 Below are part of the results. MPG population sample 23.74819 mean 24.3 24.32106 23.48401 standard deviation 0.6 0.596285 24.59496 24.19059 24.32339 23.11923 classes Bin Frequency 24.47166 22.6 22.6 0 24.62146 22.8 22.8 3 24.72662 23 23 11 24.14857 23.2 23.2 23 23.62592 23.4 23.4 23 23.90335 23.6 23.6 53 24.57064 23.8 23.8 83 24.05448 24 24 110 23.71645 24.2 24.2 113 25.07131 24.4 24.4 121 24.32436 24.6 24.6 111 24.1024 24.8 24.8 136 24.62523 25 25 93 24.28823 25.2 25.2 55 24.7843 25.4 25.4 37 24.04441 25.6 25.6 16 24.54797 25.8 25.8 5 24.26204 26 26 4 23.68477 26.2 26.2 1 24.76348 26.4 26.4 1 24.05258 26.6 26.6 1 24.38633 26.8 26.8 0 23.26503 More 0 23.78286 24.09225 23.61932 23.32867 23.02444 23.91397 24.42597 24.42067 2005 Christopher R. Bilder 6.36 Notes: Notice how close and are to the sample mean and standard deviation. The sample standard deviation is calculated as n xi x 2 where x is the sample mean n 1 and xi for i=1,…,n is the ith observed value. Explanation for why this formula was used will be given in Chapter 8. Here is an example of how to simulate a sample from a normal PDF using Maple: i 1 > randomize(1514); 1514 > data:=stats[random, normald[24.3, 0.6]](100); data := 25.36372908 , 24.96025314 , 24.44663243 , 25.27318122 , 23.94262355 , 23.69829609 , 23.96992063 , 23.72400640 , 24.06923492 , 24.38832186 , 24.54452405 , 23.47219191 , 24.51653894 , 24.22545826 , 24.58063212 , 24.40056631 , 24.22519976 , 24.73647509 , 23.04956592 , 24.94875357 , 24.02254401 , 24.35341391 , 24.67885308 , 24.81796173 , 23.60716054 , 24.15571156 , 24.48549168 , 23.84686372 , 25.62993784 , 24.95907390 , 24.13187013 , 24.40491872 , 25.04623787 , 23.81147131 , 23.04161664 , 25.57549338 , 23.34059716 , 24.46719408 , 24.23062843 , 23.80346201 , 25.20382342 , 23.72508178 , 23.35185260 , 23.99842442 , 24.55421301 , 24.06936962 , 23.50756715 , 24.22223306 , 24.28139128 , 24.47253728 , 2005 Christopher R. Bilder 6.37 24.50969275 24.44507802 25.09714835 25.60826660 24.63990563 24.27232784 24.23063511 23.34198654 25.63674860 23.65460645 , 25.31179898 , 24.28610049 , 25.12040542 , 24.19990788 , 24.74360918 , 23.27915406 , 24.06653251 , 23.30621749 , 24.48445061 , 24.16122685 , 24.30883191 , 24.04085590 , 24.69746294 , 25.02525917 , 23.45013063 , 25.15368420 , 24.43778592 , 24.58547842 , 24.56049376 , 24.74861908 , 24.39745116 , 25.13232101 , 24.51272238 , 24.41097845 , 24.52780462 , 24.38724182 , 24.04812858 , 24.40825270 , 23.33174552 , 24.58956375 , 24.34240361 , 24.66322075 , 23.75350627 , 24.17714648 , 24.47851759 , 23.47378351 , 25.20231330 , 23.90859335 , 24.26972911 , 24.41964871 , , , , , , , , , > evalf(stats[describe,mean]([data]),4); 24.30 > evalf(stats[describe, standarddeviation]([data]),4); .5871 Page 6.35 of the notes shows one possible frequency distribution for the sample. This gives information about how often observed values fell into chosen classes. In Excel, I originally entered in the values in the “classes” column. Through performing a few steps, Excel automatically generates a frequency distribution. One needs to be VERY careful with interpreting what Excel gives. Below is another representation of it: classes Frequency 22.6 0 >22.6 and 22.8 3 >22.8 and 23 11 >23 and 23.2 23 >23.2 and 23.4 23 >23.4 and 23.6 53 2005 Christopher R. Bilder 6.38 classes Frequency >23.6 and 23.8 83 >23.8 and 24 110 >24 and 24.2 113 >24.2 and 24.4 121 >24.4 and 24.6 111 >24.6 and 24.8 136 >24.8 and 25 93 >25 and 25.2 55 >25.2 and 25.4 37 >25.4 and 25.6 16 >25.6 and 25.8 5 >25.8 and 26 4 >26 and 26.2 1 >26.2 and 26.4 1 >26.4 and 26.6 1 >26.6 and 26.8 0 >26.8 0 Thus, 136 sampled values are greater than 24.6 and less than or equal to 24.8. Why were these classes chosen? There are more than one set of classes which can be used. Here are some guidelines: a) Find the minimum and maximum observed values. You can use the MIN() and MAX() functions in Excel to do this. b) Choose classes which are of equal size. c) Choose the classes between the minimum and maximum values which make sense relative to 2005 Christopher R. Bilder 6.39 the data set. You may need to choose a few different ones until you think the frequency distribution represents the data well. d) Note that 1, 2, or 3 classes do not work! The frequency distribution is often plotted. This plot is called a histogram. Below is the histogram created by Excel. Histogram of 1,000 MPG observed values 160 140 Frequency 120 100 80 60 40 20 25 .4 25 .8 26 .2 26 .6 M or e 25 23 .4 23 .8 24 .2 24 .6 23 22 .6 0 x = MPG Does the histogram have a similar shape to the normal PDF with =24.3 and =0.6? If so, a normal PDF approximation to the distribution of MPG would be appropriate. 2005 Christopher R. Bilder 6.40 Grand Am Normal PDF Example 0.7 0.6 f(x) 0.5 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 x (MPG) 24.3 & 0.6 Below is an outline of the steps to find the frequency distribution and histogram for this example. General information about how to find a frequency distribution and histogram are available at http://www.statsclass.com/excel/graphs/histogram.html. 1)Find the minimum and maximum values. min 22.71475 max 26.46672 min max =MIN(B10:B1009) =MAX(B10:B1009) 2)In an empty area in the spreadsheet, create a column of classes. classes 22.6 22.8 23 2005 Christopher R. Bilder 6.41 23.2 23.4 23.6 23.8 24 24.2 24.4 24.6 24.8 25 25.2 25.4 25.6 25.8 26 26.2 26.4 26.6 26.8 3)Select TOOLS > DATA ANALYSIS from the main Excel menu bar. 4)Select HISTOGRAM and OK from the DATA ANALYSIS window. 2005 Christopher R. Bilder 6.42 5)The HISTOGRAM window will then appear. In the window, do the following: a)Input the cell range of the 1,000 observed values in the INPUT RANGE. b)Input the cell range of the classes into the BIN RANGE. c) Select an OUTPUT RANGE for the corresponding frequency distribution to start at. I usually specify the first cell to the right of my classes. d)Select the CHART OUTPUT option to have a histogram created. e)Select OK to have the frequency distribution and the histogram created! Below is what my spreadsheet looks like immediately after OK is selected. 2005 Christopher R. Bilder 6.43 6)Edit the histogram so that it looks nicer: Histogram of 1,000 MPG observed values 160 140 Frequency 120 100 80 60 40 20 25 25 .4 25 .8 26 .2 26 .6 M or e 23 23 .4 23 .8 24 .2 24 .6 22 .6 0 x = MPG Chris Malone has created a spreadsheet called, data_summary.xls, which can be used when one wants 2005 Christopher R. Bilder 6.44 to determine if a normal PDF approximation is appropriate. Below is the spreadsheet result when used with the 1,000 MPG observed values. The curve drawn on the histogram is a normal PDF with mean 24.3211 and standard deviation of 0.5963. Thus, the sample mean and standard deviation are substituted in for the population mean and standard deviation. You are not responsible for knowing how this plot was created, but you will need to be able to use the spreadsheet. There are also other summary measures displayed (box plot and dot plot) which may be discussed in future chapters. 2005 Christopher R. Bilder 6.45 From the results in data_summary.xls, does a normal PDF approximation for MPG seem appropriate? Explain. Please see p. 18-19 of the book for more information about frequency distributions and histograms. Validity of the normal PDF assumption All of the probabilities found using the normal PDF ASSUME the normal PDF is the correct PDF for the random variable. What if this assumption is incorrect? The probabilities found using this assumption are WRONG! Example: Grand Am (grand_am.xls) Suppose X really has an uniform distribution with A=22.3 and B=26.3. The P(X<23) is baseheight = 0.70.25=0.175. With the normal assumption of =24.3 and =0.6, the probability was found to be 0.0151 2005 Christopher R. Bilder 6.46 Grand Am Normal and Uniform PDF Example 0.7 0.6 0.5 f(x) 0.4 0.3 0.2 0.1 0 20 21 22 23 24 25 26 27 28 29 30 x (MPG) Normal mean=24.3 s.d.=0.6 Uniform A=22.3, B=26.3 How does one know when the normal PDF assumption is valid? Rarely, if ever, will it be 100% correct. If a sample from the population is possible, construct a histogram of the observed values and check to see if it has the shape of a normal PDF. In addition, calculate the sample mean and variance to see if they are close to the population mean and variance (if they are known). If the histogram does have a similar shape to a normal 2005 Christopher R. Bilder 6.47 PDF and the sample and population mean and variance are about the same (if the population values are known), then the normal PDF assumption is a reasonable approximation. Suppose a histogram was constructed and the data did not appear to come from a normal or other known PDF. What can you do? You can still use the normal PDF with the sample mean provided the sample size is large enough. The central limit theorem is used here in order to make a normal PDF approximation. Chapter 8 talks about this in detail. 2005 Christopher R. Bilder 6.48 6.5: Normal Approximation to the Binomial Skip! Theorem 6.2: Note that if X is a binomial random variable with mean = E(X) = np and variance Var(X) = 2 = np(1-p), then the limiting form of the PDF for X X np Z np(1 p) as n, is the standard normal PDF. Another way this can be worded is X can be approximated by a normal random variable with mean np and variance np(1-p). Thus as the number of trials increases, Z increasingly becomes more like a normal random variable. This information will be used in Section 9.10. 2005 Christopher R. Bilder 6.49 6.6: Gamma and Exponential Distributions We have already been using the Gamma and Exponential PDFs! These PDFs are often used in survival and reliability analysis. For example, these PDFs are used for modeling lifetimes of individuals or manufactured products. Definition 6.2: The gamma function is defined by () x 1e x dx for >0. 0 Notes: When is a positive integer, () = (-1)!; for example, (3) = (3-1)! = 2! = 21 = 2 Through integrating by parts, one can show () = (-1)(-1) (1/2) = In Maple, this is represented by the GAMMA() function where GAMMA needs to be in capital letters. For example, > GAMMA(3); 2 2005 Christopher R. Bilder 6.50 Gamma PDF: The continuous random variable X has a gamma PDF, with parameters and , if its PDF is given by 1 x 1e x / for x 0 f(x) () 0 otherwise where >0 and >0. Notes: In most realistic applications, and will not be known and we will need to estimate them. How to do this will be discussed in future chapters. controls the shape of the PDF since it mostly influences the “peakedness” of the PDF. controls the scale of the PDF since most of its influence is for the spread of the PDF. In Maple, this can be programmed in as > assume(x>0); > assume(alpha>0); > assume(beta>0); > about(x, alpha, beta); Originally x, renamed x~: is assumed to be: RealRange(Open(0),infinity) Originally alpha, renamed alpha~: is assumed to be: RealRange(Open(0),infinity) Originally beta, renamed beta~: is assumed to be: RealRange(Open(0),infinity) 2005 Christopher R. Bilder 6.51 > f(x):=1/(beta^alpha*GAMMA(alpha))* x^(alpha-1)*exp(-x/beta); x~ ( 1 ) x~ e f( x~ ) := ( ) > simplify(int(f(x),x=0..infinity)); 1 There are easier ways to use the gamma PDF in Maple that will be discussed later. Below are a few comparative plots (gamma.xls). Notice the x- and y-axis scales are fixed for comparative purposes. Values of X could be greater than 24! =1, =1, =1, 2=1 =1, =2, =2, 2=4 Gamma PDF Gamma PDF 1 1 0.9 0.8 0.7 0.6 0.9 0.8 f(x) f(x) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 22 24 0.5 0.4 0.3 0.2 0.1 0 0 2 x 4 6 8 10 12 x 2005 Christopher R. Bilder 14 16 18 20 22 24 6.52 =1, =3, =3, 2=9 =2, =1, =2, 2=2 Gamma PDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 f(x) f(x) Gamma PDF 0 2 4 6 8 10 12 14 16 18 20 22 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 24 0 2 4 6 8 10 x f(x) f(x) 6 8 10 12 18 20 22 24 20 22 24 Gamma PDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4 16 =4, =2, =8, 2=16 Gamma PDF 2 14 x =4, =1, =4, 2=4 0 12 14 16 18 20 22 24 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 x 4 6 8 10 12 x 2005 Christopher R. Bilder 14 16 18 6.53 =2.5, =2.5, =6.25, 2=15.625 Gamma PDF 1 0.9 f(x) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 22 24 x Questions: What happens if and/or are increased? What happens if and/or are decreased? Why would someone want to use different values of and/or ? Theorem 6.3: The mean and variance of the gamma PDF are: E(X) = = and Var(X) = 2 = 2. pf: 1 1 x / x e dx () 1 0 x e x / dx () ( 1)1 1 x / x e dx 0 1 ( 1) () E(X) 0 x 2005 Christopher R. Bilder 6.54 1( 1) 1 x / x e dx 0 1 () ( 1) 1 Notice that 1 xe x / is a gamma PDF with ( 1) +1 and as its parameters! Thus, 1 xe x / dx = 1 and 0 1 ( 1) 1( 1) () E(X) 1 . () () A similar proof can be done for the variance. Maple code, > E(X):=simplify(int(x*f(x), x=0..infinity)); E( X ) := > Var(X):=simplify(int((x-E(X))^2*f(x), x=0..infinity)); Var ( X ) := 2 Examine what happens to the PDF as values of and 2 change the gamma PDF plots on the previous pages. Example: Distribution of lifetimes (gamma_actuary.xls) 2005 Christopher R. Bilder 6.55 Let X be a random variable denoting the lifetime of a person in a particular population. An actuary uses the PDF for X below to model the lifetimes of all people in this population: f(x) 1 x / 15 x e for x>0. 2 15 For this example, =15 and =2. Gamma PDF for actuary example 0.03 0.025 f(x) 0.02 0.015 0.01 0.005 0 0 25 50 75 100 125 150 x In Maple, the plot is > plot(eval(f(x),[alpha=2,beta=15]), x=0..150, title="Gamma PDF, alpha=2, beta=15", labels=["x", "f(x)"]); 2005 Christopher R. Bilder 6.56 This particular PDF may not be realistic for what we would commonly perceive to be the distribution of lifetimes in the United States. Questions: What are the mean and variance? The mean and variance are = = 215 = 30 and 2 = 2152 = 450. Thus, one would expect to live 30 years on average for this population. What is the probability a person in the population lives longer than 80 years? The probability can be found from 2005 Christopher R. Bilder 6.57 1 x / 15 xe dx . Notice that integration by 2 80 15 parts would be needed here. If the integration was done in Maple, P(X>80) = > P(X>80):=int(eval(f(x), [alpha=2,beta=15]), x=80..infinity); P( 80 X ) := 19 ( -16/3 ) e 3 > evalf(P(X>80),4); .03059 Also, note that P(X>80) = 1 - P(X<80) = 1 - F(80). Thus, the CDF can be used to find the probability. The GAMMADIST(x, , , TRUE) function in Excel can simply be used here. Thus, =1-GAMMADIST(80,2,15,TRUE) results in a value of 0.0306. Using the stats package in Maple, > 1-stats[statevalf,cdf,gamma[2,15]](80); .0305770166 What is the median lifetime? The value c needs to found such that the probability of living less than c years is 0.5. Then we could use 2005 Christopher R. Bilder 6.58 1 x / 15 xe dx 2 0 15 and solve for c. If the integration and solving was done in Maple, c P(X c) 0.5 > solve(int(eval(f(x),[alpha=2, beta=15]), x=0..c) = 0.5, c); -11.52058571 , 25.17520485 Of course, the positive value for c would be the answer. The GAMMAINV(prob., , ) function can be used in Excel to find c. Thus, =GAMMAINV(0.5,2,15) results in c = 25.18. Using the stats package in Maple, > stats[statevalf,icdf, gamma[2,15] ](0.5); 25.17520485 There are a few important special cases of the gamma PDF. One of them is the exponential PDF. Exponential PDF: The continuous random variable X has an exponential PDF, with parameter , if its PDF is given by 2005 Christopher R. Bilder 6.59 1 x / e f(x) 0 where >0. for x 0 otherwise Notes: This is the gamma PDF with =1. In most realistic applications, will not be known and it will need to be estimated. How to do this will be discussed in future chapters. controls the scale of the PDF since most of its influence is for the spread of the PDF. In general, this is what a plot of the PDF looks like. 1 f(x) f(x) X 0 The height of the curve at a point xo is f(xo ) Notice that when xo=0, f(0) 1 xo . e 1 0 / 1 1 e 1 since e0=1. 2005 Christopher R. Bilder 6.60 Theorem 6.3: The mean and variance for the exponential PDF are: E(X) = = and Var(X) = 2 = 2. pf: See the Chapter 4 examples with tire tread wear. Substitute in for 30. Also, see the proof used with the gamma PDF earlier. Example: Tire life (tire_wear.xls from Chapter 3) The number of miles an automobile tire lasts before it reaches a critical point in tread wear can be represented by a PDF. Let X = the number of miles (in thousands) an automobile is driven before it reaches the critical tread wear point for one tire. Suppose the PDF for X is 1 x e f(x) 0 for x 0 for x 0 In Chapter 3 and 4, we used =30. Remember that we found in Chapter 4 that E(X) = = 30 and Var(X) = 2 = 302! 2005 Christopher R. Bilder 6.61 In the spreadsheet, different values of can be entered into the cell to see how it affects the PDF. Below is a screen capture of the spreadsheet. Note that the line on the plot should extend past x=225. Questions: What happens if is increased? Explain why has this effect relative to it being called a “scale” parameter, E(X)=, and Var(X) = 2. What happens if is decreased? Explain why has this effect relative to it being called a “scale” parameter, E(X)=, and Var(X) = 2. Why would someone want to use different values of ? 2005 Christopher R. Bilder 6.62 Find the probability that a random selected tire will last (will not get to the critical tread wear point) longer than 30,000 miles. In Chapter 3, we found the probability through integration: x 1 P(X 30) e 30 dx e1 0.3679 30 30 Using the relationship between the gamma and exponential function, we can use the GAMMADIST() function: =1-GAMMADIST(30,1,30,TRUE) Notes: Remember that if FALSE is used instead of TRUE in the function, then f(x) is given as a result (the height of the curve). Excel also has a function specifically for the Exponential PDF: EXPONDIST(x,1/beta,TRUE) which finds F(x). Please note that 1/beta corresponds to what Excel defines as . Thus, Excel uses a PDF of e x f(x) 0 for x 0 otherwise To find P(X>30), note that P(X>30) = 1 – P(X<30) = 1 – F(30). In Excel, 2005 Christopher R. Bilder 6.63 =1-EXPONDIST(30,1/30,TRUE) To avoid confusion with = 1/, I recommend using the GAMMADIST() function instead. Find the tire wear number of miles such that less than 0.95 of the total number of tires will reach the critical point. In Chapter 3, we found the value of c as a solution to x c 1 0.95 P(X c) e 30 dx 0 30 The value of c was 30ln(20) 89.87. We can use the relationship between the gamma and exponential PDFs to find the same answer with =GAMMAINV(0.95,1,30) Example: Exponential distribution with =10/3 (exp.xls) 2005 Christopher R. Bilder 6.64 Exponential PDF 0.35 0.3 f(x) 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 x To find the probability P(2<X<4), find the area underneath that part of the plot. Note that P(2<X<4) = P(X<4) – P(X<2) = F(4) = F(2) since 2005 Christopher R. Bilder 6.65 = The Excel functions to find the probability are GAMMADIST(4,1,10/3,TRUE) – GAMMADIST(2,1,10/3,TRUE) and the answer is 0.2476. Final notes: o Go back to Chapter 3 and examine example_sample_tire.xls. Notice how choosing = 30 results in a very good fit of the PDF to the sampled values displayed in the histogram! o If you are an engineering major, I recommend examining the Weibull PDF in Section 6.10. o The chi-square PDF in Section 6.8 is an often used PDF which we will discuss later in the course. 2005 Christopher R. Bilder