Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Looking Ahead The Normal Distribution I Chapter 4 introduces the normal distribution as a probability distribution. I Chapter 5 culminates in the central limit theorem, the primary theoretical justification for most of the methods of statistical inference in the remainder of the textbook. Bret Larget Departments of Botany and of Statistics University of Wisconsin—Madison Statistics 371 23rd September 2005 The Normal Distribution The Normal Density Normal curves have the following bell-shaped, symmetric density. The Normal Distribution is the most important distribution of continuous random variables. I The normal density curve is the famous symmetric, bell-shaped curve. I The central limit theorem is the reason that the normal curve is so important. Essentially, many statistics that we calculate from large random samples will have approximate normal distributions (or distributions derived from normal distributions), even if the distributions of the underlying variables are not normally distributed. I The Central Limit Theorem is the basis of most of the methods of statistical inference we will study in the last half of the course. 1 y −µ 2 1 f (y ) = √ e− 2 ( σ ) σ 2π Parameters: The parameters of a normal curve are the mean µ and the standard deviation σ. Here is an example of a normal curve with µ = 100 and σ = 20. Normal Distribution mu = 100 , sigma = 20 Probability Density I 50 −3 100 −2 −1 0 150 1 2 3 Standardization The Standard Normal Table I All normal curves have the same essential shape. I Every normal curve can be drawn in exactly the same manner, just by changing labels on the axis. I The standard normal curve is the normal curve with mean µ = 0 and standard deviation σ = 1. I The standardization formula is Z= Y −µ σ I Every problem that asks for an area under a normal curve is solved by first finding an equivalent problem for the standard normal curve. I The justification for this comes from calculus. An area under a normal curve is a definite integral. The integral is simplified by using the standardization formula. The 68–95–99.7 Rule I The standard normal table lists the area to the left of z under the standard normal curve for each value from −3.49 to 3.49 by 0.01 increments. I The normal table is on the inside front cover of your textbook. I Numbers in the margins represent z. I Numbers in the middle of the table are areas to the left of z. R can do this for general values of z, and R can do the standardization for you. Example Calculations For every normal curve, 68% of the area is within one SD of the mean, 95% of the area is within two SDs of the mean, and 99.7% of the area is within three SDs of the mean. Normal Distribution mu = 0 , sigma = 1 Normal Distribution mu = 0 , sigma = 1 P( X < −1 ) = 0.1587 P( X > 1 ) = 0.1587 −4 −2 −3 −2 0 −1 0 2 1 2 4 3 P( X < −2 ) = 0.0228 P( X > 2 ) = 0.0228 −4 −2 −3 −2 0 −1 Suppose that egg shell thickness is normally distributed with a mean of 0.381 mm and a standard deviation of 0.031 mm. Here are a large number of example calculations. P( −3 < X < 3 ) = 0.9973 0 2 1 2 4 3 Probability Density P( −2 < X < 2 ) = 0.9545 Probability Density Probability Density P( −1 < X < 1 ) = 0.6827 Normal Distribution mu = 0 , sigma = 1 P( X < −3 ) = 0.0013 P( X > 3 ) = 0.0013 −4 −2 −3 −2 0 −1 0 2 1 2 4 3 Example Area Calculation Example Area Calculation Area to the left. Find the proportion of eggs with shell thickness less than 0.34 mm. Area between two values. Find the proportion of eggs with shell thickness between 0.34 and 0.36 mm. Normal Distribution mu = 0.381 , sigma = 0.031 P( X < 0.34 ) = 0.093 P( X > 0.34 ) = 0.907 0.25 0.30 −3 0.35 −2 −1 0.40 0 0.45 1 2 P( 0.34 < X < 0.36 ) = 0.1561 Probability Density Probability Density Normal Distribution mu = 0.381 , sigma = 0.031 0.50 P( X < 0.34 ) = 0.093 P( X > 0.36 ) = 0.7509 0.25 3 0.30 −3 Example Area Calculation 0.35 −2 0.35 −1 3 0.40 0 0.45 1 2 0.50 3 P( 0.32 < X < 0.4 ) = 0.7055 Probability Density Probability Density P( X > 0.36 ) = 0.7509 −2 2 Normal Distribution mu = 0.381 , sigma = 0.031 P( X < 0.36 ) = 0.2491 −3 1 0.50 Area outside two values. Find the proportion of eggs with shell thickness smaller than 0.32 mm or greater than 0.40 mm. Normal Distribution mu = 0.381 , sigma = 0.031 0.30 0 0.45 Example Area Calculation Area to the right. Find the proportion of eggs with shell thickness more than 0.36 mm. 0.25 −1 0.40 P( X < 0.32 ) = 0.0245 P( X > 0.4 ) = 0.27 0.25 0.30 −3 0.35 −2 −1 0.40 0 0.45 1 2 0.50 3 Example Area Calculation Quantiles Central area. Find the proportion of eggs with shell thickness within 0.05 mm of the mean. Normal Distribution mu = 0.381 , sigma = 0.031 Quantile calculations ask you to use the normal table backwards. You know the area but need to find the point or points on the horizontal axis. Probability Density P( 0.331 < X < 0.431 ) = 0.8932 P( X < 0.331 ) = 0.0534 P( X > 0.431 ) = 0.0534 0.25 0.30 −3 0.35 −2 −1 0.40 0 0.45 1 2 0.50 3 Example Area Calculation Example Quantile Calculations Two-tail area. Find the proportion of eggs with shell thickness more than 0.07 mm from the mean. Percentile. What is the 90th percentile of the egg shell thickness distribution? Normal Distribution mu = 0.381 , sigma = 0.031 Probability Density P( 0.311 < X < 0.451 ) = 0.9761 P( X < 0.311 ) = 0.012 P( X > 0.451 ) = 0.012 Probability Density Normal Distribution mu = 0.381 , sigma = 0.031 P( X < 0.4207 ) = 0.9 z = 1.28 0.25 0.30 −3 0.35 −2 −1 0.40 0 0.45 1 2 0.50 3 0.25 0.30 −3 0.35 −2 −1 0.40 0 0.45 1 2 0.50 3 Example Quantile Calculations Using R Upper cut-off point. What value cuts off the top 15% of egg shell thicknesses? Normal Distribution mu = 0.381 , sigma = 0.031 You can use R functions pnorm and qnorm for the previous calculations. Area to the left. Find the proportion of eggs with shell thickness less than 0.34 mm. > pnorm(0.34, 0.381, 0.031) Probability Density [1] 0.09298744 Area to the right. Find the proportion of eggs with shell thickness more than 0.36 mm. P( X < 0.4131 ) = 0.85 > 1 - pnorm(0.36, 0.381, 0.031) [1] 0.75093 z = 1.04 0.25 0.30 −3 0.35 −2 −1 0.40 0 0.45 1 2 0.50 Area between two values. Find the proportion of eggs with shell thickness between 0.34 and 0.36 mm. > pnorm(0.36, 0.381, 0.031) - pnorm(0.34, 0.381, 0.031) 3 [1] 0.1560825 Example Quantile Calculations More R Calculations Central cut-off points. The middle 75% egg shells have thicknesses between which two values? Normal Distribution mu = 0.381 , sigma = 0.031 Area outside two values. Find the proportion of eggs with shell thickness smaller than 0.32 mm or greater than 0.40 mm. > pnorm(0.32, 0.381, 0.031) + 1 - pnorm(0.4, 0.381, 0.031) [1] 0.2945190 Probability Density Central area. Find the proportion of eggs with shell thickness within 0.05 mm of the mean. > 1 - 2 * pnorm(0.381 - 0.05, 0.381, 0.031) P( X < 0.4167 ) = 0.875 [1] 0.8932345 Two-tail area. Find the proportion of eggs with shell thickness more than 0.07 mm from the mean. z = 1.15 0.25 0.30 0.35 0.40 0.45 0.50 > 2 * pnorm(0.381 - 0.07, 0.381, 0.031) [1] 0.02394164 −3 −2 −1 0 1 2 3 More R Calculations What is a Normal Probability Plot? Percentile. What is the 90th percentile of the egg shell thickness distribution? I A normal probability plot is a plot of the sorted sample data versus somethng close to the expected z-score for the corresponding rank of a random normal sample of the same size. I For example, the expected z-score of the minumum from a random sample of size 10 from a normal population is about −1.54. > qnorm(0.9, 0.381, 0.031) [1] 0.4207281 Upper cut-off point. What value cuts off the top 15% of egg shell thicknesses? > qnorm(0.85, 0.381, 0.031) [1] 0.4131294 Central cut-off points. The middle 75% egg shells have thicknesses between which two values? I If the plotted points are close to a straight line, there is evidence that the distribution is close to normal. > qnorm(c(0.25/2, 1 - 0.25/2), 0.381, 0.031) I If the plotted points are far from a straight line, there is evidence of non-normality. [1] 0.3453392 0.4166608 −2 −1 0 1 2 −2 −1 0 1 Theoretical Quantiles Theoretical Quantiles Histogram of x Histogram of x Histogram of x −1 0 x 1 2 2 Frequency 6 −2 −1 0 x 1 2 0 2 4 6 8 8 12 10 −2 1 2 2 0 −2 −2 1 −1 Sample Quantiles 1 0 −1 Sample Quantiles 2 1 0 −1 0 4 There is no easy answer to the question, but a normal probability plot is much more informative than the result of a test. −1 Theoretical Quantiles Frequency I −2 2 It is generally better to make a plot that sheds light on the question, “Is the data so far from normality as to bias a method that assumes normality?” Normal Q−Q Plot 0 I Sample Quantiles If given data, there are tests for normality, but there are reasons not to do these. −2 I Normal Q−Q Plot 15 Sometimes we can rely on the central limit theorem. Normal Q−Q Plot 10 I All three plots are normal with n = 50. 5 A standard question begins, “Assuming that variable Y has a normal distribution,. . . ”. But how do we know the distribution is approximately normal? Frequency I Normal Example 0 Normal Probability Plots −2 −1 0 x 1 2 Normal Example The Continuity Correction All three plots are normal with n = 500. 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 Theoretical Quantiles Theoretical Quantiles Histogram of x Histogram of x Histogram of x 0 1 2 These calculations are more accurate when the boundaries of normal calculations match midpoints between possible values of the discrete distribution. I Approximations to the binomial p distribution use a normal curve with µ = np and σ = np(1 − p). 3 100 I 50 Frequency 0 −1 I 0 Frequency 20 40 60 80 60 40 20 −2 The normal distribution can also be used to compute approximate probabilities of discrete distributions. 150 Theoretical Quantiles 0 −3 I −3 −3 −1 −1 Sample Quantiles 3 2 1 −1 Sample Quantiles 2 1 −1 0 Sample Quantiles −3 −2 80 −3 Frequency Normal Q−Q Plot 1 2 3 4 Normal Q−Q Plot 3 Normal Q−Q Plot 3 −3 −2 −1 x 0 1 2 3 −4 −2 x 0 2 4 x Non-normal Example When approximating a binomial p probability, the approximation is usually pretty good when np(1 − p) > 3. Example The first plot is not normal, the second two are with n = 50. Normal Q−Q Plot Normal Q−Q Plot 1 −2 −1 0 1 0.5 2 −2 −1 0 1 Theoretical Quantiles Histogram of x Histogram of x Histogram of x 6 x 8 6 4 Frequency 0 2 0 2 4 6 8 −2 −1 0 x 1 2 −2 −1 0 x Algebraically, Pr {3 ≤ Y ≤ 7} = Pr {2 < Y < 8} 8 10 Theoretical Quantiles 4 Here is an example with the binomial distribution with n = 15 and p = 0.4 where we want to compute Pr {3 ≤ Y ≤ 7}. 2 Theoretical Quantiles 12 2 I −1.5 2 Frequency 0 −0.5 Sample Quantiles 1.5 0.5 −1.5 0 0 2 4 6 8 Frequency −0.5 Sample Quantiles 8 6 4 2 Sample Quantiles 0 −1 12 −2 I 1.5 Normal Q−Q Plot 1 2 I Which end points should we use? I We get a better answer if we use 2.5 and 7.5. Example Example (cont.) I The exact calculation is: Pr {3 ≤ Y ≤ 7} = 7 X y =3 15! . (0.4)y (0.6)15−y = 0.7598 y !(15 − y )! If Y has a binomial probability with n = 500 and p = 0.2, what is the probability of 110 or more successes? Exact: > 1 - pbinom(109, 500, 0.2) [1] 0.1443539 I I The approximate calculation uses µ = 6 and σ = 1.9. ¾ ½ 7−6 3−6 ≤Z ≤ Pr {3 ≤ Y ≤ 7} ≈ Pr 1.9 1.9 . = 0.7524 This approximation is more accurate than using 3 and 7 or 2 and 8 as end points. Continuity Correction Here is a picture that shows why the continuity correction helps. 0.15 0.10 0.05 0.00 0.00 0.05 0.10 0.15 0.20 Approximate = 0.7529 0.20 Exact = 0.7598 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 Normal approximation: > mu = 500 * 0.2 > sigma = sqrt(500 * 0.2 * 0.8) > 1 - pnorm(109.5, mu, sigma) [1] 0.1440878 With R, do the exact calculation. With a calculator and normal table, use the normal approximation.