Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 IV. The Normal Distribution The normal distribution (a.k.a., the Gaussian distribution or “bell curve”) is the by far the best known random distribution. It’s discovery has had such a far-reaching impact in modeling quantitative phenomenon across the physical, social, and biological sciences that it’s founder has even found his way on to a major currency (before the Euro): Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Utility of the Normal Distribution The normal distribution has such broad applicability in part because • phenomenon in the natural world that result from the interaction of many environmental and genetic factors tend to follow the normal distribution (e.g., height, weight, measurable intelligence). • the sums and averages of random samples have distributions that look roughly normal – as the sample size gets larger, the normal approximation gets better. This result is known as the Central Limit Theorem. As we will discuss later, this applies even to samples of categorical variables! Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Characteristics of the Normal Distribution σ σ 2σ 3σ 2σ µ 3σ • Symmetric (about the mean), unimodal, bell-shaped. If X ~ N(µ,σ2) – where µ is the mean and σ2 is the variance – then the density function of X is given by f ( x | µ ,σ ) = 1 exp{( x − µ ) 2 / 2σ 2 }, for − ∞ < x < ∞. σ 2π • Of the subjects in a normally distributed population, 68.3% lie within one standard deviation of the mean, 95.4% lie within 2 s.d.’s, and 99.7% lie within 3 s.d.’s. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Computing Probabilities using the Normal Distribution Recall that for a continuous random variable X with probability density function (pdf) f(x), one cannot compute P(X = x). That is, the pdf does not yield probabilities as does a discrete probability mass function. Technically, for a continuous random variable X, P(X = x) = 0. However, we can compute probabilities over intervals of X – that is, the probability that X lies between two numbers a and b is equal to the area under the density curve between a and b, for example: a b Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Computing Probabilities using the Normal Distribution To this point, we have computed areas under a density curve by using integration. However, since the normal density (i) cannot be integrated in closed form and (ii) is used by researchers with access to modern computing tools, probabilities based on the normal distribution can be obtained using tables or computer software. A normal probability table looks something like what is shown on the last pages of this handout (reproduced from pages 881-882 of your text). Such a table is based on the standard normal distribution, or the normal distribution with zero mean and variance of 1. Using this table, what is the probability that a randomly sampled N(0,1) variable is less than 1.34? Less than –0.28? Between –2.54 and 1.68? For what x does P(Z < x) = 0.975? Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Standardizing a N(µ,σ2) Random Variable How do we compute probabilities for a normal distribution with arbitrary mean and variance using a standard normal table? Another unique aspect of the normal distribution is that if we have X ~ N(µ,σ2), then any linear function of X is also normally distributed. That is, if we have Y = a + bX for arbitrary constants a and b, then Y ~ N(a + bµ, b2σ2). If we define Z = (X – µ)/σ, then (using the notation above) a = –µ/σ, and b = 1/σ, so that Z ~ N(0,1). Computing Z is called “standardizing” X. Once we’ve converted X into standard units, we can compute probabilities over intervals of X by using the standard normal – or “Z” – distribution. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.A Data from a study of king crabs on Kodiak Island, AK, (carried out by the Alaska Department of Fish and Game) show that male crab length is normally distributed with a mean of 134.7 mm and a standard deviation of 25.5 mm. What proportion of the male crab population on Kodiak Island is less than 140 mm? What proportion is between 100 and 140 mm? What is the probability that a randomly selected male crab will measure at least 170 mm? What is the 75th percentile of this population? The 99th percentile? Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Sums of Normally Distributed Random Variables Yet another interesting feature of the normal distribution is that sums of normally distributed independent variables are also normally distributed. Suppose we have two independent random variables X1 and X2 such that X1 ~ N(µ1,σ12), and X2 ~ N(µ2,σ22), and we define Y such that Y = c1X1 + c2X2, where c1 and c2 are constants. Then Y ~ N(c1µ1 + c2µ2, c12σ12 + c22σ22). Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Distribution of the Sample Mean Recall from our discussion of random variables that if we sample n subjects X1,…,Xn at random from a population with an underlying expected value of µ and variance of σ2, then the expectation of the distribution of the sample mean X is µ, and the variance of X is σ2/n. From the previous slide, we can see further that if the sample comes from a normally distributed population then X ~ N( µ , σ 2 / n). Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.B Consider again the population of Kodiak crabs discussed in Example IV.A. Suppose that we randomly sample 20 specimens from this population. What is the probability that the sample mean will lie between 124.7 and 144.7 mm? Compute an interval centered at the mean µ such that a sample average of 20 male crabs will lie within that interval with 95% probability. What sample size is required to reduce the total width of this interval to 20 mm? Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 The Central Limit Theorem Suppose that we have a sample X1,X2,…,Xn from some distribution with mean µ and variance σ2. If n is sufficiently large, then the sample mean X ~& N(µ,σ2/n). This is true even if the underlying population is not normal – the approximation improves for relatively larger n. We refer to this result as the Central Limit Theorem, or CLT. It represents one of the most remarkable results in mathematical statistics. The CLT applies even to samples from some categorical distributions, including the binomial and Poisson distributions. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.C Let’s carry out an experiment in order to demonstrate the power of the Central Limit Theorem. Suppose we have a random variable Xi that represents the ith flip of a coin, for i = 1,…,n. We will assume Xi = 1 for heads and Xi = 0 for tails. The mass function for Xi is given by x 0 1 P(Xi = x) ½ ½ Hence, in this case µ = ½, and σ2 = ½(1 – ½) = ¼. Note that the sample mean X is just the proportion of flips out of n tries that turn up heads. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.C, cont’d Note further that the underlying distribution is not at all normal: it’s a binary distribution with just two points of probability mass at zero and one. The density of the normal distribution is continuous, unimodal, bell-shaped, symmetric, and has a domain over the entire real line. However, the CLT claims that if n is sufficiently large, the distribution of X will be approximately normal. What will the mean and variance of this distribution be? Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.C, cont’d Take a coin, and flip it 30 times. Record the results of the flips in order! What is the proportion of your first 10 flips that were heads? What is the proportion of the first 20 flips (the first 10 combined with the second 10) that were heads? What is the proportion of all 30 that were heads? Stat 3000 – Statistics for Scientists and Engineers Example IV.C,Dr. cont’d Corcoran, Fall 2005 Plot below the distribution of sample proportions for the whole class, with n = 10: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Stat 3000 – Statistics for Scientists and Engineers Example IV.C,Dr. cont’d Corcoran, Fall 2005 Plot below the distribution of sample proportions for the whole class, with n = 20: 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Stat 3000 – Statistics for Scientists and Engineers Example IV.C,Dr. cont’d Corcoran, Fall 2005 Plot below the distribution of sample proportions for the whole class, with n = 30: 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.C, cont’d Answer the following questions: How does the shape of the distribution of the sample proportion change as n gets larger? What is the estimated mean (for the whole class) of the distribution of the sample proportion X when n = 10? When n = 20? When n = 30? What is the estimated variance (for the whole class) of the distribution of the sample proportion X when n = 10? When n = 20? When n = 30? Compare the estimated means and variances to our predictions. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Approximating the Binomial and Poisson Distributions In light of the CLT, it’s not surprising that the normal distribution can provide a fairly accurate approximation – under certain (not necessarily uncommon) circumstances – of binomial and Poisson probabilities. Can you explain why? For example, the plots below superimpose normal curves on binomial distributions with different values of n and p. For what sorts of binomial distributions will the normal distribution prove more accurate? n = 10, p = 0.50 n = 10, p = 0.10 n = 100, p = 0.10 Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.D What are the mean and variance of the number of sunbathing lizards in Example III.E? Use a normal approximation to compute P(X ≥ 10), where X represents the number of lizards observed in the sun out of the 60 lizards sampled. Example IV.E What are the mean and variance of the number of ship damage incidents over a period of 5 years in Example III.H? Use a normal approximation to compute the probability that a given ship is damaged at least once during its next 5 years of service. How accurate is the normal approximation in these last two examples? Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Resources on the Web There are many applets available via the internet that demonstrate the Central Limit Theorem. A simple example using dice is found at http://www.amstat.org/publications/jse/v6n3/applets/CLT.html. http://www.amstat.org/publications/jse/v6n3/applets/CLT.html You can find a more interesting example at http://www.ruf.rice.edu/~lane/stat_sim/index.html. You can also easily access web-based CDF calculators for the normal distribution, as well as for other distributions related to the normal (more on those later). For example, this website computes probabilities for a variety of common distributions: http://www.stat.berkeley.edu/~stark/Java/Html/ProbCalc.htm. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 The Chi-Square Distribution A related distribution that we will use later on during this semester is the chi-square distribution. If Z is a standard normal random variable, then 2 Z2 is a chi-square random variable with 1 degree of freedom, or χ1 . The sum of n independent chi-square random variables follows a chisquare distribution with n degrees of freedom, denoted by χ n2 . A chi-square random variable has a range that is nonnegative, and its distribution is positively skewed. For example the pdf for the chi-square distribution with five degrees of freedom looks something like this: Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 The t Distribution Another distribution related to the normal – and one upon which we will heavily rely – is the t distribution. If Z is a standard normal random variable, and X 2 is an independent χn2 random variable, then the random variable Z T= X2 /n follows a t distribution with n degrees of freedom. A t distribution actually looks quite similar to the standard normal distribution: it’s mean is zero, it is unimodal, bell-shaped, and symmetric. One distinction is that the variability of the t distribution is slightly greater than the Z distribution. As n gets very large, however, the t distribution converges to (i.e., is nearly indistinguishable from) a Z distribution. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 The F Distribution The third related distribution that we will use is the F distribution. If U and V are independent χn2 and χm2 random variables, respectively, then the variable U /n F= V /m follows an F distribution with n and m degrees of freedom. We denote this distribution by Fn,m. The F distribution has a range that is nonnegative, and its distribution is positively skewed. For example the pdf for the F5,10 distribution looks something like this: Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.F Note that we cannot tabulate the χ2, t, and F distributions in the same way that we do for the Z distribution – there are an infinite number of distributions in each of these families (as many as there are values for the degrees of freedom). Instead of areas under the curve, then, you are given tables in your textbook that contain quantiles from a given χ2, t, or F distribution. For example, the χ2 table in the back of your book looks something like what you see on the following slide. Each row corresponds to a value for the degrees of freedom, and each column corresponds to a right tail area. Hence, the upper 95% quantile from the χ62 2 distribution is 1.635. We denote this by χ 0.05, 6 . Chi-square Table Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Example IV.F, cont’d 2 2 χ , χ Find the values of 0.025, 20 0.10,11. Find the values of t0.05,15 , t0.025,30 . Find the values of F0.05,5,10 , F0.10,9, 20 . Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 In Review… Let’s take a breath and summarize some very important points. We now have laid a foundation that allows us to describe and analyze data. From this point forward, we will focus on sampling data, and making inferences about the underlying population based on that sample. We denote our sample by X1, X2,…, Xn. The underlying mean for this population is µ and the variance is σ2. In practice, we are often interested in µ and σ2, although we don’t know what they are. That’s why we’re gathering the data. We will therefore focus much attention on inferring something about population quantities (such as µ, for example) based on the sampled data. Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 TAKE THESE FACTS WITH YOU! 1. X is a random variable: its distribution has a mean of µ, and a variance of σ2/n. 2. If the underlying population is normally distributed, then X is normally distributed. 3. Even if the underlying population is not normally distributed, the Central Limit Theorem tells us that for sufficiently large sample size n, X will be approximately normally distributed. Standard Normal Table Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005 Standard Normal Table Stat 3000 – Statistics for Scientists and Engineers Dr. Corcoran, Fall 2005