Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 II. The Normal Distribution The normal distribution (a.k.a., (a k a the Gaussian distribution or “bell curve”) is the by far the best known random distribution. It’s discovery has had such a far-reaching impact in modeling quantitative phenomena across the physical, social, and biological sciences that it’s founder has even found his way on to a major currency (before the Euro): Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Utility of the Normal Distribution The normal distribution has such broad applicability in part because • phenomena h i the in th natural t l world ld that th t result lt from f the th interaction of many environmental and genetic factors tend to follow the normal distribution (e.g., height, weight, measurable intelligence). • the sums and averages of random samples have distributions that look roughly normal – as the sample size gets larger, the normal approximation gets better. This result is known as the th Central Limit Theorem. Theorem As A we will ill discuss di later, l t this applies even to samples of categorical variables! Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Characteristics of the Normal Distribution 2 3 2 3 • Symmetric (about the mean), unimodal, bell-shaped. If X ~ N(μ,σ2) – where μ is the mean and σ2 is the variance – then the density function of X is given by f ( x | , ) 1 exp{( x ) 2 / 2 2 }, for x . 2 • Of the subjects in a normally distributed population, 68.3% lie within one standard deviation of the mean, 95.4% lie within 2 s.d.’s, and 99.7% lie within 3 s.d.’s. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Computing Probabilities using the Normal Distribution Recall that for a continuous random variable X with probability density function (pdf) f(x), one cannot compute P(X = x). That is, the pdf does not yield probabilities as does a discrete probability mass function. Technically, for a continuous random variable X, P(X = x) = 0. However, we can compute probabilities over intervals of X – that is, the probability that X lies between two numbers a and b is equal to the area under the density curve between a and b, for example: a b Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Computing Probabilities using the Normal Distribution To this point, we have computed areas under a density curve by using integration. However, since the normal density (i) cannot be integrated in closed form and (ii) is used by researchers with access to modern computing tools, probabilities based on the normal distribution can be obtained using tables or computer software. A normall probability b bilit table t bl looks l k something thi like lik what h t is i shown h in i the th handout (reproduced from the Kutner text for this course). Such a table is based on the standard normal distribution, or the normal distribution with zero mean and variance of 1. Using this table, what is the probability that a randomly sampled N(0,1) variable is less than 1.34? Less than –0.28? Between –2.54 and 1.68? For what x does P(Z < x) = 0.975? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Standardizing a N(μ,σ2) Random Variable How do we compute probabilities for a normal distribution with arbitrary mean and variance using a standard normal table? Another unique aspect of the normal distribution is that if we have X ~ N(μ,σ2), then any linear function of X is also normally distributed. That is,, if we have Y = a + bX for arbitrary y constants a and b, then Y ~ N(a + bμ, b2σ2). If we define Z = (X – μ)/σ, then (using the notation above) a = –μ/σ, and b = 1/σ, so that Z ~ N(0,1). Computing Z is called “standardizing” X. Once we’ve converted X into standard units, we can compute probabilities b bili i over intervals i l off X by b using i the h standard normal – or “Z” – distribution. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Example II.A Data from f a study d off king ki crabs b on Kodiak di k Island, l d AK, (carried ( i d out by the Alaska Department of Fish and Game) show that male crab length g is normallyy distributed with a mean of 134.7 mm and a standard deviation of 25.5 mm. What proportion of the male crab population on Kodiak Island is less than 140 mm? What proportion is between 100 and 140 mm? What iis the Wh h probability b bili that h a randomly d l selected l d male l crab b will ill measure at least 170 mm? What is the 75th percentile of this population? The 99th percentile? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Sums of Normally Distributed Random Variables Yet another interesting feature of the normal distribution is that sums of normally distributed independent variables are also normally distributed. Suppose we have two independent random variables X1 and X2 such that X1 ~ N((μ1,σ12), and X2 ~ N((μ2,σ22), and we define Y such that Y = c1X1 + c2X2, where c1 and c2 are constants. Then Y ~ N(c1μ1 + c2μ2, c12σ12 + c22σ22). ) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Distribution of the Sample Mean Recall from our discussion of random variables that if we sample n subjects X1,…,Xn at random from a population with an underlying expected value of μ and variance of σ2, then the expectation of the distribution of the sample mean X is μ, and the variance of X is σ2/n. From the previous slide, we can see further that if the sample comes from a normally y distributed population p p then X ~ N( , 2 / n). Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Example II.B Consider again the population of Kodiak crabs discussed in Example II.A. Suppose that we randomly sample 20 specimens from this population. population What is the probability that the sample mean will lie between 124.7 and 144.7 mm? Compute C t an interval i t l centered t d att the th mean μ such h that th t a sample l average of 20 male crabs will lie within that interval with 95% probability. What sample size is required to reduce the total width of this interval to 20 mm? Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 The Central Limit Theorem Suppose that we have a sample X1,X2,…,Xn from some distribution with mean μ and variance σ2. If n is sufficiently large, then the sample mean X ~ N(μ,σ2/n). This is true even if the underlying population is not normal – the approximation improves for relatively larger n. We refer to this result as the Central Limit Theorem, or CLT. It represents p one of the most remarkable results in mathematical statistics. The CLT applies even to samples from some categorical distributions (e.g., including the binomial and Poisson distributions). Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Resources on the Web There are many applets available via the internet that demonstrate the Central Limit Theorem. A simple example using dice is found at http://www.amstat.org/publications/jse/v6n3/applets/CLT.html You can find a more interesting example at http://www ruf rice edu/ lane/stat sim/index html http://www.ruf.rice.edu/~lane/stat_sim/index.html You can also easily access web-based CDF calculators for the normal distribution as well as for other distributions related to the normal (more on distribution, those in a moment). For example, this website computes probabilities for a variety of common distributions: http://www.stat.berkeley.edu/users/stark/Java/Html/ProbCalc.htm Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 The Chi-Square Distribution A related distribution that we will use later on during this semester is the chi square distribution chi-square distribution. If Z is a standard normal random variable, variable then 2 Z2 is a chi-square random variable with 1 degree of freedom, or 1 . The sum of n independent chi-square random variables follows a chi2 square distribution with n degrees of freedom, denoted by n . A chi-square random variable has a range that is nonnegative, and its di t ib ti is distribution i positively iti l skewed. k d For F example l the th pdf df for f the th chi-square hi distribution with five degrees of freedom looks something like this: Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 The t Distribution Another distribution related to the normal – and one upon which we will heavily rely – is the t distribution. distribution If Z is a standard normal random variable, and X 2 is an independent χn2 random variable, then the random variable Z T X2 /n f ll follows a t distribution di t ib ti with ith n degrees d off freedom. f d A t distribution actually looks quite similar to the standard normal distribution: it’s mean is zero, it is unimodal, bell-shaped, and symmetric. One distinction is that the variability of the t distribution is slightly greater than the Z distribution. distribution As n gets very large, however, the t distribution converges to (i.e., is nearly indistinguishable from) a Z distribution. Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 The F Distribution The third related distribution that we will use is the F distribution. If U and V are independent χn2 and χm2 random variables, variables respectively, respectively then the variable U /n F V /m follows an F distribution with n and m degrees of freedom. We denote this distribution by y Fn,m g that is n m. The F distribution has a range nonnegative, and its distribution is positively skewed. For example the pdf for the F5,10 distribution looks something like this: Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Example II.F Note that we cannot tabulate the χ2, t, and F distributions in the same way that we do for the Z distribution – there are an infinite number of distributions in each of these families (as many as there are values l for f the th degrees d off freedom). f d ) Instead of areas under the curve, then, you are given tables in your textbook that contain quantiles from a given χ2, t, or F distribution. For example, example examine the χ2 table in the back of your book (on page 663). Each row corresponds to a value for the degrees of freedom, and each column corresponds to a right tail area. Hence, th upper 95% quantile the til from f the th χ62 distribution di t ib ti is i 1.635. 1 635 We W will ill denote this – consistent with the text – by χ2(0.05;6) Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 Example II.F, cont’d Find the values of χ2(0.975;20) and χ2(0.90;11). Find the values of t(0.95;15) and t(0.975;30). Find the values of F(0.95;5,10) and F(0.90;9,20). Stat 5100 – Linear Regression and Time Series Dr. Corcoran, Spring 2012 A few points of review: Given i a random d sample l X1,…,Xn, with i h E(Xi) = µ and d Var((Xi) = σ2: 1. X is a random variable: its distribution has a mean of μ, and a variance of σ2/n. 2 If the underlying population is normally distributed, 2. distributed then X is normally distributed. 3 E 3. Even if the h underlying d l i population l i is i not normally ll distributed, di ib d the Central Limit Theorem tells us that for sufficiently large sample p size n, X will be approximately pp y normally y distributed.