Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MSc QT: Statistics Part II Statistical Inference (Weeks 3 and 4) Sotiris Migkos Department of Economics, Mathematics and Statistics Malet Street, London WC1E 7HX September 2015 MSc Economics & MSc Financial Economics (FT & PT2) MSc Finance & MSc Financial Risk Management (FT & PT2) MSc Financial Engineering (FT & PT1) PG Certificate in Econometrics (PT1) Contents Introduction v 1 Sampling Distributions Literature . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . 1.2 Sampling Distributions . . . . . . . . . . . . . . 1.3 Sampling Distributions Derived from the Normal 1.3.1 Chi-Square . . . . . . . . . . . . . . . . 1.3.2 Student-t . . . . . . . . . . . . . . . . . 1.3.3 F-distribution . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 5 5 6 7 8 8 2 Large Sample Theory Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . 2.2 The Central Limit Theorem . . . . . . . . . . . . . . . . 2.3 The Normal Approximation to the Binomial Distribution Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 11 14 15 17 17 3 Estimation Literature . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . 3.2 Evaluation Criteria for Estimators 3.3 Confidence Intervals . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 19 20 21 24 4 Hypothesis Testing Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Elements of a Statistical Test . . . . . . . . . . . . 4.3 Duality of Hypothesis Testing and Confidence Intervals 4.4 Attained Significance Levels: P-Values . . . . . . . . . 4.5 Power of the Test . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 27 27 32 32 33 35 . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv A Exercise Solutions A.1 Sampling Distributions A.2 Large Sample Theory . A.3 Estimation . . . . . . . A.4 Hypothesis Testing . . QT 2015: Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 37 39 40 44 Introduction Course Content This part of the course consists of five lectures, followed by a closed book exam. The topics covered are 1. Sampling distributions. 2. Large sample theory. 3. Estimation. 4. Hypothesis Testing. The last lecture will cover exercises based on the topics above. Textbooks Lecture notes are provided, however these notes are not a substitute for a textbook. The required textbook for this part of the course is. • Wackerly, D., Mendenhall, W. and Schaeffer, R. (2008). Mathematical Statistics with Applications, 7th ed., Cengage. (Henceforth WMS) Students who desire a more advanced treatment of the materials might want to consider: • Casella, G. and Berger, R. (2008). Statistical Inference, 2nd ed., Duxbury press. (Henceforth CB) • Rice, J. (2006). Mathematical Statistics and Data Analysis, 3rd. ed., Cengage. (Henceforth R) Furthermore, the following books are recommended for students that plan to take further courses in econometrics. The appendices of these books also contain summaries of the material covered in this class. • Greene, W. (2011). Econometric Analysis, 7th ed., Prentice-Hall. (Henceforth G) • Verbeek, M. (2012). A Guide to Modern Econometrics, 4th ed., Wiley. (Henceforth V) v vi QT 2015: Statistical Inference Online Resources The primary resources for this part of the course are contained in this syllabus. However, further resources can be found online at either www.ems.bbk.ac.uk/for_students/presess/ and the course page at the virtual learning environment Moodle (login via moodle.bbk.ac.uk/ ) Instructor The instructor for this part of this course is • Sotiris Migkos, [email protected] Chapter 1 Sampling Distributions Literature Required Reading • WMS, Chapter 7.1 – 7.2 Recommended Further Reading • WMS, Chapters 4 and 12 • CB, Chapter 5.1 – 5.4 • R, Chapters 6 – 7 1.1 Introduction A statistical investigation normally starts with some measures of interest of a distribution. The totality of elements about which some information is desired is called a population. Often we only use a small proportion of a population, known as a sample, because it is impractical to gather data on the whole population. We measure the attributes of this sample and draw conclusions or make policy decisions based on the data obtained. That is, with statistical inference we estimate the unknown parameters underlying the statistical distributions of the sample. We can then measure their precision, test hypotheses on them to and use them to generate forecasts. Definition 1.1 Population A population (of size N), x1 , x2 . . . , xN is the totality of elements that we are interested in. The numerical characteristics of a population are called parameters. Parameters are often denoted by Greek letters such as θ. Definition 1.2 Sample A sample (of size n) is a set of random variables, X1 , X2 , . . . , Xn , that are drawn from the population. The realization of the sample is denoted by x1 , . . . , xn . 1 2 QT 2015: Statistical Inference The method of sampling, known sometimes as the design of the experiment, will affect the structure of the data that you measure, and thus the amount of information and the likelihood of observing a certain sample outcome. The type of sample you collect may have profound effects on the way you can make inferences based on that sample. For the moment we will concern ourselves only with the most basic of sampling methods simple random sampling. Definition 1.3 Random Sample The random variables X1 , · · · , Xn are called a random sample of size n from the population f (x) if X1 , · · · , Xn are mutually independent random variables and the marginal pdf or pmf of each Xi is the same function f (x). Alternatively X1 , · · · , Xn are called independent and identically distributed variables with pdf of pmf f (x). This is commonly abbreviated to iid random variables. The joint density of the realized xi ’s in a random sample sample has the form: f (x1 , x2 , · · · , xn ) = n Y fXi (xi ) = n Y f (xi ) (by independence) i=1 (by identicality). (1.1) i=1 Of course, in economics and finance one normally does not have much control on how the data is collected and the data at hand is often time-series data, which is in most cases neither independent nor identically distributed. Although addressing these issues is extremely important in empirical analysis, this course will ignore such considerations to focus on the basic issues. 1.2 Sampling Distributions When drawing a sample from a population, a researcher is normally interested in reducing the data into some summary measures. Any well-defined measure may be expressed as a function of the realized values of the sample. As the function will be based on a vector of random variables, the function itself, called a statistic, will be a random variable as well. Definition 1.4 Statistic and Sampling Distribution Let X1 , . . . , Xn be a sample of size n and T (x1 , . . . , xn ) be a real-valued or vector-valued function whose domain includes the sample space of (X1 , . . . , Xn ), that does not include any unknown parameters, then the random variable X = T (x1 , . . . , xn ) is called a statistic. The probability distribution of a statistic is called the sampling distribution of X. The analysis of these statistics and their sampling distributions is at the very core of econometrics. As the definition of a statistic is very broad, it can include a wide range of different measures. The most two common statistics are probably the mean X and the sample variance S 2 . Other examples include order statistics such as the smallest observation in the sample, X(1) , the largest observation in the sample, X(n) , and the median, X(n/2) ; correlations, Corr(X, Y), and covariances, Cov(X, Y), between two sequences of random variables are also common statistics. Statistics do not need to be scalar, but may also be vector-valued, returning for instance all the unique values observed in the sample or all the order statistics of the sample. 1. Sampling Distributions 3 Note the important difference between the sampling distribution which measures the probability distribution of the statistic T (x1 , . . . , xn ) and the distribution of the population, which measures the marginal distribution of each Xi . The following two sections consider the sampling distributions of the two most important statistics, the sample mean and the sample variance, on the assumption that the sample is drawn from a normal population. The main features of these sampling distributions are summarized by the following theorem. Theorem 1.1 The sample mean and the sample variance of a random normal sample have the following three characteristics: 1. E[X] = µ, and X has the sampling distribution X ∼ N(µ, σ2 /n), 2. E[S 2 ] = σ2 , and σ2 has the sampling distribution (n − 1)S 2 /σ2 ∼ χ2n−1 , 3. X and S 2 are independent random variables. As some of the most common statistics, such as the sample mean and sample total are linear combinations of the individual sample points, the following theorem is of great value in determining the sampling distribution of statistics. Theorem 1.2 If X1 , . . . , Xn are random variables with defined means,E(Xi ) = µi , and defined variances, Var(Xi ) = σ2i ; then a linear combination of those random variables, Z =a+ n X bi Xi , (1.2) i=1 will have the following mean and variance: E(Z) = a + Var(Z) = n X i=1 n n X X [bi E(Xi )], (1.3) [bi b jCov(Xi X j )] (1.4) i=1 j=1 if the Xi are independent, the variance reduces to Var(Z) = n h X i b2i Var(Xi ) . (1.5) i=1 Sample Mean Corollary 1.1 If X1 , . . . , Xn is a random sample drawn from a population with mean µ and 4 QT 2015: Statistical Inference variance σ2 . Using theorem 1.2 it can be shown that the the mean of this sample, n X X n = n−1 Xi , (1.6) i=1 will have expectation E X n = µ, (1.7) and variance Var X n = σ2X = σ2 n n . (1.8) Let us consider how a sampling distribution may look like. As an example take the case of the sample mean X n of a random sample drawn from a normally distributed population. Combined with the knowledge that linear combinations of normal variates are also normally distributed, the sampling distribution of X will be equal to ! σ2 X n ∼ N µ, . (1.9) n We can now go one step further and calculate the standardized sample mean. Subtracting the expected value, which is the population mean µ, and dividing by the (asymptotic) standard error creates a random variable with a standard normal distribution: Z= Xn − µ √ ∼ N(0, 1). σ/ n (1.10) Of course, in reality one does not generally know σ, in which case it is common practice to replace it with it’s sample counterpart S , which will give the following sampling distribution: Z= Xn − µ √ ∼ tn . S/ n (1.11) The details on why the sampling distribution changes from from a normal to a t-distribution are discussed in the next section. Sample Variance Corollary 1.2 If X1 , . . . , Xn is a random sample drawn from a population with mean µ and variance σ2 , then the sample variance S n2 = (n − 1)−1 n X i=1 2 Xi − X n , (1.12) 1. Sampling Distributions will have the following expectation: E S 2 = σ2 . 5 (1.13) Note that to calculate the sample variance we divide by n − 1 and not n, proof of this is provided at the end of this chapter. If the sample is random and drawn from a normal population, then it can also be shown that the sampling distribution is as follows: (n − 1) S2 ∼ χ2n−1 . σ2 (1.14) An intuition of this result is provided in WMS; the proof can be found in e.g. Casella and Berger, chapter 5. Finite Population Correction As a short distraction, notice that if the whole population is sampled, the estimation error of the sample mean will be, logically, equal to zero. Similarly, if a large proportion of the population is sampled, without replacement, the standard error calculated above will over-estimate the true standard error. In such cases, the standard error should be adjusted using a so-called finite population correction. Taking the standard error of the sample mean as an example: σX = 1 − ! n−1 σ √ . N −1 n (1.15) When the sampling fraction n/N approaches zero, then the correction will approach 1. So for most applications, σ σX ≈ √ , n (1.16) which is the definition of the standard error as given in the previous section. For most samples considered, the sampling fraction will be very small. Thus, the finite sample correction will be neglected throughout most of this syllabus. 1.3 Sampling Distributions Derived from the Normal The normal distribution plays a central role in econometrics and statistics, for reasons that we will explore in more depth in the next chapter. However, there are a number of other distributions that feature as sampling distributions for various (test) statistics. As it turns out, the three most common of these distributions can actually be derived from the normal distribution. 1.3.1 Chi-Square 6 QT 2015: Statistical Inference Definition 1.5 Chi-Square distribution P Let Zi ∼ iidN(0, 1). The distribution of U = ni=1 Zi2 is called the chi-square (χ2 ) distribution, with n degrees of freedom. This is denoted with χ2n . Notice that the definition above implies that If U1 , U2 , . . . , Un are independent chi-square ranP dom variables with 1 degree of freedom, the distribution of V = Ui will be a chi-square distribution with n degrees of freedom. Also, for large degrees of freedom n the chi-square distribution will converge to a normal distribution, but this convergence is relatively slow. The moment generation function of a χ2n distribution is M(t) = (1 − 2t)−n/2 . (1.17) This implies that if V ∼ χ2n , then E(Vn ) = n, and (1.18) Var(Vn ) = 2n. (1.19) Like the other distributions that are derived from the normal distribution, the chi-square distribution often appears as the distribution of a test statistic. For instance, testing for the joint significance of two (or more) independent normally distributed variables. If Za ∼ N(µa , σa ) and Zb ∼ N(µb , σb ) and V is defined as Za − µa V2 = σa !2 Zb − µb + σb !2 , (1.20) then V ∼ χ22 (Remember that (Z − µ)/σ ∼ N(0, 1) ). Also, if X1 , X2 , . . . , Xn is a sequence of independent normally distributed variables, then the estimated variance (n − 1) S2 ∼ χ2n−1 . σ2 (1.21) 1.3.2 Student-t Definition 1.6 Student t distribution Let Z ∼ N(0, 1) and Un ∼ χ2n , with Z and Un independent, then Tn = √ Z , Un /n (1.22) will have a t distribution with n degrees of freedom, often denoted by tn . The mean an variance of a t-distribution with n degrees of freedom is E(T n ) = 0, and n , n > 2. Var(T ) = n−2 (1.23) (1.24) 1. Sampling Distributions 7 Like the normal, the expected value of the t-distribution is 0, and the distribution is symmetric around it’s mean, implying that f (t) = f (−t). In contrast to the normal, the t distribution has more probability mass in it’s tails, a property called fat-tailness. As the degrees of freedom, n, increases the tails become lighter. Indeed in appearance the student t distribution is very similar to the normal distribution; actually in the limit n −→ ∞ the t distribution converges in distribution to a standard normal distribution. Already for values of n as small as 20 or 30, the t distribution is very similar to a standard normal. √ Remember that for a random sample drawn from a normal distribution, Z = (X − µ)/(σ/ n) ∼ N(0, 1). However√in reality we do not have information about σ; thus we normally substitute the √ sample estimate (S 2 ) = S for σ. Thus T = (X − µ)/(S / n) will have a t-distribution (proof of this can be found at the end of the chapter). 1.3.3 F-distribution Definition 1.7 F distribution Let Un ∼ χ2n and Vm ∼ χ2m , and let Un and Vm be independent from each other, then Wn,m = Un /n , Vm /m (1.25) will have a F distribution with m and n degrees of freedom, often denoted by Fn,m . The mean an variance of an F-distribution with n and m degrees of freedom is m m−2 m 2 n + m − 2 Var(Fn,m ) = 2 m − 2 n(m − 4) E(Fn,m ) = , m>2 (1.26) , m > 4. (1.27) Under specific circumstances, the F distribution converges to either a t or a χ2 distribution. Particularly 2 F1,m = tm , (1.28) and d nFn,m −→ χ2n . (1.29) The F-distribution often appears when investigating variances. Recall that the standardized variance of a normal sample will have a Chi-square distribution. Hence the ratio of two variances of independent samples can be expressed as a F-distribution. [nS 12 /σ21 ]/n [mS 22 /σ22 ]/m = Un /n = Fn,m Vm /m where the degrees of freedom are a function of the two sample sizes: n = n1 − 1 and n = n2 − 1. 8 QT 2015: Statistical Inference Problems 1. The fill of a bottle of soda, dispensed by a certain machine, is normally distributed with mean µ = 100 and variance σ2 = 9 (measured in centiliters). (a) Calculate the probability that a single bottle of soda contains less than 98cl (b) Calculate the probability that a random sample of 9 soda bottles contains, on average, less than 98cl. (c) How does your answer in (b) change if the variance of 9 was an estimate (i.e. S 2 = 9), rather than a population parameter. 2. Let X1 , X2 , ..., Xm and Y1 , Y2 , ..., Yn be two normally distributed independent random samples, with Xi ∽ N µ1 , σ21 and Yi ∽ N µ2 , σ22 . Suppose that µ1 = µ2 = 10, σ21 = 2, σ22 = 2.5, and m = n. (a) Find E(X) and Var(X). (b) Find E(X − Y) and Var(X − Y). (c) Find the sample size n, such that σ(X−Y) = 0.1. 3. Let S 12 and S 22 be sample variance of two random samples drawn from a normal population with population variance σ2 = 15. Let the sample size be n = 11. h i (a) find a such that Pr S 12 ≤ a = 0.95 i h (b) find b such that Pr S 12 /S 22 ≤ b = 0.95 4. Let Z1 , Z2 , Z3 , Z4 be a sequence of independent standard normal variables. Derive distributions for the following random variables. (a) X1 = Z1 + Z2 + Z3 + Z4 . (b) X2 = Z12 + Z22 + Z32 + Z42 . Z12 . (Z22 + Z32 + Z42 )/3 Z1 (d) X4 = q √ . Z22 + Z32 + Z42 / 3 (c) X3 = Proofs Proof 1.1 To prove that the sample variance S 2 has expectation σ2 , note that n−1 S n2 = P 2 n i=1 Xi − X n = P n−1 . 2 2 Xi − nX n 1. Sampling Distributions Therefore, by taking expectations we get n−1 2 E(S n ) = E P 2 Xi2 − nX n n−1 = P h i 2 E Xi2 − nE[X n ] Recall that Var(Z) = E(Z 2 ) − E(Z)2 , so E(Xi2 ) = σ2 + µ2 and 2 E(X n ) = n−1 σ2 + µ2 . Substitute to get E(S n2 ) = σ2 n + n−1 2 = σ n−1 = σ2 . µ2 n−1 − n n−1 σ2 + µ2 √ Proof 1.2 To prove T = (X − µ)/(S / n) ∼ t(n − 1) rewrite T : √ X−µ (X − µ)/(σ/ n) = √ √ √ S / n (S / n)/(σ/ n) Z = S /σ Z = p S 2 /σ2 Z = q (n−1)S 2 /σ2 n−1 = q Z , U n−1 n−1 where X−µ Z= √ ∼ N(0, 1) and σ/ n S2 Un−1 = (n − 1) 2 ∼ χ2n−1 . σ 9 10 QT 2015: Statistical Inference Thus X−µ √ ∼ tn−1 S n/ n Chapter 2 Large Sample Theory Literature Required Reading • WMS, Chapter 7.3 – 7.6 Recommended Further Reading • CB, Chapter 5.5 • R, Chapter 5 2.1 Law of Large Numbers In many situations it is not possible to derive exact distributions of statistics with the use of a random sample of observations. This problem disappears, in most cases, if the sample size is large, because we can derive an approximate distribution. Hence the need for large sample or asymptotic distribution theory. Two of the main results of large sample theory are the Law of Large Numbers (LLN), discussed in this section, and the Central Limit Theory, described in the next section. As large sample theory builds heavily on the notion of limits, let us first define what they are. Definition 2.1 Limit of a sequence Suppose a1 , a2 , ...., an constitute a sequence of real numbers. If there exists a real number a such that for every real ǫ > 0, there exists an integer N(ǫ) with the property that for all n > N(ǫ), we have | an − a |< ǫ, then we say that a is the limit of the sequence {an } and write limn−→∞ an = a. Intuitively, if an lies in an ǫ neighborhood of a (a − ǫ, a + ǫ) for all n > N(ǫ), then a said to be the limit of the sequence {an }. Examples of limits are !# " 1 = 1, and (2.1) lim 1 + n−→∞ n a n lim 1 + = ea . (2.2) n−→∞ n The notion of convergence is easily extended to that of a function f (x). 11 12 QT 2015: Statistical Inference Definition 2.2 Limit of a function The function f (x) has the limit A at the point x0 , if for every ǫ > 0 there exists a δ(ǫ) > 0 such that | f (x) − A |< ǫ whenever 0 <| x − x0 |< δ(ǫ) One of the core principles in statistics is that the sample estimator will converge to the the ‘true’ value when the sample gets larger. For instance, if a coin is flipped enough times, the proportion of times it comes up tails should get very close to 0.5. The Law of Large Numbers is a formalization of this notion. Weak Law of Large Numbers The concept of convergence in probability can be used to show that, under very general conditions, the sample mean converges to the population mean, a result that is known as The Weak Law of Large Numbers (WLLN). This property of convergence is also referred to a consistency, will will be treated in more detail in the next chapter. Theorem 2.1 (Weak Law of Large Numbers) Let X1 , X2 , . . . , Xn be iid random variables with P E(Xi ) = µ and Var(Xi ) = σ2 < ∞. Define X n = n−1 ni=1 Xi . Then for every ǫ > 0, lim Pr(|X n − µ| < ǫ) = 1; n−→∞ that is, X n converges in probability to µ As stated, the weak law of large numbers relies on the notion of convergence in probability. This type of convergence is relatively weak and so normally not too hard to verify. Definition 2.3 Convergence in Probability if lim Pr[| Xn − x |≥ ǫ] = 0 for all ǫ > 0, n−→∞ the sequence of random variables Xn is said to converge in probability to the real number x . We write p Xn −→ x or plimXn = x. Convergence in probability implies that it becomes less and less likely that the random variable (Xn − x) lies the outside the interval (−ǫ, +ǫ) as the sample size gets larger and larger. There exist different equivalent definitions of convergence in probability. Some equivalent definitions are given below: 1. limn−→∞ Pr[|Xn − x| < ǫ] = 1, ǫ > 0. 2. Given ǫ > 0 and δ > 0, there exists N(ǫ, δ) such that Pr[| Xn − x |> ǫ] < δ, for all n > N. 3. Pr[| Xn − x |< ǫ] > 1 − δ , for all n > N, that is, Pr[| XN+1 − x |< ǫ] > 1 − δ, Pr[| XN+2 − x |< ǫ] > 1 − δ, and so on. 2. Large Sample Theory p 13 p Theorem 2.2 If Xn −→ X and Yn −→ Y, then p (a) (Xn + Yn ) −→ (X + Y), p (b) (Xn Yn ) −→ XY, and p (c) (Xn /Yn ) −→ X/Y (if Yn , Y , 0). p p Theorem 2.3 If g(·) is a continuous function, then Xn −→ X implies that g(Xn ) −→ g(X). In other words, convergence in probability is preserved under continuous transformations. Strong Law of Large Numbers Like in the case of convergence in probability, almost sure convergence can be used to prove the convergence (almost surely) of the sample mean to the population mean. This stronger result is known as the the Strong Law of Large Numbers (SLLN). Definition 2.4 Almost Sure Convergence if Pr lim Xn = x = 1, n−→∞ the sequence of random variables Xn is said to converge almost surely to the real number x. and is written as a.s. Xn −→ x. In other words, almost sure convergence implies that the sequence Xn may not converge everywhere to x, but the points where it does not converge form a set of measure zero in the probability sense. More formally, given ǫ, and δ > 0, there exists N such that Pr[| XN+1 − x |< ǫ, | XN+2 − x |< ǫ, . . .] > (1 − δ), that is, the probability of these events jointly occurring can be made arbitrarily a.s close to 1. Xn is said to converge almost surely to the random variable X if (Xn − X) −→ 0. Do not be fooled by the similarity between the definitions of almost sure convergence and convergence in probability. Although they look the same, convergence in probability is much weaker than almost sure convergence. For almost sure convergence to happen, the Xn must converge for all point in the sample space (that have a strictly positive probability). For convergence in probability all that is needed is for the likelihood of convergence to increase as the sequence gets larger. Theorem 2.4 (Strong Law of Large Numbers) Let X1 , X2 , . . . , Xn be iid random variables with E(Xi ) = µ and Var(Xi ) = σ2 < ∞. 14 QT 2015: Statistical Inference P Define X n = n−1 ni=1 Xi . Then for every ǫ > 0, Pr lim X n − µ < ǫ = 1 ; (2.3) n−→∞ that is, the strong law of large numbers states that X n converges almost surely to µ: a.s. X n − µ −→ 0. (2.4) The SLLN applies under fairly general conditions; some sufficient cases are outlined below. a.s. ′ Theorem 2.5 If the X s are iid, then a necessary and sufficient condition for X n − µ −→ 0 is that E |Xi − µ| < ∞ for all i. ′ Theorem 2.6 (Kolmogorov’s Theorem on SLLN) If the X s are independent (but not neces P∞ a.s. 2 sarily identical) with finite variances, and if n=1 Var(Xn )/n < ∞, then X n − EX n −→ 0. A third form of point-wise convergence is the concept of convergence in mean. Definition 2.5 Convergence in Mean (r) The sequence of random variables Xn is said to converge in mean of order (r) to x (r ≥ 1), and (r) designated Xn −→ x, if E[ | Xn − x |r ] exists and limn−→∞ E[ | Xn − x |r ] = 0, that is, if r th moment of the difference tends to zero. The most commonly used version is mean squared convergence, which is when r = 2. For example, the sample mean (X n ) converges in mean square to µ, because Var(X n ) = E[(X n − µ)2 ] = (σ2 /n) tends to zero as n goes to infinity. Like convergence almost surely, convergence in Mean (r) is a stronger concept than convergence in probability. 2.2 The Central Limit Theorem Perhaps the most important theorem in large sample theory is the central limit theorem, which implies, under quite general conditions, that the standardized mean of a sequence of random variables (for example the sample mean) converges in distribution to a standard normal distribution, even though the population is not normal. Thus, even if we did not know the statistical distribution of the population from which a sample is drawn, we can approximate quite well the distribution of the sample mean by the normal distribution by having a large sample. In order to establish this result, we rely on the concept of convergence in distribution. Definition 2.6 Convergence in Distribution Let {Xn } be a a sequence of random variables whose CDF is Fn (x), and let the CDF F X (x) correspond to the random variable X. We say that Xn converges in distribution to if lim Fn (x) = F X (x) n−→∞ 2. Large Sample Theory 15 at all points x at which F X (x) is continuous. This can be written as d Xn −→ X Sometimes, convergence in distribution is also referred to as convergence in law. Intuitively, convergence in distribution occurs when the distribution of Xn comes closer and closer to that of X as n increased indefinitely. Thus, F X (x) can be taken to be an approximation to the distribution of Xn when n is large. The following relations hold for convergence in distribution: d p Theorem 2.7 If Xn −→ X and Yn −→ c, where c is a non-zero constant, then d (a) (Xn + Yn ) −→ (X + c), and d (b) (Xn /Yn ) −→ (X/c). Using the definition of convergence in distribution we can now introduce formally one version of the Central Limit Theorem. Theorem 2.8 (Central Limit Theorem) Let X1 , X2 , ..., Xn be iid random variables with mean E(Xi ) = µ and a finite variance σ2 < ∞. Define the standardized sample mean, X n − E(X n ) Zn = q Var(X n ) Then, under a variety of alternative assumptions d Zn −→ N(0, 1). (2.5) 2.3 The Normal Approximation to the Binomial Distribution The Bernoulli Distribution The Bernoulli distribution is a binary distribution, with only two possible outcomes: success (X = 1) with probability p and failure (X = 0) with probability q = 1 − p. The probability density of a Bernoulli is Pr(X = x|p) = px (1 − p)1−x ; x = 0, 1. (2.6) for X = 0, 1(failure, success) and 0 ≤ p ≤ 1. The mean and variance of a Bernoulli distribution are given as: E(X) = p, Var(X) = p(1 − p) = pq. (2.7) (2.8) 16 QT 2015: Statistical Inference The Binomial Distribution The Binomial distribution can be thought of as a sequence of iid Bernoulli rv of length n. ! n x Pr(X = x|n, p) = p (1 − p)n−x x n! = px (1 − p)n−x . x! (n − x)! (2.9) x = 0, 1, ..., n (X is the number of success in n trials) 0 ≤ p ≤ 1. The mean and variance of a binomial distribution are given as: E(x) = np. (2.10) Var(x) = npq. (2.11) Example 2.1 Assume a student is given a test with 10 true-false questions. Also assume that the student is totally unprepared for the test and guesses the answer to every question. What is the probability that the student will answer 7 or more questions correctly? Let X is the number of questions answered correctly. The test represents a binomial experiment with n = 10, p = 1/2. So X ∼ Bin(n = 10, p = 1/2). Pr(x ≥ 7) = Pr(x = 7) + Pr(x = 8) + Pr(x = 9) + Pr(x = 10) ! !k !10−k X ! !10 10 10 X 1 10 1 10 1 = = 2 k 2 k 2 k=7 k=7 = 0.172. The Normal Approximation For large sample size n and number of successes k, it becomes cumbersome to calculate the exact probabilities of the binomial. However, we can obtain approximate probabilities by invoking CLT. As stated before, a Binomial(n,p), can be thought of as n independent Bernoulli trails, with success probability p. Consequently, when n is large, the sample average of the Bernoulli trails n 1X Xi = X, n i=1 will be approximately normal with mean E(X) = p and variance Var(X) = p(1 − p)/n. Thus p X−p p(1 − p)/n ∼ N(0, 1) Even for fairly low numbers of n and k the normal approximation is surprisingly accurate. Wackerly provides the useful rule of thumb that the the approximation is adequate if n>9 larger of p and q smaller of p and q (2.12) 2. Large Sample Theory 17 Example 2.2 Consider again the student from example 2.1. What would be the approximate probability? Pr(x ≥ 7) = Pr(x/10 ≥ 0.7) Define x = x/10 x− p 0.7 − p Pr (x ≥ 0.7) = Pr p ≥ p p(1 − p)/n p(1 − p)/n ! 0.2 = Pr Z ≥ √ 0.025 = Pr (Z ≥ 1.26) = 0.104 If we compare the approximate probability of 0.104 with the exact probability of 0.172 from the previous exercise, it becomes clear that there may be a substantial approximation error. However, as n gets larger, this approximation error becomes progressively smaller. Problems 1. let X1 , X2 , . . . , Xn be an independent sample (i.e. independent but not identically distributed), P with E(Xi ) = µi and Var(Xi ) = σ2i . Also, let n−1 ni=1 µi −→ µ. P Show that if n− 2 ni=1 σ2i −→ 0, then X −→ µ in probability. 2. The service times for customers coming through a checkout counter in a retail store are independent random variables with mean 1.5 minutes and variance 1.0. Use CLT to approximate the probability that 100 customers can be serviced in less than 2 hours of total service time. 3. Suppose that a measurement has mean µ and variance σ2 = 25. Let X be the average of n such independent measurements. If we are interested in measuring the sample mean with a degree of precision such that 95% of the time the sample mean lies within 1.5 units (in the absolute sense) from the true population mean, how large should we make our sample size? In on other words how large should n be so that Pr(|X − µ| < 1.5) = 0.95 ? Proofs The Weak Law of Large numbers can be proven by use of Chebychev’s Inequality. Proof 2.1 (Weak Law of Large Numbers) The Weak Law of Large numbers can be proven by use of Chebychev’s Inequality: E g(X) , ǫ > 0. Pr g(X) ≥ ǫ ≤ ǫ 18 QT 2015: Statistical Inference For instance, let g(X) be |X − E(X)|, in this case Chebychev’s inequality reduces to Pr [|X − E(X)| ≥ ǫ] ≤ E[|X − E(X)|] . ǫ Using Chebychev’s inequality; for every ǫ > 0 we have 2 E X−µ Pr X − E X ≥ ǫ ≤ , ǫ2 with E X−µ ǫ2 2 = = σ2 ) = 0 we have n−→∞ nǫ 2 h i lim Pr X − E(X) ≥ ǫ = 0. As lim ( n−→∞ Var X ǫ2 σ2 nǫ 2 . Chapter 3 Estimation Literature Required Reading • WMS, Chapters 8 & 9.1 – 9.3 Recommended Further Reading • R, Sections 8.6 – 8.8 • CB, Chapters 7, 9, & 10.1. 3.1 Introduction The purpose of statistics is to use the information contained in a sample to make inference about the parameters of the population that the sample is taken from. To key to making good inference about the parameters is to have a good estimation procedure that produces good estimates of the quantities of interest. Definition 3.1 Estimator An estimator is a rule for calculating an estimate of a target parameter based on the information from a sample. To indicate the link between an estimator and it’s target parameter, say θ, the estimator is normally denoted by adding a hat: θ̂. A point estimation procedure uses the information in the sample to arrive at a single number that is intended to be close to the true value of the target parameter in the population. For example, the sample mean Pn Xi X = i=1 (3.1) n is one possible point estimator of the population mean µ. There may be more than one estimator for a population parameter. The sample median, X(n/2) , for example might be another estimator for the population mean. Alternatively one might provide a range of values as estimates for the mean, for example the range from 0.10 to 0.35. This case is referred to as interval estimation. 19 20 QT 2015: Statistical Inference 3.2 Evaluation Criteria for Estimators As there are often multiple point estimators available for any given parameter it is important to develop some evaluation criteria to judge the performance of each estimator and compare their relative effectiveness. The three most important criteria used in economics and finance are: unbiasedness, efficiency, and consistency. Unbiasedness Definition 3.2 Unbiasedness An estimator θ̂ is called unbiased estimator of θ if E(θ̂) = θ. The bias of an estimator is given by b(θ) = E(θ̂) − θ. Definition 3.3 Asymptotic Unbiasedness √ If an estimator has the property that Var(θ̂) and n(θ̂n − θ) tend to zero as the sample size increases, then it is said to be asymptotically unbiased. Efficiency Definition 3.4 Mean Square Error (MSE) A commonly used measure of the adequacy of an estimator is E[(θ̂ − θ)2 ], which is called the mean square error ( MSE). It is a measure of how close θ̂ is, on average, to the true θ. The MSE can be decomposed into two parts: MS E = E[(θ̂ − θ)2 ] = E[(θ̂ − E(θ̂) + E(θ̂) − θ)2 ] = Var(θ̂) + bias2 (θ). (3.2) Definition 3.5 Relative Efficiency Let θ̂1 and θ̂2 be two alternative estimators of θ. Then the ratio of the respective MS Es, E[(θ̂1 − θ)2 ]/E[(θ̂2 − θ)2 ], is called the relative efficiency of θ̂1 with respect to θ̂2 . Consistency Definition 3.6 Consistency An estimator θ̂ is consistent if the sequence θ̂n converges to θ in the limit, i.e. θ̂ → θ. There are different types of consistency, corresponding to different versions of the law of large numbers. Examples are: p 1. θ̂n −→ θ (Weak Consistency) (2) 2. θ̂n −→ θ (Squared-error Consistency) 3. Estimation 21 a.s. 3. θ̂n −→ θ (Strong Consistency) A sufficient condition for weak consistency is that 1. The estimator is asymptotically unbiased 2. The variance of the estimator goes to zero as n → ∞ 3.3 Confidence Intervals An interval estimator is a estimation rule that specifies two numbers that form the endpoints of an interval, θ̂L and θ̂H . A good interval estimator is chosen such that (i) it will contain the target parameter θ most of the time and (ii) the interval chosen is as small as possible. Of course, as the estimators are random variables one or both of the endpoints of the interval will vary from sample to sample, so one cannot guarantee with certainty that the parameter will lie inside the interval but we can be fairly confident; as such interval estimators are often referred to as confidence intervals. The probability (1 − α) that θ will lie in the confidence interval is called the confidence level and the upper and lower endpoints are called, respectively, the upper and lower confidence limits Definition 3.7 Confidence Interval Let θ̂L and θ̂H be interval estimators of θ s.t. Pr(θ̂L ≤ θ ≤ θ̂H ) = 1 − α, then the interval [θ̂L , θ̂H ] is called the two-sided (1 − α) × 100% confidence interval. Normally the interval is chosen such that on each side α/2 falls outside the confidence interval. In addition to two sided confidence intervals it is also possible to form single sided confidence intervals. If θ̂L is chosen s.t. Pr(θ̂L ≤ θ) = 1 − α, h then the interval θ̂L , ∞ is the lower confidence interval. Additionally if θ̂H is chosen such that Pr(θ ≤ θ̂H ) = 1 − α, i the interval −∞, θ̂H is the upper confidence interval. Pivotal Method A useful method for finding the endpoints of confidence intervals is the pivotal method, which relies on finding a pivotal quantity Definition 3.8 Pivotal Quantity The random variable Q = q(X1 , . . . , Xn ) is said to be a pivotal quantity if the distribution of Q is independent from θ. For example for a random sample drawn from N(µ, 1) the random variable Q = X−µ 1/n is a pivotal quantity since Q ∼ N(0, 1). For the more general case of a random sample drawn from N(µ, σ2 ) the pivotal quantity associated with µ̂ will be Q = X−µ S /n , where S is the sample estimate of the standard deviation, as Q ∼ tn−1 22 QT 2015: Statistical Inference Pr(q1 ≤ Q ≤ q2 ) is unaffected by a change of scale or a translation of Q. That is if Pr (q1 ≤ Q ≤ q2 ) = (1 − α) Pr (a + bq1 ≤ a + bQ ≤ a + bq2 ) = (1 − α) (3.3) (3.4) Thus, if we know the pdf of Q, it may be possible to use the operations of addition and multiplication to find out the desired confidence interval. Let’s take as an example a sample drawn from a normal population with known variance. To build a confidence interval around the mean the pivotal quantity of interest is Q = X ∼ N(µ, 1/n) ∼ N(0, 1). (3.5) To find the confidence limits µ̂L and µ̂H s.t. Pr (µ̂L ≤ µ ≤ µ̂H ) = 1 − α, we start with finding the confidence limits q1 and q2 of our pivotal quantity s.t. ! x−µ Pr q1 ≤ √ ≤ q2 = 1 − α. 1/ n (3.6) (3.7) After we have found q1 and q2 , we can manipulate the probability to find expressions for µ̂L and µ̂H . ! ! x−µ 1 1 Pr q1 ≤ √ ≤ q2 = Pr √ q1 ≤ x − µ ≤ √ q2 1/ n n n ! 1 1 = Pr √ q1 − x ≤ −µ ≤ √ q2 − x n n ! 1 1 (3.8) = Pr x − √ q2 ≤ µ ≤ x − √ q1 . n n So, 1 µ̂L = x − √ q2 n 1 µ̂H = x − √ q1 n (3.9) (3.10) and " 1 1 x − √ q2 , x − √ q1 n n # (3.11) is the (1 − α)100% confidence interval for µ. Constructing Confidence Intervals Confidence Intervals for the Mean of a Normal Population Consider the case of a sample drawn from a normal population where both µ and σ2 are unknown. We know that Q= x−µ S √ n ∼ t(n−1) . (3.12) 3. Estimation 23 As the distribution of Q does not depend on any unknown parameters, Q is a pivotal quantity. We start with finding the confidence limits q1 and q2 of the pivotal quantity. As a t-distribution is symmetrical (just like the normal distribution), we can simplify the problem somewhat as it can be shown that q2 = −q1 = q. So we need to find a number q s.t. x−µ (3.13) Pr −q ≤ s ≤ q = 1 − α. √ n which reduces to finding q s.t. Pr (Q ≥ q) = α . 2 (3.14) After we have retrieved q = t α2 ,(n−1) , we manipulate the quantities inside the probability to find ! s s (3.15) Pr x − q √ ≤ µ ≤ x + q √ = 1 − α. n n To obtain the confidence interval " # s s x − t( α ,(n−1)) √ , x + t α2 ,(n−1) √ 2 n n (3.16) Example 3.1 Consider a sample drawn from a normal population with unknown mean and variance. Let n = 10, x = 3.22, s = 1.17, (1 − α) = 0.95. Filling in the numbers in the formula " # s s x − t α2 ,(n−1) √ , x + t α2 ,(n−1) √ . n n The 95% CI for µ equals, " # (2.262)(1.17) (2.262)(1.17) , 3.22 + 3.22 − = [2.38, 4.06] . √ √ 10 10 (3.17) Confidence Intervals for the Variance of a Normal Population To find the confidence interval of the variance of a normal population, we start again with finding an appropriate pivotal quantity. In this case recall that Q = (n − 1) S2 ∼ χ2(n−1) . σ2 (3.18) Note that the distribution of Q does not depend on any unknown parameters, hence Q is a pivotal quantity. Therefore we can find limits q1 and q2 such that Pr (q1 ≤ Q ≤ q2 ) = 1 − α. (3.19) This is slightly more tricky as the Chi-square distribution is not symmetric. It is standard to select the thresholds such that α (3.20) Pr (Q ≤ q1 ) = Pr (Q ≥ q2 ) = . 2 24 QT 2015: Statistical Inference After retrieving q1 = χ21−α/2,(n−1) and q2 = χ2α/2,(n−1) we manipulate the expression to find Pr (q1 ≤ Q ≤ q2 ) = Pr q1 ≤ (n − 1) S2 ≤ q2 σ2 ! ! S2 s2 2 = Pr (n − 1) ≤ σ ≤ (n − 1) . q2 q1 2 2 So, (n − 1) Sq2 , (n − 1) Sq1 is a 100(1 − α)100% CI for σ2 . (3.21) Example 3.2 As in the example of the previous sample, let n = 10, x = 3.22, s = 1.17, (1−α) = 0.95. The 95 percent CI for σ2 is # " s2 s2 , (n − 1) , (n − 1) q2 q1 with q2 = χ20.025, (9) = 19.02 and q1 = χ20.975, (9) = 2.70. so the 95% CI equals " # 1.172 1.172 9× ,9 × = [0.65, 4.56] . 19.02 2.70 (3.22) Problems 1. Let X1 , X2 , . . . , Xn be a random sample with mean µ and variance σ2 . Consider the following estimators: (i) µ̂1 = (ii) µ̂2 = (iii) µ̂3 = X1 +Xn 2 X1 4 Pn + Pn−1 1 i=2 Xi 2 (n−2) i=1 Xi n+k + Xn 4 where 0 < k ≤ 3. (iv) µ̂4 = X (a) Explain for each estimator whether they are unbiased and/or consistent. (b) Find the efficiency of µ̂1 , µ̂2 , and µ̂3 relative to µ̂4 . Assume n = 36, σ2 = 20, µ = 15, and k = 3. 2. Consider the case in which two estimators are available for some parameter, θ. Suppose that E(θ̂1 ) = E(θ̂2 ) = θ, Var(θ̂1 ) = σ21 , and Var(θ̂2 ) = σ22 . Consider now a third estimator, θ̂3 , defined as θ̂3 = aθ̂1 + (1 − a)θ̂2 . How should a constant a be chosen in order to minimise the variance of θ̂3 ? (a) Assume that θ̂1 and θ̂2 are independent. 3. Estimation 25 (b) Assume that θ̂1 and θ̂2 are not independent but are such that Cov(θ̂1 , θ̂2 ) = γ , 0. 3. Consider a random sample drawn from a normal population with unknown mean and variance. You have the following information about the sample: n = 21, x = 10.15, and s = 2.34. Let α = 0.10 throughout this question. (a) Calculate the (1 − α) two-sided, upper, and lower confidence intervals for µ. (b) Calculate the (1 − α) two-sided, upper, and lower confidence intervals for σ2 . (c) Calculate the (1 − α) two-sided, upper, and lower confidence intervals for σ. 26 QT 2015: Statistical Inference Chapter 4 Hypothesis Testing Literature Required Reading • WMS, Chapter 10 Recommended Further Reading • R, Sections 9.1 – 9.3 • CB, Chapter 8. • G, Chapter 5. 4.1 Introduction Think for a second about a courtroom drama. A defendant is led down the aisle, the prosecution lays out all the evidence, and at the end the judge has to weigh the evidence and make his verdict: innocent or guilty. In many ways a legal trial follows the same logic as a statistical hypothesis test. The testing of statistical hypotheses on unknown parameters of a probability model is one of the most important steps of any empirical study. Examples of statistical hypothesis that are tested in economics include • The comparison of two alternative models, • The evaluation of the effects of a policy change, • The testing of the validity of an economic theory. 4.2 The Elements of a Statistical Test Broadly speaking there are two main approaches to hypothesis testing: the classical approach and the Bayesian approach. The approach followed in this chapter is the classical approach, which is most widely used in econometrics. The classical approach is best described by the Neyman-Pearson 27 28 QT 2015: Statistical Inference methodology; it can be roughly described as a decision rule that follows the logic: ‘What type of data will lead me to reject the hypothesis?’ A decision rule that selects one of the inferences ‘reject the null hypothesis’ or ‘do not reject the null hypothesis’ is called a statistical test. Any statistical test of hypotheses is composed of the same three essential components: 1. Selecting a null hypothesis, H0 , and an alternative hypothesis, H1 , 2. Choosing a test statistic, 3. Defining the rejection region. Null and Alternative Hypotheses A hypothesis can be thought of as a binary partition of the parameter space Θ into two sets, Θ0 and Θ1 such that Θ0 ∩ Θ1 = ⊘ and Θ0 ∪ Θ1 = Θ. (4.1) The set Θ0 is called the null hypothesis, denoted by H0 . The set Θ1 is called the alternative hypothesis, denoted by H1 or Ha . Take as example a political poll. Let’s assume that the current prime minister declares that he has got the support of more than half the population and we do not believe him. To test his statement we randomly select 100 voters and ask them if they approve of the prime minister. We can now formulate a null and alternative hypothesis. Let the null hypothesis be that the prime minister is correct, in that case the proportion of people supporting the prime minister will be at least 0.5, so H0 : θ ≥ 0.5. (4.2) Conversely if the prime minister is wrong then the alternative is true H1 : θ < 0.5. (4.3) Note that this partitioning of the null and alternative is done such that there is no value for θ that lies both in the domain of the null and the alternative and the union of the null and the alternative contains all possible values that θ can take. Often the null hypothesis in the above case is simplified: we are really only interested in the endpoint of the interval described by the null hypothesis, in this case the point θ = 0.5, so often the null is written instead as H0 : θ = 0.5, (4.4) where it is implicit that any value for θ larger than 0.5 is covered by this hypothesis by the way the alternative is formulated. The above example outlines what is known as a single sided hypothesis as the alternative hypothesis lies to one side of the null hypothesis. Alternatively one can specify a two sided hypothesis such as H0 : θ = 0.5 vs. H1 : θ , 0.5. (4.5) In this case the alternative hypothesis includes values for θ that lie on both sides of the postulated null hypothesis. 4. Hypothesis Testing 29 Test Statistic Once the null and alternative hypothesis have been defined a procedure needs to be developed to decide whether the null hypothesis is a reasonable one. This test procedure usually contains a sample statistic T (x) called the test statistic, which summarizes the ‘evidence’ against the null hypothesis. Generally the test statistic is chosen such that it’s limiting distribution is known. Take again the example of the popularity poll of the prime minister. We can exploit the fact that (i) the sample consists of an iid sequence of Bernoulli RV and (ii) CLT to show that approximately ! θ(1 − θ) θ̂ = X ∼ N θ, (4.6) n If we standardize θ̂ and fill in our hypothesized value θ0 = 0.5 for θ we can create the test statistic. Z(x) = X − 0.5 ∼ N(0, 1). 0.25/100 (4.7) Note that Z(x) does not rely on any unknown quantities and its limiting distribution is known. Rejection Region After a test statistic T has been selected, the researcher needs to define a range of values of T for which the test procedure recommends the rejection of the null. This range is called the rejection region or the critical region. Conversely the range of values for T in which the null is not rejected is called the acceptance region. The cut-off point(s) that indicate the boundary between the rejection region and the acceptance region is called the critical value. Going back to the example of the popularity poll, we could create the protocol: if the test statistic T is lower than the critical value τcrit = −2 I reject the null H0 : θ = 0.5 in favour of the alternative H1 : θ < 0.5. In this case the rejection region consists of the set RR = {t < −2} and the acceptance region of the set AR = {t ≥ −2}. To find the right critical value is an interesting problem. In the above example, we know that ˆ lower than 0.5 (and hence a test statistic lower than 0) is evidence against finding any value for theta the null hypothesis. But how low should we set our threshold exactly? In order to better understand this dilemma lets first assume the decision rule fixed and evaluate the possible outcomes of our statistical test. Hopefully our test arrives at the correct conclusion: reject the null when it is not true or not rejecting it when it is indeed true. However there is the possibility that an erroneous conclusion has been made and one of two types of errors has been committed: Type I error : Rejecting H0 when it is true Type II error: Not Rejecting H0 when it is false Now that we have identified the two correct outcomes and two errors we can commit, we can associate probabilities with these events. Definition 4.1 Size of the test(α) The probability of rejecting H0 when it is actually true (ie. committing a type I error) is called the size of the test. Sometimes it is also called the level of significance of the test. This probability is usually denoted as α. 30 QT 2015: Statistical Inference Table 4.1: Decision outcomes and their associated probabilities H0 rejected H0 not rejected H0 true α Type I error Level / Size (1 − α) H0 false (1 − β) β Type II error Operating Char. Power Common sizes that are used in hypothesis testing are α = 0.10, α = 0.05, and α = 0.01. Definition 4.2 Power of the test (1 − β) The probability of rejecting H0 when it is false is called the power of the test. This probability is normally denoted as (1 − β). Definition 4.3 Operating Characteristic (β) The probability of not rejecting H0 when it is actually false (ie. committing a type II error) is known as the operating characteristic. This probability is usually denoted as β. This concept is widely used in statistical quality control theory. Table 4.2 below summarizes the probabilities. Ideally a test is chosen such that both the probability of a type I error,α, and the probability of a type II error,β, are as low as possible.However, practically this is impossible because, given some fixed sample, reducing α increases β: there is a trade-off between the two. The only way to decrease both α and β is to increase the sample size, something that is often not feasible. The classical decision procedure therefore chooses an acceptable value for the level α. Note that in small samples the empirical size associated with a critical value of a test statistic is often larger than the asymptotic size because the approximation of the limiting distribution might not yet be very good. Thus if a researcher is not careful he risks choosing a test which rejects the null hypothesis more often than he realizes. So then how do we select the critical value τcrit after fixing α? Let’s consider once more our popularity contest. Recall that the test statistic associated with the hypothesis that θ = 0.5 was (X − 0.5)/(0.25/100) ∼ N(0, 1). Let’s say that we are willing to reject the null hypothesis if there is less than 2.5% probability of committing a type I error, ie. α = 0.025. Since we know the limiting distribution of T we can find the value τc rit such that Pr[T < τcrit | θ = 0.5] = α = 0.025. (4.8) This value can be found by looking up the CDF of a standard normal: Pr(T ≥ τ) = 1−0.025 = 0.975; in this case τ = −1.96. In any case, we have now found the relevant critical value, and can define the rejection region as RR = {t < −1.96} and the acceptance region as AR = {t ≥ −1.96}. If we map 4. Hypothesis Testing 31 the critical value of the test statistic back to a proportion, this translates to θcrit = θ − 1.96 × se = 0.5 − 1.96 × 0.05 = 0.402; ie. we can reject the null (θ = 0.5) at the 2.5% level if we find a sample mean lower than 0.402. If a two sided test of the form H0 : θ = 0.5 vs. H1 : θ , 0.5 would have been considered, the rejection region would have consisted of two parts: RR = {t : t < τl or t > τu }. Where for a symmetric distribution like the normal τu = −τl = τ which reduces the rejection region to RR = {T : |t| > τ}. Using the data from the popularity poll, we can easily construct a two-sided rejection region for the hypothesis H0 : θ = 0.5 vs. H1 : θ , 0.5 at the 5% level by realizing that 5% / 2 = 2.5%. Hence the critical values for the two-sided test will be −1.96 and 1.96, with the associated rejection region: RR = {t : |t| > 1.96}. Example 4.1 Consider the hypothetical example in which a subject is asked to draw, 20 times, a card from a suit of 52 cards and identify, without looking, the suit (hearts, diamonds, clubs, spades). Let T be the number of correct identifications. Let the null hypothesis random guesses with the alternative being that the person has extrasensory ability (also called ESP). If the maximum level of the test is set at α = 0.05, what should be the decision rule and associated rejection region? T ∼ binomial(20, 0.25). Find τ0.05 such that Pr[T > τ0.05 | π = 0.25] ≤ 0.05. P[T ≥ 8 | π = 0.25] = 0.102 > 0.05 and P[T ≥ 9 | π = 0.25] = 0.041 < 0.05. Thus the critical value of this test is τ0.05 = 9 and the rejection region equals RR : t ≥ 9. Common Large-Sample Tests Many hypothesis tests are based around test statistics that are approximately normal by virtue of the CLT, such as sample means X. We can exploit this fact to construct a test statistic that is commonly encountered in econometrics. Z= θ̂ − θ0 ∼ N(0, 1). σθ̂ (4.9) The standard error is often replaced with its sample estimate S /n which results in the following test statistic T = θ̂ − θ0 ∼ t(n−1) . S /n (4.10) with associated two-sided rejection region RR : {t : |t| > τα/2 } or RR : {θ̂ : θ̂ < θ − τα/2 σθ̂ or θ̂ > θ + τα/2 σθ̂ }. (4.11) 32 QT 2015: Statistical Inference 4.3 Duality of Hypothesis Testing and Confidence Intervals i h Recall the concept of a (1 − α) two-sided confidence interval θ̂l , θ̂h as an interval that contains the true parameter θ with probability (1 − α). Also recall that if the sampling distribution of θ is approximately normal then the (1 − α) confidence interval is given by θ̂ ± zα/2 σθ̂ , (4.12) with σθ̂ the standard error of the estimator and zα/2 the value such that Pr(Z > zα/2 ) = α/2. Note the strong similarity with this confidence interval and the test statistic plus associated rejection region of a two sided hypothesis test described in the previous section. This is no coincidence. Consider again the two-sided rejection region for a test with level α from the previous section: RR : {z : |z| > zα/2 }. The complement of the rejection region, RR, is the acceptance region AR : {z : |z| ≤ zα/2 } which maps onto the parameter space as do not reject (‘accept’) null hypothesis at level α if the estimate lies in the interval θ0 ± zα/2 σθ̂ . (4.13) Restated, for all θ0 that lie in the interval θ̂ − zα/2 σθ̂ ≥ θ0 ≥ θ̂ + zα/2 σθ̂ . (4.14) the estimate θ̂ will lie inside the acceptance region and the null hypothesis cannot be rejected at level α. This interval is, as you will notice, exactly equal to the (1 − α) confidence interval outlined above. Thus the duality between confidence intervals and hypothesis testing: if the hypothesized value θ0 lies inside the (1 − α) confidence interval, one cannot reject the null hypothesis H0 : θ̂ = θ0 vs. H1 : θ̂ , θ0 at level α; if θ0 does not lie in the confidence interval then the null can be rejected at level α. A similar statement can be made for upper and lower single sided confidence intervals. Notice that any value inside the confidence interval would be an ‘acceptable’ value for the null hypothesis, in the sense that it cannot be rejected with a hypothesis test of level α. This explains why in statistics we usually only talk about rejecting the null vs. not rejecting the null, rather than saying we ‘accept’ the null. Even if we do not reject the null we recognize that there are probably many other values for θ that would be acceptable and we should be hesitant to make statements about a single θ being the single true value. Likewise we do not commonly ‘accept’ the alternative when we reject the null hypothesis are there are usually many potential values the paramater θ can take under the alternative. 4.4 Attained Significance Levels: P-Values Recall that the most common method of selecting a critical value for the test statistic and determining the rejection region is fixing the level of the test α. Of course we would like to have α as small as possible as it denotes the probability of committing a type I error. However, as discussed, choosing a low α comes at the cost of increasing β, the probability of a type II error. Choosing the correct value of α is thus important, but also rather arbitrary. While one researcher would be happy to conduct a test with level α = 0.10 another would insist upon only testing with levels lower than, say, α = 0.05. Furthermore, the levels of tests are often fixed at 10%, 5%, or 1% not as a result of long deliberations, but rather out of custom and tradition. 4. Hypothesis Testing 33 There is a way to partially sidestep this issue of selecting the right value for α by reporting the attained significance level or p-value. For example let T be a test statistic for the hypothesis H0 : θ = θ0 vs. H1 : θ > θ0 . If the realized value of the test statistic is t, based on our sample, then the p-value is calculated as the probability pval = Pr[T > t | θ0 ]. (4.15) Definition 4.4 p-value The attained significance level, or p-value, is the smallest level of significance α at which the null hypothesis can be reject given the observed sample. The advantage of reporting a p-value, rather than fixing the level of the test yourself is that it permits each of your readers to draw their own conclusion about the strength of your results. The procedures for finding p-values are very similar to those of finding the critical value of a test statistic. However, instead of fixing the probability α and finding the critical value of the test statistic τ, we now fix the value of the test statistic t and find the associated probability pval. Example 4.2 A financial analyst believes that firms experience positive stock returns upon the announcement that they are targeted for a takeover. To test his hypothesis he has collected a data set comprising 300 take-over announcements with an average abnormal return of r = 1.5% on the announcement date, with a standard error of 0.5%. Calculate the p-value of the null hypothesis H0 : r̂ = 0 vs. H0 : r̂ > 0. Invoking CLT, the natural test statistic to test this hypothesis is Z= r̂abn − 0 ∼ t(99) ≈ N(0, 1). S r̂abn The value of the test statistic in this sample equals z = 1.5/0.5 = 3. Looking up the value 3 in the standard normal table yields us p-val = Pr[Z > 3] = 0.0013. Thus the p-value of this test is 0.13%, implying that we can easily reject the null hypothesis of no news effect at the 10%, 5%, or 1% level. 4.5 Power of the Test In the previous sections we have primarily focused on the probability α of committing a type I error. However, it is at least as important for a test to also have a low probability β of committing a type II error. Remember that a type II error is committed if the testing procedure fails to reject the null when it was in fact false. In econometrics, rather than looking directly at β, many statistical tests are evaluated by its complement (1 − β): the probability that a statistical test rejects the null when it is indeed false; this probability (1 − β) is called the power of the test. Before we can calculate the power of a test there are two issues that need to be addressed. Firstly, recall that the alternative hypothesis often contains a large range of potential values for θ. For instance in the single sided hypothesis H0 : θ = θ0 vs H1 : θ > θ0 , all values of θ larger than θ0 are included in the alternative. However, the power will normally not be the same for all these different values included in Θ1 . Therefore the power of a test is often evaluated at specific values for the alternative, say θ = θ1 . 34 QT 2015: Statistical Inference Secondly, as we have focused on type I errors and the associated α’s, we have only considered how the sampling distribution looks like under the assumption that the null hypothesis is correct. This sampling distribution is referred to as the null distribution. However, if we are interested about making statements about the power of the test (or type II errors), then we have to consider how the sampling distribution of θ̂ looks like when θ = θ1 . That is, we evaluate the sampling distribution for that specific alternative. Consider once more the one-sided hypothesis H0 : θ = θ0 vs. H0 : θ > θ0 with the associated test statistic T and critical value τα . For a specific alternative θ = θ1 (with θ1 > θ0 ) the power of the test can be calculated as the conditional probability (1 − β) = Pr[T > τ | θ = θ1 ]. (4.16) Note that the main difference with the definition of α α = Pr[T > τ | θ = θ0 ], (4.17) is that the probability is conditioned on the alternative hypothesis being true, rather than the assumption that the null hypothesis is true. Example 4.3 Many American high-schoolers take the SAT (scholastic aptitude test). The average SAT score for mathematics is 633. Consider the following test: a school is considered ‘excellent’ if its students obtain an average SAT score of more than 650 (assume a class size of 40). School X believes that its own students will have an expected SAT score of 660 with a standard deviation of 113. Thus, school X feels it should be rated excellent; what is the probability of the school to be actually rated ‘excellent’? This problem is really all about the power of the test. Realize first that we can describe the above as a hypothesis test of the form H0 : θ = 633 vs. H1 : θ > 633 with a rejection region RR = {θ̂ : θ̂ > 650}. Because we are looking at the power of the test, we have to consider the alternative distribution, not the null distribution. In this case the school wants to evaluate this test at the specific alternative θ1 = 660 and find the probability Pr[θ̂ > 650 | θ1 = 660] = (1−β) which is equal to the power of the test evaluated at θ1 . √ Invoking CLT we have, under the alternative distribution, Z = (θ̂ − 660)/(113/ 40) ∼ N(0, 1). We can use this to manipulate the probability from above to (1 − β) = Pr [Z ≥ z650 ] . Filling in the numbers we find that z650 = 650 − 660 √ − 0.56. 113/ 40 Looking up z650 = −0.56 in the standard normal table yields us the probability (1 − β) = 0.71. Asymmetry Null and Alternative Hypotheses As should be clear by now there is an asymmetry between the null and the alternative hypothesis. The testing procedure outlined heavily focuses on the null hypothesis, ‘favouring’ it over the alter- 4. Hypothesis Testing 35 native: the decision rule and test statistic are based around the null distribution and the probability of falsely rejecting the null hypothesis; the conclusion drawn is mainly about the null (reject the null, do not reject the null). The test only rejects the null if there is a lot of evidence against it, even if the test has low power. Therefore, the decision as to which is the null and which is the alternative is not merely a mathematical one, but depends on context and custom. There are no fast and hard rules on how to choose the null over the alternative, but often the ’logical’ null can be deduced on the hand of one of several principles • Sometimes we have good information about the distribution of one of the two hypothesis, but not really about how the sampling distribution looks like under the other hypothesis. In this case it is standard to choose the ‘simpler’ hypothesis of which we know the distribution as the null hypothesis. For example, if you are interested whether a certain sample is drawn from a normal population, you know how the distribution looks like under the null (ie. normal), but no clue how it might look like under the alternative (exponential, χ2 , something else?), so the natural null is to assume normality. • Sometimes the consequences of falsely rejecting one hypothesis is much more grave than rejecting the other hypothesis. In this case we should choose the former as the null hypothesis. For example: if you have to judge the safety of a bridge it is more harmful to wrongly reject the hypothesis that is unsafe (potentially killing many people) than it is to wrongly reject the hypothesis that the bridge is safe (which may cost money on spurious repairs). In this case the null should be: the bridge is deemed unsafe, unless proven otherwise. • In scientific investigations it is common to approach the research question with a certain level of scepsis. If a new medicine is introduced, the appropriate null hypothesis would be to assume that it does not perform better than the current drug on the market. If you evaluate the effect of an economic policy the natural null hypothesis would be to assume that it had no effect whatsoever. In both cases you put the burden of evidence on your new medicine/theory/policy. Problems 1. The output voltage for a certain electric circuit is specified to be 130. A sample of 40 independent readings on the voltage for this circuit gave a sample mean of 128.6 and a standard deviation of 2.1. (a) Test the hypothesis that the average output voltage is 130 against the alternative that it is less than 130 using a test with level α = 0.05. (b) If the average voltage falls as low as 128 serious consequences may occur. Calculate the probability of committing a type II error for H1 : V = 128 given the decision rule outlined in (a). 2. Let Y1, Y2 , ..., Yn be a random sample of size n = 20 from a normal distribution with unknown mean µ and known variance σ2 = 5. We wish to test H0 : µ ≤ 7 versus H1 : µ > 7. (a) Find the uniformly most powerful test with significance level 0.05. 36 QT 2015: Statistical Inference (b) For the test in (a), find the power at each of the following alternative values for µ : µ1 = 7.5, µ1 = 8.0, µ1 = 8.5, and µ1 = 9.0. 3. In a study to assess various effects of using a female model in automobile advertising, each of 100 male subjects was shown photographs of two automobiles matched for price, colour, and size but of different makes. Fifty of the subjects (group A) were shown automobile 1 with a female model and automobile 2 with no model. Both automobiles were shown without the model to the other 50 subjects (group B). In group A, automobile 1 (shown with the model) was judged to be more expensive by 37 subjects. In group B, automobile 1 was judged to be more expensive by 23 subjects. Do these results indicate that using a female model increases the perceived cost of an automobile? Find the associated p-value and indicate your conclusion for an α = .05 level test. Appendix A Exercise Solutions A.1 1. Sampling Distributions (a) As the population is normally distributed, we have (Xi − µ)/σ = Z ∼ N(0, 1). Here: Z = (98 − 100)/3 = −0.67. Look up Z = −0.67 in the standard normal table to find that Pr[Xi ≤ 98] = Pr[Z ≤ −0.67] = 0.2514. (b) As the population is normally distributed, we have (X − µ)/(σ/n) = Z ∼ N(0, 1). Here: Z = (98 − 100)/(3/3) = −2. Look up Z = −2 in the standard normal table to find that Pr[X ≤ 98] = Pr[Z ≤ −2] = 0.0238. (c) If we use the sample variance to calculate the standard error, rather than the population variance, the resulting sampling distribution will be tn−1 , rather than standard normal. That is, (X − µ)/(S /n) = T ∼ tn−1 . Pr[X ≤ 98] = Pr[T ≤ −2] = 0.0403. (if you use the student t tables, you’ll only be able to find 0.025 < Pr < 0.05. 2. (a) E(X̄) = µ1 , Var(X̄) = n−1 σ21 . (b) E(X̄ − Ȳ) = 0, Var(X̄) = n−1 σ21 + m−1 σ22 = n−1 (σ21 + σ22 ) = 4.5/n. (c) √ σX̄−Ȳ = 0.1 √ 4.5/ n = 0.1 n = 4.5/(0.1)2 n = 450. 37 38 QT 2015: Statistical Inference 3. h i (a) Find Pr S 12 ≤ a . S2 We know that (n − 1) σ12 ∼ χ2n−1 . 1 S 12 2 Pr (n − 1) 2 ≤ χ(10),0.95 , χ2(10),0.05 = 18.307 σ1 S 12 Pr 10 × ≤ 18.307 15 # " 15 Pr S 12 ≤ 18.307 × 10 h i Pr S 12 ≤ 27.46 2 S1 (b) Find Pr S 2 ≤ b . 2 We know that 4. (a) P4 P4 S 12 /σ21 S 22 /σ22 ∼ F(n1 −1,n2 −1) . 2 2 S 1 /σ1 Pr 2 2 ≤ F(n1 −1,n2 −1),0.05 , F(n1 −1,n2 −1),0.05 = 2.98 S 2 /σ2 2 S 1 /15 Pr 2 ≤ 2.98 S 2 /15 2 S 1 Pr 2 ≤ 2.98 S2 Zi ∼ N(0, 4) Zi2 ∼ χ24 P (c) Z12 / 4i=2 Zi ∼ F1,3 q P4 2 (d) Z1 / i=2 Zi /3 ∼ t3 (b) A. Exercise Solutions A.2 39 Large Sample Theory 1. First we show that expectation of the sample mean equals the average of the population means. E(X) = n−1 E[X1 + . . . + Xn ] n X = n−1 µi i=1 = µ. Also the standard error of the sample mean will be 1 X 2 σ . Var(X) = 2 n i i Next we use Chebychev’s inequality to establish that Var(X n ) ǫ2 P 2 i σi 1 → 0, ≤ n2 ǫ 2 which concludes our proof. Pr[(X n − µ)2 > ǫ 2 ] ≤ 2. Approximate Pr[S 100 < 120] with Xi ∼ CDF(1.5, 1). CLT states that, S n − E(S n ) d → N(0, 1), √ 2 σS n with σ2S n = nσ2 . Thus 120 − 150 = −3, √ 100 Pr(Z < −3) = (1 − 0.9987) ≈ 0.13%. 3. Again, use CLT to approximate the sampling distribution with a normal distribution. As the variance is known, the sampling distribution will be approximately X−µ √ ∼ N(0, 1), σ/ n with σ2 = 25. Next we look up the quantile that is associated with an exceedance probability of 0.05/2 = 0.025, z0.025 = 1.96. So we solve for: p 1.5/(σ/ (n)) = 1.96 √ p (n) = 1.96/1.5 × 25 = 1.962 /1.52 × 25 n = 42.7. Normally n would be rounded up, to in this case 43, to ensure that the probability is at least the desired level. 40 QT 2015: Statistical Inference A.3 1. Estimation (a) n (i) µˆ1 = X1 +X is unbiased, but inconsistent. 2 Unbiasedness: X1 + Xn 2 E X1 + E Xn = 2 2µ = 2 = µ. E[µ̂1 ] = E Consistency: E(µ̂1 ) → µ, but Var(µ̂1 ) = σ2 /2 → σ2 /2 , 0. so µ̂2 9 µ. Pn−1 X i=2 i (ii) µˆ2 = X41 + 21 n−2 + Unbiasedness: Xn 4 is unbiased, but inconsistent. P 1 n−1 X1 + Xn i=2 Xi +E E(µ̂2 ) = E 4 2 n−2 Pn−1 1 i=2 E Xi 1 = µ+ 2 2 n−2 1 1 (n − 2)µ = µ+ 2 2 (n − 2) = µ. Consistency: E(µˆ2 ) → µ, but, 1 Var(µ̂2 ) = σ2 /8 + σ2 /(n − 2) → σ2 /8 , 0. 4 so µ̂2 9 µ. A. Exercise Solutions 41 P (iii) µ3 = ni=1 Xi /(n + k), 0 < k ≤ 3 is biased, but consistent. Unbiasedness: −1 E(µ̂3 ) = E((n + k) n X Xi ) i=1 −1 = (n + k) = (n + k)−1 n X i=1 n X E(Xi ) µ i=1 n µ n+k k µ =µ− n+k , µ. = Consistency: lim [E(µˆ3 )] = lim [ n→∞ n→∞ n µ], n+k = µ. and Var(µ̂3 ) = n/(n + k)2 σ2 → 0. so µ̂3 → µ. (iv) µ4 = X is both unbiased and consistent, see lecture notes. (b) Relative Efficiency: (MS E1 /MS E2 ), with MS Ei = Var(µ̂i ) + bias2 (µ̂i ). For n = 36, σ2 = 20, µ = 15, and k = 3 we have (i) MS E1 = σ2 /2 = 10 (ii) MS E2 = σ2 /8 + 41 σ2 /(n − 2) = 45/17 (iii) k n σ2 + µ MS E3 = 2 n+k (n + k) !2 = 80/169 + 225/169 = 300/169 (iv) MS E4 = σ2 /n = 20/36 So the relative efficiency of the sample mean (µ̂4 ) will be (i) MS E1 /MS E4 = 0.056 (ii) MS E2 /MS E4 = 0.210 (iii) MS E3 /MS E4 = 0.313 42 QT 2015: Statistical Inference 2. Note that θ3 is unbiased: E(θ̂3 ) = E(aθˆ1 + (1 − a)θˆ2 ) E(θ̂3 ) = aθ + (1 − a)θ E(θ̂3 ) = θ. The variance of θˆ3 is defined as σ23 = n X n X (bi b j γi, j ), i=1 j=1 = a2 σ21 + (1 − a)2 σ22 + 2a(1 − a)γ. Let’s consider the general case with γ unconstrained. To minimize σ23 find: arg min[a2 σ21 + (1 − a)2 σ22 + 2a(1 − a)γ]. a ∂σ23 = 2aσ21 − 2(1 − a)σ22 + 2(1 − 2a)γ ∂a = 2a(σ21 + σ21 − 2γ) − 2(σ22 − γ) a =0 = 0, = note that γ = ρσ1 σ2 . (a) a = σ22 σ21 +σ22 (b) a = σ22 −γ σ21 +σ22 −2γ 3. (a) 90 % Confidence intervals for the mean (x = 10.15) (i) Two sided " # s x ± t0.05,20 √ n . √ s t0.05,20 = 1.725, √ = 2.34/ 21 = 0.51 n CI = [10.5 ± 1.725 × 0.51] = [9.27, 11.03] σ22 σ21 −γ + σ22 − 2γ . A. Exercise Solutions 43 (ii) Upper −∞, x + t0.10,20 s √ n # √ s t0.10,20 = 1.325, √ = 2.34/ 21 = 0.51 n CIH = (−∞, 10.83] (iii) Lower " s x − t0.10,20 √ , ∞ n ! √ s t0.10,20 = 1.325, √ = 2.34/ 21 = 0.51 n CIL = [9.47, ∞) (b) 90 % Confidence intervals for the variance (s2 = 2.342 = 5.48) (i) Two sided χ20.05,20 = 31.41; χ20.95,20 = 10.85 " # 20 × 5.48 20 × 5.48 , CI = = [3.49, 10.10] 31.41 10.85 (ii) Upper χ20.90,20 = 12.44 CIH = [0, 8.80] (iii) Lower χ20.10,20 = 28.41 CIL = [2.71, ∞) (c) To find the 90 % Confidence intervals for the standard deviation, take the square root of the CI of the variance (i) Two sided [1.87, 3.18] (ii) Upper [0, 2.97] (iii) Lower [1.65, ∞) 44 QT 2015: Statistical Inference A.4 1. Hypothesis Testing (a) Test H0 : ν ≥ 130 vs. H1 : ν < 130 with α = 0.05. n = 40, ν̂ = 128.6, σ = 2.1 Using CLT we can construct a test statistic with known sampling statistic: ν̂ − ν √ ∼ N(0, 1), σ/ n ν̂ − ν t = √ ∼ t(n − 1), S/ n 128.6 − 130 −1.4 = = −4.24. tν̂ = √ 0.33 2.1/ 40 z= Rejection Region: tν̂ < t0.05 , t0.05 (39) ≈ t0.05 (40) = −1.684 (compare z0.05 = −1.645) −4.24 < −1.684 ⇒ Reject H0 : ν is significantly lower than 130 at the 5% level. (b) Decision rule: Reject H0 if (V̂−130) √ 2.1/ 40 < −1.684 → V̂ < 129.44 P[V̂ ≥ 129.44 | ν = 128] = P[(V̂ − 128)/0.33 ≥ (129.44 − 128)/0.33] (129.44 − 128)/0.33 = 4.36, look up in t-table or z-table to find this probability is well below 0.1%. 2. Consider H0 : µ ≤ 7 vs. Ha : µ > 7. Yi ∼ N(µ, 5), n = 20 (a) Uniformly most powerful test: arg max[P(µ̂ > mcrit | µ1 )] s.t. P(µ̂ > mcrit | µ0 ) ≤ α. mcrit i.e. set mcrit s.t. P(µ̂ > mcrit | µ0 ) = 0.05. By CLT we know that: µ̂ − µ √ ∼ N(0, 1), Z0.95 = 1.645. σ/ n Thus: mcrit − µ √ = Z0.95 , σ/ n mcrit − 7 = 1.645, √ 5/20 p mcrit = 7 + 1.645 5/20, = 7.8225. i.e. rejection region: reject if µ̂ > 7.8225. A. Exercise Solutions 45 (b) Find the power of the test: (1 − β) = P(µ̂ > mcrit | µ1 ), When the alternative takes on the following values (Again, use CLT for the sampling distribution): µ1 = 7.5, (7.8225 − 7.5)/0.5 = 0.645 , (1 − β) = 0.26. µ1 = 8.5, (7.8225 − 8.5)/0.5 = −1.335 , (1 − β) = 0.91. µ1 = 8.0, (7.8225 − 8.0)/0.5 = −0.335 µ1 = 9.0, (7.8225 − 9.0)/0.5 = −2.355 , (1 − β) = 0.63. , (1 − β) = 0.99. 3. In effect there are two random samples which are both a sequence of Bernoulli trails, each with n = 50 and some parameter φ ∈ [0, 1]. Setting up the null and alternative hypothesis yields: H0 : φ1 ≤ φ2 , H1 : φ1 > φ2 or alternatively H0 : (φ1 − φ2 ) ≤ 0, H1 : (φ1 − φ2 ) > 0 A Bernoulli distribution has mean φ and variance φ(1−φ). Remember that the setup of this test is strongly reminiscent of exercise 7, implying that Var(φ1 − φ2 ) = σ21 /n1 + σ22 /n2 and by CLT it will be normally distributed. Replacing the population variances with their sample equivalents yields the following test statistic, which will be a t-distribution with approximately (n1 + n2 − 2)1 degrees of freedom: (φ̂1 − φ̂2 ) − 0 t= √ 2 (s1 + s22 )/n Filling in the numbers yields: p t = (0.74 − 0.46)/ [0.74(1 − 0.74) + 0.46(1 − 0.46)]/50 = 2.982 Checking the table for the t-dist, it can be seen that the p-value < 0.005. p-value is less than 0.05, so reject H0 . Alternatively the critical value τ0.95 = 1.661. Note t > τ0.95 so reject H0 . In both cases the conclusion is that, indeed, the inclusion of female models significantly increase the probability that a car is perceived to be more expensive. Note, the 95% upper confidence interval is (−∞, 0.156]; so observing (φ̂1 − φ̂2 ) = 0.28 falls outside the confidence bounds also leads to the conclusion that H0 can be rejected. 1 to be more precise the degrees of freedom is estimated by 2 σ21 /n1 + σ22 /n2 (σ21 /n1 )2 /(n1 − 1) + (σ22 /n2 )2 /(n2 − 1)