Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
German tank problem wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Bias of an estimator wikipedia , lookup
Regression analysis wikipedia , lookup
Linear regression wikipedia , lookup
CHAPTER 3 CLASSICAL LINEAR REGRESSION MODELS Key words: Classical linear regression, Conditional heteroskedasticity, Conditional homoskedasticity, F -test, GLS, Hypothesis testing, Model selection criterion, OLS, R2 ; t-test Abstract: In this chapter, we will introduce the classical linear regression theory, including the classical model assumptions, the statistical properties of the OLS estimator, the t-test and the F -test, as well as the GLS estimator and related statistical procedures. This chapter will serve as a starting point from which we will develop the modern econometric theory. 3.1 Assumptions Suppose we have a random sample fZt gnt=1 of size n; where Zt = (Yt ; Xt0 )0 ; Yt is a scalar, Xt = (1; X1t ; X2t ; :::; Xkt )0 is a (k + 1) 1 vector, t is an index (either crosssectional unit or time period) for observations, and n is the sample size. We are interested in modelling the conditional mean E(Yt jXt ) using an observed realization (i.e., a data set) of the random sample fYt ; Xt0 g0 ; t = 1; :::; n: Notations: (i) K k + 1 is the number of regressors for which there are k economic variables and an intercept. (ii) The index t may denote an individual unit (e.g., a …rm, a household, a country) for cross-sectional data, or denote a time period (e.g., day, weak, month, year) in a time series context. We …rst list and discuss the assumptions of the classical linear regression theory. Assumption 3.1 [Linearity]: Yt = Xt0 where 0 is a K 0 + "t ; t = 1; :::; n; 1 unknown parameter vector, and "t is an unobservable disturbance. Remarks: (i) In Assumption 3.1, Yt is the dependent variable (or regresand), Xt is the vector of regressors (or independent variables, or explanatory variables), and 0 is the regression 1 coe¢ cient vector. When the linear model is correctly for the conditional mean E(Yt jXt ), @ E(Yt jXt ) is the marginal e¤ect of Xt on i.e., when E("t jXt ) = 0; the parameter 0 = @X t Yt : (ii) The key notion of linearity in the classical linear regression model is that the regression model is linear in 0 rather than in Xt : (iii) Does Assumption 3.1 imply a causal relationship from Xt to Yt ? Not necessarily. As Kendall and Stuart (1961, Vol.2, Ch. 26, p.279) point out, “a statistical relationship, however strong and however suggestive, can never establish causal connection. our ideas of causation must come from outside statistics ultimately, from some theory or other.” Assumption 3.1 only implies a predictive relationship: Given Xt , can we predict Yt linearly? (iv) Matrix Notation: Denote = (Y1 ; :::; Yn )0 ; Y " = ("1 ; :::; "n )0 ; n n X = (X1 ; :::; Xn )0 ; 1; 1; n K: Note that the t-th row of X is Xt0 = (1; X1t ; :::; Xkt ): With these matrix notations, we have a compact expression for Assumption 3.1: Y n = X 0 1 = (n + "; K)(K 1) + n 1: The second assumption is a strict exogeneity condition. Assumption 3.2 [Strict Exogeneity]: E("t jX) = E("t jX1 ; :::; Xt ; :::; Xn ) = 0; t = 1; :::; n: Remarks: (i) Among other things, this condition implies correct model speci…cation for E(Yt jXt ): This is because Assumption 3.2 implies E("t jXt ) = 0 by the law of iterated expectations. (ii) Assumption 3.2 also implies E("t ) = 0 by the law of iterated expectations. (iii) Under Assumption 3.2, we have E(Xs "t ) = 0 for any (t; s); where t; s 2 f1; :::; ng: This follows because E(Xs "t ) = E[E(Xs "t jX)] = E[Xs E("t jX)] = E(Xs 0) = 0: 2 Note that (i) and (ii) imply cov(Xs ; "t ) = 0 for all t; s 2 f1; :::; ng: (iv) Because X contains regressors fXs g for both s t and s > t; Assumption 3.2 essentially requires that the error "t do not depend on the past and future values of regressors if t is a time index. This rules out dynamic time series models for which "t may be correlated with the future values of regressors (because the future values of regressors depend on the current shocks), as is illustrated in the following example. Example 1: Consider a so-called AutoRegressive AR(1) model Yt = 0 + 1 Yt 1 + "t ; t = 1; :::; n; = Xt0 + "t ; f"t g i.i.d.(0; 2 ); where Xt = (1; Yt 1 )0 : Obviously, E(Xt "t ) = E(Xt )E("t ) = 0 but E(Xt+1 "t ) 6= 0: Thus, we have E("t jX) 6= 0; and so Assumption 3.2 does not hold. In Chapter 5, we will consider linear regression models with dependent observations, which will include this example as a special case. In fact, the main reason of imposing Assumption 3.2 is to obtain a …nite sample distribution theory. For a large sample theory (i.e., an asymptotic theory), the strict exogeneity condition will not be needed. (v) In econometrics, there are some alternative de…nitions of strict exogeneity. For example, one de…nition assumes that "t and X are independent. An example is that X is nonstochastic. This rules out conditional heteroskedasticity (i.e., var("t jX) depends onn X). In Assumption 3.2, we still allow for conditional heteroskedasticity, because we do not assume that "t and X are independent. We only assume that the conditional mean E("t jX) does not depend on X: (vi) Question: What happens to Assumption 3.2 if X is nonstochastic? Answer: If X is nonstochastic, Assumption 3.2 becomes E("t jX) = E("t ) = 0: An example of nonstochastic X is Xt = (1; t; :::; tk )0 : This corresponds to a time-trend regression model Yt = Xt0 0 + "t k X 0 j = j t + "t : j=0 (vii) Question: What happens to Assumption 3.2 if Zt = (Yt ; Xt0 )0 is an independent random sample (i.e., Zt and Zs are independent whenever t 6= s; although Yt and Xt may not be independent)? 3 Answer: When fZt g is i.i.d., Assumption 3.2 becomes E("t jX) = E("t jX1 ; X2 ; :::Xt ; :::; Xn ) = E("t jXt ) = 0: In other words, when fZt g is i.i.d., E("t jX) = 0 is equivalent to E("t jXt ) = 0: Assumption 3.3 [Nonsingularity]: The minimum eigenvalue of the K P matrix X 0 X = nt=1 Xt Xt0 ; min K square (X 0 X) ! 1 as n ! 1 with probability one. Question: How to …nd eigenvalues of a square matrix A? Answer: By solving the system of linear equations: det(A I) = 0; where det( ) denotes the determinant of a square matrix, and I is an identity matrix with the same dimension as A. Remarks: (i) Assumption 3.3 rules out multicollinearity among the (k + 1) regressors in Xt : We say that there exists multicolinearity among the Xt if for all t 2 f1; :::; ng; the variable Xjt for some j 2 f0; 1; :::; kg is a linear combination of the other K 1 column variables. (ii) This condition implies that X 0 X is nonsingular, or equivalently, X must be of full rank of K = k + 1. Thus, we need K n: That is, the number of regressors cannot be larger than the sample size. (iii) It is well-known that the eigenvalue can be used to summarize information contained in a matrix (recall the popular principal component analysis). Assumption 3.3 implies that new information must be available as the sample size n ! 1 (i.e., Xt should not only have same repeated values as t increases). (iv) Intuitively, if there are no variations in the values of the Xt ; it will be di¢ cult to determine the relationship between Yt and Xt (indeed, the purpose of classical linear regression is to investigate how a change in X causes a change in Y ): In certain sense, one may call X 0 X the “information matrix” of the random sample X because it is a 4 measure of the information contained in X: The degree of variation in X 0 X will a¤ect the preciseness of parameter estimation for 0 . Question: Why can the eigenvalue in X 0 X? used as a measure of the information contained Assumption 3.4 [Spherical error variance]: (a) [conditional homoskedasticity]: E("2t jX) = 2 > 0; t = 1; :::; n; (b) [conditional serial uncorrelatedness]: E("t "s jX) = 0; t 6= s; t; s 2 f1; :::; ng: Remarks: (i) We can write Assumption 3.4 as E("t "s jX) = 2 ts ; where ts = 1 if t = s and ts = 0 otherwise. In mathematics, ts is called the Kronecker delta function. (ii) var("t jX) = E("2t jX) [E("t jX)]2 = E("2t jX) = 2 given Assumption 3.2. Similarly, we have cov("t ; "s jX) = E("t "s jX) = 0 for all t 6= s: (iii) By the law of iterated expectations, Assumption 3.4 implies that var("t ) = 2 for all t = 1; :::; n; the so-called unconditional homoskedasticity. Similarly, cov("t ; "s ) = 0 for all t 6= s: (iv) A compact expression for Assumptions 3.2 and 3.4 together is: E("jX) = 0 and E(""0 jX) = 2 I; where I In is a n n identity matrix. (v) Assumption 3.4 does not imply that "t and X are independent. It allows the possibility that the conditional higher order moments (e.g., skewness and kurtosis) of "t depend on X: 3.2 OLS Estimation Question: How to estimate 0 using an observed data set generated from the random sample fZt gnt=1 ; where Zt = (Yt ; Xt0 )0 ? 5 De…nition [OLS estimator]: Suppose Assumptions 3.1 and 3.3 hold. De…ne the sum of squared residuals (SSR) of the linear regression model Yt = Xt0 + ut as SSR( ) (Y X )0 (Y X ) n X = (Yt Xt0 )2 : t=1 Then the Ordinary Least Squares (OLS) estimator ^ is the solution to ^ = arg min SSR( ): 2RK Remark: SSR( ) is the sum of squred model errors fut = Yt with equal weighting for each t. Xt0 g for observations, Theorem 1 [Existence of OLS]: Under Assumptions 3.1 and 3.3, the OLS estimator ^ exists and ^ = (X 0 X) 1 X 0 Y ! n X Xt Xt0 = t=1 1X Xt Xt0 n t=1 n = 1 ! n X Xt Yt t=1 1 1X Xt Yt : n t=1 n The latter expression will be useful for our asymptotic analysis in subsequent chapters. Remark: Suppose Zt = fYt ; Xt0 g0 ; t = 1; :::; n; is an independent and identically distributed (i.i.d.) random sample of size n. Compare the population MSE criterion M SE( ) = E(Yt Xt0 )2 and the sample MSE criterion SSR( ) 1X = (Yt n n t=1 n Xt0 )2 : By the weak law of large numbers (WLLN), we have SSR( ) p ! E(Yt n Xt0 )2 6 M SE( ) and their minimizers ^ = ! p n 1X Xt Xt0 n t=1 [E(Xt Xt0 )] ! 1 1 1X Xt Yt n t=1 n E(Xt Yt ) : Here, SSR( ); after scaled by n 1 ; is the sample analogue of M SE( ); and the OLS ^ is the sample analogue of the best LS approximation coe¢ cient : Proof of the Theorem: Using the formula that for both a and derivative @(a0 ) = a; @ we have n dSSR( ) d X = (Yt Xt0 )2 d d t=1 n X @ = (Yt @ t=1 = n X 2(Yt Xt0 )2 Xt0 ) t=1 n X Xt (Yt = 2 = 2X 0 (Y @ (Yt @ Xt0 ) Xt0 ) t=1 X ): The OLS must satisfy the FOC: X ^ ) = 0; X 0 (Y X ^ ) = 0; X 0 Y (X 0 X) ^ = 0: 2X 0 (Y It follows that (X 0 X) ^ = X 0 Y: By Assumption 3.3, X 0 X is nonsingular. Thus, ^ = (X 0 X) 1 X 0 Y: Checking the SOC, we have the K K Hessian matrix n X @ 2 SSR( ) @ = 2 Xt0 )Xt ] 0 0 [(Yt @ @ @ t=1 = 2X 0 X positive de…nite 7 being vectors, the given min (X 0 X) > 0: Thus, ^ is a global minimizer. Note that for the existence of ^ ; we only need that X 0 X is nonsingular, which is implied by the condition that min (X 0 X) ! 1 as n ! 1 but it does not require that min (X 0 X) ! 1 as n ! 1: Remarks: (i) Y^t Xt0 ^ is called the …tted value (or predicted value) for observation Yt ; and et Yt Y^t is the estimated residual (or prediction error) for observation Yt : Note that et = Yt = (Xt0 = "t Y^t Xt0 ^ 0 + "t ) X 0( ^ 0 t ) = true regression disturbance + estimation error, 0 ) can where the true disturbance "t is unavoidable, and the estimation error Xt0 ( ^ 0 ^ be made smaller when a larger data set is available (so becomes closer to ). (ii) FOC implies that the estimated residual e = Y X ^ is orthogonal to regressors X in the sense that n X 0 Xt et = 0: Xe= t=1 This is the consequence of the very nature of OLS, as implied by the FOC of min 2RK SSR( ). It always holds no matter whether E("t jX) = 0 (recall that we do not impose Assumption 3.2). Note that if Xt contains the intercept, then X 0 e = 0 implies nt=1 et = 0: Some useful identities To investigate the statistical properties of ^ ; we …rst state some useful lemmas. Lemma: Under Assumptions 3.1 and 3.3, we have: (i) X 0 e = 0; (ii) ^ (iii) De…ne a n 0 = (X 0 X) 1 X 0 "; n projection matrix P = X(X 0 X) 1 X 0 and M = In 8 P: Then both P and M are symmetric (i.e., P = P 0 and M = M 0 ) and idempotent (i.e., P 2 = P; M 2 = M ), with P X = X; M X = 0: (iv) SSR( ^ ) = e0 e = Y 0 M Y = "0 M ": Proof: (i) The result follows immediately by the FOC of the OLS estimator. (ii) Because ^ = (X 0 X) 1 X 0 Y and Y = X 0 + "; we have ^ 0 = (X 0 X) 1 X 0 (X 0 + ") 0 = (X 0 X) 1 X 0 ": (iii) P is idempotent because P2 = PP = [X(X 0 X) 1 X 0 ][X(X 0 X) 1 X 0 ] = X(X 0 X) 1 X 0 = P: Similarly we can show M 2 = M: (iv) By the de…nition of M; we have e = Y X^ = Y X(X 0 X) 1 X 0 Y = [I X(X 0 X) 1 X 0 ]Y = MY 0 = M (X = MX 0 + ") + M" = M" given M X = 0: It follows that SSR( ^ ) = e0 e = (M ")0 (M ") = "0 M 2 " = "0 M "; 9 where the last equality follows from M 2 = M: 3.3 Goodness of …t and model selection criterion Question: How well does the linear regression model …t the data? That is, how well does the linear regression model explain the variation of the observed data of fYt g? We need some criteria or some measures to characterize goodness of …t. We …rst introduce two measures for goodness of …t. The …rst measure is called the uncentered squared multi-correlation coe¢ cient R2 De…nition [Uncentered R2 ] : The uncentered squared multi-correlation coe¢ cient is de…ned as Y^ 0 Y^ e0 e 2 Ruc = 0 =1 : Y Y Y 0Y Remarks: 2 : The proportion of the uncentered sample quadratic variation in Interpretation of Ruc the dependent variables fYt g that can be explained by the uncentered sample quadratic 2 variation of the predicted values fY^t g. Note that we always have 0 Ruc 1: Next, we de…ne a closely related measure called Centered R2 : De…nition [Centered R2 : Coe¢ cient of Determination]: The coe¢ cient of determination Pn 2 e 2 R 1 Pn t=1 t 2 ; Y) t=1 (Yt P where Y = n 1 nt=1 Yt is the sample mean. Remarks: (i) When Xt contains the intercept, we have the following orthogonal decomposition: n X (Yt Y )2 = t=1 n X (Y^t Y + Yt (Y^t 2 Y^t )2 t=1 = n X Y) + t=1 +2 n X e2t t=1 n X (Y^t Y )et t=1 = n X t=1 10 (Y^t 2 Y) + n X t=1 e2t ; where the cross-product term n X (Y^t n X Y )et = t=1 Y^t et Y t=1 0 = ^ n X et t=1 n X Xt et Y t=1 n X et t=1 0 = ^ (X 0 e) Y n X et t=1 0 = ^ 0 Y 0 = 0; P where we have made use of the facts that X 0 e = 0 and nt=1 et = 0 from the FOC of the OLS estimation and the fact that Xt contains the intercept (i.e., X0t = 1): It follows that R 2 and consequently we have e0 e 1 (Yt Y )2 Pn 2 Pn t=1 Y )2 t=1 et t=1 (Yt Pn = 2 Y) t=1 (Yt Pn ^ (Yt Y )2 : = Pnt=1 Y )2 t=1 (Yt Pn R2 0 1: (ii) If Xt does not contain the intercept, then the orthogonal decomposition identity n X (Yt Y )2 = n X (Y^t Y )2 + e2t t=1 t=1 t=1 n X no longer holds. As a consequence, R2 may be negative when there is no intercept! This is because the cross-product term 2 n X (Y^t Y )et t=1 may be negative. (iii) For any given random sample fYt ; Xt0 g0 ; t = 1; :::; n; R2 is nondecreasing in the number of explanatory variables Xt : In other words, the more explanatory variables are 11 added in the linear regression, the higher R2 : This is always true no matter whether Xt has any true explanatory power for Yt : Theorem: Suppose fYt ; Xt0 g0 ; t = 1; :::; n; is a random sample, R12 is the centered R2 from the linear regression Yt = Xt0 + ut ; where Xt = (1; X1t ; :::; Xkt )0 ; and is a K R2 from the extended linear regression 1 parameter vector; also, R22 is the centered ~ t + vt ; Yt = X ~ t = (1; X1t ; :::; Xkt ; Xk+1;t ; :::; Xk+q;t )0 ;and where X Then R22 R12 : is a (K + q) 1 parameter vector. Proof: By de…nition, we have R12 = 1 R22 = 1 e0 e ; Y )2 t=1 (Yt e~0 e~ Pn ; Y )2 t=1 (Yt Pn where e is the estimated residual vector from the regression of Y on X; and e~ is the ~ It su¢ ces to show e~0 e~ e0 e: estimated residual vector from the regression of Y on X: ~ 0 Y minimizes SSR( ) for the extended model, ~ 1X ~ 0 X) Because the OLS estimator ^ = (X we have n n X X ~ 0 )2 for all 2 RK+q : ~ 0 ^ )2 (Yt X (Yt X e~0 e~ = t t t=1 t=1 Now we choose 0 = ( ^ ; 00 )0 ; where ^ = (X 0 X) 1 X 0 Y is the OLS from the …rst regression. It follows that !2 k+q n k X X X ^ j Xjt e~0 e~ Yt 0 Xjt = t=1 n X j=0 (Yt Xt0 ^ )2 t=1 0 = e e: Hence, we have R12 R22 : This completes the proof. Question: What is the implication of this result? 12 j=k+1 Answer: R2 is not a suitable criterion for correct model speci…cation. It is not a suitable criterion of model selection. Question: Does a high R2 imply a precise estimation for 0 ? Question: What are the suitable model selection criteria? Two popular model selection criteria The KISS Principle: Keep It Sophistically Simple! An important idea in statistics is to use a simple model to capture essential information contained in data as much as possible. Below, we introduce two popular model selection criteria that re‡ect such an idea. Akaike Information Criterion [AIC]: A linear regression model can be selected by minimizing the following AIC criterion with a suitable choice of K : 2K n goodness of …t + model complexity AIC = ln(s2 ) + where s2 = e0 e=(n K); and K = k + 1 is the number of regressors. Bayesian Information Criterion [BIC]: A linear regression model can be selected by minimizing the following criterion with a suitable choice of K : K ln(n) : BIC = ln(s2 ) + n Remark: When ln n 2; BIC gives a heavier penalty for model complexity than AIC, which is measured by the number of estimated parameters (relative to the sample size n). As a consequence, BIC will choose a more parsimonious linear regression model than AIC. Remark: BIC is strongly consistent in that it determines the true model asymptotically (i.e., as n ! 1), whereas for AIC an overparameterized model will emerge no matter 13 how large the sample is. Of course, such properties are not necessarily guaranteed in …nite samples. Remarks: In addition to AIC and BIC, there are other criteria such as R2 ; the so-called adjusted R2 that can also be used to select a linear regression model. The adjusted R2 is de…ned as e0 e=(n K) R2 = 1 : (Y Y )0 (Y Y )=(n 1) This de¤ers from R2 = 1 e0 e Y )0 (Y (Y Y) : It may be shown that R2 = 1 n n 1 (1 K R2 ) : All model criteria are structured in terms of the estimated residual variance ^ 2 plus a penalty adjustment involving the number of estimated parameters, and it is in the extent of this penalty that the criteria di¤er from. For more discussion about these, and other selection criteria, see Judge et al. (1985, Section 7.5). Question: Why is it not a good practice to use a complicated model? Answer: A complicated model contains many unknown parameters. Given a …xed amount of data information, parameter estimation will become less precise if more parameters have to be estimated. As a consequence, the out-of-sample forecast for Yt may become less precise than the forecast of a simpler model. The latter may have a larger bias but more precise parameter estimates. Intuitively, a complicated model is too ‡exible in the sense that it may not only capture systematic components but also some features in the data which will not show up again. Thus, it cannot forecast futures well. 3.4 Consistency and E¢ ciency of OLS We now investigate the statistical properties of ^ : We are interested in addressing the following basic questions: Is ^ a good estimator for 0 (consistency)? Is ^ the best estimator (e¢ ciency)? What is the sampling distribution of ^ (normality)? 14 Question: What is the sampling distribution of ^ ? Answer: The distribution of ^ is called the sampling distribution of ^ ; because ^ is a function of the random sample fZt gnt=1 ; where Zt = (Yt ; Xt0 )0 : Remark: The sampling distribution of ^ is useful for any statistical inference involving ^ ; such as con…dence interval estimation and hypothesis testing. We …rst investigate the statistical properties of ^ : Theorem: Suppose Assumptions 3.1-3.4 hold. Then (i) [Unbiaseness] E( ^ jX) = 0 and E( ^ ) = 0 : (ii) [Vanishing Variance] h var( ^ jX) = E ( ^ E ^ )( ^ 2 = and for any K 1 vector such that 0 (X 0 X) 0 1 E ^ )0 jX i = 1; we have var( ^ jX) ! 0 as n ! 1: (iii) [Orthogonality between e and ^ ] cov( ^ ; ejX) = Ef[ ^ E( ^ jX)]e0 jXg = 0: (iv) [Gauss-Markov] var(^bjX) var( ^ jx) is positive semi-de…nite (p.s.d.) for any unbiased estimator ^b that is linear in Y with E(^bjX) = (v) [Residual variance estimator] s2 = is unbiased for 2 = E("2t ): 1 n K n X e2t = e0 e=(n 0 : K) t=1 That is, E(s2 jX) = 2 : Question: What is a linear estimator ^b? Remarks: (a) Both Theorems (i) and (ii) imply that the conditional MSE 0 ^ 0 0 M SE( ^ jX) = E[( ^ )( ) jX] = var( ^ jX) + Bias( ^ jX)Bias( ^ jX)0 = var( ^ jX) ! 0 as n ! 1; 15 0 where we have used the fact that Bias( ^ jX) E( ^ jX) = 0: Recall that MSE 0 measures how close an estimator ^ is to the target : (b) Theorem (iv) implies that ^ is the best linear unbiased estimator (BLUE) for 0 because var( ^ jX) is the smallest among all unbiased linear estimators for 0 : Formally, we can de…ne a related concept for comparing two estimators: De…nition [E¢ ciency]: An unbiased estimator ^ of parameter than another unbiased estimator ^b of parameter 0 if var(^bjX) 2 RK such that 0 = 1; we have h i 0 var(^bjX) var( ^ jX) 0: Implication: For any Choosing var( ^ jX) is p.s.d. = (1; 0; :::; 0)0 ; for example, we have var(^b1 ) Proof: (i) Given ^ 0 var( ^ 1 ) 0: = (X 0 X) 1 X 0 "; we have E[( ^ 0 )jX] = E[(X 0 X) 1 X 0 "jX] = (X 0 X) 1 X 0 E("jX) = (X 0 X) 1 X 0 0 = 0: (ii) Given ^ 0 = (X 0 X) 1 X 0 " and E(""0 jX) = 2 I; we have h i var( ^ jX) E ( ^ E ^ )( ^ E ^ )0 jX h i 0 ^ 0 0 ^ = E ( )( ) jX = E[(X 0 X) 1 X 0 ""0 X(X 0 X) 1 jX] = (X 0 X) 1 X 0 E(""0 jX)X(X 0 X) = (X 0 X) 1 X 0 2 IX(X 0 X) = 2 (X 0 X) 1 X 0 X(X 0 X) = 2 (X 0 X) 1 : 16 1 1 1 0 is more e¢ cient Note that Assumption 3.4 is crutial here to obtain the expression of var( ^ jX): Moreover, for any 2 RK such that 0 = 1; we have 0 var( ^ jX) 2 0 = = (X 0 X) 0 2 (X 0 X) 1 for 1 2 max [(X X) 1 ] 2 1 0 min (X X) ! 0 given min (X 0 X) ! 1 as n ! 1 with probability one. Note that the condition that 0 ^ min (X X) ! 1 ensures that var( jX) vanishes to zero as n ! 1: 0 (iii) Given ^ = (X 0 X) 1 X 0 "; e = Y X ^ = M Y = M " (since M X = 0); and E(e) = 0; we have h i cov( ^ ; ejX) = E ( ^ E ^ )(e Ee)0 jX h i 0 0 = E (^ )e jX = E[(X 0 X) 1 X 0 ""0 M jX] = (X 0 X) 1 X 0 E(""0 jX)M = (X 0 X) 1 X 0 = 2 2 IM (X 0 X) 1 X 0 M = 0: Again, Assumption 3.4 plays a crutial role in ensuring zero correlation between ^ and e: (iv) Consider a linear estimator ^b = C 0 Y; where C = C(X) is a n K matrix depending on X: It is unbiased for the value of 0 if and only if E(^bjX) = C 0 X 0 = C 0X 0 = 0 + C 0 E("jX) : This follows if and only if C 0 X = I: 17 0 regardless of Because ^b = C 0 Y 0 = C 0 (X = C 0X 0 = 0 + ") + C 0" + C 0 "; the variance of ^b h var(^b) = E (^b 0 )(^b 0 0 ) jX = E [C 0 ""0 CjX] i = C 0 E(""0 jX)C = C0 = 2 2 IC C 0 C: Using C 0 X = I, we now have var(^bjX) var( ^ jX) = 2 C 0C 2 (X 0 X) 1 = 2 [C 0 C C 0 X(X 0 X) 1 X 0 C] = 2 C 0 [I X(X 0 X) 1 X 0 ]C = 2 C 0M C = 2 C 0M M C = 2 C 0M 0M C = 2 (M C)0 (M C) = 2 D0 D n X 2 Dt Dt0 = t=1 p.s.d. where we have used the fact that for any real-valued vector D; the squared matrix D0 D is always p.s.d. [Question: How to show this?] (v) Now we show E[e0 e=(n K)] = 2 : Because e0 e = "0 M " and tr(AB) =tr(BA); 18 we have E(e0 ejX) = E("0 M "jX) = E[tr("0 M ")jX] [putting A = "0 M; B = "] = E[tr(""0 M )jX] = tr[E(""0 jX)M ] = tr( 2 IM ) = 2 tr(M ) = 2 (n K) where tr(M ) = tr(In ) tr(X(X 0 X) 1 X 0 ) = tr(In ) tr(X 0 X(X 0 X) 1 ) = n K; using tr(AB) =tr(BA) again. It follows that E(e0 ejX) E(s jX) = n K 2 (n K) = (n K) 2 = : 2 This completes the proof. Note that the sample residual variance s2 = e0 e=(n a generalization of the sample variance Sn2 = (n sample fYt gnt=1 : 1) 1 n t=1 (Yt K) is Y )2 for the random 3.5 The Sampling Distribution of ^ To obtain the …nite sample sampling distribution of ^ ; we impose the normality assumption on ": Assumption 3.5: "jX N (0; 2 I): Remarks: 19 (i) Under Assumption 3.5, the conditional pdf of " given X is f ("jX) = p ( 2 1 2 )n "0 " 2 2 exp = f ("); which does not depend on X; the disturbance " is independent of X: Thus, every conditional moment of " given X does not depend on X: (ii) Assumption 3.5 implies Assumptions 3.2 (E("jX) = 0) and 3.4 (E(""0 X) = 2 I). (iii) The normality assumption may be reasonable for observations that are computed as the averages of the outcomes of many repeated experiments, due to the e¤ect of the so-called central limit theorem (CLT). This may occur in Physics, for example. In economics, the normality assumption may not always be reasonable. For example, many high-frequency …nancial time series usually display heavy tails (with kurtosis larger tha 3). Question: What is the sampling distribution of ^ ? Sampling Distribution of ^ : ^ 0 = (X 0 X) 1 X 0 " n X 0 1 = (X X) Xt "t t=1 = n X Ct "t ; t=1 where the weighting matrix Ct = (X 0 X) 1 Xt is called the leverage of observation Xt . Theorem [Normality of ^ ]: Under Assumptions 3.1, 3.3 and 3.5, (^ 0 )jX 2 N [0; (X 0 X) 1 ]: 0 Proof: Conditional on X; ^ is a weighted sum of independent normal random variables f"t g; and so is also normally distributed. The corollary below follows immediately. 0 Corollary [Normality of R( ^ )]: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then for any nonstochastic J K matrix R; we have R( ^ 0 )jX N [0; 20 2 R(X 0 X) 1 R0 ]: Question: What is the role of the nonstochastic matrix R? Answer: R is a selection matrix. For example, when R = (1; 0; :::; 0)0 ; we then have 0 0 R( ^ ) = ^0 0: Proof: Conditional on X; ^ the linear combination R( ^ 0 0 E[R( ^ 0 is normally distributed. Therefore, conditional on X; ) is also normally distributed, with )jX] = RE[( ^ 0 )jX] = R 0 = 0 and var[R( ^ 0 h )jX] = E R( ^ h = E R( ^ h = RE ( ^ 0 )(R( ^ 0 )( ^ 0 )( ^ = Rvar( ^ jX)R0 2 = i ))0 jX i 0 0 0 ) R jX i 0 0 ) jX R0 0 R(X 0 X) 1 R0 : It follows that R( ^ 0 )jX 2 N (0; R(X 0 X) 1 R0 ): Question: Why would we like to know the sampling distribution of R( ^ 0 )? Answer: This is mainly for con…dence interval estimation and hypothesis testing. 0 Remark: However, var[R( ^ )jX] = 2 is unknown. We need to estimate 2 : 2 R(X 0 X) 1 R0 is unknown, because var("t ) = 3.6 Estimation of the Covariance Matrix of ^ Question: How to estimate var( ^ jx) = Answer: To estimate 2 2 (X 0 X) 1 ? ; we can use the residual variance estimator s2 = e0 e=(n K): Theorem [Residual Variance Estimator]: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then we have for all n > K; (i) (n K)s2 2 jX = e0 e 2 jX (ii) conditional on X; s2 and ^ are independent. 21 2 n K; Remarks: (i) Question: What is a 2 q distribution? 2 q] De…nition [Chi-square Distribution, variables. Then the random variable 2 = Suppose fZi gqi=1 are i.i.d.N(0,1) random q X Zi2 i=1 2 q will follow a distribution. Properties of E( 2 q) var( 2 q : = q: 2 q) = 2q: Remarks: (i) Theorem (i) implies K)s2 (n E (n K) 2 It follows that E(s2 jX) = 2 jX = n K: E(s2 jX) = n K: 2 : (ii) Theorem (i) also implies var (n K)s2 (n K)2 4 jX = 2(n K); var(s2 jX) = 2(n K); 2 var(s2 jX) = 2 n 4 K !0 as n ! 1: (iii) Both Theorems (i) and (ii) imply that the conditional MSE of s2 M SE(s2 jX) = E (s2 2 2 ) jX = var(s2 jX) + [E(s2 jX) ! 0: 22 2 2 ] Thus, s2 is a good estimator for 2 : (iv) The independence between s2 and ^ is crucial for us to obtain the sampling distribution of the popular t-test and F -test statistics, which will be introduced soon. Proof of Theorem: (i) Because e = M "; we have e0 e 2 = "0 M " 2 " = 0 M " : In addition, because "jX N (0; 2 In ); and M is an idempotent matrix with rank = n K (as has been shown before), we have the quadratic form e0 e 2 = "0 M " 2 2 n K jX by the following lemma. Lemma [Quadratic form of normal random variables]: If v N (0; In ) and Q is an n n nonstochastic symmetric idempotent matrix with rank q n; then the quadratic form 2 v 0 Qv q: In our application, we have v = "= we have N (0; I); and Q = M: Since rank(M ) = n e0 e 2 K; 2 n K: jX (ii) Next, we show that s2 and ^ are independent. Because s2 = e0 e=(n K) is a function of e; it su¢ ces to show that e and ^ are independent. This follows immediately because both e and ^ are jointly normally distributed and they are uncorrelated. It is well-known that for a joint normal distribution, zero correlation is equivalent to independence. Question: Why are e and ^ jointly normally distributed? Answer: We write Because "jX N (0; " 2 e ^ 0 # " # M" (X X) 1 X 0 " " # M = ": (X 0 X) 1 X 0 = 0 I); the linear combination of " # M " (X 0 X) 1 X 0 23 is also normally distributed conditional on X. Question: Why are these sampling distributions useful in practice? Answer: They are useful in con…dence interval estimation and hypothesis testing. 3.7 Hypothesis Testing We now use the sampling distributions of ^ and s2 to develop test procedures for hypotheses of interest. We consider testing the following linear hypothesis in form of H0 (J K)(K : R 1) = J 0 = r; 1; where R is called the selection matrix, and J is the number of restrictions. We assume J K: Remark: We wil test H0 under correct model speci…cation for E(Yt jXt ). Motivations Example 1 [Reforms have no e¤ect]: Consider the extended production function ln(Yt ) = 0 + 1 ln(Lt ) + 2 ln(Kt ) + 3 AUt + 4 P St + "t ; where AUt is a dummy variable indicating whether …rm t is granted autonomy, and P St is the pro…t share of …rm t with the state. Suppose we are interested in testing whether autonomy AUt has an e¤ect on productivity. Then we can write the null hypothesis Ha0 : 0 3 =0 This is equivalent to the choice of: 0 = ( 0; 1; 2; 3; 0 4) : R = (0; 0; 0; 1; 0); r = 0: If we are interested in testing whether pro…t sharing has an e¤ect on productivity, we can consider the null hypothesis Hb0 : 4 24 = 0: Alternatively, to test whether the production technology exhibits the constant return to scale (CRS), we can write the null hypothesis as follows: 0 1 Hc0 : 0 2 + = 1: This is equivalent to the choice of R = (0; 1; 1; 0; 0) and r = 1: Finally, if we are interested in examining the joint e¤ect of both autonomy and pro…t sharing, we can test the hypothesis that neither autonomy nor pro…t sharing has impact: 0 3 Hd0 : 0 4 = = 0: This is equivalent to the choice of " 0 0 0 1 0 R = 0 0 0 0 1 " # 0 r = : 0 # ; Example 3 [Optimal Predictor for Future Spot Exchange Rate]: Consider St+ = 0 + 1 Ft ( ) + "t+ ; t = 1; :::; n; where St+ is the spot exchange rate at period t + ; and Ft ( ) is period t’s price for the foreign currency to be delivered at period t + : The null hypothesis of interest is that the forward exchange rate Ft ( ) is an optimal predictor for the future spot rate St+ in the sense that E(St+ jIt ) = Ft ( ); where It is the information set available at time t. This is actually called the expectations hypothesis in economics and …nance. Given the above speci…cation, this hypothesis can be written as He0 : 0 0 = 0; 0 1 = 1; and E("t+ jIt ) = 0: This is equivalent to the choice of R= " 1 0 0 1 # ;r = " 0 1 # : Remark: All examples considered above can be formulated with a suitable speci…cation of R; where R is a J K matrix in the null hypothesis H0 : R 0 25 = r; where r is a J 1 vector. Basic Idea of Hypothesis Testing To test the null hypothesis 0 H0 : R = r; we can consider the statistic: R^ r and check if this di¤erence is signi…cantly di¤erent from zero. Under H0 : R 0 = r; we have R^ r = R^ R 0 0 = R( ^ ) ! 0 as n ! 1 0 because ^ ! 0 as n ! 1 in the term of MSE. Under the alternative to H0 ; R 0 6= r; but we still have ^ MSE. It follows that R^ r = R( ^ ! R 0 0 )+R 0 0 ! 0 in the term of r r 6= 0 as n ! 1; where the convergence is in term of MSE. In other words, R ^ r will converge to a nonzero limit, R 0 r. Remark: The behavior of R ^ r is di¤erent under H0 and under the alternative hypothesis to H0 . Therefore, we can test H0 by examining whether R ^ r is signi…cantly di¤erent from zero. Question: How large should the magnitude of the absolute value of the di¤erence R ^ r be in order to claim that R ^ r is signi…cantly di¤erent from zero? Answer: We need to a decision rule which speci…es a threshold value with which we can compare the (absolute) value of R ^ r. Note that R ^ r is a random variable and so it can take many (possibly an in…nite number of) values. Given a data set, we only obtain one realization of R ^ r: Whether a realization of R ^ r is close to zero should be judged using the critical value of its 26 sampling distribution, which depends on the sample size n and the signi…cance level 2 (0; 1) one preselects. Question: What is the sampling distribution of R ^ r under H0 ? Because R( ^ 0 )jX N (0; 2 R(X 0 X) 1 R0 ); we have that conditional on X; R^ r = R( ^ 0 N (R 0 )+R r; 2 0 r R(X 0 X) 1 R0 ) Corollary: Under Assumptions 3.1, 3.3 and 3.5, and H0 : R n > K; (R ^ r)jX N (0; 2 R(X 0 X) 1 R0 ): 0 = r; we have for each Remarks: (i) var(R ^ jX) = 2 R(X 0 X) 1 R0 is a J J matrix. (ii) However, R ^ r cannot be used as a test statistic for H0 ; because 2 is unknown and there is no way to calculate the critical values of the sampling distribution of R ^ r: Question: How to construct a feasible (i.e., computable) test statistic? The forms of test statistics will di¤er depending on whether we have J = 1 or J > 1: We …rst consider the case of J = 1: CASE I: t-Test (J = 1): Recall that we have (R ^ N (0; 2 r)jX] = 2 r)jX R(X 0 X) 1 R0 ); When J = 1; the conditional variance var[(R ^ is a scalar (1 R(X 0 X) 1 R0 1). It follows that conditional on X; we have R^ r R^ r q = p 2 R(X 0 X) 1 R0 var[(R ^ r)jX] N (0; 1): 27 Question: What is the unconditional distribution of p R^ r 2 R(X 0 X) 1 R0 ? Answer: The unconditional distribution is also N(0,1). However, 2 is unknown, so we cannot use the ratio p R^ r 2 R(X 0 X) 1 R0 as a test statistic. We have to replace 2 by s2 ; which is a good estimator for gives a feasible (i.e., computable) test statistic T =p R^ r s2 R(X 0 X) 1 R0 2 : This : However, the test statistic T will be no longer normally distributed. Instead, R^ T = p r s2 R(X 0 X) 1 R0 = q q tn R^ r p 2 R(X 0 X) 1 R0 (n K)s2 2 =(n K) N (0; 1) 2 n K =(n K) K; where tn K denotes the Student t-distribution with n K degrees of freedom. Note that the numerator and denominator are mutually independent conditional on X; because ^ and s2 are mutually independent conditional on X. Remarks: (i) The feasible statistic T is called a t-test statistic. (ii) What is a Student tq distribution? De…nition [Student’s t-distribution]: Suppose Z Z and V are independent. Then the ratio Z p V =q 28 tq : N (0; 1) and V 2 q; and both d d (iii) When q ! 1; tq ! N (0; 1); where ! denotes convergence in distribution. This implies that we have R^ T =p r d s2 R(X 0 X) 1 R0 ! N (0; 1) as n ! 1: This result has a very important implication in practice: For a large sample size n; it makes no di¤erence to use either the critical values from tn K or from N (0; 1). Plots of Student t-distributions? Question: What is meant by convergence in distribution? De…nition [Convergence in distribution]: Suppose fZn ; n = 1; 2; g is a sequence of random variables/vectors with distribution functions Fn (z) = P (Zn z); and Z is a random variable/vector with distribution F (z) = P (Z z): We say that Zn converges to Z in distribution if the distribution of Zn converges to the distribution of Z at all continuity points; namely, lim Fn (z) = F (z) or n!1 Fn (z) ! F (z) as n ! 1 for any continuity point z (i.e., for any point at which F (z) is continuous): We use the d notation Zn ! Z: The distribution of Z is called the asymptotic or limiting distribution of Zn : Implication: In practice, Zn is a test statistic or a parameter estimator, and its sampling distribution Fn (z) is either unknown or very complicated, but F (z) is known or d very simple. As long as Zn ! Z; then we can use F (z) as an approximation to Fn (z): This gives a convenient procedure for statistical inference. The potential cost is that the approximation of Fn (z) to F (z) may not be good enough in …nite samples (i.e., when n is …nite). How good the approximation will be depend on the data generating process and the sample size n: Example: Suppose f"n ; n = 1; 2; g is an i.i.d. sequence with distribution function d F (z): Let " be a random variable with the same distribution function F (z): Then "n ! ": With the obtained sampling distribution for the test statistic T; we can now describe a decision rule for testing H0 when J = 1: Decision Rule of the T-test Based on Critical Values 29 (i) Reject H0 : R 0 = r at a prespeci…ed signi…cance level jT j > Ctn K; 2 2 (0; 1) if ; where Ctn K; is the so-called upper-tailed critical value of the tn 2 ; which is determined by 2 h i P tn or equivalently h P jtn K > Ctn K j > Ctn (ii) Accept H0 at the signi…cance level jT j = K; 2 K; 2 if Ctn K; 2 i K distribution at level 2 = : : Remarks: (a) In testing H0 ; there exist two types of errors, due to the limited information about the population in a given random sample fZt gnt=1 . One possibility is that H0 is true but we reject it. This is called the “Type I error”. The signi…cance level is the probability of making the Type I error. If h i P jT j > Ctn K; jH0 = ; 2 we say that the decision rule is a test with size : (b) The probability P [jT j > Ctn K; ] is called the power function of a size 2 When h i P jT j > Ctn K; jH0 is false < 1; test. 2 there exists a possibility that one may accept H0 when it is false. This is called the “Type II error”. (c) Ideally one would like to minimize both the Type I error and Type II error, but this is impossible for any given …nite sample. In practice, one usually presets the level for Type I error, the so-called signi…cance level, and then minimizes the Type II error. Conventional choices for signi…cance level are 10%, 5% and 1% respectively. Next, we describe an alternative decision rule for testing H0 when J = 1; using the so-called p-value of test statistic T: An Equivalent Decision Rule Based on p-values 30 n Given a data set z n = fyt ; x0t g0n t=1 ; which is a realization of the random sample Z = fYt ; Xt0 g0n t=1 ; we can compute a realization (i.e., a number) for the t-test statistic T; namely R^ r T (z n ) = p : s2 R(x0 x) 1 R0 Then the probability p(z n ) = P [jtn Kj > jT (z n )j] ; is called the p-value (i.e., probability value) of the test statistic T given that fYt ; Xt0 gnt=1 = fyt ; x0t gnt=1 is observed, where tn K is a Student t random variable with n K degrees of freedom, and T (z n ) is a realization for test statistic T = T (Z n ). Intuitively, the p-value is the tail probability that the absolute value of a Student tn K random variable take values larger than the absolute value of the test statistic T (z n ). If this probability is very small relative to the signi…cance level, then it is unlikely that the test statistic T (Z n ) will follow a Student tn K distribution. As a consequence, the null hypothesis is likely to be false. The above decision rule can be described equivalently as follows: Decision Rule Based on the p-value (i) Reject H0 at the signi…cance level if p(z n ) < : (ii) Accept H0 at the signi…cance level if p(z n ) : Question: What are the advantages/disadvantages of using p-values versus using critical values? Remarks: (i) p-values are more informative than only rejecting/accepting the null hypothesis at some signi…cance level. A p-value is the smallest signi…cance level at which a null hypothesis can be rejected. (ii) Most statistical software reports p-values of parameter estimates. Examples of t-tests Example 1 [Reforms have no e¤ects (continued.)] We …rst consider testing the null hypothesis Ha0 : 3 = 0; where 3 is the coe¢ cient of the autonomy AUt in the extended production function regression model. This is equivalent to the selection of R = (0; 0; 0; 1; 0): In this case, 31 we have s2 R(X 0 X) 1 R0 = s2 (X 0 X) 1 (4;4) S ^2 = which is the estimator of var( ^ 3 jX): The t-test statistic R^ T = p r s2 R(X 0 X) 1 R0 ^ = q3 S ^2 3 tn K: Next, we consider testing the CRS hypothesis Hc0 : 1 + 2 = 1; which corresponds to R = (0; 1; 1; 0; 0) and r = 1: In this case, s2 R(X 0 X) 1 R0 = S ^2 + S ^2 + 2ĉov( ^ 1 ; ^ 2 ) 1 2 = 2 0 s (X X) + s 2 +2 1 (2;2) (X X) 1 (3;3) s2 (X 0 X) 1 (2;3) 0 = S ^2+ ^ ; 2 which is the estimator of var( ^ 1 + ^ 2 jX): Here, ĉov( ^ 1 ; ^ 2 ) is the estimator for cov( ^ 1 ; ^ 2 jX); the covariance between ^ 1 and ^ 2 conditional on X: The t-test statistic is R^ T = p r s2 R(X 0 X) 1 R0 = ^ +^ 1 1 2 S^1+^2 tn K : CASE II: F -testing (J > 1) Question: What happens if J > 1? We …rst state a useful lemma. 32 Lemma: If Z matrix, then N (0; V ); where V = var(Z) is a nonsingular J J variance-covariance Z 0V 1 2 J: Z Proof: Because V is symmetric and positive de…nite, we can …nd a symmetric and invertible matrix V 1=2 such that V 1=2 V 1=2 = V; V 1=2 1=2 V 1 = V : (Question: What is this decomposition called?) Now, de…ne 1=2 Y =V Z: Then we have E(Y ) = 0; and var(Y ) = E f[Y E(Y )]0 g E(Y )][Y = E(Y Y 0 ) 1=2 = E(V ZZ 0 V 1=2 = V 1=2 E(ZZ 0 )V = V 1=2 VV = V 1=2 V 1=2 V 1=2 V ) 1=2 1=2 = I: It follows that Y N (0; I): Therefore, we have Y 0Y 2 q: Applying this lemma, and using the result that (R ^ r)jX N [0; 2 R(X 0 X) 1 R0 ] under H0 ; we have the quadratic form (R ^ r)0 [ 2 R(X 0 X) 1 R0 ] 1 (R ^ 2 J r) conditional on X; or (R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ 2 conditional on X: 33 r) 2 J 2 J Because does not depend on X; therefore, we also have (R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ 2 r) 2 J unconditionally. Like in constructing a t-test statistic, we should replace 2 by s2 in the left hand side: (R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r) : s2 The replacement of 2 by s2 renders the distribution of the quadratic form no longer Chi-squared. Instead, after proper scaling, the quadratic form will follow a so-called F -distribution with degrees of freedom equal to (q; n K). Question: Why? Observe (R ^ (R ^ = J r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r) s2 r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r) =J 2 (n K)s2 2 J 2 n K =(n J FJ;n where FJ;n K =(n K) 2 J =J K) K; denotes the F distribution with degrees of J and n Question: What is a FJ;n K De…nition: Suppose U the ratio 2 p K distributions. distribution? and V 2 q; and both U and V are independent. Then U=p V =q Fp;q is called to follow a Fp;q distribution with degrees of freedom (p; q): Question: Why call it the F -distribution? Answer: It is named after Professor Fisher, a well-known statistician in the 20th century. Plots of F -distributions. 34 Remarks: Properties of Fp;q : (i) If F Fp;q ; then F (ii) t2q F1;q : 1 Fq;p : t2q = 2 1 =1 2 =q F1;q 2 p (iii) Given any …xed integer p; p Fp;q ! as q ! 1: Relationship (ii) implies that when J = 1; using either the t-test or the F -test will deliver the same conclusion. Relationship (iii) implies that the conclusions based on Fp;q and on pFp;q using the 2p approximation will be approximately the same when q is su¢ ciently large. We can now de…ne the following F -test statistic F (R ^ FJ;n r)0 [R(X 0 X) 1 R0 ] 1 (R ^ s2 K: r)=J Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0 : R 0 = r, we have F = (R ^ FJ;n r)0 [R(X 0 X) 1 R0 ] 1 (R ^ s2 r)=J K for all n > K: A practical issue now is how to compute the F -statistic. One can of course compute the F -test statistic using the de…nition of F . However, there is a very convenient way to compute the F -test statistic. We now introduce this method. Alternative Expression for the F -Test Statistic Theorem: Suppose Assumptions 3.1 and 3.3 hold. Let SSRu = e0 e be the sum of squared residuals from the unrestricted model Y =X 35 0 + ": Let SSRr = e~0 e~ be the sum of squared residuals from the restricted model 0 Y =X +" subject to R 0 = r; where ~ is the restricted OLS estimator. Then under H0 ; the F -test statistic can be written as (~ e0 e~ e0 e)=J FJ;n K : F = 0 e e=(n K) Remark: This is a convenient test statistic! One only needs to compute SSR in order to compute the F -test statistic. Proof: Let ~ be the OLS under H0 ; that is, ~ = arg min (Y X )0 (Y 2RK X ) subject to the constraint that R = r: We …rst form the Lagrangian function X )0 (Y L( ; ) = (Y X ) + 2 0 (r R ); where is a J 1 vector called the Lagrange multiplier vector. We have the following FOC: @L( ~ ; ~ ) = 2X 0 (Y X ~ ) @ @L( ~ ; ~ ) = r R ~ = 0: @ 2R0 ~ = 0; With the unconstrained OLS estimator ^ = (X 0 X) 1 X 0 Y; and from the …rst equation of FOC, we can obtain (^ ~ ) = (X 0 X) 1 R0 ~ ; R(X 0 X) 1 R0 ~ = R( ^ ~ ): Hence, the Lagrange multiplier ~ = = [R(X 0 X) 1 R0 ] 1 R( ^ [R(X 0 X) 1 R0 ] 1 (R ^ 36 ~ ): r); where we have made use of the constraint that R ~ = r: It follows that ^ ~ = (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^ r): Now, X~ X ^ + X( ^ e~ = Y = Y = e + X( ^ ~) ~ ): It follows that e~0 e~ = e0 e + ( ^ ~ )0 X 0 X( ^ = e0 e + (R ^ ~) r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r): We have (R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r) = e~0 e~ e0 e and r)0 [R(X 0 X) 1 R0 ] 1 (R ^ s2 0 0 (~ e e~ e e)=J : = e0 e=(n K) F = (R ^ r)=J This completes the proof. Question: What is the interpretation for the Lagrange multiplier ~ ? Answer: Recall that we have obtained the relation that ~ = [R(X 0 X) 1 R0 ] 1 R( ^ [R(X 0 X) 1 R0 ] 1 (R ^ = ~) r): Thus, ~ is an indicator of the departure of R ^ from r: That is, the value of ~ will indicate whether R ^ r is signi…cantly di¤erent from zero. Question: What happens to the distribution of F when n ! 1? Recall that under H0 ; the quadratic form (R ^ r)0 2 R(X 0 X) 1 R0 37 1 (R ^ r) 2 J: We have shown that J F = (R ^ 1 r)0 s2 R(X 0 X) 1 R0 (R ^ r) is no longer 2J for each …nite n > K: However, the following theorem shows that as n ! 1; the quadratic form J F !d 2 J : In other words, when n ! 1; the limiting distribution of J F coincides with the distribution of the quadratic form (R ^ r)0 2 R(X 0 X) 1 R0 1 (R ^ r): Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0 ; we have the Wald test statistic (R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r) d 2 ! J W = s2 as n ! 1: Remarks: (i) W is called the Wald test statistic. In the present context, we have W = J F: (ii) Recall that when n ! 1; the F -distribution FJ;n K ; after multiplied by J; will approach the 2J distribution. This result has a very important implication in practice: When n is large, the use of FJ;n K and the use of 2J will deliver the same conclusion on statistical inference. Proof: The result follows immediately by the so-called Slutsky theorem and the fact p p that s2 ! 2 ; where ! denotes convergence in probability. Question: What is the Slutsky theorem? What is convergence in probability? d p p Slutsky Theorem: Suppose Zn ! Z; an ! a and bn ! b as n ! 1: Then d an + bn Zn ! a + bZ: In our application, we have Zn = Z (R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ 2 2 J; an = 0; a = 0; bn = 2 =s2 ; b = 1: 38 r) ; The by the Slutsky theorem, we have W = bn Zn !d 1 Z 2 J because Zn !d Z and bn !p 1: p What we need is to show that s2 ! 2 and then apply the Slutsky theorem. Question: What is the concept of convergence in probability? De…nition [Convergence in Probability] Suppose fZn g is a sequence of random variables and Z is a random variable (including a constant). Then we say that Zn p converges to Z in probability, denoted as Zn Z ! 0; or Zn Z = oP (1) if for every small constant > 0; Zj > ] = 0; or lim Pr [jZn n!1 Pr(jZn Zj > ) ! 0 as n ! 1: Interpretation: De…ne the event An ( ) = f! 2 : jZn (!) Z(!)j > g; where ! is a basic outcome in sample space : Then convergence in probability says that the probability of event An ( ) may be nonzero for any …nite n; but such a probability will eventually vanish to zero as n ! 1: In other words, it becomes less and less likely that the di¤erence jZn Zj is larger than a prespeci…ed constant > 0: Or, we have more and more con…dence that the di¤erence jZn Zj will be smaller than as n ! 1: Question: What is the interpretation of ? Answer: It can be viewed as a prespeci…ed tolerance level. p Remark: When Z = b, a constant, we can write Zn ! b; and we say that b is the probability limit of Zn ; written as b = p lim Zn : n!1 Next, we introduce another convergence concept which can be conveniently used to show convergence in probability. De…nition [Convergence in Quadratic Mean]: We say that Zn converges to Z in quadratic mean, denoted as Zn Z !q:m 0; or Zn Z = oq:m: (1); if lim E[Zn n!1 39 Z]2 = 0: Lemma: If Zn q:m: p Z ! 0; then Zn Z ! 0: Proof: By Chebyshev’s inequality, we have P (jZn for any given Z]2 E[Zn Zj > ) 2 !0 > 0 as n ! 1: Example: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Does s2 converge in probability to 2 ? Solution: Under the given assumptions, (n K) s2 2 n K; 2 4 2 2 and therefore we have E(s2 ) = 2 and var(s2 ) = n2 K : It follows that E(s2 ) = p 2 4 =(n K) ! 0; s2 !q:m: 2 and so s2 ! 2 because convergence in quadratic mean implies convergence in probability. Question: If s2 !p 2 ; do we have s !p ? Answer: Yes. Why? It follows from the following lemma with the choice of g(s2 ) = p s2 = s: Lemma: Suppose Zn !p Z; and g( ) is a continuous function. Then g(Zn ) !p g(Z): Proof: Because g( ) is continuous, we have that for any given constant exists = ( ) such that whenever jZn (!) Z(!)j < ; we have jg[Zn (!)] g[Z(!)]j < : De…ne A( ) = f! : jZn (!) Z(!)j > g; B( ) = f! : jg(Zn (!)) Then the continuity of g( ) implies that B( ) P [B( )] g(Z(!))j > g: A( ): It follows that P [A( )] ! 0 40 > 0; there as n ! 1; where P [A( )] ! 0 given Zn !p Z: Because and are arbitrary, we have g(Zn ) !p g(Z): This completes the proof. 3.8 Applications We now consider some special but implication cases often countered in economics and …nance. Case 1: Testing for the Joint Signi…cance of Explanatory Variables Consider a linear regression model Yt = Xt0 0 0 0 + = + "t k X 0 j Xjt + "t : j=1 We are interested in testing the combined e¤ect of all the regressors except the intercept. The null hypothesis is H0 : 0 j = 0 for 1 j k; which implies that none of the explanatory variables in‡uences Yt : The alternative hypothesis is HA : 0 j 6= 0 at least for some 0 j; j = 1; ; k: One can use the F -test and F Fk;n (k+1) : In fact, the restricted model under H0 is very simple: 0 0 Yt = The restricted OLS estimator ~ = (Y ; 0; e~ = Y + "t : ; 0)0 : It follows that X~ = Y Y: Y )0 (Y Y ): Hence, we have e~0 e~ = (Y 41 Recall the de…nition of R2 : R2 = 1 = 1 e0 e Y )0 (Y (Y e0 e : e~0 e~ Y) It follows that (~ e0 e~ e0 e)=k e0 e=(n k 1) 0 (1 ee~0 ee~ )=k = e0 e =(n k 1) e~0 e~ R2 =k = (1 R2 )=(n k F = 1) : Thus, it su¢ ces to run one regression, namely the unrestricted model in this case. We emphasize that this formula is valid only when one is testing for H0 : 0j = 0 for all 1 j k: Example 1 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate return in period t; and It 1 is the information available at time t 1: Then a classical version of the e¢ cient market hypothesis (EMH) can be stated as follows: E(Yt jIt 1 ) = E(Yt ) To check whether exchange rate changes are unpredictable using the past history of exchange rate changes, we specify a linear regression model: Yt = Xt0 0 + "t ; where Xt = (1; Yt 1 ; :::; Yt k )0 : Under EMH, we have H0 : 0 j = 0 for all j = 1; :::; k: If the alternative HA : 0 j 6= 0 at least for some j 2 f1; :::; kg holds, then exchange rate changes are predictable using the past information. Remarks: 42 (i) What is the appropriate interpretation if H0 is not rejected? Note that there exists a gap between the e¢ ciency hypothesis and H0 , because the linear regression model is just one of many ways to check EMH. Thus, H0 is not rejected, at most we can only say that no evidence against the e¢ ciency hypothesis is found. We should not conclude that EMH holds. (ii) Strictly speaking, the current theory (Assumption 3.2: E("t jX) = 0) rules out this application, which is a dynamic time series regression model. However, we will justify in Chapter 5 that k F = (1 d ! R2 R2 )=(n k 1) 2 k under conditional homoskedasticity even for a linear dynamic regression model. (iii) In fact, we can use a simpler version when n is large: (n k d 1)R2 ! 2 k: This follows from the Slutsky theorem because R2 !p 0 under H0 : Although Assumption 3.5 is not needed for this result, conditional homoskedasticity is still needed, which rules out autoregressive conditional heteroskedasticity (ARCH) in the time series context. Example 1 [Consumption Function and Wealth E¤ect]: Let Yt = consumption, X1t = labor income, X2t = liquidity asset wealth. A regression estimation gives Yt = 33:88 [1:77] 26:00X1t + 6:71X2t + et ; [ 0:74] R2 = 0:742; n = 25: [0:77] where the numbers inside [ ] are t-statistics. Suppose we are interested in whether labor income or liquidity asset wealth has impact on consumption. We can use the F -test statistic, R2 =2 (1 R2 )=(n 3) = (0:742=2)=[(1 0:742)=(25 F = 3)] = 31:636 F2 ;22 Comparing it with the critical value of F2;22 at the 5% signi…cance level, we reject the null hypothesis that neither income nor liquidity asset has impact on consumption at the 5% signi…cance level. 43 Case 2: Testing for Omitted Variables (or Testing for No E¤ect) Suppose X = (X (1) ; X (2) ); where X (1) is a n (k1 + 1) matrix and X (2) is a n k2 matrix: (2) A random vector Xt has no explanary power for the conditional expectation of Yt if (1) E(Yt jXt ) = E(Yt jXt ): Alternatively, it has explanatory power for the conditional expectation of Yt if (1) E(Yt jXt ) 6= E(Yt jXt ): (2) When Xt has explaining power for Yt but is not included in the regression, we say that (2) Xt is an omitted random variable or vector. (2) Question: How to test whether Xt context? Consider the restricted model Yt = 0 + is an omitted variable in the linear regression 1 X1t + + k1 Xk1 t Suppose we have additional k2 variables (X(k1 +1)t ; unrestricted regression model Yt = 0 + + 1 X1t + ::: + k1 +1 X(k1 +1)t + + "t : ; X(k1 +k2 )t ), and so we consider the k1 Xk1 t + (k1 +k2 ) X(k1 +k2 )t + "t : The null hypothesis is that the additional variables have no e¤ect on Yt : If this is the case, then H0 : k1 +1 = k1 +2 = = k1 +k2 = 0: The alternative is that at least one of the additional variables has e¤ect on Yt : The F -Test statistic is F = (~ e0 e~ e0 e)=k2 e0 e=(n k1 k2 1) Fk2 ;n (k1 +k2 +1) : Question: Suppose we reject the null hypothesis. Then some important explanatory variables are omitted, and they should be included in the regression. On the other hand, if the F -test statistic does not reject the null hypothesis H0 ; can we say that there is no omitted variable? 44 Answer: No. There may exist a nonlinear relationship for additional variables which a linear regression speci…cation cannot capture. Example 1 [Testing for the E¤ect of Reforms]: Consider the extended production function Yt = 0 + + 1 ln(Lt ) + 3 AUt + 2 4 P St ln(Kt ) + 5 CMt + "t ; where AUt is the autonomy dummy, P St is the pro…t sharing ratio, and CMt is the dummy for change of manager. The null hypothesis of interest here is that none of the three reforms has impact: H0 : 3 = 4 = 5 = 0: We can use the F -test, and F F3;n 6 under H0 : Interpretation: If rejection occurs, there exists evidence against H0 : If no rejection occurs, then we …nd no evidence against H0 (which is not the same as the statement (2) that reforms have no e¤ect): It is possible that the e¤ect of Xt is of nonlinear form. In (2) this case, we may obtain a zero coe¢ cient for Xt ; because the linear speci…cation may not be able to capture it. Example 2 [Testing for Granger Causality]: Consider two time series fYt ; Zt g; where t is the time index, ItY 1 = fYt 1 ; :::; Y1 g and ItZ 1 = fZt 1 ; :::; Z1 g. For example, Yt is the GDP growth, and Zt is the money supply growth. We say that Zt does not (Y ) (Z) Granger-cause Yt in conditional mean with respect to It 1 = fIt 1 ; It 1 g if (Y ) (Y ) (Z) E(Yt jIt 1 ; It 1 ) = E(Yt jIt 1 ): In other words, the lagged variables of Zt have no impact on the level of Yt : Remark: Granger causality is de…ned in terms of incremental predictability rather than the real cause-e¤ect relationship. From an econometric point of view, it is a test of omitted variables in a time series context. It is …rst introduced by Granger (1969). Question: How to test Granger causality? Consider now a linear regression model Yt = 0 + + 1 Yt 1 p+1 Zt 1 + + 45 + + p Yt p p+q Zt q + "t : Under non-Granger causality, we have H0 : = p+1 = p+q = 0: The F -test statistic F Fq;n (p+q+1) : Remark: The current theory (Assumption 3.2: E("t jX) = 0) rules out this application, because it is a dynamic regression model. However, we will justify in Chapter 5 that under H0 ; d q F ! 2q as n ! 1 under conditional homoskedasticity even for a linear dynamic regression model. Example 2 [Testing for Structural Change (or testing for regime shift)] Yt = + 0 1 X1t + "t ; where t is a time index, and fXt g and f"t g are mutually independent. Suppose there exist changes after t = t0 ; i.e., there exist structural changes. We can consider the extended regression model: Yt = ( = 0 0 + + 0 Dt ) 1 X1t +( + 1 + 0 Dt + 1 Dt )X1t + "t 1 (Dt X1t ) + "t ; where Dt = 1 if t > t0 and Dt = 0 otherwise. The variable Dt is called a dummy variable, indicating whether it is a pre- or post-structral break period. The null hypothesis of no structral change is H0 : 0 = 1 = 0: The alternative hypothesis that there exist a structral change is HA : 0 6= 0 or 1 6= 0: The F -test statistic F F2;n 4 : The idea of such a test is …rst proposed by Chow (1960). Case 3: Testing for linear restrictions 46 Example 1 [ Testing for CRS]: Consider the extended production function ln(Yt ) = 0 + ln(Lt ) + 1 ln(Kt ) + 2 3 AUt + 4 P St + 5 CMt + "t : We will test the null hypothesis of CRS: H0 : 1 + 2 = 1: H0 : 1 + 2 6= 1: The alternative hypothesis is What is the restricted model under H0 ? It is given by ln(Yt ) = 0 + 1 ln(Lt ) + (1 1 ) ln(Kt ) + 3 AUt + 4 P St + 5 CMt + "t or equivanetly ln(Yt =Kt ) = 0 + 1 ln(Lt =Kt ) + 3 AUt + 4 CONt + 5 CMt + "t : The F -test statistic F F1;n 6 : Remark: Both t- and F - tests are applicable to test CRS. Example 2 [Wage Determination]: Consider the wage function Wt = 0 + 1 Pt + 4 Vt + 5 Wt 1 + 2 Pt 1 + 3 Ut + "t ; where Wt = wage, Pt = price, Ut = unemployment, Vt = un…lled vacancies. We will test the null hypothesis H0 : 1 + 2 = 0; 3 + 4 = 0; and 5 = 1: Question: What is the economic interpretation of the null hypothesis H0 ? What is the restricted model? Under H0 ; we have the restricted wage equation: Wt = 0 + 1 Pt + 47 4 Dt + "t ; where Wt = Wt Wt 1 is the wage growth rate, Pt = Pt Pt 1 is the in‡ation rate, and Dt = Vt Ut is an index for job market situation (excess job supply). This implies that the wage increase depends on the in‡ation rate and the excess labor supply. The F -test statistic for H0 is F F3;n 6 : Case 4: Testing for Near-Multicolinearity Example 1: Consider the following estimation results for three separate regressions based on the same data set with n = 25. The …rst is a regression of consumption on income: Yt = 36:74 + 0:832X1t + e1t ; R2 = 0:735 [1:98][7:98] The second is a regression of consumption on wealth: Yt = 36:61 + 0:208X2t + e2t ; R2 = 0:735 [1:97][7:99] The third is a regression of consumption on both income and wealth: Y = 33:88 26:00X1t + 6:71X2t + et ; R2 = 0:742; [1:77][ 0:74][0:77] Note that in the …rst two separate regressions, we can …nd signi…cant t-test statistics for income and wealth, but in the third joint regression, both income and wealth are insigni…cant. This may be due to the fact that income and wealth are highly multicolinear! To test neither income nor stock wealth has impact on consumption, we can use the F -test: R2 =2 (1 R2 )=(n 3) 0:742=2 = (1 0:742)=(25 3) = 31:636 F = F2 ;22 : This F -test shows that the null hypothesis is …rmly rejected at the 5% signi…cance level, because the critical value of F2;22 at the 5% level is ???. 48 3.9 Generalized Least Squares Estimation Question: The classical linear regression theory crutially depends on the assumption that "jX N (0; 2 I), or equivalently f"t g i:i:d:N (0; 2 ); and fXt g and f"t g are mutually independent. What may happen if some classical assumptions do not hold? Question: Under what conditions, the existing procedures and results are still approximately true? Assumption 3.5 is unrealitic for many economic and …nancial data. Suppose Assumption 3.5 is replaced by the following condition: Assumption 3.6: "jX N (0; 2 V ); where 0 < 2 < 1 is unknown and V = V (X) is a known n n …nite and positive de…nite matrix. Question: Is this assumption realistic in practice? Remarks: (i) Assumption 3.6 implies that var("jX) = E(""0 jX) = 2 V = 2 V (X) is known up to a constant 2 : It allows for conditional heteroskedasticity of known form. (iii) It is possible that V is not a diagonal matrix. Thus, cov("t ; "s jX) may not be zero. In other words, Assumption 3.6 allows conditional serial correlation of known form. (iv) However, the assumption that V is known is still very restrictive from a practical point of view. In practice, V usually has an unknown form. Question: What is the statistical property of OLS ^ under Assumption 3.6? Theorem: Suppose Assumptions 3.1, 3.3 and 3.6 hold. Then (i) unbiasedness: E( ^ jX) = 0 : (ii) variance: var( ^ jX) = 6= 2 (X 0 X) 1 X 0 V X(X 0 X) 2 (X 0 X) 1 : 49 1 (iii) (^ 0 )jX 2 N (0; (X 0 X) 1 X 0 V X(X 0 X) 1 ): (iv) cov( ^ ; ejX) 6= 0 in general. Remarks: (i) OLS ^ is still unbiased and one can show that its variance goes to zero as n ! 1 (see Question 6, Problem Set 03). Thus, it converges to 0 in the sense of MSE. (ii) However, the variance of the OLS estimator ^ does no longer have the simple expression of 2 (X 0 X) 1 under Assumption 3.6. As a consequence, the classical t- and F -test statistics are invalid because they are based on an incorrect variance-covariance matrix of ^ . That is, they use an incorrect expression of 2 (X 0 X) 1 rather than the correct variance formula of 2 (X 0 X) 1 X 0 V X(X 0 X) 1 : (iii) Theorem (iv) implies that even if we can obtain a consistent estimator for 2 (X 0 X) 1 X 0 V X(X 0 X) 1 and use it to construct tests, we can no longer obtain the Student t-distribution and F -distribution, because the numerator and the denominator in de…ning the t- and F -test statistics are no longer independent. Proof: (i) Using ^ 0 = (X 0 X) 1 X 0 "; we have E[( ^ 0 )jX] = (X 0 X) 1 X 0 E("jX) = (X 0 X) 1 X 0 0 = 0: (ii) var( ^ jX) = E[( ^ 0 )( ^ 0 0 ) jX] = E[(X 0 X) 1 X 0 ""0 X(X 0 X) 1 jX] = (X 0 X) 1 X 0 E(""0 jX)X(X 0 X) = 2 1 (X 0 X) 1 X 0 V X(X 0 X) 1 : Note that we cannot further simplify the expression here because V 6= (iii) Because ^ 0 = (X 0 X) 1 X 0 " n X = Ct "t ; t=1 50 2 I: where Ct = (X 0 X) 1 Xt ; ^ 0 follows a normal distribution given X; because it is a sum of a normal random variables. As a result, ^ 0 2 N (0; (X 0 X) 1 X 0 V X(X 0 X) 1 ): (iv) cov( ^ ; ejX) = E[( ^ 0 )e0 jX] = E[(X 0 X) 1 X 0 ""0 M jX] = (X 0 X) 1 X 0 E(""0 jX)M 2 = (X 0 X) 1 X 0 V M 6= 0 because X 0 V M 6= 0: We can see that it is conditional heteroskedasticity and/or autocorrelation in f"t g that cause ^ to be correlated with e: Generalized Least Square (GLS) Estimation To introduce GLS, we …rst state a useful lemma. Lemma: For any symmetric positive de…nite matrix V; we can always write 1 V V where C is a n = C 0 C; 1 = C (C 0 ) 1 n nonsingular matrix. Question: What is this decomposition called? Remark: C may not be symmetric. Consider the original linear regression model: 0 Y =X + ": If we mutitiplyin the equation by C; we obtain the transformed regression model CY Y = (CX) = X 0 51 0 + C"; or +" ; where Y = CY; X = CX and " = C": Then the OLS of this transformed model ^ = (X 0 X ) 1 X 0 Y = (X 0 C 0 CX) 1 (X 0 C 0 CY ) = (X 0 V 1 X) 1 X 0 V 1 Y is called the Generalized Least Square (GLS) estimator. Question: What is the nature of GLS? Observe that E(" jX) = E(C"jX) = CE("jX) = C 0 = 0: Also, note that var(" jX) = E[" " 0 jX] = E[C""0 C 0 jX] = CE(""0 jX)C 0 = 2 CV C 0 = 2 C[C = 2 I: 1 (C 0 ) 1 ]C 0 It follows from Assumption 3.6 that " jX N (0; 2 I): The transformation makes the new error " conditionally homoskedastic and serially uncorrelated, while maintaining the normality distribution. Suppose that for t; "t has a large variance 2t : The transformation "t = C"t will discount "t by dividing it by its conditional standard deviation so that "t becomes conditionally homoskedastic. In addition, the transformation also remove possible correlation between "t and "s ; t 6= s: As a consequence, GLS becomes the best linear LS estimator for 0 in term of the Gauss-Markov theorem. 52 To appreciate how the transformation by matrix C removes conditional heteroskedasticity and eliminates serial correlation, we now consider two examples. Example 1 [Removing Heteroskedasticity]: Suppose 2 h1 6 0 6 V =6 4 0 h2 0 0 0 hn 0 Then 2 1=2 h1 6 0 6 C=6 4 0 h2 7 7 7: 5 0 0 0 1=2 0 and 3 hn 2 6 6 " = C" = 6 6 4 p"1 h1 p"2 h2 p"n hn 1=2 3 3 7 7 7 5 7 7 7: 7 5 Example 2 [Eliminating Serial Correlation] Suppose 2 Then 1 6 6 6 6 0 V =6 6 6 6 4 0 0 3 0 0 7 0 0 7 7 0 0 7 7: 0 7 7 7 5 1 1 0 1 1 0 0 0 0 2 p 1 6 " 6 2 " = C" = 6 4 "n 2" 3 "1 7 7 7: 5 "n Theorem: Under Assumptions 3:1; 3:3 and 3.6, (i) E( ^ jX) = 0 ; (ii) var( ^ jX) = 2 (X 0 X ) 1 = 2 (X 0 V 1 X) 1 ; (iii) cov( ^ ; e jX) = 0; where e = Y X ^ ; (iv) ^ is BLUE. 53 1 1 (v) E(s 2 jX) = 2 ; where s 2 = e 0 e =(n K): Proof: (i) GLS is the OLS of the transformed model. (ii) The transformed model satis…es 3.1, 3.3 and 3.5 of the classical regression assumptions with " jX N (0; 2 In ): It follows that GLS is BLUE by the Gauss-Markov Theorem. Remarks: (i) Because ^ is the OLS of the transformed regression model with i.i.d. N (0; 2 I) errors, the t-test and F -test are applicable, and these test statistics are de…ned as follows: T F R^ = p = s (R ^ FJ;n r 2 R(X 0 X ) 1 R0 tn K; r)0 [R(X 0 X ) 1 R0 ] 1 (R ^ s2 r)=J K: When testing whether all coe¢ cients except the intercept are jointly zero, we have (n K)R 2 !d 2J : (ii) Because GLS ^ is BLUE and OLS ^ di¤ers from ^ ; OLS ^ cannot be BLUE. ^ = (X 0 X ) 1 X 0 Y ; = (X 0 V 1 X) 1 X 0 V 1 Y: ^ = (X 0 X) 1 X 0 Y: In fact, the most important message of GLS is the insight it provides into the impact of conditional heteroskedasticity and serial correlation on the estimation and inference of the linear regression model. In practice, GLS is generally not feasible, because the n n matrix V is of unknown form, where var("jX) = 2 V . What are feasible solutions? Two solutions (i) First Approach: Adaptive feasible GLS In some cases with additional assumptions, we can use a nonparametric estimator V^ to replace the unknown V; we obtain the adaptive feasible GLS ^ = (X 0 V^ a 1 X) 1 X 0 V^ 54 1 Y; where V^ is an estimator for V: Because V is an n n unknown matrix and we only have n data points, it is impossible to estimate V consistently using a sample of size n if we do not impose any restriction on the form of V: In other words, we have to impose some restrictions on V in order to estimate it consistently. For example, suppose we assume 2 V = diagf 21 (X); :::; = diagf 2 (X1 ); :::; 2 n (X)g 2 (Xn )g; where diag{ } is a n n diagonal matrix and 2 (Xt ) = E("2t jX) is unknown. The fact that 2 V is a diagonal matrix can arise when cov("t "s jX) = 0 for all t 6= s; i.e., when there is no serial correlation. Then we can use the nonparametric kernel estimator Pn 2 1 x Xt 1 t=1 et b K 2 n b P ^ (x) = n 1 1 x Xt K t=1 b n b p 2 ! (x); where et is the estimated OLS residual, and K( ) is a kernel function which is a speci…ed symmetric density function (e.g., K(u) = (2 ) 1=2 exp( 21 u2 )); and b = b(n) is a bandwidth such that b ! 0; nb ! 1 as n ! 1: The …nite sample distribution of ^ a will be di¤erent from the …nite sample distribution of ^ ; which assumes that V were known. This is because the sampling errors of the estimator V^ have some impact on the estimator ^ a . However, under some suitable conditions on V^ ; ^ a will share the same asymptotic property as the infeasible GLS ^ (i.e., the MSE of ^ a is approximately equal to the MSE of ^ ). In other words, the …rst stage estimation of 2 ( ) has no impact on the asymptotic distribution of ^ a : For more discussion, see Robinson (1988) and Stinchcombe and White (1991). Simulation studies? (ii) Second Approach Continue to use OLS ^ , obtaining the correct formula for var( ^ jX) = 2 (X 0 X) 1 X 0 V X(X 0 X) 1 as well as a consistent estimator for var ^ jX : The classical de…nitions of t and F -tests cannot be used, because they are based on an incorrect formula for var( ^ jX). However, some modi…ed tests can be obtained by using a consistent estimator for the correct formula for var( ^ jX): The trick is to estimate 2 X 0 V X; which is a K K nknown 55 matrix, rather than to estimate V; which is a n n unknown matrix. Howeber, only asymptotic distributions can be used in this case. Question: Suppose we assume E(""jX) = 2 V = diagf 21 (X); :::; 2 n (X)g: As pointed out earlier, this essentially assume E("t "s jX) = 0 for all t 6= s: That is, there is no serial correlation in f"t g conditional on X: Instead of estimating 2t (X); one can estimate the K K matrix 2 X 0 V X directly. How to estimate n X 2 0 X VX = Xt Xt0 2t (X)? t=1 Answer: An estimator 0 0 X D(e)D(e) X = n X Xt Xt0 e2t ; t=1 where D(e) = diag(e1 ; :::; en ) is a n n diagonal matrix with all o¤-diagonal elements being zero. This is called White’s (1980) heteroskedasticity-consistent variance-covariance estimator. See more discussion in Chapter 4. Question: For J = 1; do we have R^ r p R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0 tn K? For J > 1; do we have (R ^ FJ;n r)0 R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0 1 (R ^ r)=J K? Answer: No. Why? Recall cov( ^ ; ejX) 6= 0 under Assumption 3.6. It follows that ^ and e are not independent. However, when n ! 1; we have (i) Case I (J = 1) : R^ p r R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0 This can be called a robust t-test. 56 !d N (; 1): (ii) Case II (J > 1) : (R ^ r) R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0 1 ^ ( R r) !d 2 J: This is a robust Wald test statistic. The above two feasible solutions are based on the assumption that E("t "s jX) = 0 for all t 6= s: When there exists serial correlation of unknown form, an allernative solution should be provided. This is discussed in Chapter 6. 3.10 Summary and Conclusion In this chapter, we have presented the econometric theory for the classical linear regression models. We …rst provide and discuss a set of assumptions on which the classical linear regression model is built. This set of regularity conditions will serve as the starting points from which we will develop modern econometric theory for linear regression models. We derive the statistical properties of the OLS estimator. In particular, we point out that R2 is not a suitable model selection criterion, because it is always nondecreasing with the dimension of regressors. Suitable model selection criteria, such as AIC and BIC, are discussed. We show that conditional on the regressor matrix X; the OLS estimator ^ is unbiased, has a vanishing variance, and is BLUE. Under the additional conditional normality assumption, we derive the …nite sample normal distribution for ^ ; the Chi-squared distribution for (n K)s2 = 2 ; as well as the independence between ^ and s2 . Many hypotheses encountered in economics can be formulated as linear restrictions on model parameters. Depending on the number of parameter restrictions, we derive the t-test and the F -test. In the special case of testing the hypothesis that all slope coe¢ cients are jointly zero, we also derive an asymptotically Chi-squared test based on R2 : When there exist conditional heteroskedasticity and/or autocorrelation, the OLS estimator is still unbiased and has a vanishing variance, but it is no longer BLUE, and ^ and s2 are no longer mutully independent. Under the assumption of a known variancecovariance matrix up to some scale parameter, one can transform the linear regression model by correcting conditional heteroskedasticity and eliminating autocorrelation, so that the transformed regression model has conditionally homoskedastic and uncorrelated errors. The OLS estimator of this transformed linear regression model is called the GLS 57 estimator, which is BLUE. The t-test and F -test are applicable. When the variancecovariance structure is unknown, the GLS estimator becomes infeasible. However, if the error in the original linear regression model is serially uncorrelated (as is the case with independent observations across t), there are two feasible solutions. The …rst is to use a nonparametric method to obtain a consistent estimator for the conditional variance var("t jXt ), and then obtain a feasible plug-in GLS. The second is to use White’s (1980) heteroskedasticity-consistent variance-covariance matrix estimator for the OLS estimator ^ : Both of these two methods are built on the asymptotic theory. When the error of the original linear regression model is serially correlated, a feasible solution is provided in Chapter 6. References Robinson, P. (1988) Stinchombe, M. and H. White (1991) 58 ADVANCED ECONOMETRICS PROBLEM SET 03 1. Consider a bivariate linear regression model Yt = Xt0 0 + "t ; t = 1; :::; n; where Xt = (X0t ; X1t ) = (1; X1t )0 ; and "t is a regression error. Let ^ denote the sample correlation between Yt and X1t ; namely, Pn x1t yt ^ = pPn t=12 Pn 2 ; t=1 x1t t=1 yt where yt = Yt Y ; x1t = X1t fX1t g. Show R2 = ^2 : X1 ; and Y and X1 are the sample means of fYt g and 2. Suppose Xt = Q for all t m; where m is a …xed integer, and Q is a K vector. Do we have min (X 0 X) ! 1 as n ! 1? Explain. 1 constant 3. The adjusted R2 ; denoted as R2 ; is de…ned as follows: R2 = 1 (Y Show e0 e=(n K) Y )0 (Y Y )=(n n n R2 = 1 1 (1 K 1) : R2 ) : 4. Consider the following linear regression model Yt = Xt0 0 + ut ; t = 1; :::; n; (4.1) where ut = (Xt )"t ; where fXt g is a nonstochastic process, and (Xt ) is a positive function of Xt such that 2 3 2 (X1 ) 0 0 ::: 0 6 7 2 6 7 0 (X2 ) 0 ::: 0 6 7 2 6 7 = 21 12 ; =6 0 0 (X3 ) ::: 0 7 6 7 ::: ::: ::: ::: 4 ::: 5 2 0 0 0 ::: (Xn ) with 59 1 2 2 6 6 6 =6 6 6 4 (X1 ) 0 0 ::: 0 0 (X2 ) 0 ::: 0 0 0 (X3 ) ::: 0 ::: ::: ::: ::: ::: 0 0 0 ::: (Xn ) 3 7 7 7 7: 7 7 5 Assume that f"t g is i.i.d. N (0; 1): Then fut g is i.i.d. N (0; 2 (Xt )): This di¤ers from Assumption 3.5 of the classical linear regression analysis, because now fut g exhibits conditional heteroskedasticity. Let ^ denote the OLS estimator for 0 : (a) Is ^ unbiased for 0 ? (b) Show that var( ^ ) = (X 0 X) 1 X 0 X(X 0 X) 1 : Consider an alternative estimator ~ = (X 0 " n X = 1 X) 1 X 0 2 1 (Xt )Xt Xt0 t=1 Y # 1 n X 2 (Xt )Xt Yt : t=1 (c) Is ~ unbiased for 0 ? (d) Show that var( ~ ) = (X 0 1 X) 1 : (e) Is var( ^ ) var( ~ ) positive semi-de…nite (p.s.d.)? Which estimator, ^ or ~ ; is more e¢ cient? (f) Is ~ the Best Linear Unbiased Estimator (BLUE) for 0 ? [Hint: There are several approaches to this question. A simple one is to consider the transformed model Yt = Xt 0 0 + "t ; (4.2) t = 1; :::; n; where Yt = Yt = (Xt ); Xt = Xt = (Xt ): This model is obtained from model (4.1) after dividing by (Xt ): In matrix notation, model (4.2) can be written as Y =X 0 + "; 1 1 2 Y and the n 2 X:] k matrix X = where the n 1 vector Y = (g) Construct two test statistics for the null hypothesis of interest H0 : 02 = 0. One test is based on ^ ; and the other test is based on ~ : What are the …nite sample distributions of your test statistics under H0 ? Can you tell which test is better? (h) Construct two test statistics for the null hypothesis of interest H0 : R 0 = r; where R is a J k matrix with J > 0: One test is based on ^ ; and the other test is based on ~ : What are the …nite sample distributions of your test statistics under H0 ? 60 5. Consider the following classical regression model Yt = Xt0 0 0 0 + = + "t k X 0 j Xjt + "t ; t = 1; :::; n: (5.1) j=1 Suppose that we are interested in testing the null hypothesis 0 1 H0 : = 0 2 = = 0 k = 0: Then the F -statistic can be written as F = (~ e0 e~ e0 e)=k : e0 e=(n k 1) where e0 e is the sum of squared residuals from the unrestricted model (5.1), and e~0 e~ is the sum of squared residuals from the restricted model (5.2) Yt = 0 0 (5.2) + "t : (a) Show that under Assumptions 3.1 and 3.3, F = R2 =k R2 )=(n k (1 1) ; where R2 is the coe¢ cient of determination of the unrestricted model (5.1). (b) Suppose in addition Assumption 3.5 holds. Show that under H0 ; (n k d 1)R2 ! 2 k: 6. Suppose X 0 X is a K K matrix, and V is a n n matrix, and both X 0 X and V are symmetric and nonsingular, with the minimum eigenvalue min (X 0 X) ! 1 as n ! 1 and 0 < c C < 1: Show that for any 2 RK such that 0 = 1; max (V ) 0 var( ^ jX) = 2 0 (X 0 X) 1 X 0 V X(X 0 X) 1 !0 as n ! 1: Thus, var( ^ jX) vanishes to zero as n ! 1 under conditional heteroskedasticity. 7. Suppose a data generating process is given by Yt = 0 1 X1t + 0 2 X2t + "t = Xt0 0 + "t ; where Xt = (X1t ; X2t )0 ; and Assumptions 3.1–3.4 hold. (For simplicity, we have assumed no intercept.) Denote the OLS estimator by ^ = ( ^ 1 ; ^ 2 )0 : 61 If 0 2 = 0 and we know it. Then we can consider a simpler regression Yt = 0 1 X1t + "t : Denote the OLS of this simpler regression as ~ 1 : Please compare the relative e¢ ciency between ^ 1 and ~ 1 : That is, which estimator is better for 01 ? Give your reasoning in detail. 62