Download CHAPTER 3 CLASSICAL LINEAR REGRESSION MODELS

Document related concepts

Data assimilation wikipedia , lookup

German tank problem wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Bias of an estimator wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
CHAPTER 3 CLASSICAL LINEAR
REGRESSION MODELS
Key words: Classical linear regression, Conditional heteroskedasticity, Conditional
homoskedasticity, F -test, GLS, Hypothesis testing, Model selection criterion, OLS, R2 ;
t-test
Abstract: In this chapter, we will introduce the classical linear regression theory, including the classical model assumptions, the statistical properties of the OLS estimator,
the t-test and the F -test, as well as the GLS estimator and related statistical procedures.
This chapter will serve as a starting point from which we will develop the modern econometric theory.
3.1 Assumptions
Suppose we have a random sample fZt gnt=1 of size n; where Zt = (Yt ; Xt0 )0 ; Yt is
a scalar, Xt = (1; X1t ; X2t ; :::; Xkt )0 is a (k + 1) 1 vector, t is an index (either crosssectional unit or time period) for observations, and n is the sample size. We are interested
in modelling the conditional mean E(Yt jXt ) using an observed realization (i.e., a data
set) of the random sample fYt ; Xt0 g0 ; t = 1; :::; n:
Notations:
(i) K
k + 1 is the number of regressors for which there are k economic variables
and an intercept.
(ii) The index t may denote an individual unit (e.g., a …rm, a household, a country)
for cross-sectional data, or denote a time period (e.g., day, weak, month, year) in a time
series context.
We …rst list and discuss the assumptions of the classical linear regression theory.
Assumption 3.1 [Linearity]:
Yt = Xt0
where
0
is a K
0
+ "t ;
t = 1; :::; n;
1 unknown parameter vector, and "t is an unobservable disturbance.
Remarks:
(i) In Assumption 3.1, Yt is the dependent variable (or regresand), Xt is the vector of
regressors (or independent variables, or explanatory variables), and 0 is the regression
1
coe¢ cient vector. When the linear model is correctly for the conditional mean E(Yt jXt ),
@
E(Yt jXt ) is the marginal e¤ect of Xt on
i.e., when E("t jXt ) = 0; the parameter 0 = @X
t
Yt :
(ii) The key notion of linearity in the classical linear regression model is that the
regression model is linear in 0 rather than in Xt :
(iii) Does Assumption 3.1 imply a causal relationship from Xt to Yt ? Not necessarily.
As Kendall and Stuart (1961, Vol.2, Ch. 26, p.279) point out, “a statistical relationship,
however strong and however suggestive, can never establish causal connection. our ideas
of causation must come from outside statistics ultimately, from some theory or other.”
Assumption 3.1 only implies a predictive relationship: Given Xt , can we predict Yt
linearly?
(iv) Matrix Notation: Denote
= (Y1 ; :::; Yn )0 ;
Y
" = ("1 ; :::; "n )0 ;
n
n
X = (X1 ; :::; Xn )0 ;
1;
1;
n
K:
Note that the t-th row of X is Xt0 = (1; X1t ; :::; Xkt ): With these matrix notations, we
have a compact expression for Assumption 3.1:
Y
n
= X
0
1 = (n
+ ";
K)(K
1) + n
1:
The second assumption is a strict exogeneity condition.
Assumption 3.2 [Strict Exogeneity]:
E("t jX) = E("t jX1 ; :::; Xt ; :::; Xn ) = 0;
t = 1; :::; n:
Remarks:
(i) Among other things, this condition implies correct model speci…cation for E(Yt jXt ):
This is because Assumption 3.2 implies E("t jXt ) = 0 by the law of iterated expectations.
(ii) Assumption 3.2 also implies E("t ) = 0 by the law of iterated expectations.
(iii) Under Assumption 3.2, we have E(Xs "t ) = 0 for any (t; s); where t; s 2 f1; :::; ng:
This follows because
E(Xs "t ) = E[E(Xs "t jX)]
= E[Xs E("t jX)]
= E(Xs 0)
= 0:
2
Note that (i) and (ii) imply cov(Xs ; "t ) = 0 for all t; s 2 f1; :::; ng:
(iv) Because X contains regressors fXs g for both s t and s > t; Assumption 3.2
essentially requires that the error "t do not depend on the past and future values of
regressors if t is a time index. This rules out dynamic time series models for which
"t may be correlated with the future values of regressors (because the future values of
regressors depend on the current shocks), as is illustrated in the following example.
Example 1: Consider a so-called AutoRegressive AR(1) model
Yt =
0
+
1 Yt 1
+ "t ;
t = 1; :::; n;
= Xt0 + "t ;
f"t g
i.i.d.(0;
2
);
where Xt = (1; Yt 1 )0 : Obviously, E(Xt "t ) = E(Xt )E("t ) = 0 but E(Xt+1 "t ) 6= 0: Thus,
we have E("t jX) 6= 0; and so Assumption 3.2 does not hold. In Chapter 5, we will
consider linear regression models with dependent observations, which will include this
example as a special case. In fact, the main reason of imposing Assumption 3.2 is to
obtain a …nite sample distribution theory. For a large sample theory (i.e., an asymptotic
theory), the strict exogeneity condition will not be needed.
(v) In econometrics, there are some alternative de…nitions of strict exogeneity. For
example, one de…nition assumes that "t and X are independent. An example is that X
is nonstochastic. This rules out conditional heteroskedasticity (i.e., var("t jX) depends
onn X). In Assumption 3.2, we still allow for conditional heteroskedasticity, because we
do not assume that "t and X are independent. We only assume that the conditional
mean E("t jX) does not depend on X:
(vi) Question: What happens to Assumption 3.2 if X is nonstochastic?
Answer: If X is nonstochastic, Assumption 3.2 becomes
E("t jX) = E("t ) = 0:
An example of nonstochastic X is Xt = (1; t; :::; tk )0 : This corresponds to a time-trend
regression model
Yt = Xt0 0 + "t
k
X
0 j
=
j t + "t :
j=0
(vii) Question: What happens to Assumption 3.2 if Zt = (Yt ; Xt0 )0 is an independent
random sample (i.e., Zt and Zs are independent whenever t 6= s; although Yt and Xt
may not be independent)?
3
Answer: When fZt g is i.i.d., Assumption 3.2 becomes
E("t jX) = E("t jX1 ; X2 ; :::Xt ; :::; Xn )
= E("t jXt )
= 0:
In other words, when fZt g is i.i.d., E("t jX) = 0 is equivalent to E("t jXt ) = 0:
Assumption 3.3 [Nonsingularity]: The minimum eigenvalue of the K
P
matrix X 0 X = nt=1 Xt Xt0 ;
min
K square
(X 0 X) ! 1 as n ! 1
with probability one.
Question: How to …nd eigenvalues of a square matrix A?
Answer: By solving the system of linear equations:
det(A
I) = 0;
where det( ) denotes the determinant of a square matrix, and I is an identity matrix
with the same dimension as A.
Remarks:
(i) Assumption 3.3 rules out multicollinearity among the (k + 1) regressors in Xt : We
say that there exists multicolinearity among the Xt if for all t 2 f1; :::; ng; the variable
Xjt for some j 2 f0; 1; :::; kg is a linear combination of the other K 1 column variables.
(ii) This condition implies that X 0 X is nonsingular, or equivalently, X must be of
full rank of K = k + 1. Thus, we need K n: That is, the number of regressors cannot
be larger than the sample size.
(iii) It is well-known that the eigenvalue can be used to summarize information
contained in a matrix (recall the popular principal component analysis). Assumption
3.3 implies that new information must be available as the sample size n ! 1 (i.e., Xt
should not only have same repeated values as t increases).
(iv) Intuitively, if there are no variations in the values of the Xt ; it will be di¢ cult
to determine the relationship between Yt and Xt (indeed, the purpose of classical linear
regression is to investigate how a change in X causes a change in Y ): In certain sense,
one may call X 0 X the “information matrix” of the random sample X because it is a
4
measure of the information contained in X: The degree of variation in X 0 X will a¤ect
the preciseness of parameter estimation for 0 .
Question: Why can the eigenvalue
in X 0 X?
used as a measure of the information contained
Assumption 3.4 [Spherical error variance]:
(a) [conditional homoskedasticity]:
E("2t jX) =
2
> 0;
t = 1; :::; n;
(b) [conditional serial uncorrelatedness]:
E("t "s jX) = 0;
t 6= s; t; s 2 f1; :::; ng:
Remarks:
(i) We can write Assumption 3.4 as
E("t "s jX) =
2
ts ;
where ts = 1 if t = s and ts = 0 otherwise. In mathematics, ts is called the Kronecker
delta function.
(ii) var("t jX) = E("2t jX) [E("t jX)]2 = E("2t jX) = 2 given Assumption 3.2. Similarly, we have cov("t ; "s jX) = E("t "s jX) = 0 for all t 6= s:
(iii) By the law of iterated expectations, Assumption 3.4 implies that var("t ) = 2 for
all t = 1; :::; n; the so-called unconditional homoskedasticity. Similarly, cov("t ; "s ) = 0
for all t 6= s:
(iv) A compact expression for Assumptions 3.2 and 3.4 together is:
E("jX) = 0 and E(""0 jX) =
2
I;
where I In is a n n identity matrix.
(v) Assumption 3.4 does not imply that "t and X are independent. It allows the
possibility that the conditional higher order moments (e.g., skewness and kurtosis) of "t
depend on X:
3.2 OLS Estimation
Question: How to estimate 0 using an observed data set generated from the random
sample fZt gnt=1 ; where Zt = (Yt ; Xt0 )0 ?
5
De…nition [OLS estimator]: Suppose Assumptions 3.1 and 3.3 hold. De…ne the sum
of squared residuals (SSR) of the linear regression model Yt = Xt0 + ut as
SSR( )
(Y X )0 (Y X )
n
X
=
(Yt Xt0 )2 :
t=1
Then the Ordinary Least Squares (OLS) estimator ^ is the solution to
^ = arg min SSR( ):
2RK
Remark: SSR( ) is the sum of squred model errors fut = Yt
with equal weighting for each t.
Xt0 g for observations,
Theorem 1 [Existence of OLS]: Under Assumptions 3.1 and 3.3, the OLS estimator
^ exists and
^ = (X 0 X) 1 X 0 Y
!
n
X
Xt Xt0
=
t=1
1X
Xt Xt0
n t=1
n
=
1
!
n
X
Xt Yt
t=1
1
1X
Xt Yt :
n t=1
n
The latter expression will be useful for our asymptotic analysis in subsequent chapters.
Remark: Suppose Zt = fYt ; Xt0 g0 ; t = 1; :::; n; is an independent and identically distributed (i.i.d.) random sample of size n. Compare the population MSE criterion
M SE( ) = E(Yt
Xt0 )2
and the sample MSE criterion
SSR( )
1X
=
(Yt
n
n t=1
n
Xt0 )2 :
By the weak law of large numbers (WLLN), we have
SSR( ) p
! E(Yt
n
Xt0 )2
6
M SE( )
and their minimizers
^
=
!
p
n
1X
Xt Xt0
n t=1
[E(Xt Xt0 )]
!
1
1
1X
Xt Yt
n t=1
n
E(Xt Yt )
:
Here, SSR( ); after scaled by n 1 ; is the sample analogue of M SE( ); and the OLS ^
is the sample analogue of the best LS approximation coe¢ cient :
Proof of the Theorem: Using the formula that for both a and
derivative
@(a0 )
= a;
@
we have
n
dSSR( )
d X
=
(Yt Xt0 )2
d
d t=1
n
X
@
=
(Yt
@
t=1
=
n
X
2(Yt
Xt0 )2
Xt0 )
t=1
n
X
Xt (Yt
=
2
=
2X 0 (Y
@
(Yt
@
Xt0 )
Xt0 )
t=1
X ):
The OLS must satisfy the FOC:
X ^ ) = 0;
X 0 (Y X ^ ) = 0;
X 0 Y (X 0 X) ^ = 0:
2X 0 (Y
It follows that
(X 0 X) ^ = X 0 Y:
By Assumption 3.3, X 0 X is nonsingular. Thus,
^ = (X 0 X) 1 X 0 Y:
Checking the SOC, we have the K
K Hessian matrix
n
X
@ 2 SSR( )
@
=
2
Xt0 )Xt ]
0
0 [(Yt
@ @
@
t=1
= 2X 0 X
positive de…nite
7
being vectors, the
given min (X 0 X) > 0: Thus, ^ is a global minimizer. Note that for the existence of ^ ; we
only need that X 0 X is nonsingular, which is implied by the condition that min (X 0 X) !
1 as n ! 1 but it does not require that min (X 0 X) ! 1 as n ! 1:
Remarks:
(i) Y^t
Xt0 ^ is called the …tted value (or predicted value) for observation Yt ; and
et Yt Y^t is the estimated residual (or prediction error) for observation Yt : Note that
et = Yt
= (Xt0
= "t
Y^t
Xt0 ^
0
+ "t )
X 0( ^
0
t
)
= true regression disturbance + estimation error,
0
) can
where the true disturbance "t is unavoidable, and the estimation error Xt0 ( ^
0
^
be made smaller when a larger data set is available (so becomes closer to ).
(ii) FOC implies that the estimated residual e = Y X ^ is orthogonal to regressors
X in the sense that
n
X
0
Xt et = 0:
Xe=
t=1
This is the consequence of the very nature of OLS, as implied by the FOC of min 2RK SSR( ).
It always holds no matter whether E("t jX) = 0 (recall that we do not impose Assumption 3.2). Note that if Xt contains the intercept, then X 0 e = 0 implies nt=1 et = 0:
Some useful identities
To investigate the statistical properties of ^ ; we …rst state some useful lemmas.
Lemma: Under Assumptions 3.1 and 3.3, we have:
(i)
X 0 e = 0;
(ii)
^
(iii) De…ne a n
0
= (X 0 X) 1 X 0 ";
n projection matrix
P = X(X 0 X) 1 X 0
and
M = In
8
P:
Then both P and M are symmetric (i.e., P = P 0 and M = M 0 ) and idempotent (i.e.,
P 2 = P; M 2 = M ), with
P X = X;
M X = 0:
(iv)
SSR( ^ ) = e0 e = Y 0 M Y = "0 M ":
Proof:
(i) The result follows immediately by the FOC of the OLS estimator.
(ii) Because ^ = (X 0 X) 1 X 0 Y and Y = X 0 + "; we have
^
0
= (X 0 X) 1 X 0 (X
0
+ ")
0
= (X 0 X) 1 X 0 ":
(iii) P is idempotent because
P2 = PP
= [X(X 0 X) 1 X 0 ][X(X 0 X) 1 X 0 ]
= X(X 0 X) 1 X 0
= P:
Similarly we can show M 2 = M:
(iv) By the de…nition of M; we have
e = Y
X^
= Y
X(X 0 X) 1 X 0 Y
= [I
X(X 0 X) 1 X 0 ]Y
= MY
0
= M (X
= MX
0
+ ")
+ M"
= M"
given M X = 0: It follows that
SSR( ^ ) = e0 e
= (M ")0 (M ")
= "0 M 2 "
= "0 M ";
9
where the last equality follows from M 2 = M:
3.3 Goodness of …t and model selection criterion
Question: How well does the linear regression model …t the data? That is, how well
does the linear regression model explain the variation of the observed data of fYt g?
We need some criteria or some measures to characterize goodness of …t.
We …rst introduce two measures for goodness of …t. The …rst measure is called the
uncentered squared multi-correlation coe¢ cient R2
De…nition [Uncentered R2 ] : The uncentered squared multi-correlation coe¢ cient is
de…ned as
Y^ 0 Y^
e0 e
2
Ruc
= 0 =1
:
Y Y
Y 0Y
Remarks:
2
: The proportion of the uncentered sample quadratic variation in
Interpretation of Ruc
the dependent variables fYt g that can be explained by the uncentered sample quadratic
2
variation of the predicted values fY^t g. Note that we always have 0 Ruc
1:
Next, we de…ne a closely related measure called Centered R2 :
De…nition [Centered R2 : Coe¢ cient of Determination]: The coe¢ cient of determination
Pn 2
e
2
R
1 Pn t=1 t 2 ;
Y)
t=1 (Yt
P
where Y = n 1 nt=1 Yt is the sample mean.
Remarks:
(i) When Xt contains the intercept, we have the following orthogonal decomposition:
n
X
(Yt
Y )2 =
t=1
n
X
(Y^t
Y + Yt
(Y^t
2
Y^t )2
t=1
=
n
X
Y) +
t=1
+2
n
X
e2t
t=1
n
X
(Y^t
Y )et
t=1
=
n
X
t=1
10
(Y^t
2
Y) +
n
X
t=1
e2t ;
where the cross-product term
n
X
(Y^t
n
X
Y )et =
t=1
Y^t et
Y
t=1
0
= ^
n
X
et
t=1
n
X
Xt et
Y
t=1
n
X
et
t=1
0
= ^ (X 0 e)
Y
n
X
et
t=1
0
= ^ 0
Y 0
= 0;
P
where we have made use of the facts that X 0 e = 0 and nt=1 et = 0 from the FOC of
the OLS estimation and the fact that Xt contains the intercept (i.e., X0t = 1): It follows
that
R
2
and consequently we have
e0 e
1
(Yt Y )2
Pn 2
Pn t=1
Y )2
t=1 et
t=1 (Yt
Pn
=
2
Y)
t=1 (Yt
Pn ^
(Yt Y )2
:
= Pnt=1
Y )2
t=1 (Yt
Pn
R2
0
1:
(ii) If Xt does not contain the intercept, then the orthogonal decomposition identity
n
X
(Yt
Y )2 =
n
X
(Y^t
Y )2 +
e2t
t=1
t=1
t=1
n
X
no longer holds. As a consequence, R2 may be negative when there is no intercept! This
is because the cross-product term
2
n
X
(Y^t
Y )et
t=1
may be negative.
(iii) For any given random sample fYt ; Xt0 g0 ; t = 1; :::; n; R2 is nondecreasing in the
number of explanatory variables Xt : In other words, the more explanatory variables are
11
added in the linear regression, the higher R2 : This is always true no matter whether Xt
has any true explanatory power for Yt :
Theorem: Suppose fYt ; Xt0 g0 ; t = 1; :::; n; is a random sample, R12 is the centered R2
from the linear regression
Yt = Xt0 + ut ;
where Xt = (1; X1t ; :::; Xkt )0 ; and is a K
R2 from the extended linear regression
1 parameter vector; also, R22 is the centered
~ t + vt ;
Yt = X
~ t = (1; X1t ; :::; Xkt ; Xk+1;t ; :::; Xk+q;t )0 ;and
where X
Then R22 R12 :
is a (K + q)
1 parameter vector.
Proof: By de…nition, we have
R12 = 1
R22 = 1
e0 e
;
Y )2
t=1 (Yt
e~0 e~
Pn
;
Y )2
t=1 (Yt
Pn
where e is the estimated residual vector from the regression of Y on X; and e~ is the
~ It su¢ ces to show e~0 e~ e0 e:
estimated residual vector from the regression of Y on X:
~ 0 Y minimizes SSR( ) for the extended model,
~ 1X
~ 0 X)
Because the OLS estimator ^ = (X
we have
n
n
X
X
~ 0 )2 for all 2 RK+q :
~ 0 ^ )2
(Yt X
(Yt X
e~0 e~ =
t
t
t=1
t=1
Now we choose
0
= ( ^ ; 00 )0 ;
where ^ = (X 0 X) 1 X 0 Y is the OLS from the …rst regression. It follows that
!2
k+q
n
k
X
X
X
^ j Xjt
e~0 e~
Yt
0 Xjt
=
t=1
n
X
j=0
(Yt
Xt0 ^ )2
t=1
0
= e e:
Hence, we have R12
R22 : This completes the proof.
Question: What is the implication of this result?
12
j=k+1
Answer: R2 is not a suitable criterion for correct model speci…cation. It is not a suitable
criterion of model selection.
Question: Does a high R2 imply a precise estimation for
0
?
Question: What are the suitable model selection criteria?
Two popular model selection criteria
The
KISS Principle: Keep It Sophistically Simple!
An important idea in statistics is to use a simple model to capture essential information contained in data as much as possible. Below, we introduce two popular model
selection criteria that re‡ect such an idea.
Akaike Information Criterion [AIC]:
A linear regression model can be selected by minimizing the following AIC criterion
with a suitable choice of K :
2K
n
goodness of …t + model complexity
AIC = ln(s2 ) +
where
s2 = e0 e=(n
K);
and K = k + 1 is the number of regressors.
Bayesian Information Criterion [BIC]:
A linear regression model can be selected by minimizing the following criterion with
a suitable choice of K :
K ln(n)
:
BIC = ln(s2 ) +
n
Remark: When ln n 2; BIC gives a heavier penalty for model complexity than AIC,
which is measured by the number of estimated parameters (relative to the sample size
n). As a consequence, BIC will choose a more parsimonious linear regression model than
AIC.
Remark: BIC is strongly consistent in that it determines the true model asymptotically
(i.e., as n ! 1), whereas for AIC an overparameterized model will emerge no matter
13
how large the sample is. Of course, such properties are not necessarily guaranteed in
…nite samples.
Remarks: In addition to AIC and BIC, there are other criteria such as R2 ; the so-called
adjusted R2 that can also be used to select a linear regression model. The adjusted R2
is de…ned as
e0 e=(n K)
R2 = 1
:
(Y Y )0 (Y Y )=(n 1)
This de¤ers from
R2 = 1
e0 e
Y )0 (Y
(Y
Y)
:
It may be shown that
R2 = 1
n
n
1
(1
K
R2 ) :
All model criteria are structured in terms of the estimated residual variance ^ 2 plus a
penalty adjustment involving the number of estimated parameters, and it is in the extent
of this penalty that the criteria di¤er from. For more discussion about these, and other
selection criteria, see Judge et al. (1985, Section 7.5).
Question: Why is it not a good practice to use a complicated model?
Answer: A complicated model contains many unknown parameters. Given a …xed
amount of data information, parameter estimation will become less precise if more parameters have to be estimated. As a consequence, the out-of-sample forecast for Yt may
become less precise than the forecast of a simpler model. The latter may have a larger
bias but more precise parameter estimates. Intuitively, a complicated model is too ‡exible in the sense that it may not only capture systematic components but also some
features in the data which will not show up again. Thus, it cannot forecast futures well.
3.4 Consistency and E¢ ciency of OLS
We now investigate the statistical properties of ^ : We are interested in addressing
the following basic questions:
Is ^ a good estimator for
0
(consistency)?
Is ^ the best estimator (e¢ ciency)?
What is the sampling distribution of ^ (normality)?
14
Question: What is the sampling distribution of ^ ?
Answer: The distribution of ^ is called the sampling distribution of ^ ; because ^ is a
function of the random sample fZt gnt=1 ; where Zt = (Yt ; Xt0 )0 :
Remark: The sampling distribution of ^ is useful for any statistical inference involving
^ ; such as con…dence interval estimation and hypothesis testing.
We …rst investigate the statistical properties of ^ :
Theorem: Suppose Assumptions 3.1-3.4 hold. Then
(i) [Unbiaseness] E( ^ jX) = 0 and E( ^ ) = 0 :
(ii) [Vanishing Variance]
h
var( ^ jX) = E ( ^ E ^ )( ^
2
=
and for any K
1 vector
such that
0
(X 0 X)
0
1
E ^ )0 jX
i
= 1; we have
var( ^ jX) ! 0 as n ! 1:
(iii) [Orthogonality between e and ^ ]
cov( ^ ; ejX) = Ef[ ^
E( ^ jX)]e0 jXg = 0:
(iv) [Gauss-Markov]
var(^bjX)
var( ^ jx) is positive semi-de…nite (p.s.d.)
for any unbiased estimator ^b that is linear in Y with E(^bjX) =
(v) [Residual variance estimator]
s2 =
is unbiased for
2
=
E("2t ):
1
n
K
n
X
e2t = e0 e=(n
0
:
K)
t=1
That is, E(s2 jX) =
2
:
Question: What is a linear estimator ^b?
Remarks:
(a) Both Theorems (i) and (ii) imply that the conditional MSE
0 ^
0 0
M SE( ^ jX) = E[( ^
)(
) jX]
= var( ^ jX) + Bias( ^ jX)Bias( ^ jX)0
= var( ^ jX)
! 0 as n ! 1;
15
0
where we have used the fact that Bias( ^ jX)
E( ^ jX)
= 0: Recall that MSE
0
measures how close an estimator ^ is to the target :
(b) Theorem (iv) implies that ^ is the best linear unbiased estimator (BLUE) for 0
because var( ^ jX) is the smallest among all unbiased linear estimators for 0 :
Formally, we can de…ne a related concept for comparing two estimators:
De…nition [E¢ ciency]: An unbiased estimator ^ of parameter
than another unbiased estimator ^b of parameter 0 if
var(^bjX)
2 RK such that 0 = 1; we have
h
i
0
var(^bjX) var( ^ jX)
0:
Implication: For any
Choosing
var( ^ jX) is p.s.d.
= (1; 0; :::; 0)0 ; for example, we have
var(^b1 )
Proof: (i) Given ^
0
var( ^ 1 )
0:
= (X 0 X) 1 X 0 "; we have
E[( ^
0
)jX] = E[(X 0 X) 1 X 0 "jX]
= (X 0 X) 1 X 0 E("jX)
= (X 0 X) 1 X 0 0
= 0:
(ii) Given ^
0
= (X 0 X) 1 X 0 " and E(""0 jX) = 2 I; we have
h
i
var( ^ jX)
E ( ^ E ^ )( ^ E ^ )0 jX
h
i
0 ^
0 0
^
= E (
)(
) jX
= E[(X 0 X) 1 X 0 ""0 X(X 0 X) 1 jX]
= (X 0 X) 1 X 0 E(""0 jX)X(X 0 X)
= (X 0 X) 1 X 0
2
IX(X 0 X)
=
2
(X 0 X) 1 X 0 X(X 0 X)
=
2
(X 0 X) 1 :
16
1
1
1
0
is more e¢ cient
Note that Assumption 3.4 is crutial here to obtain the expression of
var( ^ jX): Moreover, for any 2 RK such that 0 = 1; we have
0
var( ^ jX)
2 0
=
=
(X 0 X)
0
2
(X 0 X)
1
for
1
2
max [(X
X) 1 ]
2
1
0
min (X X)
! 0
given min (X 0 X) ! 1 as n ! 1 with probability one. Note that the condition that
0
^
min (X X) ! 1 ensures that var( jX) vanishes to zero as n ! 1:
0
(iii) Given ^
= (X 0 X) 1 X 0 "; e = Y X ^ = M Y = M " (since M X = 0); and
E(e) = 0; we have
h
i
cov( ^ ; ejX) = E ( ^ E ^ )(e Ee)0 jX
h
i
0 0
= E (^
)e jX
= E[(X 0 X) 1 X 0 ""0 M jX]
= (X 0 X) 1 X 0 E(""0 jX)M
= (X 0 X) 1 X 0
=
2
2
IM
(X 0 X) 1 X 0 M
= 0:
Again, Assumption 3.4 plays a crutial role in ensuring zero correlation between ^
and e: (iv) Consider a linear estimator
^b = C 0 Y;
where C = C(X) is a n K matrix depending on X: It is unbiased for
the value of 0 if and only if
E(^bjX) = C 0 X
0
= C 0X
0
=
0
+ C 0 E("jX)
:
This follows if and only if
C 0 X = I:
17
0
regardless of
Because
^b = C 0 Y
0
= C 0 (X
= C 0X
0
=
0
+ ")
+ C 0"
+ C 0 ";
the variance of ^b
h
var(^b) = E (^b
0
)(^b
0 0
) jX
= E [C 0 ""0 CjX]
i
= C 0 E(""0 jX)C
= C0
=
2
2
IC
C 0 C:
Using C 0 X = I, we now have
var(^bjX)
var( ^ jX) =
2
C 0C
2
(X 0 X)
1
=
2
[C 0 C
C 0 X(X 0 X) 1 X 0 C]
=
2
C 0 [I
X(X 0 X) 1 X 0 ]C
=
2
C 0M C
=
2
C 0M M C
=
2
C 0M 0M C
=
2
(M C)0 (M C)
=
2
D0 D
n
X
2
Dt Dt0
=
t=1
p.s.d.
where we have used the fact that for any real-valued vector D; the squared matrix D0 D
is always p.s.d. [Question: How to show this?]
(v) Now we show E[e0 e=(n
K)] =
2
: Because e0 e = "0 M " and tr(AB) =tr(BA);
18
we have
E(e0 ejX) = E("0 M "jX)
= E[tr("0 M ")jX]
[putting A = "0 M; B = "]
= E[tr(""0 M )jX]
= tr[E(""0 jX)M ]
= tr( 2 IM )
=
2
tr(M )
=
2
(n
K)
where
tr(M ) = tr(In )
tr(X(X 0 X) 1 X 0 )
= tr(In )
tr(X 0 X(X 0 X) 1 )
= n
K;
using tr(AB) =tr(BA) again. It follows that
E(e0 ejX)
E(s jX) =
n K
2
(n K)
=
(n K)
2
=
:
2
This completes the proof. Note that the sample residual variance s2 = e0 e=(n
a generalization of the sample variance Sn2 = (n
sample fYt gnt=1 :
1)
1
n
t=1 (Yt
K) is
Y )2 for the random
3.5 The Sampling Distribution of ^
To obtain the …nite sample sampling distribution of ^ ; we impose the normality
assumption on ":
Assumption 3.5: "jX
N (0;
2
I):
Remarks:
19
(i) Under Assumption 3.5, the conditional pdf of " given X is
f ("jX) = p
( 2
1
2 )n
"0 "
2 2
exp
= f (");
which does not depend on X; the disturbance " is independent of X: Thus, every conditional moment of " given X does not depend on X:
(ii) Assumption 3.5 implies Assumptions 3.2 (E("jX) = 0) and 3.4 (E(""0 X) = 2 I).
(iii) The normality assumption may be reasonable for observations that are computed
as the averages of the outcomes of many repeated experiments, due to the e¤ect of
the so-called central limit theorem (CLT). This may occur in Physics, for example. In
economics, the normality assumption may not always be reasonable. For example, many
high-frequency …nancial time series usually display heavy tails (with kurtosis larger tha
3).
Question: What is the sampling distribution of ^ ?
Sampling Distribution of ^ :
^
0
= (X 0 X) 1 X 0 "
n
X
0
1
= (X X)
Xt "t
t=1
=
n
X
Ct "t ;
t=1
where the weighting matrix
Ct = (X 0 X) 1 Xt
is called the leverage of observation Xt .
Theorem [Normality of ^ ]: Under Assumptions 3.1, 3.3 and 3.5,
(^
0
)jX
2
N [0;
(X 0 X) 1 ]:
0
Proof: Conditional on X; ^
is a weighted sum of independent normal random
variables f"t g; and so is also normally distributed.
The corollary below follows immediately.
0
Corollary [Normality of R( ^
)]: Suppose Assumptions 3.1, 3.3 and 3.5 hold.
Then for any nonstochastic J K matrix R; we have
R( ^
0
)jX
N [0;
20
2
R(X 0 X) 1 R0 ]:
Question: What is the role of the nonstochastic matrix R?
Answer: R is a selection matrix. For example, when R = (1; 0; :::; 0)0 ; we then have
0
0
R( ^
) = ^0
0:
Proof: Conditional on X; ^
the linear combination R( ^
0
0
E[R( ^
0
is normally distributed. Therefore, conditional on X;
) is also normally distributed, with
)jX] = RE[( ^
0
)jX] = R 0 = 0
and
var[R( ^
0
h
)jX] = E R( ^
h
= E R( ^
h
= RE ( ^
0
)(R( ^
0
)( ^
0
)( ^
= Rvar( ^ jX)R0
2
=
i
))0 jX
i
0 0 0
) R jX
i
0 0
) jX R0
0
R(X 0 X) 1 R0 :
It follows that
R( ^
0
)jX
2
N (0;
R(X 0 X) 1 R0 ):
Question: Why would we like to know the sampling distribution of R( ^
0
)?
Answer: This is mainly for con…dence interval estimation and hypothesis testing.
0
Remark: However, var[R( ^
)jX] =
2
is unknown. We need to estimate 2 :
2
R(X 0 X) 1 R0 is unknown, because var("t ) =
3.6 Estimation of the Covariance Matrix of ^
Question: How to estimate var( ^ jx) =
Answer: To estimate
2
2
(X 0 X) 1 ?
; we can use the residual variance estimator
s2 = e0 e=(n
K):
Theorem [Residual Variance Estimator]: Suppose Assumptions 3.1, 3.3 and 3.5
hold. Then we have for all n > K; (i)
(n
K)s2
2
jX =
e0 e
2
jX
(ii) conditional on X; s2 and ^ are independent.
21
2
n K;
Remarks:
(i) Question: What is a
2
q
distribution?
2
q]
De…nition [Chi-square Distribution,
variables. Then the random variable
2
=
Suppose fZi gqi=1 are i.i.d.N(0,1) random
q
X
Zi2
i=1
2
q
will follow a
distribution.
Properties of
E(
2
q)
var(
2
q
:
= q:
2
q)
= 2q:
Remarks:
(i) Theorem (i) implies
K)s2
(n
E
(n
K)
2
It follows that E(s2 jX) =
2
jX
= n
K:
E(s2 jX) = n
K:
2
:
(ii) Theorem (i) also implies
var
(n
K)s2
(n
K)2
4
jX
= 2(n
K);
var(s2 jX) = 2(n
K);
2
var(s2 jX) =
2
n
4
K
!0
as n ! 1:
(iii) Both Theorems (i) and (ii) imply that the conditional MSE of s2
M SE(s2 jX) = E (s2
2 2
) jX
= var(s2 jX) + [E(s2 jX)
! 0:
22
2 2
]
Thus, s2 is a good estimator for 2 :
(iv) The independence between s2 and ^ is crucial for us to obtain the sampling
distribution of the popular t-test and F -test statistics, which will be introduced soon.
Proof of Theorem: (i) Because e = M "; we have
e0 e
2
=
"0 M "
2
"
=
0
M
"
:
In addition, because "jX
N (0; 2 In ); and M is an idempotent matrix with rank
= n K (as has been shown before), we have the quadratic form
e0 e
2
=
"0 M "
2
2
n K
jX
by the following lemma.
Lemma [Quadratic form of normal random variables]: If v N (0; In ) and Q is
an n n nonstochastic symmetric idempotent matrix with rank q n; then the quadratic
form
2
v 0 Qv
q:
In our application, we have v = "=
we have
N (0; I); and Q = M: Since rank(M ) = n
e0 e
2
K;
2
n K:
jX
(ii) Next, we show that s2 and ^ are independent. Because s2 = e0 e=(n K) is a
function of e; it su¢ ces to show that e and ^ are independent. This follows immediately because both e and ^ are jointly normally distributed and they are uncorrelated.
It is well-known that for a joint normal distribution, zero correlation is equivalent to
independence.
Question: Why are e and ^ jointly normally distributed?
Answer: We write
Because "jX
N (0;
"
2
e
^
0
#
"
#
M"
(X X) 1 X 0 "
"
#
M
=
":
(X 0 X) 1 X 0
=
0
I); the linear combination of
"
#
M
"
(X 0 X) 1 X 0
23
is also normally distributed conditional on X.
Question: Why are these sampling distributions useful in practice?
Answer: They are useful in con…dence interval estimation and hypothesis testing.
3.7 Hypothesis Testing
We now use the sampling distributions of ^ and s2 to develop test procedures for
hypotheses of interest. We consider testing the following linear hypothesis in form of
H0
(J
K)(K
:
R
1) = J
0
= r;
1;
where R is called the selection matrix, and J is the number of restrictions. We assume
J K:
Remark: We wil test H0 under correct model speci…cation for E(Yt jXt ).
Motivations
Example 1 [Reforms have no e¤ect]: Consider the extended production function
ln(Yt ) =
0
+
1
ln(Lt ) +
2
ln(Kt ) +
3 AUt
+
4 P St
+ "t ;
where AUt is a dummy variable indicating whether …rm t is granted autonomy, and P St
is the pro…t share of …rm t with the state.
Suppose we are interested in testing whether autonomy AUt has an e¤ect on productivity. Then we can write the null hypothesis
Ha0 :
0
3
=0
This is equivalent to the choice of:
0
= (
0;
1;
2;
3;
0
4) :
R = (0; 0; 0; 1; 0);
r = 0:
If we are interested in testing whether pro…t sharing has an e¤ect on productivity,
we can consider the null hypothesis
Hb0 :
4
24
= 0:
Alternatively, to test whether the production technology exhibits the constant return
to scale (CRS), we can write the null hypothesis as follows:
0
1
Hc0 :
0
2
+
= 1:
This is equivalent to the choice of R = (0; 1; 1; 0; 0) and r = 1:
Finally, if we are interested in examining the joint e¤ect of both autonomy and pro…t
sharing, we can test the hypothesis that neither autonomy nor pro…t sharing has impact:
0
3
Hd0 :
0
4
=
= 0:
This is equivalent to the choice of
"
0 0 0 1 0
R =
0 0 0 0 1
" #
0
r =
:
0
#
;
Example 3 [Optimal Predictor for Future Spot Exchange Rate]: Consider
St+ =
0
+
1 Ft (
) + "t+ ;
t = 1; :::; n;
where St+ is the spot exchange rate at period t + ; and Ft ( ) is period t’s price for the
foreign currency to be delivered at period t + : The null hypothesis of interest is that
the forward exchange rate Ft ( ) is an optimal predictor for the future spot rate St+ in
the sense that E(St+ jIt ) = Ft ( ); where It is the information set available at time t.
This is actually called the expectations hypothesis in economics and …nance. Given the
above speci…cation, this hypothesis can be written as
He0 :
0
0
= 0;
0
1
= 1;
and E("t+ jIt ) = 0: This is equivalent to the choice of
R=
"
1 0
0 1
#
;r =
"
0
1
#
:
Remark: All examples considered above can be formulated with a suitable speci…cation
of R; where R is a J K matrix in the null hypothesis
H0 : R
0
25
= r;
where r is a J 1 vector.
Basic Idea of Hypothesis Testing
To test the null hypothesis
0
H0 : R
= r;
we can consider the statistic:
R^
r
and check if this di¤erence is signi…cantly di¤erent from zero.
Under H0 : R
0
= r; we have
R^
r = R^ R 0
0
= R( ^
)
! 0 as n ! 1
0
because ^
! 0 as n ! 1 in the term of MSE.
Under the alternative to H0 ; R 0 6= r; but we still have ^
MSE. It follows that
R^
r
= R( ^
! R
0
0
)+R
0
0
! 0 in the term of
r
r 6= 0
as n ! 1; where the convergence is in term of MSE. In other words, R ^ r will converge
to a nonzero limit, R 0 r.
Remark: The behavior of R ^ r is di¤erent under H0 and under the alternative
hypothesis to H0 . Therefore, we can test H0 by examining whether R ^ r is signi…cantly
di¤erent from zero.
Question: How large should the magnitude of the absolute value of the di¤erence R ^ r
be in order to claim that R ^ r is signi…cantly di¤erent from zero?
Answer: We need to a decision rule which speci…es a threshold value with which we
can compare the (absolute) value of R ^ r.
Note that R ^ r is a random variable and so it can take many (possibly an in…nite
number of) values. Given a data set, we only obtain one realization of R ^ r: Whether
a realization of R ^ r is close to zero should be judged using the critical value of its
26
sampling distribution, which depends on the sample size n and the signi…cance level
2 (0; 1) one preselects.
Question: What is the sampling distribution of R ^
r under H0 ?
Because
R( ^
0
)jX
N (0;
2
R(X 0 X) 1 R0 );
we have that conditional on X;
R^
r = R( ^
0
N (R
0
)+R
r;
2
0
r
R(X 0 X) 1 R0 )
Corollary: Under Assumptions 3.1, 3.3 and 3.5, and H0 : R
n > K;
(R ^ r)jX N (0; 2 R(X 0 X) 1 R0 ):
0
= r; we have for each
Remarks:
(i) var(R ^ jX) = 2 R(X 0 X) 1 R0 is a J J matrix.
(ii) However, R ^ r cannot be used as a test statistic for H0 ; because 2 is unknown
and there is no way to calculate the critical values of the sampling distribution of R ^ r:
Question: How to construct a feasible (i.e., computable) test statistic?
The forms of test statistics will di¤er depending on whether we have J = 1 or J > 1:
We …rst consider the case of J = 1:
CASE I: t-Test (J = 1):
Recall that we have
(R ^
N (0;
2
r)jX] =
2
r)jX
R(X 0 X) 1 R0 );
When J = 1; the conditional variance
var[(R ^
is a scalar (1
R(X 0 X) 1 R0
1). It follows that conditional on X; we have
R^ r
R^ r
q
= p
2 R(X 0 X) 1 R0
var[(R ^ r)jX]
N (0; 1):
27
Question: What is the unconditional distribution of
p
R^
r
2 R(X 0 X) 1 R0
?
Answer: The unconditional distribution is also N(0,1).
However,
2
is unknown, so we cannot use the ratio
p
R^
r
2 R(X 0 X) 1 R0
as a test statistic. We have to replace 2 by s2 ; which is a good estimator for
gives a feasible (i.e., computable) test statistic
T =p
R^
r
s2 R(X 0 X) 1 R0
2
: This
:
However, the test statistic T will be no longer normally distributed. Instead,
R^
T = p
r
s2 R(X 0 X) 1 R0
= q
q
tn
R^ r
p
2 R(X 0 X) 1 R0
(n K)s2
2
=(n
K)
N (0; 1)
2
n K =(n
K)
K;
where tn K denotes the Student t-distribution with n K degrees of freedom. Note that
the numerator and denominator are mutually independent conditional on X; because ^
and s2 are mutually independent conditional on X.
Remarks:
(i) The feasible statistic T is called a t-test statistic.
(ii) What is a Student tq distribution?
De…nition [Student’s t-distribution]: Suppose Z
Z and V are independent. Then the ratio
Z
p
V =q
28
tq :
N (0; 1) and V
2
q;
and both
d
d
(iii) When q ! 1; tq ! N (0; 1); where ! denotes convergence in distribution. This
implies that we have
R^
T =p
r
d
s2 R(X 0 X) 1 R0
! N (0; 1) as n ! 1:
This result has a very important implication in practice: For a large sample size n; it
makes no di¤erence to use either the critical values from tn K or from N (0; 1).
Plots of Student t-distributions?
Question: What is meant by convergence in distribution?
De…nition [Convergence in distribution]: Suppose fZn ; n = 1; 2; g is a sequence
of random variables/vectors with distribution functions Fn (z) = P (Zn z); and Z is a
random variable/vector with distribution F (z) = P (Z
z): We say that Zn converges
to Z in distribution if the distribution of Zn converges to the distribution of Z at all
continuity points; namely,
lim Fn (z) = F (z) or
n!1
Fn (z) ! F (z) as n ! 1
for any continuity point z (i.e., for any point at which F (z) is continuous): We use the
d
notation Zn ! Z: The distribution of Z is called the asymptotic or limiting distribution
of Zn :
Implication: In practice, Zn is a test statistic or a parameter estimator, and its sampling distribution Fn (z) is either unknown or very complicated, but F (z) is known or
d
very simple. As long as Zn ! Z; then we can use F (z) as an approximation to Fn (z):
This gives a convenient procedure for statistical inference. The potential cost is that the
approximation of Fn (z) to F (z) may not be good enough in …nite samples (i.e., when n
is …nite). How good the approximation will be depend on the data generating process
and the sample size n:
Example: Suppose f"n ; n = 1; 2; g is an i.i.d. sequence with distribution function
d
F (z): Let " be a random variable with the same distribution function F (z): Then "n ! ":
With the obtained sampling distribution for the test statistic T; we can now describe
a decision rule for testing H0 when J = 1:
Decision Rule of the T-test Based on Critical Values
29
(i) Reject H0 : R
0
= r at a prespeci…ed signi…cance level
jT j > Ctn
K; 2
2 (0; 1) if
;
where Ctn K; is the so-called upper-tailed critical value of the tn
2
; which is determined by
2
h
i
P tn
or equivalently
h
P jtn
K
> Ctn
K j > Ctn
(ii) Accept H0 at the signi…cance level
jT j
=
K; 2
K; 2
if
Ctn
K; 2
i
K
distribution at level
2
= :
:
Remarks:
(a) In testing H0 ; there exist two types of errors, due to the limited information
about the population in a given random sample fZt gnt=1 . One possibility is that H0 is
true but we reject it. This is called the “Type I error”. The signi…cance level is the
probability of making the Type I error. If
h
i
P jT j > Ctn K; jH0 = ;
2
we say that the decision rule is a test with size :
(b) The probability P [jT j > Ctn K; ] is called the power function of a size
2
When
h
i
P jT j > Ctn K; jH0 is false < 1;
test.
2
there exists a possibility that one may accept H0 when it is false. This is called the
“Type II error”.
(c) Ideally one would like to minimize both the Type I error and Type II error, but
this is impossible for any given …nite sample. In practice, one usually presets the level
for Type I error, the so-called signi…cance level, and then minimizes the Type II error.
Conventional choices for signi…cance level are 10%, 5% and 1% respectively.
Next, we describe an alternative decision rule for testing H0 when J = 1; using the
so-called p-value of test statistic T:
An Equivalent Decision Rule Based on p-values
30
n
Given a data set z n = fyt ; x0t g0n
t=1 ; which is a realization of the random sample Z =
fYt ; Xt0 g0n
t=1 ; we can compute a realization (i.e., a number) for the t-test statistic T;
namely
R^ r
T (z n ) = p
:
s2 R(x0 x) 1 R0
Then the probability
p(z n ) = P [jtn
Kj
> jT (z n )j] ;
is called the p-value (i.e., probability value) of the test statistic T given that fYt ; Xt0 gnt=1 =
fyt ; x0t gnt=1 is observed, where tn K is a Student t random variable with n K degrees of
freedom, and T (z n ) is a realization for test statistic T = T (Z n ). Intuitively, the p-value
is the tail probability that the absolute value of a Student tn K random variable take
values larger than the absolute value of the test statistic T (z n ). If this probability is very
small relative to the signi…cance level, then it is unlikely that the test statistic T (Z n )
will follow a Student tn K distribution. As a consequence, the null hypothesis is likely
to be false.
The above decision rule can be described equivalently as follows:
Decision Rule Based on the p-value
(i) Reject H0 at the signi…cance level if p(z n ) < :
(ii) Accept H0 at the signi…cance level if p(z n )
:
Question: What are the advantages/disadvantages of using p-values versus using critical values?
Remarks:
(i) p-values are more informative than only rejecting/accepting the null hypothesis
at some signi…cance level. A p-value is the smallest signi…cance level at which a null
hypothesis can be rejected.
(ii) Most statistical software reports p-values of parameter estimates.
Examples of t-tests
Example 1 [Reforms have no e¤ects (continued.)]
We …rst consider testing the null hypothesis
Ha0 :
3
= 0;
where 3 is the coe¢ cient of the autonomy AUt in the extended production function
regression model. This is equivalent to the selection of R = (0; 0; 0; 1; 0): In this case,
31
we have
s2 R(X 0 X) 1 R0 =
s2 (X 0 X)
1
(4;4)
S ^2
=
which is the estimator of var( ^ 3 jX): The t-test statistic
R^
T = p
r
s2 R(X 0 X) 1 R0
^
= q3
S ^2
3
tn
K:
Next, we consider testing the CRS hypothesis
Hc0 :
1
+
2
= 1;
which corresponds to R = (0; 1; 1; 0; 0) and r = 1: In this case,
s2 R(X 0 X) 1 R0 = S ^2 + S ^2 + 2ĉov( ^ 1 ; ^ 2 )
1
2
=
2
0
s (X X)
+ s
2
+2
1
(2;2)
(X X) 1 (3;3)
s2 (X 0 X) 1 (2;3)
0
= S ^2+ ^ ;
2
which is the estimator of var( ^ 1 + ^ 2 jX): Here, ĉov( ^ 1 ; ^ 2 ) is the estimator for cov( ^ 1 ; ^ 2 jX);
the covariance between ^ 1 and ^ 2 conditional on X:
The t-test statistic is
R^
T = p
r
s2 R(X 0 X) 1 R0
=
^ +^
1
1
2
S^1+^2
tn K :
CASE II: F -testing (J > 1)
Question: What happens if J > 1? We …rst state a useful lemma.
32
Lemma: If Z
matrix, then
N (0; V ); where V = var(Z) is a nonsingular J J variance-covariance
Z 0V
1
2
J:
Z
Proof: Because V is symmetric and positive de…nite, we can …nd a symmetric and
invertible matrix V 1=2 such that
V 1=2 V 1=2 = V;
V
1=2
1=2
V
1
= V
:
(Question: What is this decomposition called?) Now, de…ne
1=2
Y =V
Z:
Then we have E(Y ) = 0; and
var(Y ) = E f[Y
E(Y )]0 g
E(Y )][Y
= E(Y Y 0 )
1=2
= E(V
ZZ 0 V
1=2
= V
1=2
E(ZZ 0 )V
= V
1=2
VV
= V
1=2
V 1=2 V 1=2 V
)
1=2
1=2
= I:
It follows that Y
N (0; I): Therefore, we have
Y 0Y
2
q:
Applying this lemma, and using the result that
(R ^
r)jX
N [0;
2
R(X 0 X) 1 R0 ]
under H0 ; we have the quadratic form
(R ^
r)0 [ 2 R(X 0 X) 1 R0 ] 1 (R ^
2
J
r)
conditional on X; or
(R ^
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
2
conditional on X:
33
r)
2
J
2
J
Because
does not depend on X; therefore, we also have
(R ^
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
2
r)
2
J
unconditionally.
Like in constructing a t-test statistic, we should replace 2 by s2 in the left hand
side:
(R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
:
s2
The replacement of 2 by s2 renders the distribution of the quadratic form no longer
Chi-squared. Instead, after proper scaling, the quadratic form will follow a so-called
F -distribution with degrees of freedom equal to (q; n K).
Question: Why?
Observe
(R ^
(R ^
= J
r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
s2
r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r)
=J
2
(n K)s2
2
J
2
n K =(n
J FJ;n
where FJ;n
K
=(n
K)
2
J =J
K)
K;
denotes the F distribution with degrees of J and n
Question: What is a FJ;n
K
De…nition: Suppose U
the ratio
2
p
K distributions.
distribution?
and V
2
q;
and both U and V are independent. Then
U=p
V =q
Fp;q
is called to follow a Fp;q distribution with degrees of freedom (p; q):
Question: Why call it the F -distribution?
Answer: It is named after Professor Fisher, a well-known statistician in the 20th century.
Plots of F -distributions.
34
Remarks:
Properties of Fp;q :
(i) If F Fp;q ; then F
(ii) t2q F1;q :
1
Fq;p :
t2q =
2
1 =1
2 =q
F1;q
2
p
(iii) Given any …xed integer p; p Fp;q !
as q ! 1:
Relationship (ii) implies that when J = 1; using either the t-test or the F -test will
deliver the same conclusion. Relationship (iii) implies that the conclusions based on
Fp;q and on pFp;q using the 2p approximation will be approximately the same when q is
su¢ ciently large.
We can now de…ne the following F -test statistic
F
(R ^
FJ;n
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
s2
K:
r)=J
Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0 : R
0
= r, we
have
F =
(R ^
FJ;n
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
s2
r)=J
K
for all n > K:
A practical issue now is how to compute the F -statistic. One can of course compute
the F -test statistic using the de…nition of F . However, there is a very convenient way
to compute the F -test statistic. We now introduce this method.
Alternative Expression for the F -Test Statistic
Theorem: Suppose Assumptions 3.1 and 3.3 hold. Let SSRu = e0 e be the sum of
squared residuals from the unrestricted model
Y =X
35
0
+ ":
Let SSRr = e~0 e~ be the sum of squared residuals from the restricted model
0
Y =X
+"
subject to
R
0
= r;
where ~ is the restricted OLS estimator. Then under H0 ; the F -test statistic can be
written as
(~
e0 e~ e0 e)=J
FJ;n K :
F = 0
e e=(n K)
Remark: This is a convenient test statistic! One only needs to compute SSR in order
to compute the F -test statistic.
Proof: Let ~ be the OLS under H0 ; that is,
~ = arg min (Y
X )0 (Y
2RK
X )
subject to the constraint that R = r: We …rst form the Lagrangian function
X )0 (Y
L( ; ) = (Y
X ) + 2 0 (r
R );
where is a J 1 vector called the Lagrange multiplier vector.
We have the following FOC:
@L( ~ ; ~ )
=
2X 0 (Y X ~ )
@
@L( ~ ; ~ )
= r R ~ = 0:
@
2R0 ~ = 0;
With the unconstrained OLS estimator ^ = (X 0 X) 1 X 0 Y; and from the …rst equation
of FOC, we can obtain
(^
~ ) = (X 0 X) 1 R0 ~ ;
R(X 0 X) 1 R0 ~ =
R( ^ ~ ):
Hence, the Lagrange multiplier
~ =
=
[R(X 0 X) 1 R0 ] 1 R( ^
[R(X 0 X) 1 R0 ] 1 (R ^
36
~ ):
r);
where we have made use of the constraint that R ~ = r: It follows that
^
~ = (X 0 X) 1 R0 [R(X 0 X) 1 R0 ] 1 (R ^
r):
Now,
X~
X ^ + X( ^
e~ = Y
= Y
= e + X( ^
~)
~ ):
It follows that
e~0 e~ = e0 e + ( ^
~ )0 X 0 X( ^
= e0 e + (R ^
~)
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
r):
We have
(R ^
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
r) = e~0 e~
e0 e
and
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
s2
0
0
(~
e e~ e e)=J
:
=
e0 e=(n K)
F =
(R ^
r)=J
This completes the proof.
Question: What is the interpretation for the Lagrange multiplier ~ ?
Answer: Recall that we have obtained the relation that
~ =
[R(X 0 X) 1 R0 ] 1 R( ^
[R(X 0 X) 1 R0 ] 1 (R ^
=
~)
r):
Thus, ~ is an indicator of the departure of R ^ from r: That is, the value of ~ will indicate
whether R ^ r is signi…cantly di¤erent from zero.
Question: What happens to the distribution of F when n ! 1?
Recall that under H0 ; the quadratic form
(R ^
r)0
2
R(X 0 X) 1 R0
37
1
(R ^
r)
2
J:
We have shown that
J F = (R ^
1
r)0 s2 R(X 0 X) 1 R0
(R ^
r)
is no longer 2J for each …nite n > K:
However, the following theorem shows that as n ! 1; the quadratic form J F !d
2
J : In other words, when n ! 1; the limiting distribution of J F coincides with the
distribution of the quadratic form
(R ^
r)0
2
R(X 0 X) 1 R0
1
(R ^
r):
Theorem: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Then under H0 ; we have the
Wald test statistic
(R ^ r)0 [R(X 0 X) 1 R0 ] 1 (R ^ r) d 2
! J
W =
s2
as n ! 1:
Remarks:
(i) W is called the Wald test statistic. In the present context, we have W = J F:
(ii) Recall that when n ! 1; the F -distribution FJ;n K ; after multiplied by J; will
approach the 2J distribution. This result has a very important implication in practice:
When n is large, the use of FJ;n K and the use of 2J will deliver the same conclusion on
statistical inference.
Proof: The result follows immediately by the so-called Slutsky theorem and the fact
p
p
that s2 ! 2 ; where ! denotes convergence in probability.
Question: What is the Slutsky theorem? What is convergence in probability?
d
p
p
Slutsky Theorem: Suppose Zn ! Z; an ! a and bn ! b as n ! 1: Then
d
an + bn Zn ! a + bZ:
In our application, we have
Zn =
Z
(R ^
r)0 [R(X 0 X) 1 R0 ] 1 (R ^
2
2
J;
an = 0;
a = 0;
bn =
2
=s2 ;
b = 1:
38
r)
;
The by the Slutsky theorem, we have
W = bn Zn !d 1 Z
2
J
because Zn !d Z and bn !p 1:
p
What we need is to show that s2 !
2
and then apply the Slutsky theorem.
Question: What is the concept of convergence in probability?
De…nition [Convergence in Probability] Suppose fZn g is a sequence of random
variables and Z is a random variable (including a constant). Then we say that Zn
p
converges to Z in probability, denoted as Zn Z ! 0; or Zn Z = oP (1) if for every
small constant > 0;
Zj > ] = 0; or
lim Pr [jZn
n!1
Pr(jZn
Zj >
) ! 0 as n ! 1:
Interpretation: De…ne the event
An ( ) = f! 2
: jZn (!)
Z(!)j > g;
where ! is a basic outcome in sample space : Then convergence in probability says that
the probability of event An ( ) may be nonzero for any …nite n; but such a probability
will eventually vanish to zero as n ! 1: In other words, it becomes less and less likely
that the di¤erence jZn Zj is larger than a prespeci…ed constant > 0: Or, we have
more and more con…dence that the di¤erence jZn Zj will be smaller than as n ! 1:
Question: What is the interpretation of ?
Answer: It can be viewed as a prespeci…ed tolerance level.
p
Remark: When Z = b, a constant, we can write Zn ! b; and we say that b is the
probability limit of Zn ; written as
b = p lim Zn :
n!1
Next, we introduce another convergence concept which can be conveniently used to show
convergence in probability.
De…nition [Convergence in Quadratic Mean]: We say that Zn converges to Z
in quadratic mean, denoted as Zn Z !q:m 0; or Zn Z = oq:m: (1); if
lim E[Zn
n!1
39
Z]2 = 0:
Lemma: If Zn
q:m:
p
Z ! 0; then Zn
Z ! 0:
Proof: By Chebyshev’s inequality, we have
P (jZn
for any given
Z]2
E[Zn
Zj > )
2
!0
> 0 as n ! 1:
Example: Suppose Assumptions 3.1, 3.3 and 3.5 hold. Does s2 converge in probability
to 2 ?
Solution: Under the given assumptions,
(n
K)
s2
2
n K;
2
4
2 2
and therefore we have E(s2 ) = 2 and var(s2 ) = n2 K : It follows that E(s2
) =
p
2 4 =(n K) ! 0; s2 !q:m: 2 and so s2 ! 2 because convergence in quadratic mean
implies convergence in probability.
Question: If s2 !p
2
; do we have s !p ?
Answer: Yes. Why? It follows from the following lemma with the choice of g(s2 ) =
p
s2 = s:
Lemma: Suppose Zn !p Z; and g( ) is a continuous function. Then
g(Zn ) !p g(Z):
Proof: Because g( ) is continuous, we have that for any given constant
exists = ( ) such that whenever jZn (!) Z(!)j < ; we have
jg[Zn (!)]
g[Z(!)]j < :
De…ne
A( ) = f! : jZn (!)
Z(!)j > g;
B( ) = f! : jg(Zn (!))
Then the continuity of g( ) implies that B( )
P [B( )]
g(Z(!))j > g:
A( ): It follows that
P [A( )] ! 0
40
> 0; there
as n ! 1; where P [A( )] ! 0 given Zn !p Z: Because
and
are arbitrary, we have
g(Zn ) !p g(Z):
This completes the proof.
3.8 Applications
We now consider some special but implication cases often countered in economics
and …nance.
Case 1: Testing for the Joint Signi…cance of Explanatory Variables
Consider a linear regression model
Yt = Xt0
0
0
0
+
=
+ "t
k
X
0
j Xjt
+ "t :
j=1
We are interested in testing the combined e¤ect of all the regressors except the intercept.
The null hypothesis is
H0 :
0
j
= 0 for 1
j
k;
which implies that none of the explanatory variables in‡uences Yt :
The alternative hypothesis is
HA :
0
j
6= 0 at least for some
0
j;
j = 1;
; k:
One can use the F -test and
F
Fk;n
(k+1) :
In fact, the restricted model under H0 is very simple:
0
0
Yt =
The restricted OLS estimator ~ = (Y ; 0;
e~ = Y
+ "t :
; 0)0 : It follows that
X~ = Y
Y:
Y )0 (Y
Y ):
Hence, we have
e~0 e~ = (Y
41
Recall the de…nition of R2 :
R2 = 1
= 1
e0 e
Y )0 (Y
(Y
e0 e
:
e~0 e~
Y)
It follows that
(~
e0 e~ e0 e)=k
e0 e=(n k 1)
0
(1 ee~0 ee~ )=k
= e0 e
=(n k 1)
e~0 e~
R2 =k
=
(1 R2 )=(n k
F =
1)
:
Thus, it su¢ ces to run one regression, namely the unrestricted model in this case. We
emphasize that this formula is valid only when one is testing for H0 : 0j = 0 for all
1 j k:
Example 1 [E¢ cient Market Hypothesis]: Suppose Yt is the exchange rate return
in period t; and It 1 is the information available at time t 1: Then a classical version
of the e¢ cient market hypothesis (EMH) can be stated as follows:
E(Yt jIt 1 ) = E(Yt )
To check whether exchange rate changes are unpredictable using the past history of
exchange rate changes, we specify a linear regression model:
Yt = Xt0
0
+ "t ;
where
Xt = (1; Yt 1 ; :::; Yt k )0 :
Under EMH, we have
H0 :
0
j
= 0 for all j = 1; :::; k:
If the alternative
HA :
0
j
6= 0 at least for some j 2 f1; :::; kg
holds, then exchange rate changes are predictable using the past information.
Remarks:
42
(i) What is the appropriate interpretation if H0 is not rejected? Note that there exists
a gap between the e¢ ciency hypothesis and H0 , because the linear regression model is
just one of many ways to check EMH. Thus, H0 is not rejected, at most we can only
say that no evidence against the e¢ ciency hypothesis is found. We should not conclude
that EMH holds.
(ii) Strictly speaking, the current theory (Assumption 3.2: E("t jX) = 0) rules out
this application, which is a dynamic time series regression model. However, we will
justify in Chapter 5 that
k F =
(1
d
!
R2
R2 )=(n
k
1)
2
k
under conditional homoskedasticity even for a linear dynamic regression model.
(iii) In fact, we can use a simpler version when n is large:
(n
k
d
1)R2 !
2
k:
This follows from the Slutsky theorem because R2 !p 0 under H0 : Although Assumption
3.5 is not needed for this result, conditional homoskedasticity is still needed, which rules
out autoregressive conditional heteroskedasticity (ARCH) in the time series context.
Example 1 [Consumption Function and Wealth E¤ect]: Let Yt = consumption,
X1t = labor income, X2t = liquidity asset wealth. A regression estimation gives
Yt = 33:88
[1:77]
26:00X1t + 6:71X2t + et ;
[ 0:74]
R2 = 0:742; n = 25:
[0:77]
where the numbers inside [ ] are t-statistics.
Suppose we are interested in whether labor income or liquidity asset wealth has
impact on consumption. We can use the F -test statistic,
R2 =2
(1 R2 )=(n 3)
= (0:742=2)=[(1 0:742)=(25
F =
3)]
= 31:636
F2 ;22
Comparing it with the critical value of F2;22 at the 5% signi…cance level, we reject the
null hypothesis that neither income nor liquidity asset has impact on consumption at
the 5% signi…cance level.
43
Case 2: Testing for Omitted Variables (or Testing for No E¤ect)
Suppose X = (X (1) ; X (2) ); where X (1) is a n (k1 + 1) matrix and X (2) is a n k2
matrix:
(2)
A random vector Xt has no explanary power for the conditional expectation of Yt
if
(1)
E(Yt jXt ) = E(Yt jXt ):
Alternatively, it has explanatory power for the conditional expectation of Yt if
(1)
E(Yt jXt ) 6= E(Yt jXt ):
(2)
When Xt has explaining power for Yt but is not included in the regression, we say that
(2)
Xt is an omitted random variable or vector.
(2)
Question: How to test whether Xt
context?
Consider the restricted model
Yt =
0
+
is an omitted variable in the linear regression
1 X1t
+
+
k1 Xk1 t
Suppose we have additional k2 variables (X(k1 +1)t ;
unrestricted regression model
Yt =
0
+
+
1 X1t
+ ::: +
k1 +1 X(k1 +1)t
+
+ "t :
; X(k1 +k2 )t ), and so we consider the
k1 Xk1 t
+
(k1 +k2 ) X(k1 +k2 )t
+ "t :
The null hypothesis is that the additional variables have no e¤ect on Yt : If this is the
case, then
H0 : k1 +1 = k1 +2 =
= k1 +k2 = 0:
The alternative is that at least one of the additional variables has e¤ect on Yt :
The F -Test statistic is
F =
(~
e0 e~ e0 e)=k2
e0 e=(n k1 k2 1)
Fk2 ;n
(k1 +k2 +1) :
Question: Suppose we reject the null hypothesis. Then some important explanatory
variables are omitted, and they should be included in the regression. On the other hand,
if the F -test statistic does not reject the null hypothesis H0 ; can we say that there is no
omitted variable?
44
Answer: No. There may exist a nonlinear relationship for additional variables which a
linear regression speci…cation cannot capture.
Example 1 [Testing for the E¤ect of Reforms]:
Consider the extended production function
Yt =
0
+
+
1
ln(Lt ) +
3 AUt
+
2
4 P St
ln(Kt )
+
5 CMt
+ "t ;
where AUt is the autonomy dummy, P St is the pro…t sharing ratio, and CMt is the
dummy for change of manager. The null hypothesis of interest here is that none of the
three reforms has impact:
H0 : 3 = 4 = 5 = 0:
We can use the F -test, and F
F3;n
6
under H0 :
Interpretation: If rejection occurs, there exists evidence against H0 : If no rejection
occurs, then we …nd no evidence against H0 (which is not the same as the statement
(2)
that reforms have no e¤ect): It is possible that the e¤ect of Xt is of nonlinear form. In
(2)
this case, we may obtain a zero coe¢ cient for Xt ; because the linear speci…cation may
not be able to capture it.
Example 2 [Testing for Granger Causality]: Consider two time series fYt ; Zt g;
where t is the time index, ItY 1 = fYt 1 ; :::; Y1 g and ItZ 1 = fZt 1 ; :::; Z1 g. For example,
Yt is the GDP growth, and Zt is the money supply growth. We say that Zt does not
(Y ) (Z)
Granger-cause Yt in conditional mean with respect to It 1 = fIt 1 ; It 1 g if
(Y )
(Y )
(Z)
E(Yt jIt 1 ; It 1 ) = E(Yt jIt 1 ):
In other words, the lagged variables of Zt have no impact on the level of Yt :
Remark: Granger causality is de…ned in terms of incremental predictability rather
than the real cause-e¤ect relationship. From an econometric point of view, it is a test
of omitted variables in a time series context. It is …rst introduced by Granger (1969).
Question: How to test Granger causality?
Consider now a linear regression model
Yt =
0
+
+
1 Yt 1
p+1 Zt 1
+
+
45
+
+
p Yt p
p+q Zt q
+ "t :
Under non-Granger causality, we have
H0 :
=
p+1
=
p+q
= 0:
The F -test statistic
F
Fq;n
(p+q+1) :
Remark: The current theory (Assumption 3.2: E("t jX) = 0) rules out this application,
because it is a dynamic regression model. However, we will justify in Chapter 5 that
under H0 ;
d
q F ! 2q
as n ! 1 under conditional homoskedasticity even for a linear dynamic regression
model.
Example 2 [Testing for Structural Change (or testing for regime shift)]
Yt =
+
0
1 X1t
+ "t ;
where t is a time index, and fXt g and f"t g are mutually independent. Suppose there
exist changes after t = t0 ; i.e., there exist structural changes. We can consider the
extended regression model:
Yt = (
=
0
0
+
+
0 Dt )
1 X1t
+(
+
1
+
0 Dt
+
1 Dt )X1t
+ "t
1 (Dt X1t )
+ "t ;
where Dt = 1 if t > t0 and Dt = 0 otherwise. The variable Dt is called a dummy
variable, indicating whether it is a pre- or post-structral break period.
The null hypothesis of no structral change is
H0 :
0
=
1
= 0:
The alternative hypothesis that there exist a structral change is
HA :
0
6= 0 or
1
6= 0:
The F -test statistic
F
F2;n 4 :
The idea of such a test is …rst proposed by Chow (1960).
Case 3: Testing for linear restrictions
46
Example 1 [ Testing for CRS]:
Consider the extended production function
ln(Yt ) =
0
+
ln(Lt ) +
1
ln(Kt ) +
2
3 AUt
+
4 P St
+
5 CMt
+ "t :
We will test the null hypothesis of CRS:
H0 :
1
+
2
= 1:
H0 :
1
+
2
6= 1:
The alternative hypothesis is
What is the restricted model under H0 ? It is given by
ln(Yt ) =
0
+
1
ln(Lt ) + (1
1 ) ln(Kt )
+
3 AUt
+
4 P St
+
5 CMt
+ "t
or equivanetly
ln(Yt =Kt ) =
0
+
1
ln(Lt =Kt ) +
3 AUt
+
4 CONt
+
5 CMt
+ "t :
The F -test statistic
F
F1;n 6 :
Remark: Both t- and F - tests are applicable to test CRS.
Example 2 [Wage Determination]: Consider the wage function
Wt =
0
+
1 Pt
+
4 Vt
+
5 Wt 1
+
2 Pt 1
+
3 Ut
+ "t ;
where
Wt = wage,
Pt = price,
Ut = unemployment,
Vt = un…lled vacancies.
We will test the null hypothesis
H0 :
1
+
2
= 0;
3
+
4
= 0; and
5
= 1:
Question: What is the economic interpretation of the null hypothesis H0 ?
What is the restricted model? Under H0 ; we have the restricted wage equation:
Wt =
0
+
1
Pt +
47
4 Dt
+ "t ;
where Wt = Wt Wt 1 is the wage growth rate, Pt = Pt Pt 1 is the in‡ation rate,
and Dt = Vt Ut is an index for job market situation (excess job supply). This implies
that the wage increase depends on the in‡ation rate and the excess labor supply.
The F -test statistic for H0 is
F
F3;n 6 :
Case 4: Testing for Near-Multicolinearity
Example 1:
Consider the following estimation results for three separate regressions based on the
same data set with n = 25. The …rst is a regression of consumption on income:
Yt = 36:74 + 0:832X1t + e1t ;
R2 = 0:735
[1:98][7:98]
The second is a regression of consumption on wealth:
Yt = 36:61 + 0:208X2t + e2t ;
R2 = 0:735
[1:97][7:99]
The third is a regression of consumption on both income and wealth:
Y
= 33:88
26:00X1t + 6:71X2t + et ;
R2 = 0:742;
[1:77][ 0:74][0:77]
Note that in the …rst two separate regressions, we can …nd signi…cant t-test statistics
for income and wealth, but in the third joint regression, both income and wealth are
insigni…cant. This may be due to the fact that income and wealth are highly multicolinear! To test neither income nor stock wealth has impact on consumption, we can use
the F -test:
R2 =2
(1 R2 )=(n 3)
0:742=2
=
(1 0:742)=(25 3)
= 31:636
F =
F2 ;22 :
This F -test shows that the null hypothesis is …rmly rejected at the 5% signi…cance level,
because the critical value of F2;22 at the 5% level is ???.
48
3.9 Generalized Least Squares Estimation
Question:
The classical linear regression theory crutially depends on the assumption that "jX
N (0; 2 I), or equivalently f"t g i:i:d:N (0; 2 ); and fXt g and f"t g are mutually independent. What may happen if some classical assumptions do not hold?
Question: Under what conditions, the existing procedures and results are still approximately true?
Assumption 3.5 is unrealitic for many economic and …nancial data. Suppose Assumption 3.5 is replaced by the following condition:
Assumption 3.6: "jX N (0; 2 V ); where 0 < 2 < 1 is unknown and V = V (X) is
a known n n …nite and positive de…nite matrix.
Question: Is this assumption realistic in practice?
Remarks:
(i) Assumption 3.6 implies that
var("jX) = E(""0 jX)
=
2
V =
2
V (X)
is known up to a constant 2 : It allows for conditional heteroskedasticity of known form.
(iii) It is possible that V is not a diagonal matrix. Thus, cov("t ; "s jX) may not be
zero. In other words, Assumption 3.6 allows conditional serial correlation of known form.
(iv) However, the assumption that V is known is still very restrictive from a practical
point of view. In practice, V usually has an unknown form.
Question: What is the statistical property of OLS ^ under Assumption 3.6?
Theorem: Suppose Assumptions 3.1, 3.3 and 3.6 hold. Then
(i) unbiasedness: E( ^ jX) = 0 :
(ii) variance:
var( ^ jX) =
6=
2
(X 0 X) 1 X 0 V X(X 0 X)
2
(X 0 X) 1 :
49
1
(iii)
(^
0
)jX
2
N (0;
(X 0 X) 1 X 0 V X(X 0 X) 1 ):
(iv) cov( ^ ; ejX) 6= 0 in general.
Remarks:
(i) OLS ^ is still unbiased and one can show that its variance goes to zero as n ! 1
(see Question 6, Problem Set 03). Thus, it converges to 0 in the sense of MSE.
(ii) However, the variance of the OLS estimator ^ does no longer have the simple
expression of 2 (X 0 X) 1 under Assumption 3.6. As a consequence, the classical t- and
F -test statistics are invalid because they are based on an incorrect variance-covariance
matrix of ^ . That is, they use an incorrect expression of 2 (X 0 X) 1 rather than the
correct variance formula of 2 (X 0 X) 1 X 0 V X(X 0 X) 1 :
(iii) Theorem (iv) implies that even if we can obtain a consistent estimator for
2
(X 0 X) 1 X 0 V X(X 0 X) 1 and use it to construct tests, we can no longer obtain the
Student t-distribution and F -distribution, because the numerator and the denominator
in de…ning the t- and F -test statistics are no longer independent.
Proof: (i) Using ^
0
= (X 0 X) 1 X 0 "; we have
E[( ^
0
)jX] = (X 0 X) 1 X 0 E("jX)
= (X 0 X) 1 X 0 0
= 0:
(ii)
var( ^ jX) = E[( ^
0
)( ^
0 0
) jX]
= E[(X 0 X) 1 X 0 ""0 X(X 0 X) 1 jX]
= (X 0 X) 1 X 0 E(""0 jX)X(X 0 X)
=
2
1
(X 0 X) 1 X 0 V X(X 0 X) 1 :
Note that we cannot further simplify the expression here because V 6=
(iii) Because
^
0
= (X 0 X) 1 X 0 "
n
X
=
Ct "t ;
t=1
50
2
I:
where
Ct = (X 0 X) 1 Xt ;
^
0
follows a normal distribution given X; because it is a sum of a normal random
variables. As a result,
^
0
2
N (0;
(X 0 X) 1 X 0 V X(X 0 X) 1 ):
(iv)
cov( ^ ; ejX) = E[( ^
0
)e0 jX]
= E[(X 0 X) 1 X 0 ""0 M jX]
= (X 0 X) 1 X 0 E(""0 jX)M
2
=
(X 0 X) 1 X 0 V M
6= 0
because X 0 V M 6= 0: We can see that it is conditional heteroskedasticity and/or autocorrelation in f"t g that cause ^ to be correlated with e:
Generalized Least Square (GLS) Estimation
To introduce GLS, we …rst state a useful lemma.
Lemma: For any symmetric positive de…nite matrix V; we can always write
1
V
V
where C is a n
= C 0 C;
1
= C
(C 0 )
1
n nonsingular matrix.
Question: What is this decomposition called?
Remark: C may not be symmetric.
Consider the original linear regression model:
0
Y =X
+ ":
If we mutitiplyin the equation by C; we obtain the transformed regression model
CY
Y
= (CX)
= X
0
51
0
+ C"; or
+" ;
where Y = CY; X = CX and " = C": Then the OLS of this transformed model
^
= (X 0 X ) 1 X 0 Y
= (X 0 C 0 CX) 1 (X 0 C 0 CY )
= (X 0 V
1
X) 1 X 0 V
1
Y
is called the Generalized Least Square (GLS) estimator.
Question: What is the nature of GLS?
Observe that
E(" jX) = E(C"jX)
= CE("jX)
= C 0
= 0:
Also, note that
var(" jX) = E[" " 0 jX]
= E[C""0 C 0 jX]
= CE(""0 jX)C 0
=
2
CV C 0
=
2
C[C
=
2
I:
1
(C 0 ) 1 ]C 0
It follows from Assumption 3.6 that
" jX
N (0;
2
I):
The transformation makes the new error " conditionally homoskedastic and serially
uncorrelated, while maintaining the normality distribution. Suppose that for t; "t has
a large variance 2t : The transformation "t = C"t will discount "t by dividing it by
its conditional standard deviation so that "t becomes conditionally homoskedastic. In
addition, the transformation also remove possible correlation between "t and "s ; t 6= s:
As a consequence, GLS becomes the best linear LS estimator for 0 in term of the
Gauss-Markov theorem.
52
To appreciate how the transformation by matrix C removes conditional heteroskedasticity and eliminates serial correlation, we now consider two examples.
Example 1 [Removing Heteroskedasticity]: Suppose
2
h1
6 0
6
V =6
4
0
h2
0
0
0
hn
0
Then
2
1=2
h1
6 0
6
C=6
4
0
h2
7
7
7:
5
0
0
0
1=2
0
and
3
hn
2
6
6
" = C" = 6
6
4
p"1
h1
p"2
h2
p"n
hn
1=2
3
3
7
7
7
5
7
7
7:
7
5
Example 2 [Eliminating Serial Correlation] Suppose
2
Then
1
6
6
6
6 0
V =6
6
6
6
4 0
0
3
0 0
7
0 0 7
7
0 0 7
7:
0 7
7
7
5
1
1
0
1
1
0
0
0
0
2 p
1
6 "
6 2
" = C" = 6
4
"n
2"
3
"1 7
7
7:
5
"n
Theorem: Under Assumptions 3:1; 3:3 and 3.6,
(i) E( ^ jX) = 0 ;
(ii) var( ^ jX) = 2 (X 0 X ) 1 = 2 (X 0 V 1 X) 1 ;
(iii) cov( ^ ; e jX) = 0; where e = Y
X ^ ;
(iv) ^ is BLUE.
53
1
1
(v) E(s 2 jX) =
2
; where s
2
= e 0 e =(n
K):
Proof:
(i) GLS is the OLS of the transformed model.
(ii) The transformed model satis…es 3.1, 3.3 and 3.5 of the classical regression assumptions with " jX
N (0; 2 In ): It follows that GLS is BLUE by the Gauss-Markov
Theorem.
Remarks:
(i) Because ^ is the OLS of the transformed regression model with i.i.d. N (0; 2 I)
errors, the t-test and F -test are applicable, and these test statistics are de…ned as follows:
T
F
R^
= p
=
s
(R ^
FJ;n
r
2 R(X 0 X
) 1 R0
tn
K;
r)0 [R(X 0 X ) 1 R0 ] 1 (R ^
s2
r)=J
K:
When testing whether all coe¢ cients except the intercept are jointly zero, we have
(n K)R 2 !d 2J :
(ii) Because GLS ^ is BLUE and OLS ^ di¤ers from ^ ; OLS ^ cannot be BLUE.
^
= (X 0 X ) 1 X 0 Y ;
= (X 0 V
1
X) 1 X 0 V
1
Y:
^ = (X 0 X) 1 X 0 Y:
In fact, the most important message of GLS is the insight it provides into the impact
of conditional heteroskedasticity and serial correlation on the estimation and inference
of the linear regression model. In practice, GLS is generally not feasible, because the
n n matrix V is of unknown form, where var("jX) = 2 V .
What are feasible solutions?
Two solutions
(i) First Approach: Adaptive feasible GLS
In some cases with additional assumptions, we can use a nonparametric estimator V^
to replace the unknown V; we obtain the adaptive feasible GLS
^ = (X 0 V^
a
1
X) 1 X 0 V^
54
1
Y;
where V^ is an estimator for V: Because V is an n n unknown matrix and we only have
n data points, it is impossible to estimate V consistently using a sample of size n if we
do not impose any restriction on the form of V: In other words, we have to impose some
restrictions on V in order to estimate it consistently. For example, suppose we assume
2
V
= diagf 21 (X); :::;
= diagf 2 (X1 ); :::;
2
n (X)g
2
(Xn )g;
where diag{ } is a n n diagonal matrix and 2 (Xt ) = E("2t jX) is unknown. The fact
that 2 V is a diagonal matrix can arise when cov("t "s jX) = 0 for all t 6= s; i.e., when
there is no serial correlation. Then we can use the nonparametric kernel estimator
Pn 2 1
x Xt
1
t=1 et b K
2
n
b
P
^ (x) =
n
1
1
x Xt
K
t=1 b
n
b
p 2
!
(x);
where et is the estimated OLS residual, and K( ) is a kernel function which is a speci…ed symmetric density function (e.g., K(u) = (2 ) 1=2 exp( 21 u2 )); and b = b(n) is a
bandwidth such that b ! 0; nb ! 1 as n ! 1: The …nite sample distribution of ^ a
will be di¤erent from the …nite sample distribution of ^ ; which assumes that V were
known. This is because the sampling errors of the estimator V^ have some impact on
the estimator ^ a . However, under some suitable conditions on V^ ; ^ a will share the same
asymptotic property as the infeasible GLS ^ (i.e., the MSE of ^ a is approximately
equal to the MSE of ^ ). In other words, the …rst stage estimation of 2 ( ) has no
impact on the asymptotic distribution of ^ a : For more discussion, see Robinson (1988)
and Stinchcombe and White (1991).
Simulation studies?
(ii) Second Approach
Continue to use OLS ^ , obtaining the correct formula for
var( ^ jX) =
2
(X 0 X) 1 X 0 V X(X 0 X)
1
as well as a consistent estimator for var ^ jX : The classical de…nitions of t and F -tests
cannot be used, because they are based on an incorrect formula for var( ^ jX). However,
some modi…ed tests can be obtained by using a consistent estimator for the correct
formula for var( ^ jX): The trick is to estimate 2 X 0 V X; which is a K K nknown
55
matrix, rather than to estimate V; which is a n n unknown matrix. Howeber, only
asymptotic distributions can be used in this case.
Question: Suppose we assume
E(""jX) =
2
V = diagf 21 (X); :::;
2
n (X)g:
As pointed out earlier, this essentially assume E("t "s jX) = 0 for all t 6= s: That is, there
is no serial correlation in f"t g conditional on X: Instead of estimating 2t (X); one can
estimate the K K matrix 2 X 0 V X directly.
How to estimate
n
X
2 0
X VX =
Xt Xt0 2t (X)?
t=1
Answer: An estimator
0
0
X D(e)D(e) X =
n
X
Xt Xt0 e2t ;
t=1
where D(e) = diag(e1 ; :::; en ) is a n n diagonal matrix with all o¤-diagonal elements being zero. This is called White’s (1980) heteroskedasticity-consistent variance-covariance
estimator. See more discussion in Chapter 4.
Question: For J = 1; do we have
R^ r
p
R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0
tn
K?
For J > 1; do we have
(R ^
FJ;n
r)0 R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0
1
(R ^
r)=J
K?
Answer: No. Why? Recall cov( ^ ; ejX) 6= 0 under Assumption 3.6. It follows that ^
and e are not independent.
However, when n ! 1; we have
(i) Case I (J = 1) :
R^
p
r
R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0
This can be called a robust t-test.
56
!d N (; 1):
(ii) Case II (J > 1) :
(R ^
r) R(X 0 X) 1 X 0 D(e)D(e)0 X(X 0 X) 1 R0
1
^
( R
r) !d
2
J:
This is a robust Wald test statistic.
The above two feasible solutions are based on the assumption that E("t "s jX) = 0 for
all t 6= s: When there exists serial correlation of unknown form, an allernative solution
should be provided. This is discussed in Chapter 6.
3.10 Summary and Conclusion
In this chapter, we have presented the econometric theory for the classical linear
regression models. We …rst provide and discuss a set of assumptions on which the
classical linear regression model is built. This set of regularity conditions will serve as
the starting points from which we will develop modern econometric theory for linear
regression models.
We derive the statistical properties of the OLS estimator. In particular, we point out
that R2 is not a suitable model selection criterion, because it is always nondecreasing
with the dimension of regressors. Suitable model selection criteria, such as AIC and
BIC, are discussed. We show that conditional on the regressor matrix X; the OLS
estimator ^ is unbiased, has a vanishing variance, and is BLUE. Under the additional
conditional normality assumption, we derive the …nite sample normal distribution for ^ ;
the Chi-squared distribution for (n K)s2 = 2 ; as well as the independence between ^
and s2 .
Many hypotheses encountered in economics can be formulated as linear restrictions
on model parameters. Depending on the number of parameter restrictions, we derive
the t-test and the F -test. In the special case of testing the hypothesis that all slope
coe¢ cients are jointly zero, we also derive an asymptotically Chi-squared test based on
R2 :
When there exist conditional heteroskedasticity and/or autocorrelation, the OLS
estimator is still unbiased and has a vanishing variance, but it is no longer BLUE, and ^
and s2 are no longer mutully independent. Under the assumption of a known variancecovariance matrix up to some scale parameter, one can transform the linear regression
model by correcting conditional heteroskedasticity and eliminating autocorrelation, so
that the transformed regression model has conditionally homoskedastic and uncorrelated
errors. The OLS estimator of this transformed linear regression model is called the GLS
57
estimator, which is BLUE. The t-test and F -test are applicable. When the variancecovariance structure is unknown, the GLS estimator becomes infeasible. However, if the
error in the original linear regression model is serially uncorrelated (as is the case with
independent observations across t), there are two feasible solutions. The …rst is to use
a nonparametric method to obtain a consistent estimator for the conditional variance
var("t jXt ), and then obtain a feasible plug-in GLS. The second is to use White’s (1980)
heteroskedasticity-consistent variance-covariance matrix estimator for the OLS estimator
^ : Both of these two methods are built on the asymptotic theory. When the error of
the original linear regression model is serially correlated, a feasible solution is provided
in Chapter 6.
References
Robinson, P. (1988)
Stinchombe, M. and H. White (1991)
58
ADVANCED ECONOMETRICS
PROBLEM SET 03
1. Consider a bivariate linear regression model
Yt = Xt0
0
+ "t ;
t = 1; :::; n;
where Xt = (X0t ; X1t ) = (1; X1t )0 ; and "t is a regression error. Let ^ denote the sample
correlation between Yt and X1t ; namely,
Pn
x1t yt
^ = pPn t=12 Pn 2 ;
t=1 x1t
t=1 yt
where yt = Yt Y ; x1t = X1t
fX1t g. Show R2 = ^2 :
X1 ; and Y and X1 are the sample means of fYt g and
2. Suppose Xt = Q for all t m; where m is a …xed integer, and Q is a K
vector. Do we have min (X 0 X) ! 1 as n ! 1? Explain.
1 constant
3. The adjusted R2 ; denoted as R2 ; is de…ned as follows:
R2 = 1
(Y
Show
e0 e=(n K)
Y )0 (Y Y )=(n
n
n
R2 = 1
1
(1
K
1)
:
R2 ) :
4. Consider the following linear regression model
Yt = Xt0
0
+ ut ;
t = 1; :::; n;
(4.1)
where
ut = (Xt )"t ;
where fXt g is a nonstochastic process, and (Xt ) is a positive function of Xt such that
2
3
2
(X1 )
0
0
:::
0
6
7
2
6
7
0
(X2 )
0
:::
0
6
7
2
6
7 = 21 12 ;
=6
0
0
(X3 ) :::
0
7
6
7
:::
:::
:::
:::
4 :::
5
2
0
0
0
:::
(Xn )
with
59
1
2
2
6
6
6
=6
6
6
4
(X1 )
0
0
:::
0
0
(X2 )
0
:::
0
0
0
(X3 )
:::
0
:::
:::
:::
:::
:::
0
0
0
:::
(Xn )
3
7
7
7
7:
7
7
5
Assume that f"t g is i.i.d. N (0; 1): Then fut g is i.i.d. N (0; 2 (Xt )): This di¤ers from
Assumption 3.5 of the classical linear regression analysis, because now fut g exhibits
conditional heteroskedasticity.
Let ^ denote the OLS estimator for 0 :
(a) Is ^ unbiased for 0 ?
(b) Show that var( ^ ) = (X 0 X) 1 X 0 X(X 0 X) 1 :
Consider an alternative estimator
~ = (X 0
" n
X
=
1
X) 1 X 0
2
1
(Xt )Xt Xt0
t=1
Y
# 1
n
X
2
(Xt )Xt Yt :
t=1
(c) Is ~ unbiased for 0 ?
(d) Show that var( ~ ) = (X 0 1 X) 1 :
(e) Is var( ^ ) var( ~ ) positive semi-de…nite (p.s.d.)? Which estimator, ^ or ~ ; is more
e¢ cient?
(f) Is ~ the Best Linear Unbiased Estimator (BLUE) for 0 ? [Hint: There are several
approaches to this question. A simple one is to consider the transformed model
Yt = Xt 0
0
+ "t ;
(4.2)
t = 1; :::; n;
where Yt = Yt = (Xt ); Xt = Xt = (Xt ): This model is obtained from model (4.1) after
dividing by (Xt ): In matrix notation, model (4.2) can be written as
Y =X
0
+ ";
1
1
2 Y and the n
2 X:]
k matrix X =
where the n 1 vector Y =
(g) Construct two test statistics for the null hypothesis of interest H0 : 02 = 0.
One test is based on ^ ; and the other test is based on ~ : What are the …nite sample
distributions of your test statistics under H0 ? Can you tell which test is better?
(h) Construct two test statistics for the null hypothesis of interest H0 : R 0 = r;
where R is a J k matrix with J > 0: One test is based on ^ ; and the other test is
based on ~ : What are the …nite sample distributions of your test statistics under H0 ?
60
5. Consider the following classical regression model
Yt = Xt0
0
0
0
+
=
+ "t
k
X
0
j Xjt
+ "t ;
t = 1; :::; n:
(5.1)
j=1
Suppose that we are interested in testing the null hypothesis
0
1
H0 :
=
0
2
=
=
0
k
= 0:
Then the F -statistic can be written as
F =
(~
e0 e~ e0 e)=k
:
e0 e=(n k 1)
where e0 e is the sum of squared residuals from the unrestricted model (5.1), and e~0 e~ is
the sum of squared residuals from the restricted model (5.2)
Yt =
0
0
(5.2)
+ "t :
(a) Show that under Assumptions 3.1 and 3.3,
F =
R2 =k
R2 )=(n k
(1
1)
;
where R2 is the coe¢ cient of determination of the unrestricted model (5.1).
(b) Suppose in addition Assumption 3.5 holds. Show that under H0 ;
(n
k
d
1)R2 !
2
k:
6. Suppose X 0 X is a K K matrix, and V is a n n matrix, and both X 0 X and V are
symmetric and nonsingular, with the minimum eigenvalue min (X 0 X) ! 1 as n ! 1
and 0 < c
C < 1: Show that for any 2 RK such that 0 = 1;
max (V )
0
var( ^ jX) =
2 0
(X 0 X) 1 X 0 V X(X 0 X)
1
!0
as n ! 1: Thus, var( ^ jX) vanishes to zero as n ! 1 under conditional heteroskedasticity.
7. Suppose a data generating process is given by
Yt =
0
1 X1t
+
0
2 X2t
+ "t = Xt0
0
+ "t ;
where Xt = (X1t ; X2t )0 ; and Assumptions 3.1–3.4 hold. (For simplicity, we have assumed
no intercept.) Denote the OLS estimator by ^ = ( ^ 1 ; ^ 2 )0 :
61
If
0
2
= 0 and we know it. Then we can consider a simpler regression
Yt =
0
1 X1t
+ "t :
Denote the OLS of this simpler regression as ~ 1 :
Please compare the relative e¢ ciency between ^ 1 and ~ 1 : That is, which estimator
is better for 01 ? Give your reasoning in detail.
62