Download Consistency and asymptotic normality

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ordinary least squares wikipedia , lookup

Transcript
Consistency and asymptotic normality
Class notes for Econ 842
Robert de Jong∗
March 2006
1
Stochastic convergence
The asymptotic theory of minimization estimators relies on various theorems from mathematical statistics. The objective of this section is to explain the main theorems that underpin
the asymptotic theory for minimization estimators.
1.1
1.1.1
Infimum and supremum, and the limit operations
Supremum and infimum
The supremum of a set A, if it exists, is the smallest number y for which x ≤ y for every
x ∈ A. Analogously, the infimum of a set A, if it exists, is the largest number y for which
x ≥ y for every x ∈ A. For an open set such as A = (0, 1), the supremum sup A and the
infimum inf A are well-defined as 1 and 0, respectively. If the supremum is attained for some
element in A, the supremum is also called the maximum; analogously, the minimum of a set
is defined as the infimum if it is attained for some element of A. Therefore, the maximum
and minimum of A = (0, 1) are not defined, because they are attained at 0 and 1, which are
not inside (0, 1).
If we define the extended real line as R ∪ {−∞, +∞}, the sup and inf are always defined
on the extended real line.
Department of Economics, Ohio State University, 429 Arps Hall, Columbus, OH 43210, email
[email protected]
∗
1
1.1.2
The limit operation
For a deterministic sequence xn , we say that it “converges to x”, notation limn→∞ xn = x,
if for all ε > 0, there exists an Nε such that |xn − x| < ε for all n ≥ Nε . The “limsup”
(superior limit, “limes superior” in Latin) is defined as
lim sup xn = limn→∞ xn = inf sup xm .
n≥1 m≥n
n→∞
The limsup is the “eventual upper limit” of a sequence. For example, consider xn = n−1 .
Then
lim sup xn = inf sup m−1 = inf n−1 = 0,
n→∞
n≥1 m≥n
n≥1
just as one would expect. From the earlier discussion, it is now clear that the limsup is
always defined on the extended real line. The limsup of a sequence is the “eventual upper
bound” of a sequence. For example, lim supn→∞ sin(n) = 1, while the limit is obviously
undefined. Analogously, we define
lim inf xn = limn→∞ xn = sup inf xm .
n→∞
n≥1 m≥n
For xn = n−1 , we obtain
lim inf xn = sup inf xm = sup 0 = 0,
n→∞
n≥1 m≥n
n≥1
as one would expect from an “eventual lower limit”. If and only if the limsup and liminf of
a sequence are equal, then the limit of the sequence exists.
If, after noting that log(1 + x) ≤ x, we read something like
lim n2 log(Φ(n)) ≤ lim n2 (Φ(n) − 1) = 0
n→∞
n→∞
where Φ(·) denotes the standardnormal distribution function, we should realize the following.
In principle, when writing down the left-hand side expression, it is possible for this expression
to be undefined. The correct interpretation of the above equation is that the inequality finds
an upper bound if the expression exists; but formally the existence of the left-hand side
remains unproven. The assertion that the lim sup is less than or equal to zero is always true
however if the limsup is defined on the extended real line.
2
1.2
1.2.1
Almost sure convergence, convergence in probability, and convergence in distribution
Definitions of stochastic convergence concepts
While for a deterministic sequence Xn it is relatively straightforward to define convergence of
Xn to a number X, the situation becomes more complicated in the case where Xn and/or X
is allowed to be a random variable. There are four types of convergence for random variables
that are used frequently in the econometric literature. They are convergence in distribution,
convergence in probability, convergence in Lp , and almost sure convergence. We say that Xn
converges to X in distribution if
lim P (Xn ≤ y) = P (X ≤ y)
n→∞
for all y for which P (X ≤ y) is continuous in y. To see why we only want this requirement
to hold only for continuity points of P (X ≤ y), consider Xn = n−1 . This is a deterministic
sequence and we should hope that it converges in distribution to X = 0. However, setting
y = 0,
P (Xn ≤ 0) = P (n−1 ≤ 0) = 0 6= P (X ≤ 0) = 1.
Because the convergence in distribution requirement excludes discontinuity points however,
Xn = n−1 does converge in distribution to zero.
Xn converges in probability to X if for all δ > 0,
lim P (|Xn − X| > δ) = 0.
n→∞
p
We typically write “Xn −→ X” or “plimn→∞ Xn = X” to denote convergence in probability.
Xn is said to converge almost surely to X if
P ( lim Xn = X) = 1
n→∞
which is equivalent to the condition that for all δ > 0,
lim P (sup |Xm − X| > δ) = 0.
n→∞
m≥n
as
We then write “Xn −→ X” or “limn→∞ Xn = X almost surely”. Convergence in Lp of Xn
to X for p ≥ 1 holds if
lim E|Xn − X|p = 0.
n→∞
3
1.2.2
Differences between stochastic convergence concepts
There are subtle differences among these convergence concepts; however, convergence in
distribution is in some ways in a different category from the other stochastic convergence
concepts. The important thing to realize about convergence in distribution is that if Xn
converges to X, and X is a standardnormally distributed random variable, then Xn also
converges to Y , which is a random variable that is also standardnormally distributed. For
example, a sequence Xn of i.i.d. N(0, 1) random variables converges in distribution to a
N(0, 1) random variable as n → ∞. This may be counterintuitive, because Xn will never get
“closer” to any particular random variable; but the important thing to realize is that only
the distribution of Xn needs to converge for convergence in distribution.
While convergence in distribution only says something about distributions, convergence
in probability, almost sure convergence and convergence in Lp are statements implying that
Xn and X are asymptotically “close”. Almost sure convergence implies convergence in
probability, as can be easily seen from my definitions above. Convergence in Lp implies
convergence in probability. This follows from the Markov inequality P (Y > δ) ≤ δ −1 E|Y |,
because
P (|Xn − X| > δ) = P (|Xn − X|p > δ p ) ≤ δ −p E|Xn − X|p .
Convergence in probability implies convergence in distribution, and convergence in distribution to a constant implies convergence in probability to that constant; I will leave this
unproven here. No other relations between concepts hold without further assumptions.
1.2.3
An odd example
A classical example of a sequence Xn that converges in probability to zero but not in Lp is
the following. Let Xn be a random variable that takes the value exp(n) with probability 1/n
and zero otherwise. Then for all δ > 0,
lim P (|Xn | > δ) ≤ lim P (Xn = exp(n)) = lim n−1 = 0,
n→∞
n→∞
n→∞
so Xn converges in probability to 0. However, for any p ≥ 1,
E|Xn |p = exp(np)P (Xn = exp(n)) = n−1 exp(np),
so Xn does not converge in Lp . Another example where convergence in probability holds but
convergence in Lp does not is the case where Xn = Y /n where Y is Cauchy. In that case,
we have convergence in probability, but not in Lp , because moments of order 1 and higher
of the Cauchy distribution do not exist.
4
If in addition Xn is independent across n, we also have an example of a sequence that
p
converges in probability, but not almost surely. This is because while Xn −→ 0, supm≥n Xm
does not converge to zero in probability. After all, supm≥n Xm ≥ exp(n) if one of the Xm for
which m ≥ n is nonzero; that is, supm≥n Xm ≤ 1 only if Xm = 0 ∀m ≥ n. Therefore, for all
0 < δ < 1,
P (sup |Xm | ≤ δ) ≤ P (∀m ≥ n : Xm = 0)
m≥n
=
∞
Y
P (Xm = 0) =
m=n
∞
Y
(1 − 1/m)
m=n
= exp(
∞
X
log(1 − 1/m)) = 0.
m=n
To show that last result, note that because log(1 + x) ≤ x for x > −1,
lim exp(
K→∞
K
X
log(1 − 1/m)) ≤ lim exp(−
K→∞
m=n
K
X
(1/m)) = 0
m=n
because m−1 is not summable from 1 to infinity.
This implies that Xn converges in probability to zero, but not almost surely.
1.2.4
Averages
Less exotic statistics that we will encounter in econometrics are averages of i.i.d. or weakly
dependent random variables. Consider the following simple example. Let
Xn = n−1/2
n
X
(zi − Ezi ),
i=1
where the zi are i.i.d. and Ezi2 < ∞. Then Xn converges in distribution to a normally
distributed random variable. For each deterministic sequence an such that limn→∞ an = 0 we
p
as
have an Xn −→ 0. However, it turns out that an Xn −→ 0 is only true if an (log log n)1/2 → 0;
this follows from the so-called law of the iterated logarithm.
5
1.3
1.3.1
The law of large numbers
Statement of the law of large numbers
Consistency of estimators is derived using a law of large numbers (LLN). The LLN asserts
that
n
X
(zi − Ezi )
n−1
i=1
converges to zero. It can be shown that if E|zi | < ∞ and zi is i.i.d. (independent and
identically distributed), then the above expression converges to zero almost surely (and
therefore also in probability; the almost sure version is Kolmogorov’s strong law of large
numbers). If E|zi | is not finite, the LLN does not need to hold; a counterexample would be
the Cauchy distribution, for which z̄n is distributed identically to zi . If convergence to zero
holds almost surely, we talk about the strong law of large numbers, while if the converge to
zero is in probability, we refer to the result as a weak law of large numbers.
1.3.2
Weak LLN under independence
If zi has mean zero, is i.i.d. and Ezi2 < ∞, then
n
n
X
X
V (n−1
zi ) = n−2
V (zi ) = n−1 V (zi ) → 0,
i=1
i=1
so a weak law of large numbers will hold for zi in this case. The condition Ezi2 < ∞ can
be weakened to E|zi | < ∞ by a truncation argument. This can be done as follows. Define
ziK = zi I(|zi | > K) and ziK = zi I(|zi | ≤ K). Then
n
n
n
X
X
X
−1
−1
−1
(ziK − EziK )|
(ziK − EziK )| + E|n
zi | ≤ E|n
E|n
≤ (E|n−1
i=1
i=1
i=1
n
X
(ziK − EziK )|2 )1/2 + 2n−1
n
X
E|ziK |
i=1
i=1
≤ (n−1 V (ziK ))1/2 + 2E|zi |I(|zi| > K)
Z
−1 2 1/2
≤ (n K ) + 2
|z|dF (z),
|z|>K
and by taking first the limit as n → ∞ and then the limit as K → ∞, the convergence in
L1 follows.
To prove Kolmogorov’s strong law of large numbers is much more involved; this typically
requires a result known as the three series theorem. I will not pursue that here.
6
1.3.3
Discussion
There exist many formulations of the LLN that are different from the above. For example,
if zi is uncorrelated, Ezi2 < ∞ and
−1
Var(n
n
X
−2
(zi − Ezi ) = n
i=1
n
X
Var(zi ) → 0,
i=1
then the weak LLN will also hold, in spite of the fact that the i.i.d. assumption has not
been made. Also, the independence assumption can be relaxed further to allow for random
variables zi that have “fading memory” properties. To illustrate this, note that in general
−1
Var(n
n
X
−2
(zi − Ezi )) = n
≤ 2n
cov(zi , zs )
i=1 s=1
i=1
−2
n X
n
X
n
n X
X
|cov(zi , zs )|
n X
∞
X
|cov(zi , zi+m )|,
i=1 s=i
≤ 2n−2
i=1 m=0
and if the last expression converges to zero somehow, then a weak LLN can be obtained as
well.
Often, alternatives to the classical LLN for i.i.d. variables involve assumptions on the
existence of moments E|zi |1+δ for some δ > 0.
1.4
The central limit theorem
1.4.1
Statement of the central limit theorem
The central limit theorem (CLT) states that
−1/2
n
n
X
d
(zi − Ezi ) −→ N(0, Var(zi )).
i=1
If zi is i.i.d. and Ezi2 < ∞, then the above result holds.
7
1.4.2
CLT and characteristic function
The characteristic function of a random variable X is defined as
ψ(r) = E exp(irX),
where i satisfies i2 = −1. There is a one-to-one correspondence between characteristic
functions and distribution functions, and because |ψ(·)| ≤ 1, showing that the characteristic
function converges to the characteristic function of a normal distribution suffices to show
convergence in distribution to a normal. The reasoning for proving a CLT for a mean zero,
i.i.d. sequence zi with a variance σ 2 then starts by observing that
−1/2
E exp(irn
n
X
zi ) =
i=1
n
Y
E exp(irn−1/2 zi )
i=1
by independence. By the Taylor series expansion, exp(x) ≈ 1 + x + 0.5x2 and log(1 + x) ≈ x,
and therefore
−1/2
E exp(irn
n
X
zi ) ≈
i=1
=
n
Y
i=1
n
Y
E(1 + (irn−1/2 zi ) + 0.5(irn−1/2 zi )2 )
i=1
n
X
(1 − 0.5r n σ ) = exp(
log(1 − 0.5r 2 n−1 σ 2 ))
2 −1 2
i=1
≈ exp(−0.5r 2 σ 2 ),
which is the characteristic function of the N(0, σ 2 ) distribution. Of course, the analytical
difficulty is to justify the “≈” signs in the above reasoning.
1.4.3
Discussion
For the CLT, lots of alternative specifications are also possible. It is possible to allow for
some heterogeneity of the distributions of zi , and also the requirement of independence can be
relaxed to some type of “fading memory” assumption. “Heterogeneity of the distributions”
of zi refers to the fact that the distributions Fi (x) of zi are not identical for all i.
8
1.5
Moment conditions
Many articles and books in econometrics contain so-called moment conditions of the type
E|zi | < ∞,
−1
or
lim sup n
n→∞
n
X
E|zi |2 < ∞,
i=1
or assume that for some δ > 0,
E|zi |2+δ < ∞.
By now, you probably realize that these conditions typically originate from the verification
of conditions necessary
for obtaining a law of large numbers or central limit theorem. For
P
as
example, n−1 ni=1 (zi − Ezi ) −→ 0 if the zi are i.i.d. and E|zi | < ∞. So, if somewhere
P
as
along the line we need the result n−1 ni=1 (zi − Ezi ) −→ 0, we simply impose E|zi | < ∞
as a regularity condition. The same holds for the central limit theorem. Finally, note that
E|zi |p < ∞ is implied by E|zi |q < ∞ for some q > p; this is because for 1 ≤ p ≤ q,
(E|X|p )1/p ≤ (E|X|q )1/q .
An example of a distribution for which moment conditions can fail to hold is the Student
distribution. If zi has a Student distribution with k degrees of freedom, then E|zi |k does not
exist; remember that a t distribution with one degree of freedom is the Cauchy distribution.
1.6
The dominated convergence theorem
A tool that is frequently used in asymptotic theory in econometrics is the dominated convergence theorem.
p
Theorem 1 If Xn −→ X and |Xn | ≤ Y a.s. for some random variable Y and EY < ∞,
then
lim EXn = EX.
n→∞
This result is widely used in econometrics; notably, it is used for asymptotic normality of
minimization estimators when the mean value in the Taylor series expansion argument is
replaced by its limit.
9
1.7
A joint convergence result
The following result plays a key role in asymptotic theory for justifying variance estimation:
d
p
Theorem 2 If Xn −→ X and Yn −→ c for some constant c and f (x, y) is continuous in
d
both arguments, then f (Xn , Yn ) −→ f (X, c).
Using the above theorem, we can reason as follows. If n−1
N(0, σ 2 ), then
P
n−1/2 ni=1 zi d
p
−→ N(0, 1).
P
n−1 ni=1 zi2
1.7.1
Pn
2
i=1 zi
p
−→ σ 2 and n−1/2
Pn
i=1 zi
d
−→
A beginner’s mistake
However, it is not allowed to reason
P
Pn
−1
n−1/2 ni=1 zi
0
d
1/2 n
i=1 zi
p
=n p
−→ n1/2 2 = 0
P
P
n
n
2
2
σ
n−1 i=1 zi
n−1 i=1 zi
as many beginners tend to do. You cannot find the continuous function f (., .) that will allow
you to do this trick; you would need f (x, y) = n1/2 x/y, but dependence of f on n is not
allowed. The correct reasoning displayed previously assumed f (x, y) = x/y and clearly, this
is allowed.
1.8
The multivariate case
Often when convergence in distribution of a random vector is shown, a result know as the
Cramèr-Wold device is used. The Cramèr-Wold device is a tool that can be used to simplify
the problem of showing convergence in distribution of a vector to convergence in distribution
d
of scalars. It asserts that for a vector Xn of dimension k, Xn −→ X if and only if for all
d
k-vectors ξ, ξ ′ Xn −→ ξ ′ X. This implies that using the Cramèr-Wold device, the problem
of showing asymptotic normality of a vector can be translated into a problem involving an
application of the CLT for scalars.
1.9
Why asymptotic normality?
The reason econometricians are interested in establishing asymptotic normality of estimators
is that this property allows us to justify inference procedures such as t-tests and F -tests.
10
Of course, t-tests that are justified based on asymptotic theory are really asymptotically
standardnormal, and F -tests that are justified based on asymptotic theory are really asympd
totically chi-squared. Asymptotic normality of a k-vector Xn , i.e. Xn −→ N(0, V ), implies
p
that if V̂ −→ V ,
X
d
q nj −→ N(0, 1),
V̂jj
which will imply validity of t-values. A similar argument can be used to show that F -tests
are asymptotically valid if we have asymptotic normality of our estimator.
1.10
Exercises
1. Assume that xn is a nondecreasing, bounded deterministic sequence of scalars. Show
that limn→∞ xn exists.
p
2. Assume that Xn is an increasing sequence of random variables such that Xn −→ X.
p
Show that in this case, Xn −→ X also.
p
3. Assume that Xn and Yn are sequences of random variables such that Xn −→ X and
p
p
Yn −→ Y . Show that Yn + Xn −→ X + Y .
d
4. Assume that Xn and Yn are sequences of random variables such that Xn −→ X and
d
d
Yn −→ Y . Find an example that shows that Yn + Xn −→ X + Y does not need to
hold.
d
5. Assume that Xn is a sequence of random variables such that Xn −→ c for a constant
p
c. Show that Xn −→ c also.
6. In section 1.2.3, what happens to the “odd example” if the n−1 in its definition is
replaced by n−2 ?
7. Let xP
i be i.i.d. and N(0, 1) distributed. Using characteristic functions, show that
n
−1/2
n
i=1 xi is also N(0, 1) distributed.
8. Formally prove the dominated convergence theorem.
11
2
2.1
Asymptotics of the linear model: the scalar case
The consistency argument
In this section we will investigate the asymptotic behavior of least squares estimators. The
purpose is only to serve as an introduction to the more general case, where we will only
assume that our estimator is some minimization estimator, for which no closed form solution
is available. To understand the principles involved better, we will focus on the case of a
scalar regressor xi in this section.
In the case of the simple linear model
yi = θ0 xi + εi ,
where xi ∈ R, the closed form solution for the least squares estimator is
n
X
−1
(yi − θxi )2
θ̂n = argminθ∈Θ n
i=1
Pn
yi xi
= Pi=1
.
n
2
i=1 xi
The “argmin” is defined here as the value at which the function over which the argmin
is takes is minimized. For example, argminx∈R (1 + x2 ) = 0.
Note that this solution for θ̂n was obtained by differentiation with respect to θ. Next, we
note that - if the true model is correct - it is possible to write
Pn
εixi
θ̂n = θ0 + Pi=1
.
n
2
i=1 xi
After this step, we have to make some assumption on the behavior of xi : is it a deterministic
or a stochastic sequence?
2.1.1
Consistency argument: the deterministic regressor case
First, we assume that xi is deterministic. Then note that
Pn
Pn
−1
εi xi
n
ε
x
i
i
Pi=1
= −1 Pi=1
,
n
n
2
2
n
i=1 xi
i=1 xi
as
and therefore θ̂n −→ θ0 if
P
as
1. n−1 ni=1 εi xi −→ 0;
P
2. n−1 ni=1 x2i −→Q, where Q > 0.
12
2.1.2
Consistency argument: the random regressor case
as
Under the assumption that xi is random, sufficient conditions for θ̂n −→ θ0 are
P
as
1. n−1 ni=1 εi xi −→ 0;
P
as
2. n−1 ni=1 x2i −→ Q, where Q > 0.
Both conditions follow, by Kolmogorov’s law of large numbers, from the conditions
1. (xi , εi ) is i.i.d.;
2. Ex2i < ∞;
3. Eεi xi = 0;
4. E|εixi | < ∞.
In the case that εi and xi are independent of each other, conditions 1-4 are implied by
1. (xi , εi ) is i.i.d.;
2. Ex2i < ∞;
3. Eεi = 0;
4. E|εi| < ∞.
2.2
The asymptotic normality argument
Finding the limit distribution of θ̂n requires some more work than the simple consistency
result. If we assume that the εi are i.i.d. normal with variance σ 2 and xi is deterministic,
then
Pn
εi xi
Pi=1
n
2
i=1 xi
P
P
p
is distributed N(0, σ 2 ( ni=1 x2i )−1 ) for any sample size. Therefore, if n−1 ni=1 x2i −→ Q, we
get
d
n1/2 (θ̂n − θ0 ) −→ N(0, σ 2 Q−1 ).
The asymptotic version of this argument simply replaces the exact normality by asymptotic
normality, and uses a CLT as a replacement for the exact normality assumption.
13
2.2.1
Asymptotic normality argument: the deterministic regressor case
d
Sufficient conditions for the asymptotic normality result n1/2 (θ̂n − θ0 ) −→ N(0, σ 2 Q−1 ) are
P
d
1. n−1/2 ni=1 εi xi −→ N(0, σ 2 Q);
P
2. n−1 ni=1 x2i −→Q where Q > 0.
It is possible to use a central limit theorem for independent, but not identically distributed
random variables to establish 1.
2.2.2
Asymptotic normality argument: the random regressor case
At this point, our interest will be in relaxing the assumption that the xi are deterministic.
If we assume
1. (xi , εi ) is i.i.d.;
2. Eεi xi = 0;
3. Ex2i < ∞ and Eε2i x2i < ∞;
then by the central limit theorem,
−1/2
n
n
X
d
xi εi −→ N(0, Eε2i x2i ),
i=1
and in addition,
−1
n
n
X
p
x2i −→ Q,
i=1
and therefore by Theorem 2,
P
n−1/2 ni=1 xi εi d
P
−→ N(0, Q−2 Eε2i x2i ).
n−1 ni=1 x2i
Therefore,
d
n1/2 (θ̂n − θ0 ) −→ N(0, Q−2 P ),
where P = Eε2i x2i .
14
If we assume in addition that εi and xi are independent, then
P = Eε2i x2i = Eε2i Ex2i = σ 2 Q,
and then our result becomes
d
n1/2 (θ̂n − θ0 ) −→ N(0, σ 2 Q−1 ).
The above result also follows from assuming that E(ε2i |xi ) = σ 2 (a homoskedasticity assumption). However, please note that asymptotic normality is also obtained without homoskedasticity assumption.
In the sections that follow, we will always assume that the xi sequence is stochastic.
15