Download Almost sure convergence, convergence in probability and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Chapter 5
Almost sure convergence,
convergence in probability and
asymptotic normality
In the previous chapter we considered estimator of several different parameters. The hope is
that as the sample size increases the estimator should get ‘closer’ to the parameter of interest.
When we say closer we mean to converge. In the classical sense the sequence {xk } converges to
x (xk → x), if |xk − x| → 0 as k → ∞ (or for every ε > 0, there exists an n where for all k > n,
|xk − x| < ε). Of course the estimators we have considered are random, that is for every ω ∈ Ω
(set of all out comes) we have an different estimate. The natural question to ask is what does
convergence mean for random sequences.
5.1
Modes of convergence
We start by defining different modes of convergence.
Definition 5.1.1 (Convergence)
• Almost sure convergence We say that the sequence
{Xt } converges almost sure to µ, if there exists a set M ⊂ Ω, such that P(M ) = 1 and for
every ω ∈ N we have
Xt (ω) → µ.
In other words for every ε > 0, there exists an N (ω) such that
|Xt (ω) − µ| < ε,
(5.1)
for all t > N (ω). Note that the above definition is very close to classical convergence. We
a.s.
denote Xt → µ almost surely, as Xt → µ.
a.s.
An equivalent definition, in terms of probabilities, is for every ε > 0 Xt → µ if
∞
P (ω; ∩∞
m=1 ∪t=m {|Xt (ω) − µ| > ε}) = 0.
∞
∞
∞
It is worth considering briefly what ∩∞
m=1 ∪t=m {|Xt (ω) − µ| > ε} means. If ∩m=1 ∪t=m
∗
∞
∞
{|Xt (ω) − µ| > ε} =
6 ⊘, then there exists an ω ∈ ∩m=1 ∪t=m {|Xt (ω) − µ| > ε} such that
48
for some infinite sequence {kj }, we have |Xkj (ω ∗ ) − µ| > ε, this means Xt (ω ∗ ) does not
∞
converge to µ. Now let ∩∞
m=1 ∪t=m {|Xt (ω) − µ| > ε} = A, if P (A) = 0, then for ‘most’ ω
the sequence {Xt (ω)} converges.
• Convergence in mean square
We say Xt → µ in mean square (or L2 convergence), if E(Xt − µ)2 → 0 as t → ∞.
• Convergence in probability
Convergence in probability cannot be stated in terms of realisations Xt (ω) but only in terms
P
of probabilities. Xt is said to converge to µ in probability (written Xt → µ) if
P (|Xt − µ| > ε) → 0,
t → ∞.
Often we write this as |Xt − µ| = op (1).
If for any γ ≥ 1 we have
E(Xt − µ)γ → 0
t → ∞,
then it implies convergence in probability (to see this, use Markov’s inequality).
• Rates of convergence:
(i) We say the stochastic process {Xt } is |Xt − µ| = Op (at ), if the sequence {a−1
t |Xt − µ|}
is bounded is bounded in probability (this is bounded below). We see from the definition
of boundedness, that for all t, the distribution of a−1
t |Xt − µ| should mainly lie within
a certain interval. In general at → ∞ as t → ∞. Hence in the distribution of
a−1
t |Xt − µ| lies with in a certain interval, then |Xt − µ| will within an ever decreasing
one.
(ii) We say the stochastic process {Xt } is |Xt − µ| = op (at ), if the sequence {a−1
t |Xt − µ|}
converges in probability to zero.
Definition 5.1.2 (Boundedness) (i) Almost surely bounded If the random variable X
is almost surely bounded, then for a positive sequence {ek }, such that ek → ∞ as k → ∞
(typically ek = 2k is used), we have
P (ω; {∪∞
k=1 {|X(ω)| ≤ ek }}) = 1.
Usually to prove the above we show that
c
P ((ω; {∪∞
k=1 {|X| ≤ ek }}) ) = 0.
∞
∞
c
∞
Since (∪∞
k=1 {|X| ≤ ek }) = ∩k=1 {|X| > ek } ⊂ ∩k=1 ∪m=k {|X| > ek }, to show the above
we show
∞
P (ω : {∩∞
k=1 ∪m=k {|X(ω)| > ek }}) = 0.
(5.2)
∗
∞
We note that if (ω : {∩∞
k=1 ∪m=k {|X(ω)| > ek }}) 6= ⊘, then there exists a ω ∈ Ω and an
∗
∗
infinite subsequence kj , where |X(ω )| > ekj , hence X(ω ) is not bounded (since ek → ∞).
49
P
To prove (5.2) we usually use the Borel Cantelli Lemma. This states that if ∞
k=1 P (Ak ) <
∞, the events {Ak } P
occur only finitely often with probability one. Applying this to our case,
if we can show that ∞
|}) < ∞, then {|X(ω)| > em |} happens only
m=1 P (ω : {|X(ω)| > emP
∞
finitely often with probability one. Hence if
m=1 P (ω : {|X(ω)| > em |}) < ∞, then
∞
∞
P (ω : {∩k=1 ∪m=k {|X(ω)| > ek }}) = 0 and X is a bounded random variable.
P∞
k , in this case
It is worth noting that
often
we
choose
the
sequence
e
=
2
k
m=1 P (ω :
P∞
k
{|X(ω)| > em |}) = m=1 P (ω : {log |X(ω)| > log 2 |}) ≤ CE(log |X|). Hence if we can
show that E(log |X|) < ∞, then X is bounded almost surely.
(ii) Bounded in probability A sequence is bounded in probability, written Xt = Op (1), if
for every ε > 0, there exists a δ(ε) < ∞ such that P (|Xt | ≥ δ(ε)) < ε. Roughly speaking
this means that the sequence is only extremely large with a very small probability. And as
the ‘largeness’ grows the probability declines.
5.2
Ergodicity
To motivate the notion of ergodicity we recall the strong law of large numbers (SLLN). Suppose
{Xt }t is an iid random sequence, and E(|X0 |) < ∞ then by the SLLN we have that
n
1X
a.s.
Xt → E(X0 ),
n
j=1
for the proof see, for example, Grimmett and Stirzaker (1994). It would be useful to generalise
this result and find weaker conditions on {Xt } for this result toP
still hold true. A simple
1
application is when we want to estimate the mean µ, and we use n nj=1 Xt as an estimator of
the mean.
It can be shown that if {Xt } is an ergodic process then the above result holds. That is if
{Xt } is an ergodic process then for any function h such that E(h(X0 )) < ∞ we have
n
1X
a.s.
h(Xt ) → E(h(X0 )).
n
j=1
Note that the result does not state anything about the rate of convergence. Ergodicity is
normally defined in terms of measure preserving transformations. However, we do not formally
define ergodicity here, but needless to say all ergodic processes are stationary. For the definition
of ergodicity and a full treatment see, for example, Billingsley (1995).
However below we do state a result which characterises a general class of ergodic processes.
Theorem 5.2.1 Suppose {Zt } is an ergodic sequence (for example iid random variables) and
g : R∞ → R is a measureable function (its really hard to think up nonmeasureable functions).
Then the sequence {Yt }t , where
Yt = g(Zt , Zt−1 , . . . , ),
is an ergodic process.
50
PROOF. See Stout (1974), Theorem 3.5.8.
Example 5.2.1 (i) The process {Zt }t , where {Zt } are iid random variables, is probably the
simplest example of an ergodic sequence.
(ii) A simple example of a time series {Xt } which is not independent but is ergodic is the
AR(1) process. We recall that the AR(1) process satisfies the representation
Xt = φXt−1 + ǫt ,
(5.3)
where {ǫt }t are iid random variables with E(ǫt ) = 0, E(ǫ2t ) = 1 and |φ| < 1. It has the
unique causal solution
Xt =
∞
X
φj ǫt−j .
j=0
The solution motivates us to define the function
g(x0 , x1 , . . .) =
∞
X
φj xj .
j=0
Since g(·) is bounded, it is sufficiently well behaved (thus measureable). Which implies,
by using Theorem 5.2.1, that {Xt } is an ergodic process. We note if E(ε2 ) < ∞, then
E(X 2 ) < ∞.
Pp
2
2
• The
Pp ARCH(p) process {Xt } defined by Xt = Zt σt where σt = a0 + j=1 aj Xt−j with
j=1 aj < 1 is ergodic stochastic process (we look at this model in a later Chapter).
Example 5.2.2 (Application) If {Xt } is an AR(1) process with |a| < 1 and E(ε2t ) < ∞, then
by using the ergodic theorem we have
n
1X
a.s.
Xt Xt+k → E(X0 Xk ).
n
t=1
5.3
Sampling properties
Often we will estimate the parameters by maximising (or minimising) a criterion. Suppose we
have the criterion Ln (a) (eg. likelihood, quasi-likelihood, Kullback-Leibler etc) we use as an
estimator of a0 , ân where
ân = arg max Ln (a)
a∈Θ
and Θ is the parameter space we do the maximisation (minimisation) over. Typically the true
parameter a should maximise (minimise) the ‘limiting’ criterion L.
If this is to be a good estimator, as the sample size grows the estimator should converge (in
some sense) to the parameter we are interesting in estimating. As we discussed above, there
are various modes in which we can measure this convergence (i) almost surely (ii) in probability
and (iii) in mean squared error. Usually we show either (i) or (ii) (noting that (i) implies (ii)),
in time series its usually quite difficult to show (iii).
51
Definition 5.3.1 (i) An estimator ân is said to be almost surely consistent estimator of a0 ,
if there exists a set M ⊂ Ω, where P(M ) = 1 and for all ω ∈ M we have
ân (ω) → a.
(ii) An estimator ân is said to converge in probability to a0 , if for every δ > 0
P (|ân − a| > δ) → 0
T → ∞.
To prove either (i) or (ii) usually involves verifying two main things, pointwise convergence and
equicontinuity.
5.4
Showing almost sure convergence of an estimator
We now consider the general case where Ln (a) is a ‘criterion’ which we maximise. Let us suppose
we can write Ln as
n
Ln (a) =
1X
ℓt (a),
n
(5.4)
t=1
where for each a ∈ Θ, {ℓt (a)}t is a ergodic sequence. Let
L(a) = E(ℓt (a)),
(5.5)
we assume that L(a) is continuous and has a unique maximum in Θ. We define the estimator
α̂n where α̂n = arg mina∈Θ Ln (a).
Definition 5.4.1 (Uniform convergence) Ln (a) is said to almost surely converge uniformly
to L(a), if
a.s.
sup |Ln (a) − L(a)| → 0.
a∈Θ
In other words there exists a set M ⊂ Ω where P (M ) = 1 and for every ω ∈ M ,
sup |Ln (ω, a) − L(a)| → 0.
a∈Θ
Theorem 5.4.1 (Consistency) Suppose that ân = arg maxa∈Θ Ln (a) and a0 = arg maxa∈Θ L(a)
a.s.
is the unique minimum. If supa∈Θ |Ln (a) − L(a)| → 0 as n → ∞ and L(a) has a unique maxia.s.
mum. Then Then ân → a0 as n → ∞.
PROOF. We note that by definition we have Ln (a0 ) ≤ Ln (ân ) and L(ân ) ≤ L(a0 ). Using this
inequality we have
Ln (a0 ) − L(a0 ) ≤ Ln (ân ) − L(a0 ) ≤ Ln (ân ) − L(ân ).
Therefore from the above we have
|Ln (âT ) − L(a0 )| ≤ max {|Ln (a0 ) − L(a0 )|, |Ln (âT ) − L(ân )|} ≤ sup |Ln (a) − L(a)|.
a∈Θ
52
a.s.
Hence since we have uniform converge we have |Ln (ân ) − L(a0 )| → 0 as n → ∞. Now since
a.s.
a.s.
L(a) has a unique maximum, we see that |Ln (ân ) − L(a0 )| → 0 implies ân → a0 .
We note that directly establishing uniform convergence is not easy. Usually it is done by
assuming the parameter space is compact and showing point wise convergence and stochastic
equicontinuity, these three facts imply uniform convergence. Below we define stochastic equicontinuity and show consistency under these conditions.
Definition 5.4.2 The sequence of stochastic functions {fn (a)}n is said to be stochastically
equicontinuous if there exists a set M ∈ Ω where P (M ) = 1 and for every ω ∈ M and and
ε > 0, there exists a δ and such that for every ω ∈ M
sup
|a1 −a2 |≤δ
|fn (ω, a1 ) − fn (ω, a2 )| ≤ ε,
for all n > N (ω).
A sufficient condition for stochastic equicontinuity of fn (a) (which is usually used to prove
equicontinuity), is that fn (a) is almost surely Lipschitz continuous. In other words, there exists
a random variable K which is almost surely bounded, where for all ω ∈ M (P (M ) = 1) we have
sup |fn (ω, a1 ) − fn (ω, a2 )| < K(ω)ka1 − a2 k.
a1 ,a2 ∈Θ
In the following theorem we state sufficient conditions for almost sure uniform convergence.
It is worth noting this is the Arzela-Ascoli theorem for random variables.
Theorem 5.4.2 (The stochastic Ascoli Lemma) Suppose the parameter space Θ is coma.s.
pact, for every a ∈ Θ we have Ln (a) → L(a) and Ln (a) is stochastic equicontinuous. Then
a.s.
supa∈Θ |Ln (a) − L(a)| → 0 as n → ∞.
We use the theorem below.
Corollary 5.4.1 Suppose that ân = arg maxa∈Θ Ln (a) and a0 = arg maxa∈Θ L(a), moreover
L(a) has a unique maximum. If
a.s.
(i) we have point wise convergence, that is for every a ∈ Θ we have Ln (a) → L(a).
(ii) The parameter space Θ is compact.
(iii) Ln (a) is stochastic equicontinuous.
a.s.
Then ân → a0 as n → ∞.
We prove Theorem 5.4.2 in the section below, but it can be omitted on first reading.
5.4.1
Proof of Theorem 5.4.2 (The stochastic Ascoli theorem)
We now show that stochastic equicontinuity and almost pointwise convergence imply uniform
convergence. We note that on its own, pointwise convergence is a much weaker condition than
uniform convergence, since for pointwise convergence the rate of convergence can be different
for each parameter.
53
Before we continue a few technical points. We recall that we are assuming almost pointwise
convergence. This means for each parameter a ∈ Θ there exists a set Na ∈ Ω (with P (Na ) = 1)
such that for all ω ∈ Na Lt (ω, a) → L(a). In the following lemma we unify this set. That is
show (using stochastic equicontinuity) that there exists a set N ∈ Ω (with P (N ) = 1) such that
for all ω ∈ N Lt (ω, a) → L(a).
Lemma 5.4.1 Suppose the sequence {Ln (a)}n is stochastically equicontinuous and also pointwise convergent (that is Ln (a) converges almost surely to L(a)), then there exists a set M ∈ Ω
where P (M̄ ) = 1 and for every ω ∈ M̄ and a ∈ Θ we have
|Ln (ω, a) − L(a)| → 0.
PROOF. Enumerate all the rationals in the set Θ and call this sequence {ai }i . Then for every
ai there exists a set Mai where P (Mai ) = 1, such that for every ω ∈ Mai we have |LT (ω, ai ) −
L(ai )| → 0. Define M = ∩Mai , since the number of sets is countable P (M ) = 1 and for every
ω ∈ M and ai we have Ln (ω, ai ) → L(ai ).
Since we have stochastic equicontinuity, there exists a set M̃ (with P (M ) = 1), such that
for every ω ∈ M̃ , {Ln (ω, ·)} is equicontinuous. Let M̄ = M̃ ∩ {∩Mai }, we will show that for all
a ∈ Θ and ω ∈ M̄ we have Ln (ω, a) → L(a). By stochastic equicontinuity for every ω ∈ M̄ and
ε/3 > 0, there exists a δ > 0 such that
sup
|b1 −b2 |≤δ
|Ln (ω, b1 ) − Ln (ω, b2 )| ≤ ε/3,
(5.6)
for all n > N (ω). Furthermore by definition of M̄ for every rational aj ∈ Θ and ω ∈ N̄ we have
|LT (ω, ai ) − L(ai )| ≤ ε/3,
(5.7)
where n > N ′ (ω). Now for any given a ∈ Θ, there exists a rational ai such that ka − aj k ≤ δ.
Using this, (5.6) and (5.7) we have
|Ln (ω, a) − L(a)| ≤ |Ln (ω, a) − Ln (ω, ai )| + |Ln (ω, ai ) − L(ai )| + |L(a) − L(ai )| ≤ ε,
for n > max(N (ω), N ′ (ω)). To summarise for every ω ∈ M̄ and a ∈ Θ, we have |Ln (ω, a) −
L(a)| → 0. Hence we have pointwise covergence for every realisation in M̄ .
We now show that equicontinuity implies uniform convergence.
Proof of Theorem 5.4.2. Using Lemma 5.4.1 we see that there exists a set M̄ ∈ Ω with
P (M̄ ) = 1, where Ln is equicontinuous and also pointwise convergent. We now show uniform
convergence on this set. Choose ε/3 > 0 and let δ be such that for every ω ∈ M̄ we have
sup
|a1 −a2 |≤δ
|LT (ω, a1 ) − LT (ω, a2 )| ≤ ε/3,
(5.8)
for all n > n(ω). Since Θ is compact it can be divided into a finite number of open sets.
Construct the sets {Oi }pi=1 , such that Θ ⊂ ∪pi=1 Oi and supx,y,i kx − yk ≤ δ. Let {ai }pi=1 be such
that ai ∈ Oi . We note that for every ω ∈ M̄ we have Ln (ω, ai ) → L(ai ), hence for every ε/3,
54
there exists an ni (ω) such that for all n > ni (ω) we have |LT (ω, ai ) − L(ai )| ≤ ε/3. Therefore,
since p is finite (due to compactness), there exists a ñ(ω) such that
max |Ln (ω, ai ) − L(ai )| ≤ ε/3,
1≤i≤p
for all n > ñ(ω) = max1≤i≤p (ni (ω)). For any a ∈ Θ, choose the i, such that open set Oi such
that a ∈ Oi . Using (5.8) we have
|LT (ω, a) − LT (ω, ai )| ≤ ε/3,
for all n > n(ω). Altogether this gives
|LT (ω, a) − L(a)| ≤ |LT (ω, a) − LT (ω, ai )| + |LT (ω, ai ) − L(ai )| + |L(a) − L(ai )| ≤ ε,
for all n ≥ max(n(ω), ñ(ω)). We observe that max(n(ω), ñ(ω)) and ε/3 does not depend on
a, therefore for all n ≥ max(n(ω), ñ(ω)) and we have supa |Ln (ω, a) − L(a)| < ε. This gives
for every ω ∈ M̄ (P(M̄ ) = 1), supa |Ln (ω, a) − L(a)| → 0, thus we have almost sure uniform
convergence.
5.5
Almost sure convergence of the least squares estimator for
an AR(p) process
In Chapter ?? we will consider the sampling properties of many of the estimators defined in
Chapter ??. However to illustrate the consistency result above we apply it to the least squares
estimator of the autoregressive parameters.
To simply notation we only consider estimator for AR(1) models. Suppose that Xt satisfies
Xt = φXt−1 + εt . To estimate φ we use the least squares estimator defined below. Let
n
1 X
Ln (a) =
(Xt − aXt−1 )2 ,
n−1
(5.9)
t=2
we use φ̂n as an estimator of φ, where
φ̂n = arg max LT (a),
a∈Θ
(5.10)
where Θ = [−1, 1].
How can we show that this is consistent?
• In the case of least squares for AR processes, âT has the explicit form
1 Pn
t=2 Xt Xt−1
n−1
.
φ̂n =
1 PT −1
2
t=1 Xt
n−1
Now by just applying the ergodic theorem to the numerator and denominator we get
a.s.
φ̂n → φ.
1 Pnt=2 Xt Xt−1 < 1 is not necessarily true.
It is worth noting, that n−11 P
n−1
X2
n−1
t=1
t
55
• However we will tackle the problem in a rather artifical way and assume that it does not
have an explicit form and instead assume that φ̂n is obtained by minimising Ln (a) using
a numerical routine. In general this is the most common way of minimising a likelihood
function (usually explicit solutions do not exist).
• In order to derive the sampling properties of φ̂n we need to directly study the likelihood
function Ln (a). We will do this now in the least squares case.
We will first show almost sure convergence, which will involve repeated use of the ergodic
theorem. We will then demonstrate how to show convergence in probability. We look at almost
sure convergence as its easier to follow. Note that almost sure convergence implies convergence
in probability (but the converse is not necessarily true).
The first thing to do it let
ℓt (a) = (Xt − aXt−1 )2 .
Since {Xt } is an ergodic process (recall Example 5.2.1(ii)) by using Theorem 5.2.1 we have for
a, that {ℓt (a)}t is an ergodic process. Therefore by using the ergodic theorem we have
n
Ln (a) =
1 X
a.s.
ℓt (a) → E(ℓ0 (a)).
n−1
t=2
a.s.
In other words for every a ∈ [−1, 1] we have that Ln (a) → E(ℓ0 (a)) (almost sure pointwise
convergence).
Since the the parameter space [−1, 1] is compact and a is the unique minimum of ℓ(·) in the
parameter space, then all that remains is to show show stochastic equicontinuity, from this we
deduce almost sure uniform convergence.
To show stochastic equicontinuity we expand LT (a) and use the mean value theorem to
obtain
Ln (a1 ) − Ln (a2 ) = ∇LT (ā)(a1 − a2 ),
(5.11)
where ā ∈ [min[a1 , a2 ], max[a1 , a2 ]] and
n
∇Ln (ā) =
−2 X
Xt−1 (Xt − āXt−1 ).
n−1
t=2
Because ā ∈ [−1, 1] we have
n
|∇Ln (ā)| ≤ Dn , where Dn =
2 X
2
(|Xt−1 Xt | + Xt−1
).
n−1
t=2
2 } is an ergodic process. Therefore, if
Since {Xt }t is an ergodic process, then {|Xt−1 Xt | + Xt−1
var(ǫ0 ) < ∞, by using the ergodic theorem we have
a.s.
2
Dn → 2E(|Xt−1 Xt | + Xt−1
).
56
2 ). Therefore there exists a set M ∈ Ω, where P(M ) = 1 and for
Let D := 2E(|Xt−1 Xt | + Xt−1
every ω ∈ M and ε > 0 we have
|DT (ω) − D| ≤ δ ∗ ,
for all n > N (ω). Substituting the above into (5.11) we have
|Ln (ω, a1 ) − Ln (ω, a2 )| ≤ Dn (ω)|a1 − a2 |
≤ (D + δ ∗ )|a1 − a2 |,
for all n ≥ N (ω). Therefore for every ε > 0, there exists a δ := ε/(D + δ ∗ ) such that
sup
|a1 −a2 |≤ε/(D+δ ∗ )
|Ln (ω, a1 ) − Ln (ω, a2 )| ≤ ε,
for all n ≥ N (ω). Since this is true for all ω ∈ M we see that {Ln (a)} is stochastically
equicontinuous.
a.s.
Theorem 5.5.1 Let φ̂n be defined as in (5.10). Then we have φ̂n → φ.
PROOF. Since {Ln (a)} is almost sure equicontinuous, the parameter space [−1, 1] is compact
a.s.
and we have pointwise convergence of Ln (a) → L(a), by using Theorem 5.4.1 we have that
a.s.
φ̂n → a, where a = mina∈Θ L(a). Finally we need to show that a = φ. Since
L(a) = E(ℓ0 (a)) = −E(X1 − aX0 )2 ,
we see by differentiating L(a) with respect to a, that it is minimised at a = E(X0 X1 )/E(X02 ),
hence a = E(X0 X1 )/E(X02 ). To show that this is φ, we note that by the Yule-Walker equations
Xt = φXt−1 + ǫt
⇒
2
E(Xt Xt−1 ) = φE(Xt−1
) + E(ǫt Xt−1 ) .
| {z }
=0
a.s.
Therefore φ = E(X0 X1 )/E(X02 ), hence φ̂n → φ.
We note that by using a very similar methods we can show strong consistency of the least
squares estimator of the parameters in an AR(p) model.
5.6
Convergence in probability of an estimator
a.s.
We described above almost sure (strong) consistency where we showed âT → a0 . Sometimes
its not possible to show strong consistency (when ergodicity etc. cannot be verified). Often, as
P
an alternative, weak consistency where âT → a0 (convergence in probability), is shown. This
requires a weaker set of conditions, which we now describe:
(i) The parameter space Θ should be compact.
P
(ii) We pointwise convergence: for every a ∈ Θ Ln (a) → L(a).
57
(iii) Equicontinuity in probability, that is for every ǫ > 0 there exists a δ such that
!
lim P
n→∞
sup
|a1 −a2 |≤δ
|Ln (a1 ) − LT (a2 )| > ǫ
→ 0.
P
If the above conditions are satisified we have âT → a0 .
Verifying conditions (ii) and (iii) may look a little daunting but actually with the use of
Chebyshev’s (or Markov’s) inequality it can be quite straightforward. For example if we can
show that for every a ∈ Θ
E(LT (a) − L(a))2 → 0 T → ∞.
Therefore by applying Chebyshev’s inequality we have for every ε > 0 that
P (|LT (a) − L(a)| > ε) ≤
E(LT (a) − L(a))2
→ 0 T → ∞.
ε2
P
Thus for every a ∈ Θ we have LT (a) → L(a).
To show (iii) we often use the mean value theorem LT (a). Using the mean value theorem
we have
|LT (a1 ) − LT (a2 )| ≤ sup k∇a LT (a)k2 ka1 − a2 k.
a
Now if it can be shown that supa k∇a LT (a)k2 is bounded in probability (usually we show that
E(supa k∇a LT (a)k2 ) < ∞), then we have (ii).
5.7
Asymptotic normality of an estimator
The first central limit theorm goes back to the asymptotic distribution of sums of binary random variables (these have a binomial distribution and Bernoulli showed that they could be
approximated to a normal distribution). This result was later generalised to sums of iid random
variables. However from mid 20th century to late 20th century several advances have been made
for generalisating the results to dependent random variables. These include generalisations to
random variables which have n-dependence, mixing properties, cumulant properties, near-epoch
dependence etc. In this section we will concentrate on a central limit theore for martingales. Our
reason for choosing this flavour of CLT is that it can be applied in various estimation settings as it can often be shown that the derivative of a criterion at the true parameter is a martingale.
Let us suppose that
ân = arg max Ln (a),
aΘ
where
n
1X
Ln (a) =
ℓt (a),
n
t=1
and for each a ∈ Θ, {ℓt (a)}t are identically distributed random variables.
58
√
In this section we shall show asymptotic normality of n(ân −a0 ). The reason for normalising
√
a.s.
by n, is that (ân − a0 ) → 0 as n → ∞, hence in terms of distributions it converges towards
the point mass at zero. Therefore we need to increase the magnitude of the difference ân − a0 .
√
We can show that (ân − a0 ) = O(n−1/2 ), therefore n(ân − a) = O(1).
We often use ∇Ln (a) to denote the partial derivative of Ln (a) with respect to a (∇Ln (a) =
∂Ln (a)
∂Ln (a)
∂a1 , . . . , ∂ap ). Since âT = arg max Ln (a), we observe that ∇Ln (ân ) = 0. Now expanding
∇Ln (ân ) about a0 (the true parameter) we have
∇Ln (ân ) = ∇Ln (a0 ) + (ân − a0 )∇2 Ln (ā)
⇒ (ân − a0 ) = −{∇2 Ln (ā)}−1 ∇Ln (a0 )
(5.12)
√
To show asymptotically normality of n(ân − a0 ), first asymptotic normality of ∇Ln (a0 ) is
P
shown, second it is shown that ∇2 Ln (ā) → E(∇2 ℓ0 (a0 )), together they yield asymptotically
√
normality of n(ân − a0 ). In many cases ∇Ln (a0 ) is a martingale, hence the martingale central
limit theorem is usually applied to show asymptotic normality of ∇Ln (a0 ). We start by defining
a martingale and stating the martingale central limit theorem.
Definition 5.7.1 {Zt ; t = 1, . . . , ∞} are called martingale differences if
E(Zt |Zt−1 , Zt−2 , . . .) = 0.
An example is the sequence {Xt−1 ǫt }t considered above. Because ǫt and Xt−1 , Xt−2 , . . . are
independent then E(Xt−1 ǫt |Xt−1 , Xt−2 , . . .) = 0. The sequence {ST }T , where
ST =
T
X
Zt
k=1
are called martingales if {Zt } are martingale differences.
Let us define ST as
T
1 X
ST = √
Yt ,
T t=1
(5.13)
where Ft = σ(Yt , Yt−1 , . . .), E(Yt |Ft−1 ) = 0 and E(Yt2 ) < ∞. In the following theorem adapted
from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that ST is asymptotically
normal.
Theorem 5.7.1 Let {ST }T be defined as in (6.30). Further suppose
T
1X 2 P 2
Yt → σ ,
T
(5.14)
t=1
where σ 2 is a finite constant, for all ε > 0,
T
√
1X
P
E(Yt2 I(|Yt | > ε T )|Ft−1 ) → 0,
T
t=1
59
(5.15)
(this is known as the conditional Lindeberg condition) and
T
1X
P
E(Yt2 |Ft−1 ) → σ 2 .
T
(5.16)
t=1
Then we have
D
ST → N (0, σ 2 ).
5.8
(5.17)
Asymptotic normality of the least squares estimator
In this section we show asymptotic normality of the least squares estimator of the AR(1) defined
in (5.9).
We call that the least squares estimator is φ̂n = arg maxa∈[−1,1] Ln (a). Recalling the criterion
n
Ln (a) =
the first and the second derivative is
∇Ln (a) =
and ∇2 Ln (a) =
1 X
(Xt − aXt−1 )2 ,
n−1
t=2
n
n
t=2
n
t=2
−2 X
−2 X
Xt−1 (Xt − aXt−1 ) =
Xt−1 ǫt
n−1
n−1
2 X 2
Xt−1 .
n−1
t=2
Therefore by using (5.12) we have
(φ̂n − φ) = − ∇2 Ln
−1
∇Ln (φ).
(5.18)
a.s.
Since {Xt2 } are ergodic random variables, by using the ergodic theorem we have ∇2 Ln →
2E(X02 ). This with (5.18) implies
−1 √
√
n(φ̂n − φ) = − ∇2 Ln
n∇Ln (φ).
(5.19)
| {z }
a.s.
→ (2E(X02 ))−1
To show asymptotic normality of
We observe that
√
n(φ̂n − φ), will show asymptotic normality of
√
n∇Ln (φ).
n
−2 X
Xt−1 ǫt ,
∇Ln (φ) =
n−1
t=2
is the sum of martingale differences, since E(Xt−1 ǫt |Xt−1 ) = Xt−1 E(ǫt |Xt−1 ) = Xt−1 E(ǫt ) = 0
(here we used Definition 5.7.1). In order to show asymptotic of ∇Ln (φ) we will use the martingale
central limit theorem.
√
We now use Theorem 5.7.1 to show that n∇Ln (φ) is asymptotically normal, which means
we have to verify conditions (5.14)-(5.16). We note in our example that Yt := Xt−1 ǫt , and that
the series {Xt−1 ǫt }t is an ergodic process. Furthermore, since for any function g, E(g(Xt−1 ǫt )|Ft−1 ) =
E(g(Xt−1 ǫt )|Xt−1 ), where Ft = σ(Xt , Xt−1 , . . .) we need only to condition on Xt−1 rather than
the entire sigma-algebra Ft−1 .
60
C1 : By using the ergodicity of {Xt−1 ǫt }t we have
n
n
t=1
t=1
1X 2 2 P
1X 2
2
Yt =
Xt−1 ǫt → E(Xt−1
) E(ǫ2t ) = σ 2 .
| {z }
n
n
=1
C2 : We now verify the conditional Lindeberg condition.
n
n
√
√
1X
1X
2
E(Yt2 I(|Yt | > ε n)|Ft−1 ) =
E(Xt−1
ǫ2t I(|Xt−1 ǫt | > ε n)|Xt−1 )
n
n
t=1
t=1
2 ǫ2
We now use the Cauchy-Schwartz inequality for conditional expectations to split Xt−1
t
and I(|Xt−1 ǫt | > ε). We recall that the Cauchy-Schwartz inequality for conditional expectations is E(Xt Yt |G) ≤ [E(Xt2 |G)E(Yt2 |G)]1/2 almost surely. Therefore
n
√
1X
E(Yt2 I(|Yt | > ε n)|Ft−1 )
n
t=1
≤
≤
n
1/2
√
1 X
4
E(Xt−1
ǫ4t |Xt−1 )E(I(|Xt−1 ǫt | > ε n)2 |Xt−1 )
n
1
n
t=1
n
X
t=1
1/2
√
2
Xt−1
E(ǫ4t )1/2 E(I(|Xt−1 ǫt | > ε n)2 |Xt−1 )
.
(5.20)
We note that rather than use the Cauchy-Schwartz inequality we can use a generalisation
of it called the Hölder inequality. The Hölder inequality states that if p−1 + q −1 = 1, then
E(XY ) ≤ {E(X p )}1/p {E(Y q )}1/q (the conditional version also exists). The advantage of
using this inequality is that one can reduce the moment assumptions on Xt .
Returning to (5.20), and studying E(I(|Xt−1 ǫt | > ε)2 |Xt−1 ) we use that E(I(A)) = P(A)
and the Chebyshev inequality to show
√
√
E(I(|Xt−1 ǫt | > ε n)2 |Xt−1 ) = E(I(|Xt−1 ǫt | > ε n)|Xt−1 )
√
= E(I(|ǫt | > ε n/Xt−1 )|Xt−1 )
√
X 2 var(ǫt )
ε n
)) ≤ t−1 2
.
(5.21)
= Pε (|ǫt | >
Xt−1
ε n
Substituting (5.21) into (5.20) we have
≤
n
√
1X
E(Yt2 I(|Yt | > ε n)|Ft−1 )
n
t=1
1/2
2
n
X
1
2
4 1/2 Xt−1 var(ǫt )
Xt−1 E(ǫt )
n
ε2 n
≤
t=1
E(ǫ4t )1/2
εn3/2
≤
E(ǫ4t )1/2
εn1/2
n
X
|Xt−1 |3 E(ǫ4t )1/2
t=1
n
X
1
n
t=1
61
|Xt−1 |3 .
P
a.s.
If E(ǫ4t ) < ∞, then E(Xt4 ) < ∞, therefore by using the ergodic theorem we have n1 nt=1 |Xt−1 |3 →
E(|X0 |3 ). Since almost sure convergence implies convergence in probability we have
n
n
√
1X
E(ǫ4t )1/2 1 X
2
E(Yt I(|Yt | > ε n)|Ft−1 ) ≤
|Xt−1 |3
1/2 n
n
εn
| {z } t=1
t=1
|
{z
}
→0
P
→E(|X0 |3 )
P
→ 0.
Hence condition (5.15) is satisfied.
C3 : We need to verify that
n
1X
P
E(Yt2 |Ft−1 ) → σ 2 .
n
t=1
Since {Xt }t is an ergodic sequence we have
n
n
t=1
t=1
1X
1X
2
E(Yt2 |Ft−1 ) =
E(Xt−1
ε2 |Xt−1 )
n
n
n
n
X
1X 2
2
2 1
2
Xt−1 E(ε |Xt−1 ) = E(ε )
Xt−1
n
n
t=1
| t=1{z }
=
a.s.
→ E(X02 )
P
→ E(ε2 )E(X02 ) = σ 2 ,
hence we have verified condition (5.16).
Altogether conditions C1-C3 imply that
√
Recalling (5.19) and that
n
1 X
D
Xt−1 ǫt → N (0, σ 2 ).
n∇Ln (φ) = √
n t=1
√
√
(5.22)
D
n∇Ln (φ) → N (0, σ 2 ) we have
n(φ̂n − φ) = − ∇2 Ln
| {z
a.s.
−1 √
n∇Ln (φ) .
{z
}
} |
→ (2E(X02 ))−1
Using that E(X02 ) = σ 2 , this implies that
√
D
→N (0,σ 2 )
1
D
n(φ̂n − φ) → N (0, (σ 2 )−1 ).
4
Thus we have derived the limiting distribution of φ̂n .
62
(5.23)
(5.24)
Remark 5.8.1 We recall that
2
(φ̂n − φ) = − ∇ Ln
−2
and that var( n−1
Pn
t=2 εt Xt−1 )
=
−2
n−1
−1
∇Ln (φ) =
Pn
−2 Pn
t=2 εt Xt−1
n−1
,
2 Pn
2
t=2 Xt−1
n−1
t=2 var(εt Xt−1 )
(5.25)
= O( n1 ). This implies
(φ̂n − φ) = Op (n−1/2 ).
Indeed the results also holds almost surely
(φ̂n − φ) = O(n−1/2 ).
(5.26)
The same result is true for autoregressive processes of arbitrary finite order. That is
√
D
n(φ̂n − φ) → N (0, 2E(Γ−1
p )).
63
(5.27)