Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 5 Almost sure convergence, convergence in probability and asymptotic normality In the previous chapter we considered estimator of several different parameters. The hope is that as the sample size increases the estimator should get ‘closer’ to the parameter of interest. When we say closer we mean to converge. In the classical sense the sequence {xk } converges to x (xk → x), if |xk − x| → 0 as k → ∞ (or for every ε > 0, there exists an n where for all k > n, |xk − x| < ε). Of course the estimators we have considered are random, that is for every ω ∈ Ω (set of all out comes) we have an different estimate. The natural question to ask is what does convergence mean for random sequences. 5.1 Modes of convergence We start by defining different modes of convergence. Definition 5.1.1 (Convergence) • Almost sure convergence We say that the sequence {Xt } converges almost sure to µ, if there exists a set M ⊂ Ω, such that P(M ) = 1 and for every ω ∈ N we have Xt (ω) → µ. In other words for every ε > 0, there exists an N (ω) such that |Xt (ω) − µ| < ε, (5.1) for all t > N (ω). Note that the above definition is very close to classical convergence. We a.s. denote Xt → µ almost surely, as Xt → µ. a.s. An equivalent definition, in terms of probabilities, is for every ε > 0 Xt → µ if ∞ P (ω; ∩∞ m=1 ∪t=m {|Xt (ω) − µ| > ε}) = 0. ∞ ∞ ∞ It is worth considering briefly what ∩∞ m=1 ∪t=m {|Xt (ω) − µ| > ε} means. If ∩m=1 ∪t=m ∗ ∞ ∞ {|Xt (ω) − µ| > ε} = 6 ⊘, then there exists an ω ∈ ∩m=1 ∪t=m {|Xt (ω) − µ| > ε} such that 48 for some infinite sequence {kj }, we have |Xkj (ω ∗ ) − µ| > ε, this means Xt (ω ∗ ) does not ∞ converge to µ. Now let ∩∞ m=1 ∪t=m {|Xt (ω) − µ| > ε} = A, if P (A) = 0, then for ‘most’ ω the sequence {Xt (ω)} converges. • Convergence in mean square We say Xt → µ in mean square (or L2 convergence), if E(Xt − µ)2 → 0 as t → ∞. • Convergence in probability Convergence in probability cannot be stated in terms of realisations Xt (ω) but only in terms P of probabilities. Xt is said to converge to µ in probability (written Xt → µ) if P (|Xt − µ| > ε) → 0, t → ∞. Often we write this as |Xt − µ| = op (1). If for any γ ≥ 1 we have E(Xt − µ)γ → 0 t → ∞, then it implies convergence in probability (to see this, use Markov’s inequality). • Rates of convergence: (i) We say the stochastic process {Xt } is |Xt − µ| = Op (at ), if the sequence {a−1 t |Xt − µ|} is bounded is bounded in probability (this is bounded below). We see from the definition of boundedness, that for all t, the distribution of a−1 t |Xt − µ| should mainly lie within a certain interval. In general at → ∞ as t → ∞. Hence in the distribution of a−1 t |Xt − µ| lies with in a certain interval, then |Xt − µ| will within an ever decreasing one. (ii) We say the stochastic process {Xt } is |Xt − µ| = op (at ), if the sequence {a−1 t |Xt − µ|} converges in probability to zero. Definition 5.1.2 (Boundedness) (i) Almost surely bounded If the random variable X is almost surely bounded, then for a positive sequence {ek }, such that ek → ∞ as k → ∞ (typically ek = 2k is used), we have P (ω; {∪∞ k=1 {|X(ω)| ≤ ek }}) = 1. Usually to prove the above we show that c P ((ω; {∪∞ k=1 {|X| ≤ ek }}) ) = 0. ∞ ∞ c ∞ Since (∪∞ k=1 {|X| ≤ ek }) = ∩k=1 {|X| > ek } ⊂ ∩k=1 ∪m=k {|X| > ek }, to show the above we show ∞ P (ω : {∩∞ k=1 ∪m=k {|X(ω)| > ek }}) = 0. (5.2) ∗ ∞ We note that if (ω : {∩∞ k=1 ∪m=k {|X(ω)| > ek }}) 6= ⊘, then there exists a ω ∈ Ω and an ∗ ∗ infinite subsequence kj , where |X(ω )| > ekj , hence X(ω ) is not bounded (since ek → ∞). 49 P To prove (5.2) we usually use the Borel Cantelli Lemma. This states that if ∞ k=1 P (Ak ) < ∞, the events {Ak } P occur only finitely often with probability one. Applying this to our case, if we can show that ∞ |}) < ∞, then {|X(ω)| > em |} happens only m=1 P (ω : {|X(ω)| > emP ∞ finitely often with probability one. Hence if m=1 P (ω : {|X(ω)| > em |}) < ∞, then ∞ ∞ P (ω : {∩k=1 ∪m=k {|X(ω)| > ek }}) = 0 and X is a bounded random variable. P∞ k , in this case It is worth noting that often we choose the sequence e = 2 k m=1 P (ω : P∞ k {|X(ω)| > em |}) = m=1 P (ω : {log |X(ω)| > log 2 |}) ≤ CE(log |X|). Hence if we can show that E(log |X|) < ∞, then X is bounded almost surely. (ii) Bounded in probability A sequence is bounded in probability, written Xt = Op (1), if for every ε > 0, there exists a δ(ε) < ∞ such that P (|Xt | ≥ δ(ε)) < ε. Roughly speaking this means that the sequence is only extremely large with a very small probability. And as the ‘largeness’ grows the probability declines. 5.2 Ergodicity To motivate the notion of ergodicity we recall the strong law of large numbers (SLLN). Suppose {Xt }t is an iid random sequence, and E(|X0 |) < ∞ then by the SLLN we have that n 1X a.s. Xt → E(X0 ), n j=1 for the proof see, for example, Grimmett and Stirzaker (1994). It would be useful to generalise this result and find weaker conditions on {Xt } for this result toP still hold true. A simple 1 application is when we want to estimate the mean µ, and we use n nj=1 Xt as an estimator of the mean. It can be shown that if {Xt } is an ergodic process then the above result holds. That is if {Xt } is an ergodic process then for any function h such that E(h(X0 )) < ∞ we have n 1X a.s. h(Xt ) → E(h(X0 )). n j=1 Note that the result does not state anything about the rate of convergence. Ergodicity is normally defined in terms of measure preserving transformations. However, we do not formally define ergodicity here, but needless to say all ergodic processes are stationary. For the definition of ergodicity and a full treatment see, for example, Billingsley (1995). However below we do state a result which characterises a general class of ergodic processes. Theorem 5.2.1 Suppose {Zt } is an ergodic sequence (for example iid random variables) and g : R∞ → R is a measureable function (its really hard to think up nonmeasureable functions). Then the sequence {Yt }t , where Yt = g(Zt , Zt−1 , . . . , ), is an ergodic process. 50 PROOF. See Stout (1974), Theorem 3.5.8. Example 5.2.1 (i) The process {Zt }t , where {Zt } are iid random variables, is probably the simplest example of an ergodic sequence. (ii) A simple example of a time series {Xt } which is not independent but is ergodic is the AR(1) process. We recall that the AR(1) process satisfies the representation Xt = φXt−1 + ǫt , (5.3) where {ǫt }t are iid random variables with E(ǫt ) = 0, E(ǫ2t ) = 1 and |φ| < 1. It has the unique causal solution Xt = ∞ X φj ǫt−j . j=0 The solution motivates us to define the function g(x0 , x1 , . . .) = ∞ X φj xj . j=0 Since g(·) is bounded, it is sufficiently well behaved (thus measureable). Which implies, by using Theorem 5.2.1, that {Xt } is an ergodic process. We note if E(ε2 ) < ∞, then E(X 2 ) < ∞. Pp 2 2 • The Pp ARCH(p) process {Xt } defined by Xt = Zt σt where σt = a0 + j=1 aj Xt−j with j=1 aj < 1 is ergodic stochastic process (we look at this model in a later Chapter). Example 5.2.2 (Application) If {Xt } is an AR(1) process with |a| < 1 and E(ε2t ) < ∞, then by using the ergodic theorem we have n 1X a.s. Xt Xt+k → E(X0 Xk ). n t=1 5.3 Sampling properties Often we will estimate the parameters by maximising (or minimising) a criterion. Suppose we have the criterion Ln (a) (eg. likelihood, quasi-likelihood, Kullback-Leibler etc) we use as an estimator of a0 , ân where ân = arg max Ln (a) a∈Θ and Θ is the parameter space we do the maximisation (minimisation) over. Typically the true parameter a should maximise (minimise) the ‘limiting’ criterion L. If this is to be a good estimator, as the sample size grows the estimator should converge (in some sense) to the parameter we are interesting in estimating. As we discussed above, there are various modes in which we can measure this convergence (i) almost surely (ii) in probability and (iii) in mean squared error. Usually we show either (i) or (ii) (noting that (i) implies (ii)), in time series its usually quite difficult to show (iii). 51 Definition 5.3.1 (i) An estimator ân is said to be almost surely consistent estimator of a0 , if there exists a set M ⊂ Ω, where P(M ) = 1 and for all ω ∈ M we have ân (ω) → a. (ii) An estimator ân is said to converge in probability to a0 , if for every δ > 0 P (|ân − a| > δ) → 0 T → ∞. To prove either (i) or (ii) usually involves verifying two main things, pointwise convergence and equicontinuity. 5.4 Showing almost sure convergence of an estimator We now consider the general case where Ln (a) is a ‘criterion’ which we maximise. Let us suppose we can write Ln as n Ln (a) = 1X ℓt (a), n (5.4) t=1 where for each a ∈ Θ, {ℓt (a)}t is a ergodic sequence. Let L(a) = E(ℓt (a)), (5.5) we assume that L(a) is continuous and has a unique maximum in Θ. We define the estimator α̂n where α̂n = arg mina∈Θ Ln (a). Definition 5.4.1 (Uniform convergence) Ln (a) is said to almost surely converge uniformly to L(a), if a.s. sup |Ln (a) − L(a)| → 0. a∈Θ In other words there exists a set M ⊂ Ω where P (M ) = 1 and for every ω ∈ M , sup |Ln (ω, a) − L(a)| → 0. a∈Θ Theorem 5.4.1 (Consistency) Suppose that ân = arg maxa∈Θ Ln (a) and a0 = arg maxa∈Θ L(a) a.s. is the unique minimum. If supa∈Θ |Ln (a) − L(a)| → 0 as n → ∞ and L(a) has a unique maxia.s. mum. Then Then ân → a0 as n → ∞. PROOF. We note that by definition we have Ln (a0 ) ≤ Ln (ân ) and L(ân ) ≤ L(a0 ). Using this inequality we have Ln (a0 ) − L(a0 ) ≤ Ln (ân ) − L(a0 ) ≤ Ln (ân ) − L(ân ). Therefore from the above we have |Ln (âT ) − L(a0 )| ≤ max {|Ln (a0 ) − L(a0 )|, |Ln (âT ) − L(ân )|} ≤ sup |Ln (a) − L(a)|. a∈Θ 52 a.s. Hence since we have uniform converge we have |Ln (ân ) − L(a0 )| → 0 as n → ∞. Now since a.s. a.s. L(a) has a unique maximum, we see that |Ln (ân ) − L(a0 )| → 0 implies ân → a0 . We note that directly establishing uniform convergence is not easy. Usually it is done by assuming the parameter space is compact and showing point wise convergence and stochastic equicontinuity, these three facts imply uniform convergence. Below we define stochastic equicontinuity and show consistency under these conditions. Definition 5.4.2 The sequence of stochastic functions {fn (a)}n is said to be stochastically equicontinuous if there exists a set M ∈ Ω where P (M ) = 1 and for every ω ∈ M and and ε > 0, there exists a δ and such that for every ω ∈ M sup |a1 −a2 |≤δ |fn (ω, a1 ) − fn (ω, a2 )| ≤ ε, for all n > N (ω). A sufficient condition for stochastic equicontinuity of fn (a) (which is usually used to prove equicontinuity), is that fn (a) is almost surely Lipschitz continuous. In other words, there exists a random variable K which is almost surely bounded, where for all ω ∈ M (P (M ) = 1) we have sup |fn (ω, a1 ) − fn (ω, a2 )| < K(ω)ka1 − a2 k. a1 ,a2 ∈Θ In the following theorem we state sufficient conditions for almost sure uniform convergence. It is worth noting this is the Arzela-Ascoli theorem for random variables. Theorem 5.4.2 (The stochastic Ascoli Lemma) Suppose the parameter space Θ is coma.s. pact, for every a ∈ Θ we have Ln (a) → L(a) and Ln (a) is stochastic equicontinuous. Then a.s. supa∈Θ |Ln (a) − L(a)| → 0 as n → ∞. We use the theorem below. Corollary 5.4.1 Suppose that ân = arg maxa∈Θ Ln (a) and a0 = arg maxa∈Θ L(a), moreover L(a) has a unique maximum. If a.s. (i) we have point wise convergence, that is for every a ∈ Θ we have Ln (a) → L(a). (ii) The parameter space Θ is compact. (iii) Ln (a) is stochastic equicontinuous. a.s. Then ân → a0 as n → ∞. We prove Theorem 5.4.2 in the section below, but it can be omitted on first reading. 5.4.1 Proof of Theorem 5.4.2 (The stochastic Ascoli theorem) We now show that stochastic equicontinuity and almost pointwise convergence imply uniform convergence. We note that on its own, pointwise convergence is a much weaker condition than uniform convergence, since for pointwise convergence the rate of convergence can be different for each parameter. 53 Before we continue a few technical points. We recall that we are assuming almost pointwise convergence. This means for each parameter a ∈ Θ there exists a set Na ∈ Ω (with P (Na ) = 1) such that for all ω ∈ Na Lt (ω, a) → L(a). In the following lemma we unify this set. That is show (using stochastic equicontinuity) that there exists a set N ∈ Ω (with P (N ) = 1) such that for all ω ∈ N Lt (ω, a) → L(a). Lemma 5.4.1 Suppose the sequence {Ln (a)}n is stochastically equicontinuous and also pointwise convergent (that is Ln (a) converges almost surely to L(a)), then there exists a set M ∈ Ω where P (M̄ ) = 1 and for every ω ∈ M̄ and a ∈ Θ we have |Ln (ω, a) − L(a)| → 0. PROOF. Enumerate all the rationals in the set Θ and call this sequence {ai }i . Then for every ai there exists a set Mai where P (Mai ) = 1, such that for every ω ∈ Mai we have |LT (ω, ai ) − L(ai )| → 0. Define M = ∩Mai , since the number of sets is countable P (M ) = 1 and for every ω ∈ M and ai we have Ln (ω, ai ) → L(ai ). Since we have stochastic equicontinuity, there exists a set M̃ (with P (M ) = 1), such that for every ω ∈ M̃ , {Ln (ω, ·)} is equicontinuous. Let M̄ = M̃ ∩ {∩Mai }, we will show that for all a ∈ Θ and ω ∈ M̄ we have Ln (ω, a) → L(a). By stochastic equicontinuity for every ω ∈ M̄ and ε/3 > 0, there exists a δ > 0 such that sup |b1 −b2 |≤δ |Ln (ω, b1 ) − Ln (ω, b2 )| ≤ ε/3, (5.6) for all n > N (ω). Furthermore by definition of M̄ for every rational aj ∈ Θ and ω ∈ N̄ we have |LT (ω, ai ) − L(ai )| ≤ ε/3, (5.7) where n > N ′ (ω). Now for any given a ∈ Θ, there exists a rational ai such that ka − aj k ≤ δ. Using this, (5.6) and (5.7) we have |Ln (ω, a) − L(a)| ≤ |Ln (ω, a) − Ln (ω, ai )| + |Ln (ω, ai ) − L(ai )| + |L(a) − L(ai )| ≤ ε, for n > max(N (ω), N ′ (ω)). To summarise for every ω ∈ M̄ and a ∈ Θ, we have |Ln (ω, a) − L(a)| → 0. Hence we have pointwise covergence for every realisation in M̄ . We now show that equicontinuity implies uniform convergence. Proof of Theorem 5.4.2. Using Lemma 5.4.1 we see that there exists a set M̄ ∈ Ω with P (M̄ ) = 1, where Ln is equicontinuous and also pointwise convergent. We now show uniform convergence on this set. Choose ε/3 > 0 and let δ be such that for every ω ∈ M̄ we have sup |a1 −a2 |≤δ |LT (ω, a1 ) − LT (ω, a2 )| ≤ ε/3, (5.8) for all n > n(ω). Since Θ is compact it can be divided into a finite number of open sets. Construct the sets {Oi }pi=1 , such that Θ ⊂ ∪pi=1 Oi and supx,y,i kx − yk ≤ δ. Let {ai }pi=1 be such that ai ∈ Oi . We note that for every ω ∈ M̄ we have Ln (ω, ai ) → L(ai ), hence for every ε/3, 54 there exists an ni (ω) such that for all n > ni (ω) we have |LT (ω, ai ) − L(ai )| ≤ ε/3. Therefore, since p is finite (due to compactness), there exists a ñ(ω) such that max |Ln (ω, ai ) − L(ai )| ≤ ε/3, 1≤i≤p for all n > ñ(ω) = max1≤i≤p (ni (ω)). For any a ∈ Θ, choose the i, such that open set Oi such that a ∈ Oi . Using (5.8) we have |LT (ω, a) − LT (ω, ai )| ≤ ε/3, for all n > n(ω). Altogether this gives |LT (ω, a) − L(a)| ≤ |LT (ω, a) − LT (ω, ai )| + |LT (ω, ai ) − L(ai )| + |L(a) − L(ai )| ≤ ε, for all n ≥ max(n(ω), ñ(ω)). We observe that max(n(ω), ñ(ω)) and ε/3 does not depend on a, therefore for all n ≥ max(n(ω), ñ(ω)) and we have supa |Ln (ω, a) − L(a)| < ε. This gives for every ω ∈ M̄ (P(M̄ ) = 1), supa |Ln (ω, a) − L(a)| → 0, thus we have almost sure uniform convergence. 5.5 Almost sure convergence of the least squares estimator for an AR(p) process In Chapter ?? we will consider the sampling properties of many of the estimators defined in Chapter ??. However to illustrate the consistency result above we apply it to the least squares estimator of the autoregressive parameters. To simply notation we only consider estimator for AR(1) models. Suppose that Xt satisfies Xt = φXt−1 + εt . To estimate φ we use the least squares estimator defined below. Let n 1 X Ln (a) = (Xt − aXt−1 )2 , n−1 (5.9) t=2 we use φ̂n as an estimator of φ, where φ̂n = arg max LT (a), a∈Θ (5.10) where Θ = [−1, 1]. How can we show that this is consistent? • In the case of least squares for AR processes, âT has the explicit form 1 Pn t=2 Xt Xt−1 n−1 . φ̂n = 1 PT −1 2 t=1 Xt n−1 Now by just applying the ergodic theorem to the numerator and denominator we get a.s. φ̂n → φ. 1 Pnt=2 Xt Xt−1 < 1 is not necessarily true. It is worth noting, that n−11 P n−1 X2 n−1 t=1 t 55 • However we will tackle the problem in a rather artifical way and assume that it does not have an explicit form and instead assume that φ̂n is obtained by minimising Ln (a) using a numerical routine. In general this is the most common way of minimising a likelihood function (usually explicit solutions do not exist). • In order to derive the sampling properties of φ̂n we need to directly study the likelihood function Ln (a). We will do this now in the least squares case. We will first show almost sure convergence, which will involve repeated use of the ergodic theorem. We will then demonstrate how to show convergence in probability. We look at almost sure convergence as its easier to follow. Note that almost sure convergence implies convergence in probability (but the converse is not necessarily true). The first thing to do it let ℓt (a) = (Xt − aXt−1 )2 . Since {Xt } is an ergodic process (recall Example 5.2.1(ii)) by using Theorem 5.2.1 we have for a, that {ℓt (a)}t is an ergodic process. Therefore by using the ergodic theorem we have n Ln (a) = 1 X a.s. ℓt (a) → E(ℓ0 (a)). n−1 t=2 a.s. In other words for every a ∈ [−1, 1] we have that Ln (a) → E(ℓ0 (a)) (almost sure pointwise convergence). Since the the parameter space [−1, 1] is compact and a is the unique minimum of ℓ(·) in the parameter space, then all that remains is to show show stochastic equicontinuity, from this we deduce almost sure uniform convergence. To show stochastic equicontinuity we expand LT (a) and use the mean value theorem to obtain Ln (a1 ) − Ln (a2 ) = ∇LT (ā)(a1 − a2 ), (5.11) where ā ∈ [min[a1 , a2 ], max[a1 , a2 ]] and n ∇Ln (ā) = −2 X Xt−1 (Xt − āXt−1 ). n−1 t=2 Because ā ∈ [−1, 1] we have n |∇Ln (ā)| ≤ Dn , where Dn = 2 X 2 (|Xt−1 Xt | + Xt−1 ). n−1 t=2 2 } is an ergodic process. Therefore, if Since {Xt }t is an ergodic process, then {|Xt−1 Xt | + Xt−1 var(ǫ0 ) < ∞, by using the ergodic theorem we have a.s. 2 Dn → 2E(|Xt−1 Xt | + Xt−1 ). 56 2 ). Therefore there exists a set M ∈ Ω, where P(M ) = 1 and for Let D := 2E(|Xt−1 Xt | + Xt−1 every ω ∈ M and ε > 0 we have |DT (ω) − D| ≤ δ ∗ , for all n > N (ω). Substituting the above into (5.11) we have |Ln (ω, a1 ) − Ln (ω, a2 )| ≤ Dn (ω)|a1 − a2 | ≤ (D + δ ∗ )|a1 − a2 |, for all n ≥ N (ω). Therefore for every ε > 0, there exists a δ := ε/(D + δ ∗ ) such that sup |a1 −a2 |≤ε/(D+δ ∗ ) |Ln (ω, a1 ) − Ln (ω, a2 )| ≤ ε, for all n ≥ N (ω). Since this is true for all ω ∈ M we see that {Ln (a)} is stochastically equicontinuous. a.s. Theorem 5.5.1 Let φ̂n be defined as in (5.10). Then we have φ̂n → φ. PROOF. Since {Ln (a)} is almost sure equicontinuous, the parameter space [−1, 1] is compact a.s. and we have pointwise convergence of Ln (a) → L(a), by using Theorem 5.4.1 we have that a.s. φ̂n → a, where a = mina∈Θ L(a). Finally we need to show that a = φ. Since L(a) = E(ℓ0 (a)) = −E(X1 − aX0 )2 , we see by differentiating L(a) with respect to a, that it is minimised at a = E(X0 X1 )/E(X02 ), hence a = E(X0 X1 )/E(X02 ). To show that this is φ, we note that by the Yule-Walker equations Xt = φXt−1 + ǫt ⇒ 2 E(Xt Xt−1 ) = φE(Xt−1 ) + E(ǫt Xt−1 ) . | {z } =0 a.s. Therefore φ = E(X0 X1 )/E(X02 ), hence φ̂n → φ. We note that by using a very similar methods we can show strong consistency of the least squares estimator of the parameters in an AR(p) model. 5.6 Convergence in probability of an estimator a.s. We described above almost sure (strong) consistency where we showed âT → a0 . Sometimes its not possible to show strong consistency (when ergodicity etc. cannot be verified). Often, as P an alternative, weak consistency where âT → a0 (convergence in probability), is shown. This requires a weaker set of conditions, which we now describe: (i) The parameter space Θ should be compact. P (ii) We pointwise convergence: for every a ∈ Θ Ln (a) → L(a). 57 (iii) Equicontinuity in probability, that is for every ǫ > 0 there exists a δ such that ! lim P n→∞ sup |a1 −a2 |≤δ |Ln (a1 ) − LT (a2 )| > ǫ → 0. P If the above conditions are satisified we have âT → a0 . Verifying conditions (ii) and (iii) may look a little daunting but actually with the use of Chebyshev’s (or Markov’s) inequality it can be quite straightforward. For example if we can show that for every a ∈ Θ E(LT (a) − L(a))2 → 0 T → ∞. Therefore by applying Chebyshev’s inequality we have for every ε > 0 that P (|LT (a) − L(a)| > ε) ≤ E(LT (a) − L(a))2 → 0 T → ∞. ε2 P Thus for every a ∈ Θ we have LT (a) → L(a). To show (iii) we often use the mean value theorem LT (a). Using the mean value theorem we have |LT (a1 ) − LT (a2 )| ≤ sup k∇a LT (a)k2 ka1 − a2 k. a Now if it can be shown that supa k∇a LT (a)k2 is bounded in probability (usually we show that E(supa k∇a LT (a)k2 ) < ∞), then we have (ii). 5.7 Asymptotic normality of an estimator The first central limit theorm goes back to the asymptotic distribution of sums of binary random variables (these have a binomial distribution and Bernoulli showed that they could be approximated to a normal distribution). This result was later generalised to sums of iid random variables. However from mid 20th century to late 20th century several advances have been made for generalisating the results to dependent random variables. These include generalisations to random variables which have n-dependence, mixing properties, cumulant properties, near-epoch dependence etc. In this section we will concentrate on a central limit theore for martingales. Our reason for choosing this flavour of CLT is that it can be applied in various estimation settings as it can often be shown that the derivative of a criterion at the true parameter is a martingale. Let us suppose that ân = arg max Ln (a), aΘ where n 1X Ln (a) = ℓt (a), n t=1 and for each a ∈ Θ, {ℓt (a)}t are identically distributed random variables. 58 √ In this section we shall show asymptotic normality of n(ân −a0 ). The reason for normalising √ a.s. by n, is that (ân − a0 ) → 0 as n → ∞, hence in terms of distributions it converges towards the point mass at zero. Therefore we need to increase the magnitude of the difference ân − a0 . √ We can show that (ân − a0 ) = O(n−1/2 ), therefore n(ân − a) = O(1). We often use ∇Ln (a) to denote the partial derivative of Ln (a) with respect to a (∇Ln (a) = ∂Ln (a) ∂Ln (a) ∂a1 , . . . , ∂ap ). Since âT = arg max Ln (a), we observe that ∇Ln (ân ) = 0. Now expanding ∇Ln (ân ) about a0 (the true parameter) we have ∇Ln (ân ) = ∇Ln (a0 ) + (ân − a0 )∇2 Ln (ā) ⇒ (ân − a0 ) = −{∇2 Ln (ā)}−1 ∇Ln (a0 ) (5.12) √ To show asymptotically normality of n(ân − a0 ), first asymptotic normality of ∇Ln (a0 ) is P shown, second it is shown that ∇2 Ln (ā) → E(∇2 ℓ0 (a0 )), together they yield asymptotically √ normality of n(ân − a0 ). In many cases ∇Ln (a0 ) is a martingale, hence the martingale central limit theorem is usually applied to show asymptotic normality of ∇Ln (a0 ). We start by defining a martingale and stating the martingale central limit theorem. Definition 5.7.1 {Zt ; t = 1, . . . , ∞} are called martingale differences if E(Zt |Zt−1 , Zt−2 , . . .) = 0. An example is the sequence {Xt−1 ǫt }t considered above. Because ǫt and Xt−1 , Xt−2 , . . . are independent then E(Xt−1 ǫt |Xt−1 , Xt−2 , . . .) = 0. The sequence {ST }T , where ST = T X Zt k=1 are called martingales if {Zt } are martingale differences. Let us define ST as T 1 X ST = √ Yt , T t=1 (5.13) where Ft = σ(Yt , Yt−1 , . . .), E(Yt |Ft−1 ) = 0 and E(Yt2 ) < ∞. In the following theorem adapted from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that ST is asymptotically normal. Theorem 5.7.1 Let {ST }T be defined as in (6.30). Further suppose T 1X 2 P 2 Yt → σ , T (5.14) t=1 where σ 2 is a finite constant, for all ε > 0, T √ 1X P E(Yt2 I(|Yt | > ε T )|Ft−1 ) → 0, T t=1 59 (5.15) (this is known as the conditional Lindeberg condition) and T 1X P E(Yt2 |Ft−1 ) → σ 2 . T (5.16) t=1 Then we have D ST → N (0, σ 2 ). 5.8 (5.17) Asymptotic normality of the least squares estimator In this section we show asymptotic normality of the least squares estimator of the AR(1) defined in (5.9). We call that the least squares estimator is φ̂n = arg maxa∈[−1,1] Ln (a). Recalling the criterion n Ln (a) = the first and the second derivative is ∇Ln (a) = and ∇2 Ln (a) = 1 X (Xt − aXt−1 )2 , n−1 t=2 n n t=2 n t=2 −2 X −2 X Xt−1 (Xt − aXt−1 ) = Xt−1 ǫt n−1 n−1 2 X 2 Xt−1 . n−1 t=2 Therefore by using (5.12) we have (φ̂n − φ) = − ∇2 Ln −1 ∇Ln (φ). (5.18) a.s. Since {Xt2 } are ergodic random variables, by using the ergodic theorem we have ∇2 Ln → 2E(X02 ). This with (5.18) implies −1 √ √ n(φ̂n − φ) = − ∇2 Ln n∇Ln (φ). (5.19) | {z } a.s. → (2E(X02 ))−1 To show asymptotic normality of We observe that √ n(φ̂n − φ), will show asymptotic normality of √ n∇Ln (φ). n −2 X Xt−1 ǫt , ∇Ln (φ) = n−1 t=2 is the sum of martingale differences, since E(Xt−1 ǫt |Xt−1 ) = Xt−1 E(ǫt |Xt−1 ) = Xt−1 E(ǫt ) = 0 (here we used Definition 5.7.1). In order to show asymptotic of ∇Ln (φ) we will use the martingale central limit theorem. √ We now use Theorem 5.7.1 to show that n∇Ln (φ) is asymptotically normal, which means we have to verify conditions (5.14)-(5.16). We note in our example that Yt := Xt−1 ǫt , and that the series {Xt−1 ǫt }t is an ergodic process. Furthermore, since for any function g, E(g(Xt−1 ǫt )|Ft−1 ) = E(g(Xt−1 ǫt )|Xt−1 ), where Ft = σ(Xt , Xt−1 , . . .) we need only to condition on Xt−1 rather than the entire sigma-algebra Ft−1 . 60 C1 : By using the ergodicity of {Xt−1 ǫt }t we have n n t=1 t=1 1X 2 2 P 1X 2 2 Yt = Xt−1 ǫt → E(Xt−1 ) E(ǫ2t ) = σ 2 . | {z } n n =1 C2 : We now verify the conditional Lindeberg condition. n n √ √ 1X 1X 2 E(Yt2 I(|Yt | > ε n)|Ft−1 ) = E(Xt−1 ǫ2t I(|Xt−1 ǫt | > ε n)|Xt−1 ) n n t=1 t=1 2 ǫ2 We now use the Cauchy-Schwartz inequality for conditional expectations to split Xt−1 t and I(|Xt−1 ǫt | > ε). We recall that the Cauchy-Schwartz inequality for conditional expectations is E(Xt Yt |G) ≤ [E(Xt2 |G)E(Yt2 |G)]1/2 almost surely. Therefore n √ 1X E(Yt2 I(|Yt | > ε n)|Ft−1 ) n t=1 ≤ ≤ n 1/2 √ 1 X 4 E(Xt−1 ǫ4t |Xt−1 )E(I(|Xt−1 ǫt | > ε n)2 |Xt−1 ) n 1 n t=1 n X t=1 1/2 √ 2 Xt−1 E(ǫ4t )1/2 E(I(|Xt−1 ǫt | > ε n)2 |Xt−1 ) . (5.20) We note that rather than use the Cauchy-Schwartz inequality we can use a generalisation of it called the Hölder inequality. The Hölder inequality states that if p−1 + q −1 = 1, then E(XY ) ≤ {E(X p )}1/p {E(Y q )}1/q (the conditional version also exists). The advantage of using this inequality is that one can reduce the moment assumptions on Xt . Returning to (5.20), and studying E(I(|Xt−1 ǫt | > ε)2 |Xt−1 ) we use that E(I(A)) = P(A) and the Chebyshev inequality to show √ √ E(I(|Xt−1 ǫt | > ε n)2 |Xt−1 ) = E(I(|Xt−1 ǫt | > ε n)|Xt−1 ) √ = E(I(|ǫt | > ε n/Xt−1 )|Xt−1 ) √ X 2 var(ǫt ) ε n )) ≤ t−1 2 . (5.21) = Pε (|ǫt | > Xt−1 ε n Substituting (5.21) into (5.20) we have ≤ n √ 1X E(Yt2 I(|Yt | > ε n)|Ft−1 ) n t=1 1/2 2 n X 1 2 4 1/2 Xt−1 var(ǫt ) Xt−1 E(ǫt ) n ε2 n ≤ t=1 E(ǫ4t )1/2 εn3/2 ≤ E(ǫ4t )1/2 εn1/2 n X |Xt−1 |3 E(ǫ4t )1/2 t=1 n X 1 n t=1 61 |Xt−1 |3 . P a.s. If E(ǫ4t ) < ∞, then E(Xt4 ) < ∞, therefore by using the ergodic theorem we have n1 nt=1 |Xt−1 |3 → E(|X0 |3 ). Since almost sure convergence implies convergence in probability we have n n √ 1X E(ǫ4t )1/2 1 X 2 E(Yt I(|Yt | > ε n)|Ft−1 ) ≤ |Xt−1 |3 1/2 n n εn | {z } t=1 t=1 | {z } →0 P →E(|X0 |3 ) P → 0. Hence condition (5.15) is satisfied. C3 : We need to verify that n 1X P E(Yt2 |Ft−1 ) → σ 2 . n t=1 Since {Xt }t is an ergodic sequence we have n n t=1 t=1 1X 1X 2 E(Yt2 |Ft−1 ) = E(Xt−1 ε2 |Xt−1 ) n n n n X 1X 2 2 2 1 2 Xt−1 E(ε |Xt−1 ) = E(ε ) Xt−1 n n t=1 | t=1{z } = a.s. → E(X02 ) P → E(ε2 )E(X02 ) = σ 2 , hence we have verified condition (5.16). Altogether conditions C1-C3 imply that √ Recalling (5.19) and that n 1 X D Xt−1 ǫt → N (0, σ 2 ). n∇Ln (φ) = √ n t=1 √ √ (5.22) D n∇Ln (φ) → N (0, σ 2 ) we have n(φ̂n − φ) = − ∇2 Ln | {z a.s. −1 √ n∇Ln (φ) . {z } } | → (2E(X02 ))−1 Using that E(X02 ) = σ 2 , this implies that √ D →N (0,σ 2 ) 1 D n(φ̂n − φ) → N (0, (σ 2 )−1 ). 4 Thus we have derived the limiting distribution of φ̂n . 62 (5.23) (5.24) Remark 5.8.1 We recall that 2 (φ̂n − φ) = − ∇ Ln −2 and that var( n−1 Pn t=2 εt Xt−1 ) = −2 n−1 −1 ∇Ln (φ) = Pn −2 Pn t=2 εt Xt−1 n−1 , 2 Pn 2 t=2 Xt−1 n−1 t=2 var(εt Xt−1 ) (5.25) = O( n1 ). This implies (φ̂n − φ) = Op (n−1/2 ). Indeed the results also holds almost surely (φ̂n − φ) = O(n−1/2 ). (5.26) The same result is true for autoregressive processes of arbitrary finite order. That is √ D n(φ̂n − φ) → N (0, 2E(Γ−1 p )). 63 (5.27)