Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7 Asymptotic Results This chapter is devoted to asymptotic results; firstly, consistency is discussed, the problem of when a sequence of estimators (b qn )n≥1 converges in probability to q(θ), the quantity to be estimated. Then, techniques based on the Central Limit theorem are discussed to give conditions under which the maximum likelihood estimators of the canonical parameters an exponential family are asymptotically normal. 7.1 Consistency Let θ be an unknown parameter from a parameter space Θ. Let (b qn )n≥1 be a sequence of estimators p of q(θ), where q : Θ → R . Definition 7.1. The sequence (b qn )n≥1 is consistent if for all θ ∈ Θ and ǫ > 0, n→∞ Pθ (|b qn − q(θ)| ≥ ǫ) −→ 0. (7.1) where |.| denotes the Euclidean norm. It is said to be uniformly consistent over K ⊂ Θ (or simply uniformly consistent if K = Θ) if n→∞ sup Pθ (|b qn − q(θ)| ≥ ǫ) −→ 0. (7.2) θ∈K 7.1.1 The Weak Law of Large Numbers The simplest example of consistency is convergence of the sample average to the population average. Theorem 7.2. Let X1 , . . . , Xn be i.i.d. with distribution P. Suppose that E [|Xi |] < +∞, then X →P E [X1 ] =: µ. Sketch of Proof Only a sketch of the proof is given, since the result is treated fully in Probability 2. Let φX (t) = E eitX denote the characteristic function of the random variable X. Since |φX (t)| ≤ 1 for all t ∈ R, Taylor’s expansion theorem may be applied to give: 119 120 CHAPTER 7. ASYMPTOTIC RESULTS φX (t) = E [1 + itZ + o(t)] = 1 + itµ + o(t). It follows that, for X = n1 (X1 + . . . + Xn ), φX (t) = n Y j=1 n t t n n→+∞ itµ t = 1 + i µ + o( ) −→ e . φXj /n (t) = φX n n n This is the characteristic function of the constant random variable µ and hence by the Lévy continuity theorem (omitted) X →P µ. Without further assumptions, uniform consistency cannot be proved only on the assumption that Eθ [|X1 |] < +∞ for each θ ∈ Θ; stronger conditions are required. If, in addition, Vθ (X1 ) < M < +∞ where M does not depend on θ, Chebyshev’s inequality may be used to prove uniform consistency; 1 M Pθ X − µ(θ) > ǫ ≤ 2 Vθ (X) ≤ 2 . ǫ nǫ The main tool to prove consistency will be the law of large numbers. This is natural for a method of moments estimator, where the parameter estimators will be functions of moment estimators. We start with a result about functions of estimators of multinomial sampling probabilities. The estimators pi are the sample averages and hence are consistent by the law of large numbers. To prove uniform consistency for a function of these estimators, the function has to be uniformly continuous. This is obtained ‘for free’ if the parameter space is compact. In the following theorem for multinomial sampling, the usual parameter space is extended to obtain compactness. A continuous function over a compact space is uniformly continuous. Pk Theorem 7.3. Let P = {(p1 , . . . , pk ) : 0 ≤ pj ≤ 1, 1 ≤ j ≤ k, j=1 pj = 1}. Let Pp denote the probability distribution p = (p1 , . . . , pk ) over X = (x1 , . . . , xk ). Let X1 , . . . , Xn denote a random P N sample from Pp . Let Nj = ni=1 1xj (Xi ) and pbn,j = nj for j = 1, . . . , k. Let q : P → Rp be continuous. pn,1 , . . . , pbn,k ). Then qbn := q(b pn ) is a uniformly consistant estimator of q(p). Let pbn = (b Proof Let pbn = (b p1,n , . . . , pbk,n ). Note that E [b pnj ] = pj for each j and Vp (b pnj ) = Chebyshev’s inequality, it follows that for all p = (p1 , . . . , pk ) ∈ P and δ > 0, pj (1−pj ) n ≤ 1 4n . By k n o X Pp pbn − p ≥ δ (b pnj − pj )2 ≤ δ 2 ≤ Pp ∪kj=1 k (b pnj − pj )2 ≥ δ 2 = Pp j=1 ≤ k X j=1 Pp δ |b pnj − pj | ≥ √ k ≤ k2 . 4nδ 2 Because q is continuous and P is compact, it follows that q is uniformly continuous on P. It follows that for every ǫ > 0, there exists a δ(ǫ) > 0 such that for all p, p′ ∈ P, |p′ − p| ≤ δ(ǫ) ⇒ |q(p′ ) − q(p)| ≤ ǫ. 121 7.1. CONSISTENCY It follows that and the result follows. Pp (|b qn − q| ≥ ǫ) ≤ Pp |b pn − p| ≥ δ(ǫ) ≤ k2 4nδ(ǫ)2 The aim of the discussion that now follows is to establish Theorem 7.6; that is, that as n → +∞, the probability of existence of the ML estimator for an exponential family tends to 1 and the sequence of ML estimators from random samples of size n is consistent. This requires Proposition 7.4 and Corollary 7.5, which is a corollary to Theorem 4.5. Proposition 7.4. Let X1 , . . . , Xn be i.i.d., each with state space X and let P = {Pθ : θ ∈ Θ} be a regular family of probability distributions over X . Let g = (g1 , . . . , gd ) map X onto Y ⊂ Rd . Suppose Eθ [|gj (X1 )|] < +∞ for 1 ≤ j ≤ d for all θ ∈ Θ. Let mj (θ) = Eθ [gj (X1 )] and let q(θ) = h(m(θ)) where h : Y → Rp is a continuous function. Then n is a consistent estimate of q(θ). qb = h(g) = h 1X g(Xi ) n i=1 ! Proof It follows from the weak law of large numbers that n 1X g(Xi ) →P EP [g(X1 )] . n i=1 It is straightforward to establish that if Y n →P Y and h is continuous then h(Y n ) →P h(Y ). The following result is a corollary to Theorem 4.5. Lemma 7.5. Suppose that P = {Pη : η ∈ E}, where E, the natural parameter space is open; P is the canonical exponential family generated by (h, T ) where T = (T1 , . . . , Tk ), of rank k. Let CT denote the convex support of the distribution of T under Pη for all η ∈ E. Let t0 = Eη [T (X)]. Then ηb, the MLE exists and is unique if and only if t0 ∈ CT0 , the interior of CT . Proof of Lemma 7.5 Recall Theorem 4.5, that if there is a δ > 0 and an ǫ > 0 such that t0 satisfies: infP (c1 ,...,ck ): 2 j cj =1 P ((c, T (X) − t0 ) > δ) > ǫ then the MLE ηb exists, is unique and is the solution to the equation Ȧ(η) = Eη [T (X)] = t0 . 122 CHAPTER 7. ASYMPTOTIC RESULTS The point t0 ∈ C 0 the interior of a convex set C if and only if for every d 6= 0, both {t : (d, t) > (d, t0 )} = 6 φ and {t : (d, t) < (d, t0 )} = 6 φ. The equivalence of Equation (4.1) and Lemma 7.5 follows. The main result, which is a consequence of Proposition 7.4, together with Lemma 7.5 may now be stated and proved. Theorem 7.6. Let P be a canonical exponential family of rank d generated by T = (T1 , . . . , Td ). Let η denote the natural parameter, E the natural parameter space and A the log partition function. Suppose that E is open. Let X1 , . . . , Xn be a random sample from Pη ∈ P. Let ηb denote the MLE. Then 1. Pη (b ηM LE n→+∞ exists) −→ 1. 2. (b ηn )n≥1 is consistent. P Proof It follows from Lemma 7.5 that ηb(X1 , . . . , Xn ) exists if and only if T n := n1 nj=1 T (Xj ) belongs to the interior CT0 of the convex support of the distribution of T n . If η0 is the parameter value, then Eη0 [T (X1 )] belongs to the interior of the convex support by Theorem 4.5 since η0 solves the equation Ȧ(η0 ) = t0 = Eη0 [T (X1 )]. By definition of the convex support, there exists a ball Sδ := {t : |t − Eη0 [T (X1 )] < δ} ⊂ CT0 . By WLLN, n 1X T (Xi ) −→Pη0 Eη0 [T (X1 )] , n i=1 from which n P η0 1X T (Xi ) ∈ CT0 n i=1 Since ηb is the solution to Ȧ(η) = 1 and hence 1. follows. 1 n ! ≥ Pη 0 Pn n ! 1 X n→+∞ T (Xi ) − Eη0 [T (X1 )] < δ −→ 1. n i=1 T (Xi ), i=1 it follows that the probability that ηb exists tends to For part 2., by Theorem 3.7, the map η → Ȧ(η) is 1 - 1 and continuous on E. It follows that the inverse Ȧ−1 : Ȧ(E) → E is continuous on Sδ and the result now follows by Proposition 7.4. 7.1.2 Consistency of Minimum Contrast Estimators Let P = {Pθ : θ ∈ Θ} be a regular family, let ρ(x, θ) be a contrast function and let X = (X1 , . . . , Xn ) be a random sample from Pθ . Let θb be a minimum contrast estimate that minimises n ρn (X, θ) = 1X ρ(Xi , θ). n i=1 Recall that if ρ is a contrast function, then (by definition) D(θ0 , θ) := Eθ0 [ρ(X1 , θ)] is uniquely minimised at θ = θ0 for all θ0 ∈ Θ. 123 7.1. CONSISTENCY Theorem 7.7. Suppose ) ( n 1 X P θ0 sup 0 (ρ(Xi , θ) − D(θ0 , θ)) −→ n θ∈Θ (7.3) i=1 and inf {D(θ0 , θ) − D(θ0 , θ0 ) : |θ − θ0 | > ǫ} > 0 (7.4) ∀ǫ > 0 then θb is consistent. Proof Firstly, consider the set on which |θb − θ0 | > ǫ. This is contained in the set where |θ − θ0 | > ǫ| and ρn (X, θ) ≤ ρn (X, θ0 ). It follows that Let Pθ0 |θb − θ0 | ≥ ǫ ≤ Pθ0 inf A = inf ( n n 1 X n j=1 (ρ(Xj , θ) − ρ(Xj , θ0 )) : |θ − θ0 | ≥ ǫ 1X (ρ(Xi , θ) − ρ(Xi , θ0 )) : |θ − θ0 | ≥ ǫ n i=1 ) ≤ 0 . (7.5) and B = inf {D(θ0 , θ) − D(θ0 , θ0 ) : |θ − θ0 | ≥ ǫ} . n→+∞ From the hypotheses, Pθ0 (supθ |ρn (X, θ) − D(θ0 , θ)| > δ) −→ 0 and hence, using D(θ0 , θ) − D(θ0 , θ0 ) ≥ 0, that for all δ > 0, Pθ0 (A − B ≤ −δ) ≤ n→+∞ Pθ 0 inf θ:|θ−θ0 |≥ǫ (ρn (X, θ) − D(θ0 , θ)) − (ρn (X, θ0 ) − D(θ0 , θ0 )) ≤ −δ (7.6) −→ 0. Now choose δ = inf θ:|θ−θ0 |>ǫ (D(θ0 , θ) − D(θ0 , θ0 )). Then δ > 0 and from Equation (7.6), it follows directly that the right hand side of (7.5) tends to zero. The following simple and important corollary gives a condition under which the MLE is consistent. Corollary 7.8. Let Θ = {θ1 , . . . , θd } denote a finite parameter space. Suppose that max Eθj [| log p(X1 , θk )|] < +∞ j,k and suppose that the parametrisation is identifiable. Let θb denote the MLE of θ. Then Pθj θb = 6 θj → 0 for all j ∈ {1, . . . , d}. 124 CHAPTER 7. ASYMPTOTIC RESULTS Proof Since the parameter space is discrete and finite, it follows that there is an ǫ > 0 such that Pθj θb 6= θk = Pθj θb − θk ≥ ǫ ∀(j, k). Recall that the MLE estimator is the minimum contrast estimator with contrast function n 1X log p(Xj , θ). n i=1 By Shannon’s lemma 4.2, D(θ0 , θ) is minimised for θ = θ0 for all θ0 ∈ Θ. It follows that only equations (7.3) and (7.4) need to be checked. Equation (7.3) follows from the WLLN. Equation (7.4) follows from Shannon’s lemma. 7.2 The Delta Method Let X1 , . . . , Xn be a random sample from a parent distribution satisfying E[X1 ] = µ and V(X1 ) = σ 2 < +∞. The central limit theorem states that √ n→+∞ L( n(X − µ)) −→ N (0, σ 2 ). The delta method is simply the name given to the application of Taylor’s expansion theorem to obtain the distribution of functions of the sample average. Theorem 7.9 (The Delta Method). Let X1 , . . . , Xn be a random sample, where X1 has state space R, E[X1 ] = µ, V(X1 ) = σ 2 < +∞ and h : R → R a differentiable function. Then n→+∞ 2 √ (7.7) L n(h(X) − h(µ)) −→ N 0, h′ (µ) σ 2 The result follows from the following lemma. Lemma 7.10. Let {Un } be a sequence of real valued random variables and {an } a sequence of constants that satisfies an → +∞ as n → +∞. Suppose that n→+∞ 1. an (Un − u) −→ L V for some constant u ∈ R where V is a well defined random variable, 2. g : R → R is differentiable at u with derivative g ′ (u). Then n→+∞ an (g(Un ) − g(u)) −→ g ′ (u)V Proof of Lemma 7.10 Since an → +∞ as n → +∞, it follows that for every δ > 0, P(|Un − u| ≤ δ) → 1. From the definition of a derivative, it follows that for every ǫ > 0, there exists a δ > 0 such that |v − u| ≤ δ ⇒ |g(v) − g(u) − (v − u)g ′ (u)| ≤ ǫ|v − u|. (7.8) 125 7.2. THE DELTA METHOD From this, it follows that P g(Un ) − g(u) − g ′ (u)(Un − u) ≤ ǫ |Un − u| → 1, from which n→+∞ P an (g(Un ) − g(u)) − g ′ (u)(an (Un − u) ≤ ǫ |an (Un − u)| −→ 1. Since an (Un − u) →L V , the result follows. Proof of Theorem 7.9 This follows from the lemma by setting Un = X, an = n1/2 , u = µ and V ∼ N (0, σ 2 ). The delta method can be extended to situations where h : R → R is a twice differentiable function with h′ (µ) = 0, but h′′ (µ) 6= 0. Theorem 7.11 (Second order delta method). Let (Yn ) be a sequence of random variables that satisfy √ n(Yn − µ) →L N (0, σ 2 ). Let h be a function that is twice differentiable and satisfies h′ (µ) = 0, h′′ (µ) 6= 0. Then 2 n→+∞ σ n(h(Yn ) − h(µ)) −→ L h′′ (µ)V 2 where V ∼ χ21 . Proof Similar to the first order delta method; consider the second derivative and recall that if V ∼ χ21 , then V =L Z 2 where Z ∼ N (0, 1). The delta method extends to the multivariate setting. Firstly, Lemma 7.10 extends directly: Lemma 7.12. Let {U n } be random d-vectors and let {an } be a sequence of constants satisfying an → +∞ as n → +∞ and suppose that n→+∞ 1. an (U n − u) −→ L V where V is a random d-vector. (1) 2. g : Rd → Rp has a differential gp×d (u) at u. Then n→+∞ an (g(U n ) − g(u)) −→ L g (1) (u)V . Proof Similar to Lemma 7.10. From this, the multivariate version of the delta method can be stated and proved. Theorem 7.13 (Multivariate delta method). Let Y 1 , . . . , Y n be i.i.d. random d-vectors with well defined expected value µ and covariance matrix Σ. Let h : O → Rp where O is an open subset of Rd . Suppose that h has a well defined differential h(1) (µ), where (1) hij (µ) = ∂hi (µ). ∂xj 126 CHAPTER 7. ASYMPTOTIC RESULTS Then h(Y ) = h(µ) + h(1) (µ) Y − µ + oP (n−1/2 ) (7.9) where oP (n−1/2 ) denotes a quantity V that satisfies: n→+∞ P(n1/2 |V | > ǫ) −→ 0 ∀ǫ > 0 and n→+∞ (7.10) −→ N (0, h(1) (µ)Σh(1)t (µ)). √ The proof follows in the same way as before; let an = n, U n = Y , u = µ and V ∼ N (0, Σ). Then √ n h(Y ) − h(µ) h(1) (µ)V ∼ N (0, h(1) (µ)Σh(1)t (µ)) as required. 7.3 Asymptotic Results for Maximum Likelihood The following result gives the asymptotic distribution for the maximum likelihood estimator of the canonical parameters. Theorem 7.14. Let P be a canonical exponential family of rank d generated by T and suppose that E (the natural parameter space) is open. Let X1 , . . . , Xn be a random sample from Pη ∈ P. Let ηb be the MLE if it exists and equal to a constant vector c otherwise. Then 1. n ηb = η + Ä 2. −1 (η) 1X T (Xi ) − Ȧ(η) n i=1 ! + oPη (n−1/2 ) √ n→+∞ L( n(b η − η)) −→ N (0, I −1 (η)) where oPη (n−1/2 ) denotes a quantity Vn such limn→+∞ Pη (n1/2 |Vn | > ǫ) = 0 for all ǫ > 0. √ Remark The asymptotic variance matrix I −1 (η) of n (b η − η) is the matrix that gives the Cramér Rao lower bound on variances of unbiased estimators of linear combinations of (η1 , . . . , ηd ). This is the asymptotic efficiency property of the ML estimator for exponential families. Proof This is an immediate consequence of the multivariate delta method. Firstly, let n T = 1X T (Xj ), n j=1 then Pη T ∈ Ȧ(E) Theorem 7.13. → 1 and hence Pη ηb = Ȧ−1 (T ) → 1. Now set h = Ȧ−1 and µ = Ȧ(η) in 127 7.3. ASYMPTOTIC RESULTS FOR MAXIMUM LIKELIHOOD The following result is from Analysis 2: Let h : Rd → Rd be 1− 1 and continuously differentiable ∂hi on an open neighbourhood O of x. Suppose that Dh(x) := ∂xj be non singular. Then h−1 : d×d h(O) → O is differentiable at y = h(x) and Dh−1 (y) = (Dh(x))−1 . By definition, DȦ = Ä. In Theorem 7.13, h(1) (µ) = Ä−1 (η). The first statement of the theorem now follows from (7.9). The second part follows by noting that the covariance matrix of T (X1 ) is Σ = Ä(η), from which h(1) (µ)Σh(1)t (µ) = Ä−1 ÄÄ−1 (η) = Ä−1 (η). statement 2. now follows from (7.10). Example 7.1 (Normal Random Sample). Let X1 , . . . , Xn be a N (µ, σ 2 ) random sample. Note that Let η1 = µ σ2 n n µ X µ2 1 X 2 1 − n x − x − n log σ exp p(x1 , . . . , xn ; µ, σ) = j j σ2 2σ 2 2σ 2 (2π)n/2 j=1 j=1 and η2 = − 2σ1 2 , then the model can be re-written in canonical form as r 1 η12 1 2 exp n η1 x + η2 x + − log − p(x1 , . . . , xn ; η) = 4η2 2η2 (2π)n/2 Then T = (T1 , T2 ) where T1 = X and T2 = X 2 is a sufficient statistic for the parameters. Since Eη [T1 ] = µ and Eη [T2 ] = σ 2 + µ2 , it follows from the central limit theorem that: √ η2 n ! T1 − µ T2 − (µ2 + σ 2 ) n→+∞ −→ L 0 0 N where A(η) = − 4η12 − 21 (log 2 + log(−η2 )) so that 1 Ä(η) = 2 2η2 −η2 η1 η1 The maximum likelihood estimators for the normal are: and It follows that ηb1 = n X σ b2 and ηb2 = σ b2 = n η12 4η2 , Ä(η) ! µ b = X = T1 1X (Xj − X)2 = T2 − (T1 )2 . n j=1 − 2bσ1 2 . √ 1− ! By the preceding theorem, ηb1 − η1 ηb2 − η2 ! n→+∞ −→L N 0, I −1 (η) . ! 128 7.4 CHAPTER 7. ASYMPTOTIC RESULTS Asymptotic Distribution of Minimum Contrast Estimators The result for the asymptotic distribution of MLE estimators for the canonical parameters of exponential families may be extended to a large class of minimum contrast estimators. Let θ ∈ Θ ⊂ Rd , where θ = (θ1 , . . . , θd ). Let X = (X1 , . . . , Xn ) and let ρn (X, θ) be a minimum contrast function based on the random sample of the form n ρn (X, θ) = − 1X ρ(Xj , θ) n (7.11) j=1 where ρ is a contrast function for a single observation. For example, take ρ(x, θ) = log p(x, θ) for maximum likelihood estimation. Assume the following: 1. ρn is differentiable in θj for each j = 1, . . . , d. Let θbn denote the minimum contrast estimate; that is, θbn satisfies ∂ρn (X, θbn ) = 0 ∂θj j = 1, . . . , d. (7.12) In the case of ρn given by Equation (7.11), this is the maximum likelihood estimate. 2. Eθ ∂ρn (X, θ) = 0. ∂θj (7.13) h i Eθ |∇θ ρn (X, θ)|2 < +∞ (7.14) where |.| denotes the Euclidean norm. 3. ρn is twice differentiable in θ and satisfies X j,k The matrix with entries Eθ 4. θbn −→Pθ θ for each θ ∈ Θ. h ∂2 ρn (X, θ) < +∞ Eθ ∂θj ∂θk ∂2 ∂θi ∂θj ρn (X, θ) i ∀θ ∈ Θ is non-singular for each θ ∈ Θ. For the fourth of these, in the case of exponential families of full rank, where θ = θ(η) for a continuous 1 − 1 mapping θ, θbn →Pθ θ by virtue of Theorem 7.14. Theorem 7.15. Let P = {Pθ : θ ∈ Θ ⊂ Rd } be a regular parametric family. Let X = (X1 , . . . , Xn ) be a random sample. Suppose that conditions 1., 2., 3. and 4. hold. Let J be the matrix with entries 2 ∂ ρ(X, θ) Jjk (θ) = E ∂θj θk 7.4. ASYMPTOTIC DISTRIBUTION OF MINIMUM CONTRAST ESTIMATORS 129 and let K be the matrix with entries Kjk (θ) = Eθ ∂ρ ∂ρ (X, θ) (X, θ) . ∂θj ∂θk Then the minimum contrast estimate satisfies: θbn = θ + J −1 (θ)∇ρn (X, θ) + oPθ (n−1/2 ), so that L √ n→+∞ −→ N (0, J −1 (θ)K(θ)J −1 (θ)). n θbn − θb In the case of the maximum likelihood estimate, I = J = K so that L √ n→+∞ −→ N (0, I −1 (θ)). n θbn − θb Proof By Taylor’s expansion theorem: X ∂2 X ∂2 ∂ ∂ ρn (X, θ) = ρn (X, θbn ) + ρn (X, θn∗ )(θj − θbn,j ) = ρn (X, θn∗ )(θj − θbn,j ) ∂θk ∂θk ∂θj ∂θk ∂θj ∂θk j j ∗ − θ | ≤ |θ bn,j − θj | for each j. where |θn,j j It follows from assumption 4. that θn∗ → θ and hence from the WLLN that ∂2 ∂2 ∗ ρn (X, θn ) −→Pθ Eθ ρ(X, θ) = Jj,k (θ) ∂θj ∂θk ∂θj ∂θk h i Since Eθ ∂θ∂k ρn (X, θ) = 0, it follows that K(θ) is the covariance matrix of ∇ρ(X, θ) and hence, from the central limit theorem, that √ n∇ρn (X, θ) −→Pθ N (0, K(θ)). If V ∼ N (0, K(θ)), then J −1 (θ)V ∼ N (0, J −1 (θ)K(θ)J −1 (θ)), from which it follows that √ n→+∞ n(θbn − θ) −→ N (0, J −1 (θ)K(θ)J −1 (θ)). The result for maximum likelihood follows directly, since for maximum likelihood I = J = K.