Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Consistency and asymptotic normality Class notes for Econ 842 Robert de Jong∗ March 2006 1 Stochastic convergence The asymptotic theory of minimization estimators relies on various theorems from mathematical statistics. The objective of this section is to explain the main theorems that underpin the asymptotic theory for minimization estimators. 1.1 1.1.1 Infimum and supremum, and the limit operations Supremum and infimum The supremum of a set A, if it exists, is the smallest number y for which x ≤ y for every x ∈ A. Analogously, the infimum of a set A, if it exists, is the largest number y for which x ≥ y for every x ∈ A. For an open set such as A = (0, 1), the supremum sup A and the infimum inf A are well-defined as 1 and 0, respectively. If the supremum is attained for some element in A, the supremum is also called the maximum; analogously, the minimum of a set is defined as the infimum if it is attained for some element of A. Therefore, the maximum and minimum of A = (0, 1) are not defined, because they are attained at 0 and 1, which are not inside (0, 1). If we define the extended real line as R ∪ {−∞, +∞}, the sup and inf are always defined on the extended real line. Department of Economics, Ohio State University, 429 Arps Hall, Columbus, OH 43210, email [email protected] ∗ 1 1.1.2 The limit operation For a deterministic sequence xn , we say that it “converges to x”, notation limn→∞ xn = x, if for all ε > 0, there exists an Nε such that |xn − x| < ε for all n ≥ Nε . The “limsup” (superior limit, “limes superior” in Latin) is defined as lim sup xn = limn→∞ xn = inf sup xm . n≥1 m≥n n→∞ The limsup is the “eventual upper limit” of a sequence. For example, consider xn = n−1 . Then lim sup xn = inf sup m−1 = inf n−1 = 0, n→∞ n≥1 m≥n n≥1 just as one would expect. From the earlier discussion, it is now clear that the limsup is always defined on the extended real line. The limsup of a sequence is the “eventual upper bound” of a sequence. For example, lim supn→∞ sin(n) = 1, while the limit is obviously undefined. Analogously, we define lim inf xn = limn→∞ xn = sup inf xm . n→∞ n≥1 m≥n For xn = n−1 , we obtain lim inf xn = sup inf xm = sup 0 = 0, n→∞ n≥1 m≥n n≥1 as one would expect from an “eventual lower limit”. If and only if the limsup and liminf of a sequence are equal, then the limit of the sequence exists. If, after noting that log(1 + x) ≤ x, we read something like lim n2 log(Φ(n)) ≤ lim n2 (Φ(n) − 1) = 0 n→∞ n→∞ where Φ(·) denotes the standardnormal distribution function, we should realize the following. In principle, when writing down the left-hand side expression, it is possible for this expression to be undefined. The correct interpretation of the above equation is that the inequality finds an upper bound if the expression exists; but formally the existence of the left-hand side remains unproven. The assertion that the lim sup is less than or equal to zero is always true however if the limsup is defined on the extended real line. 2 1.2 1.2.1 Almost sure convergence, convergence in probability, and convergence in distribution Definitions of stochastic convergence concepts While for a deterministic sequence Xn it is relatively straightforward to define convergence of Xn to a number X, the situation becomes more complicated in the case where Xn and/or X is allowed to be a random variable. There are four types of convergence for random variables that are used frequently in the econometric literature. They are convergence in distribution, convergence in probability, convergence in Lp , and almost sure convergence. We say that Xn converges to X in distribution if lim P (Xn ≤ y) = P (X ≤ y) n→∞ for all y for which P (X ≤ y) is continuous in y. To see why we only want this requirement to hold only for continuity points of P (X ≤ y), consider Xn = n−1 . This is a deterministic sequence and we should hope that it converges in distribution to X = 0. However, setting y = 0, P (Xn ≤ 0) = P (n−1 ≤ 0) = 0 6= P (X ≤ 0) = 1. Because the convergence in distribution requirement excludes discontinuity points however, Xn = n−1 does converge in distribution to zero. Xn converges in probability to X if for all δ > 0, lim P (|Xn − X| > δ) = 0. n→∞ p We typically write “Xn −→ X” or “plimn→∞ Xn = X” to denote convergence in probability. Xn is said to converge almost surely to X if P ( lim Xn = X) = 1 n→∞ which is equivalent to the condition that for all δ > 0, lim P (sup |Xm − X| > δ) = 0. n→∞ m≥n as We then write “Xn −→ X” or “limn→∞ Xn = X almost surely”. Convergence in Lp of Xn to X for p ≥ 1 holds if lim E|Xn − X|p = 0. n→∞ 3 1.2.2 Differences between stochastic convergence concepts There are subtle differences among these convergence concepts; however, convergence in distribution is in some ways in a different category from the other stochastic convergence concepts. The important thing to realize about convergence in distribution is that if Xn converges to X, and X is a standardnormally distributed random variable, then Xn also converges to Y , which is a random variable that is also standardnormally distributed. For example, a sequence Xn of i.i.d. N(0, 1) random variables converges in distribution to a N(0, 1) random variable as n → ∞. This may be counterintuitive, because Xn will never get “closer” to any particular random variable; but the important thing to realize is that only the distribution of Xn needs to converge for convergence in distribution. While convergence in distribution only says something about distributions, convergence in probability, almost sure convergence and convergence in Lp are statements implying that Xn and X are asymptotically “close”. Almost sure convergence implies convergence in probability, as can be easily seen from my definitions above. Convergence in Lp implies convergence in probability. This follows from the Markov inequality P (Y > δ) ≤ δ −1 E|Y |, because P (|Xn − X| > δ) = P (|Xn − X|p > δ p ) ≤ δ −p E|Xn − X|p . Convergence in probability implies convergence in distribution, and convergence in distribution to a constant implies convergence in probability to that constant; I will leave this unproven here. No other relations between concepts hold without further assumptions. 1.2.3 An odd example A classical example of a sequence Xn that converges in probability to zero but not in Lp is the following. Let Xn be a random variable that takes the value exp(n) with probability 1/n and zero otherwise. Then for all δ > 0, lim P (|Xn | > δ) ≤ lim P (Xn = exp(n)) = lim n−1 = 0, n→∞ n→∞ n→∞ so Xn converges in probability to 0. However, for any p ≥ 1, E|Xn |p = exp(np)P (Xn = exp(n)) = n−1 exp(np), so Xn does not converge in Lp . Another example where convergence in probability holds but convergence in Lp does not is the case where Xn = Y /n where Y is Cauchy. In that case, we have convergence in probability, but not in Lp , because moments of order 1 and higher of the Cauchy distribution do not exist. 4 If in addition Xn is independent across n, we also have an example of a sequence that p converges in probability, but not almost surely. This is because while Xn −→ 0, supm≥n Xm does not converge to zero in probability. After all, supm≥n Xm ≥ exp(n) if one of the Xm for which m ≥ n is nonzero; that is, supm≥n Xm ≤ 1 only if Xm = 0 ∀m ≥ n. Therefore, for all 0 < δ < 1, P (sup |Xm | ≤ δ) ≤ P (∀m ≥ n : Xm = 0) m≥n = ∞ Y P (Xm = 0) = m=n ∞ Y (1 − 1/m) m=n = exp( ∞ X log(1 − 1/m)) = 0. m=n To show that last result, note that because log(1 + x) ≤ x for x > −1, lim exp( K→∞ K X log(1 − 1/m)) ≤ lim exp(− K→∞ m=n K X (1/m)) = 0 m=n because m−1 is not summable from 1 to infinity. This implies that Xn converges in probability to zero, but not almost surely. 1.2.4 Averages Less exotic statistics that we will encounter in econometrics are averages of i.i.d. or weakly dependent random variables. Consider the following simple example. Let Xn = n−1/2 n X (zi − Ezi ), i=1 where the zi are i.i.d. and Ezi2 < ∞. Then Xn converges in distribution to a normally distributed random variable. For each deterministic sequence an such that limn→∞ an = 0 we p as have an Xn −→ 0. However, it turns out that an Xn −→ 0 is only true if an (log log n)1/2 → 0; this follows from the so-called law of the iterated logarithm. 5 1.3 1.3.1 The law of large numbers Statement of the law of large numbers Consistency of estimators is derived using a law of large numbers (LLN). The LLN asserts that n X (zi − Ezi ) n−1 i=1 converges to zero. It can be shown that if E|zi | < ∞ and zi is i.i.d. (independent and identically distributed), then the above expression converges to zero almost surely (and therefore also in probability; the almost sure version is Kolmogorov’s strong law of large numbers). If E|zi | is not finite, the LLN does not need to hold; a counterexample would be the Cauchy distribution, for which z̄n is distributed identically to zi . If convergence to zero holds almost surely, we talk about the strong law of large numbers, while if the converge to zero is in probability, we refer to the result as a weak law of large numbers. 1.3.2 Weak LLN under independence If zi has mean zero, is i.i.d. and Ezi2 < ∞, then n n X X V (n−1 zi ) = n−2 V (zi ) = n−1 V (zi ) → 0, i=1 i=1 so a weak law of large numbers will hold for zi in this case. The condition Ezi2 < ∞ can be weakened to E|zi | < ∞ by a truncation argument. This can be done as follows. Define ziK = zi I(|zi | > K) and ziK = zi I(|zi | ≤ K). Then n n n X X X −1 −1 −1 (ziK − EziK )| (ziK − EziK )| + E|n zi | ≤ E|n E|n ≤ (E|n−1 i=1 i=1 i=1 n X (ziK − EziK )|2 )1/2 + 2n−1 n X E|ziK | i=1 i=1 ≤ (n−1 V (ziK ))1/2 + 2E|zi |I(|zi| > K) Z −1 2 1/2 ≤ (n K ) + 2 |z|dF (z), |z|>K and by taking first the limit as n → ∞ and then the limit as K → ∞, the convergence in L1 follows. To prove Kolmogorov’s strong law of large numbers is much more involved; this typically requires a result known as the three series theorem. I will not pursue that here. 6 1.3.3 Discussion There exist many formulations of the LLN that are different from the above. For example, if zi is uncorrelated, Ezi2 < ∞ and −1 Var(n n X −2 (zi − Ezi ) = n i=1 n X Var(zi ) → 0, i=1 then the weak LLN will also hold, in spite of the fact that the i.i.d. assumption has not been made. Also, the independence assumption can be relaxed further to allow for random variables zi that have “fading memory” properties. To illustrate this, note that in general −1 Var(n n X −2 (zi − Ezi )) = n ≤ 2n cov(zi , zs ) i=1 s=1 i=1 −2 n X n X n n X X |cov(zi , zs )| n X ∞ X |cov(zi , zi+m )|, i=1 s=i ≤ 2n−2 i=1 m=0 and if the last expression converges to zero somehow, then a weak LLN can be obtained as well. Often, alternatives to the classical LLN for i.i.d. variables involve assumptions on the existence of moments E|zi |1+δ for some δ > 0. 1.4 The central limit theorem 1.4.1 Statement of the central limit theorem The central limit theorem (CLT) states that −1/2 n n X d (zi − Ezi ) −→ N(0, Var(zi )). i=1 If zi is i.i.d. and Ezi2 < ∞, then the above result holds. 7 1.4.2 CLT and characteristic function The characteristic function of a random variable X is defined as ψ(r) = E exp(irX), where i satisfies i2 = −1. There is a one-to-one correspondence between characteristic functions and distribution functions, and because |ψ(·)| ≤ 1, showing that the characteristic function converges to the characteristic function of a normal distribution suffices to show convergence in distribution to a normal. The reasoning for proving a CLT for a mean zero, i.i.d. sequence zi with a variance σ 2 then starts by observing that −1/2 E exp(irn n X zi ) = i=1 n Y E exp(irn−1/2 zi ) i=1 by independence. By the Taylor series expansion, exp(x) ≈ 1 + x + 0.5x2 and log(1 + x) ≈ x, and therefore −1/2 E exp(irn n X zi ) ≈ i=1 = n Y i=1 n Y E(1 + (irn−1/2 zi ) + 0.5(irn−1/2 zi )2 ) i=1 n X (1 − 0.5r n σ ) = exp( log(1 − 0.5r 2 n−1 σ 2 )) 2 −1 2 i=1 ≈ exp(−0.5r 2 σ 2 ), which is the characteristic function of the N(0, σ 2 ) distribution. Of course, the analytical difficulty is to justify the “≈” signs in the above reasoning. 1.4.3 Discussion For the CLT, lots of alternative specifications are also possible. It is possible to allow for some heterogeneity of the distributions of zi , and also the requirement of independence can be relaxed to some type of “fading memory” assumption. “Heterogeneity of the distributions” of zi refers to the fact that the distributions Fi (x) of zi are not identical for all i. 8 1.5 Moment conditions Many articles and books in econometrics contain so-called moment conditions of the type E|zi | < ∞, −1 or lim sup n n→∞ n X E|zi |2 < ∞, i=1 or assume that for some δ > 0, E|zi |2+δ < ∞. By now, you probably realize that these conditions typically originate from the verification of conditions necessary for obtaining a law of large numbers or central limit theorem. For P as example, n−1 ni=1 (zi − Ezi ) −→ 0 if the zi are i.i.d. and E|zi | < ∞. So, if somewhere P as along the line we need the result n−1 ni=1 (zi − Ezi ) −→ 0, we simply impose E|zi | < ∞ as a regularity condition. The same holds for the central limit theorem. Finally, note that E|zi |p < ∞ is implied by E|zi |q < ∞ for some q > p; this is because for 1 ≤ p ≤ q, (E|X|p )1/p ≤ (E|X|q )1/q . An example of a distribution for which moment conditions can fail to hold is the Student distribution. If zi has a Student distribution with k degrees of freedom, then E|zi |k does not exist; remember that a t distribution with one degree of freedom is the Cauchy distribution. 1.6 The dominated convergence theorem A tool that is frequently used in asymptotic theory in econometrics is the dominated convergence theorem. p Theorem 1 If Xn −→ X and |Xn | ≤ Y a.s. for some random variable Y and EY < ∞, then lim EXn = EX. n→∞ This result is widely used in econometrics; notably, it is used for asymptotic normality of minimization estimators when the mean value in the Taylor series expansion argument is replaced by its limit. 9 1.7 A joint convergence result The following result plays a key role in asymptotic theory for justifying variance estimation: d p Theorem 2 If Xn −→ X and Yn −→ c for some constant c and f (x, y) is continuous in d both arguments, then f (Xn , Yn ) −→ f (X, c). Using the above theorem, we can reason as follows. If n−1 N(0, σ 2 ), then P n−1/2 ni=1 zi d p −→ N(0, 1). P n−1 ni=1 zi2 1.7.1 Pn 2 i=1 zi p −→ σ 2 and n−1/2 Pn i=1 zi d −→ A beginner’s mistake However, it is not allowed to reason P Pn −1 n−1/2 ni=1 zi 0 d 1/2 n i=1 zi p =n p −→ n1/2 2 = 0 P P n n 2 2 σ n−1 i=1 zi n−1 i=1 zi as many beginners tend to do. You cannot find the continuous function f (., .) that will allow you to do this trick; you would need f (x, y) = n1/2 x/y, but dependence of f on n is not allowed. The correct reasoning displayed previously assumed f (x, y) = x/y and clearly, this is allowed. 1.8 The multivariate case Often when convergence in distribution of a random vector is shown, a result know as the Cramèr-Wold device is used. The Cramèr-Wold device is a tool that can be used to simplify the problem of showing convergence in distribution of a vector to convergence in distribution d of scalars. It asserts that for a vector Xn of dimension k, Xn −→ X if and only if for all d k-vectors ξ, ξ ′ Xn −→ ξ ′ X. This implies that using the Cramèr-Wold device, the problem of showing asymptotic normality of a vector can be translated into a problem involving an application of the CLT for scalars. 1.9 Why asymptotic normality? The reason econometricians are interested in establishing asymptotic normality of estimators is that this property allows us to justify inference procedures such as t-tests and F -tests. 10 Of course, t-tests that are justified based on asymptotic theory are really asymptotically standardnormal, and F -tests that are justified based on asymptotic theory are really asympd totically chi-squared. Asymptotic normality of a k-vector Xn , i.e. Xn −→ N(0, V ), implies p that if V̂ −→ V , X d q nj −→ N(0, 1), V̂jj which will imply validity of t-values. A similar argument can be used to show that F -tests are asymptotically valid if we have asymptotic normality of our estimator. 1.10 Exercises 1. Assume that xn is a nondecreasing, bounded deterministic sequence of scalars. Show that limn→∞ xn exists. p 2. Assume that Xn is an increasing sequence of random variables such that Xn −→ X. p Show that in this case, Xn −→ X also. p 3. Assume that Xn and Yn are sequences of random variables such that Xn −→ X and p p Yn −→ Y . Show that Yn + Xn −→ X + Y . d 4. Assume that Xn and Yn are sequences of random variables such that Xn −→ X and d d Yn −→ Y . Find an example that shows that Yn + Xn −→ X + Y does not need to hold. d 5. Assume that Xn is a sequence of random variables such that Xn −→ c for a constant p c. Show that Xn −→ c also. 6. In section 1.2.3, what happens to the “odd example” if the n−1 in its definition is replaced by n−2 ? 7. Let xP i be i.i.d. and N(0, 1) distributed. Using characteristic functions, show that n −1/2 n i=1 xi is also N(0, 1) distributed. 8. Formally prove the dominated convergence theorem. 11 2 2.1 Asymptotics of the linear model: the scalar case The consistency argument In this section we will investigate the asymptotic behavior of least squares estimators. The purpose is only to serve as an introduction to the more general case, where we will only assume that our estimator is some minimization estimator, for which no closed form solution is available. To understand the principles involved better, we will focus on the case of a scalar regressor xi in this section. In the case of the simple linear model yi = θ0 xi + εi , where xi ∈ R, the closed form solution for the least squares estimator is n X −1 (yi − θxi )2 θ̂n = argminθ∈Θ n i=1 Pn yi xi = Pi=1 . n 2 i=1 xi The “argmin” is defined here as the value at which the function over which the argmin is takes is minimized. For example, argminx∈R (1 + x2 ) = 0. Note that this solution for θ̂n was obtained by differentiation with respect to θ. Next, we note that - if the true model is correct - it is possible to write Pn εixi θ̂n = θ0 + Pi=1 . n 2 i=1 xi After this step, we have to make some assumption on the behavior of xi : is it a deterministic or a stochastic sequence? 2.1.1 Consistency argument: the deterministic regressor case First, we assume that xi is deterministic. Then note that Pn Pn −1 εi xi n ε x i i Pi=1 = −1 Pi=1 , n n 2 2 n i=1 xi i=1 xi as and therefore θ̂n −→ θ0 if P as 1. n−1 ni=1 εi xi −→ 0; P 2. n−1 ni=1 x2i −→Q, where Q > 0. 12 2.1.2 Consistency argument: the random regressor case as Under the assumption that xi is random, sufficient conditions for θ̂n −→ θ0 are P as 1. n−1 ni=1 εi xi −→ 0; P as 2. n−1 ni=1 x2i −→ Q, where Q > 0. Both conditions follow, by Kolmogorov’s law of large numbers, from the conditions 1. (xi , εi ) is i.i.d.; 2. Ex2i < ∞; 3. Eεi xi = 0; 4. E|εixi | < ∞. In the case that εi and xi are independent of each other, conditions 1-4 are implied by 1. (xi , εi ) is i.i.d.; 2. Ex2i < ∞; 3. Eεi = 0; 4. E|εi| < ∞. 2.2 The asymptotic normality argument Finding the limit distribution of θ̂n requires some more work than the simple consistency result. If we assume that the εi are i.i.d. normal with variance σ 2 and xi is deterministic, then Pn εi xi Pi=1 n 2 i=1 xi P P p is distributed N(0, σ 2 ( ni=1 x2i )−1 ) for any sample size. Therefore, if n−1 ni=1 x2i −→ Q, we get d n1/2 (θ̂n − θ0 ) −→ N(0, σ 2 Q−1 ). The asymptotic version of this argument simply replaces the exact normality by asymptotic normality, and uses a CLT as a replacement for the exact normality assumption. 13 2.2.1 Asymptotic normality argument: the deterministic regressor case d Sufficient conditions for the asymptotic normality result n1/2 (θ̂n − θ0 ) −→ N(0, σ 2 Q−1 ) are P d 1. n−1/2 ni=1 εi xi −→ N(0, σ 2 Q); P 2. n−1 ni=1 x2i −→Q where Q > 0. It is possible to use a central limit theorem for independent, but not identically distributed random variables to establish 1. 2.2.2 Asymptotic normality argument: the random regressor case At this point, our interest will be in relaxing the assumption that the xi are deterministic. If we assume 1. (xi , εi ) is i.i.d.; 2. Eεi xi = 0; 3. Ex2i < ∞ and Eε2i x2i < ∞; then by the central limit theorem, −1/2 n n X d xi εi −→ N(0, Eε2i x2i ), i=1 and in addition, −1 n n X p x2i −→ Q, i=1 and therefore by Theorem 2, P n−1/2 ni=1 xi εi d P −→ N(0, Q−2 Eε2i x2i ). n−1 ni=1 x2i Therefore, d n1/2 (θ̂n − θ0 ) −→ N(0, Q−2 P ), where P = Eε2i x2i . 14 If we assume in addition that εi and xi are independent, then P = Eε2i x2i = Eε2i Ex2i = σ 2 Q, and then our result becomes d n1/2 (θ̂n − θ0 ) −→ N(0, σ 2 Q−1 ). The above result also follows from assuming that E(ε2i |xi ) = σ 2 (a homoskedasticity assumption). However, please note that asymptotic normality is also obtained without homoskedasticity assumption. In the sections that follow, we will always assume that the xi sequence is stochastic. 15