Download Central limit theorems

Statistics 910, #11 1 Asymptotic Distributions in Time Series Overview Standard proofs that establish the asymptotic normality of estimators constructed from random samples (i.e., independent observations) no longer apply in time series analysis. The usual version of the central limit theorem (CLT) presumes independence of the summed components, and that’s not the case with time series. This lecture shows that normality still rules for asymptotic distributions, but the arguments have to be modified to allow for correlated data. 1. Types of convergence 2. Distributions in regression (Th A.2, section B.1) 3. Central limit theorems (Th A.3, Th A.4) 4. Normality of covariances, correlations (Th A.6, Th A.7) 5. Normality of parameter estimates in ARMA models (Th B.4) Types of convergence Convergence The relevant types of convergence most often seen in statistics and probability are • Almost surely, almost everywhere, with probability one, w.p. 1: a.s. Xn → X : P {ω : lim Xn = X} = 1. • In probability, in measure: p Xn → X : lim P {ω : |Xn − X| > } = 0 . n • In distribution, weak convergence, convergence in law: d Xx → X : lim P (Xn ≤ x) = P (X ≤ x) . n Statistics 910, #11 2 • In mean square or `2 , the variance goes to zero: m.s. Xn → X : lim E (Xn − X)2 = 0 . n Connections Convergence almost surely (which is much like good old fashioned convergence of a sequence) implies covergence almost surely which implies covergence in distribution: a.s. → p ⇒ → d ⇒ → Convergence in distribution only implies convergence in probability if the distribution is a point mass (i.e., the r.v. converges to a constant). The various types of converence “commute” with sums, products, and smooth functions. Mean square convergence is a bit different from the others; it implies convergence in probabiity, m.s. → ⇒ p → which holds by Cheybyshev’s inequality. Asymptotically normal (Defn A.5) Xn − µn d → N (0, 1) σn ⇔ Xn ∼ AN (µn , σn2 ) Distributions in regression Not i.i.d. The asymptotic normality of the slope estimates in regression is not so obvious if the errors are not normal. Normality requires that we can handle sums of independent, but not identically distributed r.v.s. Scalar This example and those that follow only do scalar estimators to avoid matrix manipulations that conceal the underlying ideas. The text illustrates the use of the “Cramer-Wold device” for handling vector-valued estimators. Model In scalar form, we observe a sample of independent observations that follow Yi = α+βXi +i . Assume Y and denote random variables, the xi are fixed, and the deviations i have mean 0 and variance σ2 . Statistics 910, #11 3 Least squares estimators are α̂ = Y − β̂X and P (xi − x)(yi − y) covn (x, y) = β̂ = i P 2 varn (x) i (xi − x) Pn For future reference, write SSxn = i=1 (xi − x)2 . Distribution of β̂ We can avoid centering y in the numerator and write P i (xi − x)yi β̂ = P SSxn i (xi − x)(xi β + i ) = X SSxn = β+ wni i (1) where the key weights depend on n and have the form wni = xi − x . SSxn (2) Hence, we have written β̂ − β as a weighted sum of random variables, and it follows that E β̂ = β, σ2 . Var(β̂) = P (xi − x)2 To get the limiting distribution, we need a version of the CLT that allows for unequal variances (Lindeberg CLT) and weights that change with n. Bounded leverage Asymptotic normality of β̂ requires, in essence, that the all of the observations have roughly equal impact on the response. In order that one point not dominate β̂ we have to require that max wni →0. SSxn Lindeberg CLT This theorem allows a triangular array of random variables. Think of sequences as in a lower triangular array, {X11 }, {X21 , X22 }, 2 . Now {X31 , X32 , X33 } with mean zero and variances Var(Xni ) = σni Pn P n 2 . Hence, let Tn = i=1 Xni with variance Var(Tn ) = s2n = i=1 σni E Tn = 0, sn Var Tn =1. sn Statistics 910, #11 4 If the so-called Lindeberg condition holds, (I(set) is the indicator function of the set) ∀δ > 0, n 1 X 2 E X I(δs < |X | → 0 as n → ∞ . n ni ni s2n (3) i=1 then Tn d → N (0, 1), sn Tn ∼ AN (0, s2n ) . Regression application In this case, define the random variables Xni = P wni i so that β̂ = β + Xni , where the model error i has mean 0 with variance σ 2 and (as in 2) (xi − xn ) wni = P , (xi − xn )2 s2n = σ 2 X 2 wni . (4) i Asymptotic normality requires that no one observation have too much effect on the slope, which we specify by the condition max w2 Pn i ni 2 → 0 as n → ∞ i=1 wni (5) To see that the condition (5) implies the Lindeberg condition, let’s keep things a bit simpler by assuming that the i share a common distribution, differing only in scale. Now define the largest weight Wn = max |wni | and observe that 1 X 2 1 X 2 2 δsn 2 w E I(δs < |w |) ≤ w E I( < | |) n ni i i ni i ni i Wn s2n s2n i i (6) 2 The condition (5) implies that sn /Wn → ∞. Since Var i = σ is finite, choose N so that the summands have a common bound (here’s where we use the common distribution of the i ) 2 δsn E 1 I( < |1 |) < η ∀i. Wn with η → 0 as n → ∞. Then the sum (6) is bounded by η/σ 2 → 0. Statistics 910, #11 5 Central limit theorems for dependent processes M-dependent process A sequence of random variables {Xt } (stochastic process) is said to be M-dependent if |t − s| > M implies that Xt is independent of Xs , {Xt , t ≤ T } ind {Xs : s > T + M } (7) Variables with time indices t and s that are more than M apart, |t − s| > M , are independent. Basic theorem (A.2) This is the favorite method of proof in S&S and other common textbooks. An up-over-down argument. The idea is to construct an approximating r.v. Ymn that is similar to Xn (enforced by condition 3 below), but simpler to analyze in the sense that its asymptotic distribution which is controlled by the “tuning parameter” m is relatively easy to obtain. Theorem A.2 If (1) ∀m d Ymn → Ym as n → ∞, d (2) Ym → Y as m → ∞, 2 (3) E (Xn − Ymn ) → 0 as m, n → ∞, d then Xn → Y . CLT for M-dependence (A.4) Suppose {Xt } is M -dependent with covariances γj . The variance of the mean of n observations is then Var √ M M X X n − |h| n X n = n VarX n = γh → γh := VM n h=−M (8) −M Theorem A.4 If {Xt } is an M -dependent stationary process with mean µ and VM > 0 then √ d n(X n − µ) → N (0, VM ) (X n ∼ AN (µ, VM /n)) Proof Assume w.l.o.g. that µ = 0 (recall our prior discussion in Lecture 4 that we can replace the sample mean by µ) so that the Statistics 910, #11 6 √ statistic of interest is n X n . Define the similar statistic Ymn as follows by forming the data into blocks of length m and tossing enough so that the blocks are independent of each other. Choose m > 2M and let r = b n/m c (greatest integer less than) count the blocks. Now arrange the data to fill a table with r rows and m columns (pretend for the moment that n = r m. The approximation Ymn sums the first m − M columns of this table,   X1 + X2 +···+ Xm−M  + Xm+1 + Xm+2 +···+ X2m−M      Ymn = √1n  + X2m+1 + X2m+2 +···+ X3m−M      ··· + X(r−1)m+1 + X(r−1)m+2 + · · · + X(r−1)m−M (9) and omits the intervening blocks,   Xm−M +1 + · · · + Xm  + X2m−M +1 + · · · + X2m      Umn = √1n  + (10) X3m−M +1 + · · · + X3m      ··· + X(r−1)m−M +1 + · · · + Xn The idea is that we will get the shape of the distribution from the variation among the blocks. Label the row sums that define Ymn in (9) as Zi so that we can write Ymn = √1 n (Z1 + Z2 + . . . + Zr ) . Clearly, because the Xt are centered, E Zi = 0 and X Var Zi = (m − M − |h|)γh := Sm−M h Now let’s fill in the 3 requirements of Theorem A.2: (1) Ymn = √1 n r X i=1 X Zi = ( n/r )−1/2 r−1/2 Zi |{z} | {z } →m d →N Statistics 910, #11 7 Now observe that n/r → m, and since the Zi are independent by construction, the usual CLT shows that r−1/2 X d Zi → N (0, Sm−M ) d Hence, we have Ymn → Ym := N (0, Sm−M /m) as n → ∞. (2) Now let m → ∞. As the number of columns in the table (9) increases, the variance expression Sm−M approaches the limiting varid ance VM , so Ym → N (0, VM ). (3) Last, we have to check that we did not make too rough an approximation when forming Ymn . The approximation omits the last M columns (10), so the error is √ n X n − Ymn = √1 (U1 n + · · · + Ur ) where Ui = Xim−M +1 + · · · + Xim and Var(Ui ) = SM . Thus, √ E ( n X n − Ymn )2 = nr SM → 1 m SM as n → ∞, and (SM )/m → 0 as m → ∞. CLT for stationary processes, Th A.5 In particular, for linear processes P of the form Xt − µ = ψj wt−j where the white noise terms are independent with mean 0 and variance σ 2 . The coefficients, as usual, are P absolutely summable, |ψj | < ∞. Before stating the theorem, note that the variance of the mean of such a linear process has a particularly convenient form. Following our familiar manipulations of these sums (and Fatou’s lemma!), we define V as the limiting variance of the mean as n → ∞, n Var(X n ) n−1 X = → = γh h=−n+1 ∞ X γh h=−∞ ∞ X ∞ X 2 σ h=−∞ j=0 ψj ψj+h (ψj = 0, j < 0) Statistics 910, #11 8 2  = σ2  X ψj  j := V (11) Theorem A.5 If {Xt } is a linear process of this form with then X n ∼ AN (µ, V /n) P ψj 6= 0 Proof. Again use the basic three-part theorem A.2. This time, form √ the key approximation to n(X n − µ) from the M -dependent process matched up to resemble Xt , XtM = M X ψj wt−j j=0 (1) Form the approximation based on averaging the M -dependent process rather than the original, infinite order stationary process Xt , YM n = √ n(X n,M − µ), X n,M = 1 n n X XtM . t=1 Since XtM is M -dependent, Theorem A.4 implies that as n → ∞ d YM n → YM ∼ N (0, VM ), VM = X h M X γh = σ ( ψj )2 . 2 j=0 (2) As the order M increases, VM → V as defined in (11). (3) The error of the approximation is the tail sum of the moving average approximation, which is easily bounded since the weights ψj are absolutely summable h√ E 2 n ∞ ∞ X X 1 σ2 X =E ψj wt−j  = ( ψj )2 → 0 n n  M n(X n − X n ) i2 t=1 j=M +1 j=M +1 Statistics 910, #11 9 Normality of covariances, correlations Covariances Th A.6 The text shows this for linear processes. We did the messy part previously. Correlations Th A.7 This is an exercise in the use of the delta method. For scalar r.v.s, we have for differentiable functions g() that X ∼ AN (µ, σn2 ) ⇒ g(X) ∼ AN (g(µ), g 0 (µ)2 σn2 ) (12) The argument is based on the observation that if g(x) = a + bx, then g(X) ∼ N (g(µ), b2 σ 2 ). The trick is to show that the remainder in the linear Taylor series expansion of g(X) around µ is small, g(X) = g(µ) + g 0 (µ)(X − µ) + g 00 (m) (X − µ)2 . 2 {z } | (13) remainder Normality of estimates in ARMA models Conditional least squares in the AR(1) model (with mean zero) gives the estimator Pn−1 t=1 Xt+1 Xt φ̂ = P n−1 2 P t=1 Xt (φXt + wt+1 )Xt = Pn−1 2 P t=1 Xt wt+1 Xt = φ+ P 2 (14) Xt Hence, √ n(φ̂ − φ) = √1 n P 1 n Xt Wt+1 P 2 Xt Asymptotic distribution Results for the covariances of stationary processes (Theorem A.6) shows that the denominator converges to the variance γ0 . For the numerator, the expected value is zero (since the terms are independent) and variances of the components are Var(Xt wt+1 ) = E Xt2 E wt2 = γ0 σ 2 Statistics 910, #11 10 These have equal variance, unlike the counterparts in the analysis of a regression model with fixed xs. The covariances are zero (owing again to the independence of the wt ), Cov(Xt wt+1 , Xs ws+1 ) = 0, s 6= t. P The summands in √1n Xt Wt+1 are thus uncorrelated (though not independent) and are a “trivial” stationary process. Though this is not a linear process as in A.6, comparable results hold owing to the lack of lingering dependence. (One has to build an approximating M dependent version of the products, say wt+1 XtM ; the approximation works with A.2 again since Xt is a linear process – see S&S.) This version of the prior theorem gives (Theorem B.4) √ d n(φ̂ − φ) → N (0, 1 − φ2 ) where the variance expression comes from σ 2 γ0 σ2 = = 1 − φ2 . γ0 γ02 (In the general case, the asymptotic variance is σ 2 Γ−1 p , but this is a 2 nice way to see that the expression is invariant of σ .)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Central limit theorems