Download Central limit theorems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Statistics 910, #11
1
Asymptotic Distributions in Time Series
Overview
Standard proofs that establish the asymptotic normality of estimators constructed from random samples (i.e., independent observations) no longer
apply in time series analysis. The usual version of the central limit theorem
(CLT) presumes independence of the summed components, and that’s not
the case with time series. This lecture shows that normality still rules for
asymptotic distributions, but the arguments have to be modified to allow
for correlated data.
1. Types of convergence
2. Distributions in regression (Th A.2, section B.1)
3. Central limit theorems (Th A.3, Th A.4)
4. Normality of covariances, correlations (Th A.6, Th A.7)
5. Normality of parameter estimates in ARMA models (Th B.4)
Types of convergence
Convergence The relevant types of convergence most often seen in statistics and probability are
• Almost surely, almost everywhere, with probability one, w.p. 1:
a.s.
Xn → X :
P {ω : lim Xn = X} = 1.
• In probability, in measure:
p
Xn → X :
lim P {ω : |Xn − X| > } = 0 .
n
• In distribution, weak convergence, convergence in law:
d
Xx → X :
lim P (Xn ≤ x) = P (X ≤ x) .
n
Statistics 910, #11
2
• In mean square or `2 , the variance goes to zero:
m.s.
Xn → X :
lim E (Xn − X)2 = 0 .
n
Connections Convergence almost surely (which is much like good old
fashioned convergence of a sequence) implies covergence almost surely
which implies covergence in distribution:
a.s.
→
p
⇒ →
d
⇒ →
Convergence in distribution only implies convergence in probability if
the distribution is a point mass (i.e., the r.v. converges to a constant).
The various types of converence “commute” with sums, products, and
smooth functions.
Mean square convergence is a bit different from the others; it implies
convergence in probabiity,
m.s.
→
⇒
p
→
which holds by Cheybyshev’s inequality.
Asymptotically normal (Defn A.5)
Xn − µn d
→ N (0, 1)
σn
⇔
Xn ∼ AN (µn , σn2 )
Distributions in regression
Not i.i.d. The asymptotic normality of the slope estimates in regression
is not so obvious if the errors are not normal. Normality requires that
we can handle sums of independent, but not identically distributed
r.v.s.
Scalar This example and those that follow only do scalar estimators to
avoid matrix manipulations that conceal the underlying ideas. The
text illustrates the use of the “Cramer-Wold device” for handling
vector-valued estimators.
Model In scalar form, we observe a sample of independent observations
that follow Yi = α+βXi +i . Assume Y and denote random variables,
the xi are fixed, and the deviations i have mean 0 and variance σ2 .
Statistics 910, #11
3
Least squares estimators are α̂ = Y − β̂X and
P
(xi − x)(yi − y)
covn (x, y)
=
β̂ = i P
2
varn (x)
i (xi − x)
Pn
For future reference, write SSxn = i=1 (xi − x)2 .
Distribution of β̂ We can avoid centering y in the numerator and write
P
i (xi − x)yi
β̂ =
P SSxn
i (xi − x)(xi β + i )
=
X SSxn
= β+
wni i
(1)
where the key weights depend on n and have the form
wni =
xi − x
.
SSxn
(2)
Hence, we have written β̂ − β as a weighted sum of random variables,
and it follows that
E β̂ = β,
σ2
.
Var(β̂) = P (xi − x)2
To get the limiting distribution, we need a version of the CLT that
allows for unequal variances (Lindeberg CLT) and weights that change
with n.
Bounded leverage Asymptotic normality of β̂ requires, in essence, that
the all of the observations have roughly equal impact on the response.
In order that one point not dominate β̂ we have to require that
max
wni
→0.
SSxn
Lindeberg CLT This theorem allows a triangular array of random variables. Think of sequences as in a lower triangular array, {X11 }, {X21 , X22 },
2 . Now
{X31 , X32 , X33 } with mean zero and variances Var(Xni ) = σni
Pn
P
n
2 . Hence,
let Tn = i=1 Xni with variance Var(Tn ) = s2n = i=1 σni
E
Tn
= 0,
sn
Var
Tn
=1.
sn
Statistics 910, #11
4
If the so-called Lindeberg condition holds, (I(set) is the indicator function of the set)
∀δ > 0,
n
1 X
2
E
X
I(δs
<
|X
|
→ 0 as n → ∞ .
n
ni
ni
s2n
(3)
i=1
then
Tn d
→ N (0, 1),
sn
Tn ∼ AN (0, s2n ) .
Regression application In this case, define the random variables Xni =
P
wni i so that β̂ = β + Xni , where the model error i has mean 0
with variance σ 2 and (as in 2)
(xi − xn )
wni = P
,
(xi − xn )2
s2n = σ 2
X
2
wni
.
(4)
i
Asymptotic normality requires that no one observation have too much
effect on the slope, which we specify by the condition
max w2
Pn i ni
2 → 0 as n → ∞
i=1 wni
(5)
To see that the condition (5) implies the Lindeberg condition, let’s
keep things a bit simpler by assuming that the i share a common
distribution, differing only in scale.
Now define the largest weight Wn = max |wni | and observe that
1 X 2
1 X 2 2 δsn
2
w
E
I(δs
<
|w
|)
≤
w
E
I(
<
|
|)
n
ni
i
i
ni
i
ni
i
Wn
s2n
s2n
i
i
(6)
2
The condition (5) implies that sn /Wn → ∞. Since Var i = σ is
finite, choose N so that the summands have a common bound (here’s
where we use the common distribution of the i )
2 δsn
E 1 I(
< |1 |) < η ∀i.
Wn
with η → 0 as n → ∞. Then the sum (6) is bounded by η/σ 2 → 0.
Statistics 910, #11
5
Central limit theorems for dependent processes
M-dependent process A sequence of random variables {Xt } (stochastic
process) is said to be M-dependent if |t − s| > M implies that Xt is
independent of Xs ,
{Xt , t ≤ T }
ind
{Xs : s > T + M }
(7)
Variables with time indices t and s that are more than M apart, |t −
s| > M , are independent.
Basic theorem (A.2) This is the favorite method of proof in S&S and
other common textbooks. An up-over-down argument. The idea is to
construct an approximating r.v. Ymn that is similar to Xn (enforced
by condition 3 below), but simpler to analyze in the sense that its
asymptotic distribution which is controlled by the “tuning parameter”
m is relatively easy to obtain.
Theorem A.2 If
(1)
∀m
d
Ymn → Ym
as n → ∞,
d
(2)
Ym → Y
as m → ∞,
2
(3) E (Xn − Ymn ) → 0 as m, n → ∞,
d
then Xn → Y .
CLT for M-dependence (A.4) Suppose {Xt } is M -dependent with covariances γj . The variance of the mean of n observations is then
Var
√
M
M
X
X
n − |h|
n X n = n VarX n =
γh →
γh := VM
n
h=−M
(8)
−M
Theorem A.4 If {Xt } is an M -dependent stationary process with
mean µ and VM > 0 then
√
d
n(X n − µ) → N (0, VM )
(X n ∼ AN (µ, VM /n))
Proof Assume w.l.o.g. that µ = 0 (recall our prior discussion in
Lecture 4 that we can replace the sample mean by µ) so that the
Statistics 910, #11
6
√
statistic of interest is n X n . Define the similar statistic Ymn as follows
by forming the data into blocks of length m and tossing enough so that
the blocks are independent of each other. Choose m > 2M and let
r = b n/m c (greatest integer less than) count the blocks.
Now arrange the data to fill a table with r rows and m columns (pretend for the moment that n = r m. The approximation Ymn sums the
first m − M columns of this table,


X1
+
X2
+···+
Xm−M
 +
Xm+1
+
Xm+2
+···+
X2m−M 




Ymn = √1n  +
X2m+1
+
X2m+2
+···+
X3m−M 




···
+ X(r−1)m+1 + X(r−1)m+2 + · · · + X(r−1)m−M
(9)
and omits the intervening blocks,


Xm−M +1
+ · · · + Xm
 +
X2m−M +1
+ · · · + X2m 




Umn = √1n  +
(10)
X3m−M +1
+ · · · + X3m 




···
+ X(r−1)m−M +1 + · · · + Xn
The idea is that we will get the shape of the distribution from the
variation among the blocks.
Label the row sums that define Ymn in (9) as Zi so that we can write
Ymn =
√1
n
(Z1 + Z2 + . . . + Zr ) .
Clearly, because the Xt are centered, E Zi = 0 and
X
Var Zi =
(m − M − |h|)γh := Sm−M
h
Now let’s fill in the 3 requirements of Theorem A.2:
(1)
Ymn =
√1
n
r
X
i=1
X
Zi = ( n/r )−1/2 r−1/2
Zi
|{z}
|
{z
}
→m
d
→N
Statistics 910, #11
7
Now observe that n/r → m, and since the Zi are independent by
construction, the usual CLT shows that
r−1/2
X
d
Zi → N (0, Sm−M )
d
Hence, we have Ymn → Ym := N (0, Sm−M /m) as n → ∞.
(2) Now let m → ∞. As the number of columns in the table (9)
increases, the variance expression Sm−M approaches the limiting varid
ance VM , so Ym → N (0, VM ).
(3) Last, we have to check that we did not make too rough an approximation when forming Ymn . The approximation omits the last M
columns (10), so the error is
√
n X n − Ymn =
√1 (U1
n
+ · · · + Ur )
where Ui = Xim−M +1 + · · · + Xim and Var(Ui ) = SM . Thus,
√
E ( n X n − Ymn )2 = nr SM →
1
m
SM
as n → ∞, and (SM )/m → 0 as m → ∞.
CLT for stationary processes, Th A.5 In particular, for linear processes
P
of the form Xt − µ = ψj wt−j where the white noise terms are independent with mean 0 and variance σ 2 . The coefficients, as usual, are
P
absolutely summable,
|ψj | < ∞.
Before stating the theorem, note that the variance of the mean of such
a linear process has a particularly convenient form. Following our
familiar manipulations of these sums (and Fatou’s lemma!), we define
V as the limiting variance of the mean as n → ∞,
n Var(X n )
n−1
X
=
→
=
γh
h=−n+1
∞
X
γh
h=−∞
∞ X
∞
X
2
σ
h=−∞ j=0
ψj ψj+h
(ψj = 0, j < 0)
Statistics 910, #11
8
2

=
σ2 
X
ψj 
j
:= V
(11)
Theorem A.5 If {Xt } is a linear process of this form with
then
X n ∼ AN (µ, V /n)
P
ψj 6= 0
Proof. Again use the basic three-part theorem A.2. This time, form
√
the key approximation to n(X n − µ) from the M -dependent process
matched up to resemble Xt ,
XtM =
M
X
ψj wt−j
j=0
(1) Form the approximation based on averaging the M -dependent process rather than the original, infinite order stationary process Xt ,
YM n =
√
n(X n,M − µ),
X n,M =
1
n
n
X
XtM .
t=1
Since XtM is M -dependent, Theorem A.4 implies that as n → ∞
d
YM n → YM ∼ N (0, VM ),
VM =
X
h
M
X
γh = σ (
ψj )2 .
2
j=0
(2) As the order M increases, VM → V as defined in (11).
(3) The error of the approximation is the tail sum of the moving average approximation, which is easily bounded since the weights ψj are
absolutely summable
h√
E
2
n
∞
∞
X
X
1
σ2 X
=E
ψj wt−j  =
(
ψj )2 → 0
n
n

M
n(X n − X n )
i2
t=1 j=M +1
j=M +1
Statistics 910, #11
9
Normality of covariances, correlations
Covariances Th A.6 The text shows this for linear processes. We did
the messy part previously.
Correlations Th A.7 This is an exercise in the use of the delta method.
For scalar r.v.s, we have for differentiable functions g() that
X ∼ AN (µ, σn2 )
⇒
g(X) ∼ AN (g(µ), g 0 (µ)2 σn2 )
(12)
The argument is based on the observation that if g(x) = a + bx, then
g(X) ∼ N (g(µ), b2 σ 2 ). The trick is to show that the remainder in the
linear Taylor series expansion of g(X) around µ is small,
g(X) = g(µ) + g 0 (µ)(X − µ) +
g 00 (m)
(X − µ)2 .
2
{z
}
|
(13)
remainder
Normality of estimates in ARMA models
Conditional least squares in the AR(1) model (with mean zero) gives
the estimator
Pn−1
t=1 Xt+1 Xt
φ̂ =
P
n−1
2
P t=1 Xt
(φXt + wt+1 )Xt
=
Pn−1 2
P t=1 Xt
wt+1 Xt
= φ+ P 2
(14)
Xt
Hence,
√
n(φ̂ − φ) =
√1
n
P
1
n
Xt Wt+1
P 2
Xt
Asymptotic distribution Results for the covariances of stationary processes (Theorem A.6) shows that the denominator converges to the
variance γ0 . For the numerator, the expected value is zero (since the
terms are independent) and variances of the components are
Var(Xt wt+1 ) = E Xt2 E wt2 = γ0 σ 2
Statistics 910, #11
10
These have equal variance, unlike the counterparts in the analysis of a
regression model with fixed xs. The covariances are zero (owing again
to the independence of the wt ),
Cov(Xt wt+1 , Xs ws+1 ) = 0,
s 6= t.
P
The summands in √1n Xt Wt+1 are thus uncorrelated (though not
independent) and are a “trivial” stationary process. Though this is
not a linear process as in A.6, comparable results hold owing to the
lack of lingering dependence. (One has to build an approximating M dependent version of the products, say wt+1 XtM ; the approximation
works with A.2 again since Xt is a linear process – see S&S.) This
version of the prior theorem gives (Theorem B.4)
√
d
n(φ̂ − φ) → N (0, 1 − φ2 )
where the variance expression comes from
σ 2 γ0
σ2
=
= 1 − φ2 .
γ0
γ02
(In the general case, the asymptotic variance is σ 2 Γ−1
p , but this is a
2
nice way to see that the expression is invariant of σ .)