Download Large Sample Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Large Sample Theory
In statistics, we are interested in the properties of particular random variables (or
“estimators”), which are functions of our data. In asymptotic analysis, we focus on
describing the properties of estimators when the sample size becomes arbitrarily large.
The idea is that given a reasonably large dataset, the properties of an estimator even
when the sample size is finite are similar to the properties of an estimator when the
sample size is arbitrarily large.
In these notes we focus on the large sample properties of sample averages formed from
i.i.d. data. That is, assume that Xi ∼ i.i.d.F , for i = 1,
P. . . , n, . . .. Assume EXi = µ,
1
for all i. The sample average after n draws is X̄n ≡ n i Xi .
We focus on two important sets of large sample results:
(1) Law of large numbers: X̄n → EX as n → ∞.
√
√
(2) Central limit theorem: n(X̄n − EX) → N (0, ·). That is, n times a sample
average looks like (in a precise sense to be defined later) a normal random variable
as n gets large.
An important endeavor of asymptotic statistics is to show (1) and (2) under various
assumptions on the data sampling process.
Consider a sequence of random variables Z1 , Z2 , . . . , Zn .
Convergence in probability:
p
Zn → Z ⇐⇒ for all > 0, limn→∞ P rob (|Zn − Z| < ) = 1.
More formally: for all > 0 and δ > 0, there exists n0 (, δ) such that for all n >
n0 (, δ):
P rob (|Zn − Z| < ) > 1 − δ.
Note that the limiting variable Z can also be a random variable. Furthermore, for
convergence in probability, the random variables Z1 , Z2 , . . . and Z should be defined
on the same probability space: if this common sample space is Ω with elements ω,
then the statement of convergence in probability is that:
P rob (ω ∈ Ω : |Zn (ω) − Z(ω)| < ) > 1 − δ.
Corresponding to this convergence concept, we have the Weak Law of Large Numbers
p
(WLLN), which is the result that X̄n → µ.
1
Earlier, we had used Chebyshev’s inequality to prove a version of the WLLN, under
the assumption that Xi are i.i.d. with mean µ and variance σ 2 .
Khinchine’s law of large numbers only requires that the mean exists (i.e., is finite),
but does not require existence of variances.
Recall Markov’s inequality: for positive random variables Z, and p > 0, > 0, we
have P (Z > ) ≤ E(Z)p /p . Take Z = |Xn − X ∗ |. Then if we have:
E[|Xn − X ∗ |p ] → 0 :
that is, Xn converges in p-th mean to X ∗ , then by Markov’s inequality, we also have
P (|Xn − X ∗ | ≥ ) → 0. That is, convergence in p-th mean implies convergence in
probability.
Some definitions and results:
• Define: the probability limit of a random sequence Zn , denoted plimn→∞ Zn , is
p
a non-random quantity α such that Zn → α.
• Stochastic orders of magnitude:
– “big-O.p.” (bounded in probability): Zn = Op (nλ ) ⇔ for every δ, there
exists a finite ∆(δ) and n∗ (δ) such that P rob(| Znλn | > ∆) < δ for all n ≥
n∗ (δ).
– “little-o.p.”: Zn = op (nλ ) ⇔
Zn
nλ
p
→ 0.
p
Zn = op (1) ⇔ Zn → 0.
• Plim operator theorem: Let Zn be a k-dimensional random vector, and g(·)
be a function which is continuous at a constant k-vector point α. Then
p
p
Zn → α =⇒ g(Zn ) → g(α).
In other words: g(plim Zn ) = plim g(Zn ).
Proof: See Serfling, Approximation Theorems of Mathematical Statistics, p.
24.
P
1
Don’t confuse this:
plim
i g(Zi ) = Eg(Zi ), from the LLN, but this is distinct
n
P
1
from plim g( n i Zi ) = g(EZi ), using plim operator
P result. The two1 are
P gen1
erally not the same; if g(·) is convex, then plim n i g(Zi ) ≥ plim g( n i Zi ).
2
p
• A useful intermediate result (“squeezing”): if Zn → α (a constant) and Zn∗ lies
p
between Zn and α with probability 1, then Zn∗ → α.
(Quick proof: for all > 0, |Xn −α| < ⇒ |Xn∗ −α| < . Hence P rob(|Xn∗ −α| <
) ≥ P rob(|Xn − α| < ) → 1.)
Convergence almost surely
as
Zn → Z ⇐⇒ P rob (ω : limn→∞ Zn (ω) = Z(ω)) = 1.
As with convergence in probability, almost sure convergence makes sense when the
random variables Z1 , . . . , Zn , . . . and Z are all defined on the same sample space.
Hence, for each element in the sample space ω ∈ Ω, we can associate the sequence
Z1 (ω), . . . , Zn (ω), . . . as well as the limit point Z(ω).
To understand the probability statement more clearly, assume that all these RVs are
defined on the sample space (Ω, B(Ω), P ).
Define some set-theoretic concepts. Consider a sequence of sets S1 , . . . , Sn , all ∈ B(Ω).
Unless this sequence is monotonic, it’s difficult to talk about a “limit” of this sequence.
So we define the “liminf” and “limsup” of this set sequence.
lim inf n→∞ Sn ≡ lim
n→∞
∞
\
Sm =
m=n
∞ \
∞
[
Sm
n=1 m=n
= {ω ∈ Ω : ω ∈ Sn , ∀n ≥ n0 (ω)} .
T
Note that the sequence of sets ∞
m=n Sm , for n = 1, 2, 3, . . . is a non-decreasing sequence of sets. By taking the union, the liminf is this the limit of this monotone
sequence of sets. That is, for all ω ∈ lim infSn , there exists some number n0 (ω) such
that ω ∈ Sn for all n ≥ n0 (ω). Hence we can say that lim infSn is the set of outcomes
which occur “eventually”.
lim supn→∞ Sn ≡
∞ [
∞
\
Sm = lim
∞
[
n→∞
Sm
n=1 m=n
m=n
= ω ∈ Ω : for every m, ω ∈ Sn0 (ω,m) for some n0 (ω, m) ≥ m .
3
S
Note that the sequence of sets ∞
m=n Sn , for n = 1, 2, 3, . . . is a non-increasing sequence
of sets. Hence, limsup is limit of this monotone sequence of sets. The lim supSn is
the set of outcomes ω which occurs within every tail of the set sequence Sn . Hence
we say that an outcome ω ∈ lim supSn occurs “infinitely often”.1
Note that
liminf Sn ⊂ limsup Sn .
Hence
P (limsup Sn ) ≥ P (liminf Sn )
and
P (limsup Sn ) = 0 → P (liminf Sn ) = 0
P (liminf Sn ) = 1 → P (limsup Sn ) = 1.
Borel-Cantelli Lemma: if
P∞
i=1
P (Si ) < ∞ then P (lim supn→∞ Sn ) = 0.
Proof: for each m, we have
(
P (lim supn→∞ Sn ) = P (lim
)
∞
[
Sm ) = lim P (
m=n
n→∞
≤ lim
n→∞
which equals zero (by the assumption that
P∞
i=1
∞
[
Sm )
m=n
X
P (Sm )
m≥n
P (Si ) < ∞).
The first equality (interchange of “limit” and “Prob” operations) is an application
of
S∞
the Monotone Convergence Theorem, which holds only because the sequence m=n Sm
is monotone.
To apply this to almost-sure convergence, consider the sequence of sets
Sn ≡ {ω : |Zn (ω) − Z(ω)| > } ,
for > 0. Almost sure convergence is the statement that P (limsup Sn ) = 0:
S
• Then ∞
m=n Sm denotes the ω’s such that the sequence |Zn (ω) − Z(ω)| exceeds
in the tail beyond n.
1
However, note that all outcomes ω ∈ lim infSn also occur an infinite number of times along the
infinite sequence Sn .
4
S
• Then limn→∞ ∞
m=n Sm denotes all ω’s such that the sequence |Zn (ω) − Z(ω)|
exceeds in every tail: those ω for which Zn (ω) “escapes” the -ball around
Z(ω) infinitely often. For these ω’s, limn→∞ Zn (ω) 6= Z(ω).
• For almost-sure convergence,
you require the probability of this set to equal
S∞
zero, i.e. Pr(limn→∞ m=n Sm ) = 0.
Corresponding to this convergence concept, we have the Strong Law of Large Numbers
as
(SLLN), which is the result that X̄n → µ. Consider that the sample averages are random variables X̄1 (ω), X̄2 (ω), ... defined on the same probability space, say (Ω, B, P ),
where each ω indexes a sequence a sequence X̄1 (ω), X̄2 (ω), X̄3 (ω), . . .. Consider the
set sequence
Sn = ω : |X̄n (ω) − µ| > .
SLLN is the statement that P (limsupSn ) = 0.
Proof (sketch; Davidson, pg. 296): For convenience, consider µ = 0. From the
as
above discussion, we see that the two statements X̄n → 0 and P (|X̄n | > , i.o.) = 0
for
P∞any > 0 are equivalent. Therefore, one way of proving a SLLN is to verify that
i=1 P (|X̄n | > ) < ∞ for all > 0, and then apply the Borel-Cantelli Lemma. As
a starting point, we see that Chebyshev’s inequality is not good enough by itself; it
tells us
∞
X
σ2
1
P (|X̄n | > ) ≤ 2 ; but
= ∞.
n
n2
n=1
However, consider the subsequence with indices n2 : {1, 4, 9, 16, . . .}. Again using
Chebyshev’s inequality, we have2
σ2
P (|X̄n2 | > ) ≤ 2 2 ;
n
∞
X
1
1
1.64
and
= 2 (π 2 /6) ≈ 2 < ∞.
2
2
n
n=1
as
Hence, by BCP
lemma, we have that X̄n2 → 0. Next, examine the “omitted” terms
from the sum ∞
i=1 P (|X̄n | > ). Define Dn2 = maxn2 ≤k<(n+1)2 |X̄k − X̄n2 |. Note that:
X̄k − X̄n2 =
k
n2
1 X
− 1 X̄n2 +
Xt
k
k
2
t=n +1
2
Another of Euler’s greatest hits:
P∞
1
i=1 n2
=
π2
6
5
(the Riemann zeta function).
and, by independence of Xi , the two terms on RHS are uncorrelated. Hence
2 2
n2
σ
k − n2 2
V ar(X̄k − X̄n2 ) = 1 −
+
σ
k
n2
k2
1
1
1
1
2
2
≤σ
.
−
−
=σ
n2 k
n2 (n + 1)2
By Chebyshev’s inequality, then, we have
σ2 1
1
P (Dn2 > ) ≤
; so
−
n2 (n + 1)2
X
1
1
σ2 X 1
σ2 X 1
σ2
−
−
P (Dn2 > ) ≤ 2
<
=
n2 (n + 1)2
2 n
n2 (n + 1)2
2
2
2
n
n
as
implying, using BC Lemma, that Dn2 → 0.
Now, consider, for n2 ≤ l < (n + 1)2 ,
|X̄l | = |X̄l − X̄n2 + X̄n2 |
≤ |X̄l − X̄n2 | + |X̄n2 |
≤ Dn2 + |X̄n2 |.
By the above discussion, the RHS converges a.s. to 0. Hence, for all integers l, we
have that |X̄l | is bounded by a random variable which converges a.s. to 0. Thus,
as
X̄l → 0.
as
Example: Xi ∼ i.i.d. U [0, 1]. Show that X(1:n) ≡ mini=1,...,n Xi → 0.
Take Sn ≡ |X(1:n) | > . For all > 0, P (|X(1:n) | > ) = P (X(1:n) > ) = P (Xi >
, i = 1, . . . , n) = (1 − )n .
P
n
Hence, ∞
n=1 (1 − ) = 1/ < ∞. So the conclusion follows by the BC Lemma.
as
p
Theorem: Zn → Z =⇒ Zn → Z.
6
Proof:
0=P
lim
n→∞
= lim P
n→∞
∞
[
!
{ω : |Zm (ω) − Z(ω)| > }
m≥n
∞
[
!
{ω : |Zm (ω) − Z(ω)| > }
m≥n
≥ lim P ({ω : |Zn (ω) − Z(ω)| > }) .
n→∞
d
Convergence in Distribution: Zn → Z.
A sequence of real-valued random variables Z1 , Z2 , Z3 , . . . converges in distribution to
a random variable Z if
lim FZn (z) = FZ (z) ∀z s.t. FZ (z) is continuous.
n→∞
(1)
This is a statement about the CDFs of the random variables Z, and Z1 , Z2 , . . .. These
random variables do not need to be defined on the same probability space. FZ is also
called the “limiting distribution” of the random sequence Zn .
Alternative definitions of convergence in distribution (“Portmanteau theorem”):
• Letting F ([a, b]) ≡ F (b) − F (a) for a < b, convergence (1) is equivalent to
FZn ([a, b]) → FZ ([a, b]) ∀[a, b] s.t. FZ continuous at a, b.
• For all bounded, continuous functions g(·),
Z
Z
g(z)dFZn (z) → g(z)dFZ (z) ⇔
Eg(Zn ) → Eg(Z).
(2)
This definition of distributional convergence is more useful in advanced settings,
because it is extendable to setting where Zn and Z are general random elements
taking values in metric space.
• Levy’s continuity theorem: A sequence {Xn } of random variables converges
in distribution to random variable X if and only if the sequence of characteristic function {φXj (t)} converges pointwise (in t) to a function φX (t) which is
continuous at the origin. Then φX is the characteristic function of X.
7
Hence, this ties together convergence of characteristic functions, and convergence in distribution. This theorem will be used to prove the CLT later.
• For proofs of most results here, see Serfling, Approximation Theorems in Mathematical
Statistics, ch. 1.
Distributional convergence, as defined above, is called “weak convergence”. This is
because there are stronger notions of distributional convergence. One such notion is:
supA∈B(R) |PZn (A) − PZ (A)| → 0
which is convergence in total variation norm.
Example: Multinomial distribution on [0, 1]. Xn ∈ n1 , n2 , n3 , . . . , nn = 1 each with
probability n1 . Consider A = Q (set of rational numbers in [0, 1]).
Some definitions and results:
p
d
• Slutsky Theorem: If Zn → Z, and Yn → α (a constant), then
d
(a) Yn Zn → αZ
d
(b) Zn + Yn → Z + α.
• Theorem *:
p
d
d
p
(a) Zn → Z =⇒ Zn → Z
(b) Zn → Z =⇒ Zn → Z if Z is a constant.
Note: convergence in probability implies that (roughly speaking), the random
variables Zn (for n large enough) and Z frequently have the same numerical
value. Convergence in distribution need not imply this, only that the CDF’s of
Zn and Z are similar.
d
• Zn → Z =⇒ Zn = Op (1). Use ∆ = max(|FZ−1 (1 − )|, |FZ−1 ()|) in definition
of Op (1).
p,as
d
Note that the LLN tells us that X̄n → µ, which implies (trivially) that X̄n → µ, a
degenerate limiting distribution.
8
This is not very useful for our purposes, because we are interested in knowing (say)
how far X̄n is from µ, which is unknown.
How do we “fix” X̄n , so that it has a non-degenerate limiting distribution?
Central Limit Theorem: (Lindeberg-Levy) Let Xi be i.i.d. with mean µ and
variance σ 2 . Then
√
n(X̄n − µ) d
→ N (0, 1).
σ
√
n is also called the “rate of convergence” of the sequence X̄n . By definition, a
rate of convergence is the lowest polynomial in n for which np (X̄n − µ) converges in
distribution to a nondegenerate distribution.
√
The rate of convergence n make sense: if you blow up X̄n by a constant (no matter
how big), you still get a degenerate
limiting distribution. If you blow up by n, then
P
the sequence Sn ≡ nX̄n = ni=1 Xn will diverge.
Proof via characteristic functions: Consider Xi ∼ i.i.d. with mean µ and variance
σ 2 . Define:
√
1 X
n(X̄n − µ) X Xi − µ
√
=
≡√
Zn ≡
Wi .
σ
nσ
n
i
i
Note that, Wi ≡ (Xi − µ)/σ are i.i.d. with mean EWi = 0 and variance VWi =
EWi2 = 1.
Let φ(t) denote characteristic function. Then
h
in √ n
φZn (t) = φ √W (t) = φW (t/ n) .
n
√
Using the second-order Taylor expansion for φW (t/ n) around 0, we have:
2
√
t
t2 2
t
2
φW (t/ n) = φW (0) + √ iEW + i EW + o
2n
n
n
2
2
t
t
= φW (0) + 0 −
∗1+o
2n
n
9
Now we have
2 (it)2
t
log φZn (t) (t) = n ∗ log φW (0) + 0 +
∗1+o
2n
n
2 2
t
(it)
+o
= n ∗ log 1 +
2n
n
2 2 2
(it)
t
t
0
≈ n ∗ log 1 +
+o
∗ log (1) + o
2n
n
n
2
2
t
t
=0+ −n∗o
2
n
2
−t
→n→∞
2
The third line comes from Taylor-expanding log {· · · } around 1. Hence
2
−t
φZn (t) (t) →n→∞ exp
= φN (0,1) (t).
2
d
Since φN (0,1) (t) is continuous at t = 0, we have Zn → N (0, 1).
Asymptotic approximations for X̄n
√
CLT tells us that n(X̄n − µ)/σ has a limiting standard normal distribution. We
can use this “true” result to say something about the distribution of X̄n , even when
n is finite. That is, we use the CLT to derive an asymptotic approximation for the
finite-sample distribution of X̄n .
The approximation is as follows: starting with the result of the CLT:
√
d
n(X̄n − µ)/σ → N (0, 1)
we “flip over” the result of the CLT:
1
a σ
a
X̄n ∼ √ N (0, 1) + µ ⇒ X̄n ∼ N (µ, σ 2 ).
n
n
a
The notation “∼” makes explicit that what is on the RHS is an approximation. Note
d
that X̄n → N (µ, n1 σ 2 ) is definitely not true!
This approximation intuitively makes sense: under the assumptions of the LLCLT,
we know that E X̄n = µ and V ar(X̄n ) = σ 2 /n. What the asymptotic approximation
tells us is that the distribution of X̄n is approximately normal.3
3
Note that the approximation is exactly right if we assumed that Xi ∼ N (µ, σ 2 ), for all i.
10
Asymptotic approximations for functions of X̄n
Oftentimes, we are not interested per se in approximating the finite-sample distribution of the sample mean X̄n , but rather functions of a sample mean. (Later, you will
see that the asymptotic approximations for many statistics and estimators that you
run across are derived by expressing them as sample averages.)
Continuous mapping theorem: Let g(·) be a continuous function. Then
d
d
Zn → Z =⇒ g(Zn ) → g(Z).
Proof: Serfling, Approximation Theorems of Mathematical Statistics, p. 25.
(Note: you still have to figure out what the limiting distribution of g(Z) is. But if
you know FX , then you can get Fg(X) by the change of variables formulas.)
Note that for any linear function g(X̄n ) = aX̄n +b, deriving the limiting distribution of
a
g(X̄n ) is no problem (just use Slutsky’s Theorem to get aX̄n + b ∼ N (aµ + b, n1 a2 σ 2 )).
The “problem” in deriving the distribution of g(X̄n ) arises when g(·) is nonlinear so
that the X̄n is “inside” the g function:
1. Use the Mean Value Theorem: Let g be a continuous function on [a, b] that
is differentiable on (a, b). Then, there exists (at least one) λ ∈ (a, b) such that
g 0 (λ) =
g(b) − g(a)
⇔ g(b) − g(a) = g 0 (λ)(b − a).
b−a
Using the MVT, we can write
g(X̄n ) − g(µ) = g 0 (Xn∗ )(X̄n − µ)
(3)
where Xn∗ is an RV strictly between X̄n and µ.
2. On the RHS of Eq. (3):
p
(a) g 0 (Xn∗ ) → g 0 (µ) by the “squeezing” result, and the plim operator theorem.
√
(b) If we multiply by n and divide by σ, we can apply the CLT to get
√
d
n(X̄n − µ)/σ → N (0, 1).
(c) Hence,
√
n ∗ RHS =
√
d
n[g(X̄n ) − g(µ)] → N (0, g 0 (µ)2 σ 2 ),
using Slutsky’s theorem.
11
(4)
3. Now, in order to get the asymptotic approximation for the distribution of g(X̄n ),
we “flip over” to get
1
a
g(X̄n ) ∼ N (g(µ), g 0 (µ)2 σ 2 ).
n
4. Check that g satisfies assumptions for these results to go through: continuity
and differentiability for MVT, and continuous at µ for plim operator theorem.
Examples: 1/X̄n , exp(X̄n ), (X̄n )2 , etc.
Eq. (4) is a general result, known as the Delta method. For the purposes of
this class, I want you to derive the approximate distributions for g(X̄n ) from first
principles (as we did above).
12
1
1.1
Some extra topics (CAN SKIP)
Triangular array CLT
LLCLT assumes independence (across n) as well as identically distributed. We extend
this to the “independent, non-identically distributed” setting.
Lindeberg-Feller CLT: For each n, let Yn,1 , . . . , Yn,n be independent random vari2
ables with finite (possibly non-identical) means and variances σn,i
such that
2
(1/Cn ) ·
n
X
E|Yn,i − EYn,i |2 1{|Yn,i −EYn,i |/Cn >} → 0 every > 0
i=1
where Cn ≡ [
Pn
2 1/2
. Then
i=1 σn,i ]
[
Pn
i=1 (Yn,i −EYn,i )]
Cn
d
−→ N (0, 1)
• The sampling framework is known as a “triangular array”. Note that the “(n −
1)-th observation in the n-th sample” (Yn,n−1 ) need not coincide with Yn−1,n−1 ,
the “(n − 1)-th observation in the (n − 1)-th sample.
P
P
[ n (Y −EY )]
• Cn is just the standard deviation of ni=1 Yn,i . Hence i=1 Cn,in n,i is a “standardized sum”.
This is useful for showing asymptotic normality of Least-Squares regression. Let
y = β0 X + with i.i.d. with mean zero and variance σ 2 , and x being an n × 1 vector
of covariates (here we have just 1 RHS variable).
The OLS estimator is β̂ = (X 0 X)−1 X 0 Y . To apply the LFCLT, we consider the
normalized difference between the estimated and true β’s:
0
−1/2
((X X)
0
0
−1/2
/σ) · (β̂ − β ) = (1/σ) · (X X)
0
X ≡ (1/σ) ·
n
X
an,i i
i=1
where an,i corresponds to the i-th component of the 1 × n vector (1/σ) · (X 0 X)−1/2 X 0 .
So wePjust need to show P
that the Lindeberg condition applies to Yn,i = an,i i . Note
n
that i=1 V (an,i i ) = V ( ni=1 an,i i ) = (1/σ 2 ) · V ((X 0 X)−1/2 X 0 ) = σ 2 /σ 2 · 1 = Cn2 .
1.2
Convergence in distribution vs. a.s.
Combining two results above, we have
as
d
Zn → Z =⇒ Zn → Z.
13
The converse is not generally true. (Indeed, {Z1 , Z2 , Z3 , . . .} need not be defined on
the same probability space, in which case s.d. convergence makes no sense.)
However, consider
Zn∗ = FZ−1
(U ),
n
Z ∗ = FZ−1 (U );
U ∼ U [0, 1].
Here FZ−1 (U ) denotes the quantile function corresponding to the CDF F (z):
F −1 (τ ) = inf {z : F (z) > τ }
τ ∈ [0, 1].
We have F (F −1 (τ )) = τ . (Note that quantile function is also right-continuous; discontinuity points of the quantile function arise where the CDF function is “flat”.)
Then
P (Zn∗ ≤ z) = P (FZ−1
(U ) ≤ z) = P (U ≤ FZn (z)) = FZn (z)
n
d
d
so that Zn∗ = Zn . (Even though their domains are different!) Similarly, Z ∗ = Z. The
d
notation “=” means “identically distributed”.
(U ) → FZ−1 (U )
Moreover, it turns out (quite intuitive) that the convergence FZ−1
n
−1
fails only at points where FZ (U ) is discontinuous (corresponding to flat portions
of FZ (z)). Since these points of discontinuity are a countable set, their probability
(under U [0, 1]) is equal to zero, so that FZ−1
(U ) → FZ−1 (U ) for almost-all U .
n
d
So what we have here, is a result that for real-valued random variables Zn → Z, we
d
as
can construct identically distributed variables such that both Zn∗ → Z ∗ and Zn∗ → Z ∗ .
This is called the “Skorokhod construction”.
Skorokhod representation: Let Zn (n = 1, 2, . . .) be random elements defined on
d
probability spaces (Ωn , B(Ωn ), Pn ) and Zn → Z, where X is defined on (Ω, B(Ω), P ).
Then there exist random variables Zn∗ (n = 1, 2, . . .) and Z ∗ defined on a common
probability space (Ω̃, B(Ω̃), P̃ ) such that
d
Zn∗ = Zn (n = 1, 2, . . .);
d
Z ∗ = Z;
as
Zn∗ → Z ∗ .
Applications:
d
as
• Continuous mapping theorem: Zn → Z ⇒ Zn∗ → Z ∗ ; for h(·) continuous, we
as
d
d
have h(Zn∗ ) → h(Z ∗ ) ⇒ h(Zn∗ ) → h(Z ∗ ) which implies h(Zn ) → h(Z).
14
• Building on the above, if h is bounded, then we get
Eh(Zn ) = Eh(Zn∗ ) → Eh(Z ∗ ) = Eh(Z)
under the bounded convergence theorem, which shows one direction of the
“Portmanteau” theorem, Eq. (2).
1.3
Functional Central limit theorems
There are a set of distributional convergence results, which are known as functional
CLT’s (or Donsker theorems), because they deal with convergence of random functions
(or, interchangeably, processes). These are indispensible tools in finance.
One of the simplest random functions is the Wiener process W(t), which is viewed
as a random function on the unit interval t ∈ [0, 1]. This is also known as a Brownian
motion process.
Features of the Wiener process:
1. W(0) = 0
d
2. Gaussian marginals: W(t) = N (0, t); that is,
2
Z a
−u
1
P (W(t) ≤ a) = √
exp
du.
2t
2πt −∞
3. Independent increments: Define xt ≡ W(t). For any set 0 ≤ t0 ≤ t1 ≤ . . . ≤ 1,
the differences
xt1 − xt0 , xt2 − xt1 , . . .
are independent.
4. Given the two above features, we have that the increments are themselves normally distributed:
d
xti − xti−1 = N (0, ti − ti−1 ).
Moreover, from
t1 − t0 = V (x1 − x0 ) = E[(x1 − x0 )2 ]
= Ex21 − 2Ex1 x0 + Ex20
= t1 + t0 − 2Ex1 x0
implying
Ex1 x0 = Cov(x1 , x0 ) = t0 .
15
5. Furthermore, we know that any finite collection (xt1 , xt2 , . . .) is jointly distributed multivariate normal, with mean 0 and variance matrix Σ as described
above.
Take the conditions of the LLCLT: X1 , X2 , ... are iid with mean zero and finite variance
σ 2 . For a given n, define the “partial sum” Sk = X1 + X2 + · · · + Xk for k ≤ n. Now,
for t ∈ [0, 1], define the normalized “partial sum process”
S[tn]
1
S̃n (t) = √ + (tn − [tn]) ∗ √ X[tn]+1
σ n
σ n
where [a] denotes the largest integer ≤ a. S̃n (t) is a random function on [0, 1]. (Graph)
Then the functional CLT is the following result:
d
Functional CLT: S̃n → W.
Note that S̃n is a random element taking values in C[0, 1], the space of continuous
functions on [0,1]. Proving this result is beyond our focus in this class.4
Note that the functional
CLT
implies more than the LLCLT. For example, take t = 1.
Pn
√
Xi
nX̄n
i=1
√
Then S̃n (1) = σ n = σ By the FCLT, we have
P (S̃n (1) ≤ a) → P (W(1) ≤ a)
= P (N (0, 1) ≤ a)
which is the same result as the LLCLT.
Similarly, for kn < n but
kn
n
→ τ ∈ [0, 1], we have
Pk
Xi
k · X̄k d
k
√
S̃n
= i=1
= √ → N (0, τ ) .
n
σ n
σ n
Also, it turns out, by a functional version of the continuous mapping theorem, we
can get convergence results for functionals of the partial-sum process. For instance,
d
supt S̃n (t) → supt W(t), and it turns out
Z a
2
P (supt W(t) ≤ a) = √
exp(−u2 /2) du; a ≥ 0.
2π 0
(Recall that W0 = 0, so a < 0 makes no sense.)
4
See, for instance, Billingsley, Convergence of Probability Measures, ch. 2, section 10.
16