Download Lecture 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Lecture 5
1
Weak Convergence
We have previously introduced the notions of convergence in probability and almost sure convergence for a sequence of random variables (Xn )n∈N . Both notions require (Xn )n∈N to be
defined on the same probability space. However, we are often more interested in the distributions of (Xn )n∈N , and whether they converge to a limiting distribution. Recall that the
distribution of a random variable X : (Ω, F, P) → (S, S) is defined to be the induced probability measure P ◦ X −1 on the measurable space (S, S). What we need therefore is a notion
of convergence of probability measures on a measurable space (S, S). We will require (S, S)
to be a metric space equipped with the Borel σ-algebra. Special cases include R, Rd , or the
space of continuous functions C([0, 1], R) equipped with the supremum norm.
To define the convergence of a sequence of probability
R measures on (S, S)R to a limit µ,
a naive first attempt would be to require that µn (A) = 1A dµn → µ(A) = 1A dµ for all
A ∈ S. However, this notion is too strong. Indeed, if µn = δ1/n , the probability measure
concentrated at the single point 1/n, then intuitively we expect µn to converge to µ = δ0 ,
however µn ({0}) = 0 6→ µ({0}) = 1. Therefore we need a weaker notion of convergence
that takes into account such convergence of atomic mass. The solution is to replace the
discontinuous indicator functions 1A by continuous functions.
Definition 1.1 [Weak Convergence] Let (S, S) be a metric space equipped with the Borel
σ-algebra. A sequence of probability measures (µn )n∈N on (S, S) is said to converge weakly
to a probability measure
µ on (S,
R
R S) (denoted by µn ⇒ µ), if for every bounded continuous
function f : S → R, f dµn → f dµ. A sequence of (S, S)-valued random variables (Xn )n∈N
is said to converge in distribution, or weakly to X (denoted by Xn ⇒ X), if the distribution
of Xn converges weakly to that of X. The topology induced by weak convergence on M1 (S),
the space of probability measures on (S, S), is called the weak topology.
Exercise 1.2 Prove that if Xn → X either almost surely or in probability, then Xn ⇒ X.
The following result gives useful equivalent definitions of weak convergence.
Theorem 1.3 [Portmanteau Theorem] Let (µn )n∈N and µ be probability measures on a
metric space (S, S) equipped with the Borel σ-algebra. The following conditions are equivalent:
(i) µn ⇒ µ;
(ii) limn→∞
R
f dµn =
R
f dµ for all bounded and uniformly continuous f : S → R;
(iii) lim supn→∞ µn (F ) ≤ µ(F ) for all closed sets F ⊂ S;
(iv) lim inf n→∞ µn (G) ≥ µ(G) for all open sets G ⊂ S;
(v) limn→∞ µn (A) = µ(A) for all A ∈ S with µ(∂A) = 0, where ∂A := A ∩ Ac denotes the
boundary of A;
R
R
(vi) limn→∞ f dµn = f dµ for all bounded real f which are continuous at µ-a.e. x ∈ S.
1
Equivalently, the above statements can all be formulated in terms of random variables (Xn )n∈N
and X, with probability distributions (µn )n∈N and µ.
As exemplified by the example µn = δ1/n ⇒ δ0 , condition (iii) captures the fact that: the
probability assigned by µn to a closed set F can not escape from F , but probability assigned by
µn to F c can asymptotically accumulate at the boundary of F . Similarly, probability assigned
by µn to an open set G can asymptotically escape G from its boundary, but probability
assigned by µn to Gc cannot escape from Gc .
Proof. Note that (i)⇒(ii). We show next (ii)⇒(iii). Let ρ denote the metric on S. Given a
closed set F , define
f (x) := 1 − 1 ∧ (−1 inf ρ(x, y)).
y∈F
−1 ρ(x
Note that |f (x1 ) − f (x2 )| ≤
1 , x2 ) for all x1 , x2 ∈ S, and hence f is uniformly continuous. Note also that f (x) = 1 if x ∈ F and f (x) = 0 if inf y∈F ρ(x, y) ≥ , which imply that
f ↓ 1F as ↓ 0. Therefore by the Bounded Convergence Theorem and (ii),
Z
Z
Z
µ(F ) = lim f dµ = lim lim
f dµn ≥ lim lim sup 1F dµn = lim sup µn (F ),
↓0 n→∞
↓0
↓0
n→∞
n→∞
which is exactly (iii).
Clearly (iii)⇔(iv) by setting F = Gc .
Next we show that (iii) and (iv) imply (v). If we let Ao := A\Ac denote the interior of A,
then (iii) and (iv) imply
lim µn (A) ≤ lim sup µn (A) ≤ µ(A),
n→∞
n→∞
lim µn (A) ≥ lim inf µn (Ao ) ≥ µ(Ao ).
n→∞
n→∞
If µ(∂A) = 0, then µ(A) = µ(Ao ), and we obtain limn→∞ µn (A) = µ(A).
To deduce (vi) from (v), note that by adding and multiplying constants, we may assume
without loss of generality that 0 ≤ f ≤ 1. Then
Z
Z 1
f dµ =
µ(Al )d l,
where Al := {x ∈ S : f (x) ≥ l}.
0
Let Df denote the set of discontinuities of f . Then for each l > 0, ∂Al ⊂ Df ∪ {x : f (x) = l}.
Since there can be at most countably many l > 0 with µ({x : f (x) = l}) > 0, and µ(Df ) = 0
by assumption, there can be at most countably many l > 0 with µ(∂Al ) > 0; for all other
l > 0, by (v), we have µ(Al ) = limn→∞ µn (Al ). Therefore by the Bounded Convergence
Theorem,
Z
Z 1
Z 1
Z 1
Z
f dµn .
f dµ =
µ(Al )d l =
lim µn (Al )d l = lim
µn (Al )d l = lim
0
0 n→∞
n→∞ 0
n→∞
Lastly, clearly (vi)⇒(i), which completes the cycle of implications.
For real-valued random variables, weak convergence is often defined alternatively in terms
of the convergence of their distribution functions. The following result shows their equivalence.
Theorem 1.4 [Convergence of Distribution Functions] For each n ∈ N, let Xn be a realvalued random variables with distribution function Fn (x) = P(Xn ∈ (−∞, x]). Then Xn ⇒ X
for a real-valued random variable with distribution function F , if and only if Fn (x) → F (x)
for each x ∈ R at which F is continuous. We denote the latter convergence by Fn ⇒ F .
2
Proof. Assume Xn ⇒ X. If we set A = (−∞, x] with x being a point of continuity of F ,
then P(X ∈ ∂A) = P(X = x) = 0. Therefore by the Portmanteau Theorem, P(Xn ∈ A) =
Fn (x) → P(X ∈ A) = F (x), i.e., Fn ⇒ F .
Now assume that Fn ⇒ F . Let f be a bounded continuous function, which we may
assume w.l.o.g. to be non-negative and bounded by 1. We note that F can only have at most
a countable number of points of discontinuity, i.e., x ∈ R with P(X = x) > 0. For any > 0,
we can then find A > 0 large enough such that F is continuous at −A and A, and
F (−A) + (1 − F (A)) = P(X ≤ −A) + P(X > A) ≤ .
Furthermore, by the uniform continuity of f on [−A, A], we can find x0 = −A < x1 < · · · <
xk = A such that F is continuous at x0 , x1 , . . . , xk , and
sup
|f (x) − f (xi )| ≤ for all 1 ≤ i ≤ k.
(1.1)
x∈[xi−1 ,xi ]
Denote B0 := (−∞, x0 ], Bi := (xi−1 , xi ] for 1 ≤ i ≤ k, and Bk+1 := (xk , ∞). Note that
Fn ⇒ F implies limn→∞ P(Xn ∈ Bi ) = P(X ∈ Bi ) for each 0 ≤ i ≤ k + 1. We then have
k+1
X
E[f (Xn )1Bi (Xn )] − E[f (X)1Bi (X)]
lim sup E[f (Xn )] − E[f (X)] = lim sup n→∞
n→∞
≤
k+1
X
i=0
i=0
lim sup E[f (Xn )1Bi (Xn )] − E[f (X)1Bi (X)].
(1.2)
n→∞
When i = 0 or i = k + 1, we can use the assumption |f | ≤ 1 to bound the summands by
lim sup(P(Xn ∈ B0 ) + P(X ∈ B0 )) + lim sup(P(Xn ∈ Bk+1 ) + P(X ∈ Bk+1 ))
n→∞
n→∞
= 2P(X ∈ B0 ) + 2P(X ∈ Bk+1 ) ≤ 2.
When 1 ≤ i ≤ k, we can use (1.1) to bound the summands in (1.2) by
lim sup E[f (Xn )1Bi (Xn )] − E[f (X)1Bi (X)]
n→∞
≤ lim sup E[(f (Xn ) − f (xi ))1Bi (Xn )] + f (xi )E[1Bi (Xn ) − 1Bi (X)] − E[(f (X) − f (xi ))1Bi (X)]
n→∞
≤ lim sup P(Xn ∈ Bi ) + f (xi ) lim sup |P(Xn ∈ Bi ) − P(X ∈ Bi )| + P(X ∈ Bi )
n→∞
n→∞
= 2 P(X ∈ Bi ).
Substituting these bounds into (1.2) then gives
k
X
lim sup E[f (Xn )] − E[f (X)] ≤ 2 + 2
P(X ∈ Bi ) ≤ 4.
n→∞
i=1
Since > 0 is arbitrary, this implies that limn→∞ E[f (Xn )] = E[f (X)] for all bounded continuous f , and hence Xn ⇒ X.
A useful corollary of the Portmanteau Theorem is the following.
Theorem 1.5 [Continuous Mapping Theorem] Let (Xn )n∈N and X be random variables
defined on a metric space (S, S) with distributions (µn )n∈N and µ. Assume that Xn ⇒ X.
If f : (S, S) → (S 0 , S 0 ) is continuous at µ-a.e. x ∈ S, then f (Xn ) ⇒ f (X), or equivalently,
µn ◦ f −1 ⇒ µ ◦ f −1 .
3
The Continuous Mapping Theorem is very useful in studying the convergence of stochastic
processes, such as convergence of a sequence of random continuous functions to a Brownian
motion, all regarded as C([0, 1], R)-valued random variables. Weak convergence of the process
implies weak convergence of many useful functionals, such as the supremum, the first time
the process hits a level a > 0, etc.
Proof. Note that for any bounded continuous g : S 0 → R, ϕ := g ◦ f is continuous at µ-a.e.
x ∈ S. Therefore by the assumption Xn ⇒ X and Theorem 1.3 (vi), E[ϕ(Xn )] → E[ϕ(X)], or
equivalently, E[g(f (Xn ))] → E[g(f (X))], which implies f (Xn ) ⇒ f (X).
Another useful result for deducing the weak convergence of one sequence of random variables from another sequence is:
Theorem 1.6 [Converging Together Lemma] Let (Xn )n∈N , (Yn )n∈N , and X be random
variables defined on a metric space (S, S) with metric ρ(·, ·). Assume that Xn ⇒ X, and
ρ(Xn , Yn ) → 0 in probability. Then Yn ⇒ X.
Exercise 1.7 Prove Theorem 1.6.
The following result is very useful when working with convergence, because it converts
weak convergence into almost sure convergence. It requires (S, S) to be a Polish space, (i.e.,
a space whose topology is induced by a complete separable metric on S). However this is not
a serious restriction because there are several other important reasons why we usually work
with Polish spaces in probability theory.
Theorem 1.8 [Skorohod’s Representation Theorem] Let (S, S) be a Polish space. If
(µn )n∈N and µ are probability measures on (S, S) with µn ⇒ µ, then we can construct random
variables (Xn )n∈N and X on a common probability space (Ω, F, P), with distributions (µn )n∈N
and µ, such that Xn → X almost surely. We can even take (Ω, F, P) to be [0, 1] equipped with
the Borel σ-algebra B and Lebesgue measure λ.
See Durrett [1, Chapter 8] for a proof.
Exercise 1.9 Use Skorohod’s Representation Theorem to prove the Continuous Mapping
Theorem under the assumption that (S, S) is Polish.
Exercise 1.10 Prove that if (Xn )n∈N are Rd -valued random variables with Xn ⇒ X, then
E[kXk] ≤ lim inf n→∞ E[kXn k], where k · k denotes the Euclidean norm on Rd .
2
Relative Compactness
Given a sequence of (S, S)-valued random variables (Xn )n∈N with distributions (µn )n∈N , if we
already have a candidate X with distribution µ for the weak limit, then to prove Xn ⇒ X,
we need to prove E[f (Xn )] → E[f (X)] for all bounded continuous f : (S, S) → (R, B).
However, this is usually not practical because the class of bounded continuous functions is
too large. Instead, we typically find a special class U which are easier to evaluate, such that
E[f (Xn )] → E[f (X)] for all f ∈ U still implies Xn ⇒ X. Such a class is called a convergence
determining class. For S = R or Rd , one such class is
U := {x ∈ Rd → eit·x ∈ C : t ∈ Rd },
4
(2.3)
and φ(t) := E[eit·X ] defines the so-called characteristic function of X.
Often even verifying E[f (Xn )] → E[f (X)] for all f in a convergence determining class U is
not feasible, because such an approach lumps together two separate issues involved in proving
Xn ⇒ X. The first issue is the relative compactness of (µn )n∈N , which ensures that (µn )n∈N
admits subsequential weak limits in the space of probability measures on (S, S). The second
issue is the uniqueness of the subsequential weak limit, namely if µni ⇒ ν1 and
R µmi ⇒ ν2
along two different subsequences, then ν1 = ν2 . If we can show that limn→∞
f dµnRexists
R
in R for every f in a so-called distribution determining class V (i.e., f dν1 = f dν2
for all f ∈ V implies
of the subsequential weak limit follows
R ν1 = νR2 ), then the uniqueness
R
immediately since f dν1 = f dν2 = limn→∞ f dµn . Note that a convergence determining
class is automatically a distribution determining class, however the converse is not true in
general.
When (S, S) is an infinite-dimensional space, we typically follow such a two-step approaching in proving µn ⇒Rµ: firs proving the relative compactness of (µn )n∈N , then showing the
existence of limn→∞ f dµn for all f in a distribution determining class V. For S = Rd , it is
often sufficient to work with the characteristic functions defined from the class U in (2.3).
We now give necessary and sufficient conditions for a family of probability measures on
(S, S) to be relatively compact. First we introduce the relevant concepts.
Definition 2.1 [Relative Compactness and Tightness] A family of probability measures
Π on a metric space (S, S) is relatively compact if every sequence (µn )n∈N ∈ Π has a subsequence (µni )i∈N such that µni ⇒ µ, where µ may not be in Π. The family Π is said to be tight
if for all > 0, there is a compact set K such that µ(K) ≥ 1 − for all µ ∈ Π.
Theorem 2.2 [Prohorov’s Theorem] If Π is a tight family of probability measures on a
metric space (S, S) equipped with the Borel σ-algebra, then Π is relatively compact. Conversely,
if Π is relatively compact and (S, S) is a Polish space, then Π is tight.
Prohorov’s Theorem reduces the relative compactness of a family of probability measures on
a metric space (S, S) to its tightness. In a nutshell, the reason tightness implies relative
compactness is because for a compact set K, the space of finite measures on K, denoted by
Mf (K), is the dual of the space C(K) of continuous functions on K by the Riesz Representation Theorem; the weak topology on Mf (K) is in fact the weak∗ topology induced by C(K),
and hence by the Banach-Alaoglu Theorem, the closed unit ball in Mf (K) under the weak∗
topology, which is the subset of Mf (K) with total measure ≤ 1, is relatively compact. A
convergent subsequence of µn can then be extracted by requiring that µn converges weakly
on K for each element of a sequence of compact K , with inf n∈N µn (K ) ↑ 1 as ↓ 0.
Instead of proving Prohorov’s Theorem in its generality, which we refer to Durrett [1,
Chapter 8] or Billingsley [B, Chapter 1], we will give a proof in the special case S = R. When
S = R, tightness of Π is equivalent to showing that for all > 0, there exists A > 0 such that
µ([−A, A]) ≥ 1 − for all µ ∈ Π. The following result shows that loss of tightness results from
probability mass escaping to ±∞.
Theorem 2.3 [Helley’s Selection Theorem] For every sequence of probability measures
(µn )n∈N on R with distribution functions (Fn )n∈N , there is a subsequence (Fni )i∈N and a rightcontinuous non-decreasing F , such that Fni (x) → F (x) at all continuity points x of F . Every
F obtained this way is the distribution function of a probability measure µ if and only if (µn )n∈N
is tight, i.e., for all > 0, there is A > 0 such that µn ((−A, A]) = Fn (A) − Fn (−A) ≥ 1 − for all n ∈ N.
5
Note that in Theorem 2.3, F defines a measure µ on R with µ((a, b]) = F (b) − F (a) for all
a < b. It is a probability measure if and only if its total mass µ(R) = F (∞) − F (−∞) = 1.
Proof. Let q1 , q2 , . . . be an enumeration of the rationals. Since Fn (q1 ) ∈ [0, 1] for all n ∈ N,
there exists a subsequence (m1,i )i∈N such that Fm1,i (q1 ) → G(q1 ) ∈ R. Similarly, we can find
a further subsequence (m2,i )i∈N along which Fm2,i (q2 ) → G(q2 ). Repeat this argument for
q3 , q4 , . . ., and take the diagonal sequence ni := mi,i , we have Fni (x) → G(x) at all rational
x. To create a right-continuous function out of G, define F (x) = inf y>x,y∈Q G(y). Note that
since Fmi is non-decreasing on Q, so is G, and hence F is non-decreasing on R. To check the
right continuity of F , note that for any x ∈ R and xn ↓ x,
lim F (xn ) = lim
xn ↓x
inf
xn ↓x y>xn ,y∈Q
G(y) =
inf
y>x,y∈Q
G(y) = F (x).
Next we verify that Fni (x) → F (x) at all continuity points x of F . Let x be such a continuity
point. Then for all > 0, we can find δ > 0 such that F (x − δ) > F (x) − and F (x +
δ) < F (x) + . By the construction of F , we can find q, r ∈ Q with q < x < r such that
G(q) > F (x) − and G(r) < F (x) + . Since Fni (q) → G(q), Fni (r) → G(r), and Fni (x) is
bounded between Fni (q) and Fni (r), we have
lim sup Fni (x) ≤ lim sup Fni (r) = G(r) < F (x) + ,
i→∞
i→∞
lim inf Fni (x) ≥ lim inf Fni (q) = G(q) > F (x) − .
i→∞
i→∞
Since > 0 is arbitrary, this implies that limi→∞ Fni (x) = F (x).
Now suppose that (µn )n∈N is tight, and Fni ⇒ F along a subsequence (ni )i∈N . For any
> 0, we can find A with Fni (A) − Fni (−A) ≥ 1 − for all ni . Since F can have at most
countably many points of discontinuity, we can find A0 > A such that ±A0 are points of
continuity for F . Then
F (∞) − F (−∞) ≥ F (A0 ) − F (−A0 ) = lim (Fni (A0 ) − Fni (−A0 )) ≥ 1 − .
i→∞
Since > 0 can be chosen arbitrarily, F (∞) − F (−∞) = 1, and µ must be a probability
measure.
Conversely assume that (µn )n∈N is not tight. Then there exists an > 0 and a subsequence
(nk )k∈N such that for each nk , µnk ((−k, k]) ≤ 1 − . By going to a further subsequence if
necessary, we can assume that Fnk ⇒ F . For any N ∈ N, we can find A > N such that ±A
are points of continuity for F . Then
F (A) − F (−A) = lim (Fnk (A) − Fnk (−A)) = lim µnk ((−A, A]) ≤ 1 − .
k→∞
k→∞
Since A can be chosen arbitrarily large, we have F (∞) − F (−∞) ≤ 1 − , and hence µ is not
a probability measure. This establishes the equivalence between the tightness of (µn )n∈N and
the relative compactness of (µn )n∈N in M1 (R), the space of probability measures on R.
References
[B] P. Billingsley. Convergence of Probability Measures. John Wiley & Sons, 1968.
[1] R. Durrett. Stochastic Calculus. CRC Press, 1996.
6