Download Graduate Probability

Document related concepts

Probability wikipedia , lookup

Transcript
Probability Theory
Richard F. Bass
ii
c Copyright 2014 Richard F. Bass
Contents
1 Basic notions
1
1.1
A few definitions from measure theory . . . . . . . . . . . . .
1
1.2
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Some basic facts . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2 Convergence of random variables
11
2.1
Types of convergence . . . . . . . . . . . . . . . . . . . . . . . 11
2.2
Weak law of large numbers . . . . . . . . . . . . . . . . . . . . 13
2.3
The strong law of large numbers . . . . . . . . . . . . . . . . . 13
2.4
Techniques related to a.s. convergence . . . . . . . . . . . . . . 19
2.5
Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . 22
3 Conditional expectations
25
4 Martingales
29
4.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2
Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3
Optional stopping . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4
Doob’s inequalities . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5
Martingale convergence theorems . . . . . . . . . . . . . . . . 34
iii
iv
CONTENTS
4.6
Applications of martingales . . . . . . . . . . . . . . . . . . . 36
5 Weak convergence
41
6 Characteristic functions
47
6.1
Inversion formula . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2
Continuity theorem . . . . . . . . . . . . . . . . . . . . . . . . 53
7 Central limit theorem
59
8 Gaussian sequences
67
9 Kolmogorov extension theorem
69
10 Brownian motion
73
10.1 Definition and construction . . . . . . . . . . . . . . . . . . . 73
10.2 Nowhere differentiability . . . . . . . . . . . . . . . . . . . . . 78
11 Markov chains
81
11.1 Framework for Markov chains . . . . . . . . . . . . . . . . . . 81
11.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.3 Markov properties . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.4 Recurrence and transience . . . . . . . . . . . . . . . . . . . . 91
11.5 Stationary measures . . . . . . . . . . . . . . . . . . . . . . . 94
11.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Chapter 1
Basic notions
1.1
A few definitions from measure theory
Given a set X, a σ-algebra on X is a collection A of subsets of X such that
(1) ∅ ∈ A;
(2) if A ∈ A, then Ac ∈ A, where Ac = X \ A;
∞
(3) if A1 , A2 , . . . ∈ A, then ∪∞
i=1 Ai and ∩i=1 Ai are both in A.
A measure on a pair (X, A) is a function µ : A → [0, ∞] such that
(1) µ(∅) = 0;
P∞
(2) µ(∪∞
i=1 Ai ) =
i=1 µ(Ai ) whenever the Ai are in A and are pairwise
disjoint.
A function f : X → R is measurable if the set {x : f (x) > a} ∈ A for all
a ∈ R.
A property holds almost everywhere, written “a.e,” if the set where it fails
has measure 0. For example, f = g a.e. if µ({x : f (x) 6= g(x)}) = 0.
The characteristic function χA of a set in A is the function
that is 1 when
P
x ∈ A and is zero otherwise. A function of the form ni=1 ai χAi is called a
simple function.
If f is simple and of the above form, we define
Z
f dµ =
n
X
i=1
1
ai µ(Ai ).
2
CHAPTER 1. BASIC NOTIONS
If f is non-negative and measurable, we define
Z
nZ
o
f dµ = sup
s dµ : 0 ≤ s ≤ f, s simple .
Provided
R
|f | dµ < ∞, we define
Z
Z
Z
+
f dµ = f dµ − f − dµ,
where f + = max(f, 0), f − = max(−f, 0).
1.2
Definitions
A probability or probability measure is a measure whose total mass is one.
Because the origins of probability are in statistics rather than analysis, some
of the terminology is different. For example, instead of denoting a measure
space by (X, A, µ), probabilists use (Ω, F, P). So here Ω is a set, F is called
a σ-field (which is the same thing as a σ-algebra), and P is a measure with
P(Ω) = 1. Elements of F are called events. Elements of Ω are denoted ω.
Instead of saying a property occurs almost everywhere, we talk about properties occurring almost surely, written a.s.. Real-valued measurable functions
from Ω to R are called random variables and are usually denoted by X or Y
or other capital letters. We often abbreviate ”random variable” by r.v.
We let Ac = (ω ∈ Ω : ω ∈
/ A) (called the complement of A) and B − A =
B ∩ Ac .
Integration (in the sense Rof Lebesgue) is called expectation or expected
value,
and we write E X for XdP. The notation E [X; A] is often used for
R
XdP.
A
The random variable 1A is the function that is one if ω ∈ A and zero
otherwise. It is called the indicator of A (the name characteristic function in
probability refers to the Fourier transform). Events such as (ω : X(ω) > a)
are almost always abbreviated by (X > a).
Given a random variable X, we can define a probability on R by
PX (A) = P(X ∈ A),
A ⊂ R.
(1.1)
1.2. DEFINITIONS
3
The probability PX is called the law of X or the distribution of X. We define
FX : R → [0, 1] by
FX (x) = PX ((−∞, x]) = P(X ≤ x).
(1.2)
The function FX is called the distribution function of X.
As an example, let Ω = {H, T }, F all subsets of Ω (there are 4 of them),
P(H) = P(T ) = 21 . Let X(H) = 1 and X(T ) = 0. Then PX = 21 δ0 + 12 δ1 ,
where δx is point mass at x, that is, δx (A) = 1 if x ∈ A and 0 otherwise.
FX (a) = 0 if a < 0, 12 if 0 ≤ a < 1, and 1 if a ≥ 1.
Proposition 1.1 The distribution function FX of a random variable X satisfies:
(a) FX is nondecreasing;
(b) FX is right continuous with left limits;
(c) limx→∞ FX (x) = 1 and limx→−∞ FX (x) = 0.
Proof. We prove the first part of (b) and leave the others to the reader. If
xn ↓ x, then (X ≤ xn ) ↓ (X ≤ x), and so P(X ≤ xn ) ↓ P(X ≤ x) since P is
a measure.
Note that if xn ↑ x, then (X ≤ xn ) ↑ (X < x), and so FX (xn ) ↑ P(X < x).
Any function F : R → [0, 1] satisfying (a)-(c) of Proposition 1.1 is called a
distribution function, whether or not it comes from a random variable.
Proposition 1.2 Suppose F is a distribution function. There exists a random variable X such that F = FX .
Proof. Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Define
X(ω) = sup{x : F (x) < ω}. Here the Borel σ-field is the smallest σ-field
containing all the open sets.
We check that FX = F . Suppose X(ω) ≤ y. Then F (z) ≥ ω if z > y.
By the right continuity of F we have F (y) ≥ ω. Hence (X(ω) ≤ y) ⊂ (ω ≤
F (y)).
4
CHAPTER 1. BASIC NOTIONS
Suppose ω ≤ F (y). If X(ω) > y, then by the definition of X there exists
z > y such that F (z) < ω. But then ω ≤ F (y) ≤ F (z), a contradiction.
Therefore (ω ≤ F (y)) ⊂ (X(ω) ≤ y).
We then have
P(X(ω) ≤ y) = P(ω ≤ F (y)) = F (y).
In the above proof, essentially X = F −1 . However F may have jumps or
be constant over some intervals, so some care is needed in defining X.
Certain distributions or laws are very common. We list some of them.
(a) Bernoulli. A random variable is Bernoulli if P(X = 1) = p, P(X =
0) = 1 − p for some p ∈ [0, 1].
n k
(b) Binomial. This is defined by P(X = k) =
p (1 − p)n−k , where n
k
is a positive integer, 0 ≤ k ≤ n, and p ∈ [0, 1].
(c) Geometric. For p ∈ (0, 1) we set P(X = k) = (1 − p)pk . Here k is a
nonnegative integer.
(d) Poisson. For λ > 0 we set P(X = k) = e−λ λk /k! Again k is a
nonnegative integer.
(e) Uniform. For some positive integer n, set P(X = k) = 1/n for 1 ≤
k ≤ n.
Suppose F is absolutely continuous. This is not the definition, but being
absolutely continuous is equivalent to the existence of a function f such that
Z x
F (x) =
f (y) dy
−∞
for all x. We call f = F 0 the density of F . Some examples of distributions
characterized by densities are the following.
(f) Uniform on [a, b]. Define f (x) = (b − a)−1 1[a,b] (x). This means that if
X has a uniform distribution, then
Z
1
1[a,b] (x) dx.
P(X ∈ A) =
A b−a
1.3. SOME BASIC FACTS
5
(g) Exponential. For x > 0 let f (x) = λe−λx .
(h) Standard normal. Define f (x) =
2
√1 e−x /2 .
2π
1
P(X ∈ A) = √
2π
Z
e−x
2 /2
So
dx.
A
(i) N (µ, σ 2 ). We shall see later that a standard normal has mean zero and
variance one. If Z is a standard normal, then a N (µ, σ 2 ) random variable
has the same distribution as µ + σZ. It is an exercise in calculus to check
that such a random variable has density
√
1
2
2
e−(x−µ) /2σ .
2πσ
(1.3)
(j) Cauchy. Here
f (x) =
1.3
1 1
.
π 1 + x2
Some basic facts
We can use the law of a random variable to calculate expectations.
Proposition 1.3 If g is nonnegative or if E |g(X)| < ∞, then
Z
E g(X) = g(x) PX (dx).
Proof. If g is the indicator of an event A, this is just the definition of PX . By
linearity, the result holds for simple functions. By the monotone convergence
theorem, the result holds for nonnegative functions, and by writing g =
g + − g − , it holds for g such that E |g(X)| < ∞.
fR, then PX (dx) = f (x) dx. So, for example, E X =
R If FX has a density
2
xf (x) dx and E X = x2 f (x) dx. (We need E |X| finite to justify this if
X is not necessarily nonnegative.) We define the mean of a random variable
to be its expectation, and the variance of a random variable is defined by
Var X = E (X − E X)2 .
6
CHAPTER 1. BASIC NOTIONS
For example, it is routine to see that the mean of a standard normal is zero
and its variance is one.
Note
Var X = E (X 2 − 2XE X + (E X)2 ) = E X 2 − (E X)2 .
Another equality that is useful is the following.
Proposition 1.4 If X ≥ 0 a.s. and p > 0, then
Z ∞
p
pλp−1 P(X > λ) dλ.
EX =
0
The proof will show that this equality is also valid if we replace P(X > λ)
by P(X ≥ λ).
Proof. Use Fubini’s theorem and write
Z ∞
Z ∞
p−1
pλp−1 1(λ,∞) (X) dλ
pλ P(X > λ) dλ = E
0
0
Z X
pλp−1 dλ = E X p .
=E
0
We need two elementary inequalities.
Proposition 1.5 Chebyshev’s inequality If X ≥ 0,
P(X ≥ a) ≤
EX
.
a
Proof. We write
h
i
P(X ≥ a) = E 1[a,∞) (X) ≤ E
hX
a
since X/a is bigger than 1 when X ∈ [a, ∞).
i
1[a,∞) (X) ≤ E X/a,
1.3. SOME BASIC FACTS
7
If we apply this to X = (Y − E Y )2 , we obtain
P(|Y − E Y | ≥ a) = P((Y − E Y )2 ≥ a2 ) ≤ Var Y /a2 .
(1.4)
This special case of Chebyshev’s inequality is sometimes itself referred to as
Chebyshev’s inequality, while Proposition 1.5 is sometimes called the Markov
inequality.
The second inequality we need is Jensen’s inequality, not to be confused
with the Jensen’s formula of complex analysis.
Proposition 1.6 Suppose g is convex and and X and g(X) are both integrable. Then
g(E X) ≤ E g(X).
Proof. One property of convex functions is that they lie above their tangent
lines, and more generally their support lines. So if x0 ∈ R, we have
g(x) ≥ g(x0 ) + c(x − x0 )
for some constant c. Take x = X(ω) and take expectations to obtain
E g(X) ≥ g(x0 ) + c(E X − x0 ).
Now set x0 equal to E X.
If An is a sequence of sets, define (An i.o.), read ”An infinitely often,” by
∞
(An i.o.) = ∩∞
n=1 ∪i=n Ai .
This set consists of those ω that are in infinitely many of the An .
A simple but very important proposition is the Borel-Cantelli lemma. It
has two parts, and we prove the first part here, leaving the second part to
the next section.
Proposition 1.7 (Borel-Cantelli lemma) If
= 0.
P
n
P(An ) < ∞, then P(An i.o.)
8
CHAPTER 1. BASIC NOTIONS
Proof. We have
P(An i.o.) = lim P(∪∞
i=n Ai ).
n→∞
However,
P(∪∞
i=n Ai )
≤
∞
X
P(Ai ),
i=n
which tends to zero as n → ∞.
1.4
Independence
Let us say two events A and B are independent if P(A ∩ B) = P(A)P(B).
The events A1 , . . . , An are independent if
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aij ) = P(Ai1 )P(Ai2 ) · · · P(Aij )
for every subset {i1 , . . . , ij } of {1, 2, . . . , n}.
Proposition 1.8 If A and B are independent, then Ac and B are independent.
Proof. We write
P(Ac ∩ B) = P(B) − P(A ∩ B) = P(B) − P(A)P(B)
= P(B)(1 − P(A)) = P(B)P(Ac ).
We say two σ-fields F and G are independent if A and B are independent
whenever A ∈ F and B ∈ G. The σ-field generated by a random variable X,
written σ(X), is given by {(X ∈ A) : A a Borel subset of R}.) Two random
variables X and Y are independent if the σ-field generated by X and the
σ-field generated by Y are independent. We define the independence of n
σ-fields or n random variables in the obvious way.
Proposition 1.8 tells us that A and B are independent if the random
variables 1A and 1B are independent, so the definitions above are consistent.
1.4. INDEPENDENCE
9
If f and g are Borel functions and X and Y are independent, then f (X)
and g(Y ) are independent. This follows because the σ-field generated by
f (X) is a sub-σ-field of the one generated by X, and similarly for g(Y ).
Let FX,Y (x, y) = P(X ≤ x, Y ≤ y) denote the joint distribution function
of X and Y . (The comma inside the set means ”and.”)
Proposition 1.9 FX,Y (x, y) = FX (x)FY (y) if and only if X and Y are independent.
Proof. If X and Y are independent, the 1(−∞,x] (X) and 1(−∞,y] (Y ) are
independent by the above comments. Using the above comments and the
definition of independence, this shows FX,Y (x, y) = FX (x)FY (y).
Conversely, if the inequality holds, fix y and let My denote the collection
of sets A for which P(X ∈ A, Y ≤ y) = P(X ∈ A)P(Y ≤ y). My contains
all sets of the form (−∞, x]. It follows by linearity that My contains all sets
of the form (x, z], and then by linearity again, by all sets that are the finite
union of such half-open, half-closed intervals. Note that the collection of
finite unions of such intervals, A, is an algebra generating the Borel σ-field.
It is clear that My is a monotone class, so by the monotone class lemma,
My contains the Borel σ-field.
For a fixed set A, let MA denote the collection of sets B for which P(X ∈
A, Y ∈ B) = P(X ∈ A)P(Y ∈ B). Again, MA is a monotone class and
by the preceding paragraph contains the σ-field generated by the collection
of finite unions of intervals of the form (x, z], hence contains the Borel sets.
Therefore X and Y are independent.
The following is known as the multiplication theorem.
Proposition 1.10 If X, Y , and XY are integrable and X and Y are independent, then E [XY ] = (E X)(E Y ).
Proof. Consider the random variables in σ(X) (the σ-field generated by X)
and σ(Y ) for which the multiplication theorem is true. It holds for indicators
by the definition of X and Y being independent. It holds for simple random
variables, that is, linear combinations of indicators, by linearity of both sides.
10
CHAPTER 1. BASIC NOTIONS
It holds for nonnegative random variables by monotone convergence. And it
holds for integrable random variables by linearity again.
If X1 , . . . , Xn are independent, then so are X1 − E X1 , . . . , Xn − E Xn .
Assuming everything is integrable,
E [(X1 − E X1 ) + · · · (Xn − E Xn )]2 = E (X1 − E X1 )2 + · · · + E (Xn − E Xn )2 ,
using the multiplication theorem to show that the expectations of the cross
product terms are zero. We have thus shown
Var (X1 + · · · + Xn ) = Var X1 + · · · + Var Xn .
(1.5)
We finish up this section by proving the second half of the Borel-Cantelli
lemma.
Proposition
1.11 Suppose An is a sequence of independent events. If we
P
have n P(An ) = ∞, then P(An i.o.) = 1.
Note that here the An are independent, while in the first half of the BorelCantelli lemma no such assumption was necessary.
Proof. Note
P(∪N
i=n Ai )
=1−
c
P(∩N
i=n Ai )
=1−
N
Y
i=n
P(Aci )
N
Y
= 1 − (1 − P(Ai )).
i=n
−x
By the mean value theorem, 1 − x ≤ eP
, so we have that the right hand side
N
is greater than or equal to 1 − exp(− i=n P(Ai )). As N → ∞, this tends to
1, so P(∪∞
i=n Ai ) = 1. This holds for all n, which proves the result.
Chapter 2
Convergence of random
variables
2.1
Types of convergence
In this section we consider three ways a sequence of random variables Xn can
converge.
We say Xn converges to X almost surely if
P(Xn 6→ X) = 0.
Xn converges to X in probability if for each ε,
P(|Xn − X| > ε) → 0
as n → ∞.
Xn converges to X in Lp if
E |Xn − X|p → 0
as n → ∞.
The following proposition shows some relationships among the types of
convergence.
11
12
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Proposition 2.1 (1) If Xn → X a.s., then Xn → X in probability.
(2) If Xn → X in Lp , then Xn → X in probability.
(3) If Xn → X in probability, there exists a subsequence nj such that Xnj
converges to X almost surely.
Proof. To prove (1), note Xn −X tends to 0 almost surely, so 1(−ε,ε)c (Xn −X)
also converges to 0 almost surely. Now apply the dominated convergence
theorem.
(2) comes from Chebyshev’s inequality:
P(|Xn − X| > ε) = P(|Xn − X|p > εp ) ≤ E |Xn − X|p /εp → 0
as n → ∞.
To prove (3), choose nj larger than nj−1 such that P(|Xn −X| > 2−j ) < 2−j
whenever n ≥ nj . So if we let Aj = (|Xnj − X| > 2−j ), then P(Aj ) ≤ 2−j .
By the Borel-Cantelli lemma P(Ai i.o.) = 0. This implies there exists J
(depending on ω) such that if j ≥ J, then ω ∈
/ Aj . Then for j ≥ J we have
−j
/ (Ai i.o.).
|Xnj (ω) − X(ω)| ≤ 2 . Therefore Xnj (ω) converges to X(ω) if ω ∈
Let us give some examples to show there need not be any other implications among the three types of convergence.
Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Let Xn =
en 1(0,1/n) . Then clearly Xn converges to 0 almost surely and in probability,
but E Xnp = enp /n → ∞ for any p.
Let Ω be the unit circle, and let P be P
Lebesgue measure on the circle
normalized to have total mass 1. Let tn = ni=1 i−1 , and let
An = {eiθ : tn−1 ≤ θ < tn }.
Let Xn = 1An . Any point on the unit circle will be in infinitely many An ,
so Xn does not converge almost surely to 0. But P(An ) = 1/2πn → 0, so
Xn → 0 in probability and in Lp .
2.2. WEAK LAW OF LARGE NUMBERS
2.2
13
Weak law of large numbers
Suppose Xn is a sequence of independent random variables. Suppose also that
they all have the same distribution, that is, FXn = FX1 for all n. This situation comes up so often it has a name, independent, identically distributed,
which is abbreviated i.i.d.
Pn
Define Sn =
i=1 Xi . Sn is called a partial sum process. Sn /n is the
average value of the first n of the Xi ’s.
Theorem 2.2 (Weak law of large numbers=WLLN) Suppose the Xi are
i.i.d. and E X12 < ∞. Then Sn /n → E X1 in probability.
Proof. Since the Xi are i.i.d., they all have the same expectation, and so
E Sn = nE X1 . Hence E (Sn /n − E X1 )2 is the variance of Sn /n. If ε > 0, by
Chebyshev’s inequality,
Pn
Var Xi
nVar X1
Var (Sn /n)
= i=1 2 2
=
.
(2.1)
P(|Sn /n − E X1 | > ε) ≤
2
ε
nε
n 2 ε2
Since E X12 < ∞, then Var X1 < ∞, and the result follows by letting n → ∞.
The weak law of large numbers can be improved greatly; it is enough that
xP(|X1 | > x) → 0 as x → ∞.
2.3
The strong law of large numbers
Our aim is the strong law of large numbers (SLLN), which says that Sn /n
converges to E X1 almost surely if E |X1 | < ∞.
The strong law of large numbers is the mathematical formulation of the
law of averages. If one tosses a fair coin over and over, the proportion of
heads should converge to 1/2. Mathematically, if Xi is 1 if the ith toss turns
up heads and 0 otherwise, then we want Sn /n to converge with probability
one to 1/2, where Sn = X1 + · · · + Xn .
14
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Before stating and proving the strong law of large numbers for i.i.d. random variables with finite first moment, we need three facts from calculus.
First, recall that if bn → b are real numbers, then
b1 + · · · + bn
→ b.
n
(2.2)
Second, there exists a constant c1 such that
∞
X
1
c1
.
≤
2
k
n
k=n
(2.3)
(To
this, recall the proof of the integral test and compare the sum to
R ∞ prove
−2
x dx when n ≥ 2.) Third, suppose a > 1 and kn is the largest integer
n−1
less than or equal to an . Note kn ≥ an /2. Then
X
{n:kn ≥j}
1
≤
kn2
X
{n:an ≥j}
4
4
1
≤ 2·
2n
a
j 1 − a−2
(2.4)
by the formula for the sum of a geometric series.
We also need two probability estimates.
Lemma 2.3 If X ≥ 0 a.s., then E X < ∞ if and only if
∞
X
P(X ≥ n) < ∞.
n=1
Proof. Suppose E X is finite. Since P(X ≥ x) increases as x decreases,
∞
X
P(X ≥ n) ≤
∞ Z
X
P(X ≥ x) dx
n−1
Zn=1∞
n=1
P(X ≥ x) dx = E X,
=
0
which is finite.
n
2.3. THE STRONG LAW OF LARGE NUMBERS
If we now suppose
Z
P∞
n=1
P(X ≥ n) < ∞, write
∞
P(X ≥ x) dx ≤ 1 +
EX =
0
≤1+
∞ Z
X
n=1
∞
X
15
n+1
P(X ≥ x) dx
n
P(X ≥ n) < ∞.
n=1
Lemma 2.4 Let {Xn } be an i.i.d. sequence with each Xn ≥ 0 a.s. and
E X1 < ∞. Define
Yn = Xn 1(Xn ≤n) .
Then
∞
X
Var Yk
k2
k=1
< ∞.
Proof. Since Var Yk ≤ E Yk2 ,
∞
X
Var Yk
k=1
k2
≤
∞
X
EY 2
k
k2
k=1
=
∞ Z
X
k=1
=
=
k=1
2xP(Yk > x) dx
0
∞ Z
X
k=1
∞
X
∞
k
2xP(Yk > x) dx
0
1
k2
Z
∞
1(x≤k) 2xP(Yk > x) dx
0
Z
∞
X
1 ∞
≤
1(x≤k) 2xP(Xk > x) dx
2
k
0
k=1
Z ∞X
∞
1
=
1(x≤k) 2xP(X1 > x) dx
k2
0
k=1
Z ∞
1
≤ c1
· 2xP(X1 > x) dx
x
0
16
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
∞
Z
P(X1 > x) dx = 2c1 E X1 < ∞.
= 2c1
0
We used the fact that the Xk are i.i.d., the Fubini theorem, and (2.3).
Before proving the strong law, let us first show that we cannot weaken the
hypotheses.
Theorem 2.5 Suppose the Xi are i.i.d. and Sn /n converges a.s. Then E |X1 |
< ∞.
Proof. Since Sn /n converges, then
Sn
Sn−1 n − 1
Xn
=
−
·
→0
n
n
n−1
n
P
almost surely. Let An = (|Xn | > n). If ∞
n=1 P(An ) = ∞, then by using that
the An ’s are independent, the Borel-Cantelli lemma (second part)
P∞tells us that
P(An i.o.) = 1, which contradicts Xn /n → 0 a.s. Therefore n=1 P(An ) <
∞.
Since the Xi are identically distributed, we then have
∞
X
P(|X1 | > n) < ∞,
n=1
and E |X1 | < ∞ follows by Lemma 2.3.
We now state and prove the strong law of large numbers.
Theorem
Pn 2.6 Suppose {Xi } is an i.i.d. sequence with E |X1 | < ∞. Let
Sn = i=1 Xi . Then
Sn
→ E X1 ,
a.s.
n
Proof. By writing each Xn as Xn+ − Xn− and considering the positive and
negative parts separately, it suffices to suppose each Xn ≥ 0. Define Yk =
2.3. THE STRONG LAW OF LARGE NUMBERS
Xk 1(Xk ≤k) and let Tn =
that Tn /n → E X1 a.s.
Pn
i=1
17
Yi . The main part of the argument is to prove
Step 1. Let a > 1 and let kn be the largest integer less than or equal to an .
Let ε > 0 and let
|T − E T |
kn
kn
An =
>ε .
kn
Then
Pkn
Var (Tkn /kn )
Var Tkn
j=1 Var Yj
=
=
.
P(An ) ≤
ε2
kn2 ε2
kn2 ε2
Then
∞
X
P(An ) ≤
n=1
=
≤
∞ X
kn
X
Var Yj
n=1 j=1
∞
X
1
ε2
kn2 ε2
X
j=1 {n:kn ≥j}
1
Var Yj
kn2
∞
4(1 − a−2 )−1 X Var Yj
ε2
j2
j=1
P∞
by (2.4). By Lemma 2.4,
n=1 P(An ) < ∞, and by the Borel-Cantelli
lemma, P(An i.o.) = 0. This means that for each ω except for those in a null
set, there exists N (ω) such that if n ≥ N (ω), then |Tkn (ω) − ETkn |/kn < ε.
Applying this with ε = 1/m, m = 1, 2, . . ., we conclude
Tkn − E Tkn
→ 0,
kn
a.s.
Step 2. Since
E Yj = E [Xj ; Xj ≤ j] = E [X1 ; X1 ≤ j] → E X1
by the dominated convergence theorem as j → ∞, then by (2.2)
E Tkn
=
kn
Therefore Tkn /kn → E X1 a.s.
Pkn
j=1
E Yj
kn
→ E X1 .
18
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Step 3. If kn ≤ k ≤ kn+1 , then
Tk
Tk
kn+1
≤ n+1 ·
k
kn+1
kn
since we are assuming that the Xk are non-negative. Therefore
lim sup
k→∞
Tk
≤ aE X1 ,
k
a.s.
Similarly, lim inf k→∞ Tk /k ≥ (1/a)E X1 a.s. Since a > 1 is arbitrary,
Tk
→ E X1 ,
k
a.s.
Step 4. Finally,
∞
X
P(Yn 6= Xn ) =
∞
X
n=1
n=1
P(Xn > n) =
∞
X
P(X1 > n) < ∞
n=1
by Lemma 2.3. By the Borel-Cantelli lemma, P(Yn 6= Xn i.o.) = 0. In
particular, Yn − Xn → 0 a.s. By (2.2) we have
n
Tn − Sn X (Yi − Xi )
=
→ 0,
n
n
i=1
hence Sn /n → E X1 a.s.
a.s.,
2.4. TECHNIQUES RELATED TO A.S. CONVERGENCE
2.4
19
Techniques related to a.s. convergence
If X1 , . . . , Xn are random variables, σ(X1 , . . . , Xn ) is defined to be the smallest σ-field with respect to which X1 , . . . , Xn are measurable. This definition
extends to countably many random variables.
If Xi is a sequence of random variables, the tail σ-field is defined by
T = ∩∞
n=1 σ(Xn , Xn+1 , . . .).
An example of an event in the tail σ-field is (lim supn→∞ Xn > a). Another
example is (lim supn→∞ Sn /n > a). The reason for this is that if k < n is
fixed,
Pn
Xi
Sk
Sn
=
+ i=k+1 .
n
n
n
The firstPterm on the right tends to 0 as n → ∞. So lim sup Sn /n =
lim sup( ni=k+1 Xi )/n, which is in σ(Xk+1 , Xk+2 , . . .). This holds for each
k. The set (lim sup Sn > a) is easily seen not to be in the tail σ-field.
Theorem 2.7 (Kolmogorov 0-1 law) If the Xi are independent, then the
events in the tail σ-field have probability 0 or 1.
This implies that in the case of i.i.d. random variables, if Sn /n has a limit
with positive probability, then it has a limit with probability one, and the
limit must be a constant.
Proof. Let M be the collection of sets in σ(Xn+1 , . . .) that is independent
of every set in σ(X1 , . . . , Xn ). M is easily seen to be a monotone class and
it contains σ(Xn+1 , . . . , XN ) for every N > n. Therefore M must be equal
to σ(Xn+1 , . . .).
If A is in the tail σ-field, then A is independent of σ(X1 , . . . , Xn ) for each n.
The class MA of sets independent of A is a monotone class, hence is a σ-field
containing σ(X1 , . . . , Xn ) for each n. Therefore MA contains σ(X1 , . . .).
We thus have that the event A is independent of itself, or
P(A) = P(A ∩ A) = P(A)P(A) = P(A)2 .
This implies P(A) is zero or one.
Next is Kolmogorov’s inequality, a special case of Doob’s inequality.
20
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Proposition 2.8 Suppose the Xi are independent and E Xi = 0 for each i.
Then
E S2
P( max |Si | ≥ λ) ≤ 2 n .
1≤i≤n
λ
Proof. Let Ak = (|Sk | ≥ λ, |S1 | < λ, . . . , |Sk−1 | < λ). Note the Ak are
disjoint and that Ak ∈ σ(X1 , . . . , Xk ). Therefore Ak is independent of Sn −Sk .
Then
n
X
2
E Sn ≥
E [Sn2 ; Ak ]
=
k=1
n
X
E [(Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2 ); Ak ]
k=1
≥
n
X
E [Sk2 ; Ak ]
+2
n
X
E [Sk (Sn − Sk ); Ak ].
k=1
k=1
Using the independence, E [Sk (Sn − Sk )1Ak ] = E [Sk 1Ak ]E [Sn − Sk ] = 0.
Therefore
n
n
X
X
λ2 P(Ak ) = λ2 P( max |Sk | ≥ λ).
E Sn2 ≥
E [Sk2 ; Ak ] ≥
k=1
k=1
1≤k≤n
Our result is immediate from this.
We look at a special case of what is known as Kronecker’s lemma.
Proposition 2.9 Suppose xi are real numbers and sn =
P
∞
j=1 (xj /j) converges, then sn /n → 0.
Pn
i=1
xi . If the sum
Pn
Proof. Let bP
n =
j=1 (xj /j), b0 = 0, and suppose bn → b. As we have seen,
n
this implies ( i=1 bi )/n → b. We have n(bn − bn−1 ) = xn , so
Pn
Pn
Pn−1
(ib
−
ib
)
ib
−
sn
i
i−1
i
i=1 (i + 1)bi
= i=1
= i=1
n
n
n
Pn−1
bi n − 1
= bn − i=1 ·
→ b − b = 0.
n−1
n
2.4. TECHNIQUES RELATED TO A.S. CONVERGENCE
21
Lemma 2.10 Suppose Vi is aPsequence ofPindependent random variables,
each with mean 0. Let Wn = ni=1 Vi . If ∞
i=1 Var Vi < ∞, then Wn converges almost surely.
P
−3j
Proof. Choose nj > nj−1 such that ∞
. If n > nj , then
i=nj Var Vi < 2
applying Kolmogorov’s inequality shows that
P( max |Wi − Wnj | > 2−j ) ≤ 2−3j /2−2j = 2−j .
nj ≤i≤n
Letting n → ∞, we have P(Aj ) ≤ 2−j , where
Aj = (max |Wi − Wnj | > 2−j ).
nj ≤i
By the Borel-Cantelli lemma, P(Aj i.o.) = 0.
Suppose ω ∈
/ (Aj i.o.). Let ε > 0. Choose j large enough so that 2−j+1 < ε
and ω ∈
/ Aj . If n, m > nj , then
|Wn − Wm | ≤ |Wn − Wnj | + |Wm − Wnj | ≤ 2−j+1 < ε.
Since ε is arbitrary, Wn (ω) is a Cauchy sequence, and hence converges.
We now consider the “three series criterion.” We prove the “if” portion
here and defer the “only if” to Section 20.
Theorem 2.11 Let Xi be a sequence
of independent random variables., A >
P
0, and Yi = Xi 1(|Xi |≤A) . Then P
Xi converges if andPonly if all P
of the following three series converge: (a)
P(|Xn | > A); (b)
E Yi ; (c)
Var Yi .
P
Proof of “if” part. Since (c) holds, then (Yi − E YP
i ) converges by Lemma
2.10. SinceP(b) holds, taking P
the difference shows
Yi converges. Since
(a) holds,
P(Xi 6= Yi ) =
P(|X
P i | > A) < ∞, so by Borel-Cantelli,
P(Xi 6= Yi i.o.) = 0. It follows that
Xi converges.
22
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
2.5
Uniform integrability
Before proceeding to an extension of the SLLN, we discuss uniform integrability. A sequence of random variables is uniformly integrable if
Z
sup
|Xi | dP → 0
i
(|Xi |>M )
as M → ∞.
Proposition 2.12 Suppose there exists ϕ : [0, ∞) → [0, ∞) such that ϕ is
nondecreasing, ϕ(x)/x → ∞ as x → ∞, and supi E ϕ(|Xi |) < ∞. Then the
Xi are uniformly integrable.
Proof. Let ε > 0 and choose x0 such that x/ϕ(x) < ε if x ≥ x0 . If M ≥ x0 ,
Z
Z
Z
|Xi |
|Xi | =
ϕ(|Xi |)1(|Xi |>M ) ≤ ε ϕ(|Xi |) ≤ ε sup E ϕ(|Xi |).
ϕ(|Xi |)
i
(|Xi |>M )
Proposition 2.13 If Xn and Yn are two uniformly integrable sequences, then
Xn + Yn is also a uniformly integrable sequence.
Proof. Since there exists M such that supn E [|Xn |; |Xn | > M ] < 1 and
supn E [|Yn |; |Yn | > M ] < 1, then supn E |Xn | ≤ M + 1, and similarly for the
Yn .
Note
P(|Xn | + |Yn | > K) ≤
E |Xn | + E |Yn |
2(1 + M )
≤
K
K
by Chebyshev’s inequality.
Now let ε > 0 and choose L such that
E [|Xn |; |Xn | > L] < ε
2.5. UNIFORM INTEGRABILITY
23
and the same when Xn is replaced by Yn .
E [|Xn + Yn |; |Xn + Yn | > K]
≤ E [|Xn |; |Xn + Yn | > K] + E [|Yn |; |Xn + Yn | > K]
≤ E [|Xn |; |Xn | > L] + E [|Xn |; |Xn | ≤ L, |Xn + Yn | > K]
+ E [|Yn |; |Yn | > L] + E [|Yn |; |Yn | ≤ L, |Xn + Yn | > K]
≤ ε + LP(|Xn + Yn | > K]
+ ε + LP(|Xn + Yn | > K]
4(1 + M )
≤ 2ε + L
.
K
Given ε we have already chosen L. Choose K large enough that 4L(1 +
M )/K < ε. We then get that E [|Xn + Yn |; |Xn + Yn | > K] is bounded by 3ε.
The main result we need in this section is Vitali’s convergence theorem.
Theorem 2.14 T7.3 If Xn → X almost surely and the Xn are uniformly
integrable, then E |Xn − X| → 0.
Proof. By the above proposition, Xn − X is uniformly integrable and tends
to 0 a.s., so without loss of generality, we may assume X = 0. Let ε > 0 and
choose M such that supn E [|Xn |; |Xn | > M ] < ε. Then
E |Xn | ≤ E [|Xn |; |Xn | > M ] + E [|Xn |; |Xn | ≤ M ] ≤ ε + E [|Xn |1(|Xn |≤M ) ].
The second term on the right goes to 0 by dominated convergence.
Proposition 2.15 Suppose Xi is an i.i.d. sequence and E |X1 | < ∞. Then
S
n
E − E X1 → 0.
n
Proof. Without loss of generality we may assume E X1 = 0. By the SLLN,
Sn /n → 0 a.s. So we need to show that the sequence Sn /n is uniformly
integrable.
24
CHAPTER 2. CONVERGENCE OF RANDOM VARIABLES
Pick M such that E [|X1 |; |X1 | > M ] < ε. Pick N = M E |X1 |/ε. So
P(|Sn /n| > N ) ≤ E |Sn |/nN ≤ E |X1 |/N = ε/M.
We used here
E |Sn | ≤
n
X
E |Xi | = nE |X1 |.
i=1
We then have
E [|Xi |; |Sn /n| > N ] ≤ E [|Xi | : |Xi | > M ]
+ E [|Xi |; |Xi | ≤ M, |Sn /n| > N ]
≤ ε + M P(|Sn /n| > N ) ≤ 2ε.
Finally,
n
E [|Sn /n|; |Sn /n| > N ] ≤
1X
E [|Xi |; |Sn /n| > N ] ≤ 2ε.
n i=1
Chapter 3
Conditional expectations
If F ⊆ G are two σ-fields and X is an integrable G measurable random variable, the conditional expectation of X given F, written E [X | F] and read
as “the expectation (or expected value) of X given F,” is any F measurable
random variable Y such that E [Y ; A] = E [X; A] for every A ∈ F. The conditional probability of A ∈ G given F is defined by P(A | F) = E [1A | F].
If Y1 , Y2 are two F measurable random variables with E [Y1 ; A] = E [Y2 ; A]
for all A ∈ F, then Y1 = Y2 , a.s., or conditional expectation is unique up to
a.s. equivalence.
In the case X is already F measurable, E [X | F] = X. If X is independent
of F, E [X | F] = E X. Both of these facts follow immediately from the
definition. For another example, which ties this definition with the one used
in elementary probability courses, if {Ai } is a finite collection of disjoint sets
whose union is Ω, P(Ai ) > 0 for all i, and F is the σ-field generated by the
Ai s, then
X P(A ∩ Ai )
1Ai .
P(A | F) =
P(Ai )
i
This follows since the right-hand side is F measurable and its expectation
over any set Ai is P(A ∩ Ai ).
As an example, suppose we toss a fair coin independently 5 times and
let Xi be 1 or 0 depending whether the ith toss was a heads or tails. Let
A be the event that there were 5 heads and let Fi = σ(X1 , . . . , Xi ). Then
25
26
CHAPTER 3. CONDITIONAL EXPECTATIONS
P(A) = 1/32 while P(A | F1 ) is equal to 1/16 on the event (X1 = 1) and 0 on
the event (X1 = 0). P(A | F2 ) is equal to 1/8 on the event (X1 = 1, X2 = 1)
and 0 otherwise.
We have
E [E [X | F]] = E X
(3.1)
because E [E [X | F]] = E [E [X | F]; Ω] = E [X; Ω] = E X.
The following is easy to establish.
Proposition 3.1 (a) If X ≥ Y are both integrable, then E [X | F] ≥ E [Y |
F] a.s.
(b) If X and Y are integrable and a ∈ R, then E [aX + Y | F] = aE [X |
F] + E [Y | F].
It is easy to check that limit theorems such as monotone convergence and
dominated convergence have conditional expectation versions, as do inequalities like Jensen’s and Chebyshev’s inequalities. Thus, for example, we have
the following.
Proposition 3.2 (Jensen’s inequality for conditional expectations) If g is
convex and X and g(X) are integrable,
E [g(X) | F] ≥ g(E [X | F]),
a.s.
A key fact is the following.
Proposition 3.3 If X and XY are integrable and Y is measurable with
respect to F, then
E [XY | F] = Y E [X | F].
(3.2)
Proof. If A ∈ F, then for any B ∈ F,
E 1A E [X | F]; B = E E [X | F]; A ∩ B = E [X; A ∩ B] = E [1A X; B].
Since 1A E [X | F] is F measurable, this shows that (3.1) holds when Y = 1A
and A ∈ F. Using linearity and taking limits shows that (3.1) holds whenever
Y is F measurable and Xand XY are integrable.
Two other equalities follow.
27
Proposition 3.4 If E ⊆ F ⊆ G, then
E E [X | F] | E = E [X | E] = E E [X | E] | F .
Proof. The right equality holds because E [X | E] is E measurable, hence F
measurable. To show the left equality, let A ∈ E. Then since A is also in F,
E E E [X | F] | E ; A = E E [X | F]; A = E [X; A] = E [E X | E]; A .
Since both sides are E measurable, the equality follows.
To show the existence of E [X | F], we proceed as follows.
Proposition 3.5 If X is integrable, then E [X | F] exists.
Proof. Using linearity, we need only consider X ≥ 0. Define a measure Q
on F by Q(A) = E [X; A] for A ∈ F. This is trivially absolutely continuous
with respect to P|F , the restriction of P to F. Let E [X | F] be the RadonNikodym derivative of Q with respect to P|F . The Radon-Nikodym derivative
is F measurable by construction and so provides the desired random variable.
When F = σ(Y ), one usually writes E [X | Y ] for E [X | F]. Notation
that is commonly used (however, we will use it only very occasionally and
only for heuristic purposes) is E [X | Y = y]. The definition is as follows. If
A ∈ σ(Y ), then A = (Y ∈ B) for some Borel set B by the definition of σ(Y ),
or 1A = 1B (Y ). By linearity and taking limits, if Z is σ(Y ) measurable,
Z = f (Y ) for some Borel measurable function f . Set Z = E [X | Y ] and
choose f Borel measurable so that Z = f (Y ). Then E [X | Y = y] is defined
to be f (y).
If X ∈ L2 and M = {Y ∈ L2 : Y is F-measurable}, one can show that
E [X | F] is equal to the projection of X onto the subspace M. We will not
use this in these notes.
28
CHAPTER 3. CONDITIONAL EXPECTATIONS
Chapter 4
Martingales
4.1
Definitions
In this section we consider martingales. Let Fn be an increasing sequence of
σ-fields. A sequence of random variables Mn is adapted to Fn if for each n,
Mn is Fn measurable.
Mn is a martingale if Mn is adapted to Fn , Mn is integrable for all n, and
E [Mn | Fn−1 ] = Mn−1 ,
a.s.,
n = 2, 3, . . . .
(4.1)
If we have E [Mn | Fn−1 ] ≥ Mn−1 a.s. for every n, then Mn is a submartingale. If we have E [Mn | Fn−1 ] ≤ Mn−1 , we have a supermartingale.
Submartingales have a tendency to increase.
Let us take a moment to look at some examples. If Xi is a sequence of
mean zero i.i.d. random variables and Sn is the partial sum process, then
Mn = Sn is a martingale, since E [Mn | Fn−1 ] = Mn−1 + E [Mn − Mn−1 |
Fn−1 ] = Mn−1 + E [Mn − Mn−1 ] = Mn−1 , using independence. If the Xi ’s
have variance one and Mn = Sn2 − n, then
2
2
= 1+Sn−1
,
E [Sn2 | Fn−1 ] = E [(Sn −Sn−1 )2 | Fn−1 ]+2Sn−1 E [Sn | Fn−1 ]−Sn−1
using independence. It follows that Mn is a martingale.
Another example is the following: if X ∈ L1 and Mn = E [X | Fn ], then
Mn is a martingale.
29
30
CHAPTER 4. MARTINGALES
If M
Pn nis a martingale and Hn ∈ Fn−1 for each n, it is easy to check that
Nn = i=1 Hi (Mi − Mi−1 ) is also a martingale.
If Mn is a martingale and g(Mn ) is integrable for each n, then by Jensen’s
inequality
E [g(Mn+1 ) | Fn ] ≥ g(E [Mn+1 | Fn ]) = g(Mn ),
or g(Mn ) is a submartingale. Similarly if g is convex and nondecreasing on
[0, ∞) and Mn is a positive submartingale, then g(Mn ) is a submartingale
because
E [g(Mn+1 ) | Fn ] ≥ g(E [Mn+1 | Fn ]) ≥ g(Mn ).
4.2
Stopping times
We next want to talk about stopping times. Suppose we have a sequence
of σ-fields Fi such that Fi ⊂ Fi+1 for each i. An example would be if
Fi = σ(X1 , . . . , Xi ). A random mapping N from Ω to {0, 1, 2, . . .} is called
a stopping time if for each n, (N ≤ n) ∈ Fn . A stopping time is also called
an optional time in the Markov theory literature.
The intuition is that the sequence knows whether N has happened by
time n by looking at Fn . Suppose some motorists are told to drive north on
Highway 99 in Seattle and stop at the first motorcycle shop past the second
realtor after the city limits. So they drive north, pass the city limits, pass
two realtors, and come to the next motorcycle shop, and stop. That is a
stopping time. If they are instead told to stop at the third stop light before
the city limits (and they had not been there before), they would need to
drive to the city limits, then turn around and return past three stop lights.
That is not a stopping time, because they have to go ahead of where they
wanted to stop to know to stop there.
We use the notation a ∧ b = min(a, b) and a ∨ b = max(a, b). The proof of
the following is immediate from the definitions.
Proposition 4.1 (a) Fixed times n are stopping times.
(b) If N1 and N2 are stopping times, then so are N1 ∧ N2 and N1 ∨ N2 .
(c) If Nn is a nondecreasing sequence of stopping times, then so is N =
supn Nn .
(d) If Nn is a nonincreasing sequence of stopping times, then so is N =
4.3. OPTIONAL STOPPING
31
inf n Nn .
(e) If N is a stopping time, then so is N + n.
We define FN = {A : A ∩ (N ≤ n) ∈ Fn for all n}.
4.3
Optional stopping
Note that if one takes expectations in (4.1), one has E Mn = E Mn−1 , and
by induction E Mn = E M0 . The theorem about martingales that lies at the
basis of all other results is Doob’s optional stopping theorem, which says that
the same is true if we replace n by a stopping time N . There are various
versions, depending on what conditions one puts on the stopping times.
Theorem 4.2 If N is a bounded stopping time with respect to Fn and Mn
a martingale, then E MN = E M0 .
Proof. Since N is bounded, let K be the largest value N takes. We write
E MN =
K
X
E [MN ; N = k] =
k=0
K
X
E [Mk ; N = k].
k=0
Note (N = k) is Fj measurable if j ≥ k, so
E [Mk ; N = k] = E [Mk+1 ; N = k]
= E [Mk+2 ; N = k] = . . . = E [MK ; N = k].
Hence
E MN =
K
X
E [MK ; N = k] = E MK = E M0 .
k=0
This completes the proof.
The assumption that N be bounded cannot be entirely dispensed with.
For example, let Mn be the partial sums of a sequence of i.i.d. random variable
that take the values ±1, each with probability 21 . If N = min{i : Mi = 1},
we will see later on that N < ∞ a.s., but E MN = 1 6= 0 = E M0 . The same
proof as that in Theorem 4.2 gives the following corollary.
32
CHAPTER 4. MARTINGALES
Corollary 4.3 If N is bounded by K and Mn is a submartingale, then
E MN ≤ E MK .
Also the same proof gives
Corollary 4.4 If N is bounded by K, A ∈ FN , and Mn is a submartingale,
then E [MN ; A] ≤ E [MK ; A].
Proposition 4.5 If N1 ≤ N2 are stopping times bounded by K and M is a
martingale, then E [MN2 | FN1 ] = MN1 , a.s.
Proof.
Suppose A ∈ FN1 . We need to show E [MN1 ; A] = E [MN2 ; A].
Define a new stopping time N3 by
(
N1 (ω) if ω ∈ A
N3 (ω) =
N2 (ω) if ω ∈
/ A.
It is easy to check that N3 is a stopping time, so E MN3 = E MK = E MN2
implies
E [MN1 ; A] + E [MN2 ; Ac ] = E [MN2 ].
Subtracting E [MN2 ; Ac ] from each side completes the proof.
The following is known as the Doob decomposition.
Proposition 4.6 Suppose Xk is a submartingale with respect to an increasing sequence of σ-fields Fk . Then we can write Xk = Mk + Ak such that Mk
is a martingale adapted to the Fk and Ak is a sequence of random variables
with Ak being Fk−1 -measurable and A0 ≤ A1 ≤ · · · .
Proof. Let ak = E [Xk | Fk−1 ] − Xk−1 for k =P
1, 2, . . . Since Xk is a
submartingale, then each ak ≥ 0. Then let Ak = ki=1 ai . The fact that
the Ak are increasing and measurable with respect to Fk−1 is clear. Set
Mk = Xk − Ak . Then
E [Mk+1 − Mk | Fk ] = E [Xk+1 − Xk | Fk ] − ak+1 = 0,
or Mk is a martingale.
Combining Propositions 4.5 and 4.6 we have
4.4. DOOB’S INEQUALITIES
33
Corollary 4.7 Suppose Xk is a submartingale, and N1 ≤ N2 are bounded
stopping times. Then
E [XN2 | FN1 ] ≥ XN1 .
4.4
Doob’s inequalities
The first interesting consequences of the optional stopping theorems are
Doob’s inequalities. If Mn is a martingale, denote Mn∗ = maxi≤n |Mi |.
Theorem 4.8 If Mn is a martingale or a positive submartingale,
P(Mn∗ ≥ a) ≤ E [|Mn |; Mn∗ ≥ a]/a ≤ E |Mn |/a.
Proof. Set Mn+1 = Mn . Let N = min{j : |Mj | ≥ a} ∧ (n + 1). Since | · | is
convex, |Mn | is a submartingale. If A = (Mn∗ ≥ a), then A ∈ FN because
A ∩ (N ≤ j) = (N ≤ n) ∩ (N ≤ j) = (N ≤ j) ∈ Fj .
By Corollary 4.4
P(Mn∗ ≥ a) ≤ E
hM∗
i 1
1
1
; Mn∗ ≥ a ≤ E [|MN |; A] ≤ E [|Mn |; A] ≤ E |Mn |.
a
a
a
a
n
For p > 1, we have the following inequality.
Theorem 4.9 If p > 1 and E |Mi |p < ∞ for i ≤ n, then
p p
E (Mn∗ )p ≤
E |Mn |p .
p−1
P
Proof. Note Mn∗ ≤ ni=1 |Mn |, hence Mn∗ ∈ Lp . We write, using Theorem
4.8 for the first inequality,
Z ∞
Z ∞
p−1
∗
∗ p
pap−1 E [|Mn |1(Mn∗ ≥a) /a] da
E (Mn ) =
pa P(Mn > a) da ≤
0
0
Z Mn∗
p
=E
pap−2 |Mn | da =
E [(Mn∗ )p−1 |Mn |]
p
−
1
0
p
≤
(E (Mn∗ )p )(p−1)/p (E |Mn |p )1/p .
p−1
34
CHAPTER 4. MARTINGALES
The last inequality follows by Hölder’s inequality. Now divide both sides by
the quantity (E (Mn∗ )p )(p−1)/p .
4.5
Martingale convergence theorems
The martingale convergence theorems are another set of important consequences of optional stopping. The main step is the upcrossing lemma. The
number of upcrossings of an interval [a, b] is the number of times a process
crosses from below a to above b.
To be more exact, let
S1 = min{k : Xk ≤ a},
T1 = min{k > S1 : Xk ≥ b},
and
Si+1 = min{k > Ti : Xk ≤ a},
Ti+1 = min{k > Si+1 : Xk ≥ b}.
The number of upcrossings Un before time n is Un = max{j : Tj ≤ n}.
Theorem 4.10 (Upcrossing lemma) If Xk is a submartingale,
E Un ≤ (b − a)−1 E [(Xn − a)+ ].
Proof.
The number of upcrossings of [a, b] by Xk is the same as the
number of upcrossings of [0, b − a] by Yk = (Xk − a)+ . Moreover Yk is still a
submartingale. If we obtain the inequality for the the number of upcrossings
of the interval [0, b − a] by the process Yk , we will have the desired inequality
for upcrossings of X.
So we may assume a = 0. Fix n and define Yn+1 = Yn . This will still
be a submartingale. Define the Si , Ti as above, and let Si0 = Si ∧ (n + 1),
0
Ti0 = Ti ∧ (n + 1). Since Ti+1 > Si+1 > Ti , then Tn+1
= n + 1.
We write
E Yn+1 = E YS10 +
n+1
X
i=0
E [YTi0 − YSi0 ] +
n+1
X
i=0
0
E [YSi+1
− YTi0 ].
4.5. MARTINGALE CONVERGENCE THEOREMS
35
All the summands in the third term on the right are nonnegative since Yk is
a submartingale. For the jth upcrossing, YTj0 − YSj0 ≥ b − a, while YTj0 − YSj0
is always greater than or equal to 0. So
∞
X
(YTi0 − YSi0 ) ≥ (b − a)Un .
i=0
So
E Un ≤ E Yn+1 /(b − a).
(4.2)
This leads to the martingale convergence theorem.
Theorem 4.11 If Xn is a submartingale such that supn E Xn+ < ∞, then
Xn converges a.s. as n → ∞.
Proof.
Let U (a, b) = limn→∞ Un . For each a, b rational, by monotone
convergence,
E U (a, b) ≤ c(b − a)−1 E (Xn − a)+ < ∞.
So U (a, b) < ∞, a.s. Taking the union over all pairs of rationals a, b, we see
that a.s. the sequence Xn (ω) cannot have lim sup Xn > lim inf Xn . Therefore
Xn converges a.s., although we still have to rule out the possibility of the
limit being infinite. Since Xn is a submartingale, E Xn ≥ E X0 , and thus
E |Xn | = E Xn+ + E Xn− = 2E Xn+ − E Xn ≤ 2E Xn+ − E X0 .
By Fatou’s lemma, E limn |Xn | ≤ supn E |Xn | < ∞, or Xn converges a.s. to
a finite limit.
Corollary 4.12 If Xn is a positive supermartingale or a martingale bounded
above or below, Xn converges a.s.
Proof.
If Xn is a positive supermartingale, −Xn is a submartingale
bounded above by 0. Now apply Theorem 4.11.
36
CHAPTER 4. MARTINGALES
If Xn is a martingale bounded above, by considering −Xn , we may assume
Xn is bounded below. Looking at Xn + M for fixed M will not affect the
convergence, so we may assume Xn is bounded below by 0. Now apply the
first assertion of the corollary.
Proposition 4.13 If Xn is a martingale with supn E |Xn |p < ∞ for some
p > 1, then the convergence is in Lp as well as a.s. This is also true when
Xn is a submartingale. If Xn is a uniformly integrable martingale, then the
convergence is in L1 . If Xn → X∞ in L1 , then Xn = E [X∞ | Fn ].
Xn is a uniformly integrable martingale if the collection of random variables Xn is uniformly integrable.
Proof. The Lp convergence assertion follows by using Doob’s inequality
(Theorem 4.9) and dominated convergence. The L1 convergence assertion
follows since a.s. convergence together with uniform integrability implies L1
convergence. Finally, if j < n, we have Xj = E [Xn | Fj ]. If A ∈ Fj ,
E [Xj ; A] = E [Xn ; A] → E [X∞ ; A]
by the L1 convergence of Xn to X∞ . Since this is true for all A ∈ Fj ,
Xj = E [X∞ | Fj ].
4.6
Applications of martingales
One application of martingale techniques is Wald’s identities.
Proposition 4.14 Suppose the Yi are i.i.d. with E |Y1 | < ∞, N is a stopping time with E N < ∞, and N is independent of the Yi . Then E SN =
(E N )(EY1 ), where the Sn are the partial sums of the Yi .
Proof. Sn − n(E Y1 ) is a martingale, so E Sn∧N = E (n ∧ N )E Y1 by optional stopping. The right hand side tends to (E N )(E Y1 ) by monotone
convergence. Sn∧N converges almost surely to SN , and we need to show the
expected values converge.
4.6. APPLICATIONS OF MARTINGALES
37
Note
|Sn∧N | =
∞
X
|Sn∧k |1(N =k) ≤
k=0
=
n X
∞
X
j=0 k>j
∞ X
n∧k
X
|Yj |1(N =k)
k=0 j=0
n
X
∞
X
j=0
j=0
|Yj |1(N =k) =
|Yj |1(N ≥j) ≤
|Yj |1(N ≥j) .
The last expression, using the independence, has expected value
∞
X
(E |Yj |)P(N ≥ j) ≤ (E |Y1 |)(1 + E N ) < ∞.
j=0
So by dominated convergence, we have E Sn∧N → E SN .
Wald’s second identity is a similar expression for the variance of SN .
Let Sn be your fortune at time n. In a fair casino, E [Sn+1 | Fn ] = Sn . If
N is a stopping time, the optional stopping theorem says that E SN = E S0 ;
in other words, no matter what stopping time you use and what method of
betting, you will do not better on average than ending up with what you
started with.
We can use martingales to find certain hitting probabilities.
Proposition 4.15 Suppose the Yi are i.i.d. with P(Y1 = 1) = 1/2, P(Y1 =
−1) = 1/2, and Sn the partial sum process. Suppose a and b are positive
integers. Then
b
.
P(Sn hits −a before b) =
a+b
If N = min{n : Sn ∈ {−a, b}}, then E N = ab.
2
Proof. Sn2 − n is a martingale, so E Sn∧N
= E n ∧ N . Let n → ∞. The
right hand side converges to E N by monotone convergence. Since Sn∧N is
bounded in absolute value by a+b, the left hand side converges by dominated
2
convergence to E SN
, which is finite. So E N is finite, hence N is finite almost
surely.
38
CHAPTER 4. MARTINGALES
Sn is a martingale, so E Sn∧N = E S0 = 0. By dominated convergence,
and the fact that N < ∞ a.s., hence Sn∧N → SN , we have E SN = 0, or
−aP(SN = −a) + bP(SN = b) = 0.
We also have
P(SN = −a) + P(SN = b) = 1.
Solving these two equations for P(SN = −a) and P(SN = b) yields our first
2
result. Since E N = E SN
= a2 P(SN = −a) + b2 P(SN = b), substituting gives
the second result.
Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = 1
and E Nb = ∞, where Nb = min{n : Sn = b}.
An elegant application of martingales is a proof of the SLLN. Fix N large.
Let Yi be i.i.d. Let Zn = E [Y1 | Sn , Sn+1 , . . . , SN ]. We claim Zn = Sn /n.
Certainly Sn /n is σ(Sn , · · · , SN ) measurable. If A ∈ σ(Sn , . . . , SN ) for some
n, then A = ((Sn , . . . , SN ) ∈ B) for some Borel subset B of RN −n+1 . Since
the Yi are i.i.d., for each k ≤ n,
E [Y1 ; (Sn , . . . , SN ) ∈ B] = E [Yk ; (Sn , . . . , SN ) ∈ B].
Summing over k and dividing by n,
E [Y1 ; (Sn , . . . , SN ) ∈ B] = E [Sn /n; (Sn , . . . , SN ) ∈ B].
Therefore E [Y1 ; A] = E [Sn /n; A] for every A ∈ σ(Sn , . . . , SN ). Thus Zn =
Sn /n.
Let Xk = ZN −k , and let Fk = σ(SN −k , SN −k+1 , . . . , SN ). Note Fk gets
larger as k gets larger, and by the above Xk = E [Y1 | Fk ]. This shows that
Xk is a martingale (cf. the next to last example in Section 11). By Doob’s
upcrossing inequality, if UnX is the number of upcrossings of [a, b] by X, then
E UNX−1 ≤ E XN+−1 /(b − a) ≤ E |Z1 |/(b − a) = E |Y1 |/(b − a). This differs
by at most one from the number of upcrossings of [a, b] by Z1 , . . . , ZN . So
the expected number of upcrossings of [a, b] by Zk for k ≤ N is bounded
by 1 + E |Y1 |/(b − a). Now let N → ∞. By Fatou’s lemma, the expected
number of upcrossings of [a, b] by Z1 , . . . is finite. Arguing as in the proof
of the martingale convergence theorem, this says that Zn = Sn /n does not
oscillate.
4.6. APPLICATIONS OF MARTINGALES
It is conceivable that |Sn /n| → ∞. But by Fatou’s lemma,
E[lim |Sn /n|] ≤ lim inf E |Sn /n| ≤ lim inf nE |Y1 |/n = E |Y1 | < ∞.
39
40
CHAPTER 4. MARTINGALES
Chapter 5
Weak convergence
We will see
√ later that if the Xi are i.i.d. with mean zero and variance one,
then Sn / n converges in the sense
√
P(Sn / n ∈ [a, b]) → P(Z ∈ [a, b]),
√
where Z is a standard normal. If Sn / n converged in probability or almost surely, then by the Kolmogorov zero-one law it would converge to a
constant, contradicting the above. We want to generalize the above type of
convergence.
We say Fn converges weakly to F if Fn (x) → F (x) for all x at which
F is continuous. Here Fn and F are distribution functions. We say Xn
converges weakly to X if FXn converges weakly to FX . We sometimes say
Xn converges in distribution or converges in law to X. Probabilities µn
converge weakly if their corresponding distribution functions converges, that
is, if Fµn (x) = µn (−∞, x] converges weakly.
An example that illustrates why we restrict the convergence to continuity
points of F is the following. Let Xn = 1/n with probability one, and X = 0
with probability one. FXn (x) is 0 if x < 1/n and 1 otherwise. FXn (x)
converges to FX (x) for all x except x = 0.
Proposition 5.1 Xn converges weakly to X if and only if E g(Xn ) → E g(X)
for all g bounded and continuous.
The idea that E g(Xn ) converges to E g(X) for all g bounded and continuous makes sense for any metric space and is used as a definition of weak
41
42
CHAPTER 5. WEAK CONVERGENCE
convergence for Xn taking values in general metric spaces.
Proof. First suppose E g(Xn ) converges to E g(X). Let x be a continuity
point of F , let ε > 0, and choose δ such that |F (y) − F (x)| < ε if |y − x| < δ.
Choose g continuous such that g is one on (−∞, x], takes values between
0 and 1, and is 0 on [x + δ, ∞). Then FXn (x) ≤ E g(Xn ) → E g(X) ≤
FX (x + δ) ≤ F (x) + ε.
Similarly, if h is a continuous function taking values between 0 and 1
that is 1 on (−∞, x − δ] and 0 on [x, ∞), FXn (x) ≥ E h(Xn ) → E h(X) ≥
FX (x − δ) ≥ F (x) − ε. Since ε is arbitrary, FXn (x) → FX (x).
Now suppose Xn converges weakly to X. We start by making some observations. First, if we have a distribution function, it is increasing and so
the number of points at which it has a discontinuity is at most countable.
Second, if g is a continuous function on a closed bounded interval, it can be
approximated uniformly on the interval by step functions. Using the uniform
continuity of g on the interval, we may even choose the step function so that
the places where it jumps are not in some pre-specified countable set. Our
third observation is that
1
P(X < x) = lim P(X ≤ x − ) = lim FX (y);
y→x−
k→∞
k
thus if FX is continuous at x, then P(X = x) = FX (x) − P(X < x) = 0.
Suppose g is bounded and continuous, and we want to prove that E g(Xn )
→ E g(X). By multiplying by a constant, we may suppose that |g| is bounded
by 1. Let ε > 0 and choose M such that FX (M ) > 1 − ε and FX (−M ) < ε
and so that M and −M are continuity points of FX . We see that
P(Xn ≤ −M ) = FXn (−M ) → FX (−M ) < ε,
and so for large enough n, we have P(Xn ≤ −M ) ≤ 2ε. Similarly, for
large enough n, P(Xn > M ) ≤ 2ε. Therefore for large enough n, E g(Xn )
differs from E (g1[−M,M ] )(Xn ) by at most 4ε. Also, E g(X) differs from
E (g1[−M,M ] )(X) by at most 2ε.
Let h be a step function such that sup|x|≤M |h(x) − g(x)| < ε and h is
0 outside of [−M, M ]. We choose h so that the places where h jumps are
continuity points of F and of all the Fn . Then E (g1[−M,M ] )(Xn ) differs from
E h(Xn ) by at most ε, and the same when Xn is replaced by X.
43
If we show E h(Xn ) → E h(X), then
lim sup |E g(Xn ) − E g(X)| ≤ 8ε,
n→∞
and since ε is arbitrary, we will be done.
P
h is of the form m
i=1 ci 1Ii , where Ii is an interval, so by linearity it is
enough to show
E 1I (Xn ) → E 1I (X)
when I is an interval whose endpoints are continuity points of all the Fn
and of F . If the endpoints of I are a < b, then by our third observation
above, P(Xn = a) = 0, and the same when Xn is replaced by a and when a
is replaced by b. We then have
E 1I (Xn ) = P(a < Xn ≤ b) = FXn (b) − FXn (a)
→ FX (b) − FX (a) = P(a < X ≤ b) = E 1I (X),
as required.
Let us examine the relationship between
√ weak convergence and convergence in probability. The example of Sn / n shows that one can have weak
convergence without convergence in probability.
Proposition 5.2 (a) If Xn converges to X in probability, then it converges
weakly.
(b) If Xn converges weakly to a constant, it converges in probability.
(c) (Slutsky’s theorem) If Xn converges weakly to X and Yn converges weakly
to a constant c, then Xn + Yn converges weakly to X + c and Xn Yn converges
weakly to cX.
Proof. To prove (a), let g be a bounded and continuous function. If
nj is any subsequence, then there exists a further subsequence such that
X(njk ) converges almost surely to X. Then by dominated convergence,
E g(X(njk )) → E g(X). That suffices to show E g(Xn ) converges to E g(X).
For (b), if Xn converges weakly to c,
P(Xn − c > ε) = P(Xn > c + ε) = 1 − P(Xn ≤ c + ε) → 1 − P(c ≤ c + ε) = 0.
44
CHAPTER 5. WEAK CONVERGENCE
We use the fact that if Y ≡ c, then c + ε is a point of continuity for FY . A
similar equation shows P(Xn − c ≤ −ε) → 0, so P(|Xn − c| > ε) → 0.
We now prove the first part of (c), leaving the second part for the reader.
Let x be a point such that x − c is a continuity point of FX . Choose ε so
that x − c + ε is again a continuity point. Then
P(Xn + Yn ≤ x) ≤ P(Xn + c ≤ x + ε) + P(|Yn − c| > ε) → P(X ≤ x − c + ε).
So lim sup P(Xn + Yn ≤ x) ≤ P(X + c ≤ x + ε). Since ε can be as small as
we like and x − c is a continuity point of FX , then lim sup P(Xn + Yn ≤ x) ≤
P(X + c ≤ x). The lim inf is done similarly.
Here is an example where Xn converges weakly but not in probability. Let
X1 , X2 , . . . be an i.i.d. sequence with P(X1 = 1) = 12 and P(Xn = 0) = 12 .
Since the FXn are all equal, then we have weak convergence.
We claim the Xn do not converge in probability. If they did, we could
find a subsequence {nj } such that Xnj converges
P a.s. Let Aj = (Xnj = 1).
1
These are independent sets, P(Aj ) = 2 , so j P(Aj ) = ∞. By the BorelCantelli lemma, P(Aj i.o.) = 1, which means that Xnj is equal to 1 infinitely
often with probability one. The same argument also shows that Xnj is equal
to 0 infinitely often with probability one, which implies that Xnj does not
converge almost surely, a contradiction.
We say a sequence of distribution functions {Fn } is tight if for each ε >
0 there exists M such that Fn (M ) ≥ 1 − ε and Fn (−M ) ≤ ε for all n.
A sequence of random variables is tight if the corresponding distribution
functions are tight; this is equivalent to P(|Xn | ≥ M ) ≤ ε.
We give an easily checked criterion for tightness.
Proposition 5.3 Suppose there exists ϕ : [0, ∞) → [0, ∞) that is increasing
and ϕ(x) → ∞ as x → ∞. If c = supn E ϕ(|Xn |) < ∞, then the Xn are
tight.
Proof. Let ε > 0. Choose M such that ϕ(x) ≥ c/ε if x > M . Then
Z
ε
ϕ(|Xn |)
1(|Xn |>M ) dP ≤ E ϕ(|Xn |) ≤ ε.
P(|Xn | > M ) ≤
c/ε
c
45
Theorem 5.4 (Helly’s theorem) Let Fn be a sequence of distribution functions that is tight. There exists a subsequence nj and a distribution function
F such that Fnj converges weakly to F .
What could happen is that Xn = n, so that FXn → 0; the tightness
precludes this.
Proof. Let qk be an enumeration of the rationals. Since Fn (qk ) ∈ [0, 1],
any subsequence has a further subsequence that converges. Use the diagonalization procedure so that Fnj (qk ) converges for each qk and call the limit
F (qk ). F is nondecreasing, and define F (x) = inf qk ≥x F (qk ). So F is right
continuous and nondecreasing.
If x is a point of continuity of F and ε > 0, then there exist r and s
rational such that r < x < s and F (s) − F (x) < ε and F (x) − F (r) < ε.
Then
Fnj (x) ≥ Fnj (r) → F (r) > F (x) − ε
and
Fnj (x) ≤ Fnj (s) → F (s) < F (x) + ε.
Since ε is arbitrary, Fnj (x) → F (x).
Since the Fn are tight, there exists M such that Fn (−M ) < ε. Then
F (−M ) ≤ ε, which implies limx→−∞ F (x) = 0. Showing limx→∞ F (x) = 1 is
similar. Therefore F is in fact a distribution function.
46
CHAPTER 5. WEAK CONVERGENCE
Chapter 6
Characteristic functions
We define the characteristic function of a random variable X by ϕX (t) =
E eitx for t ∈ R.
R
Note that ϕX (t) = eitx PX (dx). So if X and Y have the same law, they
have the same characteristic function. Also,
R if the law of X has a density,
that is, PX (dx) = fX (x) dx, then ϕX (t) = eitx fX (x) dx, so in this case the
characteristic function is the same as (one definition of) the Fourier transform
of fX .
Proposition 6.1 ϕ(0) = 1, |ϕ(t)| ≤ 1, ϕ(−t) = ϕ(t), and ϕ is uniformly
continuous.
Proof. Since |eitx | ≤ 1, everything follows immediately from the definitions
except the uniform continuity. For that we write
|ϕ(t + h) − ϕ(t)| = |E ei(t+h)X − E eitX | ≤ E |eitX (eihX − 1)| = E |eihX − 1|.
|eihX − 1| tends to 0 almost surely as h → 0, so the right hand side tends to
0 by dominated convergence. Note that the right hand side is independent
of t.
Proposition 6.2 ϕaX (t) = ϕX (at) and ϕX+b (t) = eitb ϕX (t),
47
48
CHAPTER 6. CHARACTERISTIC FUNCTIONS
Proof. The first follows from E eit(aX) = E ei(at)X , and the second is similar.
Proposition 6.3 If X and Y are independent, then ϕX+Y (t) = ϕX (t)ϕY (t).
Proof. From the multiplication theorem,
E eit(X+Y ) = E eitX eitY = E eitX E eitY .
Note that if X1 and X2 are independent and identically distributed, then
ϕX1 −X2 (t) = ϕX1 (t)ϕ−X2 (t) = ϕX1 (t)ϕX2 (−t) = ϕX1 (t)ϕX2 (t) = |ϕX1 (t)|2 .
Let us look at some examples of characteristic functions.
(a) Bernoulli: By direct computation, this is peit + (1 − p) = 1 − p(1 − eit ).
(b) Coin flip: (i.e., P(X = +1) = P(X = −1) = 1/2) We have 21 eit +
1 −it
e = cos t.
2
(c) Poisson:
E eitX =
∞
X
eitk e−λ
k=0
X (λeit )k
λk
it
it
= e−λ
= e−λ eλe = eλ(e −1) .
k!
k!
(d) Point mass at a: E eitX = eita . Note that when a = 0, then ϕ ≡ 1.
(e) Binomial: Write X as the sum of n independent Bernoulli random
variables Bi . So
ϕX (t) =
n
Y
ϕBi (t) = [ϕBi (t)]n = [1 − p(1 − eit )]n .
i=1
(f) Geometric:
ϕ(t) =
∞
X
k=0
p(1 − p)k eitk = p
X
((1 − p)eit )k =
p
.
1 − (1 − p)eit
6.1. INVERSION FORMULA
49
(g) Uniform on [a, b]:
1
ϕ(t) =
b−a
Z
b
eitx dx =
a
eitb − eita
.
(b − a)it
Note that when a = −b this reduces to sin(bt)/bt.
(h) Exponential:
Z ∞
itx −λx
λe e
∞
Z
e(it−λ)x dx =
dx = λ
0
0
λ
.
λ − it
(i) Standard normal:
1
ϕ(t) = √
2π
Z
∞
eitx e−x
2 /2
dx.
−∞
This can be done by completing √
the square
then doing a contour inteR ∞ and
0
itx −x2 /2
gration. Alternately, ϕ (t) = (1/ 2π) −∞ ixe e
dx. (do the real and
imaginary parts separately, and use the dominated convergence theorem to
justify taking the derivative inside.) Integrating by parts (do the real and
imaginary parts separately), this is equal to −tϕ(t). The only solution to
2
ϕ0 (t) = −tϕ(t) with ϕ(0) = 1 is ϕ(t) = e−t /2 .
(j) Normal with mean µ and variance σ 2 : Writing X = σZ + µ, where Z
2 2
is a standard normal, then ϕX (t) = eiµt ϕZ (σt) = eiµt−σ t /2 .
(k) Cauchy: We have
1
ϕ(t) =
π
Z
eitx
dx.
1 + x2
This is a standard exercise in contour integration in complex analysis. The
answer is e−|t| .
6.1
Inversion formula
We need a preliminary real variable lemma, and then we can proceed to
the inversion formula, which gives a formula for the distribution function in
terms of the characteristic function.
50
CHAPTER 6. CHARACTERISTIC FUNCTIONS
RN
Lemma 6.4
(a)
(sin(Ax)/x) dx → sgn (A)π/2 as N → ∞.
0
Ra
(b) supa | 0 (sin(Ax)/x) dx| < ∞.
Proof. If A = 0, this is clear. The case A < 0 reduces to the case A > 0
by the fact that sin is an odd function. By a change of variables y = Ax, we
reduce to the case A = 1. Part (a) is a standard result in contour integration,
and part (b) comes from the fact that the integral can be written as an
alternating series.
An alternate proof of (a) is the following. e−xy sin x is integrable on
{(x, y); 0 < x < a, 0 < y < ∞}. So
Z aZ ∞
Z a
sin x
dx =
e−xy sin x dy dx
x
0
Z0 ∞ Z0 a
=
e−xy sin x dx dy
Z0 ∞ h 0 −xy
ia
e
=
(−y
sin
x
−
cos
x)
dy
y2 + 1
0
0
Z ∞h
o
e−ay
−1 i
=
(−y
sin
a
−
cos
a)
−
dy
y2 + 1
y2 + 1
0
Z ∞ −ay
Z ∞ −ay
ye
e
π
dy
−
cos
a
dy.
= − sin a
2
y2 + 1
y2 + 1
0
0
The last two integrals tend to 0 as a → ∞ since the integrand is bounded by
(1 + y)e−y if a ≥ 1.
Theorem
R itx6.5 (Inversion formula) Let µ be a probability measure and let
ϕ(t) = e µ(dx). If a < b, then
1
lim
T →∞ 2π
Z
T
−T
e−ita − e−itb
1
1
ϕ(t) dt = µ(a, b) + µ({a}) + µ({b}).
it
2
2
The example where µ is point mass at 0, so ϕ(t) = 1, shows that one
needs to take a limit, since the integrand in this case is 2 sin t/t, which is not
integrable.
6.1. INVERSION FORMULA
51
Proof. By Fubini,
Z T −ita
Z T Z −ita
e
− e−itb itx
e
− e−itb
ϕ(t) dt =
e µ(dx) dt
it
it
−T
−T
Z Z T −ita
e
− e−itb itx
=
e dt µ(dx).
it
−T
To justify this, we bound the integrand by the mean value theorem
Expanding e−itb and e−ita using Euler’s formula, and using the fact that
cos is an even function and sin is odd, we are left with
Z T
Z hZ T
sin(t(x − b)) i
sin(t(x − a))
dt −
dt µ(dx).
2
t
t
0
0
Using Lemma 6.4 and dominated convergence, this tends to
Z
[πsgn (x − a) − πsgn (x − b)]µ(dx).
Theorem 6.6 If
R
|ϕ(t)| dt < ∞, then µ has a bounded density f and
Z
1
f (y) =
e−ity ϕ(t) dt.
2π
Proof.
1
1
µ(a, b) + µ({a}) + µ({b})
2
2
Z T −ita
e
− e−itb
1
= lim
ϕ(t) dt
T →∞ 2π −T
it
Z ∞ −ita
1
e
− e−itb
=
ϕ(t) dt
2π −∞
it
Z
b−a
≤
|ϕ(t)| dt.
2π
Letting b → a shows that µ has no point masses.
52
CHAPTER 6. CHARACTERISTIC FUNCTIONS
We now write
Z −itx
e
− e−it(x+h)
1
ϕ(t)dt
µ(x, x + h) =
2π
it
Z Z x+h
1
−ity
=
e dy ϕ(t) dt
2π
x
Z x+h Z
1
=
e−ity ϕ(y) dt dy.
2π
x
R
So µ has density (1/2π) e−ity ϕ(t) dt. As in the proof of Proposition 6.1, we
see f is continuous.
A corollary to the inversion formula is the uniqueness theorem.
Theorem 6.7 If ϕX = ϕY , then PX = PY .
Proof. Let a and b be such that P(X = a) = P(X = b) = 0, and the same
when X is replaced by Y . We have
FX (b) − FX (a) = P(a < X ≤ b) = PX ((a, b]) = PX ((a, b)),
and the same when X is replaced by Y . By the inversion theorem, PX ((a, b))
= PY ((a, b)). Now let a → −∞ along a sequence of points that are continuity
points for FX and FY . We obtain FX (b) = FY (b) if b is a continuity point
for X and Y . Since FX and FY are right continuous, FX (x) = FY (x) for all
x.
The following proposition can be proved directly, but the proof using characteristic functions is much easier.
Proposition 6.8 (a) If X and Y are independent, X is a normal with mean
a and variance b2 , and Y is a normal with mean c and variance d2 , then X+Y
is normal with mean a + c and variance b2 + d2 .
(b) If X and Y are independent, X is Poisson with parameter λ1 , and Y is
Poisson with parameter λ2 , then X + Y is Poisson with parameter λ1 + λ2 .
(c) If Xi are i.i.d. Cauchy, then Sn /n is Cauchy.
6.2. CONTINUITY THEOREM
53
Proof. For (a),
2 t2 /2
ϕX+Y (t) = ϕX (t)ϕY (t) = eiat−b
eict−c
2 t2 /2
2 +d2 )t2 /2
= ei(a+c)t−(b
.
Now use the uniqueness theorem.
Parts (b) and (c) are proved similarly.
6.2
Continuity theorem
Lemma 6.9 Suppose ϕ is the characteristic function of a probability µ.
Then
Z 1/A
ϕ(t) dt − 1.
µ([−2A, 2A]) ≥ A
−1/A
Proof. Note
1
2T
Z
T
−T
Z T Z
1
ϕ(t) dt =
eitx µ(dx) dt
2T
Z Z −T
1
=
1[−T,T ] (t)eitx dt µ(dx)
2T
Z
sin T x
=
µ(dx).
Tx
Since | sin(T x)| ≤ 1, then |(sin(T x))/T x| ≤ 1/2T A if |x| ≥ 2A. Since
|(sin(T x))/T x| ≤ 1, we then have
Z
Z sin T x
1
µ(dx) ≤ µ([−2A, 2A]) +
µ(dx)
Tx
[−2A,2A]c 2T A
1
= µ([−2A, 2A]) +
(1 − µ([−2A, 2A])
2T A
1
1 =
+ 1−
µ([−2A, 2A]).
2T A
2T A
Setting T = 1/A,
A Z 1/A
1 1
ϕ(t) dt ≤ + µ([−2A, 2A]).
2 −1/A
2 2
Now multiply both sides by 2.
54
CHAPTER 6. CHARACTERISTIC FUNCTIONS
Proposition 6.10 If µn converges weakly to µ, then ϕn converges to ϕ uniformly on every finite interval.
Proof. Let ε > 0 and choose M large so that µ([−M, M ]c ) < ε. Define f
c
Rto be 1 on R[−M, M ], 0 on [−M − 1, M + 1] , and linear in between. Since
f dµn → f dµ, then if n is large enough,
Z
(1 − f ) dµn ≤ 2ε.
We have
Z
|eihx − 1|µn (dx)
Z
Z
≤ 2 (1 − f ) dµn + h |x|f (x) µn (dx)
|ϕn (t + h) − ϕn (t)| ≤
≤ 2ε + h(M + 1).
So for n large enough and |h| ≤ ε/(M + 1), we have
|ϕn (t + h) − ϕn (t)| ≤ 3ε,
which implies that the ϕn are equicontinuous. Therefore the convergence is
uniform on finite intervals.
The interesting result of this section is the converse, Lévy’s continuity
theorem.
Theorem 6.11 Suppose µn are probabilities, ϕn (t) converges to a function
ϕ(t) for each t, and ϕ is continuous at 0. Then ϕ is the characteristic
function of a probability µ and µn converges weakly to µ.
Proof. Let ε > 0. Since ϕ is continuous at 0, choose δ small so that
1 Z δ
ϕ(t) dt − 1 < ε.
2δ −δ
Using the dominated convergence theorem, choose N such that
Z δ
1
|ϕn (t) − ϕ(t)| dt < ε
2δ −δ
6.2. CONTINUITY THEOREM
55
if n ≥ N . So if n ≥ N ,
Z δ
1 Z δ
1 Z δ
1
ϕn (t) dt ≥ ϕ(t) dt −
|ϕn (t) − ϕ(t)| dt
2δ −δ
2δ −δ
2δ −δ
≥ 1 − 2ε.
By Lemma 6.9 with A = 1/δ, for such n,
µn [−2/δ, 2/δ] ≥ 2(1 − 2ε) − 1 = 1 − 4ε.
This shows the µn are tight.
Let nj be a subsequence such that µnj converges weakly, say to µ. Then
ϕnj (t) → ϕµ (t), hence ϕ(t) = ϕµ (t), or ϕ is the characteristic function of
a probability µ. If µ0 is any subsequential weak limit point of µn , then
ϕµ0 (t) = ϕ(t) = ϕµ (t); so µ0 must equal µ. Hence µn converges weakly to µ.
We need the following estimate on moments.
Proposition 6.12 If E |X|k < ∞ for an integer k, then ϕX has a continuous
derivative of order k and
Z
(k)
ϕ (t) = (ix)k eitx PX (dx).
In particular, ϕ(k) (0) = ik E X k .
Proof. Write
ei(t+h)x − eitx
P(dx).
h
R
The integrand is bounded by |x|. So if |x|PX (dx) < ∞, we can use dominated convergence to obtain the desired formula for ϕ0 (t). As in the proof of
Proposition 6.1, we see ϕ0 (t) is continuous. We do the case of general k by
induction. Evaluating ϕ(k) at 0 gives the particular case.
ϕ(t + h) − ϕ(t)
=
h
Here is a converse.
Z
56
CHAPTER 6. CHARACTERISTIC FUNCTIONS
Proposition 6.13 If ϕ is the characteristic function of a random variable
X and ϕ00 (0) exists, then E |X|2 < ∞.
Proof. Note
eihx − 2 + e−ihx
1 − cos hx
= −2
≤0
2
h
h2
and 2(1 − cos hx)/h2 converges to x2 as h → 0. So by Fatou’s lemma,
Z
Z
1 − cos hx
2
x PX (dx) ≤ 2 lim inf
PX (dx)
h→0
h2
ϕ(h) − 2ϕ(0) + ϕ(−h)
= ϕ00 (0) < ∞.
= − lim sup
2
h
h→0
One nice application of the continuity theorem is a proof of the weak law
of large numbers. Its proof is very similar to the proof of the central limit
theorem, which we give in the next section. Another nice use of characteristic
functions and martingales is the following.
Proposition 6.14 Suppose Xi is a sequence of independent random variables and Sn converges weakly. Then Sn converges almost surely.
Proof. We first rule out the possibility that |Sn | → ∞ with positive probability. If this happens with positive probability ε, given M there exists N
depending on M such that if n ≥ N , then P(|Sn | > M ) > ε. Then the limit
law of PSn , say, P∞ , will have P∞ ([−M, M ]c ) ≥ ε for all M , a contradiction.
Let N1 be the set of ω for which |Sn (ω)| → ∞.
Suppose Sn converges weakly to W . Then ϕSn (t) → ϕW (t) uniformly on
compact sets by Proposition 6.10. Since ϕW (0) = 1 and ϕW is continuous,
there exists δ such that |ϕW (t) − 1| < 1/2 if |t| < δ. So for n large, |ϕSn (t)| ≥
1/4 if |t| < δ.
Note
h
i
itSn
E e
| X1 , . . . Xn−1 = eitSn−1 E [eitXn | X1 , . . . , Xn−1 ] = eitSn−1 ϕXn (t).
6.2. CONTINUITY THEOREM
Since ϕSn (t) =
Q
57
ϕXi (t), it follows that eitSn /ϕSn (t) is a martingale.
Therefore for |t| < δ and n large, eitSn /ϕSn (t) is a bounded martingale,
and hence converges almost surely. Since ϕSn (t) → ϕW (t) 6= 0, then eitSn
converges almost surely if |t| < δ.
Let A = {(ω, t) ∈ Ω × (−δ, δ) : eitSnR(ω) does not converge}. For each
t, we have almost sure convergence, so 1A (ω, t) P(dω) = 0. Therefore
R Rδ
Rδ R
1
dP
dt
=
0,
and
by
Fubini,
1 dt dP = 0. Hence almost surely,
A
−δ A
R−δ
1A (ω, t) dt = 0. This means, there exists a set N2 with P(N2 ) = 0, and if
ω∈
/ N2 , then eitSn (ω) converges for almost every t ∈ (−δ, δ). Call the limit,
when it exists, L(t).
Fix ω ∈
/ N1 ∪ N2 .
Suppose one subsequence of Sn (ω) converges to Q and another subsequence to R. Then L(a) = eiaR = eiaQ for a.e. a small, which implies that
R = Q.
Suppose one subsequence converges to R 6= 0 and another to ±∞. By
integrating eitSn and using dominated convergence, we have
Z a
eiaSn − 1
eitSn dt =
iSn
0
converges. We then obtain
eiaR − 1
=0
iR
for a.e. a small, which implies R = 0.
Finally suppose one subsequence converges to 0 and another to ±∞. Since
(e
− 1)/iSnj → a, we obtain a = 0 for a.e. a small, which is impossible.
iaSnj
We conclude Sn must converge.
58
CHAPTER 6. CHARACTERISTIC FUNCTIONS
Chapter 7
Central limit theorem
The simplest case of the central limit theorem (CLT) is the case when the
Xi are
√ i.i.d., with mean zero and variance one, and then the CLT says that
Sn / n converges weakly to a standard normal. We first prove this case.
Lemma 7.1 If cn are complex numbers with cn → c, then
cn n
1+
→ ec .
n
Proof. If cn → c, there exists a real number R such that |cn | ≤ R for all n
and then |c| ≤ R. We use the fact (from the Taylor series expansion of ex )
that for x > 0
ex = 1 + x + x2 /2! + · · · ≥ 1 + x.
We also use the identity
an − bn = (a − b)(an−1 + ban−2 + b2 an−3 + · · · + bn−1 ),
which implies
|an − bn | ≤ |a − b|n[(|a| ∨ |b|)]n−1 ,
where |a| ∨ |b| = max(|a|, |b|).
Using this inequality with a = 1 + cn /n and b = 1 + c/n, we have
|(1 + cn /n)n − (1 + c/n)n | ≤ |cn − c|(1 + R/n)n−1 ≤ |cn − c|(1 + R/n)n
≤ |cn − c|(eR/n )n → 0
59
60
CHAPTER 7. CENTRAL LIMIT THEOREM
as n → ∞.
Now using the inequality with a = 1 + c/n and b = ec/n , we have
|(1 + c/n)n − (ec/n )n ≤ |(1 + c/n) − ec/n |n(1 + R/n)n−1
≤ |(1 + c/n) − ec/n |n(1 + R/n)n
≤ |(1 + c/n) − ec/n |neR .
A Taylor series expansion or l’Hôpital’s rule shows that n|(1+c/n)−ec/n | → 0,
which proves the lemma.
Lemma 7.2 If f is twice continuously differentiable, then
|f (x + h) − [f (x) + f 0 (x)h + 12 f 00 (x)h2 ] |
→0
h2
as h → 0.
Proof. Write
f (x + h)−[f (x) + f 0 (x)h + 21 f 00 (x)h2 ]
Z x+h
=
[f 0 (y) − f 0 (x) − f 00 (x)(y − x)] dy
x
Z x+h Z y
[f 00 (z) − f 00 (x)] dz dy.
=
x
x
So the above is bounded in absolute value by
Z x+h Z y
Z x+h
A(h) dz dy ≤ A(h)
h dy ≤ A(h)h2 ,
x
x
x
where
A(h) = sup |f 00 (w) − f 00 (x)|.
|w−x|≤h
Theorem
7.3 Suppose the Xi are i.i.d., mean zero, and variance one. Then
√
Sn / n converges weakly to a standard normal.
61
Proof. Since X1 has finite second moment, then ϕX1 has a continuous second
derivative. By Taylor’s theorem,
ϕX1 (t) = ϕX1 (0) + ϕ0X1 (0)t + ϕ00X1 (0)t2 /2 + R(t),
where |R(t)|/t2 → 0 as |t| → 0. So
ϕX1 (t) = 1 − t2 /2 + R(t).
Then
h
√
√ in
√
t2
+ R(t/ n) .
ϕSn /√n (t) = ϕSn (t/ n) = (ϕX1 (t/ n))n = 1 −
2n
√
Since t/ n converges to zero as n → ∞, we have
2 /2
ϕSn /√n (t) → e−t
.
Now apply the continuity theorem.
Let us give another proof of this simple CLT that does not use characteristic functions. For simplicity let Xi be i.i.d. mean zero variance one random
variables with E |Xi |3 < ∞.
√
Proposition 7.4 With Xi as above, Sn / n converges weakly to a standard
normal.
Proof. Let Y1 , . . . , Yn be i.i.d. standard normal random variables that are
independent of the Xi . Let Z1 = Y2 + · · · + Yn , Z2 = X1 + Y3 + · · · + Yn ,
Z3 = X1 + X2 + Y4 + · · · + Yn , etc.
Let us suppose g ∈ C 3 with compact support and let W be a standard
normal. Our first goal is to show
√
(7.1)
|E g(Sn / n) − E g(W )| → 0.
We have
n
X
√
√
√
E g(Sn / n) − E g(W ) = E g(Sn / n) − E g(
Yi / n)
i=1
=
n h
X
i=1
X + Z Y + Z i
i
√ i − E g i√ i .
Eg
n
n
62
CHAPTER 7. CENTRAL LIMIT THEOREM
By Taylor’s theorem,
X + Z √
√
√ X
1
i
√ i = g(Zi / n) + g 0 (Zi / n) √ i + g 00 (Zi / n)Xi2 + Rn ,
g
n
n 2
where |Rn | ≤ kg 000 k∞ |Xi |3 /n3/2 . Taking expectations and using the independence,
X + Z √
√
1
i
√ i = E g(Zi / n) + 0 + E g 00 (Zi / n) + E Rn .
Eg
2
n
√
We have a very similar expression for E g((Yi + Zi )/ n). Taking the difference,
X + Z Y + Z E |Xi |3 + E |Yi |3
i
√ i − E g i√ i ≤ kg 000 k∞
.
E g
n3/2
n
n
Summing over i from 1 to n, we have (7.1).
By approximating continuous functions with compact support by√C 3 func2
tions with compact
√ support, we have (7.1) for such g. Since E (Sn / n) = 1,
the sequence√Sn / n is tight. So given ε we have that there exists M such
that P(|Sn / n| > M ) < ε for all n. By taking M larger if necessary, we
also have P(|W | > M ) < ε. Suppose g is bounded and continuous. Let ψ
be a continuous function with compact support that is bounded by one, is
nonnegative, and that equals 1 on [−M, M ]. By (7.1) applied to gψ,
√
|E (gψ)(Sn / n) − E (gψ)(W )| → 0.
However,
√
√
√
|E g(Sn / n) − E (gψ)(Sn / n)| ≤ kgk∞ P(|Sn / n| > M ) < εkgk∞ ,
and similarly
|E g(W ) − E (gψ)(W )| < εkgk∞ .
Since ε is arbitrary, this proves (7.1) for bounded continuous g. By Proposition 5.1, this proves our proposition.
We give another example of the use of characteristic functions.
63
Proposition 7.5 Suppose for each n the random variables Xni , i =
...,n
P1,
n
are i.i.d. Bernoullis with parameter pn . If npn → λ and Sn = i=1 Xni ,
then Sn converges weakly to a Poisson random variable with parameter λ.
Proof. We write
h
in
ϕSn (t) = [ϕXn1 (t)]n = 1 + pn (eit − 1)
in
h
npn it
it
(e − 1) → eλ(e −1) .
= 1+
n
Now apply the continuity theorem.
A much more general theorem than Theorem 7.3 is the Lindeberg-Feller
theorem.
Theorem 7.6 Suppose for each n, Xni , i = 1, . . . , n are mean zero independentP
random variables. Suppose
2
→ σ 2 > 0 and
(a) ni=1 E Xni
P
2
; |Xni | > ε] → 0.
(b) for each ε, ni=1 E [|Xni
P
Let Sn = ni=1 Xni . Then Sn converges weakly to a normal random variable
with mean zero and variance σ 2 .
Note nothing is said about independence of the Xni for different n.
Let us look at Theorem
√ 7.3 in light of this theorem. Suppose the Yi are
i.i.d. and let Xni = Yi / n. Then
n
X
√
E (Yi / n)2 = E Y12
i=1
and
n
X
E [|Xni |2 ; |Xni | > ε] = nE [|Y1 |2 /n; |Y1 | >
√
√
nε] = E [|Y1 |2 ; |Y1 | > nε],
i=1
which tends to 0 by the dominated convergence theorem.
64
CHAPTER 7. CENTRAL LIMIT THEOREM
If the Yi are independent with mean 0, and
Pn
3
i=1 E |Yi |
→ 0,
(Var Sn )3/2
then Sn /(Var Sn )1/2 converges weakly to a standard normal. This is known
as Lyapounov’s theorem; we leave the derivation of this from the LindebergFeller theorem as an exercise for the reader.
2
be the
Proof. Let ϕni be the characteristic function of Xni and let σni
variance of Xni . We need to show
n
Y
2 σ 2 /2
ϕni (t) → e−t
.
(7.2)
i=1
Using Taylor series, |eib − 1 − ib + b2 /2| ≤ c|b|3 for a constant c. Also,
|eib − 1 − ib + b2 /2| ≤ |eib − 1 − ib| + |b2 |/2 ≤ c|b|2 .
If we apply this to a random variable tY and take expectations,
|ϕY (t) − (1 + itE Y − t2 E Y 2 /2)| ≤ c(t2 E Y 2 ∧ t3 E Y 3 ).
Applying this to Y = Xni ,
2
|ϕni (t) − (1 − t2 σni
/2)| ≤ cE [t3 |Xni |3 ∧ t2 |Xni |2 ].
The right hand side is less than or equal to
cE [t3 |Xni |3 ;|Xni | ≤ ε] + cE [t2 |Xni |2 ; |Xni | > ε]
≤ cεt3 E [|Xni |2 ] + ct2 E [|Xni |2 ; |Xni | ≥ ε].
Summing over i we obtain
n
X
2
|ϕni (t) − (1 − t2 σni
/2)| ≤ cεt3
X
E [|Xni |2 ] + ct2
X
E [|Xni |2 ; |Xni | ≥ ε].
i=1
We need the following inequality: if |ai |, |bi | ≤ 1, then
n
n
n
Y
X
Y
bi ≤
|ai − bi |.
ai −
i=1
i=1
i=1
65
To prove this, note
Y
ai −
Y
bi = (an − bn )
Y
b i + an
Y
i<n
ai −
i<n
Y bi
i<n
and use induction.
2
2
2
Note |ϕni (t)| ≤ 1 and |1 − t2 σni
/2| ≤ 1 because σni
≤ ε2 + E [|Xni
|; |Xni | >
2
ε] < 1/t if we take ε small enough and n large enough. So
n
n
Y
X
X
Y
2
E [|Xni |2 ; |Xni | ≥ ε].
E [|Xni |2 ]+ct2
/2) ≤ cεt3
ϕni (t)− (1−t2 σni
i=1
i=1
2
2
2
/2, and
/2) is asymptotic to −t2 σni
→ 0, then log(1 − t2 σni
Since supi σni
so
X
Y
2
2
/2) = exp
log(1 − t2 σni
/2)
(1 − t2 σni
is asymptotically equal to
X
2 2
2
exp − t2
σni
/2 = e−t σ /2 .
Since ε is arbitrary, the proof is complete.
We now complete the proof of Theorem 2.11.
P
Proof of “only if” part of Theorem 2.11. Since
Xn converges, then
Xn must converge to zero P
a.s., and so P(|Xn | > A i.o.) = 0. By the BorelCantelli lemma, this says
P(|Xn | > A) < ∞. Since P(Xn 6= Yn i.o.) = 0,
P
we also conclude that
Yn converges.
P
√
cn .
LetP
cn = ni=1 Var Yi and suppose
c
→
∞.
Let
Z
=
(Y
−
E
Y
)/
n
nm
m
m
Pn
n
Then m=1 Var Znm = (1/cn ) m=1 Var Ym = 1. If ε > 0, then for n large,
√
we have 2A/ cn < ε. Since |YP
m | ≤ A and hence |E Ym | ≤ A, then |Znm | ≤
√
2A/ cn < ε. It follows
that nm=1 E (|Znm |2 ; |Znm | > ε) = 0 for large n.
Pn
√
By Theorem 7.6, Pm=1 (Ym − E Ym )/ cn converges weakly
P to a√standard
normal. However, nm=1 Ym converges,
and
c
→
∞,
so
Ym / cn must
n
P
√
converge to 0. The quantities
E Ym / cn are nonrandom, so there is no
way the difference can converge to a standard normal, a contradiction. We
conclude cn does not converge to infinity.
66
CHAPTER 7. CENTRAL LIMIT THEOREM
Let Vi = Yi − E Yi . Since |Vi | < 2A, E Vi = 0, and Var Vi P
= Var Yi , which
is summable,
by the “if” part of the three series criterion,
Vi converges.
P
P
Since
Yi converges, taking the difference shows
E Yi converges.
Chapter 8
Gaussian sequences
We first prove a converse to Proposition 6.3.
Proposition 8.1 If E ei(uX+vY ) = E eiuX E eivY for all u and v, then X and
Y are independent random variables.
Proof. Let X 0 be a random variable with the same law as X, Y 0 one with
the same law as Y , and X 0 , Y 0 independent. (We let Ω = [0, 1]2 , P Lebesgue
measure, X 0 a function of the first variable, and Y 0 a function of the second
0
0
0
0
variable defined as in Proposition 1.2.) Then E ei(uX +vY ) = E eiuX E eivY .
Since X, X 0 have the same law, they have the same characteristic function,
and similarly for Y, Y 0 . Therefore (X 0 , Y 0 ) has the same joint characteristic
function as (X, Y ). By the uniqueness of the Fourier transform, (X 0 , Y 0 ) has
the same joint law as (X, Y ), which is easily seen to imply that X and Y are
independent.
A sequence of random variables X1 , . . . , Xn is said to be jointly normal
if there exists a sequence of independent standard normal
Pmrandom variables
Z1 , . . . , Zm and constants bij and ai such that Xi =
j=1 bij Zj + ai , i =
1, . . . , n. In matrix notation, X = BZ +A. For simplicity, in what follows let
us take A = 0; the modifications for the general case are easy. The covariance
of two random variables X and Y is defined to be E [(X − E X)(Y − E Y )].
Since we are assuming our normal random variables are mean 0, we can
omit the centering at expectations. Given a sequence of mean 0 random
67
68
CHAPTER 8. GAUSSIAN SEQUENCES
variables, we can talk about the covariance matrix, which is Cov (X) =
E XX t , where X t denotes the transpose of the vector X. In the above case,
we see Cov (X) = E [(BZ)(BZ)t ] = E [BZZ t B t ] = BB t , since E ZZ t = I,
the identity.
t
Let us compute the joint characteristic function E eiu X of the vector X,
where u is an n-dimensional vector. First, if v is an m-dimensional vector,
Ee
iv t Z
=E
m
Y
ivj Zj
e
j=1
=
m
Y
Ee
ivj Zj
=
j=1
m
Y
2
e−vj /2 = e−v
t v/2
j=1
using the independence of the Z’s. Setting v = B t u,
t
t
t
E eiu X = E eiu BZ = e−u BB
t u/2
.
By taking u = (0, . . . , 0, a, 0, . . . , 0) to be a constant times the unit vector
in the jth coordinate direction, we deduce that each of the X’s is indeed
normal.
Proposition 8.2 If the Xi are jointly normal and Cov (Xi , Xj ) = 0 for
i 6= j, then the Xi are independent.
Proof. If Cov (X) = BB t is a diagonal matrix, then the joint characteristic
function of the X’s factors, and so by Proposition 8.1, the Xs would in this
case be independent.
Chapter 9
Kolmogorov extension theorem
The goal of this section is to show how to construct probability measures on
RN = R × R × · · · . We may view RN as the set of sequences (x1 , x2 , . . .) of
elements of R. Given an element x = (x1 , x2 , . . .) of RN , we define τn (x) =
(x1 , . . . , xn ) ∈ Rn . A cylindrical set in RN is a set of the form A × RN , where
A is a Borel subset of Rn for some n ≥ 1. Another way of phrasing this is
to say a cylindrical set is one of the form τn−1 (A), where n ≥ 1 and A is a
Borel subset of Rn . We furnish Rn with the product topology. Recall that
this means we take the smallest topology that contains all cylindrical sets.
We use the σ-field on RN generated by the cylindrical sets. Thus the σ-field
we use is the same as the Borel σ-field on RN . We use Bn to denote the Borel
σ-field on Rn .
We suppose that for each n we have a probability measure µn defined on
(Rn , Bn ). The µn are consistent if µn+1 (A × R) = µn (A) whenever A ∈ Bn .
The Kolmogorov extension theorem is the following.
Theorem 9.1 Suppose for each n we have a probability measure µn on
(Rn , Bn ). Suppose the µn are consistent. Then there exists a probability
measure µ on RN such that µ(A × RN ) = µn (A) for all A ∈ Bn .
Proof. Define µ on cylindrical sets by µ(A × RN ) = µm (A) if A ∈ Bm .
By the consistency assumption, µ is well defined. If A0 is the collection of
cylindrical sets, it is easy to see that A0 is an algebra of sets and that µ is
finitely additive on A0 . If we can show that µ is countably additive on A0 ,
69
70
CHAPTER 9. KOLMOGOROV EXTENSION THEOREM
then by the Carathéodory extension theorem, we can extend µ to the σ-field
generated by the cylindrical sets. It suffices to show that whenever An ↓ ∅
with An ∈ A0 , then µ(An ) → 0.
Suppose that An are cylindrical sets decreasing to ∅ but µ(An ) does not
tend to 0; by taking a subsequence we may assume without loss of generality
that there exists ε > 0 such that µ(An ) ≥ ε for all n. We will obtain a
contradiction.
It is possible that An might depend on fewer or more than n coordinates.
It will be more convenient if we arrange things so that An depends on exactly
en ) for some A
en a Borel subset of Rn .
n coordinates. We want An = τn−1 (A
Suppose An is of the form
An = τj−1
(Dn )
n
for some Dn ⊂ Rjn ; in other words, An depends on jn coordinates. By letting A0 = RN and replacing our original sequence by A0 , . . . , A0 , A1 , . . . , A1 ,
A2 , . . . , A2 , . . ., where we repeat each Ai sufficiently many times, we may
without loss of generality suppose that jn ≤ n. On the other hand, if jn < n
b n ) with D
b n = Dn × Rn−jn .
and An = τj−1
(Dn ), we may write An = τn−1 (D
n
Thus we may without loss of generality suppose that An depends on exactly
n coordinates.
en = τn (An ). For each n, choose B
en ⊂ A
en so that B
en is comWe set A
n+1
en − B
en ) ≤ ε/2 . To do this, first we choose M such that
pact and µ(A
n c
en of
µn (([−M, M ] ) ) < ε/2n+2 , and then we choose a compact subset B
n
n
n+2
−1 e
e
e
e
An ∩[−M, M ] such that µ(An ∩[−M, M ] −Bn ) ≤ ε/2 . Let Bn = τn (Bn )
and let Cn = B1 ∩ . . . ∩ Bn . Hence Cn ⊂ Bn ⊂ An and Cn ↓ ∅. Since
A1 ∩ · · · ∩ An − B1 ∩ · · · ∩ Bn ⊂ ∪ni=1 (Ai − Bi ) and A1 ∩ · · · ∩ An = An , we
have
n
X
µ(Cn ) ≥ µ(An ) −
µ(Ai − Bi ) ≥ ε/2.
i=1
en = τn (Cn ), the projection of Cn onto Rn , is compact.
Also C
We will find x = (x1 , . . . , xn , . . . ) ∈ ∩n Cn and obtain our contradiction.
For each n choose a point y(n) ∈ Cn . The first coordinates of {y(n)}, namely,
e1 , which is compact, hence there is
{y1 (n)}, form a sequence contained in C
a convergent subsequence {y1 (nk )}. Let x1 be the limit point. The first
and second coordinates of {y(nk )} form a sequence contained in the compact
e2 , so a further subsequence {(y1 (nk ), y2 (nk ))} converges to a point
set C
j
j
71
e2 . Since {nk } is is a subsequence of {nk }, the first coordinate of the
in C
j
limit is x1 . Therefore the limit point of {(y1 (nkj ), y2 (nkj ))} is of the form
e2 . We continue this procedure to obtain x =
(x1 , x2 ), and this point is in C
en for each n, hence
(x1 , x2 , . . . , xn , . . .). By our construction, (x1 , . . . , xn ) ∈ C
x ∈ Cn for each n, or x ∈ ∩n Cn , a contradiction.
A typical application of this theorem is to construct a countable sequence
of independent random variables. We construct X1 , . . . , Xn to be an independent collection of n independent random variables. Let µn be the joint
law of (X1 , . . . , Xn ); it is easy to check that the µn form a consistent family.
We use Theorem 9.1 to obtain a probability measure µ on RN . To get random
variables out of this, we let Xi (ω) = ωi if ω = (ω1 , ω2 , . . .). If we set P = µ,
then the law of (X1 , X2 , . . .) under P is µ.
72
CHAPTER 9. KOLMOGOROV EXTENSION THEOREM
Chapter 10
Brownian motion
10.1
Definition and construction
In this section we construct Brownian motion and define Wiener measure.
Let (Ω, F, P) be a probability space and let B be the Borel σ-field on
[0, ∞). A stochastic process, denoted X(t, ω) or Xt (ω) or just Xt , is a map
from [0, ∞) × Ω to R that is measurable with respect to the product σ-field
of B and F.
Definition 10.1 A stochastic process Xt is a one-dimensional Brownian motion started at 0 if
(1) X0 = 0 a.s.;
(2) for all s ≤ t, Xt −Xs is a mean zero normal random variable with variance
t − s;
(3) the random variables Xri − Xri−1 , i = 1, . . . , n, are independent whenever
0 ≤ r0 ≤ r1 ≤ · · · ≤ rn ;
(4) there exists a null set N such that if ω ∈
/ N , then the map t → Xt (ω) is
continuous.
Brownian motion has a useful scaling property.
Proposition 10.2 If Xt is a Brownian motion started at 0, a > 0, and
Yt = a−1 Xa2 t , then Yt is a Brownian motion started at 0.
73
74
CHAPTER 10. BROWNIAN MOTION
Proof. Clearly Y has continuous paths and Y0 = 0. We observe that
Yt − Ys = a−1 (Xa2 t − Xa2 s ) is a mean zero normal with variance t − s. If
0 = r0 ≤ r1 < · · · < rn , then a2 r0 ≤ a2 r1 < · · · < a2 rn , so the Xa2 ri − Xa2 ri−1
are independent, and hence the Yri − Yri−1 are independent.
Let us show that there exists a Brownian motion. We give the Haar
function construction, which is one of the quickest ways to the construction
of Brownian motion.
For i = 1, 2, . . ., j = 1, 2, . . . , 2i−1 , let ϕij be the function on [0, 1] defined
by
ϕij =



2(i−1)/2 ,


x∈
h
2j−2 2j−1
, 2i
2i
;
h
2j−1 2j
(i−1)/2

−2
,
x
∈
,
;

2i
2i



0,
otherwise.
Let ϕ00 be the function that is identically 1. The ϕij are called the Haar
If h·, ·i denotes the inner product in L2 ([0, 1]), that is, hf, gi =
Rfunctions.
1
f (x)g(x)dx, note the ϕij are orthogonal and have norm 1. It is also easy
0
to see that they form a complete orthonormal system for L2 : ϕ00 ≡ 1; 1[0,1/2)
and 1[1/2,1) are both linear combinations of ϕ00 and ϕ11 ; 1[0,1/4) and 1[1/4,1/2)
are both linear combinations of 1[0,1/2) , ϕ21 , and ϕ22 . Continuing in this way,
we see that 1[k/2n ,(k+1)/2n ) is a linear combination of the ϕij for each n and each
k ≤ 2n . Since any continuous function can be uniformly approximated by
step functions whose jumps are at the dyadic rationals, linear combinations
of the Haar functions are dense in the set of continuous functions, which in
turn is dense in L2 ([0, 1]).
Rt
Let ψij (t) = 0 ϕij (r) dr. Let Yij be a sequence of independent identically
distributed standard normal random variables. Set
i−1
V0 (t) = Y00 ψ00 (t),
Vi (t) =
2
X
Yij ψij (t),
i ≥ 1.
j=1
If {ei }, i = 1, . . . , N is a finite orthonormal set and f =
hf, ej i =
N
X
i=1
ai hei , ej i = aj ,
PN
i=1
ai ei , then
10.1. DEFINITION AND CONSTRUCTION
75
or
f=
N
X
hf, ei iei .
i=1
If f =
PN
i=1
ai ei and g =
hf, gi =
PN
j=1 bj ej ,
N X
N
X
then
ai bj hei , ej i =
i=1 j=1
N
X
ai b i ,
i=1
or
hf, gi =
N
X
hf, ei ihg, ei i.
i=1
We need one more preliminary. If Z is a standard normal random variable,
2
then Z has density (2π)−1/2 e−x /2 . Since
Z
2
x4 e−x /2 dx < ∞,
then E Z 4 < ∞. We then have
P(|Z| > λ) = P(Z 4 > λ4 ) ≤
E Z4
.
λ4
(10.1)
P∞
Theorem 10.3
i=0 Vi (t) converges uniformly in t a.s. If we call the sum
Xt , then Xt is a Brownian motion started at 0.
Proof. Step 1. We first prove convergence of the series. Let
Ai = (|Vi (t)| > i−2 for some t ∈ [0, 1]).
P∞
We will show
i=1 P(Ai ) < ∞. Then by the Borel–Cantelli lemma, except for ω in a null set, there exists i0P
(ω) such that if i ≥ i0 (ω), we have
−2
supt |Vi (t)(ω)| ≤ i . This will show Ii=0 Vi (t)(ω) converges as I → ∞,
uniformly over t ∈ [0, 1]. Moreover, since each ψij (t) is continuous in t, then
so is each Vi (t)(ω), and we thus deduce that Xt (ω) is continuous in t.
76
CHAPTER 10. BROWNIAN MOTION
Now for i ≥ 1 and j1 6= j2 , for each t at least one of ψij1 (t) and ψij2 (t) is
zero. Also, the maximum value of ψij is 2−(i+1)/2 . Hence
P(|Vi (t)| > i−2 for some t ∈ [0, 1])
≤ P(|Yij |ψij (t) > i−2 for some t ∈ [0, 1], some 0 ≤ j ≤ 2i−1 )
≤ P(|Yij |2−(i+1)/2 > i−2 for some 0 ≤ j ≤ 2i−1 )
i−1
≤
2
X
P(|Yij |2−(i+1)/2 > i−2 )
j=0
= (2i−1 + 1)P(|Z| > 2(i+1)/2 i−2 )
where Z is a standard normal random variable. Using (10.1), we conclude
P(Ai ) is summable in i.
Step 2. Next we show that the limit, Xt , satisfies the definition of Brownian
motion. It is obvious that each Xt has mean zero and that X0 = 0.
Let D = {k/2n : n ≥ 0, 0 ≤ k ≤ 2n }, the dyadic rationals. If s, t ∈ D and
s < t, then there exists N such that s = k/2N , t = `/2N , and then
E [Xs Xt ] =
N XX
N X
X
i=0
=
i=0
=
j
N X
X
`
ψij (s)ψij (t)
j
N X
X
i=0
k=0
E [Yij Yk` ]ψij (s)ψk` (t)
h1[0,s] , ϕij (s)ih1[0,t] , ϕij (t)i
j
= h1[0,s] , 1[0,t] i = s.
Thus Cov (Xs , Xt ) = s. Applying this also with s replaced by t and t replaced
by s, we see that Var Xt = t and Var Xs = s, and therefore
Var (Xt − Xs ) = t − 2s + s = t − s.
Clearly Xt − Xs is normal, hence
2 (t−s)/2
E eiu(Xt −Xs ) = e−u
.
10.1. DEFINITION AND CONSTRUCTION
77
For general s, t ∈ [0, 1], choose sm < tm in D such that sm → s and tm → t.
Using dominated convergence and the fact that X has continuous paths,
2 (t −s )/2
m
m
E eiu(Xt −Xs ) = lim E eiu(Xtm −Xsm ) = lim e−u
m→∞
m→∞
2 (t−s)/2
= e−u
.
This proves (2).
We can work backwards from this to see that if s < t, then
t − s = Var (Xt − Xs ) = Var Xt − 2Cov (Xs , Xt ) + Var Xs
= s + t − 2Cov (Xs , Xt ),
or
Cov (Xs , Xt ) = s.
Suppose 0 ≤ r1 < · · P
· < rP
n are in D. There exists N such that each
ri = mi /2N . So Xrk = N
i=0
j Yij ψij (rk ), and hence the Xrk are jointly
normal.
If ri < rj , then
Cov (Xri − Xri−1 , Xrj − Xrj−1 ) = (ri − ri ) − (ri−1 − ri−1 ) = 0,
and hence the increments are independent. We therefore have
E ei
Pn
k=1
uk (Xrk −Xrk−1 )
=
n
Y
E eiuk (Xrk −Xrk−1 ) .
(10.2)
k=1
Now suppose the rk are in [0, 1], take rkm in D converging to rk with
r1k < · · · < rnk . Replacing rk by rkm , letting m → ∞, and using dominated
convergence, we have (10.2). This proves independent increments.
The stochastic process Xt induces a measure on C([0, 1]). We say A ⊂
C([0, 1]) is a cylindrical set if
A = {f ∈ C([0, 1]) : (f (r1 ), . . . , f (rn )) ∈ B}
for some n ≥ 1, r1 ≤ · · · ≤ rn , and B a Borel subset of Rn . For A a
cylindrical set, define µ(A) = P({X· (ω) ∈ A}, where X is a Brownian motion
and X· (ω) is the function t → Xt (ω). We extend µ to the σ-field generated
78
CHAPTER 10. BROWNIAN MOTION
by the cylindrical sets. If B is in this σ-field, then µ(B) = P(X· ∈ B). The
probability measure µ is called Wiener measure.
We defined Brownian motion for t ∈ [0, 1]. To define Brownian motion
for t ∈ [0, ∞), take a sequence {Xtn } of independent Brownian motions on
[0, 1] and piece them together as follows. Define Xt = Xt1 for 0 ≤ t ≤ 1. For
3
2
, and
. For 2 < t ≤ 3, let Xt = X2 + Xt−2
1 < t ≤ 2, define Xt = X1 + Xt−1
so on.
10.2
Nowhere differentiability
We have the following result, which says that except for a set of ω’s that
form a null set, t → Xt (ω) is a function that does not have a derivative at
any time t ∈ [0, 1].
Theorem 10.4 With probability one, the paths of Brownian motion are nowhere differentiable.
Proof. Note that if Z is a normal random variable with mean 0 and variance
1, then
Z r
1
2
P(|Z| ≤ r) = √
e−x /2 dx ≤ 2r.
(10.3)
2π −r
Let M, h > 0 and let
AM,h = {∃s ∈ [0, 1] : |Xt − Xs | ≤ M |t − s| if |t − s| ≤ h},
Bn = {∃k ≤ 2n : |Xk/n − X(k−1)/n | ≤ 4M/n,
|X(k+1)/n − Xk/n | ≤ 4M/n, |X(k+2)/n − X(k+1)/n | ≤ 4M/n}.
We check that AM,h ⊂ Bn if n ≥ 2/h. To see this, if ω ∈ AM,h , there exists
an s such that |Xt − Xs | ≤ M |t − s| if |t − s| ≤ 2/n; let k/n be the largest
multiple of 1/n less than or equal to s. Then
|(k + 2)/n − s| ≤ 2/n
and
|(k + 1)/n − s| ≤ 2/n,
and therefore
|X(k+2)/n − X(k+1)/n | ≤ |X(k+2)/n − Xs | + |Xs − X(k+1)/n |
≤ 2M/n + 2M/n < 4M/n.
10.2. NOWHERE DIFFERENTIABILITY
79
Similarly |X(k+1)/n − Xk/n | and |Xk/n − X(k−1)/n | are less than 4M/n.
Using the independent increments property, the stationary increments
property, and (10.3),
P(Bn ) ≤ 2n sup P(|Xk/n − X(k−1)/n | < 4M/n, |X(k+1)/n − Xk/n | < 4M/n,
k≤2n
|X(k+2)/n − X(k+1)/n | < 4M/n)
≤ 2nP(|X1/n | < 4M/n, |X2/n − X1/n | < 4M/n,
|X3/n − X2/n | < 4M/n)
= 2nP(|X1/n | < 4M/n)P(|X2/n − X1/n | < 4M/n)
× P(|X3/n − X2/n | < 4M/n)
= 2n(P(|X1/n | < 4M/n))3
4M 3
≤ cn √
,
n
which tends to 0 as n → ∞. Hence for each M and h,
P(AM,h ) ≤ lim sup P(Bn ) = 0.
n→∞
This implies that the probability that there exists s ≤ 1 such that
lim sup
h→0
|Xs+h − Xs |
≤M
|h|
is zero. Since M is arbitrary, this proves the theorem.
80
CHAPTER 10. BROWNIAN MOTION
Chapter 11
Markov chains
11.1
Framework for Markov chains
Suppose S is a set with some topological structure that we will use as our
state space. Think of S as being Rd or the positive integers, for example. A
sequence of random variables X0 , X1 , . . ., is a Markov chain if
P(Xn+1 ∈ A | X0 , . . . , Xn ) = P(Xn+1 ∈ A | Xn )
(11.1)
for all n and all measurable sets A. The definition of Markov chain has this
intuition: to predict the probability that Xn+1 is in any set, we only need
to know where we currently are; how we got there gives no new additional
intuition.
Let’s make some additional comments. First of all, we previously considered random variables as mappings from Ω to R. Now we want to extend
our definition by allowing a random variable be a map X from Ω to S, where
(X ∈ A) is F measurable for all open sets A. This agrees with the definition
of random variable in the case S = R.
Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will confine our attention to the case where either S is
finite, in which case we will usually suppose S = {1, 2, . . . , n}, or countable
and discrete, in which case we will usually suppose S is the set of positive
integers.
81
82
CHAPTER 11. MARKOV CHAINS
We are going to further restrict our attention to Markov chains where
P(Xn+1 ∈ A | Xn = x) = P(X1 ∈ A | X0 = x),
that is, where the probabilities do not depend on n. Such Markov chains are
said to have stationary transition probabilities.
Define the initial distribution of a Markov chain with stationary transition
probabilities by µ(i) = P(X0 = i). Define the transition probabilities by
p(i, j) = P(Xn+1 = j | Xn = i). Since the transition probabilities are
stationary, p(i, j) does not depend on n.
In this case we can use the definition of conditional probability given in
undergraduate classes. If P(Xn = i) = 0 for all n, that means we never visit
i and we could drop the point i from the state space.
Proposition 11.1 Let X be a Markov chain with initial distribution µ and
transition probabilities p(i, j). Then
P(Xn = in , Xn−1 = in−1 , . . . , X1 = i1 , X0 = i0 )
= µ(i0 )p(i0 , i1 ) · · · p(in−1 , in ).
(11.2)
Proof. We use induction on n. It is clearly true for n = 0 by the definition
of µ(i). Suppose it holds for n; we need to show it holds for n + 1. For
simplicity, we will do the case n = 2. Then
P(X3 = i3 , X2 = i2 , X1 = i1 , X0 = i0 )
= E [P(X3 = i3 | X0 = i0 , X1 = i1 , X2 = i2 ); X2 = i2 , X1 = ii , X0 = i0 ]
= E [P(X3 = i3 | X2 = i2 ); X2 = i2 , X1 = ii , X0 = i0 ]
= p(i2 , i3 )P(X2 = i2 , X1 = i1 , X0 = i0 ).
Now by the induction hypothesis,
P(X2 = i2 , X1 = i1 , X0 = i0 ) = p(i1 , i2 )p(i0 , i1 )µ(i0 ).
Substituting establishes the claim for n = 3.
The above proposition says that the law of the Markov chain is determined
by the µ(i) and p(i, j). The formula (11.2) also gives a prescription for
constructing a Markov chain given the µ(i) and p(i, j).
11.1. FRAMEWORK FOR MARKOV CHAINS
83
Proposition
11.2 Suppose µ(i) is a sequence of nonnegative numbers with
P
i µ(i) = 1 and for each i the sequence p(i, j) is nonnegative and sums to
1. Then there exists a Markov chain with µ(i) as its initial distribution and
p(i, j) as the transition probabilities.
Proof. Define Ω = S ∞ . Let F be the σ-fields generated by the collection
of sets {(i0 , i1 , . . . , in ) : n > 0, ij ∈ S}. An element ω of Ω is a sequence
(i0 , i1 , . . .). Define Xj (ω) = ij if ω = (i0 , i1 , . . .). Define P(X0 = i0 , . . . , Xn =
in ) by (11.2). Using the Kolmogorov extension theorem, one can show that
P can be extended to a probability on Ω.
The above framework is rather abstract, but it is clear that under P the
sequence Xn has initial distribution µ(i); what we need to show is that Xn
is a Markov chain and that
P(Xn+1 = in+1 | X0 = i0 , . . . Xn = in )
= P(Xn+1 = in+1 | Xn = in ) = p(in , in+1 ).
(11.3)
eee By the definition of conditional probability, the left hand side of (11.3)
is
P(Xn+1 = in+1 | X0 = i0 , . . . , Xn = in )
(11.4)
P(Xn+1 = in+1 , Xn = in , . . . X0 = i0 )
=
P(Xn = in , . . . , X0 = i0 )
µ(i0 ) · · · p(in−1 , in )p(in , in+1 )
=
µ(i0 ) · · · p(in−1 , in )
= p(in , in+1 )
as desired.
To complete the proof we need to show
P(Xn+1 = in+1 , Xn = in )
= p(in , in+1 ),
P(Xn = in )
or
P(Xn+1 = in+1 , Xn = in ) = p(in , in+1 )P(Xn = in ).
(11.5)
84
CHAPTER 11. MARKOV CHAINS
Now
X
P(Xn = in ) =
P(Xn = in , Xn−1 = in−1 , . . . , X0 = i0 )
i0 ,··· ,in−1
X
=
µ(i0 ) · · · p(in−1 , in )
i0 ,··· ,in−1
and similarly
P(Xn+1 = in+1 , Xn = in )
X
=
P(Xn+1 = in+1 , Xn = in , Xn−1 = in−1 , . . . , X0 = i0 )
i0 ,··· ,in−1
X
= p(in , in+1 )
µ(i0 ) · · · p(in−1 , in ).
i0 ,··· ,in−1
Equation (11.5) now follows.
Note in this construction that the Xn sequence is fixed and does not
depend on µ or p. Let p(i, j) be fixed. The probability we constructed above
is often denoted Pµ . If µ is point mass at a point i or x, it is denoted Pi or
Px . So we have one probability space, one sequence Xn , but a whole family
of probabilities Pµ .
11.2
Examples
Random walk on the integers
We let Yi be an i.i.d. sequence of random
Pn variables, with p = P(Yi = 1)
and 1−p = P(Yi = −1). Let Xn = X0 + i=1 Yi . Then the Xn can be viewed
as a Markov chain with p(i, i + 1) = p, p(i, i − 1) = 1 − p, and p(i, j) = 0
if |j − i| 6= 1. More general random walks on the integers also fit into this
framework. To check that the random walk is Markov,
P(Xn+1 = in+1 | X0 = i0 , . . . , Xn = in )
= P(Xn+1 − Xn = in+1 − in | X0 = i0 , . . . , Xn = in )
= P(Xn+1 − Xn = in+1 − in ),
11.2. EXAMPLES
85
using the independence, while
P(Xn+1 = in+1 | Xn = in ) = P(Xn+1 − Xn = in+1 − in | Xn = in )
= P(Xn+1 − Xn = in+1 − in ).
Random walks on graphs
Suppose we have n points, and from each point there is some probability
of going to another point. For example, suppose there are 5 points and we
have p(1, 2) = 12 , p(1, 3) = 21 , p(2, 1) = 14 , p(2, 3) = 12 , p(2, 5) = 14 , p(3, 1) = 41 ,
p(3, 2) = 14 , p(3, 3) = 12 , p(4, 1) = 1, p(5, 1) = 21 , p(5, 5) = 21 . The p(i, j) are
often arranged into a matrix:

 1 1
0 2 2 0 0




1
1
1
4 0 2 0 4





1 1 1
.
0
0
P =

4 4 2






1 0 0 0 0




1
1
0 0 0 2
2
Note the rows must sum to 1 since
5
X
p(i, j) =
j=1
5
X
P(X1 = j | X0 = i) = P(X1 ∈ S | X0 = i) = 1.
j=1
Renewal processes
Let Yi be i.i.d. with P(Yi = k)P= ak and the ak are nonnegative and sum
to 1. Let T0 = i0 and Tn = T0 + ni=1 . We think of the Yn as the lifetime of
the nth light bulb and Tn the time when the nth light bulb burns out. (We
replace a light bulb as soon as it burns out.) Let
Xn = min{m − n : Ti = m for some i}.
So Xn is the amount of time after time n until the current light bulb burns
out.
86
CHAPTER 11. MARKOV CHAINS
If Xn = j and j > 0, then Ti = n + j for some i but Ti does not equal
n, n + 1, . . . , n + j − 1 for any i. So Ti = (n + 1) + (j − 1) for some i and Ti
does not equal (n + 1), (n + 1) + 1, . . . , (n + 1) + (j − 2) for any i. Therefore
Xn+1 = j − 1. So p(i, i − 1) = 1 if i ≥ 1.
If Xn = 0, then a light bulb burned out at time n and Xn+1 is 0 if the next
light bulb burned out immediately and j − 1 if the light bulb has lifetime j.
The probability of this is aj . So p(0, j) = aj+1 . All the other p(i, j)’s are 0.
Branching processes
Consider k particles. At the next time interval, some of them die, and some
of them split into several particles. The probability that a given particle will
split into j particles is given by aj , j = 0, 1, . . ., where the aj are nonnegative
and sum to 1. The behavior of each particle is independent of the behavior
of all the other particles. If Xn is the number of particles at time n, then Xn
is a Markov chain. Let Yi be i.i.d. random variables with P(Yi = j) = aj .
The
Pp(i, j) for Xn are somewhat complicated, and can be defined by p(i, j) =
P( im=1 Ym = j).
Queues
We will discuss briefly the M/G/1 queue. The M refers to the fact that
the customers arrive according to a Poisson process. So the probability that
the number of customers arriving in a time interval of length t is k is given by
e−λt (λt)k /k! The G refers to the fact that the length of time it takes to serve
a customer is given by a distribution that is not necessarily exponential. The
1 refers to the fact that there is 1 server.
Suppose the length of time to serve one customer has distribution function
F with density f . The probability that k customers arrive during the time
it takes to serve one customer is
Z ∞
(λt)k
f (t) dt.
ak =
e−λt
k!
0
Let the Yi be i.i.d. with P(Yi = k − 1) = ak . So Yi is the number of customers
arriving during the time it takes to serve one customer. Let Xn+1 = (Xn +
Yn+1 )+ be the number of customers waiting. Then Xn is a Markov chain
with p(0, 0) = a0 + a1 and p(i, i − 1 + k) = ak if i ≥ 1, k > 1.
Ehrenfest urns
11.3. MARKOV PROPERTIES
87
Suppose we have two urns with a total of r balls, k in one and r − k in
the other. Pick one of the r balls at random and move it to the other urn.
Let Xn be the number of balls in the first urn. Xn is a Markov chain with
p(k, k + 1) = (r − k)/r, p(k, k − 1) = k/r, and p(i, j) = 0 otherwise.
One model for this is to consider two containers of air with a thin tube
connecting them. Suppose a few molecules of a foreign substance are introduced. Then the number of molecules in the first container is like an
Ehrenfest urn. We shall see that all states in this model are recurrent, so
infinitely often all the molecules of the foreign substance will be in the first
urn. Yet there is a tendency towards equilibrium, so on average there will be
about the same number of molecules in each container for all large times.
Birth and death processes
Suppose there are i particles, and the probability of a birth is ai , the
probability of a death is bi , where ai , bi ≥ 0, ai + bi ≤ 1. Setting Xn equal
to the number of particles, then Xn is a Markov chain with p(i, i + 1) = ai ,
p(i, i − 1) = bi , and p(i, i) = 1 − ai − bi .
11.3
Markov properties
Proposition 11.3 Px (Xn+1 = z | X1 , . . . , Xn ) = PXn (X1 = z), Px -a.s.
Proof. The right hand side is measurable with respect to the σ-field generated by Xn . We therefore need to show that if A is in σ(Xn ), then
Px (Xn+1 = z, A) = E x [PXn (X1 = z); A].
A is of the form (Xn ∈ B) for some B ⊂ S, so it suffices to show
Px (Xn+1 = z, Xn = y) = E x [PXn (X1 = z); Xn = y]
and sum over y ∈ B.
Note the right hand side of (11.6) is equal to
E x [Py (X1 = z); Xn = y],
(11.6)
88
CHAPTER 11. MARKOV CHAINS
while
Py (X1 = z) =
X
Py (X0 = i, X1 = z) =
i
X
1{y} (i)p(i, z) = p(y, z).
i
Therefore the right hand side of (11.6) is
p(y, z)Px (Xn = y).
On the other hand, the left hand side of (11.6) equals
Px (Xn+1 = z | Xn = y)Px (Xn = y) = p(y, z)Px (Xn = y).
Theorem 11.4
Px (Xn+1 = i1 , . . . , Xn+m = im | X1 , . . . , Xn ) = PXn (X1 = i1 , . . . , Xm = im ).
Proof. Note this is equivalent to
E x [f1 (Xn+1 ) · · · fm (Xn+m ) | X1 , . . . , Xn ] = E Xn [f1 (X1 ) · · · fm (Xm )].
To go one way, we let fk = 1{ik } , to go the other, we multiply the conditional
probability result by f1 (i1 ) · · · fm (im ) and sum over all possible values of
i1 , . . . , i m .
We’ll use induction. The case m = 1 is the previous proposition. Let us
suppose the result holds for m and prove that it holds for m + 1. Let
C = (Xn+1 = i1 , . . . , Xn+m−1 = im−1 )
and
D = (X1 = i1 , . . . , Xm−1 = im−1 ).
Set
ϕ(z) = Pz (X1 = im+1 ).
Let Fn = σ(X1 , . . . , Xn ). We then have
Px (Xn+1 = i1 , . . . , Xn+m = im , Xn+m+1 = im+1 | Fn )
= E x [Px (C, Xn+m = im , Xn+m+1 = im+1 | Fn+m ) | Fn ]
= E x [1C 1{im } (Xn+m )Px (Xn+m+1 = im+1 | Fn+m ) | Fn ]
= E x [1C (ϕ1{im } )(Xn+m ) | Fn ].
11.3. MARKOV PROPERTIES
89
By the induction hypothesis, this is equal to
E Xn [1D (ϕ1{im } )(Xm )].
For any y,
E y [1D (ϕ1{im } )(Xm )]
= E y [1D 1{im } (Xm )PXm (X1 = im+1 )]
= E y [1D 1{im } (Xm )Py (Xm+1 = im+1 | Fm )]
= E y [1D 1{im } (Xm )1{im+1 } (Xm+1 )].
The strong Markov property is the same as the Markov property, but
where the fixed time n is replaced by a stopping time N .
Theorem 11.5 If N is a finite stopping time, then
Px (XN +1 = i1 | FN ) = PXN (X1 = i1 ),
and
Px (XN +1 = i1 , . . . , XN +m = im | FN ) = PXN (X1 = i1 , . . . , Xm = im ),
both Px -a.s.
Proof. We will show
Px (XN +1 = j | FN ) = PXN (X1 = j).
Once we have this, we can proceed as in the proof of the previous theorem
to obtain the second result. To show the above equality, we need to show
that if B ∈ FN , then
Px (XN +1 = j, B) = E x [PXN (X1 = j); B].
Recall that since B ∈ FN , then B ∩ (N = k) ∈ Fk . We have
Px (XN +1 = j, B, N = k) = Px (Xk+1 = j, B, N = k)
= E x [Px (Xk+1 = j | Fk ); B, N = k]
= E x [PXk (X1 = j); B, N = k]
= E x [PXN (X1 = j); B, N = k].
(11.7)
90
CHAPTER 11. MARKOV CHAINS
Now sum over k; since N is finite, we obtain our desired result.
We will need the following corollary. We use the notation Ty = min{n >
0 : Xn = y}, the first time the Markov chain hits the state y.
Corollary 11.6 Let U be a finite stopping time. Let V = min{n > U :
Xn = y}, the first time after U that the chain hits y. Then
Px (V < ∞ | FU ) = PXU (Ty < ∞).
Proof. We can write
(V < ∞) = ∪∞
n=1 (V = U + n)
= ∪∞
n=1 (XU +1 6= y, . . . , XU +n−1 6= y, XU +n = y).
By the theorem
Px (XU +1 6= y, . . . ,XU +n−1 6= y, XU +n = y)
= PXU (X1 6= y, . . . , Xn−1 6= y, Xn = y)
= PXU (Ty = n).
Now sum over n.
Another way of expressing the Markov property is through the ChapmanKolmogorov equations. Let pn (i, j) = P(Xn = j | X0 = i).
Proposition 11.7 For all i, j, m, n we have
X
pn (i, k)pm (k, j).
pn+m (i, j) =
k∈S
Proof. We write
P(Xn+m = j, X0 = i) =
X
P(Xn+m = j, Xn = k, X0 = i)
k
=
X
=
X
=
X
P(Xn+m = j | Xn = k, X0 = i)P(Xn = k | X0 = i)P(X0 = i)
k
P(Xn+m = j | Xn = k)pn (i, k)P(X0 = i)
k
k
pm (k, j)pn (i, k)P(X0 = i).
11.4. RECURRENCE AND TRANSIENCE
91
If we divide both sides by P(X0 = i), we have our result.
Note the resemblance to matrix multiplication. It is clear if P is the
matrix made up of the p(i, j), then P n will be the matrix whose (i, j) entry
is pn (i, j).
11.4
Recurrence and transience
Let
Ty = min{i > 0 : Xi = y}.
This is the first time that Xi hits the point y. Even if X0 = y we would have
Ty > 0. We let Tyk be the k-th time that the Markov chain hits y and we set
r(x, y) = Px (Ty < ∞),
the probability starting at x that the Markov chain ever hits y.
Proposition 11.8 Px (Tyk < ∞) = r(x, y)r(y, y)k−1 .
Proof. The case k = 1 is just the definition, so suppose k > 1. Let U = Tyk−1
and let V be the first time the chain hits y after U . Using the corollary to
the strong Markov property,
Px (Tyk < ∞) = Px (V < ∞, Tyk−1 < ∞)
= E x [Px (V < ∞ | FTyk−1 ); Tyk−1 < ∞]
k−1
= E x [PX(Ty
)
(Ty < ∞); Tyk−1 < ∞]
= E x [Py (Ty < ∞); Tyk−1 < ∞]
= r(y, y)Px (Tyk−1 < ∞).
We used here the fact that at time Tyk−1 the Markov chain must be at the
point y. Repeating this argument k − 2 times yields the result.
We say that y is recurrent if r(y, y) = 1; otherwise we say y is transient.
Let
∞
X
N (y) =
1(Xn =y) .
n=1
92
CHAPTER 11. MARKOV CHAINS
Proposition 11.9 y is recurrent if and only if E y N (y) = ∞.
Proof. Note
y
E N (y) =
=
∞
X
k=1
∞
X
y
P (N (y) ≥ k) =
∞
X
Py (Tyk < ∞)
k=1
r(y, y)k .
k=1
We used the fact that N (y) is the number of visits to y and the number of
visits being larger than k is the same as the time of the k-th visit being finite.
Since r(y, y) ≤ 1, the left hand side will be finite if and only if r(y, y) < 1.
Observe that
E y N (y) =
X
n
Py (Xn = y) =
X
pn (y, y).
n
If we consider simple symmetric random walk on the integers, then pn (0, 0)
n
is 0 if n is odd and equal to
2−n if n is even. This is because in order
n/2
to be at 0 after n steps, the walk must have had n/2 positive steps and n/2
negative steps; the probability of this is given by the binomial
distribution.
√
n
Using Stirling’s approximation, we see that p (0, 0) ∼ c/ n for n even, which
diverges, and so simple random walk in one dimension is recurrent.
Similar arguments show that simple symmetric random walk is also recurrent in 2 dimensions but transient in 3 or more dimensions.
Proposition 11.10 If x is recurrent and r(x, y) > 0, then y is recurrent
and r(y, x) = 1.
Proof. First we show r(y, x) = 1. Suppose not. Since r(x, y) > 0, there
is a smallest n and y1 , . . . , yn−1 such that p(x, y1 )p(y1 , y2 ) · · · p(yn−1 , y) > 0.
Since this is the smallest n, none of the yi can equal x. Then
Px (Tx = ∞) ≥ p(x, y1 ) · · · p(yn−1 , y)(1 − r(y, x)) > 0,
11.4. RECURRENCE AND TRANSIENCE
93
a contradiction to x being recurrent.
Next we show that y is recurrent. Since r(y, x) > 0, there exists L such
that pL (y, x) > 0. Then
pL+n+K (y, y) ≥ pL (y, x)pn (x, x)pK (x, y).
Summing over n,
X
X
pL+n+K (y, y) ≥ pL (y, x)pK (x, y)
pn (x, x) = ∞.
n
n
We say a subset C of S is closed if x ∈ C and r(x, y) > 0 implies y ∈ C.
A subset D is irreducible if x, y ∈ D implies r(x, y) > 0.
Proposition 11.11 Let C be finite and closed. Then C contains a recurrent
state.
From the preceding proposition, if C is irreducible, then all states will be
recurrent.
Proof. If not, for all y we have r(y, y) < 1 and
x
E N (y) =
=
∞
X
k=1
∞
X
x
P (N (y) ≥ k) =
∞
X
k=1
r(x, y)r(y, y)k−1 =
k=1
Since C is finite, then
X
P
y
Px (Tyk < ∞)
r(x, y)
< ∞.
1 − r(y, y)
E x N (y) < ∞. But that is a contradiction since
E x N (y) =
y
=
XX
y
n
X
x
n
pn (x, y) =
P (Xn ∈ C) =
XX
n
X
n
pn (x, y)
y
1 = ∞.
94
CHAPTER 11. MARKOV CHAINS
Theorem 11.12 Let R = {x : r(x, x) = 1}, the set of recurrent states.
Then R = ∪∞
i=1 Ri , where each Ri is closed and irreducible.
Proof. Say x ∼ y if r(x, y) > 0. Since every state is recurrent, x ∼ x and if
x ∼ y, then y ∼ x. If x ∼ y and y ∼ z, then pn (x, y) > 0 and pm (y, z) > 0
for some n and m. Then pn+m (x, z) > 0 or x ∼ z. Therefore we have an
equivalence relation and we let the Ri be the equivalence classes.
Looking at our examples, it is easy to see that in the Ehrenfest urn model
all states are recurrent. For the branching process model, suppose p(x, 0) > 0
for all x. Then 0 is recurrent and all the other states are transient. In the
renewal chain, there are two cases. If {k : ak > 0} is unbounded, all states
are recurrent. If K = max{k : ak > 0}, then {0, 1, . . . , K − 1} are recurrent
states and the rest are transient.
P
For the queueing model, let µ =
kak , the expected number of people
arriving during one customer’s service time. We may view this as a branching
process by letting all the customers arriving during one person’s service time
be considered the progeny of that customer. It turns out that if µ ≤ 1, 0 is
recurrent and all other states are also. If µ > 1 all states are transient.
11.5
Stationary measures
A probability µ is a stationary distribution if
X
µ(x)p(x, y) = µ(y).
(11.8)
x
In matrix notation this is µP = µ, or µ is the left eigenvector corresponding to
the eigenvalue 1. In the case of a stationary distribution, Pµ (X1 = y) = µ(y),
which implies that X1 , X2 , . . . all have the same distribution. We can use
(11.8) when µ is a measure rather than a probability, in which case it is
called a stationary measure. Note
µP n = (µP )P n−1 = µP n−1 = · · · = µ.
If we have a random walk on the integers, µ(x) = 1 for all x serves as a
stationary measure. In the case of an asymmetric random walk: p(i, i + 1) =
11.5. STATIONARY MEASURES
95
p, p(i, i − 1) = q = 1 − p and p 6= q, setting µ(x) = (p/q)x also works. To
check this, note
X
µ(i)p(i, x) = µ(x − 1)p(x − 1, x) + µ(x + 1)p(x + 1, x)
p x−1
p x+1
=
p+
q
q
q
px
px
px
= x · q + x · q = x.
q
q
q
µP (x) =
r
In the Ehrenfest urn model, µ(x) = 2−r
works. One way to see this
x
is that µ is the distribution one gets if one flips r coins and puts a coin in
the first urn when the coin is heads. A transition corresponds to picking a
coin at random and turning it over.
Proposition 11.13 Let a be recurrent and let T = Ta . Set
µ(y) = E
a
T −1
X
1(Xn =y) .
n=0
Then µ is a stationary measure.
The idea of the proof is that µ(y) is the expected number of visits to y
by the sequence X0 , . . . , XT −1 while µP is the expected number of visits to
y by X1 , . . . , XT . These should be the same because XT = X0 = a.
Proof. Let pn (a, y) = Pa (Xn = y, T > n). So
µ(y) =
∞
X
Pa (Xn = y, T > n) =
n=0
∞
X
pn (a, y)
n=0
and
X
y
µ(y)p(y, z) =
∞
XX
y
n=0
pn (a, y)p(y, z).
96
CHAPTER 11. MARKOV CHAINS
First we consider the case z 6= a. Then
X
pn (a, y)p(y, z)
y
=
X
Pa (hit y in n steps without first hitting a
y
and then go to z in one step)
= pn+1 (a, z).
So
X
µ(y)p(y, z) =
y
=
XX
n
∞
X
pn (a, y)p(y, z)
y
pn+1 (a, z) =
n=0
∞
X
pn (a, z)
n=0
= µ(z)
since p0 (a, z) = 0.
Second we consider the case a = z. Then
X
pn (a, y)p(y, z)
y
=
X
Pa (hit y in n steps without first hitting a
y
and then go to z in one step)
= P (T = n + 1).
a
Recall Pa (T = 0) = 0, and since a is recurrent, T < ∞. So
X
XX
pn (a, y)p(y, z)
µ(y)p(y, z) =
y
=
n
∞
X
n=0
y
a
P (T = n + 1) =
∞
X
n=0
On the other hand,
T −1
X
n=0
1(Xn =a) = 1(X0 =a) = 1,
Pa (T = n) = 1.
11.5. STATIONARY MEASURES
97
hence µ(a) = 1. Therefore, whether z 6= a or z = a, we have µP (z) = µ(z).
Finally, we show µ(y) < ∞. If r(a, y) = 0, then µ(y) = 0. If r(a, y) > 0,
choose n so that pn (a, y) > 0, and then
X
1 = µ(a) =
µ(y)pn (a, y),
y
which implies µ(y) < ∞.
We next turn to uniqueness of the stationary distribution. We give the
stationary measure constructed in Proposition 11.13 the name µa . We showed
µa (a) = 1.
Proposition 11.14 If the Markov chain is irreducible and all states are recurrent, then the stationary measure is unique up to a constant multiple.
Proof. Fix a ∈ S. Let µa be the stationary measure constructed above and
let ν be any other stationary measure.
Since ν = νP , then
ν(z) = ν(a)p(a, z) +
X
ν(y)p(y, z)
y6=a
= ν(a)p(a, z) +
X
ν(a)p(a, y)p(y, z) +
ν(x)p(x, y)p(y, z)
x6=a y6=a
y6=a
a
XX
a
= ν(a)P (X1 = z) + ν(a)P (X1 6= a, X2 = z)
+ Pν (X0 6= a, X1 6= a, X2 = z).
Continuing,
ν(z) = ν(a)
n
X
Pa (X1 6= a, X2 6= a, . . . , Xm−1 6= a, Xm = z)
m=1
+ Pν (X0 6= a, X1 6= a, . . . , Xn−1 6= a, Xn = z)
n
X
≥ ν(a)
Pa (X1 6= a, X2 6= a, . . . , Xm−1 6= a, Xm = z).
m=1
98
CHAPTER 11. MARKOV CHAINS
Letting n → ∞, we obtain
ν(z) ≥ ν(a)µa (z).
We have
ν(a) =
X
ν(x)pn (x, a) ≥ ν(a)
x
X
µa (x)pn (x, a)
x
= ν(a)µa (a) = ν(a),
since µa (a) = 1 (see proof of Proposition 11.13). This means that we have
equality and so for each n and x either pn (x, a) = 0 or
ν(x) = ν(a)µa (x).
Since r(x, a) > 0, thhen pn (x, a) > 0 for some n. Consequently
ν(x)
= µa (x).
ν(a)
Proposition 11.15 If a stationary distribution exists, then µ(y) > 0 implies
y is recurrent.
Proof. If µ(y) > 0, then
∞=
∞
X
µ(y) =
∞ X
X
n=1
n=1
=
X
=
X
x
n
µ(x)p (x, y) =
µ(x)
∞
X
Px (Xn = y) =
n=1
∞
X
pn (x, y)
n=1
x
x
µ(x)
X
X
µ(x)E x N (y)
x
µ(x)r(x, y)[1 + r(y, y) + r(y, y)2 + · · · ].
x
Since r(x, y) ≤ 1 and µ is a probability measure, this is less than
X
µ(x)(1 + r(y, y) + · · · ) ≤ 1 + r(y, y) + · · · .
x
Hence r(y, y) must equal 1.
Recall that Tx is the first time to hit x.
11.5. STATIONARY MEASURES
99
Proposition 11.16 If the Markov chain is irreducible and has stationary
distribution µ, then
1
µ(x) = x .
E Tx
Proof. µ(x) > 0 for some x. If y ∈ S, then r(x, y) > 0 and so pn (x, y) > 0
for some n. Hence
X
µ(y) =
µ(x)pn (x, y) > 0.
x
Hence by Proposition 11.15, all states are recurrent. By the uniqueness of the
stationary distribution, µx is a constant multiple of µ, i.e., µx = cµ. Recall
µx (y) =
∞
X
Px (Xn = y, Tx > n),
n=0
and so
X
µx (y) =
y
=
∞
XX
y
n=0
X
x
Px (Xn = y, Tx > n) =
XX
n
Px (Xn = y, Tx > n)
y
x
P (Tx > n) = E Tx .
n
Thus c = E x Tx . Recalling that µx (x) = 1,
µ(x) =
µx (x)
1
= x .
c
E Tx
We make the following distinction for recurrent states. If E x Tx < ∞, then
x is said to be positive recurrent. If x is recurrent but E x Tx = ∞, x is null
recurrent.
An example of null recurrent states is the simple random walk on the
integers. If we let g(y) = E y Tx , the Markov property tells us that
g(y) = 1 + 21 g(y − 1) + 12 g(y + 1).
Some algebra translates this to
g(y) − g(y − 1) = 2 + g(y + 1) − g(y).
100
CHAPTER 11. MARKOV CHAINS
If d(y) = g(y) − g(y − 1), we have d(y + 1) = d(y) − 2. If d(y0 ) is finite for
any y0 , then g(y) will be less than −1 for all y larger than some y1 , which
implies that g(y) will be negative for sufficiently large y, a contradiction. We
conclude g is identically infinite.
Proposition 11.17 Suppose a chain is irreducible.
(a) If there exists a positive recurrent state, then there is a stationary distribution.
(b) If there is a stationary distribution, all states are positive recurrent.
(c) If there exists a transient state, all states are transient.
(d) If there exists a null recurrent state, all states are null recurrent.
Proof. To show (a), suppose x is positive recurrent. We have seen that
µx (y) = E
x
TX
x −1
1(Xn =y)
n=0
is a stationary measure. Then
µx (S) =
X
µx (y) = E x
y
TX
x −1
1 = E x Tx < ∞.
n=0
Therefore µ(y) = µ(y)/E x Tx will be a stationary distribution. From the
definition of µx we have µx (x) = 1, hence µ(x) > 0.
For (b), suppose µ(x) > 0 for some x. If y is another state, choose n so
that pn (x, y) > 0, and then from
X
µ(y) =
µ(x)pn (x, y)
x
we conclude that µ(y) > 0. Then 0 < µ(y) = 1/E y Ty , which implies E y Ty <
∞.
We showed that if x is recurrent and r(x, y) > 0, then y is recurrent. So
(c) follows.
Suppose there exists a null recurrent state. If there exists a positive recurrent or transient state as well, then by (a) and (b) or by (c) all states are
positive recurrent or transient, a contradiction, and (d) follows.
11.6. CONVERGENCE
11.6
101
Convergence
Our goal is to show that under certain conditions pn (x, y) → π(y), where π
is the stationary distribution. (In the null recurrent case pn (x, y) → 0.)
Consider a random walk on the set {0, 1}, where with probability one on
each step the chain moves to the other state. Then pn (x, y) = 0 if x 6= y and
n is even. A less trivial case is the simple random walk on the integers. We
need to eliminate this periodicity.
Suppose x is recurrent, let Ix = {n ≥ 1 : pn (x, x) > 0}, and let dx be the
g.c.d. (greatest common divisor) of Ix . dx is called the period of x.
Proposition 11.18 If r(x, y) > 0, then dy = dx .
Proof. Since x is recurrent, r(y, x) > 0. Choose K and L such that
pK (x, y), pL (y, x) > 0.
pK+L+n (y, y) ≥ pL (y, x)pn (x, x)pK (x, y),
so taking n = 0, we have pK+L (y, y) > 0, or dy divides K + L. So dy divides
n if pn (x, x) > 0, or dy is a divisor of Ix . Hence dy divides dx . By symmetry
dx divides dy .
Proposition 11.19 If dx = 1, there exists m0 such that pm (x, x) > 0 whenever m ≥ m0 .
Proof. First of all, Ix is closed under addition: if m, n ∈ Ix ,
pm+n (x, x) ≥ pm (x, x)pn (x, x) > 0.
Secondly, if there exists N such that N, N + 1 ∈ Ix , let m0 = N 2 . If
m ≥ m0 , then m − N 2 = kN + r for some r < N and
m = r + N 2 + kN = r(N + 1) + (N − r + k)N ∈ Ix .
Third, pick n0 ∈ Ix and k > 0 such that n0 + k ∈ Ix . If k = 1, we are
done. Since dx = 1, there exists n1 ∈ Ix such that k does not divide n1 .
102
CHAPTER 11. MARKOV CHAINS
We have n1 = mk + r for some 0 < r < k. Note (m + 1)(n0 + k) ∈ Ix
and (m + 1)n0 + n1 ∈ Ix . The difference between these two numbers is
(m + 1)k − n1 = k − r < k. So now we have two numbers in Ik differing by
less than or equal to k − 1. Repeating at most k times, we get two numbers
in Ix differing by at most 1, and we are done.
We write d for dx . A chain is aperiodic if d = 1.
If d > 1, we say x ∼ y if pkd (x, y) > 0 for some k > 0 We divide S into
equivalence classes S1 , . . . Sd . Every d steps the chain started in Si is back in
Si . So we look at p0 = pd on Si .
Theorem 11.20 Suppose the chain is irreducible, aperiodic, and has a stationary distribution π. Then pn (x, y) → π(y) as n → ∞.
Proof. The idea is to take two copies of the chain with different starting
distributions, let them run independently until they couple, i.e., hit each
other, and then have them move together. So define


p(x1 , x2 )p(y1 , y2 ) if x1 6= y1 ,
q((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )
if x1 = y1 , x2 = y2 ,


0
otherwise.
Let Zn = (Xn , Yn ) and T = min{i : Xi = Yi }. We have
P(Xn = y) = P(Xn = y, T ≤ n) + P(Xn = y, T > n)
= P(Yn = y, T ≤ n) + P(Xn = y, T > n),
while
P(Yn = y) = P(Yn = y, T ≤ n) + P(Yn = y, T > n).
Subtracting,
P(Xn = y) − P(Yn = y) ≤ P(Xn = y, T > n) − P(Yn = y, T > n)
≤ P(Xn = y, T > n) ≤ P(T > n).
Using symmetry,
|P(Xn = y) − P(Yn = y)| ≤ P(T > n).
11.6. CONVERGENCE
103
Suppose we let Y0 have distribution π and X0 = x. Then
|pn (x, y) − π(y)| ≤ P(T > n).
It remains to show P(T > n) → 0. To do this, consider another chain
Wn = (Xn , Yn ), where now we take Xn , Yn independent. Define
r((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 )p(y1 , y2 ).
The chain under the transition probabilities r is irreducible. To see this,
there exist K and L such that pK (x1 , x2 ) > 0 and pL (y1 , y2 ) > 0. If M is
large, pL+M (x2 , x2 ) > 0 and pK+M (y2 , y2 ) > 0. So pK+L+M (x1 , x2 ) > 0 and
pK+L+M (y1 , y2 ) > 0, and hence we have rK+L+M ((x1 , x2 ), (y1 , y2 )) > 0.
It is easy to check that π 0 (a, b) = π(a)π(b) is a stationary distribution for
W . Hence Wn is recurrent, and hence it will hit (x, x), hence the time to hit
the diagonal {(y, y) : y ∈ S} is finite. However the distribution of the time
to hit the diagonal is the same as T .
104
CHAPTER 11. MARKOV CHAINS