Download Chapter 4 Dependent Random Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Random variable wikipedia , lookup

Inductive probability wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Randomness wikipedia , lookup

Probability interpretations wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Transcript
Chapter 4
Dependent Random Variables
4.1
Conditioning
One of the key concepts in probability theory is the notion of conditional
probability and conditional expectation. Suppose that we have a probability
space (Ω, F , P ) consisting of a space Ω, a σ-field F of subsets of Ω and a
probability measure on the σ-field F . If we have a set A ∈ F of positive
measure then conditioning with respect to A means we restrict ourselves to
the set A. Ω gets replaced by A. The σ-field F by the σ-field FA of subsets
of A that are in F . For B ⊂ A we define
P (B)
PA (B) =
P (A)
We could achieve the same thing by defining for arbitrary B ∈ F
PA (B) =
P (A ∩ B)
P (A)
in which case PA (·) is a measure defined on F as well but one that is concentrated on A and assigning 0 probability to Ac . The definition of conditional
probability is
P (A ∩ B)
P (B|A) =
.
P (A)
Similarly the definition of conditional expectation of an integrable function
f (ω) given a set A ∈ F of positive probability is defined to be
R
f (ω)dP
E{f |A} = A
.
P (A)
101
CHAPTER 4. DEPENDENT RANDOM VARIABLES
102
In particular if we take f = χB for some B ∈ F we recover the definition
of conditional probability. In general if we know P (B|A) and P (A) we can
recover P (A ∩ B) = P (A)P (B|A) but we cannot recover P (B). But if we
know P (B|A) as well as P (B|Ac ) along with P (A) and P (Ac ) = 1 − P (A)
then
P (B) = P (A ∩ B) + P (Ac ∩ B) = P (A)P (B|A) + P (Ac )P (B|Ac ).
More generally if P is a partition of Ω into a finite or even a countable number
of disjoint measurable sets A1 , · · · , Aj , · · ·
X
P (B) =
P (Aj )P (B|Aj ).
j
If ξ is a random variable taking distinct values {aj } on {Aj } then
P (B|ξ = aj ) = P (B|Aj )
or more generally
P (B|ξ = a) =
P (B ∩ ξ = a)
P (ξ = a)
provided P (ξ = a) > 0. One of our goals is to seek a definition that makes
sense when P (ξ = a) = 0. This involves dividing 0 by 0 and should involve
differentiation of some kind. In the countable case we may think of P (B|ξ =
aj ) as a function fB (ξ) which is equal to P (B|Aj ) on ξ = aj . We can rewrite
our definition of
fB (aj ) = P (B|ξ = aj )
as
Z
fB (ξ)dP = P (B ∩ ξ = aj )
for each j
ξ=aj
or summing over any arbitrary collection of j’s
Z
fB (ξ)dP = P (B ∩ {ξ ∈ E}).
ξ∈E
Sets of the form ξ ∈ E form a sub σ-field Σ ⊂ F and we can rewrite the
definition as
Z
fB (ξ)dP = P (B ∩ A)
A
4.1. CONDITIONING
103
for all A ∈ Σ. Of course in this case A ∈ Σ if and only if A is a union
of the atoms ξ = a of the partition over a finite or countable subcollection
of the possible values of a. Similar considerations apply to the conditional
expectation of a random variable G given ξ. The equation becomes
Z
Z
g(ξ)dP =
G(ω)dP
A
or we can rewrite this as
A
Z
Z
g(ω)dP =
A
G(ω)dP
A
for all A ∈ Σ and instead of demanding that g be a function of ξ we demand
that g be Σ measurable which is the same thing. Now the random variable
ξ is out of the picture and rightly so. What is important is the information
we have if we know ξ and that is the same if we replace ξ by a one-to-one
function of itself. The σ-field Σ abstracts that information nicely. So it turns
out that the proper notion of conditioning involves a sub σ-field Σ ⊂ F . If G
is an integrable function and Σ ⊂ F is given we will seek another integrable
function g that is Σ measurable and satisfies
Z
Z
g(ω)dP =
G(ω)dP
A
A
for all A ∈ Σ. We will prove existence and uniqueness of such a g and call it
the conditional expectation of G given Σ and denote it by g = E[G|Σ].
The way to prove the above result will take us on a detour. A signed
measure on a measurable space (Ω, F ) is a set function λ(.) defined for A ∈
F which is countably additive but not necessarily nonnegative. Countable
addivity is again in any of the following two equivalent senses.
X
λ(∪An ) =
λ(An )
for any countable collection of disjoint sets in F , or
lim λ(An ) = λ(A)
n→∞
whenver An ↓ A or An ↑ A.
Examples of such λ can be constructed by taking the difference µ1 − µ2
of two nonnegative measures µ1 and µ2 .
104
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Definition 4.1. A set A ∈ F is totally positive (totally negative) for λ if
for every subset B ∈ F with B ⊂ A λ(B) ≥ 0. (≤ 0)
Remark 4.1. A measurable subset of a totally positive set is totally positive.
Any countable union of totally positive subsets is again totally positive.
Lemma 4.1. If λ is a countably additive signed measure on (Ω, F ),
sup |λ(A)| < ∞
A∈F
Proof. The key idea in the proof is that, since λ(Ω) is a finite number, if
λ(A) is large so is λ(Ac ) with an opposite sign. In fact, it is not hard to
see that ||λ(A)| − |λ(Ac )|| ≤ |λ(Ω)| for all A ∈ F . Another fact is that if
supB⊂A |λ(B)| and supB⊂Ac |λ(B)| are finite, so is supB |λ(B)|. Now let us
complete the proof. Given a subset A ∈ F with supB⊂A |λ(B)| = ∞, and
any positive number N, there is a subset A1 ∈ F with A1 ⊂ A such that
|λ(A1 )| ≥ N and supB⊂A1 |λ(B)| = ∞. This is obvious because if we pick
a set E ⊂ A with |λ(E)| very large so will λ(E c ) be. At least one of the
two sets E, E c will have the second property and we can call it A1 . If we
proceed by induction we have a sequence An that is ↓ and |λ(An )| → ∞ that
contradicts countable additivity.
Lemma 4.2. Given a subset A ∈ F with λ(A) = ` > 0 there is a subset
Ā ⊂ A that is totally positive with λ(Ā) ≥ `.
Proof. Let us define m = inf B⊂A λ(B). Since the empty set is included,
m ≤ 0. If m = 0 then A is totally positive and we are done. So let us assume
that m < 0. By the previous lemma m > −∞.
Let us find B1 ⊂ A such that λ(B1 ) ≤ m2 . Then for A1 = A − B1 we have
A1 ⊂ A, λ(A1 ) ≥ ` and inf B⊂A1 λ(B) ≥ m2 . By induction we can find Ak
with A ⊃ A1 ⊃ · · · ⊃ Ak · · · , λ(Ak ) ≥ ` for every k and inf B⊂Ak λ(Ak ) ≥ 2mk .
Clearly if we define Ā = ∩Ak which is the decreasing limit, Ā works.
Theorem 4.3. (Hahn-Jordan Decomposition). Given a countably additive signed measure λ on (Ω, F ) it can be written always as λ = µ+ − µ−
the difference of two nonnegative measures. Moreover µ+ and µ− may be
chosen to be orthogonal i.e, there are disjoint sets Ω+ , Ω− ∈ F such that
µ+ (Ω− ) = µ− (Ω+ ) = 0. In fact Ω+ and Ω− can be taken to be subsets of
Ω that are respectively totally positive and totally negative for λ. µ± then
become just the restrictions of λ to Ω± .
4.1. CONDITIONING
105
Proof. Totally positive sets are closed under countable unions, disjoint or
not. Let us define m+ = supA λ(A). If m+ = 0 then λ(A) ≤ 0 for all A and
we can take Ω+ = Φ and Ω− = Ω which works. Assume that m+ > 0. There
exist sets An with λ(A) ≥ m+ − n1 and therefore totally positive subsets Ān
of An with λ(Ān ) ≥ m+ − n1 . Clearly Ω+ = ∪n Ān is totally positive and
λ(Ω+ ) = m+ . It is easy to see that Ω− = Ω − Ω+ is totally negative. µ± can
be taken to be the restriction of λ to Ω± .
Remark 4.2. If λ = µ+ − µ− with µ+ and µ− orthogonal to each other,
then they have to be the restrictions of λ to the totally positive and totally
negative sets for λ and such a representation for λ is unique. It is clear that
in general the representation is not unique because we can add a common µ
to both µ+ and µ− and the µ will cancel when we compute λ = µ+ − µ− .
Remark 4.3. If µ is a nonnegative measure and we define λ by
Z
Z
λ(A) =
f (ω) dµ = χA (ω)f (ω) dµ
A
where f is an integrable function, then λ is a countably additive signed
measure and Ω+ = {ω : f (ω) > 0} and Ω− = {ω : f (ω) < 0}. If we define
f ± (ω) as the positive and negative parts of f , then
Z
±
µ (A) =
f ± (ω) dµ.
A
The signed measure λ that was constructed in the preceeding remark
enjoys a very special relationship to µ. For any set A with µ(A) = 0, λ(A) = 0
because the integrand χA (ω)f (ω) is 0 for µ-almost all ω and for all practical
purposes is a function that vanishes identically.
Definition 4.2. A signed measure λ is said to be absolutely continuous with
respect to a nonnegative measure µ, λ << µ in symbols, if whenever µ(A) is
zero for a set A ∈ F it is also true that λ(A) = 0.
Theorem 4.4. (Radon-Nikodym Theorem). If λ << µ then there is an
integrable function f (ω) such that
Z
λ(A) =
f (ω) dµ
(4.1)
A
106
CHAPTER 4. DEPENDENT RANDOM VARIABLES
for all A ∈ F . The function f is uniquely determined almost everywhere and
is called the Radon-Nikodym derivative of λ with respect to µ. It is denoted
by
dλ
f (ω) =
.
dµ
Proof. The proof depends on the decomposition theorem. We saw that if the
relation 4.1 holds, then Ω+ = {ω : f (ω) > 0}. If we define λa = λ − aµ,
then λa is a signed measure for every real number a. Let us define Ω(a) to
be the totally positive subset of λa . These sets are only defined up to sets of
measure zero, and we can only handle a countable number of sets of measure
0 at one time. So it is prudent to restrict a to the set Q of rational numbers.
Roughly speaking Ω(a) will be the sets f (ω) > a and we will try to construct
f from the sets Ω(a) by the definition
f (ω) = [sup a ∈ Q : ω ∈ Ω(a)].
The plan is to check that the function f (ω) defined above works. Since λa
is getting more negative as a increases, Ω(a) is ↓ as a ↑. There is trouble
with sets of measure 0 for every comparison between two rationals a1 and
a2 . Collect all such troublesome sets (only a countable number and throw
them away). In other words we may assume without loss of generality that
Ω(a1 ) ⊂ Ω(a2 ) whenever a1 > a2 . Clearly
{ω : f (ω) > x} = {ω : ω ∈ Ω(y) for some rational y > x}
= ∪ y>x Ω(y)
y∈Q
and this makes f measurable. If A ⊂ ∩a Ω(a) then λ(A) − aµ(A) ≥ 0 for all
A. If µ(A) > 0, λ(A) has to be infinite which is not possible. Therefore µ(A)
has to be zero and by absolute continuity λ(A) = 0 as well. On the other
hand if A ∩ Ω(a) = Φ for all a, then λ(A) − aµ(A) ≤ 0 for all a and again
if µ(A) > 0, λ(A) = −∞ which is not possible either. Therefore µ(A), and
by absolute continuity, λ(A) are zero. This proves that f (ω) is finite almost
everywhere with respect to both λ and µ. Let us take two real numbers a < b
and consider Ea,b = {ω : a ≤ f (ω) ≤ b}. It is clear that the set Ea,b is in
Ω(a0 ) and Ωc (b0 ) for any a0 < a and b0 > b. Therefore for any set A ⊂ Ea,b by
letting a0 and b0 tend to a and b
a µ(A) ≤ λ(A) ≤ b µ(A).
4.1. CONDITIONING
107
Now we are essentially done. Let us take a grid {n h} and consider En =
{ω : nh ≤ f (ω) < (n + 1)h} for −∞ < n < ∞. Then for any A ∈ F and
each n,
Z
λ(A ∩ En ) − hµ(A ∩ En ) ≤ n h µ(A ∩ En ) ≤
f (ω) dµ
A∩En
≤ (n + 1) hµ(A ∩ En ) ≤ λ(A ∩ En ) + h µ(A ∩ En ).
Summing over n we have
Z
λ(A) − h µ(A) ≤
f (ω) dµ ≤ λ(A) + h µ(A)
A
proving the integrability of f and if we let h → 0 establishing
Z
λ(A) =
f (ω) dµ
A
for all A ∈ F .
Remark 4.4. (Uniqueness). If we have two choices of f say f1 and f2 their
difference g = f1 − f2 satisfies
Z
g(ω) dµ = 0
A
for all A ∈ F . If we take A = {ω : g(ω) ≥ }, then 0 ≥ µ(A ) and this
implies µ(A ) = 0 for all > 0 or g(ω) ≤ 0, almost everywhere with respect to
µ. A similar argument establishes g(ω) ≥ 0 almost everywhere with respect
to µ. Therefore g = 0 a.e. µ proving uniqueness.
Exercise 4.1. If f andR g are two integrable
functions, maesurable with respect
R
to a σ-filed B and if A f (ω)dP = A g(ω)dP for all sets A ∈ B0 , a field that
generates the σ-field B, then f = g a.e. P .
Exercise 4.2. If λ(A) ≥ 0 for all A ∈ F , prove that f (ω) ≥ 0 almost everywhere.
Exercise 4.3. If Ω is a countable set and µ({ω}) > 0 for each single point
set prove that any measure λ is absolutely continuous with respect to λ and
calculate the Radon-Nikodym derivative.
108
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Exercise 4.4. Let F (x) be a distribution function on the line with F (0) = 0
and F (1) = 1 so that the probability measure α corresponding to it lives on
the interval [0, 1]. If F (x) satisfies a Lipschitz condition
|F (x) − F (y)| ≤ A|x − y|
then prove that α << m where m is the Lebesgue measure on [0, 1]. Show
dα
also that 0 ≤ dm
≤ A almost surely.
If ν, λ, µ are three nonnegative measures such that ν << λ and λ << µ
then show that ν << µ and
dν
dν dλ
=
dµ
dλ dµ
a.e.
dλ
Exercise 4.5. If λ, µ are nonnegative measures with λ << µ and dµ
= f , then
show that g is integrable with respect to λ if and only if g f is integrable with
respect to µ and
Z
Z
g(ω) dλ = g(ω) f (ω) dµ.
Exercise 4.6. Given two nonnegative measures λ and µ, λ is said to be uniformly absolutely continuous with respect to µ on F if for any > 0 there
exists a δ > 0 such that for any A ∈ F with µ(A) < δ it is true that λ(A) < .
Use the Radon-Nikodym theorem to show that absolute continuity on a σfield F implies uniform absolute continuity. If F0 is a field that generates the
σ-field F show by an example that absolute continuity on F0 does not imply
absolute continuity on F . Show however that uniform absolute continuity
on F0 implies uniform absolute continuity and therefore absolute continuity
on F .
Exercise 4.7. If F is a distribution function on the line show that it is absolutely continuous with respect to Lebesgue measure on the line, if and only
if for any > 0, there exists a δ > 0 suchP
that for arbitrary finite collection
of
disjoint
intervals
I
=
[a
,
b
]
with
j
j j
j |bj − aj | < δ it follows that
P
j [F (bj ) − F (aj )] ≤ .
4.2
Conditional Expectation
In the Radon-Nikodym theorem, if λ << µ are two probability distributions
dλ
on (Ω, F ), we defined the Radon-Nikodym derivative f (ω) = dµ
as an F
4.2. CONDITIONAL EXPECTATION
measurable function such that
Z
λ(A) =
f (ω) dµ
109
for all A ∈ F
A
If Σ ⊂ F is a sub σ-field, the absolute continuity of λ with respect to µ on Σ
is clearly implied by the absolute continuity of λ with respect to µ on F . We
can therefore apply the Radon-Nikodym theorem on the measurable space
(Ω, Σ), and we will obtain a new Radon-Nikodym derivative
g(ω) =
dλ
dλ =
dµ
dµ Σ
Z
such that
λ(A) =
g(ω) dµ
for all A ∈ Σ
A
and g is Σ measurable. Since the old function f (ω) was only F measurable,
in general, it cannot be used as the Radon-Nikodym derivative for the sub
σ-field Σ. Now if f is an integrable function on (Ω, F , µ) and Σ ⊂ F is a sub
σ-field we can define λ on F by
Z
λ(A) =
f (ω) dµ
for all A ∈ F
A
and recalculate the Radon-Nikodym derivative g for Σ and g will be a Σ
measurable, integrable function such that
Z
λ(A) =
g(ω) dµ
for all A ∈ Σ
A
In other words g is the perfect candidate for the conditional expectation
g(ω) = E f (·)|Σ
We have therefore proved the existence of the conditional expectation.
Theorem 4.5. The conditional expectation as mapping of f → g has the
following properties.
1. If g = E f |Σ then E[g] = E[f ]. E[1|Σ] = 1 a.e.
2. If f is nonnegative then g = E f |Σ is almost surely nonnegative.
110
CHAPTER 4. DEPENDENT RANDOM VARIABLES
3. The map is linear. If a1 , a2 are constants
E a1 f1 + a2 f2 |Σ = a1 E f1 |Σ + a2 E f2 |Σ
a.e.
4. If g = E f |Σ , then
Z
Z
|g(ω)| dµ ≤
|f (ω)| dµ
5. If h is a bounded Σ measurable function then
E f h|Σ = h E f |Σ
a.e.
6. If Σ2 ⊂ Σ1 ⊂ F , then
7. Jensen’s
Inequality. If φ(x) is a convex function of x, and g =
E f |Σ then
E φ(f (ω))|Σ ≥ φ(g(ω)) a.e.
(4.2)
and if we take expectations
E[φ(f )] ≥ E[φ(g)].
Proof. (i), (ii) and (iii) are obvious. For (iv) we note that if dλ = f dµ
Z
|f | dµ = sup λ(A) − inf λ(A)
A∈F
A∈F
and if we replace F by a sub σ-field Σ the right hand side is decreased. (v)
is obvious if h is the indicator function of a set A in Σ. To go from indicator
functions to simple functions to bounded measurable functions is routine.
(vi) is an easy consequence of the definition. Finally (vii) corresponds to
Theorem 1.7 proved for ordinary
expectations
and is proved analogously.
We
note that if f1 ≥ f2 then E f1 |Σ ≥ E f2 |Σ a.e. and consequently
E max(f1 , f2 )|Σ ≥ max(g1 , g2 ) a.e. where gi = E fi |Σ for i = 1, 2. Since
we can represent any convex function φ as φ(x) = supa [ax − ψ(a)], limiting
4.2. CONDITIONAL EXPECTATION
111
ourselves to rational a, we have only a countable set of functions to deal with,
and
E φ(f )|Σ = E sup[af − ψ(a)]|Σ
a ≥ sup a E f |Σ − ψ(a)
a
= sup a g − ψ(a)
a
= φ(g)
a.e. and after taking expectations
E[φ(f )] ≥ E[φ(g)].
Remark 4.5. Conditional expecation is a form of averaging, i.e. it is linear,
takes constants into constants and preserves nonnegativity. Jensen’s inequality is now a cosequence of convexity.
In a somewhat more familiar context if µ = λ1 ×λ2 is a product measure
on (Ω, F ) = (Ω1 × Ω2 , F1 × F2 ) and we take Σ = A × Ω2 : A ∈ F1 } then
for any function f (ω) = f (ω1 , ω2), E[f (·)|Σ] = g(ω) where g(ω) = g(ω1) is
given by
Z
g(ω1 ) =
Ω2
f (ω1, ω2 ) dλ2
so that the conditional expectation is just integrating the unwanted variable
ω2 . We can go one step more. If φ(x, y) is the joint density on R2 of two
random variables X, Y (with respect to the Lebesgue measure on R2 ), and
ψ(x) is the marginal density of X given by
Z ∞
ψ(x) =
φ(x, y) dy
−∞
then for any integrable function f (x, y)
E[f (X, Y )|X] = E[f (·, ·)|Σ] =
R∞
−∞
f (x, y) φ(x, y) dy
ψ(x)
where Σ is the σ-field of vertical strips A × (−∞, ∞) with a measurable
horizontal base A.
112
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Exercise 4.8. If f is already Σ measurable then E[f |Σ] = f . This suggests
that the map f → g = E[f |Σ] is some sort of a projection. In fact if
we consider the Hilbert space H = L2 [Ω, F , µ] of all F measurable square
integrable functions with an inner product
Z
< f, g >µ = f g dµ
then
H0 = L2 [Ω, Σ, µ] ⊂ H = L2 [Ω, F , µ]
and f → E[f |Σ] is seen to be the same as the orthogonal projection from H
onto H0 . Prove it.
Exercise 4.9. If F1 ⊂ F2 ⊂ F are two sub σ-fields of F and X is any integrable function, we can define Xi = E[X|Fi] for i = 1, 2. Show that
X1 = E[X2 |F1 ] a.e.
Conditional expectation is then the best nonlinear predictor if the loss
function is the expected (mean) square error.
4.3
Conditional Probability
We now turn our attention to conditional probability. If we take f = χB (ω)
then E[f |Σ] = P (ω, B) is called the conditional probability of B given Σ. It
is characterized by the property that it is Σ measurable as a function of ω
and for any A ∈ Σ
Z
µ(A ∩ B) =
P (ω, B) dµ.
A
Theorem 4.6. P (·, ·) has the following properties.
1. P (ω, Ω) = 1, P (ω, Φ) = 0 a.e.
2. For any B ∈ F , 0 ≤ P (ω, B) ≤ 1 a.e.
3. For any countable collection {Bj } of disjoint sets in F ,
X
P (ω, ∪j Bj ) =
P (ω, Bj ) a.e.
j
4.3. CONDITIONAL PROBABILITY
113
4. If B ∈ Σ, P (ω, B) = χB (ω) a.e.
Proof. All are easy consequences of properties of conditional expectations.
Property (iii) perhaps needs an explanation. If E[|fn − f |] → 0 by the
properties of conditional expectation E[|E{fn |Σ} − E{f |Σ}| → 0. Property
(iii) is an easy consequence of this.
The problem with the above theorem is that every property is valid only
almost everywhere. There are exceptional sets of measure zero for each case.
While each null set or a countable number of them can be ignored we have an
uncountable number of null sets and we would like a single null set outside
which all the properties hold. This means constructing a good version of the
conditional probability. It may not be always possible. If possible, such a
version is called a regular conditional probability. The existence of such a
regular version depends on the space (Ω, F ) and the sub σ-field Σ being nice.
If Ω is a complete separable metric space and F are its Borel stes, and if Σ
is any countably generated sub σ-field of F , then it is nice enough. We will
prove it in the special case when Ω = [0, 1] is the unit interval and F are the
Borel subsets B of [0, 1]. Σ can be any countably generated sub σ-field of F .
Remark 4.6. In fact the case is not so special. There is theorem [6] which
states that if (Ω, F ) is any complete separable metric space that has an uncountable number of points, then there is one-to-one measurable map with
a measurable inverse between (Ω, F ) and ([0, 1], B). There is no loss of generality in assuming that (Ω, F ) is just ([0, 1], B).
Theorem 4.7. Let P be a probability distribution on ([0, 1], B). Let Σ ⊂ B
be a sub σ-field. There exists a family of probability distributions Qx on
([0, 1], B) such that for every A ∈ B, Qx (A) is Σ measurable and for every B
measurable f ,
Z
f (y)Qx (dy) = E P [f (·)|Σ]
a.e. P.
(4.3)
If in addition Σ is countably generated, i.e there is a field Σ0 consisting of a
countable number of Borel subsets of [0, 1] such that the σ-field generated by
Σ0 is Σ, then
Qx (A) = 1A (x)
for all A ∈ Σ.
(4.4)
114
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Proof. The trick is not to be too ambitious in the first place but try to
construct the conditional expectations
Q(ω, B) = E χB (ω)|Σ}
only for sets B given by B = (−∞, x) for rational x. We call our conditional
expectation, which is in fact a conditional probability, by F (ω, x). By the
properties of conditional expectations for any pair of rationals x < y, there
is a null set Ex,y , such that for ω ∈
/ Ex,y
F (ω, x) ≤ F (ω, y).
Moreover for any rational x < 0, there is a null set Nx outside which
F (ω, x) = 0 and similar null sets Nx for x > 1, ouside which F (ω, x) = 1.
If we collect all these null sets, of which there are only countably many, and
take their union, we get a null set N ∈ Σ such that for ω ∈
/ N, we have have
a family F (ω, x) defined for rational x that satisfies
F (ω, x) ≤ F (ω, y) if x < y are rational
F (ω, x) = 0 for rational x < 0
F (ω, x) = 1 for rational x > 1
Z
P (A ∩ [0, x]) =
F (ω, x) dP for all A ∈ Σ.
A
For ω ∈
/ N and real y we can define
G(ω, y) =
lim F (ω, x).
x↓y
x rational
For ω ∈
/ N, G is a right continuous nondecreasing function (distribution
function) with G(ω, y) = 0 for y < 0 and G(ω, y) = 1 for y ≥ 1. There is
then a probability measure Q̂(ω, B) on the Borel subsets of [0, 1] such that
Q̂(ω, [0, y]) = G(ω, y) for all y. Q̂ is our candidate for regular conditional
probability. Clearly Q̂(ω, I) is Σ measurable for all intervals I and by standard arguments will continue to be Σ measurable for all Borel sets B ∈ F .
If we check that
Z
P (A ∩ [0, x]) =
G(ω, x) dP for all A ∈ Σ
A
4.3. CONDITIONAL PROBABILITY
for all 0 ≤ x ≤ 1 then
115
Z
P (A ∩ I) =
Q̂(ω, I) dP
for all A ∈ Σ
A
for all intervals I and by standard arguments this will extend to finite disjoint
unions of half open intervals that constitute a field and finally to the σ-field
F generated by that field. To verify that for all real y,
Z
P (A ∩ [0, y]) =
G(ω, y) dP for all A ∈ Σ
A
we start from
Z
P (A ∩ [0, x]) =
F (ω, x) dP
for all A ∈ Σ
A
valid for rational x and let x ↓ y through rationals. From the countable additivity of P the left hand side converges to P (A ∩ [0, y]) Rand by the bounded
convergence theorem, the right hand side converges to A G(ω, y) dP and
we are done.
Finally from the uniqueness of the conditional expectation if A ∈ Σ
Q̂(ω, A) = χA (ω)
provided ω ∈
/ NA , which is a null set that depends on A. We can take a
countable set Σ0 of generators A that forms a field and get a single null set
N such that if ω ∈
/N
Q̂(ω, A) = χA (ω)
for all A ∈ Σ0 . Since both side are countably additive measures in A and as
they agree on Σ0 they have to agree on Σ as well.
Exercise 4.10. (Disintegration Theorem.) Let µ be a probability measure on
the plane R2 with a marginal distribution α for the first coordinate. In other
words if we denote α is such that, for any f that is a bounded measurable
function of x,
Z
Z
f (x) dµ =
f (x) dα
R2
R
Show that there exist measures βx depending measurably on x such that
βx [{x}×R]
R = 1, i.e. βx is supported on the vertical line through (x, y) : y ∈ R
and µ = R βx dα. The converse is of course easier. Given α and βx we can
construct a unique µ such that µ disintegrates as expected.
116
4.4
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Markov Chains
One of the ways of generating a sequence of dependent random variables is
to think of a system evolving in time. We have time points that are discrete
say T = 0, 1, · · · , N, · · · . The state of the system is described by a point
x in the state space X of the system. The state space X comes with a
natural σ-field of subsets F . At time 0 the system is in a random state and
its distribution is specified by a probability distribution µ0 on (X , F ). At
successive times T = 1, 2, · · · , the system changes its state and given the past
history (x0 , · · · , xk ) of the states of the system at times T = 0, · · · , k − 1
the probability that system finds itself at time k in a subset A ∈ F is given
by πk (x0 , · · · , xk−1 ; A ). For each (x0 , · · · , xk−1 ), πk defines a probability
measure on (X , F ) and for each A ∈ F , πk (x0 , · · · , xk−1 ; A ) is assumed to
be a measurable function of (x0 , · · · , xk−1 ), on the space (X k , F k ) which is
the product of k copies of the space (X , F ) with itself. We can inductively
define measures µk on (X k+1, F k+1) that describe the probability distribution
of the entire history (x0 , · · · , xk ) of the system through time k. To go from
µk−1 to µk we think of (X k+1, F k+1) as the product of (X k , F k ) with (X , F )
and construct on (X k+1, F k+1 ) a probability measure with marginal µk−1 on
(X k , F k ) and conditionals πk (x0 , · · · , xk−1 ; ·) on the fibers (x1 , · · · , xk−1 )×X .
This will define µk and the induction can proceed. We may stop at some finite
terminal time N or go on indefinitely. If we do go on indefinitely, we will have
a consitent family of finite dimensional distributions {µk } on (X k+1 , F k+1)
and we may try to use Kolmogorov’s theorem to construct a probability
measure P on the space (X ∞ , F ∞ ) of sequences {xj : j ≥ 0} representing
the total evolution of the system for all times.
Remark 4.7. However Kolmogorov’s theorem requires some assumptions on
(X , F ) that are satisfied if X is a complete separable metric space and F
are the Borel sets. However, in the present context, there is a result known
as Tulcea’s theorem (see [8]) that proves the existence of a P on (X ∞ , F ∞ )
for any choice of (X , F ), exploiting the fact that the consistent family of
finite dimensional distributions µk arise from well defined successive regular
conditional probability distributions.
An important subclass is generated when the transition probability depends
on the past history only through the current state. In other words
πk (x0 , · · · , xk−1 ; ·) = πk−1,k (xk−1 ; ·).
4.4. MARKOV CHAINS
117
In such a case the process is called a Markov Process with transition probabilities πk−1,k (·, ·). An even smaller subclass arises when we demand that
πk−1,k (·, ·) be the same for different values of k. A single transition probability π(x, A) and the initial distribution µ0 determine the entire process i.e.
the measure P on (X ∞ , F ∞ ). Such processes are called time-homogeneous
Markov Proceses or Markov Processes with stationary transition probabilities.
Chapman-Kolmogorov Equations:. If we have the transition probabilities πk,k+1 of transition from time k to k + 1 of a Markov Chain it is possible
to obtain directly the transition probabilities from time k to k + ` for any
` ≥ 2. We do it by induction on `. Define
Z
πk,k+`+1(x, A) =
πk,k+` (x, dy) πk+`,k+`+1(y, A)
(4.5)
X
or equivalently, in a more direct fashion
Z
Z
πk,k+`+1(x, A) =
···
πk,k+1(x, dyk+1 ) · · · πk+`,k+`+1(yk+` , A)
X
X
Theorem 4.8. The transition probabilities πk,m (·, ·) satisfy the relations
Z
πk,n (x, A) =
πk,m (x, dy) πm,n (y, A)
(4.6)
X
for any k < m < n and for the Markov Process defined by the one step
transition probabilities πk,k+1 (·, ·), for any n > m
P [xn ∈ A|Σm ] = πm,n (xm , A) a.e.
where Σm is the σ-field of past history upto time m generated by the coordinates x0 , x1 , · · · , xm .
Proof. The identity is basically algebra. The multiple integral can be carried
out by iteration in any order and after enough variables are integrated we
get our identity. To prove that the conditional probabilities are given by the
right formula we need to establish
Z
P [{xn ∈ A} ∩ B] =
πm,n (xm , A) dP
B
CHAPTER 4. DEPENDENT RANDOM VARIABLES
118
for all B ∈ Σm and A ∈ F . We write
Z
dP
P [{xn ∈ A} ∩ B] =
{xn ∈A}∩B
Z
Z
dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm )
= ···
{xn ∈A}∩B
Z
=
π
(x , dxm−1 ) · · · πn−1,n (xn−1 , dxn )
Z m,m+1 m
· · · dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm )
B
Z
=
Z
=
π
(x , dxm−1 ) · · · πn−1,n (xn−1 , A)
Z m,m+1 m
· · · dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm ) πm,n (xm , A)
B
πm,n (xm , A) dP
B
and we are done.
Remark 4.8. If the chain has stationary transition probabilities then the
transition probabilities πm,n (x, dy) from time m to time n depend only on
the difference k = n − m and are given by what are usually called the k step
transition probabilities. They are defined inductively by
Z
(k+1)
π
(x, A) =
π (k) (x, dy) π(y, A)
X
and satisfy the Chapman-Kolmogorov equations
Z
Z
(k+`)
(k)
(`)
(x, A) =
π (x, dy)π (x, A) = π (`) (x, dy)π (k)(y, A)
π
X
§
Suppose we have a probability measure P on the product space X ×Y ×Z
with the product σ-field. The Markov property in this context refers to
equality
E P [g(z)|Σx,y ] = E P [g(z)|Σy ]
a.e. P
(4.7)
for bounded measurable functions g on Z, where we have used Σx,y to denote
the σ-field generated by projection on to X × Y and Σy the corresponding
4.4. MARKOV CHAINS
119
σ-field generated by projection on to Y . The Markov property in the reverse
direction is the similar condition for bounded measurable functions f on X.
E P [f (x)|Σy,z ] = E P [f (x)|Σy ]
a.e. P
(4.8)
They look different. But they are both equivalent to the symmetric condition
E P [f (x)g(z)|Σy ] = E P [f (x)|Σy ]E P [g(z)|Σy ]
a.e. P
(4.9)
which says that given the present, the past and future are conditionally
independent. In view of the symmetry it sufficient to prove the following:
Theorem 4.9. For any P on (X × Y × Z) the relations (4.7) and (4.9) are
equivalent.
Proof. Let us fix f and g. Let us denote the common value in (4.7) by ĝ(y)
Then
E P [f (x)g(z)|Σy ] = E P [E P [f (x)g(z)|Σx,y ]|Σy ]
P
P
a.e. P
= E [f (x)E [g(z)|Σx,y ]|Σy ]
a.e. P
= E P [f (x)ĝ(y)|Σy ]
a.e. P (by (4.5))
P
= E [f (x)|Σy ]ĝ(y)
P
P
= E [f (x)|Σy ]E [g(z)|Σy ]
a.e. P
a.e. P
which is (4.9). Conversely, we assume (4.9) and denote by ḡ(x, y) and ĝ(y)
the expressions on the left and right side of (4.7). Let b(y) be a bounded
measurable function on Y .
E P [f (x)b(y)ḡ(x, y)] = E P [f (x)b(y)g(z)]
= E P [b(y)E P [f (x)g(z)|Σy ]]
= E P [b(y) E P [f (x)|Σy ] E P [g(z)|Σy ] ]
= E P [b(y) E P [f (x)|Σy ] ĝ(y)]
= E P [f (x)b(y)ĝ(y)].
Since f and b are arbitrary this implies that ḡ(x, y) = ĝ(y) a.e. P .
Let us look at some examples.
120
CHAPTER 4. DEPENDENT RANDOM VARIABLES
1. Suppose we have an urn containg a certain number of balls (nonzero)
some red and others green. A ball is drawn at random and its color
is noted. Then it is returned to the urn along with an extra ball of
the same color. Then a new ball is drawn at random and the process
continues ad infinitum. The current state of the system can be characterized by two integers r, g such that r + g ≥ 1. The initial state if the
system is some r0 , g0 with r0 + g0 ≥ 1. The system can go from (r, g)
r
to either (r + 1, g) with probability r+g
or to (r, g + 1) with probability
g
.
This
is
clearly
an
example
of
a
Markov
Chain with stationary
r+g
transition probabilities.
2. Consider a queue for service in a store. Suppose at each of the times
1, 2, · · · , a random number of new customers arrive and and join the
queue. If the queue is non empty at some time, then exactly one
customer will be served and will leave the queue at the next time point.
The distribution of the number of new arrivals is specified by {pj : j ≥
0} where pj is the probability that exactly j new customers arrive
at a given time. The number of new arrivals at distinct times are
assumed to be independent. The queue length is a Markov Chain on
the state space X = {0, 1, · · · , } of nonegative integers. The transition
probabilities π(i, j) are given by π(0, j) = pj because there is no service
and nobody in the queue to begin with and all the new arrivals join
the queue. On the other hand π(i, j) = pj−i+1 if j + 1 ≥ i ≥ 1 because
one person leaves the queue after being served.
3. Consider a reservoir into which water flows. The amount of additional
water flowing into the reservoir on any given day is random, and has
a distribution α on [0, ∞). The demand is also random for any given
day, with a probability distribution β on [0, ∞). We may also assume
that the inflows and demands on successive days are random variables
ξn and ηn , that have α and β for their common distributions and are all
mutually independent. We may wish to assume a percentage loss due
to evaporation. In any case the storage level at successive days have a
recurrence relation
Sn+1 = [(1 − p)Sn + ξn − ηn ]+
p is the loss and we have put the condition that the outflow is the
demand unless the stored amount is less than the demand in which case
4.4. MARKOV CHAINS
121
the outflow is the available quantity. The current amount in storage is
a Markov Process with Stationary transition probabilities.
4. Let X1 , · · · , Xn , · · · be a sequence of independent random variables
with a common distribution α. Let Sn = Y + X1 + · · · + Xn for
n ≥ 1 with S0 = Y where Y is a random variable independent of
X1 , . . . , Xn , . . . with distribution µ. Then Sn is a Markov chain on
R with one step transition probaility π(x, A) = α(A − x) and initial
distribution µ. The n step transition probability is αn (A − x) where
αn is the n-fold convolution of α. This is often referred to as a random
walk.
The last two examples can be described by models of the type
xn = f (xn−1 , ξn )
where xn is the current state and ξn is some random external disturbance. ξn
are assumed to be independent and identically distributed. They could have
two components like inflow and demand. The new state is a deterministic
function of the old state and the noise.
Exercise 4.11. Verify that the first two examples can be cast in the above
form. In fact there is no loss of generality in assuming that ξj are mutually
independent random variables having as common distribution the uniform
distribution on the interval [0, 1].
Given a Markov Chain with stationary transition probabilities π(x, dy) on
a state space (X , F ), the behavior of π (n) (x, dy) for large n is an important
and natural question. In the best situation of independent random variables
π (n) (x, A) = µ(A) are independent of x as well as n. Hopefully after a long
time the Chain will ‘forget’ its origins and π (n) (x, ·) → µ(·), in some suitable
sense, for some µ that does not depend on x . If that happens, then from
the relation
Z
(n+1)
π
(x, A) = π (n) (x, dy)π(y, A),
Z
we conclude
µ(A) =
π(y, A) dµ(y) for all A ∈ F
122
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Measures that satisfy the above property, abbreviated as µπ = µ, are called
invariant measures for the Markov Chain. If we start with the initial distribution µ which is inavariant then the probability measure P has µ as marginal
at every time. In fact P is stationary i.e., invariant with respect to time
translation, and can be extended to a stationary process where time runs
from −∞ to +∞.
4.5
Stopping Times and Renewal Times
One of the important notions in the analysis of Markov Chains is the idea of
stopping times and renewal times. A function
τ (ω) : Ω → {n : n ≥ 0}
is a random variable defined on the set Ω = X ∞ such that for every n ≥ 0 the
set {ω : τ (ω) = n} (or equivalently for each n ≥ 0 the set {ω : τ (ω) ≤ n})
is measurable with respect to the σ-field Fn generated by Xj : 0 ≤ j ≤ n.
It is not necessary that τ (ω) < ∞ for every ω. Such random variable τ are
called stopping times. Examples of stopping times are, constant times n ≥ 0,
the first visit to a state x, or the second visit to a state x. The important
thing is that in order to decide if τ ≤ n i.e. to know if what ever is supposed
to happen did happen before time n the chain need be observed only up to
time n. Examples of τ that are not stopping times are easy to find. The last
time a site is visited is not a stopping time nor is is the first time such that
at the next time one is in a state x. An important fact is that the Markov
property extends to stopping times. Just as we have σ-fields Fn associated
with constant times, we do have a σ field Fτ associated to any stopping
time. This is the information we have when we observe the chain upto time
τ . Formally
∞
Fτ = A : A ∈ F
and A ∩ {τ ≤ n} ∈ Fn for each n
One can check from the definition that τ is Fτ measurable and so is Xτ on
the set τ < ∞. If τ is the time of first visit to y then τ is a stopping time and
the event that the chain visits a state z before visiting y is Fτ measurable.
Lemma 4.10. ( Strong Markov Property.) At any stopping time τ
the Markov property holds in the sense that the conditional distribution of
4.6. COUNTABLE STATE SPACE
123
Xτ +1 , · · · , Xτ +n , · · · conditioned on Fτ is the same as the original chain
starting from the state x = Xτ on the set τ < ∞. In other words
Px {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An |Fτ }
Z
Z
=
···
π(Xτ , dx1 ) · · · π(xn−1 , dxn )
A1
An
a.e. on {τ < ∞}.
Proof. Let A ∈ Fτ be given with A ⊂ {τ < ∞}. Then
Px {A ∩ {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An }}
X
=
Px {A ∩ {τ = k} ∩ {Xk+1 ∈ A1 , · · · , Xk+n ∈ An }}
XZ
Z
k
=
Zk Z
A∩{τ =k}
Z
···
=
A
A1
Z
···
A1
π(Xk , dxk+1 ) · · · π(xk+n−1 , dxk+n ) dPx
An
π(Xτ , dx1 ) · · · π(xn−1 , dxn ) dPx
An
We have used the fact that if A ∈ Fτ then A ∩ {τ = k} ∈ Fk for every
k ≥ 0.
Remark 4.9. If Xτ = y a.e. with respect to Px on the set τ < ∞, then at
time τ , when it is finite, the process starts afresh with no memory of the past
and will have conditionally the same probabilities in the future as Py . At
such times the process renews itself and these times are called renewal times.
4.6
Countable State Space
From the point of view of analysis a particularly simple situation is when the
state space X is a countable set. It can be taken as the integers {x : x ≥ 1}.
Many applications fall in this category and an understanding of what happens
in this situation will tell us what to expect in general.
The one step transition
probability is a matrix π(x, y) with nonnegative enP
tries such that y π(x, y) = 1 for each x. Such matrices are called stochastic
124
CHAPTER 4. DEPENDENT RANDOM VARIABLES
matrices. The n step transition matrix is just the n-th power of the matrix
defined inductively by
X
π (n+1) (x, y) =
π (n) (x, z)π(z, y)
z
To be consistent one defines π (0) (x, y) = δx,y which is 1 if x = y and 0
otherwise. The problem is to analyse the behaviour for large n of π (n) (x, y). A
state x is said to communicate with a state y if π (n) (x, y) > 0 for some n ≥ 1.
We will assume for simplicity that every state communicates with every other
state. Such Markov Chains are called irreducible. Let us first limit ourselves
to the study of irreducible chains. Given an irreducible Markov chain with
transition probabilities π(x, y) we define fn (x) as the probability of returning
to x for the first time at the n-th step assuming that the chain starts from
the state x.. Using the convention that Px refers to the measure on sequences
for the chain starting from x and {Xj } are the successive positions of the
chain
fn (x) = Px Xj 6= x for 1 ≤ j ≤ n − 1 and Xn = x
X
=
π(x, y1 ) π(y1, y2 ) · · · π(yn−1, x)
y1 6=x
···
yn−1 6=x
P
Since fn (x) are probailities
of
disjoint
events
≤ 1. The state x
n fn (x)P
P
is called transient if n fn (x) < 1 and recurrent if n fn (x) = 1. The
recurrent case is divided into two situations. If we denote by τx = inf{n ≥
1 : Xn = x}, the time of first visit to x, then recurrence is Px {τx < ∞} = 1.
A recurrent state x is called positive recurrent if
X
E Px {τx } =
n fn (x) < ∞
n≥1
and null recurrent if
E Px {τx } =
X
n fn (x) = ∞
n≥1
Lemma 4.11. If for a (not necessarily irreducible) chain starting from x, the
probability of ever visiting y is positive then so is the probability of visiting y
before returning to x.
4.6. COUNTABLE STATE SPACE
125
Proof. Assume that for the chain starting from x the probability of visiting
y before returning to x is zero. But when it returns to x it starts afresh and
so will not visit y until it returns again. This reasoning can be repeated and
so the chain will have to visit x infinitely often before visiting y. But this
will use up all the time and so it cannot visit y at all.
Lemma 4.12. For an irreducible chain all states x are of the same type.
Proof. Let x be recurrent and y be given. Since the chain is irreducible, for
some k, π (k) (x, y) > 0. By the previous lemma, for the chain starting from x,
there is a positive probability of visiting y before returning to x. After each
successive return to x, the chain starts afresh and there is a fixed positive
probability of visiting y before the next return to x. Since there are infinitely
many returns to x, y will be visited infinitely many times as well. Or y is
also a recurrent state.
We now prove that if x is positive recurrent then so is y. We saw already
that the probability p = Px {τy < τx } of visiting y before returning to x is
positive. Clearly
E Px {τx } ≥ Px {τy < τx } E Py {τx }
and therefore
1
E Py {τx } ≤ E Px {τx } < ∞.
p
On the other hand we can write
Z
Z
Px
E {τy } ≤
τx dPx +
τy dPx
τy <τx
τx <τy
Z
Z
=
τx dPx +
{τx + E Px {τy }} dPx
τ <τ
τ <τ
Zy x
Zx y
=
τx dPx +
τx dPx + (1 − p) E Px {τy }
τ <τ
τx <τy
Zy x
= τx dPx + (1 − p) E Px {τy }
by the renewal property at the stopping time τx . Therefore
1
E Px {τy } ≤ E Px {τx }.
p
We also have
2
E Py {τy } ≤ E Py {τx } + E Px {τy } ≤ E Px {τx }
p
126
CHAPTER 4. DEPENDENT RANDOM VARIABLES
proving that y is positive recurrent.
Transient Case: We have the following theorem regarding transience.
Theorem 4.13. An irreducible chain is transient if and only if
G(x, y) =
∞
X
π (n) (x, y) < ∞ for all x, y.
n=0
Moreover for any two states x and y,
G(x, y) = f (x, y)G(y, y)
and
G(x, x) =
1
1 − f (x, x)
where f (x, y) = Px {τy < ∞}.
Proof. Each time the chain returns to x there is a probability 1 − f (x, x) of
never returning. The number of returns has then the geometric distribution
Px { exactly n returns to x} = (1 − f (x, x))f (x, x)n
and the expected number of returns is given by
∞
X
π (k) (x, x) =
k=1
f (x, x)
.
1 − f (x, x)
The left hand side comes from the calculation
E Px
∞
X
k=1
χ{x} (Xk ) =
∞
X
π (k) (x, x)
k=1
and the right hand side from the calculation of the mean of a Geometric
distribution. Since we count the visit at time 0 as a visit to x we add 1 to
both sides to get our formula. If we want to calculate the expected number of
visits to y when we start from x, first we have to get to y and the probability
of that is f (x, y). Then by the renewal property it is exactly the same as the
expected number of visits to y starting from y, including the visit at time 0
and that equals G(y, y).
4.6. COUNTABLE STATE SPACE
127
Before we study the recurrent behavior we need the notion of periodicity.
For each state x let us define Dx = {n : π (n) (x, x) > 0} to be the set of times
at which a return to x is possible if one starts from x. We define dx to be
the greatest common divisor of Dx .
Lemma 4.14. For any irreducible chain dx = d for all x ∈ X and for each
x, Dx contains all sufficiently large multiples of d.
Proof. Let us define
Dx,y = {n : π (n) (x, y) > 0}
so that Dx = Dx,x . By the Chapman-Kolmogorov equations
π (m+n) (x, y) ≥ π (m) (x, z)π (n) (z, y)
for every z, so that if m ∈ Dx,z and n ∈ Dz,y , then m + n ∈ Dx,y . In
particular if m, n ∈ Dx it follows that m + n ∈ Dx . Since any pair of states
communicate with each other, given x, y ∈ X , there are positive integers n1
and n2 such that n1 ∈ Dx,y and n2 ∈ Dy,x . This implies that with the choice
of ` = n1 + n2 , n + ` ∈ Dx whenever n ∈ Dy ; similarly n + ` ∈ Dy whenever
n ∈ Dx . Since ` itself belongs to both Dx and Dy both dx and dy divide `.
Suppose n ∈ Dx . Then n + ` ∈ Dy and therefore dy divides n + `. Since dy
divides `, dy must divide n. Since this is true for every n ∈ Dx and dx is the
greatest common divisor of Dx , dy must divide dx . Similarly dx must divide
dy . Hence dx = dy . We now complete the proof of the lemma. Let d be the
greatest common divisor of Dx . Then it is the greatest common divisor of a
finite subset n1 , n2 , · · · , nq of Dx and there will exist integers a1 , a2 , · · · , aq
such that
a1 n1 + a2 n2 + · · · + aq nq = d
Some of the a’s will be positive and others negative. Seperating them out,
and remembering that all the ni are divisible by d, we find two integers
md, (m + 1)d such that they both belong to Dx . If now n = kd with k > m2
we can write k = `m + r with a large ` ≥ m and the remainder r is less than
m.
kd = (`m + r)d = `md + r(m + 1)d − rmd = (` − r)md + r(m + 1)d ∈ Dx
since ` ≥ m > r.
128
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Remark 4.10. For an irreducible chain the common value d is called the period of the chain and an irreducible chain with period d = 1 is called aperiodic.
The simplest example of a periodic chain is one with 2 states and the chain
shuttles back and forth between the two. π(x, y) = 1 if x 6= y and 0 if x = y.
A simple calculation yields π (n) (x, x) = 1 if n is even and 0 otherwise. There
is oscillatory behavior in n that persists. The main theorem for irreducible,
aperiodic, recurrent chains is the following.
Theorem 4.15. Let π(x, y) be the one step transition probability for a recurrent aperiodic Markov chain and let π (n) (x, y) be the n-step transition
probabilities. If the chain is null recurrent then
lim π (n) (x, y) = 0 for all x, y
n→∞
If the chain is positive recuurrent then of course E Px {τx } = m(x) < ∞ for
all x, and in that case
lim π (n) (x, y) = q(y) =
n→∞
1
m(y)
exist for all x and y is independent of the starting point x and
P
y
q(y) = 1.
The proof is based on
Theorem 4.16. (Renewal Theorem.) Let {fn : n ≥ 1} be a sequence of
nonnegative numbers such that
X
X
fn = 1,
n fn = m ≤ ∞
n
n
and the greatest common divisor of {n : fn > 0} is 1. Suppose that {pn : n ≥
0} are defined by p0 = 1 and recursively
pn =
n
X
fj pn−j
j=1
Then
1
n→∞
m
where if m = ∞ the right hand side is taken as 0.
lim pn =
(4.10)
4.6. COUNTABLE STATE SPACE
129
Proof. The proof is based on several steps.
Step 1. We have inductively pn ≤ 1. Let a = lim supn→∞ pn . We can
choose a subsequence nk such that pnk → a. We can assume without loss of
generality that pnk +j → qj as k → ∞ for all positive and negative integers j
as well. Of course the limit q0 for j = 0 is a. In relation 4.10 we can pass to
the limit along the subsequence and use the dominated convergence theorem
to obtain
∞
X
qn =
fj qn−j
(4.11)
j=1
valid for −∞ < n < ∞. In particular
q0 =
∞
X
fj q−j
(4.12)
j=1
Step 2: Because a = lim sup pn we can conclude that qj ≤ a for all j. If
we denote by S = {n : fn > 0} then q−k = a for k ∈ S. We can then
deduce from equation 4.11 that q−k = a for k = k1 + k2 with k1 , k2 ∈ S. By
repeating the same reasoning q−k = a for k = k1 + k2 + · · · + k` . By lemma
3.6 because the greatest common factor of the integers in S is 1, there is a k0
such that for k ≥ k0 ,we have q−k = a. We now apply the relation 4.11 again
to conclude that qj = a for all positve as well as negative j.
Step 3: If we add up equation 4.10 for n = 1, · · · , N we get
p1 + p2 + · · · + pN = (f1 + f2 + · · · + fN ) + (f1 + f2 + · · · + fN −1 )p1
+ · · · + (f1 + f2 + · · · + fN −k )pk + · · · + f1 pN −1
P∞
P
If we denote by Tj = i=j fi , we have T1 = 1 and ∞
j=0 Tj = m. We can
now rewrite
N
N
X
X
Tj pN −j+1 =
fj
j=1
j=1
Step
4: Because pN −j → a for every j along the subsequence N = nk , if
P
j Tj = m < ∞, we can deduce from the dominated convergence theorem
that m a = 1 and we conclude that
lim sup pn = v1m
n→∞
130
CHAPTER 4. DEPENDENT RANDOM VARIABLES
P
If j Tj = ∞, by Fatou’s Lemma a = 0. Exactly the same argument applies
to liminf and we conclude that
1
lim inf pn =
n→∞
m
This concludes the proof of the renewal theorem.
We now turn to
Proof. (of Theorem 4.15). If we take a fixed x ∈ X and consider fn =
Px {τx = n}, then fn and pn = π (n) (x, x) are related by (1) and m = E Px {τx }.
In order to apply the renewal theorem we need to establish that the greatest
common divisor of S = {n : fn > 0} is 1. In general if fn > 0 so is pn . So
the greatest common divisor of S is always larger than that of {n : pn > 0}.
That does not help us because the greatest common divisor of {n : pn > 0}
is 1. On the other hand if fn = 0 unless n = k d for some k, the relation 4.10
can be used inductively to conclude that the same is true of pn . Hence both
sets have the same greatest common divisor. We can now conclude that
lim π (n) (x, x) = q(x) =
n→∞
1
m(x)
On the other hand if fn (x, y) = Px {τy = n}, then
π
(n)
and recurrence implies
(x, y) =
P∞
k+1
n
X
fk (x, y) π (n−k)(y, y)
k=1
fk (x, y) = 1 for all x and y. Therefore
lim π (n) (x, y) = q(y) =
n→∞
1
m(y)
and is independent of x, the starting point. In order to complete the proof
we have to establish that
X
Q=
q(y) = 1
y
It is clear by Fatou’s lemma that
X
y
q(y) = Q ≤ 1
4.6. COUNTABLE STATE SPACE
131
By letting n → ∞ in the Chapman-Kolmogorov equation
X
π (n+1) (x, y) =
π n (x, z)π(z, y)
z
and using Fatou’s lemma we get
q(y) ≥
X
π(z, y)q(z)
z
Summing with repect to y we obtain
X
Q≥
π(z, y)q(z) = Q
z,y
and equality holds in this relation. Therefore
X
q(y) =
π(z, y)q(z)
z
for every y or q(·) is an invariant measure. By iteration
X
q(y) =
π n (z, y)q(z)
z
and if we let n → ∞ again an application of the bounded convergence theorem yields
q(y) = Q q(y)
implying Q = 1 and we are done.
Let us now consider an irreducible Markov Chain with one step transition
probability π(x, y) that is periodic with period d > 1. Let us choose and fix
a reference point x0 ∈ X . For each x ∈ X let Dx0 ,x = {n : π (n) (x0 , x) > 0}.
Lemma 4.17. If n1 , n2 ∈ Dx0 ,x then d divides n1 − n2 .
Proof. Since the chain is irreducible there is an m such tha π (m) (x, x0 ) > 0.
By the Chapman-Kolmogorov equations π (m+ni ) (x0 , x0 ) > 0 for i = 1, 2.
Therefore m + ni ∈ Dx0 = Dx0 ,x0 for i = 1, 2. This implies that d divides
both m + n1 as well as m + n2 . Thus d divides n1 − n2 .
CHAPTER 4. DEPENDENT RANDOM VARIABLES
132
The residue modulo d of all the integers in Dx0 ,x are the same and equal
some number r(x), satisfying 0 ≤ r(x) ≤ d − 1. By definition r(x0 ) = 0. Let
us define Xj = {x : r(x) = j}. Then {Xj : 0 ≤ j ≤ d − 1} is a partition of X
into disjoint sets with x0 ∈ X0 .
Lemma 4.18. If x ∈ X , then π (n) (x, y) = 0 unless r(x) + n = r(y) mod d.
Proof. Suppose that x ∈ X and π(x, y) > 0. Then if m ∈ Dx0 ,x then
(m + 1) ∈ Dx0 ,y . Therefore r(x) + 1 = r(y) modulo d. The proof can be
completed by induction. The chain marches through {Xj } in a cyclical way
from a state in Xj to one in Xj+1
Theorem 4.19. Let X be irreducible and positive recurrent with period d.
Then
d
lim
π (n) (x, y) =
n→∞
m(y)
n+r(x)=r(y) modulo d
Of course
π (n) (x, y) = 0
unless n + r(x) = r(y) modulo d.
Proof. If we replace π by π̃ where π̃(x, y) = π (d) (x, y), then π̃(x, y) = 0 unless
both x and y are in the same Xj . The restriction of π̃ to each Xj defines an
irreducible aperiodic Markov chain. Since each time step under π̃ is actually
d units of time we can apply the earlier results and we will get for x, y ∈ Xj
for some j,
d
lim π (k d) (x, y) =
k→∞
m(y)
We note that
π (n) (x, y) =
X
fm (x, y) π (n−m) (y, y)
1≤m≤n
fm (x, y) = Px {τy = m} = 0 unless r(x) + m = r(y) modulo d
π (n−m) (y, y) = 0 unless n − m = 0 modulo d
X
fm (x, y) = 1.
m
The theorem now follows.
4.6. COUNTABLE STATE SPACE
133
Suppose now we have a chain that is not irreducible. Let us collect all
the transient states and call the set Xtr . The complement consists of all the
recurrent states and will be denoted by Xre .
Lemma 4.20. If x ∈ Xre and y ∈ Xtr , then π(x, y) = 0.
Proof. If x is a recuurrent state, and π(x, y) > 0, the chain will return to x
infinitely often and each time there is a positive probability of visiting y. By
the renewal property these are independent events and so y will be recurrent
too.
The set of recurrent states Xre can be divided into one or more equivalence
classes accoeding to the following procedure. Two recurrent states x and y are
in the same equivalence class if f (x, y) = Px {τy < ∞}, the probability of ever
visiting y starting from x is positive. Because of recurrence if f (x, y) > 0 then
f (x, y) = f (y, x) = 1. The restriction of the chain to a single equivalence class
is irreducible and possibly periodic. Different equivalence classes could have
different periods, some could be positive recurrent and others null recurrent.
We can combine all our observations into the following theorem.
P (n)
Theorem 4.21. If y is transient then
(x, y) < ∞ for all x. If y
nπ
is null recurrent (belongs toP
an equivalence class that is null recurrent) then
π (n) (x, y) → 0 for all x, but n π (n) (x, y) = ∞ if x is in the same equivalence
class or x ∈ Xtr with f (x, y) > 0. In all other cases π (n) (x, y) = 0 for all
n ≥ 1. If y is positive recurrent and belongs to an equivalence class with
period d with m(y) = E Py {τy }, then for a nontransient x, π (n) (x, y) = 0
unless x is in the same equivalence class and r(x) + n = r(y) modulo d. In
such a case,
d
lim
.
π (n) (x, y) =
n→∞
m(y)
r(x)+n=r(y) modulo d
If x is transient then
lim
n→∞
π (n) (x, y) = f (r, x, y)
n=r modulo d
d
m(y)
where
f (r, x, y) = Px {Xkd+r = y
for some k ≥ 0}.
CHAPTER 4. DEPENDENT RANDOM VARIABLES
134
Proof. The only statement that needs an explanation is the last one. The
chain starting from a transient state x may at some time get into a positive
recurrent equivalence class Xj with period d. If it does, it never leaves that
class and so gets absorbed in that class. The probability of this is f (x, y)
where y can be any state in Xj . However if the period d is greater than
1, there will be cyclical subclasses C1 , · · · , Cd of Xj . Depending on which
subclass the chain enters and when, the phase of its future is determined.
There are d such possible phases. For instance, if the subclasses are ordered
in the correct way, getting into C1 at time n is the same as getting into
C2 at time n + 1 and so on. f (r, x, y) is the probability of getting into the
equivalence class in a phase that visits the cyclical subclass containing y at
times n that are equal to r modulo d.
Example 4.1. (Simple Random Walk).
If X = Z d , the integral lattice in Rd , a random walk is a Markov chain
with transition probability π(x, y) = p(y − x) where {p(z)} specifies the
probability distribution of a single step. We will assume for simplicity that
p(z) = 0 except when z ∈ F where F consists of the 2 d neighbors of 0 and
1
p(z) = 2d
for each z ∈ F . For ξ ∈ Rd the characteristic function of p̂(ξ) of
p(·) is given by d1 (cos ξ1 + cos ξ2 + · · · + cos ξd ). The chain is easily seen to
irreducible, but periodic of period 2. Return to the starting point is possible
only after an even number of steps.
Z
1 d
(2n)
π (0, 0) = ( )
[p̂(ξ)]2n dξ
2π
d
T
C
'
d .
n2
To see this asymptotic behavior let us first note that the integration can be
restricted to the set where |p̂(ξ)| ≥ 1 − δ or near the 2 points (0, 0, · · · , 0)
and (π, π, · · · , π) where |p̂(ξ)| = 1. Since the behaviour is similar at both
points let us concentrate near the origin.
X
X
1X
cos ξj ≤ 1 − c
ξj2 ≤ exp[−c
ξj2 ]
d j=1
j
j
d
for some c > 0 and
d
X
1 X
2n
cos ξj
≤ exp[−2 n c
ξj2 ]
d j=1
j
4.6. COUNTABLE STATE SPACE
135
and with a change of variables the upper bound is clear. We have a similar
lower bound as well. The random walk is recurrent if d = 1 or 2 but transient
if d ≥ 3.
Exercise 4.12. If the distribution p(·) is arbitrary, determine when the chain
is irreducible and when it is irreducible and aperiodic.
P
Exercise 4.13. If z zp(z) = m 6= 0 conclude that the chain is transient by
an application of the strong law of large numbers.
P
Exercise
4.14.
If
z zp(z) = m = 0, and if the covariance matrix given by ,
P
z
z
p(z)
=
σ
,
is
nondegenerate show that the transience or recurrence
i,j
z i j
is determined by the dimension as in the case of the nearest neighbor random
walk.
Exercise 4.15. Can you make sense of the formal calculation
X
X 1 Z
(n)
π (0, 0) =
( )d
[p̂(ξ)]n dξ
2π
d
T
n
n
Z
1 d
1
= ( )
dξ
2π
Td (1 − p̂(ξ))
Z
1
1 d
Real Part
= ( )
dξ
2π
1 − p̂(ξ)
Td
to conclude that a necessary and sufficient condition for transience or recurrece is the convergence or divergence of the integral
Z
1
Real Part
dξ
1 − p̂(ξ)
Td
with an integrand
Real Part
1
1 − p̂(ξ)
that is seen to be nonnegative ?
Hint: Consider instead the sum
∞
X
n=0
n (n)
ρ π
X 1 Z
(0, 0) =
( )d
ρn [p̂(ξ)]n dξ
2π
d
T
n
Z
1
1
dξ
= ( )d
2π
Td (1 − ρp̂(ξ))
136
CHAPTER 4. DEPENDENT RANDOM VARIABLES
1
= ( )d
2π
Z
Td
Real Part
1
dξ
1 − ρp̂(ξ)
for 0 < ρ < 1 and let ρ → 1.
Example 4.2. (The Queue Problem).
In the example of customers arriving, except in the trivial cases of p0 = 0
or p0 + p1 = 1 the chain is irreducible and
P aperiodoc. Since the service
rate is at most 1 if the arrival rate m = j j pj > 1, then the queue will
get longer and by an application of law of large numbers it is seen that the
queue length will become infinite as time progresses. This is the transient
behavior of the queue. If m < 1, one can expect the situation to be stable and
there should be an asymptotic distribution for the queue length. If m = 1,
it is the borderline case and one should probably expect this to be the null
recurent case. The actual proofs are not hard. In time n the actual number
of customers served is at most n because the queue may sometomes be empty.
If {ξi : i ≥ 1} are the number of new customers arriving at time i and X0 is
the initial number inP
the queue, then the number Xn in the queue at time n
satisfies Xn ≥ X0 +( ni=1 ξi )−n and if m > 1, it follows from the law of large
numbers that limn→∞ Xn = +∞, thereby establishing transience. To prove
positive recurrence when m < 1 it is sufficient to prove that the equations
X
q(x)π(x, y) = q(y)
x
P
has a nontrivial nonnegative solution such that
x q(x) < ∞. We shall
proceed to show that this is indeed the case.PSince the equation is linear we
can alaways normalize tha solution so that x q(x) = 1. By iteration
X
q(x)π (n) (x, y) = q(y)
x
P
for every n. If limn→∞ π (x, y) = 0 for every x and y, because x q(x) =
1 < ∞, by the bounded convergence theorem the right hand side tends to
0 as n → ∞. therefore q ≡ 0 and is trivial. This rules out the transient
and the null recurrent case. In our case π(0, y) = py and π(x, y) = py−x+1
if y ≥ x − 1 and x ≥ 1. In all other cases π(x, y) = 0. The equations for
{qx = q(x)} are then
(n)
q0 py +
y+1
X
x=1
qx py−x+1 = qy
for y ≥ 0.
(4.13)
4.6. COUNTABLE STATE SPACE
137
Multiplying equation 4.13 by z n and summimg from 1 to ∞, we get
q0 P (z) +
1
P (z) [Q(z) − q0 ] = Q(z)
z
where P (z) and Q(z) are the generating functions
P (z) =
∞
X
px z x
x=0
Q(z) =
∞
X
qx z x .
x=0
We can solve for Q to get
−1
Q(z)
P (z) − 1
= P (z) 1 −
q0
z−1
∞
X P (z) − 1 k
= P (z)
z−1
k=0
k
∞
∞
X
X
j−1
= P (z)
pj (1 + z + · · · + z )
k=0
j=1
is a power series in z with nonnegative coefficients. If m < 1, we can let
z → 1 to get
k X
∞ ∞
∞
Q(1) X X
1
<∞
=
j pj =
mk =
q0
1−m
j=1
k=0
k=0
proving positive recurrence.
The case m = 1 is a little bit harder. The calculations carried out earlier
are still valid and we knowP
in this case that there exists q(x) ≥ 0 such that
each q(x) < ∞ for each x, x q(x) = ∞, and
X
q(x) π(x, y) = q(y).
x
In other words the chain admits an infinite invariant measure. Such a chain
cannot be positive recurrent. To see this we note
X
q(y) =
π (n) (x, y)q(x)
x
CHAPTER 4. DEPENDENT RANDOM VARIABLES
138
and if the chain were positive recurrent
lim π (n) (x, y) = q̃(y)
n→∞
would exist and
P
y
q̃(y) = 1. By Fatou’s lemma
q(y) ≥
X
q̃(y)q(x) = ∞
x
giving us a contradiction. To decide between transience and null recurrence
a more detiled investigation is needed. We will outline a general procedure.
Suppose we have a state x0 that is fixed and would like to calculate
Fx0 (`) = Px0 {τx0 ≤ `}. If we can do this, then we can answer questions
about transience, recurrence etc. If lim`→∞ Fx0 (`) < 1 then the chain is
transient and otherwise recurrent. In the recurrent case the convergence or
divergence of
X
E Px0 {τx0 } =
[1 − Fx0 (`)]
`
determines if it is positive or null recurrent. If we can determine
Fy (`) = Py {τx0 ≤ `}
for y 6= x0 , then for ` ≥ 1
Fx0 (`) = π(x0 , x0 ) +
X
π(x0 , y)Fy (` − 1).
y6=x0
We shall outline a procedure for determining for λ > 0,
U(λ, y) = Ey exp[−λτx0 ] .
Clearly U(x) = U(λ, x) satisfies
X
π(x, y)U(y) for x 6= x0
U(x) = e−λ
(4.14)
y
and U(x0 ) = 1. One would hope that if we solve for these equations then
we have our U. This requires uniqueness. Since our U is bounded in fact by
1, it is sufficient to prove uniqueness within the class of bounded solutions
4.6. COUNTABLE STATE SPACE
139
of equation 4.14. We will now establish that any bounded solution U of
equation 4.14 with U(x0 ) = 1, is given by
U(y) = U(λ, y) = Ey exp[−λτx0 ] .
Let us define En = {X1 6= x0 , X2 6= x0 , · · · , Xn−1 6= x0 , Xn = x0 }. Then
we will prove, by induction, that for any solution U of equation (3.7), with
U(λ, x0 ) = 1,
Z
n
X
−λ j
−λ n
U(y) =
e
Py {Ej } + e
U(Xn ) dPy .
(4.15)
τx0 >n
j=1
By letting n → ∞ we would obtain
U(y) =
∞
X
e−λ j Py {Ej } = E Py {e−λτx0 }
j=1
because U is bounded and λ > 0.
Z
U(Xn ) dPy
τx0 >n
Z
X
−λ
=e
[
π(Xn , y) U(y)] dPy
τx0 >n
−λ
=e
y
−λ
Z
Py {En+1 } + e
[
X
π(Xn , y) U(y)] dPy
τx0 >n y6=x
0
−λ
=e
−λ
Z
Py {En+1 } + e
U(Xn+1 ) dPy
τx0 >n+1
completing the induction argument. In our case, if we take x0 = 0 and try
Uσ (x) = e−σ x with σ > 0, for x ≥ 1
X
X
π(x, y)Uσ (y) =
e−σ y py−x+1
y
y≥x−1
=
X
e−σ (x+y−1) py
y≥0
= e−σ x eσ
X
y≥0
e−σ y py = ψ(σ)Uσ (x)
CHAPTER 4. DEPENDENT RANDOM VARIABLES
140
where
ψ(σ) = eσ
X
e−σ y py .
y≥0
Let us solve eλ = ψ(σ) for σ which is the same as solving log ψ(σ) = λ for
λ > 0 to get a solution σ = σ(λ) > 0. Then
U(λ, x) = e−σ(λ) x = E Px {e−λτ0 }.
We see now that recurrence is equivalent to σ(λ) → 0 as λ ↓ 0 and positive
recurrence to σ(λ) being differentiable at λ = 0. The function log ψ(σ)
is convex and its slope at the origin is 1 − m. If m > 1 it dips below 0
initially for σ > 0 and then comes back up to 0 for some positive σ0 before
turning positive for good. In that situation limλ↓0 σ(λ) = σ0 > 0 and that
is transience. If m < 1 then log ψ(σ) has a positive slope at the origin and
1
σ 0 (0) = ψ01(0) = 1−m
< ∞. If m = 1, then log ψ has zero slope at the origin
0
and σ (0) = ∞. This concludes the discussion of this problem.
Example 4.3. ( The Urn Problem.)
We now turn to a discussion of the urn problem.
π(p, q ; p + 1, q) =
p
p+q
and π(p, q ; p, q + 1) =
q
p+q
and π is zero otherwise. In this case the equation
F (p, q) =
p
q
F (p + 1, q) +
F (p, q + 1) for all p, q
p+q
p+q
which will play a role later, has lots of solutions. In particular, F (p, q) =
is one and for any 0 < x < 1
Fx (p, q) =
p
p+q
1
xp−1 (1 − x)q−1
β(p, q)
where
β(p, q) =
Γ(p)Γ(q)
Γ(p + q)
is a solution as well. The former is defined on p + q > 0 where as the latter is
defined only on p > 0, q > 0. Actually if p or q is initially 0 it remains so for
4.6. COUNTABLE STATE SPACE
141
ever and there is nothing to study in that case. If f is a continuous function
on [0, 1] then
Z 1
Ff (p, q) =
Fx (p, q)f (x) dx
0
is a solution and if we want we can extend Ff by making it f (1) on q = 0
and f (0) on p = 0. It is a simple exercise to verify
lim Ff (p, q) = f (x)
p,q→∞
p
q →x
n
for any continuous f on [0, 1]. We will show that the ratio ξn = pnp+q
which
n
is random, stabilizes asymptotically (i.e. has a limit) to a random variable
ξ and if we start from p, q the distribution of ξ is the Beta distribution on
[0, 1] with density
Fx (p, q) =
1
xp−1 (1 − x)q−1
β(p, q)
Suppose we have a Markov Chain on some state space X with transition
probability π(x, y) and U(x) is a bounded function on X that solves
X
U(x) =
π(x, y) U(y).
y
Such functions are called (bounded) Harmonic functions for the Chain. Consider the random variables ξn = U(Xn ) for such an harmonic function. ξn
are uniformly bounded by the bound for U. If we denote by ηn = ξn − ξn−1
an elementary calculation reveals
E Px {ηn+1 } = E Px {U(Xn+1 ) − U(Xn )}
= E Px {E Px {U(Xn+1 ) − U(Xn )}|Fn }}
where Fn is the σ-field generated by X0 , · · · , Xn . But
X
E Px {U(Xn+1 ) − U(Xn )}|Fn } =
π(Xn , y)[U(y) − U(Xn )] = 0.
y
A similar calculation shows that
E Px {ηn ηm } = 0
CHAPTER 4. DEPENDENT RANDOM VARIABLES
142
for m 6= n. If we write
U(Xn ) = U(X0 ) + η1 + η2 + · · · + ηn
this is an orthogonal sum in L2 [Px ] and because U is bounded
2
2
E {|U(Xn )| } = |U(x)| +
Px
n
X
E Px {|ηi |2 } ≤ C
i=1
is bounded in n. Therefore limn→∞ U(Xn ) = ξ exists in L2 [Px ] and E Px {ξ} =
U(x). Actually the limit exists almost surely and we will show it when
p
we discuss martingales later. In our example if we take U(p, q) = p+q
, as
remarked earlier, this is a harmonic function bounded by 1 and therefore
pn
=ξ
lim
n→∞ pn + qn
exists in L2 [Px ]. Moreover if we take U(p, q) = Ff (p, q) for some continuous
f on [0, 1], because Ff (p, q) → f (x) as p, q → ∞ and pq → x, U(pn , qn ) has a
limit as n → ∞ and this limit has to be f (ξ). On the other hand
E Pp,q {U(pn , qn )} = U(p0 , q0 ) = Ff (p0 , q0 )
Z 1
1
=
f (x)xp0 −1 (1 − x)q0 −1 dx
β(p0, q0 ) 0
giving us
Z 1
1
f (x) xp−1 (1 − x)q−1 dx
E {f (ξ)} =
β(p, q) 0
thereby identifying the distribution of ξ under Pp,q as the Beta distribution
with the right parameters.
Pp,q
Example 4.4. (Branching Process). Consider a population, in which each
individual member replaces itself at the beginning of each day by a random
number of offsprings. Every individual has the same probability distribution,
but the number of offsprings for different individuals are distibuted independently of each other. The distribution of the number N of offsprings is given
by P [N = i] = pi for i ≥ 0. If there are Xn individuals in the population on
a given day, then the number of individuals Xn+1 present on the next day,
has the represenation
Xn+1 = N1 + N2 + · · · + NXn
4.6. COUNTABLE STATE SPACE
143
as the sum of Xn independent random variables each having the offspring
distribution {pi : i ≥ 0}. Xn is seen to be a Markov chain on the set
of nonnegative integers. Note that if Xn ever becomes zero, i.e. if every
member on a given day produces no offsprings, then the population remains
extinct.
If one uses generating functions, then the transition probability πi,j of the
chain are
X
πi,j z j =
X
j
i
pj z j .
j
What is the long time behavior of the chain?
Let us denote by m the expected number of offsprings of any individual,
i.e.
X
m=
ipi .
i≥0
Then
E[Xn+1 |Fn ] = mXn .
1. If m < 1, then the population becomes extinct sooner or later. This is
easy to see. Consider
X
X
E[
Xn |F0 ] =
mn X0 =
n≥0
n≥0
1
X0 < ∞.
1−m
By an application of Fubini’s theorem, if S =
E[S|X0 = i] =
P
n≥0
Xn , then
i
<∞
1−m
proving that P [S < ∞] = 1. In particular
P [ lim Xn = 0] = 1.
n→∞
2. If m = 1 and p1 = 1, then Xn ≡ X0 and the poulation size never
changes, each individual replacing itself everytime by exactly one offspring.
144
CHAPTER 4. DEPENDENT RANDOM VARIABLES
3. If m = 1 and p1 < 1, then p0 > 0, and there is a positive probabiity
q(i) = q i that the poulation becomes extinct, when it starts with i
individuals. Here q is the probabilty of the population becoming extinct
when we start with X0 = 1. If we have initially i individulas each of
the i family lines have to become extinct for the entire population to
become extinct. The number q must therefore be a solution of the
equation
q = P (q)
where P (z) is the generating function
X
P (z) =
pi z i .
i≥0
If we show that the equation P (z) = z has only the solution z = 1
in 0 ≤ z ≤ 1, then the population becomes extinct with probability 1
although E[S] = ∞ in this case. If P (1) = 1 and P (a) = a for some
0 ≤ a < 1 then by the mean value theorem applied to P (z) − z we
must have P 0 (z) = 1 for some 0 < z < 1. But if 0 < z < 1
X
X
P 0(z) =
iz i−1 pi <
ipi = 1
i≥1
i≥1
a contradiction.
4. If m > 1 but p0 = 0 the problem is trivial. There is no chance of the
population becoming extinct. Let us assume that p0 > 0. The equation
P (z) = z has another solution z = q besides z = 1, in the range
0 < z < 1. This is seen by considering the function g(z) = P (z) − z.
We have g(0) > 0, g(1) = 0, g 0(1) > 0 which implies another root. But
g(z) is convex and therefore ther can be atmost one more root. If we
can rule out the possibility of extinction probability being equal to 1,
then this root q must be the extinction probability when we start with
a single individual at time 0. Let us denote by qn the probability of
extinction with in n days. Then
X
qn+1 =
pi qni = P (qn )
i
and q1 < 1. A simple consequence of the monotonicity of P (z) and
the inequalities P (z) > z for z < q and P (z) < z for z > q is that if
4.6. COUNTABLE STATE SPACE
145
we start with any a < 1 and iterate qn+1 = P (qn ) with q1 = a, then
qn → q.
If the population does not become extinct, one can show that it has
to grow indefinitely. This is best done using martingales and we will
revisit this example later as Example 5.6.
Example 4.5. Let X be the set of integers. Assume that transitions from x
are possible only to x − 1, x, and x + 1. The transition matrix π(x, y) appears
as a tridiagonal matrix with π(x, y) = 0 unless |x − y| ≤ 1. For simplicity let
us assume that π(x, x), π(x, x − 1) and π(x, x + 1) are positive for all x.
The chain is then irreducible and aperiodic. Let us try to solve for
U(x) = Px {τ0 = ∞}
that satisfies the equation
U(x) = π(x, x − 1) U(x − 1) + π(x, x) U(x) + π(x, x + 1) U(x + 1)
for x 6= 0 with U(0) = 0. The equations decouple into a set for x > 0 and a
set for x < 0. If we denote by V (x) = U(x + 1) − U(x) for x ≥ 0, then we
always have
U(x) = π(x, x − 1) U(x) + π(x, x) U(x) + π(x, x + 1) U(x)
so that
π(x, x − 1) V (x − 1) − π(x, x + 1) V (x) = 0
or
and therefore
and
V (x)
π(x, x − 1)
=
V (x − 1)
π(x, x + 1)
x
Y
π(i, i − 1)
V (x) = V (0)
π(i, i + 1)
i=1
y
x−1 Y
X
π(i, i − 1)
U(x) = V (0) 1 +
.
π(i,
i
+
1)
y=1 i=1
146
CHAPTER 4. DEPENDENT RANDOM VARIABLES
If the chain is to be transient we must have for some choice of V (0), 0 ≤
U(x) ≤ 1 for all x > 0 and this will be possible only if
y
∞ Y
X
π(i, i − 1)
<∞
π(i,
i
+
1)
y=1 i=1
which then becomes a necessary condition for
Px {τ0 = ∞} > 0
for x > 0. There is a similar condition on the negative side
y
∞ Y
X
π(−i, −i + 1)
< ∞.
π(−i,
−i
−
1)
y=1 i=1
Transience needs at least one of the two series to converge. Actually the
converse is also true. If, for instance the series on the positive side converges
then we get a function U(x) with 0 ≤ U(x) ≤ 1 and U(0) = 0 that satisfies
U(x) = π(x, x − 1) U(x − 1) + π(x, x) U(x) + π(x, x + 1) U(x + 1)
and by iteration one can prove that for each n,
Z
U(x) =
U(Xn ) dPx ≤ P {τ0 > n}
τ0 >n
so the existence of a nontrivial U implies transience.
Exercise 4.16. Determine the conditions for positive recurrence in the previous example.
Exercise 4.17. We replace the set of integers by the set of nonnegative integers and assume that π(0, y) = 0 for y ≥ 2. Such processes are called birth
and death processes. Work out the conditions in that case.
Exercise 4.18. In the special case of a birth and death process with π(0, 1) =
π(0, 0) = 12 , and for x ≥ 1, π(x, x) = 13 , π(x, x − 1) = 13 + ax , π(x, x + 1) =
1
− ax with ax = xλα for large x, find conditions on positive α and real λ for
3
the chain to be transient, null recurrent and positive recurrent.
4.6. COUNTABLE STATE SPACE
147
Exercise 4.19. The notion of a Markov Chain makes sense for a finite chain
X0 , · · · , Xn . Formulate it precisely. Show that if the chain {Xj : 0 ≤ j ≤ n}
is Markov so is the reversed chain {Yj : 0 ≤ j ≤ n} where Yj = Xn−j for 0 ≤
j ≤ n. Can the transition probabilities of the reversed chain be determined
by the transition probabilities of the forward chain? If the forward chain has
stationary transition proabilities does the same hold true for the reversed
chain? What if we assume that the chain has a finte invariant probability
distribution and we initialize the chain to start with an initial distribution
which is the invariant distribution?
Exercise 4.20. Consider the simple chain on nonnegative integers
P∞ with the
following transition probailities. π(0, x) = px for x ≥ 0 with x=0 px = 1.
For x > 0, π(x, x − 1) = 1 and π(x, y) = 0 for all other y. Determine
conditions on {px } in order that the chain may be transient, null recurrent
or positive recurrent. Determine the invariant probability measure in the
positive recurrent case.
Exercise 4.21. Show that any null recurrent equivalence class must necessarily contain an infinite number of states. In patricular any Markov Chain
with a finite state space has only transient and positive recurrent states and
moreover the set of positive recurrent states must be non empty.
148
CHAPTER 4. DEPENDENT RANDOM VARIABLES