Download Lecture 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-negative matrix factorization wikipedia , lookup

Matrix multiplication wikipedia , lookup

Matrix calculus wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Brouwer fixed-point theorem wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Jordan normal form wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Transcript
Lecture 8
1
Convergence of Markov chain transition kernels
Theorem 1.1 [Convergence of transition kernels] Let X be an irreducible aperiodic
Markov chain with countable state space S. If the chain is transient or null recurrent, then
lim Πn (x, y) = 0
n→∞
∀ x, y ∈ S.
(1.1)
If the chain is positive recurrent with stationary distribution µ, then
lim Πn (x, y) = µ(y)
n→∞
∀ x, y ∈ S.
(1.2)
Theorem 1.1 is in fact equivalent to the Renewal Theorem, which was proved last time. When
the Markov chain is positive recurrent, Theorem 1.1 admits a simpler proof by coupling, which
we now explain.
Proof of Theorem 1.1 with positive recurrence: When X is positive recurrent, the
Markov chain admits a unique stationary probability distribution µ. Let X 1 be a copy of the
Markov chain with initial distribution µ. Let X 2 be an independent copy of the Markov chain
with initial distribution δx for a fixed x ∈ S. Then we claim that τ := inf{n ≥ 0 : Xn1 =
Xn2 } < ∞ almost surely. If the claim holds, then we can couple X 1 and X 2 by defining two
new Markov chains X̃ 1 and X̃ 2 on the same probability space such that X̃ni = Xni for n ≤ τ ,
and X̃ni = Xn1 for n > τ , i.e., the second Markov chain starts following the trajectory of the
first Markov chain as soon as they meet. Using the strong Markov property of the pair of
independent chains (X 1 , X 2 ), it is clear that X̃ i is equally distributed with X i . The claim
τ < ∞ almost surely implies that P(X̃n1 6= X̃n2 ) ↓ 0 as n → ∞. Since µ(y) = P(X̃n1 = y) for all
n ∈ N, while Πn (x, y) = P(X̃n2 = y), we have
1X n
1 X |Π (x, y) − µ(y)| =
E[1{X̃ 2 =y} − 1{X̃ 1 =y} ] ≤ P(X̃n1 6= X̃n2 ) = P(τ > n),
n
2
2
y∈S
y∈S
which decreases to 0 as n → ∞. This proves the convergence of Πn (x, ·) to µ in total variational
distance, which implies (1.2).
To verify that τ < ∞ a.s., we only need to check that (X 1 , X 2 ) defines an irreducible
recurrent Markov chain. The fact that (X 1 , X 2 ) is Markov is clear. By the irreducibility
and aperiodicity of X 1 , Px (Xn1 = y) > 0 for all n sufficiently large for any given y. By the
independence of X 1 and X 2 , it follows that for any two pairs (x1 , x2 ) and (y1 , y2 ),
P(x1 ,x2 ) ((Xn1 , Xn2 ) = (y1 , y2 )) = Px1 (Xn1 = y1 )Px2 (Xn2 = y2 ) > 0
for all n sufficiently large, which implies the irreducibility of (X 1 , X 2 ). Clearly the product
measure µ × µ on S × S is a stationary probability distribution for (X 1 , X 2 ), which implies
that (X 1 , X 2 ) must be a positive recurrent Markov chain. This implies that τ < ∞ a.s.
Finally, we give an account of what happens when the Markov chain has period d > 1. A
simple example is the simple random walk on Zd , which has period 2. The first observation
is that the state space S can be partitioned into d disjoint classes S0 , S1 , · · · , Sd−1 , and the
Markov chain simply marches through these d classes sequentially. Let us make this statement
more precise.
1
Lemma 1.2 For x, y ∈ S, let Dx,y = {n ≥ 0 : Πn (x, y) > 0}. Then d divides m − n for any
m, n ∈ Dx,y .
Proof. By irreducibility, there exists k ∈ N with Πk (y, x) > 0. Therefore Πk+m (y, y) ≥
Πk (y, x)Πm (x, y) > 0 and k + m ∈ Dy . Similarly k + n ∈ Dy , which implies that d divides
m − n.
By Lemma 1.2, for a fixed x ∈ S, each y ∈ S is associated with an ry ∈ {0, 1, · · · , d − 1},
where ry is the residue modulo d of any n ∈ Dx,y . Let Si := {y ∈ S : ry = i} for i =
0, 1, · · · , d − 1. Then S is the disjoint union of S0 , · · · , Sd−1 , and the Markov chain marches
through S0 , S1 , · · · , Sd−1 in this order cyclically.
Theorem 1.3 [Convergence of transition kernels: periodic case] Let X be an irreducible Markov chain with countable state space S and period d > 1. If the chain is transient
or null recurrent, then
lim Πn (x, y) = 0
∀ x, y ∈ S.
(1.3)
n→∞
If the chain is positive recurrent with stationary distribution µ, then
lim
n→∞
rx +n≡ry (mod d)
Πn (x, y) = dµ(y)
∀ x, y ∈ S.
(1.4)
Proof. Let us consider the transition kernel Π̃ = Πd . Clearly Π̃(x, y) > 0 only if x, y belong to
the same class Sr for some 0 ≤ r ≤ d − 1. Restricted to each Sr , the associated Markov chain
is irreducible, and furthermore, aperiodic. In fact, it is simply Xn restricted to a d-periodic
subsequence of times. We can then apply Theorem 1.1. The result follows once we observe
that the stationary distribution µ̃r (·) for the Markov chain on Sr with transition kernel Π̃
must equal dµ(·) restricted to Sr .
2
Perron-Frobenius Theorem
When the state space S is finite, everything boils down to the study of the finite-dimensional
transition matrix Π. When Π has positive entries, Perron’s theorem asserts that 1 is the
dominant eigenvalue with a positive eigenvector.
Theorem 2.1 [Perron’s Theorem] Let P be an n×n matrix with all positive entries. Then
P has a dominant eigenvalue λ such that
(i) λ > 0, and its associated eigenvector h has all positive entries.
(ii) λ is a simple eigenvalue.
(iii) Any other eigenvalue κ of P satisfies |κ| < λ.
(iv) P has no other eigenvector with all non-negative entries.
Proof. Let T := {t ≥ 0 : P v ≥ tv for some v ∈ [0, ∞)n , v 6≡ 0}. Then Pii ∈ T since
P
P
P~ei ≥ Pii~ei , and T ⊂ [0, 1≤i,j≤n Pij ] since |P v|∞ ≤
Pij |v|∞ . Furthermore, T is a closed
set since if there exists tn → t and Pvn ≥ tn vn , where without loss of generality we may assume
|vn |1 = 1, we can find a subsequence ni such that vni converges to a limiting non-negative
vector v∞ with |v∞ |1 = 1. Then we see that P v∞ ≥ tv∞ , which proves that T is closed.
2
Let λ > 0 be the maximum in T , and let P v ≥ λv for some non-negative vector v 6= 0.
We claim that in fact P v = λv and v ∈ (0, ∞)n . Indeed, if P v − λv ≥ 0 and P v − λv 6= 0,
then by the positivity assumption on entries of P , we have P 2 v − λP v > 0, which implies
that there exists some λ0 > λ with P 2 v − λ0 P v ≥ 0. Since P v ≥ 0, this implies that λ0 ∈ T ,
contradicting our assumption that λ is the maximum in T . Therefore P v = λv. Since P has
positive entries and v 6≡ 0, λv = P v > 0. This proves (i).
To prove (ii), note that if P w = λw for an eigenvector w distinct from any constant
multiple of v, then there exists c ∈ R such that w + cv ≥ 0, w + cv 6≡ 0 and w + cv has at
lease one component which equals 0. Since P (w + cv) = λ(w + cv) > 0 by the positivity of
P , this creates a contradiction. To conclude that λ is a simple eigenvalue of P , it remains to
rule out the existence of a generalized eigenvector w with eigenvalue λ, i.e., (P − λ)w = cv for
some c 6= 0. Replacing w by −w if necessary, we can assume c > 0. Replacing w by w + bv if
necessary, we can guarantee that w > 0. Then (P − λ)w = cv > 0 implies that max T > λ, a
contradiction.
To prove (iii), note that if P w = κw for some κ 6= λ, where κ and w could both be
complex, then
|P w| = |κ||w| ≤ P |w|,
(2.5)
where |w| = |(w1 , · · · , wn )| = (|w1 |, · · · , |wn |). Therefore |κ| ∈ T and hence |κ| ≤ λ. The
inequality in (2.5) is in fact strict, which implies |κ| < λ, unless w = eiθ w0 for some θ ∈ R and
w0 ≥ 0, in which case κ ∈ (0, λ]. Therefore there is no other eigenvalue κ with |κ| ≥ λ. This
proves (iii).
By (i)–(iii) applied to P T , which has the same eigenvalues as P , there exists a non-negative
and non-trivial w such that PT w = λw. Suppose that h is a non-negative eigenvector of P
with eigenvalue λ0 6= λ. Then
λ0 hw, hi = hw, P hi = hP T w, hi = λhw, hi.
Since λ0 6= λ, we have hw, hi = 0, which is not possible if h has all non-negative entries.
Remark 2.2 For a transition probability matrix Π, clearly |Πv|∞ ≤ |v|∞ for any vector v,
and Π1 = 1. Therefore 1 is an eigenvalue of Π and all other eigenvalues κ of Π has |κ| ≤ 1.
When Π is the transition matrix of an irreducible aperiodic Markov chain, there exists n0 ∈ N
such that for all n ≥ n0 , Πn has positive entries. Therefore by Perron’s Theorem, 1 is a simple
dominant eigenvalue of Πn for n ≥ n0 . It is then easy to see that 1 is also a simple dominant
eigenvalue of Π, since any (generalized) eigenvector for Π with eigenvalue λ must also be
an eigenvector or generalized eigenvector for Πn with eigenvalue λn , and conversely, any
(generalized) eignenvector of Πn with eigenvalue λ must also be an eigenvector or generalized
1
eigenvector of Π with eigenvalue λ n for some n-th root of λ.
Remark 2.3 Perron’s Theorem applied to the transpose of a positive transition matrix Π
implies the existence of a stationary positive probability distribution µ. The fact that 1 is a
simple dominant eigenvalue of ΠT implies that starting from any probability measure ν on
{1, · · · , n}, (ΠT )n ν converges to µ exponentially fast.
The case when P is only assumed to be non-negative is covered by Frobenius’s Theorem.
Theorem 2.4 [Frobenius’ Theorem] Let P be an n × n matrix with non-negative entries
which are not all 0. Then P has an eigenvalue λ with the following properties:
3
(i) λ > 0 and there exists an associated eigenvector with non-negative entries.
(ii) Any other eigenvalue κ of P satisfies |κ| ≤ λ.
(iii) If |κ| = λ, then κ = e2πik/m λ for some k, m ∈ N with m ≤ n.
Remark 2.5 Let us analyze the dominant eigenvalues of the transition matrix Π of a d-period
irreducible Markov chain. Note that Πd is of block diagonal form, where Πd (i, j) > 0 only if
i and j belong to the same class, and there are exactly d such classes. Restricted to each
class Si ⊂ {1, · · · , n}, Πd is an aperiodic transition matrix, which has 1 as a simple dominant
eigenvalue by Perron’s Theorem and Remark 2.2. Therefore Πd has 1 as the dominant eigenvalue with multiplicity d. Consequently, counting multiplicity, Π has exactly d eigenvalues
with modulus 1, and any other eigenvalue κ of Π has |κ| < 1. We claim that these eigenvalues
are precisely e2πik/d with k = 0, 1, · · · , d − 1. Indeed, (Πd )T has d linearly independent eigenvectors, which are just the restriction of the invariant measure µ of the Markov chain to the d
classes of states Si , 0 ≤ i ≤ d − 1. Denote the restriction of µ to Si by µi . Then ΠT µi = µi+1
for 0 ≤ i ≤ d − 1 and ΠT µd−1 = µ0 . The subspace V spanned by (µi )0≤i≤d−1 is preserved by
ΠT , and on V , if we choose µi to be the basis vectors, then ΠT is a permutation matrix and
its characteristic polynomial is λd − 1 = 0. Therefore the set of eigenvalues of ΠT restricted
to V is precisely e2πik/d for 0 ≤ k ≤ d − 1. This exhausts the possible eigenvalues of modulus
1 for ΠT , and hence also Π.
For a proof of the Frobenius theorem, see Lax [1, Chapter 16]. There are also infinitedimensional versions of the Perron-Frobenius Theorem for positive compact operators.
3
Reversible Markov chains
We now consider a special class of Markov chains called reversible Markov chains.
Definition 3.1 [Reversible Markov chains] A Markov chain with countable state space
S and transition matrix Π is called reversible, if it admits a stationary measure µ, called a
reversible measure, which satisfies
µ(x)Π(x, y) = µ(y)Π(y, x)
∀ x, y ∈ S.
(3.6)
Remark. In the physics literature, (3.6) is called the detailed balance condition. Heuristically,
µ(x)Π(x, y) represents the probability flow from x to y in equilibrium. Condition (3.6) requires
that the probability flow from x to y equals the flow from y to x, which is a sufficient condition
for µ to be stationary. Reversing the flow can be interpreted as reversing time. Therefore the
time evolution of a Markov chain under a reversible measure is invariant under time reversal.
Not all stationary measures are reversible measures, as can be seen for the uniform measure
µ ≡ 1 for an asymmetric simple random walk on Z. Note that the notion of a reversible Markov
chain is always accompanied by a reversible measure.
Theorem 3.2 [Cycle condition for reversibility] Let X be an irreducible Markov chain
with countable state space S and transition matrix Π. A necessary and sufficient condition
for the existence of a reversible measure for X is that
(i) Π(x, y) > 0 if and only if Π(y, x) > 0.
4
(ii) For any loop x0 , x1 , · · · , xn = x0 with
n
Y
Qn
i=1 Π(xi−1 , xi )
n
Y
Π(xi−1 , xi ) =
i=1
> 0, we have
Π(xi , xi−1 ).
(3.7)
i=1
Proof. Suppose µ is a reversible measure for X and µ 6≡ 0. Then by irreducibility and the
stationarity of µ, µ(x) > 0 for all x ∈ S. The detailed balance condition (3.6) clearly implies
(i). Similarly, reversibility implies that the probability flow along the cycle x0 , x1 , · · · , xn =
Q
x0 , i.e., µ(x0 ) ni=1 Π(xi−1 , xi ), equals the probability flow along the reversed cycle xn =
Q
x0 , xn−1 , · · · , x0 , which is just µ(x0 ) 1i=n Π(xi , xi−1 ). This yields (3.7).
Conversely, if Π satisfies (i) and (ii), then for a given x ∈ S, we can define µ(x) = 1,
and for any y ∈ S with a path of states z0 = x, z1 , · · · , zn = y connecting x and y such that
Π(zi−1 , zi ) > 0, we can define
n
Y
Π(zi−1 , zi )
µ(y) =
.
Π(zi , zi−1 )
i=1
Conditions (i) and (ii) guarantee that our definition of µ is independent of the choice of path
connecting x and y. It is easy to check that µ satisfies the detailed balance condition (3.6).
Example 3.3
1 An irreducible birth-death chain is reversible, since Theorem 3.2 (i) follows from irreducibility, and the cycle condition (3.7) is trivially satisfied due to the lack
of cycles. Similarly, any Markov chain on a tree with transitions between neighboring
vertices is reversible.
2 A random walk on a connected graph G = (V, E) with vertex set V and edge set E is a
Markov chain with state space V and transition matrix Π(x, y) = 1/dx for all y ∈ V with
{x, y} ∈ E, where dx is the degree of x in G. It is easily seen that µ(x) = dx is a reversible
measure for the walk with unit flow across each edge. More generally, if each edge
Cx,y
{x, y} ∈ E is assigned a positive conductance Cx,y = Cy,x , and Π(x, y) = P
Cx,z ,
z:{x,z}∈E
P
then µ(x) = z:{x,z}∈E Cx,z is a reversible measure for the walk with mass flow Cx,y
across each edge {x, y} ∈ E. The properties for such a random walk (e.g., transience or
recurrence) is intimately connected with the electrical properties of the graph G seen as
an electrical network with conductances (Cx,y ){x,y}∈E .
We briefly explain the usefulness of reversibility. Let µ be a reversible measure for the
Markov chain with transition matrix Π. Then Π is a self-adjoint operator on the Hilbert
P
space L2 (S, µ) with inner product hf, giµ :=
x f (x)g(x)µ(x). Indeed, formally for any
f, g ∈ L2 (S, µ),
X
X
hf, Πgiµ =
f (x)
Π(x, y)g(y)µ(x)
x∈S
=
X
y∈S
f (x)g(y)Π(y, x)µ(y) = hΠf, giµ .
x,y∈S
All information about the Markov chain are encoded in its transition matrix. The selfadjointness of Π on L2 (µ, S) allows one to use spectral theory and variational formula to
study its spectrum, which is not possible when the Markov chain is irreversible.
References
[1] P. Lax. Linear Algebra. John Wiley & Sons, Inc. New York, 1997.
5