Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Non-negative matrix factorization wikipedia , lookup
Matrix multiplication wikipedia , lookup
Matrix calculus wikipedia , lookup
Singular-value decomposition wikipedia , lookup
Brouwer fixed-point theorem wikipedia , lookup
Cayley–Hamilton theorem wikipedia , lookup
Jordan normal form wikipedia , lookup
Lecture 8 1 Convergence of Markov chain transition kernels Theorem 1.1 [Convergence of transition kernels] Let X be an irreducible aperiodic Markov chain with countable state space S. If the chain is transient or null recurrent, then lim Πn (x, y) = 0 n→∞ ∀ x, y ∈ S. (1.1) If the chain is positive recurrent with stationary distribution µ, then lim Πn (x, y) = µ(y) n→∞ ∀ x, y ∈ S. (1.2) Theorem 1.1 is in fact equivalent to the Renewal Theorem, which was proved last time. When the Markov chain is positive recurrent, Theorem 1.1 admits a simpler proof by coupling, which we now explain. Proof of Theorem 1.1 with positive recurrence: When X is positive recurrent, the Markov chain admits a unique stationary probability distribution µ. Let X 1 be a copy of the Markov chain with initial distribution µ. Let X 2 be an independent copy of the Markov chain with initial distribution δx for a fixed x ∈ S. Then we claim that τ := inf{n ≥ 0 : Xn1 = Xn2 } < ∞ almost surely. If the claim holds, then we can couple X 1 and X 2 by defining two new Markov chains X̃ 1 and X̃ 2 on the same probability space such that X̃ni = Xni for n ≤ τ , and X̃ni = Xn1 for n > τ , i.e., the second Markov chain starts following the trajectory of the first Markov chain as soon as they meet. Using the strong Markov property of the pair of independent chains (X 1 , X 2 ), it is clear that X̃ i is equally distributed with X i . The claim τ < ∞ almost surely implies that P(X̃n1 6= X̃n2 ) ↓ 0 as n → ∞. Since µ(y) = P(X̃n1 = y) for all n ∈ N, while Πn (x, y) = P(X̃n2 = y), we have 1X n 1 X |Π (x, y) − µ(y)| = E[1{X̃ 2 =y} − 1{X̃ 1 =y} ] ≤ P(X̃n1 6= X̃n2 ) = P(τ > n), n 2 2 y∈S y∈S which decreases to 0 as n → ∞. This proves the convergence of Πn (x, ·) to µ in total variational distance, which implies (1.2). To verify that τ < ∞ a.s., we only need to check that (X 1 , X 2 ) defines an irreducible recurrent Markov chain. The fact that (X 1 , X 2 ) is Markov is clear. By the irreducibility and aperiodicity of X 1 , Px (Xn1 = y) > 0 for all n sufficiently large for any given y. By the independence of X 1 and X 2 , it follows that for any two pairs (x1 , x2 ) and (y1 , y2 ), P(x1 ,x2 ) ((Xn1 , Xn2 ) = (y1 , y2 )) = Px1 (Xn1 = y1 )Px2 (Xn2 = y2 ) > 0 for all n sufficiently large, which implies the irreducibility of (X 1 , X 2 ). Clearly the product measure µ × µ on S × S is a stationary probability distribution for (X 1 , X 2 ), which implies that (X 1 , X 2 ) must be a positive recurrent Markov chain. This implies that τ < ∞ a.s. Finally, we give an account of what happens when the Markov chain has period d > 1. A simple example is the simple random walk on Zd , which has period 2. The first observation is that the state space S can be partitioned into d disjoint classes S0 , S1 , · · · , Sd−1 , and the Markov chain simply marches through these d classes sequentially. Let us make this statement more precise. 1 Lemma 1.2 For x, y ∈ S, let Dx,y = {n ≥ 0 : Πn (x, y) > 0}. Then d divides m − n for any m, n ∈ Dx,y . Proof. By irreducibility, there exists k ∈ N with Πk (y, x) > 0. Therefore Πk+m (y, y) ≥ Πk (y, x)Πm (x, y) > 0 and k + m ∈ Dy . Similarly k + n ∈ Dy , which implies that d divides m − n. By Lemma 1.2, for a fixed x ∈ S, each y ∈ S is associated with an ry ∈ {0, 1, · · · , d − 1}, where ry is the residue modulo d of any n ∈ Dx,y . Let Si := {y ∈ S : ry = i} for i = 0, 1, · · · , d − 1. Then S is the disjoint union of S0 , · · · , Sd−1 , and the Markov chain marches through S0 , S1 , · · · , Sd−1 in this order cyclically. Theorem 1.3 [Convergence of transition kernels: periodic case] Let X be an irreducible Markov chain with countable state space S and period d > 1. If the chain is transient or null recurrent, then lim Πn (x, y) = 0 ∀ x, y ∈ S. (1.3) n→∞ If the chain is positive recurrent with stationary distribution µ, then lim n→∞ rx +n≡ry (mod d) Πn (x, y) = dµ(y) ∀ x, y ∈ S. (1.4) Proof. Let us consider the transition kernel Π̃ = Πd . Clearly Π̃(x, y) > 0 only if x, y belong to the same class Sr for some 0 ≤ r ≤ d − 1. Restricted to each Sr , the associated Markov chain is irreducible, and furthermore, aperiodic. In fact, it is simply Xn restricted to a d-periodic subsequence of times. We can then apply Theorem 1.1. The result follows once we observe that the stationary distribution µ̃r (·) for the Markov chain on Sr with transition kernel Π̃ must equal dµ(·) restricted to Sr . 2 Perron-Frobenius Theorem When the state space S is finite, everything boils down to the study of the finite-dimensional transition matrix Π. When Π has positive entries, Perron’s theorem asserts that 1 is the dominant eigenvalue with a positive eigenvector. Theorem 2.1 [Perron’s Theorem] Let P be an n×n matrix with all positive entries. Then P has a dominant eigenvalue λ such that (i) λ > 0, and its associated eigenvector h has all positive entries. (ii) λ is a simple eigenvalue. (iii) Any other eigenvalue κ of P satisfies |κ| < λ. (iv) P has no other eigenvector with all non-negative entries. Proof. Let T := {t ≥ 0 : P v ≥ tv for some v ∈ [0, ∞)n , v 6≡ 0}. Then Pii ∈ T since P P P~ei ≥ Pii~ei , and T ⊂ [0, 1≤i,j≤n Pij ] since |P v|∞ ≤ Pij |v|∞ . Furthermore, T is a closed set since if there exists tn → t and Pvn ≥ tn vn , where without loss of generality we may assume |vn |1 = 1, we can find a subsequence ni such that vni converges to a limiting non-negative vector v∞ with |v∞ |1 = 1. Then we see that P v∞ ≥ tv∞ , which proves that T is closed. 2 Let λ > 0 be the maximum in T , and let P v ≥ λv for some non-negative vector v 6= 0. We claim that in fact P v = λv and v ∈ (0, ∞)n . Indeed, if P v − λv ≥ 0 and P v − λv 6= 0, then by the positivity assumption on entries of P , we have P 2 v − λP v > 0, which implies that there exists some λ0 > λ with P 2 v − λ0 P v ≥ 0. Since P v ≥ 0, this implies that λ0 ∈ T , contradicting our assumption that λ is the maximum in T . Therefore P v = λv. Since P has positive entries and v 6≡ 0, λv = P v > 0. This proves (i). To prove (ii), note that if P w = λw for an eigenvector w distinct from any constant multiple of v, then there exists c ∈ R such that w + cv ≥ 0, w + cv 6≡ 0 and w + cv has at lease one component which equals 0. Since P (w + cv) = λ(w + cv) > 0 by the positivity of P , this creates a contradiction. To conclude that λ is a simple eigenvalue of P , it remains to rule out the existence of a generalized eigenvector w with eigenvalue λ, i.e., (P − λ)w = cv for some c 6= 0. Replacing w by −w if necessary, we can assume c > 0. Replacing w by w + bv if necessary, we can guarantee that w > 0. Then (P − λ)w = cv > 0 implies that max T > λ, a contradiction. To prove (iii), note that if P w = κw for some κ 6= λ, where κ and w could both be complex, then |P w| = |κ||w| ≤ P |w|, (2.5) where |w| = |(w1 , · · · , wn )| = (|w1 |, · · · , |wn |). Therefore |κ| ∈ T and hence |κ| ≤ λ. The inequality in (2.5) is in fact strict, which implies |κ| < λ, unless w = eiθ w0 for some θ ∈ R and w0 ≥ 0, in which case κ ∈ (0, λ]. Therefore there is no other eigenvalue κ with |κ| ≥ λ. This proves (iii). By (i)–(iii) applied to P T , which has the same eigenvalues as P , there exists a non-negative and non-trivial w such that PT w = λw. Suppose that h is a non-negative eigenvector of P with eigenvalue λ0 6= λ. Then λ0 hw, hi = hw, P hi = hP T w, hi = λhw, hi. Since λ0 6= λ, we have hw, hi = 0, which is not possible if h has all non-negative entries. Remark 2.2 For a transition probability matrix Π, clearly |Πv|∞ ≤ |v|∞ for any vector v, and Π1 = 1. Therefore 1 is an eigenvalue of Π and all other eigenvalues κ of Π has |κ| ≤ 1. When Π is the transition matrix of an irreducible aperiodic Markov chain, there exists n0 ∈ N such that for all n ≥ n0 , Πn has positive entries. Therefore by Perron’s Theorem, 1 is a simple dominant eigenvalue of Πn for n ≥ n0 . It is then easy to see that 1 is also a simple dominant eigenvalue of Π, since any (generalized) eigenvector for Π with eigenvalue λ must also be an eigenvector or generalized eigenvector for Πn with eigenvalue λn , and conversely, any (generalized) eignenvector of Πn with eigenvalue λ must also be an eigenvector or generalized 1 eigenvector of Π with eigenvalue λ n for some n-th root of λ. Remark 2.3 Perron’s Theorem applied to the transpose of a positive transition matrix Π implies the existence of a stationary positive probability distribution µ. The fact that 1 is a simple dominant eigenvalue of ΠT implies that starting from any probability measure ν on {1, · · · , n}, (ΠT )n ν converges to µ exponentially fast. The case when P is only assumed to be non-negative is covered by Frobenius’s Theorem. Theorem 2.4 [Frobenius’ Theorem] Let P be an n × n matrix with non-negative entries which are not all 0. Then P has an eigenvalue λ with the following properties: 3 (i) λ > 0 and there exists an associated eigenvector with non-negative entries. (ii) Any other eigenvalue κ of P satisfies |κ| ≤ λ. (iii) If |κ| = λ, then κ = e2πik/m λ for some k, m ∈ N with m ≤ n. Remark 2.5 Let us analyze the dominant eigenvalues of the transition matrix Π of a d-period irreducible Markov chain. Note that Πd is of block diagonal form, where Πd (i, j) > 0 only if i and j belong to the same class, and there are exactly d such classes. Restricted to each class Si ⊂ {1, · · · , n}, Πd is an aperiodic transition matrix, which has 1 as a simple dominant eigenvalue by Perron’s Theorem and Remark 2.2. Therefore Πd has 1 as the dominant eigenvalue with multiplicity d. Consequently, counting multiplicity, Π has exactly d eigenvalues with modulus 1, and any other eigenvalue κ of Π has |κ| < 1. We claim that these eigenvalues are precisely e2πik/d with k = 0, 1, · · · , d − 1. Indeed, (Πd )T has d linearly independent eigenvectors, which are just the restriction of the invariant measure µ of the Markov chain to the d classes of states Si , 0 ≤ i ≤ d − 1. Denote the restriction of µ to Si by µi . Then ΠT µi = µi+1 for 0 ≤ i ≤ d − 1 and ΠT µd−1 = µ0 . The subspace V spanned by (µi )0≤i≤d−1 is preserved by ΠT , and on V , if we choose µi to be the basis vectors, then ΠT is a permutation matrix and its characteristic polynomial is λd − 1 = 0. Therefore the set of eigenvalues of ΠT restricted to V is precisely e2πik/d for 0 ≤ k ≤ d − 1. This exhausts the possible eigenvalues of modulus 1 for ΠT , and hence also Π. For a proof of the Frobenius theorem, see Lax [1, Chapter 16]. There are also infinitedimensional versions of the Perron-Frobenius Theorem for positive compact operators. 3 Reversible Markov chains We now consider a special class of Markov chains called reversible Markov chains. Definition 3.1 [Reversible Markov chains] A Markov chain with countable state space S and transition matrix Π is called reversible, if it admits a stationary measure µ, called a reversible measure, which satisfies µ(x)Π(x, y) = µ(y)Π(y, x) ∀ x, y ∈ S. (3.6) Remark. In the physics literature, (3.6) is called the detailed balance condition. Heuristically, µ(x)Π(x, y) represents the probability flow from x to y in equilibrium. Condition (3.6) requires that the probability flow from x to y equals the flow from y to x, which is a sufficient condition for µ to be stationary. Reversing the flow can be interpreted as reversing time. Therefore the time evolution of a Markov chain under a reversible measure is invariant under time reversal. Not all stationary measures are reversible measures, as can be seen for the uniform measure µ ≡ 1 for an asymmetric simple random walk on Z. Note that the notion of a reversible Markov chain is always accompanied by a reversible measure. Theorem 3.2 [Cycle condition for reversibility] Let X be an irreducible Markov chain with countable state space S and transition matrix Π. A necessary and sufficient condition for the existence of a reversible measure for X is that (i) Π(x, y) > 0 if and only if Π(y, x) > 0. 4 (ii) For any loop x0 , x1 , · · · , xn = x0 with n Y Qn i=1 Π(xi−1 , xi ) n Y Π(xi−1 , xi ) = i=1 > 0, we have Π(xi , xi−1 ). (3.7) i=1 Proof. Suppose µ is a reversible measure for X and µ 6≡ 0. Then by irreducibility and the stationarity of µ, µ(x) > 0 for all x ∈ S. The detailed balance condition (3.6) clearly implies (i). Similarly, reversibility implies that the probability flow along the cycle x0 , x1 , · · · , xn = Q x0 , i.e., µ(x0 ) ni=1 Π(xi−1 , xi ), equals the probability flow along the reversed cycle xn = Q x0 , xn−1 , · · · , x0 , which is just µ(x0 ) 1i=n Π(xi , xi−1 ). This yields (3.7). Conversely, if Π satisfies (i) and (ii), then for a given x ∈ S, we can define µ(x) = 1, and for any y ∈ S with a path of states z0 = x, z1 , · · · , zn = y connecting x and y such that Π(zi−1 , zi ) > 0, we can define n Y Π(zi−1 , zi ) µ(y) = . Π(zi , zi−1 ) i=1 Conditions (i) and (ii) guarantee that our definition of µ is independent of the choice of path connecting x and y. It is easy to check that µ satisfies the detailed balance condition (3.6). Example 3.3 1 An irreducible birth-death chain is reversible, since Theorem 3.2 (i) follows from irreducibility, and the cycle condition (3.7) is trivially satisfied due to the lack of cycles. Similarly, any Markov chain on a tree with transitions between neighboring vertices is reversible. 2 A random walk on a connected graph G = (V, E) with vertex set V and edge set E is a Markov chain with state space V and transition matrix Π(x, y) = 1/dx for all y ∈ V with {x, y} ∈ E, where dx is the degree of x in G. It is easily seen that µ(x) = dx is a reversible measure for the walk with unit flow across each edge. More generally, if each edge Cx,y {x, y} ∈ E is assigned a positive conductance Cx,y = Cy,x , and Π(x, y) = P Cx,z , z:{x,z}∈E P then µ(x) = z:{x,z}∈E Cx,z is a reversible measure for the walk with mass flow Cx,y across each edge {x, y} ∈ E. The properties for such a random walk (e.g., transience or recurrence) is intimately connected with the electrical properties of the graph G seen as an electrical network with conductances (Cx,y ){x,y}∈E . We briefly explain the usefulness of reversibility. Let µ be a reversible measure for the Markov chain with transition matrix Π. Then Π is a self-adjoint operator on the Hilbert P space L2 (S, µ) with inner product hf, giµ := x f (x)g(x)µ(x). Indeed, formally for any f, g ∈ L2 (S, µ), X X hf, Πgiµ = f (x) Π(x, y)g(y)µ(x) x∈S = X y∈S f (x)g(y)Π(y, x)µ(y) = hΠf, giµ . x,y∈S All information about the Markov chain are encoded in its transition matrix. The selfadjointness of Π on L2 (µ, S) allows one to use spectral theory and variational formula to study its spectrum, which is not possible when the Markov chain is irreversible. References [1] P. Lax. Linear Algebra. John Wiley & Sons, Inc. New York, 1997. 5