Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Miranda Holmes-Cerfon Applied Stochastic Analysis, Spring 2015 Lecture 2: Markov Chains (I) Readings These are only suggested readings, that can provide additional perspective: • Grimmett and Stirzaker [2001] 6.1, 6.4-6.6 • Koralov and Sinai [2010] 5.1-5.5, pp.67-78 • For a lively history and gentle introduction, see Hayes [2013]. We will begin talking about stochastic processes, i.e. random functions X(t), where t is a parameter (discrete or continuous), and X(t) is a random variable, for each t, that takes values in some state space S (discrete or continuous.) First, we will consider Markov chains. Markov chains arise in a great many applications. Some examples include: • models of physical processes – rainfall from day-to-day – neural networks – card shuffling – population dynamics – lineups, e.g. in grocery stores, computer servers, telephone call centers, etc. – chemical reactions – protein folding – baseball statistics • discretize a continuous system • sampling from high-dimensional systems, e.g. Markov-Chain Monte-Carlo • data/network analysis – clustering – speech recognition – PageRank algorithm in Google’s search engine. Markov chains were first invented – by Andrei Markov – for a particular application: to analyze the distribution of letters in Russian poetry (Hayes [2013]). He meticulously constructed a list of the frequencies of vowel↔consonant pairs in the first 20,000 letters of Pushkin’s poem-novel Eugene Onegin, and showed that from this matrix one could infer the average number of vowels an consonants. When he realized how powerful this idea was, he spent several years developing tools to analyze the properties of such random processes with memory. 2.1 Setup and definitions In our study of Markov chains we will consider the state space S to be discrete, i.e. finite or countable. Therefore we can let it be a set of integers, as in S = {1, 2, . . . , N} or S = {1, 2, . . .}. Today, we will also consider discrete-time processes. We write X(n) = Xn . Definition. The process X(t) = X0 , X1 , X2 , . . . is a discrete-time Markov chain if it satisfies the Markov property: P(Xn = s|X0 = x0 , X1 = x1 , . . . , Xn−1 = xn−1 ) = P(Xn = s|Xn−1 = xn−1 ). (1) Definition. The Markov chain X(t) is homogeneous if P(Xn+1 = j|Xn = i) = P(X1 = j|X0 = i), i.e. the transition probabilities do not depend on time. If this is the case, we write pi j = P(X1 = j|X0 = i) for the probability to go from i to j in one step. We will only consider homogeneous Markov chains in this course. Definition. If |S| = N (the state space is finite), we can form the transition matrix P = (pi j ). This satisfies: (i) pi j ≥ 0 ∀i, j (ii) ∑ j pi j = 1 ∀i (the entries are non-negative) (the rows sum to 1) Any matrix that satisfies (i), (ii) above is called a stochastic matrix. (Note: this is not the same as a “random matrix”!) Examples 1. Random walk on the line Xn = ∑nj=1 ξ j , ξ j = ±1 with probability 1 2 each. Then pi,i+1 = 12 , pi,i=1 = 21 , pi, j = 0 ( j 6= i ± 1) 2. Random walk on a graph The most general Markov chain can be pictured as a random walk on a graph, where the walker jumps from node to node, choosing the next edge at random with a probability that depends only on the node it is currently at. This is a convenient way to visualize a Markov chain: 3. Card shuffling Shuffling a pack of cards can be modeled in many ways, most of which can be mapped to Markov chains. Perhaps the simplest model is this: at each step, take a card from the top of the deck, and put it back in at a random location. A more complicated model is the riffle shufle: split a deck in two at a random locations, then put them back together by alternating randomly-sized groups of cards from each half. Such models have been extensively analyzed and one can prove results about the number of shuffles needed to make the deck close to random again. For example, it takes seven riffle shuffles to get close to random, but it takes 11 or 12 to get so close that a gambler in a casino cannot exploit the deviations from randomness to win a typical game. See Austin for an accessible introduction, and Aldous and Diaconis [1986] for the mathematical proofs. (I first learned about this in the beautiful Proofs from the Book, by Aigner and Ziegler.) 2 4. Ehrenfest model of diffusion Consider a container with a membrane in the middle, and m particles distributed in some way between the left and right sides. At each step, pick one particle at random and move it to the other side. Let Xn = # of particles in the left side at time n. Then Xn is a Markov chain, with transition probabilities pi,i+1 = 1 − mi , pi,i−1 = mi . 5. Ising model, 2d This model was initially invented to study magnetism. Each square on a 2d lattice with N = m × m sites (the magnet) is assigned a “spin” σ j ∈ {+1, −1}. Each configuration of spins σ = (σ1 , . . . , σN ) has an energy H(σ ) = − ∑ σi σ j , hi, ji where hi, ji means i, j are neighbours on the lattice. Dynamics can be added as follows: eat each step, pick a spin uniformly at random, tentatively flip its spin, and let ∆H be the change in energy of the configuration. If ∆H < 0, accept the move, but otherwise accept it with some probability e−β ∆H , where β > 0 is a parameter, typically called the “inverse temperature.” (Small β means the magnet is hot, large β means it is cold.) This defines a Markov chain with transition probabilities pσ →σ 0 = N1 min(1, e−β ∆H )if σ 6= σ 0 are neighbours, pσ →σ = 1 − ∑σ 0 6=σ pσ →σ 0 . The Ising model was initially invented to study phase transitions in ferromagnetism, but it has since been been used in an enormous variety of applications, ranging from ice to gasses to spin glasses to cancer to ion channels to neuroscience to urban segregation. It was completely analytically solved in 1 dimension by Ising in 1924 in his PhD thesis. In 2 dimensions and higher, there is a competition between spins wanting to align with their neighbours, and the vastly bigger number of states with non-aligned spins. It can be shown that the model exhibits a phase transition at a critical temperature βc , where the lowest-energy state changes from being ordered at low temperature, to disordered at high temperature. 6. Autoregressive model of order k (AR(k)) Given constants a1 , . . . , ak ∈ R, let Yn = a1Yn−1 + a2Yn−2 + . . . + akYn−k +Wn , where Wn are i.i.d.1 random variables. s This process is not Markov, because it depends on the past k timesteps. However, we can form a Markov process by defining Xn = (Yn ,Yn−1 , . . . ,Yn−k+1 )T . Then Xn = AXn−1 + Wn , 1 i.i.d. = independent, identically distributed 3 a1 1 where A = 0 ··· 2.2 a2 0 1 ··· ··· ··· ··· ··· 1 ak 0 , and Wn = (Wn , 0, . . . , 0)T . 0 0 Evolution of probability Next question: suppose we know the probability distribution at time t = 0, i.e. the probability of finding the chain in any particular state. What is the distribution at later times? This comes from the following: Chapman-Kolmogov Equation. P(Xn = j|X0 = i) = ∑ P(Xn = j|Xm = k)P(Xm = k|X0 = i). (2) k Proof. P(Xn = j|X0 = i) = ∑ P(Xn = j, Xm = k|X0 = i) (Law of Total Probability) k = ∑ P(Xn = j|Xm = k, X0 = i)P(Xm = k|X0 = i) k (since P(A ∩ B|C) = P(A|B ∩C)P(B ∩C), property of conditional probability) = RHS, by Markov property This holds whether the state space is finite, or countably infinite. Now let’s suppose the state space is finite: |S| = N < ∞. Let µ (n) = probability distribution after n steps, represented as a row vector. Let’s calculate µ (n) , assuming we know µ (n−1) . (n) µ j = ∑ P(Xn = j|Xn−1 = k)P(Xn−1 = i) (Law Of Total Probability) i (n−1) = ∑ pi j µi . i We obtain the: Forward Kolmogov Equation. µ (n) = µ (n−1) P ⇐⇒ µ (0) Pn . (3) This shows how to evolve the probability in time. If we know the initial probability distribution µ (0) , then we can find it at any later time using powers of the matrix P. 4 Now consider what happens if we ask for the expected value of some function of the chain:EXn2 , EXn3 , E|Xn |, etc. Does this evolve in time in a similar way to (3)? Let f : S → R be a function defined on state space, and let u(k,t) = Ek X(t) = E( f (X(t)|X(0) = k). You can think of u(·,t) as a column vector indexed by k. Then u(·,t) evolves in time as: Backward Kolmogov Equation. u(·, n + 1) = Pu(·, n), u(k, 0) = f (k). (4) The proof is left as an exercise on the homework. Remark. The backward equation is sometimes written as an equation for v(k,t) = E( f (X(T )|X(t) = k), where T is a fixed time and t ≤ T . The equation then takes the form v(·, n) = Pv(·, n + 1), v(k, T ) = f (k). The equation solves for v “backwards” in time; hence the name. It is also possible to show that µ (n) v(n) is constant. Indeed, (n) µ (n) v(n) = ∑ µk v(k, n) = ∑ P(X(t) = k)E( f (X(T ))|X(t) = k) = E( f (X(T ))), k k where we used the Law of Total Probability for the last step. The invariance of µ (n) v(n) can sometimes be used to derive the forward equation from the backward equation, and vice-versa, this exercise is left for the reader.2 Remark. Yet another approach is to let P(k,t; l, s) be the transition probability to be in state k at time t, given the system started in state l at time s. One can then derive equations for how P(k,t; l, s) evolves in t (forward equation) and in s (backward equation.) 2.3 Stationary probability One question we might ask is: What happens to the chain after long times? The chain itself does not converge to anything, because it is continually jumping around, but the probability distribution might. If so, we should look for a distribution that doesn’t change with time. Definition. The vector π is called a stationary distribution if (i) π j ≥ 0, ∑ j π j = 1. (ii) π = πP, i.e. π j = ∑i πi pi j for all j. for one direction, write µ (n+1) v(n + 1) = µ (n) v(n) = µ (n) Pv(n + 1), rearrange, and argue for conditions under which you can derive the equation for µ (n) . 2 Hint: 5 Remark. Other synonyms you might hear for stationary distribution include invariant measure, invariant distribution, steady-state probability, equilibrium probability or equilibrium distribution (the latter two are from physics.). Number (i) says the vector is a probability distribution. Number (ii) says that if we start with probability π and run the Markov chain, the probability will not change. Some questions we might ask about π include: – Does it exist? – Is it unique? – Does an arbitrary distribution converge to it? These are actually questions about the eigenvalues of the transition matrix P. Indeed, to satisfy (ii), we need λ = 1 to be an eigenvalue. If an arbitrary distribution is to converge in some way, we need all other eigenvalues to have complex norm less than 1. Note that λ = 1 is clearly an eigenvalue of P, since it has right eigenvector (1, 1, . . . , 1)T . We need to know whether the corresponding left eigenvector is always positive, and also what the other eigenvalues / generalized eigenvalues are. Lemma. The spectral radius of P is 1, i.e. ρ(P) = maxλ |λ | = 1, where the max is over all eigenvalues. Proof. Let η be a left eigenvector with eigenvalue λ . Then λ ηi = ∑Nj=1 η j pi j , so N N N N |λ | ∑ |ηi | = ∑ | ∑ ηi pi j | ≤ i=1 i=1 j=1 ∑ i, j=1 N |η j |pi j = ∑ |η j |. j=1 Therefore |λ | ≤ 1. We can say more than this, if we know something about the structure of the chain. Definition. A stochastic matrix is irreducible if, for every pair (i, j) there exists an s > 0 such that (Ps )i j > 0. A homogeneous Markov chain is irreducible if it is generated by some an irreducible stochastic matrix. Definition. A stochastic matrix is primitive if there exists some s > 0 such that the s-step transition proba(s) bilities are positive for all i, j, i.e. (Ps )i j = pi j > 0 ∀i, j. A homogeneous Markov chain is primitive if it is generated b a primitive stochastic matrix. Note that some books (e.g. Koralov and Sinai [2010]) call such a matrix ergodic instead. Examples In these examples, an arrow from one node to another means there is a positive probability transition probability between these nodes in that direction; the actual value of the transition probability is not important. 6 Theorem (Ergodic Theorem for Markov Chains). Assume the Markov Chain is primitive. Then there exists a unique stationary probability distribution π = (π1 , . . . , πN ), with π j > 0 ∀ j. The n-step transition (n) probabilities converge to π: that is, limn→∞ pi j = π j . Remark. The name of this theorem comes from Koralov and Sinai [2010]; it may not be universal. There are many meanings of the word “ergodic;” we will see several variants throughout this course. Proof. This is a sketch, see Koralov and Sinai [2010] p.72 for all the details. – Define a metric d(µ 0 , µ 00 ) = 21 ∑Ni=1 |µi0 − µi00 | on the space of probability distributions. It can be shown that d is a metric, and the space of distributions is complete. – Show (*) d(µ 0 Q, µ 00 Q) ≤ (1 − α)d(µ 0 , µ 00 ), where α = mini, j Qi j and Q is a stochastic matrix. – Show that µ (0) , µ (0) P, µ (0) P2 , . . . is a Cauchy sequence. Therefore it converges, so let π = limn→∞ µ (n) . – Show that π is unique: let π1 , π2 be two distributions such that πi = π1 P. Then d(π1 , π2 ) = d(π1 Ps , π2 Ps ) ≤ (1 − α)d(π1 , π2 ) by (*). Therefore d(π1 , π2 ) = 0. – Let µ (0) be the probability distribution which is concentrated at point i. Then µ (0) Pn is the probability (n) (n) distribution (pi j ). Therefore, limn→∞ pi j = π j . Remark. This shows that d(µ (0) Pn , π) ≤ (1 − α)−1 β n , with β = (1 − α)1/s < 1. Therefore the rate of convergence to the stationary distribution is exponential. Theorem (Law of large numbers for a primitive Markov chain). Let π be the stationary distribution of a (n) stationary Markov chain, and let νi be the number of occurences of state i in after n time steps of the (n) chain, i.e. among the values of X0 , X1 , . . . , Xn . Let µi j be the number of values of k, 1 ≤ k ≤ n, for which Xk−1 = i, Xk = j. Then for any ε > 0, (n) lim P(| n→∞ νi − πi | ≥ ε) = 0 n (n) lim P(| n→∞ νi j n − πi pi j | ≥ ε) = 0 Proof. For a proof, see Koralov and Sinai [2010], p. 74. Remark. This implies that the long-time average of any function of the Markov chain, approaches the average with respect to the stationary distribution. 7 Remark. Several of these results hold for irreducible chains, with slightly weaker conditions. Irreducible chains also have a unique stationary distribution – this follows from the Perron-Frobenius Theorem. However, it is not true that an arbitrary distirbution converges to it; rather, we have that µ (0) P̄(n) → π as n → ∞, where P̄(n) = 1n ∑nk=1 Pk . This is because there may be a built-in periodicity, as in the chain with transition 0 1 . In this case P2n = I, and P2n+1 = P, so µ (n) oscillates between two distributions, matrix P = 1 0 instead of converging to a fixed limit. 2.4 Detailed balance What happens if we run a Markov Chain backwards? Suppose we take a primitive chain, start it in the stationary distirbution, and let it run, to get X0 , X1 , X2 , . . . , XN , each of these with distribution Xi ∼ π. Let Yn = XN−n be the “reversed” chain. Theorem. Y0 ,Y1 , . . . ,YN is a Markov chain with P(Yn+1 = j|Yn = i) = πj πi pi j . Proof. Calculate the conditional probability: P(Yn+1 = j,Yn = i) P(Xk = j, Xk+1 = i) = P(Yn = i) P(Xk+1 = i) p ji π j P(Xk+1 = i|Xk = j)P(Xk = j) = = . P(Xk+1 = i) πi P(Yn+1 = j|Yn = i) = We also need to show it’s Markov: P(Yk = ik , 0 ≤ k ≤ n + 1) P(Yk = ik , 0 ≤ k ≤ n) P(XN−n−1 = in+1 , XN−n = in , . . . , XN = i0 ) = P(XN−n = in , . . . , XN = i0 ) πin+1 pin+1 ,in pin ,in−1 · · · pi1 ,i0 = πin pin ,in−1 · · · pi1 ,i0 πin+1 pin+1 ,in = πin P(Yn+1 = ii+1 |Yn = in , . . . ,Y0 = i0 ) = Question: when does Y have the same transition probabilies as X? Definition. A Markov chain satisfies detailed balance if πi pi j = π j p ji 8 ∀i, j. (5) Remark. From the above, we see that when detailed balance holds, the time-reversed chain (when run in the steady-state) has the same transition probabilities as the original. Therefore, the chain is indistinguishable when it is run forwards and backwards. Notes • Detailed balance is a very important concept, that is widely used in physics and statistics. • The quantity πi pi j is the “amount” of probability flowing down edge i → j in each time step, in steadystate. If a system satsifies detailed balance, then the amount of probability flowing down an edge in one direction in steady state, equals the amount flowing in the other direction. Therefore, there is no net flux of probability. • Here is a system that does not satisfy detailed balance: There must be a net flux of probability through the red edges. • Another way to picture detailed balance is to imagine a turbulent fluid in a box. If the fluid is forced and dissipated isotropically, (e.g. by some external heat bath), then there should be no mean flow, on average. However, if we stir the fluid in one direction, then even if it is very turbulent there will be a mean circulation in the box. • In physics, if a system satisfies detailed balance, then it is called an equilibrium system, or sometimes reversible in equilibrium. Otherwise, it is called a non-equilibrium system. Note that being an equilibrium system is dfferent from being in equilibrium, aka in steady-state – many systems have a stationary distribution π that the probability converges to, but not all stationary distributions have non-zero fluxes in steady-state – the steady-state must satisfy πi = ∑ j π j pi j , but it does not necessarily satisfy (5). • In physics, detailed balance is a very important concept, because if there are non-zero fluxes in steadystate, then a system must have forces acting on it. This makes statistical mechanics vastly more complicated. Most of statistical mechanics has been developed to deal with equilibrium systems, but it is only now being developed for non-equilibrium systems. One example of a non-equilibrium system that is still very poorly understood is a conducting rod, that is maintained at a hot temperature at one end and a cold one at the other. We all know the steady-state is a linear temperature distribution in the rod, but deriving this from microscopic interactions is still a research topic. • In statistics, the concept of detailed balance is widely used because when it holds, it is easier to show that a particular distribution is the stationary distribution – see this week’s homework, and lecture on Monte-Carlo methods (Week ∼6). 9 • Here are some examples of physical systems that do / do not satisfy detailed balance: No detailed balance – self-propelled (e.g. swimming) particle – atmospheric circulation patterns – plasma with non-Maxwellian velocities – system in contact with two heat baths, at different temperatures – snowflake melting in a coffee cup – sheared crystal Detailed balance – particle diffusing in a fluid – Ising model with dynamics above – ideal gas in an insulated box – system in contact with one heat bath – covered, insulated coffee cup with liquid/vapour equilibrium – crystal with no external forces 2.5 Spectral decomposition for a Markov Chain that satisfies detailed balance If P satisfies detailed balance, then it can be symmetrized by a similarity transformation: V = ΛPΛ−1 , (6) √ √ where Λi j = δi j πi is the matrix with π on its diagonal. Then Vi j = √ π √ i πj pi j = √ π √ j πi p ji (by detailed balance) = V ji , so V is symmetric. We know from linear algebra that V has a full set of eigenvalues λ j ∈ R, and an orthonormal set of eigenvectors w j that are both the left and right eigenvectors. Therefore, P also has eigenvalues λ j , and it has left eigenvectors right eigenvectors ψ j = Λw j φ j = Λ−1 w j √ √ i.e. each element of the vectors (ψ j )i or (φ j )i is multiplied by πi or ( πi )−1 . We also have that ψ j = Λ2 φ j , i.e. the left eigenvector equals π times the right eigenvector. Note that this is always true for the eigenvectors corresponding to λ = 1, but for a chain that satisfies detailed balance it is true for all other eigenvectors as well. Suppose that |w j | = 1. The spectral decomposition of P is given by Pt = ∑ λkt φk ψkT = ∑ λkt Λ2 φk φkT . k (7) k (t) In components: pi j = ∑k λkt (ψk )i (φk ) j = ∑k λkt (φk )i (φk ) j πi . Remark. Another way to derive (7) is to define an inner product h, iπ to be hu, viπ = ∑i ui vi πi . Then we can show that P is self-adjoint with respect to this inner product, i.e. hPu, viπ = hu, Pviπ , so (7) follows. Note. The spectral decomposition gives insight into the timescales associated with the Markov chain. We will see an example of this below. Note. If you truncate (7), then you typically don’t get a stochastic matrix back. In fact, if you evolve the probability with the truncated matrix, the probability can even become negative. Therefore the spectral decomposition hides the fact that P is stochastic. 10 Example. Consider the transition matrix 1 1 − 2m 1 2 P= 0 0 0 1 2m 0 1 2 0 0 0 1 2 0 1 2 0 0 0 1 2 0 1 2m 0 0 0 1 2 1 1 − 2m This describes a particle moving on an energy landscape with 5 sites. It undergoes an unbiased random walk along the middle 3 nodes, but when it hits an endpoint, it tends to stay there for a long time (when m is large), before escaping to possibly visit the other endpoint. The stationary distribution is π = (m, 1, 1, 1, m)Z −1 , where Z = 2m + 3. You can check that the chain satisfies detailed balance. Let’s calculate the eigenvalues: m = 2: m = 5: m = 20: m = 100: λ λ λ λ = 1, = 1, = 1, = 1, 0.89, 0.95, 0.99, 0.999, 0.5, 0.62, 0.68, 0.70, -0.14, -0.05, -0.01, -0.03, -0.75 -0.72 -0.71 -0.71 You see there is a spectral gap between λ1 , which approaches 1 as m → ∞, and λ2 , which appears bounded away from 1. What do the left eigenvectors look like? Here is a sketch (for large enough m): The largest eigenvector is the steady-state (as expected), while the 2nd-largest captures transitions between the endpoints. The timescale of transition is given by λ1−1 . The other eigenvectors describe the diffusive motion across the energetically flat plateau in the middle. Example. Just for fun, here’s an example (from Hayes [2013]) of how Markov chains can be used to generate realistic-looking text. In each of these excerpts, a Markov chain was constructed by considering the frequencies of strings of k letters from the English translation of the novel Eugene Onegin by Pushkin, for k = 1, 3, 5, 7, and was then run from a randomly-generated initial condition. You can see that when k = 3, there are English-looking syllables, when k = 5 there are English-looking words, and when k = 7 the words themselves almost fit together coherently. 11 References D. Aldous and P. Diaconis. Shuffling cards and stopping times. American Mathematical Monthly, 93: 333–348, 1986. D. Austin. How many times do I have to shuffle this http://www.ams.org/samplings/feature-column/fcarc-shuffle. deck? G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001. Brian Hayes. First links in the Markov chain. American Scientist, 101, 2013. L. B. Koralov and Y. G. Sinai. Theory of Probability and Random Processes. Springer, 2010. 12 URL