Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 5. Markov chains Mathematical Statistics and Discrete Mathematics November 16th, 2015 1 / 22 Stochastic processes • What we learned so far was about random variables interpreted e.g. as measurements of some random systems. • Most (physical) systems evolve in time and one wants to be able to analyze such systems. • This can be done by, say, repeated measurements indexed by the time of the measurement. • This is modelled by objects called stochastic processes. Main motivating question: How to analyze random systems evolving in time? Let S be a sample space with a probability measure P on the events contained in S. A stochastic process is a collection of random variables {Xt }t∈T defined on S indexed by time t ∈ T. If T is countable, then we say that the process is a discrete-time stochastic process. Otherwise we say that it is a continuous-time stochastic process. 2 / 22 Markov chains We will use the notation {Xn } = {Xn }∞ n=0 = X0 , X1 , X2 , . . . for a sequence of random variables. We will consider random variables taking values in countable sets (not necessarily subsets of the real numbers). By S we will denote the state space, that is, a countable set that our random variables will take values in. A sequence of random variables {Xn } defined on the sample space S and taking values in a state space S is called a Markov chain if it satisfies the Markov property: for all n and all s0 , s1 , . . . , sn ∈ S, P(Xn = sn |Xn−1 = sn−1 , Xn−2 = sn−2 , . . . , X0 = s0 ) = P(Xn = sn |Xn−1 = sn−1 ). One can put the Markov property into words in the following way: the future state of a Markov chain is influenced by its past states only through its present state. 3 / 22 Markov chains Note that a Markov chain is a discrete-time stochastic process. A Markov chain is called stationary, or time-homogeneous, if for all n and all s, s0 ∈ S, P(Xn = s0 |Xn−1 = s) = P(Xn+1 = s0 |Xn = s). The above probability is called the transition probability from state s to state s0 . X A game of tennis between two players can be modelled by a Markov chain Xn which keeps track of the current score. The transition probabilities depend on the skills of players, and probably also on the score itself. X A game of chess can be modelled by a Markov chain Xn whose state space is the space of all possible chessboard configurations. 4 / 22 Markov chains Let {Xn } be a sequence of independent random variables. Then, X {Xn } is a Markov chain (The Markov property holds here trivially since the past does not influence the future at all). It is stationary if and only if the variables have the same distribution . (1) = max{X1 , X2 , . . . , Xn }} is a Markov chain. X (2) {Yn = min{X1 , X2 , . . . , Xn }} is a Markov chain. X (1) {Yn + Yn } is not a Markov chain. X {Yn (1) (2) (2) X (Yn , Yn ) is a Markov chain. X {Yn = X1 + X2 + . . . + Xn } is a Markov chain. If Xi takes values {−1, 1}, it is called a random walk. 5 / 22 One can define Markov chains by using state diagrams: X The following state diagram describes a Markov chain on the state space S = {A, B, C, D}. 1 2 1 1 B 1 1 4 2 2 1 4 1 4 C D 1 4 1 A 2 X The following diagram describes a Markov chain formed from a repeated execution of Experiment 2 from the previous lecture: the next coin toss is always biased towards the outcome of the current toss. 1 2 3 2 3 0 1 1 3 3 6 / 22 Transition matrix We will only deal with stationary Markov chains on finite state spaces and we will drop the word stationary in the statements. Let {Xn }∞ n=1 be a Markov chain on a finite state space S = {s1 , s2 , . . . , sk }. The transition matrix P is a k × k matrix p1,1 p1,2 . . . p1,k p2,1 p2,2 . . . p2,k P= . .. .. .. .. . . . pk,1 pk,2 . . . pk,k where pi,j = P(Xn = sj |Xn−1 = si ) for all i, j = 1, 2, . . . , k. A matrix P is a transition matrix of some Markov chain if and onlyP if all its entries are k non-negative and they sum up to 1 in each row, that is, for every i, j=1 pi,j = 1. 7 / 22 Transition matrix X The matrices 1 0 P1 = 0 0 0 1 0 0 0 0 , 0 1 0 0 1 0 0 0 P2 = 0 1 1 0 0 0 0 1 0 0 0 0 , 1 0 define deterministic (cellular automaton) Markov chains on a four-element set. X The matrices 0 1 2 P1 = 0 1 2 1 2 0 1 2 0 0 1 2 0 1 2 1 2 0 1 , 2 0 0 1 3 P2 0 2 3 1 3 0 1 3 0 0 2 3 0 1 3 2 3 0 2 , 3 0 define a symmetric and an asymmetric random walk on a four-element cyclic set. 8 / 22 Transition matrix X Let S = {s1 , s2 , s3 , s4 } = {A, B, C, D} and consider a Markov chain be given by the following diagram. 1 2 1 1 B 1 1 4 2 2 1 4 1 C 4 D 1 4 1 A 2 Then, 1/2 1/2 1/2 1/2 P= 1/4 1/4 0 0 0 0 1/4 0 0 0 . 1/4 1 9 / 22 Transition matrix X Let S = {s1 , s2 } = {0, 1} and consider a Markov chain given by the following diagram 1 2 3 2 3 1 0 1 3 3 Then, P= 2/3 1/3 1/3 . 2/3 X Let {Xn } be a sequence of independent coin tosses and let {Yn = max{X1 , X2 , . . . , Xn }}. Then, 1/6 1/6 1/6 1/6 1/6 1/6 0 2/6 1/6 1/6 1/6 1/6 0 0 3/6 1/6 1/6 1/6 . P= 0 0 4/6 1/6 1/6 0 0 0 0 0 5/6 1/6 0 0 0 0 0 1 10 / 22 Distribution vector One can represent a probability distribution of a random variable X taking values in a finite state space S = {s1 , s2 , . . . , sk } by the distribution vector ~u given by ~u = (u1 , u2 , . . . , uk ), ui = P(X = si ) for i = 1, 2, . . . , k. where We choose ~u to be a row-vector. Let {Xn } be a Markov chain on S = {s1 , s2 , . . . , sk }, and let (n) (n) (n) ~u(n) = (u1 , u2 , . . . , uk ) be the distribution vector of Xn . Then, ~u(n) = ~u(n−1) · P = ~u(n−2) · P2 = · · · = ~u(0) · Pn , where · denotes matrix multiplication. We call ~u(0) the initial distribution. 11 / 22 Distribution vector Proof. Note that the events {Xn−1 = s1 }, {Xn−1 = s2 }, . . . , {Xn−1 = sk } are pairwise disjoint and their union is the full sample space S. We can hence use the total probability formula for the event {Xn = sj }. Fix a state sj . We can write P(Xn = sj ) = k X P(Xn = sj |Xn−1 = si )P(Xn−1 = si ) i=1 = k X pi,j P(Xn−1 = si ) i=1 = k X i=1 (n−1) =u (n−1) pi,j ui · cj , (dot product of u(n−1) and cj ). where cj is the j-th colum vector of P. This agrees with the definition of matrix multiplication. (0) We say that the chain {Xn } starts in state si if the initial distribution satisfies ui = 1. 12 / 22 Distribution vector The i, j entry of the matrix Pn is (n) pi,j = P(Xn = sj |X0 = si ). X Consider a Markov chain {Xn } given by the matrix 2/3 1/3 P= 1/3 2/3 and started at s1 . What is the distribution vector of X2 , X4 , X8 and X16 ? We have u(0) = (1, 0), and ~u(2) = u(0) · P2 = (5/9, 4/9), ~u(4) = u(0) · P4 = (41/81, 40/81), ~u(8) = u(0) · P8 = (3281/6561, 3280/6561), ~u(16) = u(0) · P16 = (21523361/43046721, 21523360/43046721). 13 / 22 Absorbing Markov chains A state si ∈ S of a stationary Markov chain is called • absorbing if pi,i = 1, • transient if pi,i < 1. A Markov chain is called absorbing if from any transient state, one can reach an absorbing state. 14 / 22 Absorbing Markov chains X Let a Markov chain be given by the following transition matrix 1 0 0 0 0 1 0 1 0 0 2 1 2 1 P= 0 2 01 2 01 . 0 0 0 2 2 0 0 0 0 1 This is an absorbing Markov chain. This chain can be interpreted as a mouse jumping on a table of length 1m and making jumps of 0.5m to the left or to the right with equal probability. The absorbing state is when the mouse falls off the table (either on the left-hand side or the right-hand side). 15 / 22 Absorbing Markov chains X Consider a Markov chain given by the following diagram: 1 2 1 1 B 1 1 4 2 2 1 4 1 C 4 D 1 4 1 A 2 This Markov chain is not absorbing since the only absorbing state D cannot be reached from the transient states A and B. 16 / 22 Absorbing Markov chains The probability that an absorbing Markov chain will eventually end up in one of its absorbing states is 1. Proof. From each nonabsorbing state sj it is possible to reach an absorbing state. Let mj be the minimum number of steps required to reach an absorbing state, starting from sj . Let pj be the probability that, starting from sj , the process will not reach an absorbing state in mj steps. Then pj < 1. Let m be the largest of the mj and let p be the largest of pj . The probability of not being absorbed in m steps is less than or equal to p, in 2m steps less than or equal to p2 , etc. Since p < 1, these probabilities tend to 0. For the interested: this argument is very similar to the one for the infinite monkey theorem which states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will with probability 1 type a given text, such as the complete works of William Shakespeare. 17 / 22 Canonical form of the transition matrix Let {Xn } be an absorbing Markov chain with l transient states and r absorbing states. Let {s1 , s2 , . . . , sl+r } be a numbering of the state space where all the absorbing states come at the end. Let P be the transition matrix for this particular numbering. Then P is in the canonical form: Q R P= , 0 I where • I is an r × r identity matrix, • 0 is an r × l zero matrix, • Q is an l × l matrix with transition probabilities between transient states, • R is an l × r matrix with transition probabilities from transient to absorbing states. Note that n P = Qn 0 R(n) I for some matrix R(n). 18 / 22 Time to absorption Let {Xn } be an absorbing Markov chain and let Q be as before. The fundamental matrix is defined to be N = (I − Q)−1 The entries ni,j of N satisfy ni,j = the exp. # of visits of {Xn } to the state sj when started from state si . Note that N = I + Q + Q2 + . . .. Proof. Fix i and let {Xn } be a Markov chain started at si . Let Ynj = 1 if Xn = sj , and Ynj = 0 otherwise. Then, Y j = Y0j + Y1j + . . . is the total number of visits of {Xn } to sj . Hence, E[Y j ] = E[Y0j ] + E[Y1j ] + . . . = Ii,j + P(X1 = sj | X0 = si ) + P(X2 = sj | X0 = si ) + . . . (2) (2) = Ii,j + pi,j + pi,j + . . . = Ii,j + qi,j + qi,j + . . . = ni,j , (n) where qi,j is the i, j entry of the matrix Qn . 19 / 22 Time to absorption Let ti = the expected time before absorption when started from si , and let ~t = (t1 , t2 , . . . , tl ) be a column vector. Then, ~t = N · ~c, where ~c is a column vector of 1’s. Proof. Fix i and let {Xn } be a Markov chain started at si . Let Z j be the total number P of visits of {Xn } to sj . Let Ti be the time until absorption for {Xn }. We have Ti = j Z j , and hence X X ti = E[Ti ] = E[Z j ] = ni,j = (N · ~c)i . j j 20 / 22 Probability of absorption Let si be a transient state. We define bi,j = P( absorption in sj | X0 = si ), and put B = (bi,j )1≤i≤l,1≤j≤r . Then, B = N · R. Proof. We represent Qni,k as a sum over all possible trajectories of {Xn } of n steps from a transient state si to a transient state sk . All trajectories of n + 1 steps from si ending at an absorbing state sj have to have their first n steps between transient states, and their last step between a transient and the absorbing state sj . Hence, the sum over all such trajectories is X Qni,k Rk,j = (Qn R)i,j . k If we sum over all possible numbers of steps, we get Bi,j = [(I + Q + Q2 + . . .)R]i,j = (NR)i,j . 21 / 22 Exercise: Compute the vector ~t and the matrix B for the mouse on a table. 22 / 22