Download Lecture 2: Markov Chains (I)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical and experimental justification for the Schrödinger equation wikipedia , lookup

Relativistic quantum mechanics wikipedia , lookup

Density matrix wikipedia , lookup

Quantum electrodynamics wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
Miranda Holmes-Cerfon
Applied Stochastic Analysis, Spring 2015
Lecture 2: Markov Chains (I)
Readings These are only suggested readings, that can provide additional perspective:
• Grimmett and Stirzaker [2001] 6.1, 6.4-6.6
• Koralov and Sinai [2010] 5.1-5.5, pp.67-78
• For a lively history and gentle introduction, see Hayes [2013].
We will begin talking about stochastic processes, i.e. random functions X(t), where t is a parameter (discrete
or continuous), and X(t) is a random variable, for each t, that takes values in some state space S (discrete or
continuous.) First, we will consider Markov chains.
Markov chains arise in a great many applications. Some examples include:
• models of physical processes
– rainfall from day-to-day
– neural networks
– card shuffling
– population dynamics
– lineups, e.g. in grocery stores, computer servers, telephone call centers, etc.
– chemical reactions
– protein folding
– baseball statistics
• discretize a continuous system
• sampling from high-dimensional systems, e.g. Markov-Chain Monte-Carlo
• data/network analysis
– clustering
– speech recognition
– PageRank algorithm in Google’s search engine.
Markov chains were first invented – by Andrei Markov – for a particular application: to analyze the distribution of letters in Russian poetry (Hayes [2013]). He meticulously constructed a list of the frequencies
of vowel↔consonant pairs in the first 20,000 letters of Pushkin’s poem-novel Eugene Onegin, and showed
that from this matrix one could infer the average number of vowels an consonants. When he realized how
powerful this idea was, he spent several years developing tools to analyze the properties of such random
processes with memory.
2.1
Setup and definitions
In our study of Markov chains we will consider the state space S to be discrete, i.e. finite or countable.
Therefore we can let it be a set of integers, as in S = {1, 2, . . . , N} or S = {1, 2, . . .}. Today, we will also
consider discrete-time processes. We write X(n) = Xn .
Definition. The process X(t) = X0 , X1 , X2 , . . . is a discrete-time Markov chain if it satisfies the Markov
property:
P(Xn = s|X0 = x0 , X1 = x1 , . . . , Xn−1 = xn−1 ) = P(Xn = s|Xn−1 = xn−1 ).
(1)
Definition. The Markov chain X(t) is homogeneous if P(Xn+1 = j|Xn = i) = P(X1 = j|X0 = i), i.e. the
transition probabilities do not depend on time. If this is the case, we write
pi j = P(X1 = j|X0 = i)
for the probability to go from i to j in one step.
We will only consider homogeneous Markov chains in this course.
Definition. If |S| = N (the state space is finite), we can form the transition matrix P = (pi j ). This satisfies:
(i) pi j ≥ 0
∀i, j
(ii) ∑ j pi j = 1
∀i
(the entries are non-negative)
(the rows sum to 1)
Any matrix that satisfies (i), (ii) above is called a stochastic matrix. (Note: this is not the same as a “random
matrix”!)
Examples
1. Random walk on the line Xn = ∑nj=1 ξ j , ξ j = ±1 with probability
1
2
each.
Then pi,i+1 = 12 , pi,i=1 = 21 , pi, j = 0 ( j 6= i ± 1)
2. Random walk on a graph The most general Markov chain can be pictured as a random walk on a graph,
where the walker jumps from node to node, choosing the next edge at random with a probability that
depends only on the node it is currently at. This is a convenient way to visualize a Markov chain:
3. Card shuffling Shuffling a pack of cards can be modeled in many ways, most of which can be mapped
to Markov chains. Perhaps the simplest model is this: at each step, take a card from the top of the
deck, and put it back in at a random location. A more complicated model is the riffle shufle: split a
deck in two at a random locations, then put them back together by alternating randomly-sized groups
of cards from each half. Such models have been extensively analyzed and one can prove results about
the number of shuffles needed to make the deck close to random again. For example, it takes seven
riffle shuffles to get close to random, but it takes 11 or 12 to get so close that a gambler in a casino
cannot exploit the deviations from randomness to win a typical game. See Austin for an accessible
introduction, and Aldous and Diaconis [1986] for the mathematical proofs. (I first learned about this
in the beautiful Proofs from the Book, by Aigner and Ziegler.)
2
4. Ehrenfest model of diffusion Consider a container with a membrane in the middle, and m particles distributed in some way between the left and right sides. At each step, pick one particle at
random and move it to the other side. Let Xn = # of particles in
the left side at time n. Then Xn is a Markov chain, with transition
probabilities pi,i+1 = 1 − mi , pi,i−1 = mi .
5. Ising model, 2d This model was initially invented to study magnetism. Each square on a 2d lattice
with N = m × m sites (the magnet) is assigned a “spin” σ j ∈ {+1, −1}. Each configuration of spins
σ = (σ1 , . . . , σN ) has an energy
H(σ ) = − ∑ σi σ j ,
hi, ji
where hi, ji means i, j are neighbours on the lattice.
Dynamics can be added as follows: eat each step, pick a spin
uniformly at random, tentatively flip its spin, and let ∆H be
the change in energy of the configuration. If ∆H < 0, accept the move, but otherwise accept it with some probability
e−β ∆H , where β > 0 is a parameter, typically called the “inverse temperature.” (Small β means the magnet is hot, large
β means it is cold.)
This defines a Markov chain with transition probabilities
pσ →σ 0 = N1 min(1, e−β ∆H )if σ 6= σ 0 are neighbours,
pσ →σ = 1 − ∑σ 0 6=σ pσ →σ 0 .
The Ising model was initially invented to study phase transitions in ferromagnetism, but it has since
been been used in an enormous variety of applications, ranging from ice to gasses to spin glasses to
cancer to ion channels to neuroscience to urban segregation. It was completely analytically solved in
1 dimension by Ising in 1924 in his PhD thesis. In 2 dimensions and higher, there is a competition
between spins wanting to align with their neighbours, and the vastly bigger number of states with
non-aligned spins. It can be shown that the model exhibits a phase transition at a critical temperature
βc , where the lowest-energy state changes from being ordered at low temperature, to disordered at
high temperature.
6. Autoregressive model of order k (AR(k)) Given constants a1 , . . . , ak ∈ R, let Yn = a1Yn−1 + a2Yn−2 +
. . . + akYn−k +Wn , where Wn are i.i.d.1 random variables. s
This process is not Markov, because it depends on the past k timesteps. However, we can form a
Markov process by defining Xn = (Yn ,Yn−1 , . . . ,Yn−k+1 )T . Then
Xn = AXn−1 + Wn ,
1 i.i.d.
= independent, identically distributed
3

a1
1
where A = 
0
···
2.2
a2
0
1
···
···
···
···
···
1

ak
0
, and Wn = (Wn , 0, . . . , 0)T .
0
0
Evolution of probability
Next question: suppose we know the probability distribution at time t = 0, i.e. the probability of finding the
chain in any particular state. What is the distribution at later times? This comes from the following:
Chapman-Kolmogov Equation.
P(Xn = j|X0 = i) = ∑ P(Xn = j|Xm = k)P(Xm = k|X0 = i).
(2)
k
Proof.
P(Xn = j|X0 = i) = ∑ P(Xn = j, Xm = k|X0 = i)
(Law of Total Probability)
k
= ∑ P(Xn = j|Xm = k, X0 = i)P(Xm = k|X0 = i)
k
(since P(A ∩ B|C) = P(A|B ∩C)P(B ∩C), property of conditional probability)
= RHS, by Markov property
This holds whether the state space is finite, or countably infinite.
Now let’s suppose the state space is finite: |S| = N < ∞. Let
µ (n) = probability distribution after n steps, represented as a row vector.
Let’s calculate µ (n) , assuming we know µ (n−1) .
(n)
µ j = ∑ P(Xn = j|Xn−1 = k)P(Xn−1 = i)
(Law Of Total Probability)
i
(n−1)
= ∑ pi j µi
.
i
We obtain the:
Forward Kolmogov Equation.
µ (n) = µ (n−1) P
⇐⇒
µ (0) Pn .
(3)
This shows how to evolve the probability in time. If we know the initial probability distribution µ (0) , then
we can find it at any later time using powers of the matrix P.
4
Now consider what happens if we ask for the expected value of some function of the chain:EXn2 , EXn3 , E|Xn |,
etc. Does this evolve in time in a similar way to (3)?
Let f : S → R be a function defined on state space, and let
u(k,t) = Ek X(t) = E( f (X(t)|X(0) = k).
You can think of u(·,t) as a column vector indexed by k. Then u(·,t) evolves in time as:
Backward Kolmogov Equation.
u(·, n + 1) = Pu(·, n),
u(k, 0) = f (k).
(4)
The proof is left as an exercise on the homework.
Remark. The backward equation is sometimes written as an equation for v(k,t) = E( f (X(T )|X(t) = k),
where T is a fixed time and t ≤ T . The equation then takes the form
v(·, n) = Pv(·, n + 1),
v(k, T ) = f (k).
The equation solves for v “backwards” in time; hence the name.
It is also possible to show that µ (n) v(n) is constant. Indeed,
(n)
µ (n) v(n) = ∑ µk v(k, n) = ∑ P(X(t) = k)E( f (X(T ))|X(t) = k) = E( f (X(T ))),
k
k
where we used the Law of Total Probability for the last step. The invariance of µ (n) v(n) can sometimes be
used to derive the forward equation from the backward equation, and vice-versa, this exercise is left for the
reader.2
Remark. Yet another approach is to let P(k,t; l, s) be the transition probability to be in state k at time t, given
the system started in state l at time s. One can then derive equations for how P(k,t; l, s) evolves in t (forward
equation) and in s (backward equation.)
2.3
Stationary probability
One question we might ask is: What happens to the chain after long times?
The chain itself does not converge to anything, because it is continually jumping around, but the probability
distribution might. If so, we should look for a distribution that doesn’t change with time.
Definition. The vector π is called a stationary distribution if
(i) π j ≥ 0, ∑ j π j = 1.
(ii) π = πP, i.e. π j = ∑i πi pi j for all j.
for one direction, write µ (n+1) v(n + 1) = µ (n) v(n) = µ (n) Pv(n + 1), rearrange, and argue for conditions under which you can
derive the equation for µ (n) .
2 Hint:
5
Remark. Other synonyms you might hear for stationary distribution include invariant measure, invariant
distribution, steady-state probability, equilibrium probability or equilibrium distribution (the latter two are
from physics.).
Number (i) says the vector is a probability distribution. Number (ii) says that if we start with probability π
and run the Markov chain, the probability will not change.
Some questions we might ask about π include:
– Does it exist?
– Is it unique?
– Does an arbitrary distribution converge to it?
These are actually questions about the eigenvalues of the transition matrix P. Indeed, to satisfy (ii), we
need λ = 1 to be an eigenvalue. If an arbitrary distribution is to converge in some way, we need all other
eigenvalues to have complex norm less than 1.
Note that λ = 1 is clearly an eigenvalue of P, since it has right eigenvector (1, 1, . . . , 1)T . We need to
know whether the corresponding left eigenvector is always positive, and also what the other eigenvalues /
generalized eigenvalues are.
Lemma. The spectral radius of P is 1, i.e. ρ(P) = maxλ |λ | = 1, where the max is over all eigenvalues.
Proof. Let η be a left eigenvector with eigenvalue λ . Then λ ηi = ∑Nj=1 η j pi j , so
N
N
N
N
|λ | ∑ |ηi | = ∑ | ∑ ηi pi j | ≤
i=1
i=1 j=1
∑
i, j=1
N
|η j |pi j =
∑ |η j |.
j=1
Therefore |λ | ≤ 1.
We can say more than this, if we know something about the structure of the chain.
Definition. A stochastic matrix is irreducible if, for every pair (i, j) there exists an s > 0 such that (Ps )i j > 0.
A homogeneous Markov chain is irreducible if it is generated by some an irreducible stochastic matrix.
Definition. A stochastic matrix is primitive if there exists some s > 0 such that the s-step transition proba(s)
bilities are positive for all i, j, i.e. (Ps )i j = pi j > 0 ∀i, j. A homogeneous Markov chain is primitive if it is
generated b a primitive stochastic matrix.
Note that some books (e.g. Koralov and Sinai [2010]) call such a matrix ergodic instead.
Examples In these examples, an arrow from one node to another means there is a positive probability
transition probability between these nodes in that direction; the actual value of the transition probability is
not important.
6
Theorem (Ergodic Theorem for Markov Chains). Assume the Markov Chain is primitive. Then there exists a unique stationary probability distribution π = (π1 , . . . , πN ), with π j > 0 ∀ j. The n-step transition
(n)
probabilities converge to π: that is, limn→∞ pi j = π j .
Remark. The name of this theorem comes from Koralov and Sinai [2010]; it may not be universal. There
are many meanings of the word “ergodic;” we will see several variants throughout this course.
Proof. This is a sketch, see Koralov and Sinai [2010] p.72 for all the details.
– Define a metric d(µ 0 , µ 00 ) = 21 ∑Ni=1 |µi0 − µi00 | on the space of probability distributions. It can be shown
that d is a metric, and the space of distributions is complete.
– Show (*) d(µ 0 Q, µ 00 Q) ≤ (1 − α)d(µ 0 , µ 00 ), where α = mini, j Qi j and Q is a stochastic matrix.
– Show that µ (0) , µ (0) P, µ (0) P2 , . . . is a Cauchy sequence. Therefore it converges, so let π = limn→∞ µ (n) .
– Show that π is unique: let π1 , π2 be two distributions such that πi = π1 P. Then d(π1 , π2 ) = d(π1 Ps , π2 Ps ) ≤
(1 − α)d(π1 , π2 ) by (*). Therefore d(π1 , π2 ) = 0.
– Let µ (0) be the probability distribution which is concentrated at point i. Then µ (0) Pn is the probability
(n)
(n)
distribution (pi j ). Therefore, limn→∞ pi j = π j .
Remark. This shows that d(µ (0) Pn , π) ≤ (1 − α)−1 β n , with β = (1 − α)1/s < 1. Therefore the rate of
convergence to the stationary distribution is exponential.
Theorem (Law of large numbers for a primitive Markov chain). Let π be the stationary distribution of a
(n)
stationary Markov chain, and let νi be the number of occurences of state i in after n time steps of the
(n)
chain, i.e. among the values of X0 , X1 , . . . , Xn . Let µi j be the number of values of k, 1 ≤ k ≤ n, for which
Xk−1 = i, Xk = j. Then for any ε > 0,
(n)
lim P(|
n→∞
νi
− πi | ≥ ε) = 0
n
(n)
lim P(|
n→∞
νi j
n
− πi pi j | ≥ ε) = 0
Proof. For a proof, see Koralov and Sinai [2010], p. 74.
Remark. This implies that the long-time average of any function of the Markov chain, approaches the average with respect to the stationary distribution.
7
Remark. Several of these results hold for irreducible chains, with slightly weaker conditions. Irreducible
chains also have a unique stationary distribution – this follows from the Perron-Frobenius Theorem. However, it is not true that an arbitrary distirbution converges to it; rather, we have that µ (0) P̄(n) → π as n → ∞,
where P̄(n) = 1n ∑nk=1 Pk . This is because there may be a built-in periodicity, as in the chain with transition
0 1
. In this case P2n = I, and P2n+1 = P, so µ (n) oscillates between two distributions,
matrix P =
1 0
instead of converging to a fixed limit.
2.4
Detailed balance
What happens if we run a Markov Chain backwards? Suppose we take a primitive chain, start it in the
stationary distirbution, and let it run, to get X0 , X1 , X2 , . . . , XN , each of these with distribution Xi ∼ π. Let
Yn = XN−n be the “reversed” chain.
Theorem. Y0 ,Y1 , . . . ,YN is a Markov chain with P(Yn+1 = j|Yn = i) =
πj
πi
pi j .
Proof. Calculate the conditional probability:
P(Yn+1 = j,Yn = i) P(Xk = j, Xk+1 = i)
=
P(Yn = i)
P(Xk+1 = i)
p ji π j
P(Xk+1 = i|Xk = j)P(Xk = j)
=
=
.
P(Xk+1 = i)
πi
P(Yn+1 = j|Yn = i) =
We also need to show it’s Markov:
P(Yk = ik , 0 ≤ k ≤ n + 1)
P(Yk = ik , 0 ≤ k ≤ n)
P(XN−n−1 = in+1 , XN−n = in , . . . , XN = i0 )
=
P(XN−n = in , . . . , XN = i0 )
πin+1 pin+1 ,in pin ,in−1 · · · pi1 ,i0
=
πin pin ,in−1 · · · pi1 ,i0
πin+1 pin+1 ,in
=
πin
P(Yn+1 = ii+1 |Yn = in , . . . ,Y0 = i0 ) =
Question: when does Y have the same transition probabilies as X?
Definition. A Markov chain satisfies detailed balance if
πi pi j = π j p ji
8
∀i, j.
(5)
Remark. From the above, we see that when detailed balance holds, the time-reversed chain (when run in the
steady-state) has the same transition probabilities as the original. Therefore, the chain is indistinguishable
when it is run forwards and backwards.
Notes
• Detailed balance is a very important concept, that is widely used in physics and statistics.
• The quantity πi pi j is the “amount” of probability flowing down edge i → j in each time step, in steadystate. If a system satsifies detailed balance, then the amount of probability flowing down an edge in
one direction in steady state, equals the amount flowing in the other direction. Therefore, there is no
net flux of probability.
• Here is a system that does not satisfy detailed balance:
There must be a net flux of probability through the red edges.
• Another way to picture detailed balance is to imagine a turbulent fluid in a box. If the fluid is forced
and dissipated isotropically, (e.g. by some external heat bath), then there should be no mean flow, on
average. However, if we stir the fluid in one direction, then even if it is very turbulent there will be a
mean circulation in the box.
• In physics, if a system satisfies detailed balance, then it is called an equilibrium system, or sometimes reversible in equilibrium. Otherwise, it is called a non-equilibrium system. Note that being an
equilibrium system is dfferent from being in equilibrium, aka in steady-state – many systems have
a stationary distribution π that the probability converges to, but not all stationary distributions have
non-zero fluxes in steady-state – the steady-state must satisfy πi = ∑ j π j pi j , but it does not necessarily
satisfy (5).
• In physics, detailed balance is a very important concept, because if there are non-zero fluxes in steadystate, then a system must have forces acting on it. This makes statistical mechanics vastly more
complicated. Most of statistical mechanics has been developed to deal with equilibrium systems,
but it is only now being developed for non-equilibrium systems. One example of a non-equilibrium
system that is still very poorly understood is a conducting rod, that is maintained at a hot temperature
at one end and a cold one at the other. We all know the steady-state is a linear temperature distribution
in the rod, but deriving this from microscopic interactions is still a research topic.
• In statistics, the concept of detailed balance is widely used because when it holds, it is easier to show
that a particular distribution is the stationary distribution – see this week’s homework, and lecture on
Monte-Carlo methods (Week ∼6).
9
• Here are some examples of physical systems that do / do not satisfy detailed balance:
No detailed balance
– self-propelled (e.g. swimming) particle
– atmospheric circulation patterns
– plasma with non-Maxwellian velocities
– system in contact with two heat baths, at
different temperatures
– snowflake melting in a coffee cup
– sheared crystal
Detailed balance
– particle diffusing in a fluid
– Ising model with dynamics above
– ideal gas in an insulated box
– system in contact with one heat bath
– covered, insulated coffee cup with liquid/vapour equilibrium
– crystal with no external forces
2.5
Spectral decomposition for a Markov Chain that satisfies detailed balance
If P satisfies detailed balance, then it can be symmetrized by a similarity transformation:
V = ΛPΛ−1 ,
(6)
√
√
where Λi j = δi j πi is the matrix with π on its diagonal.
Then Vi j =
√
π
√ i
πj
pi j =
√
π
√ j
πi
p ji (by detailed balance) = V ji , so V is symmetric.
We know from linear algebra that V has a full set of eigenvalues λ j ∈ R, and an orthonormal set of eigenvectors w j that are both the left and right eigenvectors. Therefore, P also has eigenvalues λ j , and it has
left eigenvectors
right eigenvectors
ψ j = Λw j
φ j = Λ−1 w j
√
√
i.e. each element of the vectors (ψ j )i or (φ j )i is multiplied by πi or ( πi )−1 . We also have that ψ j = Λ2 φ j ,
i.e. the left eigenvector equals π times the right eigenvector. Note that this is always true for the eigenvectors
corresponding to λ = 1, but for a chain that satisfies detailed balance it is true for all other eigenvectors as
well.
Suppose that |w j | = 1. The spectral decomposition of P is given by
Pt = ∑ λkt φk ψkT = ∑ λkt Λ2 φk φkT .
k
(7)
k
(t)
In components: pi j = ∑k λkt (ψk )i (φk ) j = ∑k λkt (φk )i (φk ) j πi .
Remark. Another way to derive (7) is to define an inner product h, iπ to be hu, viπ = ∑i ui vi πi . Then we can
show that P is self-adjoint with respect to this inner product, i.e. hPu, viπ = hu, Pviπ , so (7) follows.
Note. The spectral decomposition gives insight into the timescales associated with the Markov chain. We
will see an example of this below.
Note. If you truncate (7), then you typically don’t get a stochastic matrix back. In fact, if you evolve the
probability with the truncated matrix, the probability can even become negative. Therefore the spectral
decomposition hides the fact that P is stochastic.
10
Example. Consider the transition matrix

1
1 − 2m
 1
 2
P=
 0
 0
0
1
2m
0
1
2
0
0
0
1
2
0
1
2
0
0
0
1
2
0
1
2m

0
0
0
1
2





1
1 − 2m
This describes a particle moving on an energy landscape with 5 sites. It undergoes an unbiased random walk
along the middle 3 nodes, but when it hits an endpoint, it tends to stay there for a long time (when m is
large), before escaping to possibly visit the other endpoint.
The stationary distribution is π = (m, 1, 1, 1, m)Z −1 , where Z = 2m + 3. You can check that the chain satisfies
detailed balance.
Let’s calculate the eigenvalues:
m = 2:
m = 5:
m = 20:
m = 100:
λ
λ
λ
λ
= 1,
= 1,
= 1,
= 1,
0.89,
0.95,
0.99,
0.999,
0.5,
0.62,
0.68,
0.70,
-0.14,
-0.05,
-0.01,
-0.03,
-0.75
-0.72
-0.71
-0.71
You see there is a spectral gap between λ1 , which approaches 1 as m → ∞, and λ2 , which appears bounded
away from 1.
What do the left eigenvectors look like? Here is a sketch (for large enough m):
The largest eigenvector is the steady-state (as expected), while the 2nd-largest captures transitions between
the endpoints. The timescale of transition is given by λ1−1 . The other eigenvectors describe the diffusive
motion across the energetically flat plateau in the middle.
Example. Just for fun, here’s an example (from Hayes [2013]) of how Markov chains can be used to
generate realistic-looking text. In each of these excerpts, a Markov chain was constructed by considering the
frequencies of strings of k letters from the English translation of the novel Eugene Onegin by Pushkin, for
k = 1, 3, 5, 7, and was then run from a randomly-generated initial condition. You can see that when k = 3,
there are English-looking syllables, when k = 5 there are English-looking words, and when k = 7 the words
themselves almost fit together coherently.
11
References
D. Aldous and P. Diaconis. Shuffling cards and stopping times. American Mathematical Monthly, 93:
333–348, 1986.
D.
Austin.
How many times do I have to shuffle this
http://www.ams.org/samplings/feature-column/fcarc-shuffle.
deck?
G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 2001.
Brian Hayes. First links in the Markov chain. American Scientist, 101, 2013.
L. B. Koralov and Y. G. Sinai. Theory of Probability and Random Processes. Springer, 2010.
12
URL