Download MARKOV CHAIN

Document related concepts
no text concepts found
Transcript
MARKOV CHAIN
INTRODUCTION
Most of our study of probability has dealt with independent trials processes. These
processes are the basis of classical probability theory and much of statistics. During our
high school study , we learned and studied two of the principal theorems for these
processes: the Law of Large Numbers and the Central Limit Theorem.
We have seen that when a sequence of chance experiments forms an independent trials
process, the possible outcomes for each experiment are the same and occur with the same
probability. Further, knowledge of the outcomes of the previous experiments does not
influence our predictions for the outcomes of the next experiment. The distribution for
the outcomes of a single experiment is sufficient to construct a tree and a tree measure for
a sequence of n experiments, and we can answer any probability question about these
experiments by using this tree measure. Modern probability theory studies chance
processes for which the knowledge of previous outcomes influences predictions for
future experiments. In principle, when we observe a sequence of chance experiments, all
of the past outcomes could influence our predictions for the next experiment. For
example, this should be the case in predicting a student's grades on a sequence of exams
in a course. But to allow this much generality would make it very difficult to prove
general results. In 1907, A. A. Markov began the study of an important new type of
chance process. In this process, the outcome of a given experiment can affect the
outcome of the next experiment. This type of process is called a Markov chain.
We describe a Markov chain as follows: We have a set of states, S  s1, s2 ,..., sr .
The process starts in one of these states and moves successively from one state to
another. Each move is called a step. If the chain is currently in state si then it moves to
state s j at the next step with a probability denoted by pij , and this probability does not
depend upon which states the chain was in before the current state. The probabilities pij
are called transition probabilities. The process can remain in the state it is in, and this
occurs with probability pii . An initial probability distribution, defined on S, specifies the
starting state. Usually this is done by specifying a particular state as the starting state.
Stochastic process are used to model an extremely wide variety of real life situations.
A discrete-time stochastic process is an infinite sequence of random variables
 X n n0 , usually with some features and structures in common and viewed as being
arranged in time order.
Generally, a stochastic process is a series of experiments whose outcome depends on
chance.
Let assume that we know for each pair of states i and j,and each time t, the
probability pij that the process is in state j at time t  1 as it is in state i at time t. In
addition, the probability p ij t  will be assumed not to depend on t. Such process is
called Markov chain (discrete time and with a finite set of states), named after its inventor
Andrei Andreyevich Markov (1856-1922)
With these assumptions, we can describe the system by giving the set u1 , u 2 ,..., u m  of
possible states u i and a matrix P of dimensions mxm where terms pij is the
probability that the process is in state j at time t  1 given that it is in state i at time t, for
all t. P is called the transition matrix of the system.
We generally represent P by a directed graph G whose vertices correspond to the m
states and arcs to ordered pairs (i, j) such that pij  0 .
Definition: A Markov chain is a sequence of random variables  X n n0  X 1 , X 2 , X 3 ,...
with the Markov property , namely that, given the present state, the future and past states
are independent. .Markov homogenous chain with values in a set V is a sequence of
random variables  X n n0 with values in V, such that there exists a function
f : V V  0,1 such that for all n, x0 , x,..., xn , y ,
P X 0  x0 , X 2  x2 ,....., X n  xn , X n1  y   P X 0  x0 , X 2  x2 ,....., X n  xn  f xn , y 
 PX n1  y X n  xn 
Formally, PX n1  y X 0  x0 , X 2  x2 ,....., X n  xn   PX n1  y X n  xn  .The possible
values of X i form a countable set S called the state space of the chain. Markov chain are
often described by a directed graph, where the edges are labeled by the probabilities of
going from one state to the other states.
The transition matrix is given by M xy  f x, y   PX t 1  y X t  x
Remark: A Markov chain is a sequence of random variables X 1 , X 2 ,..., X n whose joint
probability factors in a simple pairwise fashion :


P X 1 , X 2 ,..., X n   p X 1 PX 2 X 1 PX 3 X 2 ....P X n X n1   PX t X t 1  and we can
n
t 1
prove the reverse order is also valid


P X 1 , X 2 ,..., X n    PX t X t 1   PX 1 X 2 PX 2 X 3 ....P X n1 X n p X n  . Markov
n
t 1
chain is applicable to any sequential process with no “ memory” i.e. what happens next
depends only on now and not on anything in the past. If the states are S  s1 , s2 ,..., sm ,
then the transition matrix going from state s i to s j is given by
 P s1 s1 

 P s1 s 2 
T 
 ....
 P s s 
 1
P s 2 s1 
P s 2 s 2 
P s 2 s m 
P s3 s1 .........................P s m s1  

P s3 s 2 ........................P s m s 2  
 for an homogenous

P s3 s m ......................P s m s m 
Markov chain.
Theorem : Let P be the transition matrix of a Markov chain. The ijth entry pijn  of the
matrix P n gives the probability that the Markov chain, starting in state si , will be in state
s j after n steps.
Theorem : Let P be the transition matrix of a Markov chain, and let u be the
probability vector which represents the starting distribution. Then the probability that the
chain is in state si after n steps is the ith entry in the vector u n   uP n
Note: if we want to examine the behavior of the chain under the assumption that it starts
in a certain state s i , we simply choose u to be the probability vector with ith entry equal 1
and all other entries equal 0. In fact, a Markov chain is nothing but a random walk
(biased, in general) on a graph directed ( weighted, with weights positive) simple ( with
loops) where the weights of the arcs are the transition probabilities( hence the condition:
the sum of weights of arcs leaving a vertex must be 1 ) . the power M n
has Pxyn  PX t  n  y X t  x as coefficients(sum of weights of paths from x to y of length
n).If the probability of X 0 is u  ux xV then the distribution of X n is u n   u  P n (
which is a linear combination of Pxyn  PX t  n  y X t  x )
Remark: To define a Markov chain X, we must normally define the initial distribution at
t  0 ( i.e. law of X 0 )and the transition matrix. In practice many properties don’t
depend( or little) on the initial distribution, but essentially on the transition distribution,
in that case we say the chain X starting from u or chain X starting from v, if they have the
same transition matrix. ( We say that the process they describe is time homogeneous
which means that the transitional behavior stays the same throughout time). Basically, a
Markov chain is a model in which the current value ( time t) of a variable X taking values
in 1,..., N  is fully explained by the knowledge of the value taken by the same variable at
time t 1 , and its summarize in a matrix P given the probability distribution of X t given
any possible value of X t 1
t




M  P  P X t  j X t 1
.......... .......... .......... .......... ... N
t 1






 1..........
1
 p........................... p1N 

.
.



 i   pij  .
.



.
.



N  1 p N 1....................... p NN 
 
Each row of P is a probability distribution summing to one. Since the current value is
fully determined by the knowledge of only one past period, this model is said to be of
order 1. This model is used in a lot of different fields including economy, chemistry,
biology and meteorology.
More generally, a Markov chain of order f , f  0 , is a model in which the current value
is explained by all lags up to t  f . The transition matrix is then of a larger size.
Definition: We denote by X t t a probabilistic process over time (so it is a stochastic
process) whose value at any time depends on the outcome of a random
experiment. Thus, at each time t, X t  is a random variable.
If we consider a discrete time, we will have a discrete stochastic process. If we assume
that the random variables can take only a discrete set of values., We call this " discrete
time process and discrete space."
Note: It is quite possible as in the study of teletraffic to have a process in continuous
time and discrete state space.
Definition: X n nIN is a Markov Chain iff
PX n  j X n1  in1 , X n2  in2 ,....., X 1  i1 , X 0  i0   PX n  j X n1  in1 
in other words (it's very simple!) the probability that the chain is in a certain state in
the n-th stage of the process depends only on the state of the process at step n - 1 and not
on previous steps!
Note: A stochastic probability process satisfies the Markov property if and only if
the conditional probability distribution of future states, given the present moment, is up
to the same state and not this past states. A process that has this property is called
"Markov processes".
Definition: A "homogeneous Markov chain" is a string such that the probability that it
has to go in a certain state to the n-th isindependent of time. In other words, the
probability distributioncharacterizing the next stage does not depend on time (the
previous step), and at all times the probability at any time of the chain is always the
same to characterize the transition the current step.
We can then define the (law) "transition probability" of a state i tostate j by:.
ERGODIC /IRREDUTIBLE CHAIN:
Definition
A Markov chain is said to be ergodic or irreducible if any state is reachable from any
other state. It is called regular if there is a power P k of its transition matrix P whose
elements are strictly positive. A regular chain is ergodic.
Example:
Four balls are spread over two urnes. At each step of the process, one ball among the
four is chosen at random, with equal probability , and changes urn. Let X k be the
number of balls in the urn after the first k runs. Then the sequence of X k form a Markov
chain, whose graph is the following
Its transition matrix is given by
1
0
0
0
1
3

0
0
4
4

1
1
P  0
0
2
2


3
0
0
0
4

0
0
1
0
0 

0 


0 

1
4 
0 
From any state, we can find a path in the graph to any other state. The chain is
ergodic. However, starting, for example from state 0, the chain will be in either state
0,2,or 4 after an even number of draws, and in state 1 or 3 after an odd number of
draws . The chain is thus not regular, since the element P n ij is zero each time that
 
i  j  n is odd.
Definition: A Markov chain is called an ergodic chain if it is possible to go from every
state to every state( not necessarily in one move). Ergodic Markov chains are also called
irreducible. The chain is irreductible if each state is accessible from every other state.
Accessibility: a state v is reachable from a state u if the chain, starting from u, has a
strictly positive probability to pass through v i.e. there exist n such that
PX n  v X 0  u   0 i.e. there exist a path from u to v in the graph .
Definition: A Markov chain is called a regular chain if some power of the transition
matrix has only positive elements. A Markov chain is called a regular chain if some
power of the transition matrix has only positive elements.
In other words, for some n, it is possible to go from any state to any state in exactly n
steps. It is clair from this definition that every regular chain is ergodic. On the other hand,
an ergodic chain is not necessarily regular.
Remark: Any transition matrix that has no zeros determines a regular Markov
chain.However, it is possible for a regular Markov chain to have a transition matrix that
has zeros.
Theorem. Let P be the transition matrix for a regular chain. Then, as n   , the
powers P n approach a limiting matrix W with all rows the same vector w. The vector w
is a strictly positive probability vector (i.e., the components are all positive and they sum
to one).
Theorem. Let P be a regular transition matrix, let
W  lim P n ;let w be the common row of W, and let c be the column vector all of whose
n  
components are 1. Then
(a) wP  w , and any row vector v such that vP  v is a constant multiple of w.
(b) Pc  c , and any column vector x such that Px  x is a multiple of c
Lemma: A regular transition matrix is ergodic
Lemma:
The powers of a regular transition matrix converge to a matrix where each column has a
single value.
Definition: Fixed Vectors
A row vector w with the property wP  w is called a fixed row vector for P.
²Similarly, a column vector x such that Px  x is called a fixed column vector for P.
Remark: Assume that the value at a particular state, say state one, is 1,and then use all
but one of the linear equations from wP  w .
This set of equations will have a unique solution and we can obtain w from this solution
by dividing each of its entries by their sum to give the probability vector w.
Equilibrium
Suppose that our starting vector picks state si as a starting state with probability wi
, for all i. Then the probability of being in the various states after n steps is given by
wP n  w , and is the same on all steps. This method of starting provides us with a process
that is called stationary.
Ergodic Markov Chains
Theorem. For an ergodic Markov chain, there is a unique probability vector w such that
wP  w and w is strictly positive. Any row vector such that vP  v v is a multiple of w.
Any column vector x such that Px  x is a constant vector.
The Ergodic Theorem
Theorem. Let P be the transition matrix for an ergodic chain. Let An be the matrix
I  P  P 2  ...  P n
defined by An 
Then An  W , where W is a matrix all of whose
n 1
rows are equal to the unique fixed probability vector w for P.
Property 3: An Ergodic matrix, not necessarily regular, has a single unit eigenvector
EXAMPLE:
Julie has cellular with a 2 hours fixed monthly price. In order to manage its fixed price
well, she notes that:
- If during the month, she exceeded her fixed price, the probability that she
exceeds it the next month is of
1
5
- If she did not exceed her fixed price, the probability that she exceeds it the next
month is of
2
.
5
The question which interests us now is the evolution of the probability of going beyond
of fixed price for Julie. This type of question makes it possible to use the chains of
Markov, because the probability, in a month given, to exceed or not its fixed price
depends on the probability of the previous month.
The first thing to be made with the chains of Markov is to determine the states. Here there
are two of them that one calls also event.
- Event A: Julie exceeded her fixed price
- Event B: Julie did not exceed her fixed price
As these events evolve/move in time, one notes An , Julie exceeded her fixed price in
month
n,
and
Bn ,
Julie
did
not
exceed
her
fixed
price
in
month
n.
The probability of going beyond for a month n given is for the two events
- an  p An 
- bn  pBn 
By probability definition, an  bn  1 , i.e. the sum of the probabilities of the events is
worth 1. In this case, the probability bn is the complementary of a n which is expressed by
bn  1  an .So we just need to know for the first month the probability of event A. Let
suppose
a1  p A1  
1
2
.
We now will build the graph associated with this Markov chain which will make it
possible to express and to visualize the relations of probability between one month n and
n+1. We will define then the matrix of transition which makes it possible to calculate the
probabilities for one month n+1, n+2,…, n+m.
he graph expresses the various events and their relation. We have two events here. One
thus draws 2 points or 2 circles symbolizing each one the events: With
A  exceeded her fixed price , = B  did not exceed her fixed price .
We must now draw the relations between these events. We take each point
independently. Going from point A , we have two possibilities : Either we exceeded the
fixed price, in this case we makes an arrow which returns to A as A represents the event:
the fixed price exceeded. If it were not exceeded, then we make an arrow in direction of
B. Going from point B, we have two possibilities :. If the fixed price were exceeded, then
we make an arrow towards then A, if it were not exceeded, we make an arrow which
goes towards B as B represents the event: the fixed price did not exceed.
Each arrow represents the evolution of a one month probability to another according to
the previous month. We can write beside each arrow the value of his probability.
How to check that a probabilistic graph is right? The sum of the probabilities of the
arrows
departing
For point A, we have
a
point
must
be
equal
to
1.
1 4
2 3
  1 . For point B   1 . The graph seems to be right.
5 5
5 5
To get the transition matrix, we seek the probability of going from A->A. we looks at the
graph, we recover the value of the arrow which goes from A->A and registers it at the
intersection of column A and line A. For the probability of A->B, B->B and B->A same
thing. we obtain finally the following matrix:
Let us calculate now the probabilities of going beyond of fixed price
In fact in a Markov chain , the probability Pn in month n is written Pn an , bn  where
a n , bn ,… represents the various states of the chain of Markov. In our case, there are two
states thus a n and bn , for n  1 , we have a n  a1 
1
1 1
and i.e. P1  ,  .
2
2 2
The probability at stage n is given by the matrix line P and one with the following
relation: Pn  M  Pn1
Let compute P2  MP1 which will give us a 2 and b2 and consequently the probability
a 2 of going beyond of fixed price for Julie at the end of one month.
1
P2  
2
1
1 5

2   2
 5
4
5
  0.3
3
5 
0.7
Pn M  Pn 1 ; Pn M 2  Pn  2 ,………., Pn M m  Pn  m
To know the evolution of the probability at the end of m period, we just need to multiply
the matrix of transition m time by itself.
Example: According to Kemeny, Snell, and Thompson, the Land of Oz is blessed by
many things, but not by good weather. They never have two nice days in a row. If they
have a nice day, they are just as likely to have snow as rain the next day. If they have
snow or rain, they have an even chance of having the same the next day. If there is
change from snow or rain, only half of the time is this a change to a nice day. With this
information we form a Markov chain as follows.
We take as states the kinds of weather R, N, and S. From the above information we
determine the transition probabilities. These are most conveniently represented in a
square array as
The entries in the first row of the matrix P represent the probabilities for the various
kinds of weather following a rainy day. Similarly, the entries in the second and third rows
represent the probabilities for the various kinds of weather following nice and snowy
days, respectively. Such a square array is called the matrix of transition probabilities, or
the transition matrix.
We consider the question of determining the probability that, given the chain is
in state i today, it will be in state j two days from now. We denote this probability
by pij2  .we see that if it is rainy today then the event that it is snowy two days from now is
the disjoint union of the following three events: 1) it is rainy tomorrow and snowy two
days from now, 2) it is nice tomorrow and snowy two days from now, and 3) it is snowy
tomorrow and snowy two days from now. The probability of the first of these events is
the product of the conditional probability that it is rainy tomorrow, given that it is rainy
today, and the conditional probability that it is snowy two days from now, given that it is
rainy tomorrow.
Using the transition matrix P, we can write this product as p11 p13 .The other two events
also have probabilities that can be written as products of entries of P. Thus, we have
2 
p13
 p11p13  p12p 23  p13p33 This equation should remind the reader of a dot product of
two vectors; we are dotting the first row of P with the third column of P. This is just what
is done in obtaining the 1; 3-entry of the product of P with itself. In general, if a Markov
r
chain has r states, then pij2    pik pkj The following general theorem is easy to prove by
k 1
using the above observation and induction. So, t the powers of the transition matrix give
us interesting information about the process as it evolves. We shall be particularly interested in
the state of the chain after a large number of steps. The program MatrixPowers
computes the powers of P.
We note that after six days our weather predictions are, to three-decimal-place accuracy, independent of today's weather. The probabilities for the three types of
weather, R, N, and S, are .4, .2, and .4 no matter where the chain started. This
is an example of a type of Markov chain called a regular Markov chain. For this
type of chain, it is true that long-range predictions are independent of the starting
state. Not all chains are regular, but this is an important class of chains.
We now consider the long-term behavior of a Markov chain when it starts in a
state chosen by a probability distribution on the set of states, which we will call a
probability vector . A probability vector with r components is a row vector whose
entries are non-negative and sum to 1. If u is a probability vector which represents
the initial state of a Markov chain, then we think of the ith component of u as
representing the probability that the chain starts in state si .if we want to examine the
behavior of the chain under the assumption that it starts in a certain state si , we simply
choose u to be the probability vector with ith entry equal to 1 and all other entries equal
to 0.
let the initial probability
1 1 1
vector u equal  , ,  . Then we can calculate the distribution of the states
3 3 3
after three days.
Find the limiting vector w for the Land of Oz. w1  w2  w3  1
and wP  w1
w2
1

2
1
w3 
2
1

4
1
4
0
1
4
1

4
1 
  w1
2 
1 

2 
w2
w3 
w1  w2  w3  1
1
 w1  1 w2  1 w3  w1
2
2
4

1
1
 4 w1  4 w3  w2

1 w  1 w  1 w  w
3
 4 1 2 2 2 3
The solution is w  0.4
0.2
0.4 
Another method: Set w1  1 , and then solve the first and second linear equations
from wP  w
1
1 1
 2  2 w2  4 w3  1

1  1 w  w
2
 4 4 3
We obtain w1
w2

w 3   1

1
2

1

HIDDEN MARKOV CHAIN
Example:
Let suppose we have a red box/urn containing a red ball and four black balls. There is
also another black urn containing three red balls and one black ball. We perform a series
of experiments consisting of drawing a ball with replacement in the urns as follow:
1) the first draw randomly takes place in one of the two urns;
2) After each drawn in an urn, the ball is put back into the same box;
3) After each draw, we draw in the box that has the color of the drawn ball.
Here the states correspond to the urns and the observation symbols are the colors of the
balls selected . S  s1 , s2 , O  o1 ,o2 
So
1
3
1
3
11
3

p n 1  p n  q n  p n  1  p n   
pn 

1
3
5
4
5
4
20
4

 p n 1  5 p n  4 q n 
4
1
q n 1  p n  1  p n 

5
4
q  4 p  1 q 
n

1
n
n
q n  1  p n 
5
4



11
3
15
16
p  p  ,q 
If lim p n  p , then we have p  
n  
20
4
31
31
15
in the formula
31

11
3 15
11  15  3
 p   20 p  4 i.e. 31   20  31   4
 


15 
11
3

, which show that  p n  
is
pn  
 p n 1  
31  nIN
20
4
15
11 
15  3


 p     pn   
15
11  15  3  n 1 31
20 
31  4
     
20  31  4 
 31
 11
a geometric sequence with
as common ratio. This sequence converges to zero
20
regardless of the choice of p 0
Second Representation : Probabilistic Graph
Two possible states: first state, red ball, the second condition, black ball.
Four conditional probabilities: :
probability of having a red ball knowing that the previous drawing did yield a black ball,
etc..
Now let us use p n 
A weighted graph (the probability graph ) which displays what’s occurring
The probability graph transition matrix is
4
1


5
5

,the state transition matrix aij  Pqt 1  s j qt  si 
M
3
1


4
4
1
3 
4
1
p n 1  p n  q n 


5
4 
5
5






p
q

p
q
  p n q n M

n 1
n 1
n
n
3
4
1 
1
q n 1  p n  q n


5
4 
4
4
 p n 1 q n 1    p n q n M   p 0 q 0 M n we can compute M n using binomial
formula or eigenvalues/eigenvectors decomposition or using TI 83. Go on
2nd X 1  EDIT  ENTER  input  the  values  2nd MODE   2nd X 1  ENTER
 
 
^100  ENTER; now  type 
15
31
4
1


 15 16 
5
5

lim p n
lim q n   p 0 q 0 
  p 0 q 0 M  

n  
n  
3
1
 31 31 


4
4
Note : A Markov chain is a random process on a finite number of states with transition
probabilities without memory.


Example: consider a simple 3-state Markov model of the weather.We assume that ounce
a day (e.g. , at noon), the weather is observed as being one of the following:
State 1: rain or ( snow)
State 2: cloudy
State 3: sunny.
We postulate that the weather on day t is characterized by a single one of these three
states above, and that the matrix A of state transition probabilities is
0.4
A  aij  0.2
0.1
 
0.3 
0.2 
0.8
0.3
0.6
0.1
Given that the weather on day 1 t  1 is sunny (state 3) , we can ask the question: what
is the probability ( according to the model) that the weather for the next 7 days will be
“sun-sun-rain-rain-sun-cloudy-sun…”?Stated more formally, we define the observation
sequence O as O  s3 , s3 , s3 , s1 , s1 , s3 , s2 , s3  corresponding to t  1,2,3,..,8 and we wish
to determine the probability of O, given the model. This probability can be expressed (
and evaluated) as
PO Model   Ps3 , s3 , s3 , s1 , s1 , s3 , s 2 , s3 Model   Ps3 Ps3 s3 Ps3 s3 Ps1 s3 Ps1 s1 Ps3 s1 Ps 2 s3 Ps3 s 2 
  3 a33a33a31a11a13 a32 a 23  1 0.8  0.8  0.1 0.4  0.3  0.1 0.2  1.536  10 4
Where  i  Pq1  si  , 1  i  N is the initial state probability.
Now, given that the model is in a known state, what is the probability it stays in that state
for exactly d days? This probability can be evaluated as the probability of the observation


sequence O  si , si , si , si ,......, si , s j  si  , given the model , which is
d d 1
1 2 3 4



d 1
d 1


Pi d   PO Model, q 1  si   P si , si , si , si ,......, si , si  s j Model   Psi Psi si  P si s j  aii  1  aii 
d d 1
1 2 3 4

 Pi d  is the discrete probability density function of duration d in state i. The expected
number of observations ( duration) in a state, conditioned on starting in that state is


d 1
d 1
Ed i   d i   dpi d   d aii 
d 1
1  aii  
1
. So the expected number of
1  aii
consecutive days of sunny weather, according to the model, is
1
 5 ; for cloudy it is
0 .2
2.5 ; for it is 1.67 .
Example:
Consider the game of tossing a coin. We have three unfair coins ( bias). Suppose a person
has say three coins and is sitting inside a room tossing them in some sequence--this room
is closed and what you are shown (on a display outside the room) is only the outcomes of
his tossings TTHTHHTT... this will be called the observation sequence . . You do not
know the sequence in which he is tossing the different coins, nor do you know the bias of
the various coins.For a given observation, the symbols( heads, tail) can be generated
either by the first, second, or third coin. To appreciate how much the outcome depends on
the individual biasing and the order of tossing the coins.
A Hidden Markov Chain models the experience as follow:
We consider that a state is model by a coin and that such state can produce two symbols :
heads or tails . We assume that we can move from one state to another (change of coins)
according to a certain probability distribution law and that each coin has its own
probability distribution to generate those symbols (heads and tails). The sequence of
(head-tail) is still observable (matrix B) , but the sequence of states that generates this
sequences is not observable ( matrix A). We say that it is hidden.
2
1
Tail
Head
1
3





1   0 .2
0.3
0.5
1   0 .1
0.9 


A  20.2 0.2
0.7  B  20.7
0.3    0.1
3 0.7 0.2
0.1 
3 0.4
0.6 
0.5
0.4
Let suppose we observe O  TTTHHTHHTT THTT where T  tail; H  Head
Bayes theorem states that
posterior =
.
likelihood × prior
PO  P 
normalisat ion factor i.e. P O  
PO 
where O denotes a series of measured or observed data, and λ comprises a set of model
p( | O) is the posterior probabilit y
p(O |  ) is the likelihood

parameters. 
p( ) is the prior probabilit y
p(O) is the evidence
Find the most likely state sequence ( optimal one)
Given an observation O and a HMM  , find the states path, q , most likely follow in the
generation of the observation O by the HMM    A, B,   which is
Pq O,  maximal.This path is determine using Viterbi algorithm
HIDDEN MARKOV CHAIN AND PROTEIN CLASSIFICATION
HIDDEN MARKOV MODEL
Introduction
Proteins are large, organic molecules and are among the most important
components in the cells of living organisms. They are more diverse in
structure and function than any other kind of molecule. Enzymes, antibodies,
hormones, transport molecules, hair, skin, muscle, tendons, cartilage, claws,
nails, horns, hooves, and feathers are all made of proteins. Faster and more
sensitive and accurate methods are required to classify these proteins into
families and predict their functions. Many existing protein classification
methods build hidden Markov models (HMMs) and other forms of
profiles/motifs based on multiple alignments. These methods in general
require a large amount of time for building models and also for predicting
functions based on them. Furthermore, they can predict protein functions
only if sequences are sufficiently conserved. When there is very little
sequence similarity, these methods often fail, even if sequences share some
structural similarities. Machine learning methods that have been studied
specifically for a problem of protein classification include HMM and
support vector machine (SVM) methods.
Hidden Markov models (HMMs) are a formal foundation for making
probabilistic models of
linear sequence “labeling” problems .They provide a conceptual toolkit that
allows building a model of almost any complexity, just by drawing an
intuitive picture. They are at the heart of a diverse range of programs,
including gene finding, consensus profile searches, multiple sequence
alignment, and regulatory site identification. HMMs are the Legos of
computational sequence analysis.
The hidden Markov model (HMM) are Markov model in which states are
hidden . There are several types of HMM. The first distinction is made
according to the nature of the probability density function used to
generate all observations. When the distribution is directly obtained through
counting , HMMs are called discrete. The use of a continuous distribution,
generally approximated by a mixture of Gaussian, leading
to continuous HMM . There is usually a compromise between these two
families called HMM semi-continuous. Indeed, the use of
quantification induces a loss of information that can be detrimental to
the models. On the other hand, the use of a continuous modeling led to an
increase in the number of parameters to be estimated. The semicontinuous HMM is an alternative which optimize the overall number
of model parameters.
We have another type of HMM base on the emission of
observation. Observations are usually produced by the state of the model, it
is called state model. However, it is possible to
consider the emission of observations when crossing the transitions, then
this is called arc of models. The choice is guided by the
application. However, in equal number of states, the models of
arches allow a greater number of possibilities for the emission of
observations. When modeling a phenomenon by a model of arcs, it may be
worthwhile to allow the crossing transition without emission of
observations, especially to model the absence of an event
EXTENSION OF MARKOV MODEL TO HIDDEN MARKOV
MODELS
We would extend now the concept of Markov model to include the case
where the observation is a probabilistic function of the state-i.e., the
resulting model (which is called a hidden Markov model) is a doubly
embedded stochastic process with an underlying stochastic process that is
not observable ( it is hidden), but can only be observed through another set
of stochastic process that produce the sequence of observations.
Example: Suppose a person has say three coins and is sitting inside a room
tossing them in some sequence--this room is closed and what you are shown
(on a display outside the room) is only the outcomes of his tossings
TTHTHHTT... this will be called the observation sequence . You do not
know the sequence in which he is tossing the different coins, nor do you
know the bias of the various coins. To appreciate how much the outcome
depends on the individual biasing and the order of tossing the coins, suppose
you are given that the third coin is highly biased to produce heads and all
coins are tossed with equal probability. Then, we naturally expect there to be
far greater number of heads than tails in the output sequence. Now if it be
given that besides the bias the probability of going to the third coin (state)
from either the first or the second coin (state) is zero; then assuming that we
were in the first or second state to begin with the heads and tails will appear
with almost equal probability inspite of the bias. So we see that the output
sequence depends very much on the individual bias, the transition
probabilities between various states, as well as on which state is chosen to
begin the observations. The three sets, namely, the set of individual bias of
the three coins, the set of transition probabilities from one coin to the next
and the set of initial probabilities of choosing the states characterize what is
called as the HIDDEN MARKOV MODEL(HMM) for this coin tossing
experiment. HMMs allow you to estimate probabilities of unobserved
events.
Definition: The Hidden Markov Model (HMM) is a variant of a finite state
machine having a set of hidden states, Q, an output alphabet (observations),
O, transition probabilities, A, output (emission) probabilities, B, and initial
state probabilities, Π. The current state is not observable. Instead, each state
produces an output with a certain probability (B). Usually the states, Q, and
outputs, O, are understood, so an HMM is said to be a triple, ( A, B, Π ).
A Hidden Markov Model HMM is defined or composed the following
different elements :
- N is the number of hidden states of the model . We denoted
S  s1 , s2 ,..., s N  the set of hidden states At time t, a state is
represented by Qt Qt  S  . Qt 1t T are hidden and discrete.
Generally the states are interconnected in such a way that any
state can be reached from any other state( e.g., an ergodic model).
- M is the number of distinct observation symbols per state ,i.e. the
number of distinct symbols that can be observed in every state i.e.
the discrete alphabet size. They are represented by
V  v1 , v2 ,..., vM . At time t, an observable symbol is denoted by
Ot Ot V  . Ot 1t T are discrete observation variables. The
observation symbols correspond to the physical output of the
system being model.
- A transition matrix of probabilities , denoted A  aij , where a ij is
the prior probability transition from state i to state j. In a
stationary HMM of first order, this probability does not depend
on t. we define
aij  PQt 1  s j Qt  si   Pstate q j at t  1state q i at t , 1  i, j  N ;
- A matrix of probabilities distributions , denoted B  b j k ,
associated to each state where b j k  is the probability to observe
the symbol vk at state s j at time t. We define the Emission
probabilities B  b j k   b j ok   Pok Q j  where o k is in O .


b j k   P Ot  vk Qt  s j  Pobservatio n k at t state q i at t  ,

1  i  N , 1  k  M or b j ot   P Ot  ot Qt  s j

- An initial transition vector of probability distributions    i  ,
where  i is the probability of starting at state i. We define
 i  PQ1  si  where 1  i  N
A HMM denoted by  is completely defined by    A, B,   ( N and M
are indeed in matrices A,B , and vector  ).
The generation of observations in a HMM is done as follow:
1) t  1 choice of initial state , Q1  si with probability  i .
2) Choice of observation Ot  vk , with probability bi k  ;
3) Transition toward a new state Qt 1  s j with probability a ij ;
4) Set t  t  1, if t  T , return to step 2, otherwise terminate the
procedure . ( where T is the length of a sequence of
observations)
It’s obvious from above that a complete specification of an HMM
requires specification of two model parameters ( N and M) ,
specification of observation symbols, and the specification of the
three probabilities measures A, B,  .For convenience, we use the
compact notation    A, B,   to indicate the complete parameter set
of the model.
Example:
Probabilistic parameters of a hidden Markov model given in
the above example.
x — states
y — possible observations
a — state transition probabilities
b — output probabilities
Two assumptions are made by the model.
 The first, called the Markov assumption, is to assume that the
sequence of hidden states is governed by a Markov process of degree
1 in discrete time that is to say that the likelihood ( probability) of a
hidden state depends only on the previous hidden state of the
sequence and that these depencies probabilities do not change over
time , i.e., the current state dependent only on the previous state, this
represents the memory of the model i.e. the t th hidden variable, given
the t  1th hidden variable, is independent of previous variables. If we
denoted Q  Q1, Q2 ,......, QT  the sequence of hidden state, then
PQt Qt 1 , Ot 1 , Qt 2 , Ot 2 ,..., Q1 , O1   PQt Qt 1  i.e.
T 1
PQ    PQ1   PQt 1 Qt ,  
t 1
 The second hypothesis says that the emission probability of a symbol
depends only on the hidden state in which the process is , i.e. , the
output observation at time t is dependent only on the current state, it is
independent of previous observations and states i.e. the t th observation
depends only on the t th state .If we denoted O  O1, O2 ,..., OT  the
sequence of observations, : POt Qt 1 , Ot 1 , Qt 2 , Ot 2 ,..., Q1 , O1   POt Qt ,
PO Q,     POt Qt ,  
T
t 1
The likelihood of the observation O relative to model  is
PO     PO Q,  PQ  
Q
Graphically, the dependencies of the probabilities can be summarized as:
Given a HMM, and a sequence of observations, we’d like to be able to
compute PO   , the probability of the observation sequence given a model.
This problem could be viewed as one of evaluating how well a model
predicts a given observation sequence, and thus allow us to choose the most
appropriate model from a set. aij  Pqt 1  s j qt  si 
The probability of the observations O  o1 , o2 ,...., oT  for a specific state
sequence Q  q1 , q2 ,..., qT  is: PO Q,     Pot qt ,   where we assumed
T
t 1
statistical independence of observations , thus we get
PO Q,     Pot qt ,    bq1 o1   bq2 o2   bq3 o3 .......  bqT oT  and the probability
T
i 1
of the state sequence is: PQ     q aq q  aq q  aq q .......  aq
1
1 2
2 3
3 4
T 1qT
. The joint
probability of O and Q, i.e., the probability that O and Q occur
simultaneously, is simply the product of the above two terms, i.e.,
PO, Q    PO Q,  PQ,   ; so we can calculate the probability of the
observations given the model as:
PO     PO Q,  PQ   
Q

b o1   a q1q2  bq2 o2   a q2q3  bq3 o3 a q3q4  .......  bqT oT a qT 1qT
q1 q1
q1 ..... qT
This result allows the evaluation of the probability of O, but to evaluate it
directly would be exponential in T.
Note:
1) We are mostly going to consider the special case of ergodic or fully
connected HMMs in which every state of the model could be reached
( in a single step) from every other state of the model. Strictly
speaking, an ergodic model has the propriety that every state can be
reached from every other state in a finite number of steps.We’ll also
consider when the underlying “hidden” Markov chain defined by
PQt Qt 1  is time – homogenous i.e. independent of the time t.
2) The difference between a Markov chain and a hidden Markov model
is in the information known on each state. In a Markov chain, for any
sequence, all state transitions are exactly known– i.e., there is a
unique, known path through the model. In a hidden Markov model,
the state information is hidden from the user.
3) In the case of protein classification, Hidden Markov models have a
finite set of states a1 , a2 ,..., an  , including a begin state (where the
sequence begins) and an end state (where the sequence terminates).
Each state has two probabilities associated with it:
• the transition probability Tij , or the probability that a state ai will
transition to another state a j , where j  i  1,...., n , and
• the emission probability Ex j  , or the probability that a state a j will
emit a particular symbol x. Emission probabilitites are properties of only
HMMs and not Markov chains.
NOTE: Computing PO  
Forward procedure
Consider t i   PO1, O2 ,.., Ot , Qt  i  
1) Initially 1 i    ibi O1  for 1  i  N

N

2) For t  2,3,..., T ,  t  j     t 1 i aij b j Ot  for 1  i  N
 i 1

3) Finally PO     T i  . so we solve the computing of PO   by
N
i 1
recursion
NOTE: Computing PO  
Backward Procedure
Define t i   POt 1, Ot  2 ,.., OT Qt  i,  :
1) Initially T i   1 , for 1  i  N
2) For t  T  1, T  2,...,1
N
 t i    a ij b j Ot 1  t 1  j  for 1  i  N
j 1
3) Finally PO      ibi O1 1 i 
N
i 1
We now have another efficient way of computing PO  
VITERBI ALGORITHM
We want to find the sequence states I  Si 1, Si 2 ,.., Si n  which has a maximum
probability of generating O  O1, O2 ,...., On i.e. max PI O,    max PO, I  
II
II
Viterbi algorithm:
States observed and the hidden states have a 1-1 correspondence
The most likely (probably) sequence of states at time t depends
only on t and the most likely sequence at t-1
The Viterbi algorithm is used to compute the most probable path (as well as
its probability). It requires knowledge of the parameters of the HMM model
and a particular output sequence and it finds the state sequence that is most
likely to have generated that output sequence. It works by finding a
maximum over all possible state sequences.
Given observations O  O1, O2 ,.., OT , find the state sequence
Q  {Q1, Q2 ,... . . . , QT } with greatest
likelihood: Q*  arg max PO, Q    arg max Q  where
Q
Q
T
Q    Q1 bQ1 O1  aQt 1Qt bQt Ot 
t 2
The Viterbi algorithm is an inductive algorithm that allows us to find the
optimal state sequence Q* efficiently
1 i    ibi O1 
for 1  i  N
 1 i   0
Initially 
For t  2,3,..., T


 t  j   max  t 1 i aij b j Ot 
i
for 1  i  N

 t 1 i aij
 t  j   arg max
i


*  max T i 
Finally  * i
T i 
QT  arg max
i
Trace back ,for t  T  1, T  2,...,1
 
Qt*   t 1 Qt*1
 *
Q  Q1* , Q2* ,..., QT*


Reformulating the optimisation
Recall the likelihood calculation,
PO, Q    PO Q,  PQ    1b1 O1 a11b1 O2 a12b2 O3 a23b3 O4 .....
T
Now, taking the negative logarithm of Q    Q bQ O1  a Q
1

1
t 2
t 1Qt
bQt Ot 

T


Q   ln  Q1 bQ1 O1    ln aQt 1Qt bQt Ot  
t 2




Hence, Q*  arg max PO, Q    arg max Q  becomes
Q
Q
Q *  arg max PO, Q    arg max Q
Q
Q
In sequence analysis, this method can be used for example to predict coding
vs non-coding sequences.
In fact there are often many state sequences that can produce the same
particular output sequence, but with different probabilities. It is possible to
calculate the probability for the HMM model to generate that output
sequence by doing the summation over all possible state sequences. This
also can be done efficiently using the Forward algorithm, which is also a
dynamical programming algorithm.
In sequence analysis, this method can be used for example to predict the
probability that a particular DNA region match the HMM motif (i.e. was
emitted by the HMM model).
Remark
To create a HMM model (i.e. find the most likely set of state transition and
output probabilities of each state), we need a set of (related/aligned) sequences.
No tractable algorithm is known for solving this problem exactly, but a local
maximum likelihood can be derived efficiently using the Baum-Welch
algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is
an example of a forward-backward algorithm, and is a special case of the
Expectation-maximization algorithm.
EXAMPLE:
consider a three state HMM, with R or B emitted by each state (e.g.,
three urns, each with red or blue balls. R stands for red and B for blue ) with
emission probabilities b1 R   , b2 R   , b3 R   , b1 B   , b2 B   , and
1
2
1
2

1
1
b3 B   B  
3
4

3
 4
0.3
A  0.5
0.4
1
3
3
4
1
2
2
3
1
2

2 
state transition matrix
3 

1 
4 
0.6
0.2
0.1
0.1
1
0.3 ,and initial state probabilities  i  Suppose
3
0.5 
we observe the sequence O  RBR, then we can find the “optimal” state
sequence to explain this sequence of observations by running the Viterbi
algorithm by hand:  t i   max
Q1 , Q2 ,..., Qt 1
1  j 
 jb j  R 
PQ1, Q2 ,..., Qt  i, O1, O2 ,..., Ot  
2  j 

3  j 

max 1 i aij b j B 
1 i  3


max  2 i aij b j R 
1 i  3
In the first step, we initialize the probabilities at t = 1 to  t 1  j    j b j R  for
each j  1,2,3 . These are given in the first column to the left, as
1 1 1
, , ,
6 9 4
respectively.
In the second step, t  2 , we determine first  t  2 1 by considering the three
1
1
quantities 1 i ai1 for i  1,2,3 . They are respectively    0.3 ,    0.5 , and
6
9
1
   0.4 . The third one is the largest, so according to the algorithm we set
4
 1 
 4 
 1 
 2 
 2 1     0.4   0.05 , and remember that the maximum probability
path to state j  1 at time t  2 came from state j  3 at time t  1 (blue line).
Similarly, to determine  2 2 we consider the three quantities 1 i ai 2 for
1
1
1
i  1,2,3 , respectively    0.6 ,    0.2 ,    0.1 , and the first
9
4
6
1
2
is the largest, so we set  2 2    0.6   0.06 . Finally, to determine
6
 3 
 2 3 we consider the three quantities 1 i ai 3 for i = 1, 2, 3, respectively (1/6)
· .1, (1/9) · .3, (1/4) · .5, and the third is the largest, so we set
1
4
 1 
 4 
 2 3    0.5   0.03125 .
In the third step, t = 3, we determine first  t  3 1 by considering the three
quantities  2 i ai1 for i = 1, 2, 3. They are respectively .05 · .3, 0.06  0.5 , and
.03125 · .4. The second is the largest, so according to the algorithm we set

 12 
 3 1  0.06  0.5    0.016 , and remember that the maximum probability
 
path to state j = 1 at time t = 3 came from state j = 2 at time t = 2 (blue line).
Similarly, to determine  3 2 we consider the three quantities  2 i ai 2 for i =
1, 2, 3, respectively .05 · .6, 0.06  0.2 , .03125 · .1, and the first is the
1
largest, so we set  3 2  0.05  0.6   0.01 . Finally, to determine  2 3 we
3
consider the three quantities 1 i ai 3 for i = 1, 2, 3, respectively .05 ·
.1, 0.06  0.3 , .03125 · .5, and the second is the largest, so we set

 34 
 3 3  0.06  0.3    0.015 .
 
Since there are only three observations, we can now use the termination step
to determine that the maximum probability for the observations O  RBR  is
P*  0.016 P with state path Q  1,2,1 (purple lines).
EXAMPLE: A casino has two dice:
Fair die : P1  P2  P3  P4  P5  P6 
1
6
Loaded die: P1  P2  P3  P5 
1
1
1
; P6  ; P4 
10
2
5
Casino player switches back-&-forth between fair and loaded die once every
20 turns
Game:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die, maybe with loaded die)
4. Highest number wins $2
A sequence of rolls by the casino player
12455264621461461361366616646616366163661636165156151151461
2356234
P124552646214614613613666166466163661636616361651561511514612356234  1.3 10 35
What portion of the sequence was generated with the fair die, and what
portion with the loaded die?
124552646214614613 666166466163661636616
515615115146123562
613
3616
344
FAIR
LOADED
FAIR
The dishonest casino model
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4. Then, what is
the likelihood of Q  Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair ? (say
1
2
1
2
initial probs  0 Fair  ,  0 Loaded  )
P Q     q1 a q1q2  a q2q3  a q3q4 .......  a qT 1qT 
1
 P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair)  P(4 | Fair)
2
10
11
9
    0.95  0.521 10 9
26
So, the likelihood the die is fair in this run is just 0.521109
What is the likelihood of
Q  Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded
?
PQ     q1 a q1q2  a q2 q3  a q3q4 .......  a qT 1qT 
8

1
 P(1 | Loaded) P(Loaded, Loaded)  P(4 | Loaded)
2
2
1 1  1
9
10
      0.95  7.9  10
2  10   2 
Therefore, it somewhat more likely that all the rolls are done with the fair
die, than that they are all done with the loaded die. Now Let the sequence
of rolls be:
x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6. Now,
what is the likelihood Q  F, F, , F
PQ     q1 a q1q2  a q2 q3  a q3q4 .......  a qT 1qT 
1
 P(1 | F) P(6 F)  P(6 | F)
2
10

11
9
9
   0.95  0.5  10
26
same as before . What is the likelihood
Q  L, L, , L
PQ     q1 a q1q2  a q2 q3  a q3q4 .......  a qT 1qT 
14
1
 P(1 | L) P(6 L)  P(6 | L)
2
6
1 1 
1
9
       0.95  0.5  10 7
2  10 
2
So, it is 100 times more likely the die is loaded
EXAMPLE:
Suppose we want to determine the average annual temperature at a particular
location on earth over a series of years. To make it interesting, suppose the
years we are concerned with lie in the distant past, before thermometers
were invented. Since we can't go back in time, we instead look for indirect
evidence of the temperature.
To simplify the problem, we only consider two annual temperatures, \hot"
and \cold".
Suppose that modern evidence indicates that the probability of a hot year
followed by another hot year is 0.7 and the probability that a cold year is
followed by another cold year is 0.6. We'll assume that these probabilities
held in the distant past as well. The information so far can be summarized as
H
C

 

H 0.7 0.3 
A

C 0.4
0.6
where H is “hot" and C is “cold".
Also suppose that current research indicates a correlation between the size of
tree growth rings and temperature. For simplicity, we only consider three
different tree ring sizes, small, medium and large, or S, M and L,
respectively. Finally, suppose that based on available evidence, the
probabilistic relationship between annual temperature and tree ring sizes is
given by
S
M



L

H 0.1 0.4 0.5 
B

C 0.7
0.2 0.1
For this system, the state is the average annual temperature-either H or C.
The transition from one state to the next is a Markov process (of order one1),
since the next state depends only on the current state and the fixed
probabilities in A. However, the actual states are “hidden" since we can't
directly observe the temperature in the past.
Although we can't observe the state (temperature) in the past, we can
observe the size
of tree rings. From B, tree rings provide us with probabilistic information
regarding the
temperature. Since the states are hidden, this type of system is known as a
Hidden Markov Model (HMM). Our goal is to make effective and effcient
use of the observable information so as to gain insight into various aspects of
the Markov process.
0.7
The state transition matrix A  
0.4
0.1
B
0.7
0.3 
and the observation matrix
0.6
0.5
. In this example, suppose that the initial state
0.2 0.1
0.4
distribution, denoted by  is   0.6 0.4 .The matrices  , A and B are row
stochastic, meaning that each element is a probability and the elements of
each row sum to 1, that is, each row is a probability distribution.
Now consider a particular four-year period of interest from the distant past,
for which
we observe the series of tree rings S;M; S; L. Letting 0 represent S, 1
represent M and 2 represent L, this observation sequence is O = (0; 1; 0; 2):
We might want to determine the most likely state sequence of the Markov
process given
the observations O. That is, we might want to know the most likely average
annual temperatures over the four-year period of interest.Let define ”most
likely" as the state sequence that maximizes the expected number of correct
states. HMMs
can be used to find this sequence.
With the observations sequence given above O  (0; 1; 0; 2) : we have T  4 ,
N  2 , M  3 , Q  H , C, V  0,1,2 (where we let 0; 1; 2 represent \small",
\medium" and “large" tree rings, respectively).
HMM is denoted by    A, B, .
Consider a generic state sequence of length four X  (x 0 ; x1 ; x 2 ; x 3 )
with corresponding observations O  (O 0 ; O1 ; O 2 ; O3 ) Then  x is the
0
probability of starting in state x 0 . Also, b x (O 0 ) is the probability of initially
0
observing O 0 and a x ;x is the probability of transiting from state x 0 to
0
1
state x 1 . Continuing, we see that the probability of the state sequence X is
given by
P(X)  
x0
b x 0 (O 0 )a x 0 ;x1 b x1 (O1 )a x1 ;x 2 b x 2 (O 2 )a x 2 ;x 3 b x 3 (O 3 ) with observation
sequence
O  (0; 1; 0; 2) , we can compute, say,
P(HHCC)  0.6(0.1)(0 .7)(0.4)(0 .3)(0.7)(0 .6)(0.1)  0.000212 Similarly, we can
directly compute the probability of each possible state sequence of length
four, assuming the given observation sequence.
To find the optimal sequence in the HMM sense, we choose the most
probable symbol at each position. To this end we sum the probabilities in the
list of state sequence probabilities above that have an H in the first position.
Doing so, we find the (normalized) probability of H in the first position is
0:18817 and hence the probability of C in the first position is 0:81183. The
HMM therefore chooses the first element of the optimal sequence to be C.
We repeat this for each element of the sequence, obtaining the probabilities
in the table below.
From Table below we find that the optimal sequence-in the HMM sense-is
CHCH.
Example:
1  1,2,3, a, b, c, A, B.,   with Bi, j   POt  j Qt  i 
B1, a  0.6 ,
B1, b  0.2 , B1, c  0.2 , B2, a   0 , B2, b  0.5 , B2, c   0.5 , B3, a   0.3 ,
B3, b  0 , B3, c  0.7
0.6
B  0.2
0.2
0
0.5
0.5
0.3
0 
0.7
a11  0.3 , a12  0.2 , a13  0.5 , a21  0.6 , a 22  0.1 , a 23  0.3 , a31  0.2 , a32  0.4
a33  0.4
0.3
A  0.2
0.5
0.6
0.1
0.3
0.2
0.4 
0.4
 1  1,  2  0 ,  3  0
Example:
0.4
State-transition probabilities, A  aij  0.2
0.1
 
0.3
0.6
0.1
0.3 
0.2 
0.8
Given today is sunny (i.e., q1  3 ), what is the probability of “sun-sun-raincloud-cloud-sun” with model  .
PQ    PQ  3,3,1,2,2,3   Pq1  3Pq2  3 q1  3Pq3  1 q2  3
Pq4  2 q3  1Pq5  2 q4  2Pq6  3 q5  2   3a33a31a12a22a23  1 0.80.10.30.60.2  0.00288
where the initial state probability for state i is  i  Pq1  i 
Probability of state i producing an observation Ot is: bi Ot   POt qt  i 
which can be discrete or continuous in o.
with state sequence Q  {1, 1, 2, 3, 3, 4},
producing a sequence Q  {1, 1, 2, 3, 3, 4},
producing observations with a state sequence Q  {1, 1, 2, 3, 3, 4},
EXAMPLE
Let's consider the following simple HMM. This model is composed
of 2 states, H (high GC content) and L (low GC content). We can
for example consider that state H characterizes coding DNA while
L characterizes a non-coding DNA.
The model can then be used to predict the region of coding DNA
from a given sequence. Consider the sequence S  GGCACTGAA . There are
several paths through the hidden states (H and L) that lead to
the given sequence.
Example: P = LLHHHHLLL
The probability of the HMM to produce sequence S through the path P is
p  PL 0  PL G   PLL  PL G   PLH  PH C .  ....
= 0.5  0.2  0.6  0.2  0.4  0.3  .....
GGCACTGAA
There are several paths through the hidden states (H and L) that lead
to the given sequence, but they do not have the same probability.
The Viterbi algorithm is a dynamical programming algorithm that
allows us to compute the most probable path. Its principle is similar to
the DP programs used to align 2 sequences (i.e. Needleman-Wunsch)
The probability of the most probable path ending in state k with observation
"i" is
In our example, the probability of the most probable path ending in state H
with observation "A" at the 4th position is:
We can thus compute recursively (from the first to the last element of our
sequence) the probability of the most probable path. or the calculations, it is
convenient to use the log of the probabilities (rather than the probabilities
themselves). Indeed, this allows us to compute sums instead of products,
which is more efficient and accurate.
We used here log log 2  p
Probability (in log2 ) that G at the first position was emitted by state
H PH G,1  1  1.737  2.737
Probability (in log2 ) that G at the first position was emitted by state
L PL G,1  1  2.322  3.322
Probability (in log2 ) that G at the 2nd position was emitted by state H
p H (G,2) = - 1.737 + max (p H (G,1) + p HH , p L (G,1) + p LH )
= - 1.737 + max (-2.737 - 1 , - 3.322 - 1.322)
= - 5.474 (obtained from p H (G,1)
Probability (in log2 ) that G at the 2nd position was emitted by state L
p L (G,2) = - 2.322 + max (p H (G,1) + p HL , p L (G,1) + p LL )
= - 2.322 + max (-2.737 - 1.322 , - 3.322 - 0.737)
= - 6.059 (obtained from p H (G,1)
We then compute iteratively the probabilities pH (i, x) and pL (i, x) that
nucleotide i at position x was emitted by state H or L, respectively. The
highest probability obtained for the nucleotide at the last position is the
probability of the most probable path. This path can be retrieved by backtracking.
back-tracking (= finding the path which corresponds to the highest
probability, -24.49)
The most probable path is: HHHLLLLLL Its probability is 2 24.49  4.25E  8
(remember we used log 2  p )
HMM AND SEQUENCE ALIGNMENTS
Sequence alignment is a way of writing one sequence on top of another
where the residues in one position are supposed to have a common
evolutionary origin. If the same letter occurs in both sequences then this
position has been conserved in evolution. If the letters differ it is assumed
that the two derive from an ancestral letter. Similar sequences may have
different length, which is generally explained through insertions or deletions
in sequences. Thus, a letter or a stretch of letters may be paired up with
dashes in the other sequence to signify such an insertion or deletion. Since
an insertion in one sequence can always be seen as a deletion in the other
one frequently uses the term "indel" to represent this.
One frequently used method for protein classification is a hidden Markov
model. In biological sequence analysis, hidden Markov models are built
based on a multiple alignment. Example:
The alignment of a sequence to a profile HMM. The squares
indicate a match state. The diamonds an insert state, and the circles
a delete state. The path through the HMM is shown in bold arrows
Example: A two-state HMM modelling a DNA sequence, the first
generating AT-rich sequences, and the second generating CG-rich
sequences. State transitions and their associated probabilities are
indicated by arrows, and symbol emission probabilities
for A, C, G, T for each state are indicated below the states. This
model generates a state sequence as a Markov chain (middle) and
each sequence generates a symbol according to its own emission
probability distribution (bottom). The probability of the sequence is
the product of the state transitions and the symbol emissions. For
a given observed DNA sequence, the hidden state sequence that
generated it, i.e. whether this position is in a CG-rich or an AT-rich
segment, is inferred
A HMM can
be visualised as a finite state machine. Finite state machines
move through a series of states and produce some kind of
output, either when the machine has reached a particular state
or when it is moving from state to state. The HMM generates
a protein sequence by emitting amino acids as it progresses
through a series of states. Each state has a table of amino
acid emission probabilities, and transition probabilities for
moving from state to state. Transition probabilities define a
distribution over the possible next states.
In general, the multiple alignments are generated from a training set
consisting of positive examples of protein sequences that belong to a certain
functional family sharing a level of sequence similarities.
Example:
A gap is represented by a ‘–’. Columns 1-3 and 6-10 are “match” columns,
while the columns 4 and 5 are “insert” columns.
In biological sequence analysis, hidden Markov models are built based on a
multiple alignment
Let start with a multiple sequence alignment to see how is the structure of
the sequences. That means that our HMM will be a probabilistic
representation of a multiple alignment.
With the alignment we see that some columns are complete and some are
almost, and the others have few data. These most common matches can be
used as a common matches for our model and the deletions and insertions
can be modelled as other states. Here is an example of a few sequences
aligned and the core columns of the alignment (marked with an *)
AC----ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATC
***
***
Given a multiple alignment of protein sequences, “match”, “insert”, and
“delete” states
are first identified. If a column of the multiple alignment has less than or
equal to fifty percent
gaps (i.e., a half or more of the sequences emit an amino acid), then it is
classified as a
“match column” (columns 1-3 and 6-10 in Figure above). A non-gap entry in
a match column
is a “match state” in the HMM, while a gap in a match column is a “delete
state”. Delete
states are presumed to be modifications that stem from an amino acid
sequence losing one
or more amino acids in an evolutionary event. The last type of state is the
“insert” state.
“Insert columns” (columns 4 and 5 in Figure 2.4) are similar to delete states,
except that
the evolutionary modification to the amino acid sequence is that of gaining
amino acids. A
non-gap in an insert column is an “insert state”, while a gap in an insert
column is ignored
since it does not represent an event of evolutionary significance.
Example:
A hidden Markov model (courtesy [16]) with delete (circle), insert
(diamond),
and match (square) states. Transitions are allowed along each arrow. Delete
and match
states can only be visited once for each position along a path. Delete states
do not emit
any symbols. Insert states are allowed to insert multiple symbols. The
alignment at the
bottom is used to build the model in this example. The sequences begin in
the start state.
Amino acids a1 and a2 are inserted at the beginning of the sequence. A3 and
B1 are the
first matched symbols, followed by a deletion, where B2 is matched with a
gap. A4 is
then matched with B3, b4 is inserted, A5 is matched with B5, and finally the
end state is
reached.
With this sample of alignment we have examples of insertions and deletions
between the core alignments, which may have a state on the HMM to
represent them. Note that insertions may occur on arbitrary times between
the matching states. And the deletion states always replaces some matching.
One possible HMM template for building the model is presented in the
following picture:
Each matching state ( M j ) is related to the matching on the jth core
alignment. The same applies for deletion state (Dj). The insertion is slightly
different, because it represents the insertions between the core alignments,
that's why it has one extra state, and this enable to represent states before the
first and after the last alignment
Our model will be like the one in the picture with the same length as the core
alignments of a multiple alignment for a given set of sequences. However
we should use maximum likelihood to estimate the transitions and emission
for each state. The easiest way to count the total emissions/transitions, is to
thread each sequence to be profiled in the model. For each symbol in the
aligned sequence you must check if the symbol is in the core alignment. If it
is, then increment the count of that state to the next match state, otherwise,
you go to the insertion state. If it is a deletion, go to the deletion state and
increment the transition. Finally, to calculate the probability for the
transitions, just divide the count of each transition and divide it by all the
states leaving the same state. It's important to notice that we have a stopping
state, and this is a special state that has no transitions.
Note that is important to initialize the model with pseudo counts at each
possible transitions. Adding this pseudo count, we let our model less rigid
and avoid overfitting to the train set. Otherwise we could have some zero
probabilities for some sequences.
To create the emissions probabilities, at each match state, you also have to
count which symbol was emitted and then increment them. To calculate the
probability, just divide it by the total symbols matched in the threading
process.
The similar process could be done with the insertion. However, the insertion
states are characterized by having a low occurrence rate and this may lead us
directly to a overfitting problem because the number of emissions must be
too small. To avoid this we should use the background distribution as the
emissions probabilities of each insertion state. The background distribution
is the probability of the occurrence of a given amino acid in the entire
protein set. To calculate this, count each amino acid type for all the train set
sequence and then divided by the total count.
For the deletion states, it's important to notice that it is a silent state. It
doesn't emit any symbol at all. To signalize it in ghmm, just let all the
emissions probabilities with 0. Note that the end state is also a silent state
once we have no emission associated to it.
ALERT: the loglikelihood is the only function in the ghmm library which
handles silent states correctly.
At this point, the model is ready for use. However we have a problem of
how to classify a new protein sequence. A threshold could be used to divide
the two classes. However, the most common alternative is to use a null
model to compare with it. The null model is a model which aims represents
any protein with similar probability as any other. With this two models we
could take a sequence and compare if it is more similar to a general
definition of a protein or to a specific family. This model should model a
sequence with average size equal to the aligned sequences being handled,
and should emit any kind of symbol at each position. A common alternative
for creating the null model, is using a single insertion state, which goes to a
stopping state with probability of 1 divided by the average length of
sequences in the train set. For the emissions probability, we should use the
background distribution, because this is related to the general amino acid
distribution. At the end, the model should be like this:
For testing the proposed model, I used a set of 100 globin proteins from
the NCBI protein repository as a train set to build a profile model, and
used ghmm to build and test the model.
To test if our model corresponds to our expectations, 100 globins different
from the ones in the trains set were used with 1800 other random proteins
with similar length. The loglikelihood function from the ghmm library to
calculate the similarity index. The classification of the globins versus non
globins, was a comparison between the likelihood of the protein with the
globins profile hmm, and the null model. This test gave us 100% of
accuracy! To display this result graph, each sequence was pointed out in a
graph where the horizontal axis displays the length of the sequence and the
vertical the log of globins / null models likelihood (or globins - null model
loglikelihood). The globins are plotted in red an the others in blue. Proteins
over the zero line are classified as globins, and below means they aren't
globins. The graphs show us a clear distinction between the classes, and that
our model is very precise for this problem.
Example: These structural similarities make it possible to create a statistical
model of a protein family. The model shown below is a simplified statistical
profile, a model which shows the amino acid probability distribution for
each position in the family. According to this profile, the probability of C in
position 1 is 0.8, the probability of G in position 2 is 0.4, and so forth. The
probabilities are calculated from the observed frequencies of amino acids in
the family.
Given a profile, the probability of a sequence is the product of the amino
acid probabilities given by the profile. For example, the probability of
CGGSV, given the profile above, is 0.8  0.4  0.8  0.6  0.2  .031.
Given a statistical model, the probability of a sequence is used to calculate a
score for the sequence. Because multiplication of fractions is
computationally expensive and prone to floating point errors such as
underflow, a convenient transformation into the logarithmic world is used.
The score of a sequence is calculated by taking the logs of all amino acid
probabilities and adding them up. Using this method with base e logarithms,
the score of CGGSV is
log e (0.8)  log e (0.4)  log e (0.8)  log e (0.6)  log e (0.2)  - 3.48
In practice, profile models take other factors into account. For example,
members of a protein family have varying lengths, so a score penalty is
charged for insertions and deletions. The scores of individual amino acids in
a profile are also position specific. In other words, more weight must be
given to an unlikely amino acid which appears in a structurally important
position in the protein than to one which appears in a structurally
unimportant position.
Although these refinements are necessary to create good profile models, they
introduce many additional free parameters which must be calculated when
building a profile, and unfortunately, the calculations must be done by trial
and error. These shortcomings set the stage for a new kind of profile, based
on the Hidden Markov model.
Like an ordinary profile, HMM is built by analyzing the distribution of
amino acids in a training set of related proteins. Finite state machines
typically move through a series of states and produce some kind of output
either when the machine has reached a particular state or when it is moving
from state to state. The HMM generates a protein sequence by emitting
amino acids as it progresses through a series of states. Each state has a table
of amino acid emission probabilities similar to those described in a profile
model. There are also transition probabilities for moving from state to state.
A possible hidden Markov model for the protein ACCY. The protein is
represented as a sequence of probabilities. The numbers in the boxes show
the probability that an amino acid occurs in a particular state, and the
numbers next to the directed arcs show probabilities which connect the
states. The probability of ACCY is shown as a highlighted path through the
model.
The figure above shows one topology for a hidden Markov model. Although
other topologies are used, the one shown is very popular in protein sequence
analysis. Note that there are three kinds of states represented by three
different shapes. The squares are called match states, and the amino acids
emitted from them form the conserved primary structure of a protein. These
amino acids are the same as those in the common ancestor or, if not, are the
result of substitutions. The diamond shapes are insert states and emit amino
acids which result from insertions. The circles are special, silent states
known as delete states and model deletions.
Transitions from state to state progress from left to right through the model,
with the exception of the self-loops on the diamond insertion states. The
self-loops allow deletions of any length to fit the model, regardless of the
length of other sequences in the family.
Scoring a Sequence with an HMM
Any sequence can be represented by a path through the model. The
probability of any sequence, given the model, is computed by multiplying
the emission and transition probabilities along the path.
In the figure above, a path through the model represented by ACCY is
highlighted. In the interest of saving space, the full tables of emission
probabilities are not shown. Only the probability of the emitted amino acid is
given. For example, the probability of A being emitted in position 1 is 0.3,
and the probability of C being emitted in position 2 is 0.6. The probability of
ACCY along this path is .4  .3  .46  .6  .97  .5  .015  .73  .01  1  1.76 10 -6
As in the profile case described above, the calculation is simplified by
transforming probabilities to logs so that addition can replace multiplication.
The resulting number is the raw score of a sequence, given the HMM.
For example, the score of ACCY along the path shown in figure above is
log e (.4)  log e (.3)  log e (.46)  log e (.6)  log e (.97)  log e (.5)  log e (.015)
 log e (.73)  log e (.01)  log e (1)  - 13.25
The calculation is easy if the exact state path is known, as in the toy example
of figure above . In a real model, many different state paths through a model
can generate the same sequence. Therefore, the correct probability of a
sequence is the sum of probabilities over all of the possible state paths.
Unfortunately, a brute force calculation of this problem is computationally
unfeasible, except in the case of very short sequences. Two good alternatives
are to calculate the sum over all paths inductively using the forward
algorithm, or to calculate the most probable path through the model using
the Viterbi algorithm. Both algorithms are described below.
Figure 8. HMM with multiple paths through the model for ACCY. The
highlighted path is only one of several possibilities.
Consider the HMM shown in Figure 8. The Insert, Match, and Delete states
are labeled with their position number in the model, M1, D1 etc. (States I1
and I2 are unlabelled to reduce clutter.) Because the number of insertion
states is greater than the number of match or delete states, there is an extra
insertion state at the beginning of the model, labeled I0. Unlike the HMM in
Figure 7, where the state path for ACCY was known, several state paths
through the model are possible for this sequence.
The most likely path through the model is computed with the Viterbi
algorithm. The algorithm employs a matrix, shown in Figure 9. The columns
of the matrix are indexed by the states in the model, and the rows are
indexed by the sequence. Deletion states are not shown, since, by definition,
they have a zero probability of emitting an amino acid. The elements of the
matrix are initialized to zero and then computed with these steps:
1. The probability that the amino acid A was generated by state
I0 is computed and entered as the first element of the matrix.
2. The probabilities that C is emitted in state M1 (multiplied by
the probability of the most likely transition to state M1 from
state I0) and in state I1 (multiplied by the most likely transition
to state I1 from state I0) are entered into the matrix element
indexed by C and I1/M1.
3. The maximum probability, max(I1, M1), is calculated.
4. A pointer is set from the winner back to state I0.
5. Steps 2-4 are repeated until the matrix is filled.
Prob(A in state I0) = 0.4*0.3=0.12
Prob(C in state I1) = 0.05*0.06*0.5 = .015
Prob(C in state M1) = 0.46*0.01 = 0.005
Prob(C in state M2) = 0.46*0.5 = 0.23
Prob(Y in state I3) = 0.015*0.73*0.01 = .0001
Prob(Y in state M3) = 0.97*0.23 = 0.22
The most likely path through the model can now be found by following the
back-pointers.
Figure 9. Matrix for the Viterbi algorithm
Once the most probable path through the model is known, the probability of
a sequence given the model can be computed by multiplying all probabilities
along the path.
The forward algorithm is similar to Viterbi. However in step 3, a sum rather
than a maximum is computed, and no back pointers are necessary. The
probability of the sequence is found by summing the probabilities in the last
column. The resulting matrix is shown in Figure 10.
Prob(A in state I0) = 0.4*0.3=0.12
Prob(C in state I1) = 0.05*0.06*0.5 = 0.015
Prob(C in state M1) = 0.46*0.01= 0.005
Prob(C in state M2) = (0.005*0.97) +(0.015*0.46)= .012
Prob(Y in state I3) = .012*0.015*0.73*0.01 = 1.31x10-7
Prob(Y in state M3) = .012*0.97*0.2 = 0.002
Figure 10. Matrix for the Forward algorithm
What the Score Means
Once the probability of a sequence has been determined, its score can be
computed. Because the model is a generalization of how amino acids are
distributed in a related group (or class) of sequences, a score measures the
probability that a sequence belongs to the class. A high score implies that the
sequence of interest is probably a member of the class, and a low score
implies it is probably not a member.
Local and Global Scoring
In the examples above, global scoring was used. This means simply that
computation of the score begins at the first amino acid in the sequence and
ends at the last one. Even though this may seem like the most natural way to
compute a score, the results are often misleading. Because of the
evolutionary variety found in related protein sequences, a family member
may be composed of both highly conserved areas which score well against
the model and divergent areas which score poorly. If both kinds of areas are
given equal importance, the overall score of family member may be poor.
The solution to this problem is to use local scoring, where the score of a
sequence is set to the score of its highest scoring subsequence. The principle
can be illustrated with a very simple example. Consider again the sequence
ACCY shown in Figure 5. Converting probabilities to the log world, the
global score for the sequence is the sum of all four scores: -13.25. The
computation is shown in Figure 11.
Figure 11. The log score of ACCY
Clearly, the score has been significantly lowered by A and Y. The score is
low enough ACCY that may not appear to be a member of the family being
modeled.
Figure 12. The family of sequence ACCY
Let's assume ACCY is a member of the family shown in Figure 12. In this
case, the global score proves to be a poor measure of family membership.
However, if local scoring is used to evaluate the sequence, the final score is
much higher. The highest scoring subsequence is found to be CC, with a
score of -2.01. Unlike the global score, the local score is high enough to
classify ACCY in this family. In situations like this, classifications based on
local scoring are more accurate than those based on global scoring.
1.1.
An example of a HMM for Protein Sequences
Transi
tion Prob.
Output Prob.
This is a possible hidden Markov model for the protein ACCY. The protein is represented as a
sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs
in a particular state, and the numbers next to the directed arcs show probabilities, which connect the
states. The probability of ACCY is shown as a highlighted path through the model. There are three
kinds of states represented by three different shapes. The squares are called match states, and the
amino acids emitted from them form the conserved primary structure of a protein. These amino
acids are the same as those in the common ancestor or, if not, are the result of substitutions. The
diamond shapes are insert states and emit amino acids that result from insertions. The circles are
special, silent states known as delete states and model deletions. These type of HMMs are called
Protein Profile-HMMs and will be covered in more depth in the later sections.
Scoring a Sequence with an HMM
Any sequence can be represented by a path through the model. The probability of any
sequence, given the model, is computed by multiplying the emission and transition
probabilities along the path. A path through the model represented by ACCY is
highlighted. For example, the probability of A being emitted in position 1 is 0.3, and the
probability of C being emitted in position 2 is 0.6. The probability of ACCY along this
path is
.4*.3*.46*.6*.97*.5*.015*.73*.01*1 = 1.76x10-6.
1.2.
Three Problems Of Hidden Markov Models
1) Scoring Problem
We want to find the probability of an observed sequence given an HMM. It can be seen
that one method of calculating the probability of the observed sequence would be to find
each possible sequence of the hidden states, and sum these probabilities. We use the
Forward Algorithm for this.
Consider the HMM shown above. In this figure several paths exist for the protein
sequence ACCY.
The Forward algorithm employs a matrix, shown below. The columns of the matrix are
indexed by the states in the model, and the rows are indexed by the sequence. The
elements of the matrix are initialized to zero and then computed with these steps:
1. The probability that the amino acid A was generated by state I0 is computed and
entered as the first element of the matrix. This is .4*.3 = .12
2. The probabilities that C is emitted in state M1 (multiplied by the probability of the
most likely transition to state M1 from state I0) and in state I1 (multiplied by the most
likely transition to state I1 from state I0) are entered into the matrix element indexed
by C and I1/M1.
3. The sum of the two probabilities, sum(I1, M1), is calculated.
4. A pointer is set from the winner back to state I0.
5. Steps 2-4 are repeated until the matrix is filled.
The probability of the sequence is found by summing the probabilities in the last column.
Matrix for the Forward algorithm
2) Alignment Problem
We often wish to take a particular HMM, and determine from an observation sequence
the most likely sequence of underlying hidden states that might have generated it. This is
the alignment problem and the Viterbi Algorithm is used to solve this problem.
The Viterbi algorithm is similar to the forward algorithm. However in step 3, maximum
rather than a sum is calculated. The most likely path through the model can now be found
by following the back-pointers.
Matrix for the Viterbi algorithm
Once the most probable path through the model is known, the probability of a sequence
given the model can be computed by multiplying all probabilities along the path.
3) Training Problem
Another tricky problem is how to create an HMM in the first place, given a particular set
of related training sequences. It is necessary to estimate the amino acid emission
distributions in each state and all state-to-state transition probabilities from a set of
related training sequences. This is done by using the Baum-Welch Algorithm or the
Forward Backward Algorithm.
The algorithm proceeds by making an initial guess of the parameters (which may well be
entirely wrong) and then refining it by assessing its worth, and attempting to reduce the
errors it provokes when fitted to the given data. In this sense, it is performing a form of
gradient descent, looking for a minimum of an error measure.