Download MARKOV CHAIN

MARKOV CHAIN INTRODUCTION Most of our study of probability has dealt with independent trials processes. These processes are the basis of classical probability theory and much of statistics. During our high school study , we learned and studied two of the principal theorems for these processes: the Law of Large Numbers and the Central Limit Theorem. We have seen that when a sequence of chance experiments forms an independent trials process, the possible outcomes for each experiment are the same and occur with the same probability. Further, knowledge of the outcomes of the previous experiments does not influence our predictions for the outcomes of the next experiment. The distribution for the outcomes of a single experiment is sufficient to construct a tree and a tree measure for a sequence of n experiments, and we can answer any probability question about these experiments by using this tree measure. Modern probability theory studies chance processes for which the knowledge of previous outcomes influences predictions for future experiments. In principle, when we observe a sequence of chance experiments, all of the past outcomes could influence our predictions for the next experiment. For example, this should be the case in predicting a student's grades on a sequence of exams in a course. But to allow this much generality would make it very difficult to prove general results. In 1907, A. A. Markov began the study of an important new type of chance process. In this process, the outcome of a given experiment can affect the outcome of the next experiment. This type of process is called a Markov chain. We describe a Markov chain as follows: We have a set of states, S  s1, s2 ,..., sr . The process starts in one of these states and moves successively from one state to another. Each move is called a step. If the chain is currently in state si then it moves to state s j at the next step with a probability denoted by pij , and this probability does not depend upon which states the chain was in before the current state. The probabilities pij are called transition probabilities. The process can remain in the state it is in, and this occurs with probability pii . An initial probability distribution, defined on S, specifies the starting state. Usually this is done by specifying a particular state as the starting state. Stochastic process are used to model an extremely wide variety of real life situations. A discrete-time stochastic process is an infinite sequence of random variables  X n n0 , usually with some features and structures in common and viewed as being arranged in time order. Generally, a stochastic process is a series of experiments whose outcome depends on chance. Let assume that we know for each pair of states i and j,and each time t, the probability pij that the process is in state j at time t  1 as it is in state i at time t. In addition, the probability p ij t  will be assumed not to depend on t. Such process is called Markov chain (discrete time and with a finite set of states), named after its inventor Andrei Andreyevich Markov (1856-1922) With these assumptions, we can describe the system by giving the set u1 , u 2 ,..., u m  of possible states u i and a matrix P of dimensions mxm where terms pij is the probability that the process is in state j at time t  1 given that it is in state i at time t, for all t. P is called the transition matrix of the system. We generally represent P by a directed graph G whose vertices correspond to the m states and arcs to ordered pairs (i, j) such that pij  0 . Definition: A Markov chain is a sequence of random variables  X n n0  X 1 , X 2 , X 3 ,... with the Markov property , namely that, given the present state, the future and past states are independent. .Markov homogenous chain with values in a set V is a sequence of random variables  X n n0 with values in V, such that there exists a function f : V V  0,1 such that for all n, x0 , x,..., xn , y , P X 0  x0 , X 2  x2 ,....., X n  xn , X n1  y   P X 0  x0 , X 2  x2 ,....., X n  xn  f xn , y   PX n1  y X n  xn  Formally, PX n1  y X 0  x0 , X 2  x2 ,....., X n  xn   PX n1  y X n  xn  .The possible values of X i form a countable set S called the state space of the chain. Markov chain are often described by a directed graph, where the edges are labeled by the probabilities of going from one state to the other states. The transition matrix is given by M xy  f x, y   PX t 1  y X t  x Remark: A Markov chain is a sequence of random variables X 1 , X 2 ,..., X n whose joint probability factors in a simple pairwise fashion :   P X 1 , X 2 ,..., X n   p X 1 PX 2 X 1 PX 3 X 2 ....P X n X n1   PX t X t 1  and we can n t 1 prove the reverse order is also valid   P X 1 , X 2 ,..., X n    PX t X t 1   PX 1 X 2 PX 2 X 3 ....P X n1 X n p X n  . Markov n t 1 chain is applicable to any sequential process with no “ memory” i.e. what happens next depends only on now and not on anything in the past. If the states are S  s1 , s2 ,..., sm , then the transition matrix going from state s i to s j is given by  P s1 s1    P s1 s 2  T   ....  P s s   1 P s 2 s1  P s 2 s 2  P s 2 s m  P s3 s1 .........................P s m s1    P s3 s 2 ........................P s m s 2    for an homogenous  P s3 s m ......................P s m s m  Markov chain. Theorem : Let P be the transition matrix of a Markov chain. The ijth entry pijn  of the matrix P n gives the probability that the Markov chain, starting in state si , will be in state s j after n steps. Theorem : Let P be the transition matrix of a Markov chain, and let u be the probability vector which represents the starting distribution. Then the probability that the chain is in state si after n steps is the ith entry in the vector u n   uP n Note: if we want to examine the behavior of the chain under the assumption that it starts in a certain state s i , we simply choose u to be the probability vector with ith entry equal 1 and all other entries equal 0. In fact, a Markov chain is nothing but a random walk (biased, in general) on a graph directed ( weighted, with weights positive) simple ( with loops) where the weights of the arcs are the transition probabilities( hence the condition: the sum of weights of arcs leaving a vertex must be 1 ) . the power M n has Pxyn  PX t  n  y X t  x as coefficients(sum of weights of paths from x to y of length n).If the probability of X 0 is u  ux xV then the distribution of X n is u n   u  P n ( which is a linear combination of Pxyn  PX t  n  y X t  x ) Remark: To define a Markov chain X, we must normally define the initial distribution at t  0 ( i.e. law of X 0 )and the transition matrix. In practice many properties don’t depend( or little) on the initial distribution, but essentially on the transition distribution, in that case we say the chain X starting from u or chain X starting from v, if they have the same transition matrix. ( We say that the process they describe is time homogeneous which means that the transitional behavior stays the same throughout time). Basically, a Markov chain is a model in which the current value ( time t) of a variable X taking values in 1,..., N  is fully explained by the knowledge of the value taken by the same variable at time t 1 , and its summarize in a matrix P given the probability distribution of X t given any possible value of X t 1 t     M  P  P X t  j X t 1 .......... .......... .......... .......... ... N t 1        1.......... 1  p........................... p1N   . .     i   pij  . .    . .    N  1 p N 1....................... p NN    Each row of P is a probability distribution summing to one. Since the current value is fully determined by the knowledge of only one past period, this model is said to be of order 1. This model is used in a lot of different fields including economy, chemistry, biology and meteorology. More generally, a Markov chain of order f , f  0 , is a model in which the current value is explained by all lags up to t  f . The transition matrix is then of a larger size. Definition: We denote by X t t a probabilistic process over time (so it is a stochastic process) whose value at any time depends on the outcome of a random experiment. Thus, at each time t, X t  is a random variable. If we consider a discrete time, we will have a discrete stochastic process. If we assume that the random variables can take only a discrete set of values., We call this " discrete time process and discrete space." Note: It is quite possible as in the study of teletraffic to have a process in continuous time and discrete state space. Definition: X n nIN is a Markov Chain iff PX n  j X n1  in1 , X n2  in2 ,....., X 1  i1 , X 0  i0   PX n  j X n1  in1  in other words (it's very simple!) the probability that the chain is in a certain state in the n-th stage of the process depends only on the state of the process at step n - 1 and not on previous steps! Note: A stochastic probability process satisfies the Markov property if and only if the conditional probability distribution of future states, given the present moment, is up to the same state and not this past states. A process that has this property is called "Markov processes". Definition: A "homogeneous Markov chain" is a string such that the probability that it has to go in a certain state to the n-th isindependent of time. In other words, the probability distributioncharacterizing the next stage does not depend on time (the previous step), and at all times the probability at any time of the chain is always the same to characterize the transition the current step. We can then define the (law) "transition probability" of a state i tostate j by:. ERGODIC /IRREDUTIBLE CHAIN: Definition A Markov chain is said to be ergodic or irreducible if any state is reachable from any other state. It is called regular if there is a power P k of its transition matrix P whose elements are strictly positive. A regular chain is ergodic. Example: Four balls are spread over two urnes. At each step of the process, one ball among the four is chosen at random, with equal probability , and changes urn. Let X k be the number of balls in the urn after the first k runs. Then the sequence of X k form a Markov chain, whose graph is the following Its transition matrix is given by 1 0 0 0 1 3  0 0 4 4  1 1 P  0 0 2 2   3 0 0 0 4  0 0 1 0 0   0    0   1 4  0  From any state, we can find a path in the graph to any other state. The chain is ergodic. However, starting, for example from state 0, the chain will be in either state 0,2,or 4 after an even number of draws, and in state 1 or 3 after an odd number of draws . The chain is thus not regular, since the element P n ij is zero each time that   i  j  n is odd. Definition: A Markov chain is called an ergodic chain if it is possible to go from every state to every state( not necessarily in one move). Ergodic Markov chains are also called irreducible. The chain is irreductible if each state is accessible from every other state. Accessibility: a state v is reachable from a state u if the chain, starting from u, has a strictly positive probability to pass through v i.e. there exist n such that PX n  v X 0  u   0 i.e. there exist a path from u to v in the graph . Definition: A Markov chain is called a regular chain if some power of the transition matrix has only positive elements. A Markov chain is called a regular chain if some power of the transition matrix has only positive elements. In other words, for some n, it is possible to go from any state to any state in exactly n steps. It is clair from this definition that every regular chain is ergodic. On the other hand, an ergodic chain is not necessarily regular. Remark: Any transition matrix that has no zeros determines a regular Markov chain.However, it is possible for a regular Markov chain to have a transition matrix that has zeros. Theorem. Let P be the transition matrix for a regular chain. Then, as n   , the powers P n approach a limiting matrix W with all rows the same vector w. The vector w is a strictly positive probability vector (i.e., the components are all positive and they sum to one). Theorem. Let P be a regular transition matrix, let W  lim P n ;let w be the common row of W, and let c be the column vector all of whose n   components are 1. Then (a) wP  w , and any row vector v such that vP  v is a constant multiple of w. (b) Pc  c , and any column vector x such that Px  x is a multiple of c Lemma: A regular transition matrix is ergodic Lemma: The powers of a regular transition matrix converge to a matrix where each column has a single value. Definition: Fixed Vectors A row vector w with the property wP  w is called a fixed row vector for P. ²Similarly, a column vector x such that Px  x is called a fixed column vector for P. Remark: Assume that the value at a particular state, say state one, is 1,and then use all but one of the linear equations from wP  w . This set of equations will have a unique solution and we can obtain w from this solution by dividing each of its entries by their sum to give the probability vector w. Equilibrium Suppose that our starting vector picks state si as a starting state with probability wi , for all i. Then the probability of being in the various states after n steps is given by wP n  w , and is the same on all steps. This method of starting provides us with a process that is called stationary. Ergodic Markov Chains Theorem. For an ergodic Markov chain, there is a unique probability vector w such that wP  w and w is strictly positive. Any row vector such that vP  v v is a multiple of w. Any column vector x such that Px  x is a constant vector. The Ergodic Theorem Theorem. Let P be the transition matrix for an ergodic chain. Let An be the matrix I  P  P 2  ...  P n defined by An  Then An  W , where W is a matrix all of whose n 1 rows are equal to the unique fixed probability vector w for P. Property 3: An Ergodic matrix, not necessarily regular, has a single unit eigenvector EXAMPLE: Julie has cellular with a 2 hours fixed monthly price. In order to manage its fixed price well, she notes that: - If during the month, she exceeded her fixed price, the probability that she exceeds it the next month is of 1 5 - If she did not exceed her fixed price, the probability that she exceeds it the next month is of 2 . 5 The question which interests us now is the evolution of the probability of going beyond of fixed price for Julie. This type of question makes it possible to use the chains of Markov, because the probability, in a month given, to exceed or not its fixed price depends on the probability of the previous month. The first thing to be made with the chains of Markov is to determine the states. Here there are two of them that one calls also event. - Event A: Julie exceeded her fixed price - Event B: Julie did not exceed her fixed price As these events evolve/move in time, one notes An , Julie exceeded her fixed price in month n, and Bn , Julie did not exceed her fixed price in month n. The probability of going beyond for a month n given is for the two events - an  p An  - bn  pBn  By probability definition, an  bn  1 , i.e. the sum of the probabilities of the events is worth 1. In this case, the probability bn is the complementary of a n which is expressed by bn  1  an .So we just need to know for the first month the probability of event A. Let suppose a1  p A1   1 2 . We now will build the graph associated with this Markov chain which will make it possible to express and to visualize the relations of probability between one month n and n+1. We will define then the matrix of transition which makes it possible to calculate the probabilities for one month n+1, n+2,…, n+m. he graph expresses the various events and their relation. We have two events here. One thus draws 2 points or 2 circles symbolizing each one the events: With A  exceeded her fixed price , = B  did not exceed her fixed price . We must now draw the relations between these events. We take each point independently. Going from point A , we have two possibilities : Either we exceeded the fixed price, in this case we makes an arrow which returns to A as A represents the event: the fixed price exceeded. If it were not exceeded, then we make an arrow in direction of B. Going from point B, we have two possibilities :. If the fixed price were exceeded, then we make an arrow towards then A, if it were not exceeded, we make an arrow which goes towards B as B represents the event: the fixed price did not exceed. Each arrow represents the evolution of a one month probability to another according to the previous month. We can write beside each arrow the value of his probability. How to check that a probabilistic graph is right? The sum of the probabilities of the arrows departing For point A, we have a point must be equal to 1. 1 4 2 3   1 . For point B   1 . The graph seems to be right. 5 5 5 5 To get the transition matrix, we seek the probability of going from A->A. we looks at the graph, we recover the value of the arrow which goes from A->A and registers it at the intersection of column A and line A. For the probability of A->B, B->B and B->A same thing. we obtain finally the following matrix: Let us calculate now the probabilities of going beyond of fixed price In fact in a Markov chain , the probability Pn in month n is written Pn an , bn  where a n , bn ,… represents the various states of the chain of Markov. In our case, there are two states thus a n and bn , for n  1 , we have a n  a1  1 1 1 and i.e. P1  ,  . 2 2 2 The probability at stage n is given by the matrix line P and one with the following relation: Pn  M  Pn1 Let compute P2  MP1 which will give us a 2 and b2 and consequently the probability a 2 of going beyond of fixed price for Julie at the end of one month. 1 P2   2 1 1 5  2   2  5 4 5   0.3 3 5  0.7 Pn M  Pn 1 ; Pn M 2  Pn  2 ,………., Pn M m  Pn  m To know the evolution of the probability at the end of m period, we just need to multiply the matrix of transition m time by itself. Example: According to Kemeny, Snell, and Thompson, the Land of Oz is blessed by many things, but not by good weather. They never have two nice days in a row. If they have a nice day, they are just as likely to have snow as rain the next day. If they have snow or rain, they have an even chance of having the same the next day. If there is change from snow or rain, only half of the time is this a change to a nice day. With this information we form a Markov chain as follows. We take as states the kinds of weather R, N, and S. From the above information we determine the transition probabilities. These are most conveniently represented in a square array as The entries in the first row of the matrix P represent the probabilities for the various kinds of weather following a rainy day. Similarly, the entries in the second and third rows represent the probabilities for the various kinds of weather following nice and snowy days, respectively. Such a square array is called the matrix of transition probabilities, or the transition matrix. We consider the question of determining the probability that, given the chain is in state i today, it will be in state j two days from now. We denote this probability by pij2  .we see that if it is rainy today then the event that it is snowy two days from now is the disjoint union of the following three events: 1) it is rainy tomorrow and snowy two days from now, 2) it is nice tomorrow and snowy two days from now, and 3) it is snowy tomorrow and snowy two days from now. The probability of the first of these events is the product of the conditional probability that it is rainy tomorrow, given that it is rainy today, and the conditional probability that it is snowy two days from now, given that it is rainy tomorrow. Using the transition matrix P, we can write this product as p11 p13 .The other two events also have probabilities that can be written as products of entries of P. Thus, we have 2  p13  p11p13  p12p 23  p13p33 This equation should remind the reader of a dot product of two vectors; we are dotting the first row of P with the third column of P. This is just what is done in obtaining the 1; 3-entry of the product of P with itself. In general, if a Markov r chain has r states, then pij2    pik pkj The following general theorem is easy to prove by k 1 using the above observation and induction. So, t the powers of the transition matrix give us interesting information about the process as it evolves. We shall be particularly interested in the state of the chain after a large number of steps. The program MatrixPowers computes the powers of P. We note that after six days our weather predictions are, to three-decimal-place accuracy, independent of today's weather. The probabilities for the three types of weather, R, N, and S, are .4, .2, and .4 no matter where the chain started. This is an example of a type of Markov chain called a regular Markov chain. For this type of chain, it is true that long-range predictions are independent of the starting state. Not all chains are regular, but this is an important class of chains. We now consider the long-term behavior of a Markov chain when it starts in a state chosen by a probability distribution on the set of states, which we will call a probability vector . A probability vector with r components is a row vector whose entries are non-negative and sum to 1. If u is a probability vector which represents the initial state of a Markov chain, then we think of the ith component of u as representing the probability that the chain starts in state si .if we want to examine the behavior of the chain under the assumption that it starts in a certain state si , we simply choose u to be the probability vector with ith entry equal to 1 and all other entries equal to 0. let the initial probability 1 1 1 vector u equal  , ,  . Then we can calculate the distribution of the states 3 3 3 after three days. Find the limiting vector w for the Land of Oz. w1  w2  w3  1 and wP  w1 w2 1  2 1 w3  2 1  4 1 4 0 1 4 1  4 1    w1 2  1   2  w2 w3  w1  w2  w3  1 1  w1  1 w2  1 w3  w1 2 2 4  1 1  4 w1  4 w3  w2  1 w  1 w  1 w  w 3  4 1 2 2 2 3 The solution is w  0.4 0.2 0.4  Another method: Set w1  1 , and then solve the first and second linear equations from wP  w 1 1 1  2  2 w2  4 w3  1  1  1 w  w 2  4 4 3 We obtain w1 w2  w 3   1  1 2  1  HIDDEN MARKOV CHAIN Example: Let suppose we have a red box/urn containing a red ball and four black balls. There is also another black urn containing three red balls and one black ball. We perform a series of experiments consisting of drawing a ball with replacement in the urns as follow: 1) the first draw randomly takes place in one of the two urns; 2) After each drawn in an urn, the ball is put back into the same box; 3) After each draw, we draw in the box that has the color of the drawn ball. Here the states correspond to the urns and the observation symbols are the colors of the balls selected . S  s1 , s2 , O  o1 ,o2  So 1 3 1 3 11 3  p n 1  p n  q n  p n  1  p n    pn   1 3 5 4 5 4 20 4   p n 1  5 p n  4 q n  4 1 q n 1  p n  1  p n   5 4 q  4 p  1 q  n  1 n n q n  1  p n  5 4    11 3 15 16 p  p  ,q  If lim p n  p , then we have p   n   20 4 31 31 15 in the formula 31  11 3 15 11  15  3  p   20 p  4 i.e. 31   20  31   4     15  11 3  , which show that  p n   is pn    p n 1   31  nIN 20 4 15 11  15  3    p     pn    15 11  15  3  n 1 31 20  31  4       20  31  4   31  11 a geometric sequence with as common ratio. This sequence converges to zero 20 regardless of the choice of p 0 Second Representation : Probabilistic Graph Two possible states: first state, red ball, the second condition, black ball. Four conditional probabilities: : probability of having a red ball knowing that the previous drawing did yield a black ball, etc.. Now let us use p n  A weighted graph (the probability graph ) which displays what’s occurring The probability graph transition matrix is 4 1   5 5  ,the state transition matrix aij  Pqt 1  s j qt  si  M 3 1   4 4 1 3  4 1 p n 1  p n  q n    5 4  5 5       p q  p q   p n q n M  n 1 n 1 n n 3 4 1  1 q n 1  p n  q n   5 4  4 4  p n 1 q n 1    p n q n M   p 0 q 0 M n we can compute M n using binomial formula or eigenvalues/eigenvectors decomposition or using TI 83. Go on 2nd X 1  EDIT  ENTER  input  the  values  2nd MODE   2nd X 1  ENTER     ^100  ENTER; now  type  15 31 4 1    15 16  5 5  lim p n lim q n   p 0 q 0    p 0 q 0 M    n   n   3 1  31 31    4 4 Note : A Markov chain is a random process on a finite number of states with transition probabilities without memory.   Example: consider a simple 3-state Markov model of the weather.We assume that ounce a day (e.g. , at noon), the weather is observed as being one of the following: State 1: rain or ( snow) State 2: cloudy State 3: sunny. We postulate that the weather on day t is characterized by a single one of these three states above, and that the matrix A of state transition probabilities is 0.4 A  aij  0.2 0.1   0.3  0.2  0.8 0.3 0.6 0.1 Given that the weather on day 1 t  1 is sunny (state 3) , we can ask the question: what is the probability ( according to the model) that the weather for the next 7 days will be “sun-sun-rain-rain-sun-cloudy-sun…”?Stated more formally, we define the observation sequence O as O  s3 , s3 , s3 , s1 , s1 , s3 , s2 , s3  corresponding to t  1,2,3,..,8 and we wish to determine the probability of O, given the model. This probability can be expressed ( and evaluated) as PO Model   Ps3 , s3 , s3 , s1 , s1 , s3 , s 2 , s3 Model   Ps3 Ps3 s3 Ps3 s3 Ps1 s3 Ps1 s1 Ps3 s1 Ps 2 s3 Ps3 s 2    3 a33a33a31a11a13 a32 a 23  1 0.8  0.8  0.1 0.4  0.3  0.1 0.2  1.536  10 4 Where  i  Pq1  si  , 1  i  N is the initial state probability. Now, given that the model is in a known state, what is the probability it stays in that state for exactly d days? This probability can be evaluated as the probability of the observation   sequence O  si , si , si , si ,......, si , s j  si  , given the model , which is d d 1 1 2 3 4    d 1 d 1   Pi d   PO Model, q 1  si   P si , si , si , si ,......, si , si  s j Model   Psi Psi si  P si s j  aii  1  aii  d d 1 1 2 3 4   Pi d  is the discrete probability density function of duration d in state i. The expected number of observations ( duration) in a state, conditioned on starting in that state is   d 1 d 1 Ed i   d i   dpi d   d aii  d 1 1  aii   1 . So the expected number of 1  aii consecutive days of sunny weather, according to the model, is 1  5 ; for cloudy it is 0 .2 2.5 ; for it is 1.67 . Example: Consider the game of tossing a coin. We have three unfair coins ( bias). Suppose a person has say three coins and is sitting inside a room tossing them in some sequence--this room is closed and what you are shown (on a display outside the room) is only the outcomes of his tossings TTHTHHTT... this will be called the observation sequence . . You do not know the sequence in which he is tossing the different coins, nor do you know the bias of the various coins.For a given observation, the symbols( heads, tail) can be generated either by the first, second, or third coin. To appreciate how much the outcome depends on the individual biasing and the order of tossing the coins. A Hidden Markov Chain models the experience as follow: We consider that a state is model by a coin and that such state can produce two symbols : heads or tails . We assume that we can move from one state to another (change of coins) according to a certain probability distribution law and that each coin has its own probability distribution to generate those symbols (heads and tails). The sequence of (head-tail) is still observable (matrix B) , but the sequence of states that generates this sequences is not observable ( matrix A). We say that it is hidden. 2 1 Tail Head 1 3      1   0 .2 0.3 0.5 1   0 .1 0.9    A  20.2 0.2 0.7  B  20.7 0.3    0.1 3 0.7 0.2 0.1  3 0.4 0.6  0.5 0.4 Let suppose we observe O  TTTHHTHHTT THTT where T  tail; H  Head Bayes theorem states that posterior = . likelihood × prior PO  P  normalisat ion factor i.e. P O   PO  where O denotes a series of measured or observed data, and λ comprises a set of model p( | O) is the posterior probabilit y p(O |  ) is the likelihood  parameters.  p( ) is the prior probabilit y p(O) is the evidence Find the most likely state sequence ( optimal one) Given an observation O and a HMM  , find the states path, q , most likely follow in the generation of the observation O by the HMM    A, B,   which is Pq O,  maximal.This path is determine using Viterbi algorithm HIDDEN MARKOV CHAIN AND PROTEIN CLASSIFICATION HIDDEN MARKOV MODEL Introduction Proteins are large, organic molecules and are among the most important components in the cells of living organisms. They are more diverse in structure and function than any other kind of molecule. Enzymes, antibodies, hormones, transport molecules, hair, skin, muscle, tendons, cartilage, claws, nails, horns, hooves, and feathers are all made of proteins. Faster and more sensitive and accurate methods are required to classify these proteins into families and predict their functions. Many existing protein classification methods build hidden Markov models (HMMs) and other forms of profiles/motifs based on multiple alignments. These methods in general require a large amount of time for building models and also for predicting functions based on them. Furthermore, they can predict protein functions only if sequences are sufficiently conserved. When there is very little sequence similarity, these methods often fail, even if sequences share some structural similarities. Machine learning methods that have been studied specifically for a problem of protein classification include HMM and support vector machine (SVM) methods. Hidden Markov models (HMMs) are a formal foundation for making probabilistic models of linear sequence “labeling” problems .They provide a conceptual toolkit that allows building a model of almost any complexity, just by drawing an intuitive picture. They are at the heart of a diverse range of programs, including gene finding, consensus profile searches, multiple sequence alignment, and regulatory site identification. HMMs are the Legos of computational sequence analysis. The hidden Markov model (HMM) are Markov model in which states are hidden . There are several types of HMM. The first distinction is made according to the nature of the probability density function used to generate all observations. When the distribution is directly obtained through counting , HMMs are called discrete. The use of a continuous distribution, generally approximated by a mixture of Gaussian, leading to continuous HMM . There is usually a compromise between these two families called HMM semi-continuous. Indeed, the use of quantification induces a loss of information that can be detrimental to the models. On the other hand, the use of a continuous modeling led to an increase in the number of parameters to be estimated. The semicontinuous HMM is an alternative which optimize the overall number of model parameters. We have another type of HMM base on the emission of observation. Observations are usually produced by the state of the model, it is called state model. However, it is possible to consider the emission of observations when crossing the transitions, then this is called arc of models. The choice is guided by the application. However, in equal number of states, the models of arches allow a greater number of possibilities for the emission of observations. When modeling a phenomenon by a model of arcs, it may be worthwhile to allow the crossing transition without emission of observations, especially to model the absence of an event EXTENSION OF MARKOV MODEL TO HIDDEN MARKOV MODELS We would extend now the concept of Markov model to include the case where the observation is a probabilistic function of the state-i.e., the resulting model (which is called a hidden Markov model) is a doubly embedded stochastic process with an underlying stochastic process that is not observable ( it is hidden), but can only be observed through another set of stochastic process that produce the sequence of observations. Example: Suppose a person has say three coins and is sitting inside a room tossing them in some sequence--this room is closed and what you are shown (on a display outside the room) is only the outcomes of his tossings TTHTHHTT... this will be called the observation sequence . You do not know the sequence in which he is tossing the different coins, nor do you know the bias of the various coins. To appreciate how much the outcome depends on the individual biasing and the order of tossing the coins, suppose you are given that the third coin is highly biased to produce heads and all coins are tossed with equal probability. Then, we naturally expect there to be far greater number of heads than tails in the output sequence. Now if it be given that besides the bias the probability of going to the third coin (state) from either the first or the second coin (state) is zero; then assuming that we were in the first or second state to begin with the heads and tails will appear with almost equal probability inspite of the bias. So we see that the output sequence depends very much on the individual bias, the transition probabilities between various states, as well as on which state is chosen to begin the observations. The three sets, namely, the set of individual bias of the three coins, the set of transition probabilities from one coin to the next and the set of initial probabilities of choosing the states characterize what is called as the HIDDEN MARKOV MODEL(HMM) for this coin tossing experiment. HMMs allow you to estimate probabilities of unobserved events. Definition: The Hidden Markov Model (HMM) is a variant of a finite state machine having a set of hidden states, Q, an output alphabet (observations), O, transition probabilities, A, output (emission) probabilities, B, and initial state probabilities, Π. The current state is not observable. Instead, each state produces an output with a certain probability (B). Usually the states, Q, and outputs, O, are understood, so an HMM is said to be a triple, ( A, B, Π ). A Hidden Markov Model HMM is defined or composed the following different elements : - N is the number of hidden states of the model . We denoted S  s1 , s2 ,..., s N  the set of hidden states At time t, a state is represented by Qt Qt  S  . Qt 1t T are hidden and discrete. Generally the states are interconnected in such a way that any state can be reached from any other state( e.g., an ergodic model). - M is the number of distinct observation symbols per state ,i.e. the number of distinct symbols that can be observed in every state i.e. the discrete alphabet size. They are represented by V  v1 , v2 ,..., vM . At time t, an observable symbol is denoted by Ot Ot V  . Ot 1t T are discrete observation variables. The observation symbols correspond to the physical output of the system being model. - A transition matrix of probabilities , denoted A  aij , where a ij is the prior probability transition from state i to state j. In a stationary HMM of first order, this probability does not depend on t. we define aij  PQt 1  s j Qt  si   Pstate q j at t  1state q i at t , 1  i, j  N ; - A matrix of probabilities distributions , denoted B  b j k , associated to each state where b j k  is the probability to observe the symbol vk at state s j at time t. We define the Emission probabilities B  b j k   b j ok   Pok Q j  where o k is in O .   b j k   P Ot  vk Qt  s j  Pobservatio n k at t state q i at t  ,  1  i  N , 1  k  M or b j ot   P Ot  ot Qt  s j  - An initial transition vector of probability distributions    i  , where  i is the probability of starting at state i. We define  i  PQ1  si  where 1  i  N A HMM denoted by  is completely defined by    A, B,   ( N and M are indeed in matrices A,B , and vector  ). The generation of observations in a HMM is done as follow: 1) t  1 choice of initial state , Q1  si with probability  i . 2) Choice of observation Ot  vk , with probability bi k  ; 3) Transition toward a new state Qt 1  s j with probability a ij ; 4) Set t  t  1, if t  T , return to step 2, otherwise terminate the procedure . ( where T is the length of a sequence of observations) It’s obvious from above that a complete specification of an HMM requires specification of two model parameters ( N and M) , specification of observation symbols, and the specification of the three probabilities measures A, B,  .For convenience, we use the compact notation    A, B,   to indicate the complete parameter set of the model. Example: Probabilistic parameters of a hidden Markov model given in the above example. x — states y — possible observations a — state transition probabilities b — output probabilities Two assumptions are made by the model.  The first, called the Markov assumption, is to assume that the sequence of hidden states is governed by a Markov process of degree 1 in discrete time that is to say that the likelihood ( probability) of a hidden state depends only on the previous hidden state of the sequence and that these depencies probabilities do not change over time , i.e., the current state dependent only on the previous state, this represents the memory of the model i.e. the t th hidden variable, given the t  1th hidden variable, is independent of previous variables. If we denoted Q  Q1, Q2 ,......, QT  the sequence of hidden state, then PQt Qt 1 , Ot 1 , Qt 2 , Ot 2 ,..., Q1 , O1   PQt Qt 1  i.e. T 1 PQ    PQ1   PQt 1 Qt ,   t 1  The second hypothesis says that the emission probability of a symbol depends only on the hidden state in which the process is , i.e. , the output observation at time t is dependent only on the current state, it is independent of previous observations and states i.e. the t th observation depends only on the t th state .If we denoted O  O1, O2 ,..., OT  the sequence of observations, : POt Qt 1 , Ot 1 , Qt 2 , Ot 2 ,..., Q1 , O1   POt Qt , PO Q,     POt Qt ,   T t 1 The likelihood of the observation O relative to model  is PO     PO Q,  PQ   Q Graphically, the dependencies of the probabilities can be summarized as: Given a HMM, and a sequence of observations, we’d like to be able to compute PO   , the probability of the observation sequence given a model. This problem could be viewed as one of evaluating how well a model predicts a given observation sequence, and thus allow us to choose the most appropriate model from a set. aij  Pqt 1  s j qt  si  The probability of the observations O  o1 , o2 ,...., oT  for a specific state sequence Q  q1 , q2 ,..., qT  is: PO Q,     Pot qt ,   where we assumed T t 1 statistical independence of observations , thus we get PO Q,     Pot qt ,    bq1 o1   bq2 o2   bq3 o3 .......  bqT oT  and the probability T i 1 of the state sequence is: PQ     q aq q  aq q  aq q .......  aq 1 1 2 2 3 3 4 T 1qT . The joint probability of O and Q, i.e., the probability that O and Q occur simultaneously, is simply the product of the above two terms, i.e., PO, Q    PO Q,  PQ,   ; so we can calculate the probability of the observations given the model as: PO     PO Q,  PQ    Q  b o1   a q1q2  bq2 o2   a q2q3  bq3 o3 a q3q4  .......  bqT oT a qT 1qT q1 q1 q1 ..... qT This result allows the evaluation of the probability of O, but to evaluate it directly would be exponential in T. Note: 1) We are mostly going to consider the special case of ergodic or fully connected HMMs in which every state of the model could be reached ( in a single step) from every other state of the model. Strictly speaking, an ergodic model has the propriety that every state can be reached from every other state in a finite number of steps.We’ll also consider when the underlying “hidden” Markov chain defined by PQt Qt 1  is time – homogenous i.e. independent of the time t. 2) The difference between a Markov chain and a hidden Markov model is in the information known on each state. In a Markov chain, for any sequence, all state transitions are exactly known– i.e., there is a unique, known path through the model. In a hidden Markov model, the state information is hidden from the user. 3) In the case of protein classification, Hidden Markov models have a finite set of states a1 , a2 ,..., an  , including a begin state (where the sequence begins) and an end state (where the sequence terminates). Each state has two probabilities associated with it: • the transition probability Tij , or the probability that a state ai will transition to another state a j , where j  i  1,...., n , and • the emission probability Ex j  , or the probability that a state a j will emit a particular symbol x. Emission probabilitites are properties of only HMMs and not Markov chains. NOTE: Computing PO   Forward procedure Consider t i   PO1, O2 ,.., Ot , Qt  i   1) Initially 1 i    ibi O1  for 1  i  N  N  2) For t  2,3,..., T ,  t  j     t 1 i aij b j Ot  for 1  i  N  i 1  3) Finally PO     T i  . so we solve the computing of PO   by N i 1 recursion NOTE: Computing PO   Backward Procedure Define t i   POt 1, Ot  2 ,.., OT Qt  i,  : 1) Initially T i   1 , for 1  i  N 2) For t  T  1, T  2,...,1 N  t i    a ij b j Ot 1  t 1  j  for 1  i  N j 1 3) Finally PO      ibi O1 1 i  N i 1 We now have another efficient way of computing PO   VITERBI ALGORITHM We want to find the sequence states I  Si 1, Si 2 ,.., Si n  which has a maximum probability of generating O  O1, O2 ,...., On i.e. max PI O,    max PO, I   II II Viterbi algorithm: States observed and the hidden states have a 1-1 correspondence The most likely (probably) sequence of states at time t depends only on t and the most likely sequence at t-1 The Viterbi algorithm is used to compute the most probable path (as well as its probability). It requires knowledge of the parameters of the HMM model and a particular output sequence and it finds the state sequence that is most likely to have generated that output sequence. It works by finding a maximum over all possible state sequences. Given observations O  O1, O2 ,.., OT , find the state sequence Q  {Q1, Q2 ,... . . . , QT } with greatest likelihood: Q*  arg max PO, Q    arg max Q  where Q Q T Q    Q1 bQ1 O1  aQt 1Qt bQt Ot  t 2 The Viterbi algorithm is an inductive algorithm that allows us to find the optimal state sequence Q* efficiently 1 i    ibi O1  for 1  i  N  1 i   0 Initially  For t  2,3,..., T    t  j   max  t 1 i aij b j Ot  i for 1  i  N   t 1 i aij  t  j   arg max i   *  max T i  Finally  * i T i  QT  arg max i Trace back ,for t  T  1, T  2,...,1   Qt*   t 1 Qt*1  * Q  Q1* , Q2* ,..., QT*   Reformulating the optimisation Recall the likelihood calculation, PO, Q    PO Q,  PQ    1b1 O1 a11b1 O2 a12b2 O3 a23b3 O4 ..... T Now, taking the negative logarithm of Q    Q bQ O1  a Q 1  1 t 2 t 1Qt bQt Ot   T   Q   ln  Q1 bQ1 O1    ln aQt 1Qt bQt Ot   t 2     Hence, Q*  arg max PO, Q    arg max Q  becomes Q Q Q *  arg max PO, Q    arg max Q Q Q In sequence analysis, this method can be used for example to predict coding vs non-coding sequences. In fact there are often many state sequences that can produce the same particular output sequence, but with different probabilities. It is possible to calculate the probability for the HMM model to generate that output sequence by doing the summation over all possible state sequences. This also can be done efficiently using the Forward algorithm, which is also a dynamical programming algorithm. In sequence analysis, this method can be used for example to predict the probability that a particular DNA region match the HMM motif (i.e. was emitted by the HMM model). Remark To create a HMM model (i.e. find the most likely set of state transition and output probabilities of each state), we need a set of (related/aligned) sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum-Welch algorithm or the Baldi-Chauvin algorithm. The Baum-Welch algorithm is an example of a forward-backward algorithm, and is a special case of the Expectation-maximization algorithm. EXAMPLE: consider a three state HMM, with R or B emitted by each state (e.g., three urns, each with red or blue balls. R stands for red and B for blue ) with emission probabilities b1 R   , b2 R   , b3 R   , b1 B   , b2 B   , and 1 2 1 2  1 1 b3 B   B   3 4  3  4 0.3 A  0.5 0.4 1 3 3 4 1 2 2 3 1 2  2  state transition matrix 3   1  4  0.6 0.2 0.1 0.1 1 0.3 ,and initial state probabilities  i  Suppose 3 0.5  we observe the sequence O  RBR, then we can find the “optimal” state sequence to explain this sequence of observations by running the Viterbi algorithm by hand:  t i   max Q1 , Q2 ,..., Qt 1 1  j   jb j  R  PQ1, Q2 ,..., Qt  i, O1, O2 ,..., Ot   2  j   3  j   max 1 i aij b j B  1 i  3   max  2 i aij b j R  1 i  3 In the first step, we initialize the probabilities at t = 1 to  t 1  j    j b j R  for each j  1,2,3 . These are given in the first column to the left, as 1 1 1 , , , 6 9 4 respectively. In the second step, t  2 , we determine first  t  2 1 by considering the three 1 1 quantities 1 i ai1 for i  1,2,3 . They are respectively    0.3 ,    0.5 , and 6 9 1    0.4 . The third one is the largest, so according to the algorithm we set 4  1   4   1   2   2 1     0.4   0.05 , and remember that the maximum probability path to state j  1 at time t  2 came from state j  3 at time t  1 (blue line). Similarly, to determine  2 2 we consider the three quantities 1 i ai 2 for 1 1 1 i  1,2,3 , respectively    0.6 ,    0.2 ,    0.1 , and the first 9 4 6 1 2 is the largest, so we set  2 2    0.6   0.06 . Finally, to determine 6  3   2 3 we consider the three quantities 1 i ai 3 for i = 1, 2, 3, respectively (1/6) · .1, (1/9) · .3, (1/4) · .5, and the third is the largest, so we set 1 4  1   4   2 3    0.5   0.03125 . In the third step, t = 3, we determine first  t  3 1 by considering the three quantities  2 i ai1 for i = 1, 2, 3. They are respectively .05 · .3, 0.06  0.5 , and .03125 · .4. The second is the largest, so according to the algorithm we set   12   3 1  0.06  0.5    0.016 , and remember that the maximum probability   path to state j = 1 at time t = 3 came from state j = 2 at time t = 2 (blue line). Similarly, to determine  3 2 we consider the three quantities  2 i ai 2 for i = 1, 2, 3, respectively .05 · .6, 0.06  0.2 , .03125 · .1, and the first is the 1 largest, so we set  3 2  0.05  0.6   0.01 . Finally, to determine  2 3 we 3 consider the three quantities 1 i ai 3 for i = 1, 2, 3, respectively .05 · .1, 0.06  0.3 , .03125 · .5, and the second is the largest, so we set   34   3 3  0.06  0.3    0.015 .   Since there are only three observations, we can now use the termination step to determine that the maximum probability for the observations O  RBR  is P*  0.016 P with state path Q  1,2,1 (purple lines). EXAMPLE: A casino has two dice: Fair die : P1  P2  P3  P4  P5  P6  1 6 Loaded die: P1  P2  P3  P5  1 1 1 ; P6  ; P4  10 2 5 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: 1. You bet $1 2. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $2 A sequence of rolls by the casino player 12455264621461461361366616646616366163661636165156151151461 2356234 P124552646214614613613666166466163661636616361651561511514612356234  1.3 10 35 What portion of the sequence was generated with the fair die, and what portion with the loaded die? 124552646214614613 666166466163661636616 515615115146123562 613 3616 344 FAIR LOADED FAIR The dishonest casino model 0.05 0.95 FAIR P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 0.95 LOADED 0.05 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 Let the sequence of rolls be: x = 1, 2, 1, 5, 6, 2, 1, 5, 2, 4. Then, what is the likelihood of Q  Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair ? (say 1 2 1 2 initial probs  0 Fair  ,  0 Loaded  ) P Q     q1 a q1q2  a q2q3  a q3q4 .......  a qT 1qT  1  P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair)  P(4 | Fair) 2 10 11 9     0.95  0.521 10 9 26 So, the likelihood the die is fair in this run is just 0.521109 What is the likelihood of Q  Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded ? PQ     q1 a q1q2  a q2 q3  a q3q4 .......  a qT 1qT  8  1  P(1 | Loaded) P(Loaded, Loaded)  P(4 | Loaded) 2 2 1 1  1 9 10       0.95  7.9  10 2  10   2  Therefore, it somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die. Now Let the sequence of rolls be: x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6. Now, what is the likelihood Q  F, F, , F PQ     q1 a q1q2  a q2 q3  a q3q4 .......  a qT 1qT  1  P(1 | F) P(6 F)  P(6 | F) 2 10  11 9 9    0.95  0.5  10 26 same as before . What is the likelihood Q  L, L, , L PQ     q1 a q1q2  a q2 q3  a q3q4 .......  a qT 1qT  14 1  P(1 | L) P(6 L)  P(6 | L) 2 6 1 1  1 9        0.95  0.5  10 7 2  10  2 So, it is 100 times more likely the die is loaded EXAMPLE: Suppose we want to determine the average annual temperature at a particular location on earth over a series of years. To make it interesting, suppose the years we are concerned with lie in the distant past, before thermometers were invented. Since we can't go back in time, we instead look for indirect evidence of the temperature. To simplify the problem, we only consider two annual temperatures, \hot" and \cold". Suppose that modern evidence indicates that the probability of a hot year followed by another hot year is 0.7 and the probability that a cold year is followed by another cold year is 0.6. We'll assume that these probabilities held in the distant past as well. The information so far can be summarized as H C     H 0.7 0.3  A  C 0.4 0.6 where H is “hot" and C is “cold". Also suppose that current research indicates a correlation between the size of tree growth rings and temperature. For simplicity, we only consider three different tree ring sizes, small, medium and large, or S, M and L, respectively. Finally, suppose that based on available evidence, the probabilistic relationship between annual temperature and tree ring sizes is given by S M    L  H 0.1 0.4 0.5  B  C 0.7 0.2 0.1 For this system, the state is the average annual temperature-either H or C. The transition from one state to the next is a Markov process (of order one1), since the next state depends only on the current state and the fixed probabilities in A. However, the actual states are “hidden" since we can't directly observe the temperature in the past. Although we can't observe the state (temperature) in the past, we can observe the size of tree rings. From B, tree rings provide us with probabilistic information regarding the temperature. Since the states are hidden, this type of system is known as a Hidden Markov Model (HMM). Our goal is to make effective and effcient use of the observable information so as to gain insight into various aspects of the Markov process. 0.7 The state transition matrix A   0.4 0.1 B 0.7 0.3  and the observation matrix 0.6 0.5 . In this example, suppose that the initial state 0.2 0.1 0.4 distribution, denoted by  is   0.6 0.4 .The matrices  , A and B are row stochastic, meaning that each element is a probability and the elements of each row sum to 1, that is, each row is a probability distribution. Now consider a particular four-year period of interest from the distant past, for which we observe the series of tree rings S;M; S; L. Letting 0 represent S, 1 represent M and 2 represent L, this observation sequence is O = (0; 1; 0; 2): We might want to determine the most likely state sequence of the Markov process given the observations O. That is, we might want to know the most likely average annual temperatures over the four-year period of interest.Let define ”most likely" as the state sequence that maximizes the expected number of correct states. HMMs can be used to find this sequence. With the observations sequence given above O  (0; 1; 0; 2) : we have T  4 , N  2 , M  3 , Q  H , C, V  0,1,2 (where we let 0; 1; 2 represent \small", \medium" and “large" tree rings, respectively). HMM is denoted by    A, B, . Consider a generic state sequence of length four X  (x 0 ; x1 ; x 2 ; x 3 ) with corresponding observations O  (O 0 ; O1 ; O 2 ; O3 ) Then  x is the 0 probability of starting in state x 0 . Also, b x (O 0 ) is the probability of initially 0 observing O 0 and a x ;x is the probability of transiting from state x 0 to 0 1 state x 1 . Continuing, we see that the probability of the state sequence X is given by P(X)   x0 b x 0 (O 0 )a x 0 ;x1 b x1 (O1 )a x1 ;x 2 b x 2 (O 2 )a x 2 ;x 3 b x 3 (O 3 ) with observation sequence O  (0; 1; 0; 2) , we can compute, say, P(HHCC)  0.6(0.1)(0 .7)(0.4)(0 .3)(0.7)(0 .6)(0.1)  0.000212 Similarly, we can directly compute the probability of each possible state sequence of length four, assuming the given observation sequence. To find the optimal sequence in the HMM sense, we choose the most probable symbol at each position. To this end we sum the probabilities in the list of state sequence probabilities above that have an H in the first position. Doing so, we find the (normalized) probability of H in the first position is 0:18817 and hence the probability of C in the first position is 0:81183. The HMM therefore chooses the first element of the optimal sequence to be C. We repeat this for each element of the sequence, obtaining the probabilities in the table below. From Table below we find that the optimal sequence-in the HMM sense-is CHCH. Example: 1  1,2,3, a, b, c, A, B.,   with Bi, j   POt  j Qt  i  B1, a  0.6 , B1, b  0.2 , B1, c  0.2 , B2, a   0 , B2, b  0.5 , B2, c   0.5 , B3, a   0.3 , B3, b  0 , B3, c  0.7 0.6 B  0.2 0.2 0 0.5 0.5 0.3 0  0.7 a11  0.3 , a12  0.2 , a13  0.5 , a21  0.6 , a 22  0.1 , a 23  0.3 , a31  0.2 , a32  0.4 a33  0.4 0.3 A  0.2 0.5 0.6 0.1 0.3 0.2 0.4  0.4  1  1,  2  0 ,  3  0 Example: 0.4 State-transition probabilities, A  aij  0.2 0.1   0.3 0.6 0.1 0.3  0.2  0.8 Given today is sunny (i.e., q1  3 ), what is the probability of “sun-sun-raincloud-cloud-sun” with model  . PQ    PQ  3,3,1,2,2,3   Pq1  3Pq2  3 q1  3Pq3  1 q2  3 Pq4  2 q3  1Pq5  2 q4  2Pq6  3 q5  2   3a33a31a12a22a23  1 0.80.10.30.60.2  0.00288 where the initial state probability for state i is  i  Pq1  i  Probability of state i producing an observation Ot is: bi Ot   POt qt  i  which can be discrete or continuous in o. with state sequence Q  {1, 1, 2, 3, 3, 4}, producing a sequence Q  {1, 1, 2, 3, 3, 4}, producing observations with a state sequence Q  {1, 1, 2, 3, 3, 4}, EXAMPLE Let's consider the following simple HMM. This model is composed of 2 states, H (high GC content) and L (low GC content). We can for example consider that state H characterizes coding DNA while L characterizes a non-coding DNA. The model can then be used to predict the region of coding DNA from a given sequence. Consider the sequence S  GGCACTGAA . There are several paths through the hidden states (H and L) that lead to the given sequence. Example: P = LLHHHHLLL The probability of the HMM to produce sequence S through the path P is p  PL 0  PL G   PLL  PL G   PLH  PH C .  .... = 0.5  0.2  0.6  0.2  0.4  0.3  ..... GGCACTGAA There are several paths through the hidden states (H and L) that lead to the given sequence, but they do not have the same probability. The Viterbi algorithm is a dynamical programming algorithm that allows us to compute the most probable path. Its principle is similar to the DP programs used to align 2 sequences (i.e. Needleman-Wunsch) The probability of the most probable path ending in state k with observation "i" is In our example, the probability of the most probable path ending in state H with observation "A" at the 4th position is: We can thus compute recursively (from the first to the last element of our sequence) the probability of the most probable path. or the calculations, it is convenient to use the log of the probabilities (rather than the probabilities themselves). Indeed, this allows us to compute sums instead of products, which is more efficient and accurate. We used here log log 2  p Probability (in log2 ) that G at the first position was emitted by state H PH G,1  1  1.737  2.737 Probability (in log2 ) that G at the first position was emitted by state L PL G,1  1  2.322  3.322 Probability (in log2 ) that G at the 2nd position was emitted by state H p H (G,2) = - 1.737 + max (p H (G,1) + p HH , p L (G,1) + p LH ) = - 1.737 + max (-2.737 - 1 , - 3.322 - 1.322) = - 5.474 (obtained from p H (G,1) Probability (in log2 ) that G at the 2nd position was emitted by state L p L (G,2) = - 2.322 + max (p H (G,1) + p HL , p L (G,1) + p LL ) = - 2.322 + max (-2.737 - 1.322 , - 3.322 - 0.737) = - 6.059 (obtained from p H (G,1) We then compute iteratively the probabilities pH (i, x) and pL (i, x) that nucleotide i at position x was emitted by state H or L, respectively. The highest probability obtained for the nucleotide at the last position is the probability of the most probable path. This path can be retrieved by backtracking. back-tracking (= finding the path which corresponds to the highest probability, -24.49) The most probable path is: HHHLLLLLL Its probability is 2 24.49  4.25E  8 (remember we used log 2  p ) HMM AND SEQUENCE ALIGNMENTS Sequence alignment is a way of writing one sequence on top of another where the residues in one position are supposed to have a common evolutionary origin. If the same letter occurs in both sequences then this position has been conserved in evolution. If the letters differ it is assumed that the two derive from an ancestral letter. Similar sequences may have different length, which is generally explained through insertions or deletions in sequences. Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion. Since an insertion in one sequence can always be seen as a deletion in the other one frequently uses the term "indel" to represent this. One frequently used method for protein classification is a hidden Markov model. In biological sequence analysis, hidden Markov models are built based on a multiple alignment. Example: The alignment of a sequence to a profile HMM. The squares indicate a match state. The diamonds an insert state, and the circles a delete state. The path through the HMM is shown in bold arrows Example: A two-state HMM modelling a DNA sequence, the first generating AT-rich sequences, and the second generating CG-rich sequences. State transitions and their associated probabilities are indicated by arrows, and symbol emission probabilities for A, C, G, T for each state are indicated below the states. This model generates a state sequence as a Markov chain (middle) and each sequence generates a symbol according to its own emission probability distribution (bottom). The probability of the sequence is the product of the state transitions and the symbol emissions. For a given observed DNA sequence, the hidden state sequence that generated it, i.e. whether this position is in a CG-rich or an AT-rich segment, is inferred A HMM can be visualised as a finite state machine. Finite state machines move through a series of states and produce some kind of output, either when the machine has reached a particular state or when it is moving from state to state. The HMM generates a protein sequence by emitting amino acids as it progresses through a series of states. Each state has a table of amino acid emission probabilities, and transition probabilities for moving from state to state. Transition probabilities define a distribution over the possible next states. In general, the multiple alignments are generated from a training set consisting of positive examples of protein sequences that belong to a certain functional family sharing a level of sequence similarities. Example: A gap is represented by a ‘–’. Columns 1-3 and 6-10 are “match” columns, while the columns 4 and 5 are “insert” columns. In biological sequence analysis, hidden Markov models are built based on a multiple alignment Let start with a multiple sequence alignment to see how is the structure of the sequences. That means that our HMM will be a probabilistic representation of a multiple alignment. With the alignment we see that some columns are complete and some are almost, and the others have few data. These most common matches can be used as a common matches for our model and the deletions and insertions can be modelled as other states. Here is an example of a few sequences aligned and the core columns of the alignment (marked with an *) AC----ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC *** *** Given a multiple alignment of protein sequences, “match”, “insert”, and “delete” states are first identified. If a column of the multiple alignment has less than or equal to fifty percent gaps (i.e., a half or more of the sequences emit an amino acid), then it is classified as a “match column” (columns 1-3 and 6-10 in Figure above). A non-gap entry in a match column is a “match state” in the HMM, while a gap in a match column is a “delete state”. Delete states are presumed to be modifications that stem from an amino acid sequence losing one or more amino acids in an evolutionary event. The last type of state is the “insert” state. “Insert columns” (columns 4 and 5 in Figure 2.4) are similar to delete states, except that the evolutionary modification to the amino acid sequence is that of gaining amino acids. A non-gap in an insert column is an “insert state”, while a gap in an insert column is ignored since it does not represent an event of evolutionary significance. Example: A hidden Markov model (courtesy [16]) with delete (circle), insert (diamond), and match (square) states. Transitions are allowed along each arrow. Delete and match states can only be visited once for each position along a path. Delete states do not emit any symbols. Insert states are allowed to insert multiple symbols. The alignment at the bottom is used to build the model in this example. The sequences begin in the start state. Amino acids a1 and a2 are inserted at the beginning of the sequence. A3 and B1 are the first matched symbols, followed by a deletion, where B2 is matched with a gap. A4 is then matched with B3, b4 is inserted, A5 is matched with B5, and finally the end state is reached. With this sample of alignment we have examples of insertions and deletions between the core alignments, which may have a state on the HMM to represent them. Note that insertions may occur on arbitrary times between the matching states. And the deletion states always replaces some matching. One possible HMM template for building the model is presented in the following picture: Each matching state ( M j ) is related to the matching on the jth core alignment. The same applies for deletion state (Dj). The insertion is slightly different, because it represents the insertions between the core alignments, that's why it has one extra state, and this enable to represent states before the first and after the last alignment Our model will be like the one in the picture with the same length as the core alignments of a multiple alignment for a given set of sequences. However we should use maximum likelihood to estimate the transitions and emission for each state. The easiest way to count the total emissions/transitions, is to thread each sequence to be profiled in the model. For each symbol in the aligned sequence you must check if the symbol is in the core alignment. If it is, then increment the count of that state to the next match state, otherwise, you go to the insertion state. If it is a deletion, go to the deletion state and increment the transition. Finally, to calculate the probability for the transitions, just divide the count of each transition and divide it by all the states leaving the same state. It's important to notice that we have a stopping state, and this is a special state that has no transitions. Note that is important to initialize the model with pseudo counts at each possible transitions. Adding this pseudo count, we let our model less rigid and avoid overfitting to the train set. Otherwise we could have some zero probabilities for some sequences. To create the emissions probabilities, at each match state, you also have to count which symbol was emitted and then increment them. To calculate the probability, just divide it by the total symbols matched in the threading process. The similar process could be done with the insertion. However, the insertion states are characterized by having a low occurrence rate and this may lead us directly to a overfitting problem because the number of emissions must be too small. To avoid this we should use the background distribution as the emissions probabilities of each insertion state. The background distribution is the probability of the occurrence of a given amino acid in the entire protein set. To calculate this, count each amino acid type for all the train set sequence and then divided by the total count. For the deletion states, it's important to notice that it is a silent state. It doesn't emit any symbol at all. To signalize it in ghmm, just let all the emissions probabilities with 0. Note that the end state is also a silent state once we have no emission associated to it. ALERT: the loglikelihood is the only function in the ghmm library which handles silent states correctly. At this point, the model is ready for use. However we have a problem of how to classify a new protein sequence. A threshold could be used to divide the two classes. However, the most common alternative is to use a null model to compare with it. The null model is a model which aims represents any protein with similar probability as any other. With this two models we could take a sequence and compare if it is more similar to a general definition of a protein or to a specific family. This model should model a sequence with average size equal to the aligned sequences being handled, and should emit any kind of symbol at each position. A common alternative for creating the null model, is using a single insertion state, which goes to a stopping state with probability of 1 divided by the average length of sequences in the train set. For the emissions probability, we should use the background distribution, because this is related to the general amino acid distribution. At the end, the model should be like this: For testing the proposed model, I used a set of 100 globin proteins from the NCBI protein repository as a train set to build a profile model, and used ghmm to build and test the model. To test if our model corresponds to our expectations, 100 globins different from the ones in the trains set were used with 1800 other random proteins with similar length. The loglikelihood function from the ghmm library to calculate the similarity index. The classification of the globins versus non globins, was a comparison between the likelihood of the protein with the globins profile hmm, and the null model. This test gave us 100% of accuracy! To display this result graph, each sequence was pointed out in a graph where the horizontal axis displays the length of the sequence and the vertical the log of globins / null models likelihood (or globins - null model loglikelihood). The globins are plotted in red an the others in blue. Proteins over the zero line are classified as globins, and below means they aren't globins. The graphs show us a clear distinction between the classes, and that our model is very precise for this problem. Example: These structural similarities make it possible to create a statistical model of a protein family. The model shown below is a simplified statistical profile, a model which shows the amino acid probability distribution for each position in the family. According to this profile, the probability of C in position 1 is 0.8, the probability of G in position 2 is 0.4, and so forth. The probabilities are calculated from the observed frequencies of amino acids in the family. Given a profile, the probability of a sequence is the product of the amino acid probabilities given by the profile. For example, the probability of CGGSV, given the profile above, is 0.8  0.4  0.8  0.6  0.2  .031. Given a statistical model, the probability of a sequence is used to calculate a score for the sequence. Because multiplication of fractions is computationally expensive and prone to floating point errors such as underflow, a convenient transformation into the logarithmic world is used. The score of a sequence is calculated by taking the logs of all amino acid probabilities and adding them up. Using this method with base e logarithms, the score of CGGSV is log e (0.8)  log e (0.4)  log e (0.8)  log e (0.6)  log e (0.2)  - 3.48 In practice, profile models take other factors into account. For example, members of a protein family have varying lengths, so a score penalty is charged for insertions and deletions. The scores of individual amino acids in a profile are also position specific. In other words, more weight must be given to an unlikely amino acid which appears in a structurally important position in the protein than to one which appears in a structurally unimportant position. Although these refinements are necessary to create good profile models, they introduce many additional free parameters which must be calculated when building a profile, and unfortunately, the calculations must be done by trial and error. These shortcomings set the stage for a new kind of profile, based on the Hidden Markov model. Like an ordinary profile, HMM is built by analyzing the distribution of amino acids in a training set of related proteins. Finite state machines typically move through a series of states and produce some kind of output either when the machine has reached a particular state or when it is moving from state to state. The HMM generates a protein sequence by emitting amino acids as it progresses through a series of states. Each state has a table of amino acid emission probabilities similar to those described in a profile model. There are also transition probabilities for moving from state to state. A possible hidden Markov model for the protein ACCY. The protein is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the numbers next to the directed arcs show probabilities which connect the states. The probability of ACCY is shown as a highlighted path through the model. The figure above shows one topology for a hidden Markov model. Although other topologies are used, the one shown is very popular in protein sequence analysis. Note that there are three kinds of states represented by three different shapes. The squares are called match states, and the amino acids emitted from them form the conserved primary structure of a protein. These amino acids are the same as those in the common ancestor or, if not, are the result of substitutions. The diamond shapes are insert states and emit amino acids which result from insertions. The circles are special, silent states known as delete states and model deletions. Transitions from state to state progress from left to right through the model, with the exception of the self-loops on the diamond insertion states. The self-loops allow deletions of any length to fit the model, regardless of the length of other sequences in the family. Scoring a Sequence with an HMM Any sequence can be represented by a path through the model. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path. In the figure above, a path through the model represented by ACCY is highlighted. In the interest of saving space, the full tables of emission probabilities are not shown. Only the probability of the emitted amino acid is given. For example, the probability of A being emitted in position 1 is 0.3, and the probability of C being emitted in position 2 is 0.6. The probability of ACCY along this path is .4  .3  .46  .6  .97  .5  .015  .73  .01  1  1.76 10 -6 As in the profile case described above, the calculation is simplified by transforming probabilities to logs so that addition can replace multiplication. The resulting number is the raw score of a sequence, given the HMM. For example, the score of ACCY along the path shown in figure above is log e (.4)  log e (.3)  log e (.46)  log e (.6)  log e (.97)  log e (.5)  log e (.015)  log e (.73)  log e (.01)  log e (1)  - 13.25 The calculation is easy if the exact state path is known, as in the toy example of figure above . In a real model, many different state paths through a model can generate the same sequence. Therefore, the correct probability of a sequence is the sum of probabilities over all of the possible state paths. Unfortunately, a brute force calculation of this problem is computationally unfeasible, except in the case of very short sequences. Two good alternatives are to calculate the sum over all paths inductively using the forward algorithm, or to calculate the most probable path through the model using the Viterbi algorithm. Both algorithms are described below. Figure 8. HMM with multiple paths through the model for ACCY. The highlighted path is only one of several possibilities. Consider the HMM shown in Figure 8. The Insert, Match, and Delete states are labeled with their position number in the model, M1, D1 etc. (States I1 and I2 are unlabelled to reduce clutter.) Because the number of insertion states is greater than the number of match or delete states, there is an extra insertion state at the beginning of the model, labeled I0. Unlike the HMM in Figure 7, where the state path for ACCY was known, several state paths through the model are possible for this sequence. The most likely path through the model is computed with the Viterbi algorithm. The algorithm employs a matrix, shown in Figure 9. The columns of the matrix are indexed by the states in the model, and the rows are indexed by the sequence. Deletion states are not shown, since, by definition, they have a zero probability of emitting an amino acid. The elements of the matrix are initialized to zero and then computed with these steps: 1. The probability that the amino acid A was generated by state I0 is computed and entered as the first element of the matrix. 2. The probabilities that C is emitted in state M1 (multiplied by the probability of the most likely transition to state M1 from state I0) and in state I1 (multiplied by the most likely transition to state I1 from state I0) are entered into the matrix element indexed by C and I1/M1. 3. The maximum probability, max(I1, M1), is calculated. 4. A pointer is set from the winner back to state I0. 5. Steps 2-4 are repeated until the matrix is filled. Prob(A in state I0) = 0.4*0.3=0.12 Prob(C in state I1) = 0.05*0.06*0.5 = .015 Prob(C in state M1) = 0.46*0.01 = 0.005 Prob(C in state M2) = 0.46*0.5 = 0.23 Prob(Y in state I3) = 0.015*0.73*0.01 = .0001 Prob(Y in state M3) = 0.97*0.23 = 0.22 The most likely path through the model can now be found by following the back-pointers. Figure 9. Matrix for the Viterbi algorithm Once the most probable path through the model is known, the probability of a sequence given the model can be computed by multiplying all probabilities along the path. The forward algorithm is similar to Viterbi. However in step 3, a sum rather than a maximum is computed, and no back pointers are necessary. The probability of the sequence is found by summing the probabilities in the last column. The resulting matrix is shown in Figure 10. Prob(A in state I0) = 0.4*0.3=0.12 Prob(C in state I1) = 0.05*0.06*0.5 = 0.015 Prob(C in state M1) = 0.46*0.01= 0.005 Prob(C in state M2) = (0.005*0.97) +(0.015*0.46)= .012 Prob(Y in state I3) = .012*0.015*0.73*0.01 = 1.31x10-7 Prob(Y in state M3) = .012*0.97*0.2 = 0.002 Figure 10. Matrix for the Forward algorithm What the Score Means Once the probability of a sequence has been determined, its score can be computed. Because the model is a generalization of how amino acids are distributed in a related group (or class) of sequences, a score measures the probability that a sequence belongs to the class. A high score implies that the sequence of interest is probably a member of the class, and a low score implies it is probably not a member. Local and Global Scoring In the examples above, global scoring was used. This means simply that computation of the score begins at the first amino acid in the sequence and ends at the last one. Even though this may seem like the most natural way to compute a score, the results are often misleading. Because of the evolutionary variety found in related protein sequences, a family member may be composed of both highly conserved areas which score well against the model and divergent areas which score poorly. If both kinds of areas are given equal importance, the overall score of family member may be poor. The solution to this problem is to use local scoring, where the score of a sequence is set to the score of its highest scoring subsequence. The principle can be illustrated with a very simple example. Consider again the sequence ACCY shown in Figure 5. Converting probabilities to the log world, the global score for the sequence is the sum of all four scores: -13.25. The computation is shown in Figure 11. Figure 11. The log score of ACCY Clearly, the score has been significantly lowered by A and Y. The score is low enough ACCY that may not appear to be a member of the family being modeled. Figure 12. The family of sequence ACCY Let's assume ACCY is a member of the family shown in Figure 12. In this case, the global score proves to be a poor measure of family membership. However, if local scoring is used to evaluate the sequence, the final score is much higher. The highest scoring subsequence is found to be CC, with a score of -2.01. Unlike the global score, the local score is high enough to classify ACCY in this family. In situations like this, classifications based on local scoring are more accurate than those based on global scoring. 1.1. An example of a HMM for Protein Sequences Transi tion Prob. Output Prob. This is a possible hidden Markov model for the protein ACCY. The protein is represented as a sequence of probabilities. The numbers in the boxes show the probability that an amino acid occurs in a particular state, and the numbers next to the directed arcs show probabilities, which connect the states. The probability of ACCY is shown as a highlighted path through the model. There are three kinds of states represented by three different shapes. The squares are called match states, and the amino acids emitted from them form the conserved primary structure of a protein. These amino acids are the same as those in the common ancestor or, if not, are the result of substitutions. The diamond shapes are insert states and emit amino acids that result from insertions. The circles are special, silent states known as delete states and model deletions. These type of HMMs are called Protein Profile-HMMs and will be covered in more depth in the later sections. Scoring a Sequence with an HMM Any sequence can be represented by a path through the model. The probability of any sequence, given the model, is computed by multiplying the emission and transition probabilities along the path. A path through the model represented by ACCY is highlighted. For example, the probability of A being emitted in position 1 is 0.3, and the probability of C being emitted in position 2 is 0.6. The probability of ACCY along this path is .4*.3*.46*.6*.97*.5*.015*.73*.01*1 = 1.76x10-6. 1.2. Three Problems Of Hidden Markov Models 1) Scoring Problem We want to find the probability of an observed sequence given an HMM. It can be seen that one method of calculating the probability of the observed sequence would be to find each possible sequence of the hidden states, and sum these probabilities. We use the Forward Algorithm for this. Consider the HMM shown above. In this figure several paths exist for the protein sequence ACCY. The Forward algorithm employs a matrix, shown below. The columns of the matrix are indexed by the states in the model, and the rows are indexed by the sequence. The elements of the matrix are initialized to zero and then computed with these steps: 1. The probability that the amino acid A was generated by state I0 is computed and entered as the first element of the matrix. This is .4*.3 = .12 2. The probabilities that C is emitted in state M1 (multiplied by the probability of the most likely transition to state M1 from state I0) and in state I1 (multiplied by the most likely transition to state I1 from state I0) are entered into the matrix element indexed by C and I1/M1. 3. The sum of the two probabilities, sum(I1, M1), is calculated. 4. A pointer is set from the winner back to state I0. 5. Steps 2-4 are repeated until the matrix is filled. The probability of the sequence is found by summing the probabilities in the last column. Matrix for the Forward algorithm 2) Alignment Problem We often wish to take a particular HMM, and determine from an observation sequence the most likely sequence of underlying hidden states that might have generated it. This is the alignment problem and the Viterbi Algorithm is used to solve this problem. The Viterbi algorithm is similar to the forward algorithm. However in step 3, maximum rather than a sum is calculated. The most likely path through the model can now be found by following the back-pointers. Matrix for the Viterbi algorithm Once the most probable path through the model is known, the probability of a sequence given the model can be computed by multiplying all probabilities along the path. 3) Training Problem Another tricky problem is how to create an HMM in the first place, given a particular set of related training sequences. It is necessary to estimate the amino acid emission distributions in each state and all state-to-state transition probabilities from a set of related training sequences. This is done by using the Baum-Welch Algorithm or the Forward Backward Algorithm. The algorithm proceeds by making an initial guess of the parameters (which may well be entirely wrong) and then refining it by assessing its worth, and attempting to reduce the errors it provokes when fitted to the given data. In this sense, it is performing a form of gradient descent, looking for a minimum of an error measure.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MARKOV CHAIN