Download Markov Chains, Hidden Markov Models (HMM): Inference, Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Class 5:
Hidden Markov Models
.
Sequence Models
 So
far we examined several probabilistic model
sequence models
 These model, however, assumed that positions are
independent
 This means that the order of elements in the
sequence did not play a role
 In
this class we learn about probabilistic models of
sequences
Probability of Sequences
an alphabet 
 Let X1,…,Xn be a sequence of random variables
over 
 Fix
 We
want to model P(X1,…,Xn)
Markov Chains
Assumption:
 Xi+1 is independent of the past once we know Xi
This allows us to write:
P (X1 ,, Xn )  P (X1 ) P (Xi 1 | X1 , , Xi )
i
 P (X1 ) P (Xi 1 | Xi )
i
Markov Chains (cont)
Assumption:
 P(Xi+1|Xi) is the same for all i
Notation P(Xi+1=b |Xi=a ) = Aab
specifying the matrix A and initial probabilities,
we define P(X1,…,Xn)
 To avoid the special case of P(X1), we can use a
special start state, and denote P(X1 = a) = Asa
 By
Example: CpG islands
human genome, CpG dinucleotides are relatively
rare
 CpG pairs undergo a process called methylation
that modifies the C nucleotide
 A methylated C can (with relatively high chance)
mutate to a T
 Promotor regions are CpG rich
 In


These regions are not methylated, and thus
mutate less often
These are called CpG islands
CpG Islands
 We
construct Markov chain
for CpG rich and poor
regions
 Using maximum likelihood
estimates from 60K
nucleotide, we get two
models
Ratio Test for CpG islands
a sequence X1,…,Xn we compute the
likelihood ratio
P ( X1 , , Xn | )
S ( X1 , , Xn )  log
P ( X1 , , Xn | )
 Given
A  Xi Xi  1
  log 
A X i Xi  1
i
   X i Xi  1
i
Empirical Evalation
Finding CpG islands
Simple Minded approach:
 Pick a window of size N
(N = 100, for example)
 Compute log-ratio for the sequence in the window,
and classify based on that
Problems:
 How do we select N?
 What do we do when the window intersects the
boundary of a CpG island?
Alternative Approach
 Build
a model that include “+” states and “-” states
A state “remembers” last nucleotide and the type of region
 A transition from a - state to a + describes a start of CpG
island

Hidden Markov Models
Two components:
 A Markov chain of hidden states H1,…,Hn with L
values
 P(Hi+1=k |Hi=l ) = Akl
X1,…,Xn
Assumption:
Xi depends only on hidden state Hi
P(Xi=a |Hi=k ) = Bka
 Observations


Semantics
P (X1 , , Xn , H1 , , Hn )  P (H1 , , Hn )P (X1 , , Xn | H1 , , Hn )
 P (H1 ) P (Hi 1 | Hi )P (Xi | Hi )
i
 A0 Hi  AHi Hi 1 BHi Xi
i
Example: Dishonest Casino
Computing Most Probable Sequence
Given: x1,…,xn
Output: h*1,…,h*n such that
P (x1 ,, xn , h1* ,, hn* )  max h1 ,,hn p (x1 ,, xn , h1 ,, hn )
Idea:
 If we know the value of hi, then the most probable
sequence on i+1,…,n does not depend on
observations before time i
Vi(l) be the probability of the best sequence
h1,…,hi such that hi = l
 Let
Dynamic Programming Rule
P (x1 , , xi 1 , h1 , , hi 1 )
 P (x1 ,, xi , h1 , , hi )P (hi 1 | hi )P (xi 1 | hi 1 )
 P (x1 ,, xi , h1 , , hi )Ahi hi 1 Bhi 1 xi 1
 so
Vi 1 (l )  BlXi 1 maxk Vi (k )Akl
Viterbi Algorithm
V0(0) = 1, V0(l) = 0 for l > 0
 for i = 1, …, n
 for l = 1,…,L
 set V (l )  B
max k Vi (k )Akl
i 1
lX
 Set
i 1
Pi 1 (l )  arg max k Vi (l )Akl
h*n = argmaxl Vn(l)
 for i = n-1,…,1
 set h*i = Pi+1(h*i+1)
 Let
Computing Probabilities
Given: x1,…,xn
Output: P(x*1,…,x*n )
P (x1 ,, xn ) 
P (x ,, x ,h ,,h )
h1 ,,hn
1
n
1
n
How do we sum of exponential number of hidden
sequences?
Forward Algorithm
 Perform
dynamic programming on sequences
 Let fi(l) = P(x1,…,xi,Hi=l)
 Recursion
rule:
fi 1 (l )  Blx i 1 fi (k )Akl
k
 Conclusion
P (x1 ,, xn )  fn (k )
k
Backward Algorithm
 Perform
dynamic programming on sequences
 Let bi(l) = P(xi+1,…,xn|Hi=l)
 Recursion
rule:
bl (i )   Bkxi 1 bi 1 (k )Akl
k
 Conclusion
P (x1 ,, xn )   b0 (l )
k
Computing Posteriors
 How
do we compute P(Hi | x1,…,xn) ?
P (Hi  l , x1 , , xn )
P (Hi  l | x1 , , xn ) 
P (x1 , , xn )
P (Hi  l , x1 , , xi )P (xi 1 , , xn | Hi  l )

P (x1 , , xn )
fi (l )bi (l )

P (x1 , , xn )
Dishonest Casino (again)
 Computing
posterior probabilities for “fair” at each
point in a long sequence:
Learning
Given a sequence x1,…,xn, h1,…,hn
 How do we learn Akl and Bka ?
 We
want to find parameters that maximize the
likelihood P(x1,…,xn, h1,…,hn)
We simply count:
 Nkl - number of times hi=k & hi+1=l
 Nka - number of times hi=k & xi = a
Nkl
Akl 
 Nkl '
l'
Bka
Nka

 Nka '
a'
Learning
Given only sequence x1,…,xn
 How do we learn Akl and Bka ?
 We
want to find parameters that maximize the
likelihood P(x1,…,xn)
Problem:
 Counts are inaccessible since we do not observe hi
 If
we have Akl and Bka we can compute
P (Hi  k , Hi 1  l | x1 , , xn )
P (Hi  k , Hi 1  l , x1 , , xn )

P (x1 , , xn )
P (Hi  k , x1 , , xi )P (xi 1 | Hi 1  l )P (xi 2 , , xn | Hi 1  l )

P (x1 , , xn )

fi (k )Blx i 1 bi 1 (l )
P (x1 , , xn )
Expected Counts
 We
can compute expected number of times
hi=k & hi+1=l
E [Nkl ]   P (Hi  k , Hi 1  l | x1 ,, xn )
i
 Similarly
E [Nka ] 
 P (H
i ,xi a
i
 k | x1 ,, xn )
Expectation Maximization (EM)
 Choose
Akl and Bka
E-step:
 Compute expected counts E[Nkl], E[Nka]
M-Step:
E [Nkl ]
A'kl 
 Restimate:
 E [N
l'
kl '
]
E [Nka ]
B'ka 
 E [Nka ']
a'
 Reiterate
EM - basic properties
 P(x1,…,xn: Akl,


Bka)  P(x1,…,xn: A’kl, B’ka)
Likelihood grows in each iteration
If P(x1,…,xn: Akl, Bka) = P(x1,…,xn: A’kl, B’ka)
then Akl, Bka is a stationary point of the likelihood
 either a local maxima, minima, or saddle point
Complexity of E-step
 Compute
forward and backward messages
 Time & Space complexity: O(nL)
 Accumulate expected counts
2
 Time complexity O(nL )
2
 Space complexity O(L )
EM - problems
Local Maxima:
 Learning can get stuck in local maxima
 Sensitive to initialization
 Require some method for escaping such maxima
Choosing L
 We often do not know how many hidden values we
should have or can learn
Related documents