Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Computational Genomics
Lecture 8a
Hidden Markov Models (HMMs)
© Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU)
.
Modified by Benny Chor (TAU)
Outline
Finite, or Discrete, Markov Models
Hidden Markov Models
Three major questions:
Q1: Compute the probability of a given sequence
of observations.
A1: Forward – Backward dynamic programming
algorithm (Baum Welch).
Q2: Compute the most probable sequence of
states, given a sequence of observations.
A2: Viterbi’s dynamic programming Algorithm
Q3: Learn best model, given an observation,.
A3: The Expectation Maximization (EM) heuristic.
2
Markov Models
A discrete (finite) system:
N distinct states.
Begins (at time t=1) in some initial state(s).
At each time step (t=1,2,…) the system moves
from current to next state (possibly the same as
the current state) according to transition
probabilities associated with current state.
This kind of system is called a finite, or discrete
Markov model. Aka probabilistic finite automata.
After Andrei Andreyevich Markov (1856 -1922)
3
Example (reminder): The Friendly Gambler
Game starts with 10$ in gambler’s pocket
– At each round we have the following:
or
• Gambler wins 1$ with probability p
• Gambler loses 1$ with probability 1-p
– Game ends when gambler goes broke (no sister in bank),
or accumulates a capital of 100$ (including initial capital)
– Both 0$ and 100$ are absorbing states (or boundaries)
p
0
1
1-p
p
p
N-1
2
1-p
p
N
1-p
1-p
Start
(10$)
4
Example (reminder): : The Friendly Gambler
Irreducible means that every state is accessible from every other
state. Aperiodic means that there exists at least one state for
which the transition from that state to itself is possible. Positive
recurrent means that for every state, the expected return time is
finite. If the Markov chain is positive recurrent, there exists a
stationary distribution.
Is the gambler’s chain positive recurrent? Does it have stationary
distribution(s) (and are they independent of initial distribution)?
p
0
1
1-p
p
p
N-1
2
1-p
p
N
1-p
1-p
Start
(10$)
5
Let Us Change Gear
Nough with these simple Markov chains.
Our next mission: Hidden Markov chains.
Start
1/2
1/2
tail
1/2
0.1
loaded
Fair
0.9
0.1
1/2
head
tail
1/4
3/4
0.9
head
6
Hidden Markov Models
(or probabilistic finite state transducers)
Often we face cases where states cannot be directly
observed. We need an extension to Markov Models:
Hidden Markov Models
a11
a12
b11
1
a44
a34
a23
b14
b13
b12
a33
a22
4
2
Observed
phenomenon
3
aij are state transition probabilities.
bik are observation (output) probabilities.
b11 + b12 + b13 + b14 = 1,
b21 + b22 + b23 + b24 = 1, etc.
7
Hidden Markov Models - HMM
hidden state
variables
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
observed data
(“output”)
8
Example: Dishonest Casino
Actually, what is hidden in this model?
9
A Similar Example: Loaded Coin
Start
1/2
1/2
tail
1/2
0.1
loaded
Fair
0.9
tail
1/4
0.1
3/4
1/2
0.9
head
head
L tosses
Fair/Loade
d
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
Head/Tail
10
Loaded Coin Example (cont.)
Start
1/2
1/2
tail
1/2
Fair
0.9
1/2
0.1
0.1
tail
1/4
loaded
0.9
3/4
head
head
Fair/Loade
d
L tosses
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
Head/Tail
Q1.: What is the probability of the sequence of observed
outcome (e.g. HHHTHTTHHT), given the model?
11
HMMs – Question I
Given an observation sequence O = (O1 O2 O3 … OL),
and a model M = {A, B, p }, how do we efficiently
compute P( O | M ), the probability that the given
model M produces the observation O in a run of
length L ?
This probability can be viewed as a measure of the
quality of the model M. Viewed this way, it enables
discrimination/selection among alternative models
M1, M2, M3…
12
Example: CpG islands
human genome, CG dinucleotides are relatively
rare
CG pairs undergo a process called methylation
that modifies the C nucleotide
A methylated C mutate (with relatively high
chance) to a T
Promotor regions are CG rich
In
These regions are not methylated, and thus
mutate less often
These are called CG (aka CpG) islands
13
Biological Example: Methylation and CG Islands
CG dinucleotides in nuclear DNA sequences often undergo
a process of methylation, where a methyl (CH3) “joins” the
Cytosine (C). The methylated Cytosine may be converted to
Thymine (T) by accidental deamination. Over evolutionary
time scales, the methylated CG sequence will often be
converted to the TG sequence.
Genes whose control regions are methylated are usually
under expressed. Indeed, unmethylated CGs are often found
around active genes (this is condition and tissue dependent).
A CG island is a short stretch of DNA in which the frequency
of the CG sequence is higher than other regions. Therefore,
such islands are found with higher density around genes.
http://www.web-books.com/MoBio/Free/Ch7F2.htm
14
Biological Example: Methylaion & CG Islands
Notice: The
complement
of a CG is a
GC
epi genetic
phenomena
http://www.web-books.com/MoBio/Free/Ch7F2.htm
15
CpG Islands
We
construct a Markov chain
for CpG rich and another for
CpG poor regions
Using maximum likelihood
estimates from 60K
nucleotide, we get two
models
16
Ratio Test for CpC islands
a sequence X1,…,Xn we compute the
likelihood ratio
P ( X 1, , X n | )
S ( X 1, , X n ) log
P ( X 1 , , X n | )
Given
log
i
A X i X i1
A X i X i1
X i X i1
i
17
Empirical Evalation
18
Finding CpG islands
Simple Minded approach:
Pick a window of size N
(N = 100, for example)
Compute log-ratio for the sequence in the window,
and classify based on that
Problems:
How do we select N?
What do we do when the window intersects the
boundary of a CpG island?
19
Alternative Approach
Build
a model that include “+” states and “-” states
A state “remembers” last nucleotide and the type of region
A transition from a - state to a + corresponds to the start of
a CpG island
20
C-G Islands: A Different HMM
Define C-G islands: DNA stretches which are very rich in CG
q/4
G
A
P
q/4
P
Regular
T
C
q
change
P
DNA
q
q
P q
q/4
q/4
p/6
(1-P)/4
A
p/3
G
(1-q)/6
(1-q)/3
aka CpG islands
p/3
P/6
C-G island
T
C
21
A Different C-G Islands HMM
G
A
change
C
T
A
G
T
C
C-G
island?
H1
H2
Hi
HL-1
HL
X1
X2
Xi
XL-1
XL
A/C/G/T
22
HMM Recognition (question I)
For a given model M = { A, B, p} and a given state
sequence Q1 Q2 Q3 … QL ,, the probability of an
observation sequence O1 O2 O3 … OL is
P(O|Q,M) = bQ1O1 bQ2O2 bQ3O3 … bQTOT
For a given hidden Markov model M = { A, B, p}
the probability of the state sequence Q1 Q2 Q3 … QL
is (the initial probability of Q1 is taken to be pQ1)
P(Q|M) = pQ1 aQ1Q2 aQ2Q3 aQ3Q4 … aQL-1QL
So, for a given HMM, M
the probability of an observation sequence O1O2O3 … OT
is obtained by summing over all possible state sequences
23
HMM – Recognition (cont.)
P(O| M) = SQ P(O|Q) P(Q|M)
= S Q p Q1 b Q1 O1 a Q1 Q2 b Q2 O 2 a Q2 Q3 b Q2 O2 …
Requires summing over exponentially many paths
Can this be made more efficient?
24
HMM – Recognition (cont.)
Why isn’t it efficient? – O(2LQL)
For a given state sequence of length L we have
about 2L calculations
= pQ aQ Q aQ Q aQ Q … aQ
P(O|Q) = bQ O bQ O bQ O … bQ O
P(Q|M)
1
1
1
1
2
2
2
2
3
3
3
3
T-1QT
4
T
T
There are QL possible state sequence
So, if Q=5, and L=100, then the algorithm requires
200x5100 computations
Instead, we will use the forward-backward (F-B)
algorithm of Baum (68) to do things more efficiently.
25
The Forward Backward Algorithm
A white board presentation.
26
The F-B Algorithm (cont.)
Option 1) The likelihood is measured using any
sequence of states of length T
This is known as the “Any Path” Method
Option 2) We can choose an HMM by the probability
generated using the best possible sequence of
states
We’ll refer to this method as the “Best Path”
Method
27
HMM – Question II (Harder)
Given an observation sequence, O = (O1 O2 … OT),
and a model, M = {A, B, p }, how do we efficiently
compute the most probable sequence(s) of states,
Q?
Namely the sequence of states Q = (Q1 Q2 … QT) ,
which maximizes P(O|Q,M), the probability that the
given model M produces the given observation O
when it goes through the specific sequence of
states Q .
Recall that given a model M, a sequence of
observations O, and a sequence of states Q, we
can efficiently compute P(O|Q,M) (should watch
out for numeric underflows)
28
Most Probable States Sequence (Q. II)
Idea:
If we know the identity of Qi , then the most probable
sequence on i+1,…,n does not depend on
observations before time i
A white board presentation of Viterbi’s algorithm
Followed by a simple weather demo.
An online demo of Viterbi’s algorithm
http://www.telecom.tuc.gr/~ntsourak/demo_viterbi.htm
29
Dishonest Casino (again)
posterior probabilities for “fair” at each
point in a long sequence:
Computing
30
HMM – Question III (Hardest)
Given an observation sequence O = (O1 O2 … OL),
and a class of models, each of the form M = {A,B,p},
which specific model “best” explains the
observations?
A solution to question I enables the efficient
computation of P(O|M) (the probability that a specific
model M produces the observation O).
Question III can be viewed as a learning problem:
We want to use the sequence of observations in
order to “train” an HMM and learn the optimal
underlying model parameters (transition and output
probabilities).
31
Learning
Given the two sequences O1,…,OT, and S1,…,ST
How do we learn the model param. ai,j and bi(a) ?
We
want to find parameters that
maximize the likelihood, Pr(O1,…,OT | q)
We simply count:
Nkl - number of times qi=Sk & qi+1=Sl
Nka - number of times qi=Sk & Oi = a
N kl
N ka
akl
Bka
N kl '
N ka '
l'
a'
32
Learning the Model
Given only the observations O1,…,OT,
How do we learn Akl and Bka ?
We
want to find parameters that maximize the
likelihood Pr(O1,…,OT | q)
Problem:
Counts are inaccessible,
since we do not observe the St
33
Learning the Model
Problem:
Counts are inaccessible, since the St ’s are hidden.
Solution (heuristic): The EM algorithm, next lecture.
34