Download class10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
Variants of HMM
Sequence Alignment via HMM
Lecture #10
Background Readings: chapters 11.6, 3.4, 3.5, 4, in the Durbin et
al., 2001, Chapter 3.4 Setubal et al., 1997
This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then
Shlomo
Moran.
.
Sequence Comparison using HMM
We now use HMM to unify the different scoring functions
for sequence alignment to one scoring function, which
combine the log-odds probabilities and the affine gap
penalties into a single scoring function.
2
Sequence Comparison using HMM
• We wish to assign a probability to each alignment of
two DNA/protein sequences, using the HMM model.
• Each “output symbol” of the HMM is an aligned pair of
two letters, or of a letter and a gap.
• The hidden states should represent some evolutionary
model.
• Transition and emission probabilities define the
probability of each aligned pair of sequences.
• Given two input sequences, we look for an alignment
of these sequences of maximum probability.
Next we show an example of such a model.
3
Sequence Comparison using HMM
HMM for sequence alignment, which incorporates
affine gap scores.
“Hidden” States
 Match.
 Insertion in x.
 insertion in y.
Symbols emitted
 Match: {(a,b)| a,b in ∑ }.
 Insertion in x: {(a,-)| a in ∑ }.
 Insertion in y: {(-,a)| a in ∑ }.
4
Example: Insertion of a first gap in this model:
…
M
X
(G,T)
(C,-)
…
We still need to assign transition/emission probabilities
5
Transitions and Emission Probabilities
M
Transitions probabilities
(note the forbidden ones).
δ = probability for 1st gap
ε = probability for tailing gap.
M
X
Y
X
Y
1-2δ
δ
δ
1- ε
ε
0
1- ε
0
ε
Emission Probabilities
 Match: (a,b) with pab – only from M states
 Insertion in x: (a,-) with qa – only from X state
 Insertion in y: (-,a).with qa - only from Y state.
(Note that the hidden states can be reconstructed from the alignment.)
6
Scoring alignments
For each pair of sequences x (of length m) and y (of
length n), there are many alignments of x and y, each
corresponds to a different state-path (the length of the
paths are between max{m,n} and m+n).
Given the transmission and emission probabilities, each
alignment has a defined score – the product of the
corresponding probabilities.
An alignment is “most probable” if it maximizes this
score.
7
Finding the most probable alignment
Let vM(i,j) be the probability of the most probable
alignment of x(1..i) and y(1..j), which ends with a
match. Similarly, vX(i,j) and vY(i,j), the probabilities of
the most probable alignment of x(1..i) and y(1..j), which
ends with an insertion to x or y.
Then using a recursive argument, we get:
v M [i, j ]  pxi y j
 (1  2 )v M (i  1, j  1) 


X
max  (1   )v (i  1, j  1) 


Y
 (1   )v (i  1, j  1) 


8
Most probable alignment
Similar argument for vX(i,j) and vY(i,j), the probabilities
of the most probable alignment of x(1..i) and y(1..j),
which ends with an insertion to x or y, are:
M


v
(i  1, j ) 
X
v [i, j ]  qxi max 

  v X (i  1, j ) 


M


v
(i, j  1) 
Y
v [i, j ]  q y j max 

  vY (i, j  1) 


9
Adding termination probabilities
Different alignments of x and y may have different lengths. To
get a coherent probabilistic model we need to define a
probability distribution over sequences of different lengths.
For this, an END state is added,
with transition probability τ
from any other state to END.
This assumes expected
sequence length of 1/ τ.
The last transition in each
alignment is to the END
state, with probability τ
M
X
Y
END
M 1-2δ -τ
δ
δ
τ
X 1-ε -τ
ε
Y
END
1-ε -τ
τ
ε
τ
1
10
The log-odds scoring function
We wish to know if the alignment score is above or below
the score of random alignment.
 For gapless alignments we used the log-odds ratio:
s(a,b) = log (pab / qaqb). log (pab/qaqb)>0 iff the probability that
a and b are related by our model is larger than the
probability that they are picked at random.
 To adapt this for the HMM model, we need to model random
sequence by HMM, with end state. This model assigns
probability to each pair of sequences x and y of arbitrary
lengths m and n.

11
scoring function for random model
The transition probabilities for the
random model, with termination
probability η:
(x is the start state)
The emission probability for a is qa.
Thus the probability of x (of length n)
and y (of length m) being random is:
X
Y
END
X 1- η
η
0
Y
0
1- η
η
END
0
0
1
n
m
i 1
j 1
p( x, y | Random)   2 (1   ) n  m  qxi  q y j
And the corresponding score is:
log p( x, y | Random)  2 log   ( n  m) log(1   ) 
n
m
log q  log q
i 1
xi
i 1
yi
12
Markov Chains for “Random” and “Model”
M
X
Y
END
M 1-2δ -τ
δ
δ
τ
X 1-ε -τ
ε
Y
1-ε -τ
τ
ε
τ
1
END
“Model”
“Random”
X
Y
X 1- η
η
Y
END
1- η
END
η
1
13
Combining models in the log-odds
scoring function
In order to compare the M score to the R score of sequences x and y, we
can find an optimal M score, and then subtract from it the R score.
This is insufficient when we look for local alignments, where the
optimal substrings in the alignment are not known in advance. A
better way:
1. Define a log-odds scoring function which keeps track of the
difference Match-Random scores of the partial strings during the
alignment.
2. At the end add to the score (logτ – 2logη) to compensate for the end
transitions in both models.
We get the following:
14
The log-odds scoring function
(assuming that letters at insertions/deletions are selected by the random model)
V [i, j ]  log
M
p xi y j
q xi q y j
 log(1  2   )  V M [i  1, j  1]


X
 max  log(1     )  V [i  1, j  1]   2 log(1   )


Y
log(
1




)

V
[
i

1
,
j

1
]
b 

M


log


V
[
i

1
,
j
]
X
  log(1   )
V [i, j ]  max 
X

log


V
[
i

1
,
j
]


M


log


V
[
i
,
j

1
]
Y
  log(1   )
V [i, j ]  max 
Y

log


V
[
i
,
j

1
]


And at the end add to the score (logτ – 2logη).
15
The log-odds scoring function
Another way (Durbin et. al., Chapter 4.1):
Define scoring function s with penalties d and e for a first gap and a
tailing gap, resp.
pab
(1  2   )
s( a , b )  log
 log
q a qb
(1   ) 2
(assume move from M state)
 (1     )
-d  log
(" prepayment" when movi ng to X or Y states)
(1   )(1  2   )

-e  log
1-
Then modify the algorithm to correct for extra prepayment, as follows:
16
Log-odds alignment algorithm
Initialization: VM(0,0)=logτ - 2logη.
 V M (i  1, j  1) 


V M [i, j ]  s ( xi , y j )  max  V X (i  1, j  1) 
 Y

 V (i  1, j  1) 


M

V
(i  1, j )  d  Y
V M (i, j  1)  d 
X
V [i, j ]  max 


 V X (i  1, j )  e  V [i, j ]  max  Y

V
(
i
,
j

1)

e




Termination:
V = max{ VM(m,n), VX(m,n)+c, VY(m,n)+c}
Where c= log (1-2δ-τ) – log(1-ε-τ)
17
Total probability of x and y
Rather then computing the probability of the most probable
alignment, we look for the total probability that x and y are
related by our model.
Let fM(i,j) be the sum of the probabilities of all alignments of
x(1..i) and y(1..j), which end with a match. Similarly, fX(i,j)
and fY(i,j) are the sum of these alignments which end with
insertion to x (y resp.). A “forward” type algorithm for
computing these functions.
Initialization: fM(0,0)=1, fX(0,0)= fY(0,0)=0 (we start from M, but
we could select other initial state).
18
Total probability of x and y (cont.)
f M [i, j ]  pxi y j
 (1  2   ) f M (i  1, j  1) 


X
 (1     ) f (i  1, j  1) 


 (1     ) f Y (i  1, j  1) 


M


f
(i  1, j ) 
X
f [i, j ]  qxi 

  f X (i  1, j ) 


M


f
(i, j  1) 
Y
f [i, j ]  q y j 

  f X (i, j  1) 


The total probability of all alignments is:
P(x,y|model)= fM[m,n] + fX[m,n] + fY[m,n]
19
HMM model structure:
extensions of Markov models
Markov chains are rather limited in describing sequences
of symbols with non-random structures. For instance, a
Markov chain forces the distribution of segments in
which some state is repeated k+1 times to be (1-p)pk,
for some p.
A
A
A
A
In the following we’ll see some ways to extend Markov chains
and HMM to allow more involved structures
20
HMM model structure:
1. Duration Modeling
An extension of Markov chain which allows the
distribution of segments in which a state is repeated k+1
times to have any desired value:
Assign k+1 states to represent the same “real” state. This
may model k repetitions (or less) with any desired
probability.
A1
A2
A3
A4
21
HMM model structure:
2. Silent states
States which do not emit symbols.
 Can be used to model duration distributions.
 Also used to allow arbitrary jumps (may be used to model
deletions)
 Need to generalize the Forward and Backward algorithms for
arbitrary acyclic digraphs to count for the silent states (see
next):

Silent states:
Regular states:
22
eg, the forwards algorithm should look:
For a regular vertex v of states which emit symbol x:

Fl ( v )  el ( x ) 
( u , v )E

k

Fk (u )akl 

and for a vertex z of silent states: Fl ( z ) 
  F (u )a
k
( u , z )E
k
z
Silent states
Directed cycles of silent (or
other) states complicate
things, and should be
avoided.
kl
Regular states
symbols
v
x
23
HMM model structure:
3. High Order Markov Chains
Markov chains in which the transition probabilities
depends on the last k states:
P(xi|xi-1,...,x1) = P(xi|xi-1,...,xi-k)
Can be represented by a standard Markov chain with
more states. eg for k=2:
AA
AB
BA
BB
24
HMM model structure:
4. Inhomogeneous Markov Chains

An important task in analyzing DNA sequences is recognizing the
genes which code for proteins.

A triplet of 3 nucleotides – codon - codes for amino acids.

It is known that in parts of DNA which code for genes, the three
codons positions has different statistics.

Thus a Markov chain model for DNA should represent not only
the Nucleotide (A, C, G or T), but also its position – the same
nucleotide in different position will have different transition
probabilities. Used in GENEMARK gene finding program (93).
25