Download Introduction to Computational Biology Lecture #28

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Computational Biology
Lecture #28: Estimating Evolutionary Time
Elad Granot
April 23, 2012
1
A Short Recap
In the previous lecture we have created a probabilistic model of the sequence evolution. We dened a transition
matrix, P (t), for each time t as the probability that a position in the DNA will change from nucleotide a to b
t
∀a, b ∈ {A, T, C, G}. We will mark such probability as p(a −→ b). We have also dened the rate matrix, R, that
describes the rate of a single change. Finally, we have dened two important properties - the Markov property and
the homogeneous property.
2
Jukes-Cantor Model
In the Jukes-Cantor (JC) model, the probability for a position to change its value dened as the following distribution:
Denition 1. Given two nucleotides a, b and time t, the JC distribution dened as
(
t
p(a −→ b) =
1
4 (1
1
4 (1
4
− e− 3 t ) , a 6= b
4
+ 3e− 3 t ) , a = b
Assume we are given the following sequences: x = x1 , x2 , ..., xn , y = y1 , y2 , ..., yn which are already aligned (i.e.
1, 2, ..., n represents the same position for both sequences) we can ask for the likelihood that sequence y was derived
n
P
from sequence x after time t. For that to happen we will dene the following random variables: Neq = 1{xi =yi }
and Nneq =
n
P
i=1
1{xi 6=yi } . Notice that Neq + Nneq = n. Therefore the likelihood will be:
i=1
L(t) =
n
Y
t
p(xi −→ yi ) =
i=1
n Neq Nneq
4
4
1
· 1 − e− 3 t
· 1 + 3e− 3 t
4
Denition 2. In order to nd the most likely time we shall maximize logL(t) using rst deviation:
4
4
1
logL(t) = n · log + Neq · log 1 − e− 3 t + Nneq · log 1 + 3e− 3 t
4
4
Nneq
Neq
d
4 4
logL(t) =
· e− 3 t −
· 4e− 3 t = 0
4
4
−
t
−
t
dt
1−e 3 3
1 + 3e 3
1
Introduction to Computational Biology
Lecture # 28
Neq
=⇒
3 1−
4
e− 3 t
=
Nneq
4
1 + 3e− 3 t
4
4
=⇒ 3Neq − Nneq = 3 (Neq + Nneq ) e− 3 t = 3ne− 3 t
3
=⇒ t̂JC = − log
4
3Neq − Nneq
3n
3
4 Nneq
= − log 1 −
4
3 n
(2.1)
Some interesting details regarding the JC estimator, t̂JC , as dened at equation 2.1:
1. When 3Neq → Nneq then t̂JC → ∞. Notice that this situation represents uniform distribution, and indeed
in our model we assume that such cases will appear after innity time, where both sequences are almost
independent of each other.
2. If Nneq = 0 then t̂JC = 0. This is most likely since there are no mismatches that assumed to appear after
time.
3. When 3Neq < Nneq then t̂JC is undened. In such situations we will assume t̂JC → ∞.
4. Recall the mismatch estimator : a naive estimator that counts only the mismatches, i.e. t̂mm = Nneq
n . Compar
Nneq
3
3
≈
ing the mismatch estimator to JC estimator, one can nd that if n 4 then t̂JC = − 4 log 1 − 43 Nneq
n
− 43 · − 43
Nneq
n
=
Nneq
n
= t̂mm 1 . That means that as the number of the mismatches increase the mismatch
estimator becomes less accurate.
3
The Reversible Model
The previous model dealt with the problem of estimating the evolutionary time for sequence x to derive to sequence
y . But usually that is not the case. In practice, more often we can only have a glance to the end of the evolutionary
process. That is, given two sequences x, y we can assume that both had a common ancestor which both derived from,
but is less likely that y derived from x. We will mark the most recent common ancestor of x and y as M RCA(x, y).
In order to use JC properties we need to know the distance (as evolutionary time) from both sequences x and y
to the root - M RCA(x, y), or at least the ratio between both branches. Since we know none of the above, we will
demand the reversible property .
Denition 3. (The reversible property) Given set of random variables X1 , ..., Xn , which satisfy the Markov
property, i.e. p(Xi |X1 , ..., Xi−1 ) = p(Xi |Xi−1 ), ∀i ∈ {2, ..., n}. We say that a Markov chain is reversible if there is
a probability distribution over the states, π , such that
πa · p (Xi = b|Xi−1 = a) = πb · p (Xi = a|Xi−1 = b)
for all i ∈ {2, ..., n} and for all states a and b. In our denitions of the models:
t
t
πa · p a −→ b = πb · p a −→ b
Proposition 4.
1 The
Having the reversible property, one do not need to know the root (M RCA(x, y)).
last equation was derived from the log approximation: for
1, log (1 − ) = −.
2
Introduction to Computational Biology
Lecture # 28
Figure 3.1: An example of the MRCA for two sequences x and y
We are given two sequences x and y , and assume that sequence x is at state a and y is at state b. We dene
the distance between M RCA(x, y) to x as t and the distance between M RCA(x, y) to y as s (see gure 3.1). For
convenience we will write M RCA(x, y) as z .
Now,
Proof.
p (x = a, y = b) =
X
p (x = a, y = b, z = c) =
X
X
t
s
p (z = c) p (x = a|z = c) p (y = b|z = c) =
πc p c −→ a p c −→ b
c
c
c
From the reversible property,
p (x = a, y = b) =
X
t
s
t+s
πa p a −→ c p c −→ b = πa p a −→ b
c
and we found a close formula that does not contains z.
Given the sequences x = x1 , x2 , ..., xn and y = y1 , y2 , ..., yn , the likelihood in the reversible model is dened as
follows:
L(t) =
n
Y
t
πxi p xi −→ yi
i=1
In order to nd the most likely t for L(t) we will use Line Search.
3
Introduction to Computational Biology
Lecture # 28
Line Search
When we have an optimization problem, the line search strategy is an iterative approach to nd a local maximum.
Given function f : X → R where X ⊆ R, we say that three points x1 < x2 < x3 ∈ X are bracketing a maximum if
f (x2 ) ≥ f (x1 ) ∨ f (x2 ) ≥ f (x3 ). Choosing another point x4 ∈ X such that x1 < x4 < x3 , w.l.o.g say that x4 < x2 ,
will provide another bracket over the maximum (x1 , x4 , x2 or x4 , x2 , x3 ), with smaller area over x (see illustartion
at gure 3.2).
Algorithm 1. Line search
input: x1 , x2 , x3 // Assumes that x1 , x2 , x3 are bracketing a maximum
repeat until x3 − x1 < T HRESHOLD
if x2 − x1 > x3 − x2 then
set x4 ← α(x1 + x2 ) // α usually set to
if f (x4 ) ≥ f (x1 ) ∨ f (x4 ) ≥ f (x2 ) then
set x3 ← x2 , x2 ← x4
else
set x1 ← x4
else
set x4 ← α(x2 + x3 ) // α usually set to
if f (x4 ) ≥ f (x2 ) ∨ f (x4 ) ≥ f (x3 ) then
set x1 ← x2 , x2 ← x4
else
set x3 ← x4
1
2
1
2
return some point between x1 to x3
Having α = 12 will have the property that after each two following iterations, x3 − x1 will decreased with at least
1
1
4 of its original size. Therefore this is a logarithmic time search algorithm. Sometimes, instead of choosing α = 2 ,
one will nd the best second degree polynomial for f (x1 ), f (x2 ), f (x3 ) and set x4 as its maximum. Such heuristics
are sometimes better, since we assume that in the area of x2 , f (x) behaves as a second degree Taylor polynomial.
How do we found x1 , x2 , x3 such that they will satisfy the bracketing a maximum criterion? We will assume that
the function f is convex. If X is bounded, then choosing the two ends and some point in the middle will satisfy
our needs. Otherwise, we will start at point zero and sample some point in exponential-ever-growing distance, till
we will nd such points.
4
Conclusion
In order to estimate the distance between two sequences we should do the following actions:
1. Align both sequences.
2. Find how many locations in the DNA is reserved (i.e. calc Neq , Nneq ).
3. Using JC model or the reversible model, nd t̂.
4
Introduction to Computational Biology
Lecture # 28
Figure 3.2: Illustration the cases in the line search algorithm. Image (i) depicts the case of the bracketing the
maximum. Images (ii) and (iii) show how we choose x4 and the new x1 , x2 , x3 values in this case.
How do we align multiple (more than two) sequences? And how do we build the best phylogenetic tree based on
the distances? All that and more will reveal soon.
5