Download Introduction to Computational Biology Lecture #28

Introduction to Computational Biology Lecture #28: Estimating Evolutionary Time Elad Granot April 23, 2012 1 A Short Recap In the previous lecture we have created a probabilistic model of the sequence evolution. We dened a transition matrix, P (t), for each time t as the probability that a position in the DNA will change from nucleotide a to b t ∀a, b ∈ {A, T, C, G}. We will mark such probability as p(a −→ b). We have also dened the rate matrix, R, that describes the rate of a single change. Finally, we have dened two important properties - the Markov property and the homogeneous property. 2 Jukes-Cantor Model In the Jukes-Cantor (JC) model, the probability for a position to change its value dened as the following distribution: Denition 1. Given two nucleotides a, b and time t, the JC distribution dened as ( t p(a −→ b) = 1 4 (1 1 4 (1 4 − e− 3 t ) , a 6= b 4 + 3e− 3 t ) , a = b Assume we are given the following sequences: x = x1 , x2 , ..., xn , y = y1 , y2 , ..., yn which are already aligned (i.e. 1, 2, ..., n represents the same position for both sequences) we can ask for the likelihood that sequence y was derived n P from sequence x after time t. For that to happen we will dene the following random variables: Neq = 1{xi =yi } and Nneq = n P i=1 1{xi 6=yi } . Notice that Neq + Nneq = n. Therefore the likelihood will be: i=1 L(t) = n Y t p(xi −→ yi ) = i=1 n Neq Nneq 4 4 1 · 1 − e− 3 t · 1 + 3e− 3 t 4 Denition 2. In order to nd the most likely time we shall maximize logL(t) using rst deviation: 4 4 1 logL(t) = n · log + Neq · log 1 − e− 3 t + Nneq · log 1 + 3e− 3 t 4 4 Nneq Neq d 4 4 logL(t) = · e− 3 t − · 4e− 3 t = 0 4 4 − t − t dt 1−e 3 3 1 + 3e 3 1 Introduction to Computational Biology Lecture # 28 Neq =⇒ 3 1− 4 e− 3 t = Nneq 4 1 + 3e− 3 t 4 4 =⇒ 3Neq − Nneq = 3 (Neq + Nneq ) e− 3 t = 3ne− 3 t 3 =⇒ t̂JC = − log 4 3Neq − Nneq 3n 3 4 Nneq = − log 1 − 4 3 n (2.1) Some interesting details regarding the JC estimator, t̂JC , as dened at equation 2.1: 1. When 3Neq → Nneq then t̂JC → ∞. Notice that this situation represents uniform distribution, and indeed in our model we assume that such cases will appear after innity time, where both sequences are almost independent of each other. 2. If Nneq = 0 then t̂JC = 0. This is most likely since there are no mismatches that assumed to appear after time. 3. When 3Neq < Nneq then t̂JC is undened. In such situations we will assume t̂JC → ∞. 4. Recall the mismatch estimator : a naive estimator that counts only the mismatches, i.e. t̂mm = Nneq n . Compar Nneq 3 3 ≈ ing the mismatch estimator to JC estimator, one can nd that if n 4 then t̂JC = − 4 log 1 − 43 Nneq n − 43 · − 43 Nneq n = Nneq n = t̂mm 1 . That means that as the number of the mismatches increase the mismatch estimator becomes less accurate. 3 The Reversible Model The previous model dealt with the problem of estimating the evolutionary time for sequence x to derive to sequence y . But usually that is not the case. In practice, more often we can only have a glance to the end of the evolutionary process. That is, given two sequences x, y we can assume that both had a common ancestor which both derived from, but is less likely that y derived from x. We will mark the most recent common ancestor of x and y as M RCA(x, y). In order to use JC properties we need to know the distance (as evolutionary time) from both sequences x and y to the root - M RCA(x, y), or at least the ratio between both branches. Since we know none of the above, we will demand the reversible property . Denition 3. (The reversible property) Given set of random variables X1 , ..., Xn , which satisfy the Markov property, i.e. p(Xi |X1 , ..., Xi−1 ) = p(Xi |Xi−1 ), ∀i ∈ {2, ..., n}. We say that a Markov chain is reversible if there is a probability distribution over the states, π , such that πa · p (Xi = b|Xi−1 = a) = πb · p (Xi = a|Xi−1 = b) for all i ∈ {2, ..., n} and for all states a and b. In our denitions of the models: t t πa · p a −→ b = πb · p a −→ b Proposition 4. 1 The Having the reversible property, one do not need to know the root (M RCA(x, y)). last equation was derived from the log approximation: for 1, log (1 − ) = −. 2 Introduction to Computational Biology Lecture # 28 Figure 3.1: An example of the MRCA for two sequences x and y We are given two sequences x and y , and assume that sequence x is at state a and y is at state b. We dene the distance between M RCA(x, y) to x as t and the distance between M RCA(x, y) to y as s (see gure 3.1). For convenience we will write M RCA(x, y) as z . Now, Proof. p (x = a, y = b) = X p (x = a, y = b, z = c) = X X t s p (z = c) p (x = a|z = c) p (y = b|z = c) = πc p c −→ a p c −→ b c c c From the reversible property, p (x = a, y = b) = X t s t+s πa p a −→ c p c −→ b = πa p a −→ b c and we found a close formula that does not contains z. Given the sequences x = x1 , x2 , ..., xn and y = y1 , y2 , ..., yn , the likelihood in the reversible model is dened as follows: L(t) = n Y t πxi p xi −→ yi i=1 In order to nd the most likely t for L(t) we will use Line Search. 3 Introduction to Computational Biology Lecture # 28 Line Search When we have an optimization problem, the line search strategy is an iterative approach to nd a local maximum. Given function f : X → R where X ⊆ R, we say that three points x1 < x2 < x3 ∈ X are bracketing a maximum if f (x2 ) ≥ f (x1 ) ∨ f (x2 ) ≥ f (x3 ). Choosing another point x4 ∈ X such that x1 < x4 < x3 , w.l.o.g say that x4 < x2 , will provide another bracket over the maximum (x1 , x4 , x2 or x4 , x2 , x3 ), with smaller area over x (see illustartion at gure 3.2). Algorithm 1. Line search input: x1 , x2 , x3 // Assumes that x1 , x2 , x3 are bracketing a maximum repeat until x3 − x1 < T HRESHOLD if x2 − x1 > x3 − x2 then set x4 ← α(x1 + x2 ) // α usually set to if f (x4 ) ≥ f (x1 ) ∨ f (x4 ) ≥ f (x2 ) then set x3 ← x2 , x2 ← x4 else set x1 ← x4 else set x4 ← α(x2 + x3 ) // α usually set to if f (x4 ) ≥ f (x2 ) ∨ f (x4 ) ≥ f (x3 ) then set x1 ← x2 , x2 ← x4 else set x3 ← x4 1 2 1 2 return some point between x1 to x3 Having α = 12 will have the property that after each two following iterations, x3 − x1 will decreased with at least 1 1 4 of its original size. Therefore this is a logarithmic time search algorithm. Sometimes, instead of choosing α = 2 , one will nd the best second degree polynomial for f (x1 ), f (x2 ), f (x3 ) and set x4 as its maximum. Such heuristics are sometimes better, since we assume that in the area of x2 , f (x) behaves as a second degree Taylor polynomial. How do we found x1 , x2 , x3 such that they will satisfy the bracketing a maximum criterion? We will assume that the function f is convex. If X is bounded, then choosing the two ends and some point in the middle will satisfy our needs. Otherwise, we will start at point zero and sample some point in exponential-ever-growing distance, till we will nd such points. 4 Conclusion In order to estimate the distance between two sequences we should do the following actions: 1. Align both sequences. 2. Find how many locations in the DNA is reserved (i.e. calc Neq , Nneq ). 3. Using JC model or the reversible model, nd t̂. 4 Introduction to Computational Biology Lecture # 28 Figure 3.2: Illustration the cases in the line search algorithm. Image (i) depicts the case of the bracketing the maximum. Images (ii) and (iii) show how we choose x4 and the new x1 , x2 , x3 values in this case. How do we align multiple (more than two) sequences? And how do we build the best phylogenetic tree based on the distances? All that and more will reveal soon. 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Computational Biology Lecture #28