Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics Algorithms and Data Structures Probabilistic Approaches to Phylogeny Lecturer: Dr. Rose Slides by: Dr. Rose April 24, 2003 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Approaches to Phylogeny Principal Goal: rank trees according to: – – Their likelihood P(data|tree) or Their posterior probability P(tree|data) Secondary goal: – – Find the probability of some taxonomic feature Example: the grouping of a set of taxa on a branch UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Approaches to Phylogeny Notation and definitions: Let P(x•|T,t•) denote the probability of a set of data given a tree, where: – – – x• denotes n sequences T denotes a tree with n leaves with sequence j at leaf j t• denotes the edge lengths of the tree The definition of P(x•|T,t•) depends on our choice of model of evolution. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Approaches to Phylogeny Let P(x|y,t) denote the probability that sequence y evolves into x along an edge of length t. Assume that we can define P(x|y,t). If we can do this for each edge of T we can calculate the probability of T. Let’s look at an example. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Approaches to Phylogeny P(x1,.., x5|T, t•) = P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5) root x5 t4 x4 t2 t3 t1 x2 x1 x3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Approaches to Phylogeny There is a small problem: normally, we do not have sequences for internal nodes. Solution: calculate P(x1,.., x3 | T, t•) by summing over all possible ancestors x4 & x5. Given this model we can search for T and t• that maximizes P(x• | T, t•) Maximizing P(x• | T, t•) requires: 1. Searching over possible tree topologies, T 2. Searching over possible lengths of edges, t• UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Approaches to Phylogeny Concerning the search over tree topologies: Q: How many rooted binary trees are there with n leaves? A: (2n – 3)!! rooted binary trees with n leaves. For n = 10 there are ~ 40 million trees For n = 20 there are ~ 4.4 * 1021 An efficient search procedure might prove useful. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Ridiculously simplistic model of evolution: 1. Every site is independent 2. Deletions and insertions do not occur 3. Substitution accounts for all evolution Let P(b|a, t) denote the probability of the substitution of residue b for residue a over an edge length of t. Extending to aligned gapless sequences x and y, P(x | y, t) = PuP(xu|yu, t), where u indexes over sites UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Consider the substitution matrix which specifies P(b|a,t): A residue alphabet of size K, entails a K-by-K subsitution matrix. P( A1 | A1 , t ) P( A2 | A1 , t ) P( AK | A1 , t ) P( A1 | A2 , t ) P( A2 | A2 , t ) P( AK | A2 , t ) S (t ) P( A | A , t ) P( A | A , t ) P( A | A , t ) 1 K 2 K K K UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution A nice property for substitution matrices is the multiplicative property in the sense that: S(t)S(s) = S(t + s) for all values of the lengths s and t. This is equivalent to: SbP(a|b, t)P(b|c, s) = P(a|c, s + t) for all a, c, s, and t. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Viewing t as a time variable, multiplicativity is a consequence of the nature of the substitution process. The process is: 1. Markovian and 2. Stationary, i.e., the substitution of a at time t for b at time s depends only on the time interval (t-s) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Example: Jukes & Cantor [1969] model In this model all nucleotides undergo transitions at the same rate a, giving the rate substitution matrix R: A 3a C a G a T a A C G T a a a 3a a a a 3a a a a 3a UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution The substitution matrix for a short time S(e) is approximately I+Re where I is the identity 3a a a a matrix. a 3 a a a Recall from the previous slide R a a 3a a ae ae 1 3ae ae ae ae 1 3ae ae I Re ae ae 1 3ae ae ae ae ae 1 3 ae UNIVERSITY OF SOUTH CAROLINA a a a 3a College of Engineering & Information Technology Probabilistic Models of Evolution We want to derive the substitution matrix for time t. S(t+e) = S(t)S(e) S(t)(I+Re) by multiplicativity S(t+e) S(t)I+ S(t)Re multiplying out S(t+e) - S(t) S(t)Re subtract S(t) from both sides [S(t+e) - S(t)]/e S(t)R divide by e S´(t) = S(t)R as e 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Noting that the rate matrix is symmetrical, we expect something of the following form for S(t): rt st S (t ) s t s t st st st rt st st st rt st st st rt 3a a a a a 3a a a R a a 3a a a a a 3a Substituting our matrix S(t) into S´(t) = S(t)R gives: ŕ = -3ar +3as, ś = -as + ar UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution From the previous slide: substituting our matrix S(t) into S´(t) = S(t)R gives: ŕ = -3ar +3as, ś = -as + ar Which are satisfied by: rt = ¼(1 + 3e-4at) st = ¼(1 - e-4at) Substituting rt and st into S(t) give the Jukes-Cantor model UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Jukes-Cantor Model: ¼(1 3e -4at ) ¼(1 - e -4at ) ¼(1 - e-4at ) ¼(1 - e -4at ) - 4at - 4at - 4at - 4at ¼(1 - e ) ¼(1 - e ) ¼(1 3e ) ¼(1 - e ) S (t ) - 4at - 4at - 4at - 4at ¼(1 e ) ¼(1 e ) ¼(1 3 e ) ¼(1 e ) - 4at - 4at - 4at ¼(1 - e -4at ) ¼(1 e ) ¼(1 e ) ¼(1 3 e ) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution • • When t = , rt = st = ¼. Hence the nucleotide equilibrium frequencies implied by the Jukes-Cantor model are: qA = qC = qG = qT = ¼. • This model is too simple even for nucleotide substitution. Q: what obvious substitution issue is ignored? A: The difference between transitions and transversions. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Transitions: • purine purine, i.e., A G • pyrimidine pyrimidine, i.e., C T Transversions: • purine pyrimidine – – – – AC AT GC GT UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Kimura [1980] proposed a rate matrix that accounts for the difference between transitions and tranversions: 2 a R a a 2 a a 2 a a 2 a UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution We can integrate the Kimura rate matrix to give the general time-dependent form: Where: rt st S (t ) u t s t st rt st ut ut st rt st st ut st rt rt = ¼(1 - e-4t) ut = ¼(1 + e-4t - 2e-2(a)t) rt = 1 - 2st - ut UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution This model is still unrealistic: – – The equilibrium frequencies are equal qA = qC = qG = qT = ¼. This is unrealistic for some taxa where there is a preponderance of bias in AT vs GC ratio. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution • Moving on to protein sequences, we consider evolution models for amino acids • Dayhoff et. al. compiled a matrix Aab – – – hypothetical phylogenetic tree of 71 families. compilation of frequecies of transitions between paired residues Gives pab, the probability of a aligned with b. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution • Next, a second matrix Bab is derived from Aab – Bab = Aab/ScAac – – • • Gives the conditional probability p(b|a) This is the short time interval estimate. Let qa denote the frequency of occurrence of residue a in a protein. Similarly for qb. The expected number of substitutions in a protein is then Sa,b qaqb Bab UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution • • • A substitution matrix where Sa,b qaqb Bab= 0.01 is a 1 PAM matrix. Dayhoff et. al. define a 1 PAM (point accepted mutation) matrix as the expected number of substitutions is 1% Next Bab is scaled to produce a 1 PAM matrix of substitution probabilities UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution A third matrix Cab is derived from Bab – – Cab = sBab for a b Caa = sBaa + (1 – s) – – i.e., scale off-diagonals by s and adjust diagonals to maintain a row sum of 1. s is chosen so that Cab is a 1 PAM matrix. – Let S(1) denote this 1 PAM matrix UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Q: How can substitution matrices for longer times be created? A: raise S(1) to a power n. Example: – S(2) = S(1)S(1) – – Entries P(a|b, t = 2) = ScP(a|c, t = 1)P(c|b, t = 1) These are the substitutions from b to a via any intermediary c UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution • S(n) can be viewed as the result of a n steps in a Markov chain with 20 states. – – The 20 states correspond to the 20 amino acids Each step in this Markov chain has probability given by S(1). UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Recap: S(1) was defined by 1. normalizing the rows of the symmetric matrix A 2. then rescaling These operations can be interchanged 1. rescale 2. then normalize Note: the rescaled matrix is still symmetrical it can be diagonalized UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution We can thus write S(1) = UD(li)U-1 where – – – – – – U is a coordinate transformation and D(li) is the diagonal matrix with eigenvalues l1…l20 on the diagonal. the eigenvalues range between 0 and 1 and can be written l1 = exp(-mi) Powers of S(1) are represented very simply in the diagonal matrix system S(2) = S(1)S(1) = UD(li)U-1UD(li)U-1 = UD(li2)U-1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution For arbitrary t, S(t) = UD(lit)U-1 , i.e., e m1t 0 0 m 2t 0 1 0 e S (t ) U U m 20t 0 0 e If Ai is the ith amino acid then P(Aj|Ai, t) = Skuikexp(-mkt) vkj where uik and vkj are entries in U and U-1, respectively UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Probabilistic Models of Evolution Letting t = , we get the PAM matrix: q A1 q A2 q A20 q A1 q A2 q A20 qA qA qA 2 20 1 Where the qAi are the equilibrium frequencies for amino acids. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Q: What is the simplest tree that we might consider? A: A tree comprised of two leaves, x1 and x2. Q: Why is this interesting? We already know the topology. A: We don’t know the edge lengths! a t2 t1 x2 x1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Q: How does the likelihood of the tree vary with the edge lengths? A: That is the question Consider a single site with residues x1, and x2. Assign the variable a to the root as shown below: a t2 t1 x2 x1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments The probability of 1. having a at the root (we assume the equilibrium distribution) and 2. substituting a by x1 and 3. Substituting a by x2 is given by: P(x1, x2, a|T, t1, t2) = qa P(x1|a,t1)P(x2|a,t2) Recall: Since we don’t know what residue is at the root, we must sum over all possibilities. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Thus the generalized form of the equation is: P(x1, x2|T, t1, t2) = Sa qa P(x1|a,t1)P(x2|a,t2) The generalization of this equations to N sites is: N P( x1 , x 2 | T , t1 , t 2 ) P( xu1 , xu2 | T , t1 , t 2 ) u 1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Example: likelihood of two sequences CCGGCCGCGCG CGGGCCGGCCG Q: What is the likelihood of these sequences in our simple tree of two leaves? For this example, assume the Jukes-Cantor model. We will need: 1. The substitution equations from the previous section. 2. The single site equation from the previous slide. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Based on: • the substitution matrix from the previous section, rt s S (t ) t st s t • st rt st st st st rt st st st st rt the single site equation from the previous slide, P(x1, x2|T, t1, t2) = Sa qa P(x1|a,t1)P(x2|a,t2) The probability of C occurring in both leaves is: P(C, C | T , t1 , t 2 ) qC rt1 rt2 qG st1 st2 qA st1 st2 qT st1 st2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments P(C, C | T , t1 , t 2 ) qC rt1 rt2 qG st1 st2 qA st1 st2 qT st1 st2 Q: Why does the first term use the probabilities rt1 and rt2 while the other three terms use st1 and st2? A: the first term models the case where there is no change. • In the first term the root residue is C and both leaves are also C, so there is no substitution rt1 and rt2 • The other terms model the substitution of the residue at the root by C at the leaves st1 and st2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Recall that the equilibrium distribution for JukesCantor entails qA = qC = qG = qT = ¼. Hence: P(C, C | T , t1 , t 2 ) qC rt1 rt2 qG st1 st2 qA st1 st2 qT st1 st2 1 (rt1 rt2 3st1 st2 ) 4 By symmetry P(G,G|T,t1,t2) = P(C,C|T,t1,t2) Likewise, P(G,C|T,t1,t2) = P(C,G|T,t1,t2) = ? 1 P(C, G | T , t1 , t2 ) qC rt1 st2 qG st1 rt2 qA st1 st2 qT st1 st2 (rt1 st2 st1 rt2 2st1 st2 ) 4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Recall: rt and st from the previous section: rt = ¼(1 + 3e-4at) st = ¼(1 - e-4at) Substituting gives: P(C, C | T , t1 , t 2 ) 1 1 (rt1 rt2 3st1 st2 ) (1 3e 4a (t1 t2 ) ) 4 16 1 1 P(C, G | T , t1 , t 2 ) (rt1 st2 st1 rt2 2st1 st2 ) (1 e 4a (t1 t2 ) ) 4 16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Ok, now we have probabilities for single sites. We derive the probability for sequences by taking the product of site probabilities, i.e., N P( x , x | T , t1 , t 2 ) P( xu1 , xu2 | T , t1 , t 2 ) 1 2 u 1 If there are match sites where the sequences match and mismatch sites where they mismatch we get: P( x , x | T , t1 , t 2 ) 1 1 2 16 4a ( t1 t 2 ) match 4a ( t1 t 2 ) mismatch ( 1 3 e ) ( 1 e ) match mismatch UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Q: How can we get the likelihood for multiple sequences? Consider a tree T with edge lengths t. Denote the ancestor of node i by a(i) Let P(x1,..,xn |T, t) denote the probability of generating residues x1,.., xn at the n leaves P(x1,.., xn |T, t) is found by multiplying the probabilities of all edges of the tree. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments P(x1,..,xn |T, t) = a n1 , a n2 ,.. a 2 n1 2 n2 n i n 1 i 1 qa 2 n1 P(a i | aa (i ) , ti ) P( x i | aa (i ) , ti ) sum over all possible assignments of residues to inner nodes probabilities of a inner nodes given inner nodes’ parents probabilities of a leaves given leaves’ parents Note: this can be computed from the leaves up. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments • Let Lk denote the leaves below node k. • Let P(Lk|a) denote the probability of these leaves given that the residue at k is a. • Let i and j be the children of k. • P(Lk|a) is computed from P(Li|b) and P(Lj|c) for all residues b and c. UNIVERSITY OF SOUTH CAROLINA k a i b j c College of Engineering & Information Technology Felsenstein’s Likelihood Algorithm Initialization: Set k = 2n-1 Recursion: Compute P(Lk|a): If k is a leaf Set P(Lk|a) = 1 if a = xk, o/w P(Lk|a)= 0 Else compute P(Li|a) & P(Lj|a) for children i & j and all a Set P(Lk|a) = Sb,c P(b|a,ti)P(Li|b)P(c|a,tj)P(Lj|c) Termination: the likelihood at site u = P(x |T,t) = Sa P(L2n-1|a)qa UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Felsenstein’s Likelihood Algorithm Finally, site likelihood values are combined to give likelihood of sequences at leaves. This step assumes independence of sites: N P( x | T , t ) P( xu | T , t ) u 1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Let’s look at a simple example. Tree with 3 sequences (only GC bases) t4 CCGGCCGCGCG CGGGCCGGCCG GCCGCCGGGCC 4 t2 t1 t3 2 Assume Jukes-Cantor model 1 Consider site where C occurs at all leaves: 3 P(C, C, C | T, t1, t2, t3) = ? UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments t4 4 t2 t1 P(C, C, C | T , t1 , t2 , t3 ) 1 qC rt3 (rt4 rt1 rt2 3st4 st1 st2 ) (q A qG qT ) st3 (rt4 st1 st2 2st4 st1 st2 st4 rt1 rt2 ) 1. t3 2 3 Equilibrium probability of residue C at the root 1. C is at all nodes 2. 4 does not have residue C 2. Equilibrium probabilities of other residues at root 1. 2. 3. 4 has same residue as root, i.e., not C 4 has a different residue from root, but not C 4 has residue C UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments t4 4 t2 t1 2 1 P(C, C, C | T , t1 , t 2 , t3 ) qC rt3 (rt4 rt1 rt2 3st4 st1 st2 ) (q A qG qT ) st3 (rt4 st1 st2 2st4 st1 st2 st4 rt1 rt2 ) 1 rt1 rt2 (rt3 rt4 3st3 st4 ) t3 4 3 st1 st2 (2st3 st4 st3 rt4 st4 rt3 ) 3 4 1 (rt1 rt2 rt3 t4 3st1 st2 st3 t4 ) 4 N.B. The sorcery at the colored boxes is a consequence of multiplicativity of the Jukes-Cantor matrices. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Observation: the edges adjoining the root appear only as their sum: – In the previous example of the tree comprised of 2 sequences, the edges t1 & t2 appeared as: 1 P(C, C | T , t1 , t2 ) (1 3e 4a (t1 t2 ) ) 16 1 P(C, G | T , t1 , t 2 ) (1 e 4a (t1 t2 ) ) 16 – In this last example, the edges t3 & t4 appeared as: 1 P(C, C, C | T , t1 , t 2 , t3 ) (rt1 rt2 rt3 t4 3st1 st2 st3 t4 ) 4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments • • This observation holds for all leaf values, and thus total likelihood. Idea: collapse t3 & t4 into a single edge, shifting the root to node 4. t4 4 t2 t1 t1 t3 t3 2 2 1 t2 3 UNIVERSITY OF SOUTH CAROLINA 1 3 College of Engineering & Information Technology Likelihood for ungapped alignments We now denote the third edge by t3 rather than t3+t4 This results in a simpler equation: P( xu1 , xu2 , xu3 | T , t1 , t2 , t3 ) P( xu1 | a, t1 ) P( xu2 | a, t2 ) P( xu3 | a, t3 ) a Since our example sequences contain only C & G, the only possible types of terms are: • • • • CCC or GGG (All residues are the same) GGC or CCG CGC or GCG GCC or CGG UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for ungapped alignments Examples of these equations are: 1 P(C, C, C | T , t1 , t 2 , t3 ) (rt1 rt2 rt3 st1 st2 st3 ) 4 1 P(C, C, G | T , t1 , t 2 , t3 ) (rt1 rt2 st3 st1 st2 rt3 2st1 st2 st3 ) 4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Reversibility & Independence of Root Position Q: How is it that we were able to collapse t3 and t4 in the previous example? A: The likelihood of the tree was independent of the position of the root. If a substitution matrix has the properties of: 1. Multiplicativity and 2. Reversibility, i.e., P(b|a,t)qa = P(a|b,t)qb for all a,b,t Then likelihood is independent of the position of the root. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Reversibility & Independence of Root Position Can we show that multiplicativity & reversibility imply that root position doesn’t affect likelihood? Let the children of the root, node 2n-1, be i and j. Recall: P(Lk|a) is the probability of the leaves below node k given that k is residue a. In Felsentein’s algorithm: – – P(L2n-1|) is computed from P(Li| ) and P(Lj|) The likelihood of the sequences x at site u is – P(xu |T, t) = SaqaP(Lroot|a) = Sb,c,aqaP(b|a,ti) P(c|a,tj) P(Li|b) P(Lj|c) i.e., we sum over all possible residues appearing at the root and its two children UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Reversibility & Independence of Root Position Taking this equation: P(xu |T, t) = Sb,c,a qaP(b|a,ti) P(c|a,tj) P(Li|b) P(Lj|c) And applying reversibility, we get: P(xu |T, t) = Sb,c,[Sa P(c|a,tj) P(a|b,ti)] qbP(Li|b) P(Lj|c) This effectively shifts the root to node i. Multiplicativity allows us to simplify the inner sum: Sa P(c|a,tj) P(a|b,ti) = Sa P(c|b, ti +tj) This reflects the shift of the root from node 2n-1 to node i. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Reversibility & Independence of Root Position The root can be shifted to any position in the tree. Likelihood is independent of where the root is. This is Felsenstein’s so-called ‘pulley principle’. Q: What is the significance of the ‘pulley principle’? A: the likelihood search need only consider unrooted trees provided multiplicativity & reversibility obtain. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference We have investigated: 1. Simple evolutionary models (Jukes-Cantor, Kimura) 2. Felsenstein’s algorithm for computing likelihood Now we look at finding the ‘Best’ tree. Maximum Likelihood assumes that the tree with the highest likelihood score is the best tree. Recall: we must search for T and t• that maximizes the likelihood P(x• | T, t•) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference Maximizing P(x• | T, t•) requires: 1. Searching over possible tree topologies, T 2. Searching over possible lengths of edges, t• For small numbers of sequences, enumerating all trees is ok. Recall, however: there are (2n – 3)!! rooted binary trees with n leaves. – – For n = 10 there are ~ 40 million trees For n = 20 there are ~ 4.4 * 1021 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference Approach for small numbers of sequences (2-5): – – Enumerate all trees For each tree • • The likelihood is explicitly expressed as a function of edge lengths Use a numerical technique such as Newton’s method optimization UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference For larger numbers of sequences: – – – There is the problem of tree space search (next section) Felsentein’s algorithm can be used to compute likelihood Edge length optimization can be done by: • • Felsenstein’s EM alogorithm or standard optimiser such as conjugate gradients – – Using a conjugate gradient method requires the derivatives of the likelihood wrt edge length. Replace P(yk| ya(k), tk) with P(yk| ya(k), tk)/ tk UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference Instead of maximizing likelihood, consider a Bayesian approach: • • Given the prior probability P(T, t) Calculate the posterior probability P(T, t | x) Q: How can we do this? A: use Bayes’ rule: P ( x | T , t ) P(T , t ) P(T , t | x ) P( x ) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference Q: What does the posterior probability P(T, t | x) express? A: The probability of the tree and edge lengths given the data. This is approach is easy for small numbers of sequences. Q: Why is this true? A: all trees can be enumerated so P(T, t) is known UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference In order to use Bayes’ rule, we need: 1. 2. 3. P(x) P(x | T, t) P(T, t) 1. We can derive P(x) from the evolution model, but we still need to integrate over tree space. 2. We’ve spent most of today deriving P(x | T, t). 3. We can derive P(T, t) by enumerating all trees. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference For larger numbers of sequences, it is not possible to enumerate all trees Q: If we can’t enumerate all trees how can we derive the posterior probability? A: we sample from the posterior distribution on the space of trees and edge lengths. Idea: randomly sampling the posterior distribution should produce a sample that reflects the distribution. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference In the limit of a large sampling, properties with high probability will appear with frequency proportional to their probability. The frequency of a property is an estimate of its probability. Q: How do we sample randomly? A: We can use the Metropolis algorithm, i.e., Markov chain Monte Carlo methodology. Note: sadly, the semester is virtually over, so there isn’t time to explore MCMC UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference Superficially, MCMC samples by creating a sequence of trees as follows: – – – – – – A proposal distribution is assumed. Each tree is randomly created from the previous tree via the proposal distribution Let P1 = P(T, t| x) be the posterior probability of the of the current tree. Let P2 = P(Ŧ, ŧ| x) be the posterior probability of proposed new tree. If P2 P1 then the new tree is accepted If P2 < P1 then the new tree is accepted with probability P2/P1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Likelihood for Inference – – – – If P2 < P1 then the new tree is accepted with probability P2/P1 If Ŧ is rejected, then T becomes the current tree again. This last point is key since we are interested in the frequency, i.e., distribution. If the proposal distribution is symmetrical in the sense that • the probability of proposing Ŧ, ŧ from T, t is the same as • the probability of proposing T, t from Ŧ, ŧ Then procedure correctly samples the posterior distribution. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Proposal Distribution Proposal distribution: Key to Metropolis algorithm Q: What should we expect if we really randomly selected trees from the space of all trees? A: small posterior probability i.e., P(T, t| x) is small. Many proposed trees will be a waste of time. Q: What should we expect if the select trees that are very close to previous trees? A: We will explore the space very slowly. Many more steps will be required. Tuning the proposal distribution is the hardest part of using the Metropolis algorithm. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Proposal Distribution Mau et al. suggest generating candidate trees by: 1. Adjusting edges 2. Reordering the assignments of sequences to leaves Adjustment of edges: based on called “traversal profile” – – – Equivalent to original tree 2D representation of tree Node position: • • – Vertical: equal to sum of edges to node from root in the tree Horizontal: spaced according to inorder traversal order. Nodes numbered by inorder traversal UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Proposal Distribution Below: tree & corresponding traversal profile Notice: for any internal node k: – – Descendants on left have a numbers smaller than k. Descendants on right have numbers larger than k. 10 6 14 8 4 12 2 1 UNIVERSITY OF SOUTH CAROLINA 16 5 3 7 9 11 13 15 17 College of Engineering & Information Technology Proposal Distribution Given a traversal profile, construct the tree by: – – The highest node is the root. Recurse: • The left child is the highest node to the left but not horizontally past an ancestor. The right child is the highest node to the right, but not horizontally past an ancestor. • 10 10 6 6 14 14 8 4 4 12 2 1 12 16 2 5 3 7 8 9 11 13 15 UNIVERSITY OF SOUTH CAROLINA 17 1 16 5 3 7 9 11 13 15 17 College of Engineering & Information Technology Proposal Distribution Edge adjustment is effected by: Shifting the vertical position up/down by D. D is chosen from a uniform distribution. Construct the tree from the traversal profile • • • 10 10 6 4 14 14 8 4 6 12 16 2 2 5 5 1 12 8 16 3 7 9 11 13 15 17 UNIVERSITY OF SOUTH CAROLINA 1 3 7 9 11 13 15 17 College of Engineering & Information Technology Proposal Distribution Observation: Edge adjustment – – produces a new topology Does not allow non-adjacent leaves to become adjacent. Branch direction switching accomplishes this. – – – – Branches are randomly switched at a node This reorders leaves and results in non-adjacent leaves becoming adjacent This does not change the posterior probability. Hence this proposed tree is always accepted. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Proposal Distribution Branch direction switching example: 10 10 6 6 14 14 8 4 8 12 4 12 16 16 2 2 5 1 3 7 9 11 13 15 5 17 UNIVERSITY OF SOUTH CAROLINA 7 9 1 3 11 13 15 17 College of Engineering & Information Technology Likelihood for Inference In Summary: Branch direction swapping Does not change posterior probabilities Does leads to new part of tree space Edge adjustment Does change posterior probabilities Changes vary continuously with the size of the adjustment Can change topology, but not always. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology