Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Phylogenetics “Inferring Phylogenies” Joseph Felsenstein Excellent reference What is a phylogeny? Different Representations Cladogram - branching pattern only Phylogram - branch lengths are estimated and drawn proportional to the amount of change along the branch Rooted - implies directionality of change Unrooted - does not How do you root a tree? What is a phylogeny used for? 4 Nei n1 n 2 ij i 1 j i1 nn 1 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Working Tree sp2 sp1 c2 sp3 sp4 sp5 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Working Tree sp2 sp1 c2 c4 sp4 sp3 sp5 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Working Tree sp2 sp1 c2 c7 c4 sp4 sp3 sp5 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Working Tree sp2 sp1 c2 c9 sp4 c7 c4 sp3 sp5 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Working Tree sp1 sp2 c10 c2 c9 sp4 c7 c4 sp3 sp5 Estimate a Phylogeny Sp1 ACCGTCTTGTTA Sp2 AGCGTCATCAAA Sp3 AGCGTCATCAAA Sp4 ACCGTCTTGATA Sp5 AGCCTCTTCATA Final Tree sp1 sp2 c10 c2 c9 sp4 c7 c11 c4 sp3 sp5 What optimality criteria do we use then? Parsimony Likelihood Bayesian Distance methods? Parsimony 1 Why should we choose a specific grouping? Maximum parsimony: we should accept the hypothesis that explain the data most simply and efficiently “Parsimony is simply the most robust criterion for choosing between competing scientific hypotheses. It is not a statement about how evolution may or may not have taken place”1 Kitching, I. J.; Forey, P. L.; Humphries, J. & Williams, D. M. 1998. Cladistics: the theory and practice of parsimony analysis. The systematics Association Publication. No. 11. Parsimony Optimality criteria that chooses the topology with the less number of transformations of character states Optimizing one component: tree topology (pattern based) Most parsimonious tree: the one (or multiple) with the minimum number of evolutionary changes (smaller size/tree length) Reconstructing trees via sequence data O A 6. T=>G C 2. G=>A 4. A=>G 1. T=>A D 1 2 3 4 5 6 O T G T A A T A A A T G A G B A G C C - G C A A T G A T D A G C C - T B 6. T=>G 5. A=> GAP 4. A=>C 3. T=>C Tree length = 8 Neighbor-joining Method NJ distance matrices NJ distance matrices NJ distance matrices NJ distance matrices Finished NJ tree Models of Evolution T C Pyrimidines A G Purines Transversions Transitions Maximum Likelihood Base frequencies: fA + fG + fC + fT = 1 Base exchange: fs + fv = 1 R-matrix: + + + + + = 1 Gamma shape parameter Number of discrete gamma-distribution categories Pinvar: fvar + finv = 1 Likelihood: L = li where i is each character state Maximum Likelihood C G A t1 L=Pr(D|H) L( i ) x G t2 t7 t4 t3 z y G t5 t6 t8 w Pr AGGCG via x, y,z,w all x ,y ,z, w Pr(w)Pr(z;w,t )Pr( x; w,t )Pr( y;z,t )Pr( A; x,t ) Pr(G;z,t )Pr(G;z,t ) Pr(C; y,t 8 x y z w 7 6 1 2 3 4 )Pr(G; y,t5 ) ML cont. n L L (i ) i1 the probability that the nucleotide at time t is i is given by 1 3 4t / 3 Pii (t) e 4 4 the probability that the nucleotide at time t is j, j i, is given by 1 1 4t /3 Pij (t) e 4 4 Bayes Theorem The conditional probability of H given D: posterior Prob probability (H │D) = Prior probability or Marginal probability of H Prob (H) Prob (D│H) H=Hypothesis D=Data Prob (D) Likelihood function Prior probability or Marginal probability of D ∑ P(H) P(D|H) H Normalizing Constant: ensures ∑ P (H │D) = 1 Take Home Message Likelihood: represents the P of the data given the hypothesis => difficult to interpret Bayes approach: estimates the P of the hypothesis given the data => estimates P for the hypothesis of interest Bayesian Inference of Phylogeny f(i |X) = Calculating pP of a tree involves a summation over all possible trees and, for each tree, integration over all combinations of bl and substitution-model parameter values f(i,i,|X) = f(i,i,) f(X|i,i,) B(s) ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd Inferences of any single parameter are based on the marginal distribution of the parameter f(i|X) = f(i) f(X|i) B(s) ∑j=1 f(i) f(X|i) ∫ , f(i,i,) f(X|i,i,) dd B(s) ∑j=1 ∫ , f(i,i,) f(X| i,i,)dd This marginal P distribution of the topology, for example, integrates out all the other parameters Advantage: the power of the analysis is focused on the parameter of interest (i.e., the topology of the tree) Estimating phylogenies Exhaustive Searches Branch and bound methods Rise in computational time versus rise in solution space How many topologies are there? 2n 3! T n1 2 n 1! The Phylogenetic Problem T B(T) 2i 5 Number of Seqs 10 100 1,000 10,000 100,000 1,000,000 i3 Number of Trees 6 2x10 2x10182 2x102,860 8x1038,658 1x10486,663 1x105,866,723 HIV-1 Whole Genomes 1993 - 15 HIV-1 Whole Genomes 2003 (JAN) - 397 Tree Space - the final frontier Heuristic Searches Nearest-neighbor interchanges (NNI) - swap two adjacent branches on the tree Subtree pruning and regrafting (SPR) - removing a branch from the tree (either an interior or an exterior branch) with a subtree attached to it. The subtree is then reinserted into the remaining tree in all possible places Tree bisection and reconnection (TBR) - An interior branch is broken, and the two resulting fragments o the tree ar considered as separate trees. All possible connections are made between a branch of one and a branch of the other. Other approaches Tree-fusing - find two near optimal trees and exchange subgroups between the two trees Genetic Algorithms - a simulation of evolution with a genotype that describes the tree and a fitness function that reflects the optimality of the tree Disc Covering - upcoming paper Phylogenetic Accuracy? Consistency - A phylogenetic method is consistent for a given evolutionary model if the method converges on the correct tree as the data available to the method become infinite. Efficiency - Statistical efficiency is a measure of how quickly a method converges on the correct solution as more data are applied to the problem. Robustness - Robustness refers to the degree to which violations of assumptions will affect performance of phylogenetic methods How reliable is MY phylogeny? Bootstrap Analysis Jackknife Analysis Posterior Probabilities (Bayesian Approaches) Decay Indices Bootstrap Pseudoreplicates