Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Choosing among models of sequence evolution. Maximum likelihood (RA Fisher) Compares hypotheses (H1 & H2) given the observed data (D). Likelihood ratio statistic (L or Λ ): L = (L1/ L2) = Pr(D/ H1)/ Pr(D/ H2) So the log-likelihood statistic: lnL = ln(L1/ L2 ) = lnL1 - lnL2 2lnL is approximately ∼ Χ 2 distributed with the degrees of freedom = difference in number of parameters between hypotheses. From the homework: we fit different models Found distances and branch lengths differed among models. How do we objectively choose which model is “best”? Likelihood curve for 11 coin tosses. Experimental result: 5 heads. L = Pr(D/H) (joint) probability of the data (D) given the hypothesis (H) H may be a tree or a branch length or a model parameter D is the alignment of nucleotide sequences Likelihood curve for 11 coin tosses. Experimental result: 5 heads. 0.500 Ho: p=0.5 (fair coin) ^p = 5/11= 0.454. 0.454 11 Probability of heads (p) Parameter estimate. Likelihood Trees and model parameters Bootstrap Searching trees Likelihood Maximum Likelihood Estimation Maximum likelihood (RA Fisher) We want to know how likely is our parameter estimate given that: 0.454 Pr(D/ H 0 ) =0.500 vs. Pr(D/ H 1 ) ≠ 0.500 ^p = 5/11= 0.454. Probability of heads (p)11 Ho: p=0.5 (fair coin) Likelihood difference between the estimate and the hypothesis. Likelihood difference between the estimate and the hypothesis. 0.500 Likelihood Parameter estimate. ^p = 5/11= 0.454. We want to know how likely is our parameter estimate given that: 0.454 Probability of heads (p)11 Ho: p=0.5 (fair coin) Likelihood surface: more than one parameter: e.g. 2 dimensions. Parameter estimate. Likelihood 0.500 Likelihood surface: more than one parameter: e.g. 2 dimensions. p(inv) and p(alpha) Is the likelihood difference statistically significant. 0.454 Probability of heads (p)11 Ho: p=0.5 (fair coin) How do 2 parameters covary? Likelihood surface: more than one parameter: e.g. 2 dimensions. The -lnL The -lnL p(inv) and p(alpha) p(inv) and p(alpha) Furthermore most models are mutidimensional surfaces - more than 2 parameters. Harder to visualize but the Furthermore most models are mutidimensional surfaces - more than 2 parameters. math is exactly the same. What combination of parameters results in the highest likelihood? The -lnL ^p = 5/11= 0.454. ML has two distinct steps. Model parameter estimation • maximum likelihood estimates MLEs. • conditioned on a “good” tree. Tree estimation • Uses the MLEs --> ML tree. Today Site likelihoods for 2 models Model parameter estimation • maximum likelihood estimates MLEs. • conditioned on a “good” tree. Later: Tree estimation • Uses the MLEs --> ML tree. For example equal Base Frequencies? Central assumption of JC69 and K80. site model F81 F81+I Likelihoods are calculated at each site for each model using the same tree (e.g. a pretty good NJ tree). Just the models vary. (Swofford showed that parameter estimates are pretty stable across many trees that are not ML trees but are close to right.) So use F81:assume unequal base frequencies. -lnL = 35046.7 Estimated Freq(G)= 0.125 F81+ I: suppose also lots of constant (invariant) sites? -lnL = 33309.6 44.2% invariant sites Note the estimate of the base frequencies changes with the added parameter. -lnL = 33309.6 0.11 estimated freq(G) 44.2% estimated invariant sites Log likelihood statistic F81 F81+ I -lnL1 = 35047 -lnL2 = 33310 df= 3 df= 4 lnL = ln(L 1/ L2 ) = lnL1 - lnL 2 2lnL is approximately ∼ Χ 2 distributed degrees of freedom = difference in number of parameters between hypotheses. 2(lnL1 -lnL2 ) = 2 x 1737.1 = 3474.2 df difference = 1 If no significant difference, Χ 21df ≤ 3.84 for p≤0.05. Log likelihood statistic F81 F81+ I df= 3 df= 4 lnL = ln(L 1/ L2 ) = lnL1 - lnL 2 2lnL is approximately ∼ Χ 2 distributed degrees of freedom = difference in number of parameters between hypotheses. So Χ 21df ≤ 3.84 is much much smaller than 3474.2 so F81+I is much much better than F81. Similarly: site likelihoods for 2 trees 2(lnL1 -lnL2 ) = 2 x 2183 = 4366 df difference = 7 Model selection software Later we will compare the fit of 2 or more trees under our estimated model of sequence evolution (say GTR+G) to choose the optimal tree. -lnL1 = 35047 -lnL2 = 33310 And similarly to account for other features of the sequence alignment: GTR+I+G -lnL = 31120.67 df=10 jModeltest (Posada, 2008) TOPALi v2 DT-ModSel (Minin, Abdo, Joyce, Sullivan, 2003) This is the preferred performance-based method of Sullivan and Joyce. More in later lectures. An abbreviated example. We use SeaView and a small primate data set. ML with Phyml (a fast version of phylip) We generate the models and compare the models. SeaView analysis GTR and GTR+I lnL=-5946 lnL difference = 9, df difference=3 lnL=-5719 lnL=-5728 lnL=-5759 lnL difference = 9, df difference=3 χ32=7.8, so significantly different. lnL=-5719 lnL=-5728 GTR+I+G and HKY+I+G lnL=-5719 lnL=-5728 But note: same topology estimate. lnL=-5719 lnL=-5728 So HKY+I+G significantly poorer fit. And same branch length estimates. The bootstrap (nonparametric) Bootstrap (non parametric) Sequences are resampled 1000 times (at least) Tree search on each of 1000 replicated sequences yields 1000 trees bootstrap consensus is the majority rule consensus of 1000 tree branches are labeled by % occurrence in 1000 trees. »Felsenstein, J. 1985. Evolution 39: 783-791 Then a pseudosample provides variation estimate. Measuring Tree Support Nonparametric bootstrap Used in statistics as confidence levels when the data distribution is unknown (Efron, 1979). Eg. Eg. Suppose data is not normally distributed. But GTR+I+G more complex Search time is much longer and... Less stable, sometimes fails to converge Less power to detect the true tree. Consider slightly poorer fit if tree is nearly as good. The lack of fit apparently is due to systematic error that may not apply to the goodness of fit for the tree. But may be hard to justify to reviewers. Adapted for phylogenies by Felsenstein in 1985. Most commonly used measure of tree support. For MP, distance and ML methods. Unnecessary for Bayesian methods. The Bootstrap Replicate Tree Confidence: The Bootstrap Resamples, with replacement, all sites along the alignment Note that high bootstrap support will be misleading if model assumptions violated. Not a way to check the model. 53 68 94 88 86 99 73 91 90 80 100 74 Count the number of times each partition occurs in all the trees. 100 100 62 91 91 55 100 100 100 Any partition that occurs in more than 50% of trees shows up in the majority rule consensus tree - the “bootstrap tree”. Notice that it is the sequences that are bootstrapped, not the original tree. Note: multiple tests problem. 79 Collection of trees built from the replicated data. Some sites are sampled twice or more and some not at all. Bootstrap tree Bootstrap Convention: published trees Actually a maximum likelihood tree with bootstrap values added in a Word document. Note that branch lengths cannot be represented in the consensus tree itself. 100 55 97 100 72 74 82 84 51 99 99 50 90 88 99 100 Gulo gulo Martes pen Martes ame Martes mel Martes mar Martes zib Martes Foi Martes fla Mustela ev Mustela fu Mustela pu Mustela lu Mustela si Mustela it Mustela er Mustela al Mustela ni Mustela vi Taxidea ta Meles mele Spilogale Mephitis m Enhydra lu Aonyx cape Amblonyx c Lontra fel Lontra lon Lutra lutr Lutra macu Pteronura A forsteri Zalophusca walrus C cristata P fasciata P groenlan grayseal harborseal E barbatus H leptonyx Weddell seal M schauins ringtail Racoon Panda PolarBear Grizzly mongoose cat dog fox donkey horse indiarhino whiterhino blackrhino tapir pig sheep cow alpaca pygmyhippo hippo blue gray fin humpback bowhead n right dolphin Bootstrap tree Pseudosampling the sequence. Idea: if many sites support a clade then some will appear in most random replicates. If just a few sites, many replicates will lack support for a clade. The boot strap samples with replacement. So the sequence length is the same as the original sequence Some sites are randomly sampled multiple times, while others are randomly omitted. So some rare, but informative, sites are only rarely sampled and so do not in show up in all bootstrap trees. Hence the clades supported by just a very few sites will not be resolved. This is the point: high bootstrap values show that many sites support the clade. How do we search tree space? Search algorithms find the best tree for the data. Two methods guaranteed to find the globally best tree. Exhaustive search: every single tree Branch and bound: discard one tree usually means a set of subtrees are bad. Algorithm is much faster than exhaustive search. SeaView analysis for homework We will try the bootstrap for distances and for parsimony. Searching methods We will try different model parameters using maximum likelihood tests. For ML and MP trees Figures from Felsenstein’s Inferring phylogenies How many possible trees? Global vs Heuristic Search Too many for computational tractability. Most data sets are large. No possibility of exploring all trees Currently few packages even have these options (Exception is PAUP4). Goal: Balance thorough search against tractability and speed. Branch Swapping Strategies S T Heuristic Search Strategies 1. NNI: nearest neighbor interchange. 2. SPR: Subtree pruning and reconnection. Dissolve interior branch and form each alternative. Break a branch off and reconnect the “root” somewhere else (any other branch). Which Branch Swapping to Use? But all can get stuck on local optima. V U S T V U S T S T U V U V NNI --> STR --> TBR Local maximums Global maximum Heuristic Search Strategies Tree bisection and reconnection (TBR) Break a branch into two. Then reattach using a different branch. But all can get stuck on local optima. Maximum Likelihood tree. NNI and STR subsets of TBR. Increasingly accurate: TBR best. Decreasingly fast: TBR slowest. PAUP has all three methods phyml just NNI and STR. Traveling through tree space by accepting increasingly better trees (hill climbing) Traveling through tree space But all can get stuck on local optima. …or Maximum Parsimony tree. Starting tree influences outcome. Global maximum If start here then… Starting tree influences outcome. …end up here. Global maximum If start here then… Traveling through tree space Two effective methods …end up here. Global maximum If start here then… Sequential addition of taxa. Felsenstein recommendation: best resolved taxa listed first Add increasingly unresolved relationships. Sequential addition of taxa. Best known first Requires knowledge of the biology (oh dear!) Must know what the question is. Must limit the question that you are asking. Or use Stepwise Addition. Or use Stepwise Addition: multiple starting points. Global maximum Fastest: start with MP or NJ tree Global maximum NJ distance tree Suppose you start at multiple random trees to increase chance of covering tree space. Better chance of reaching the global maximum. Use a quick method. Hope it lands near the global max. No guarantees but usually OK.