* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
Accurate gene phylogeny across multiple complete genomes Species Informed Distance-based Reconstruction Matt Rasmussen and Manolis Kellis The goal Determine the evolutionary history of every gene in multiple complete genomes The goal Determine the evolutionary history of every gene in multiple complete genomes From phylogenies determine: • Orthologs • Paralogs • Duplications • Losses • Family expansions • Varying rates of evolution • Etc… Contrast of the phylogenetic method with alternative methods 1. Pair-wise sequence comparison – – 2. Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods – – 3. Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods – – 4. Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods – – Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Contrast of the phylogenetic method with alternative methods 1. Pair-wise sequence comparison – – 2. Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods – – 3. Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Synteny methods – – 4. Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogenetic methods – – – Phylogeny of family clusters orthologs near each other Traditionally applied to specific families Can they be applied genome-wide? What is the accuracy of current phylogenetic methods? Tricky question: • Requires knowing the correct phylogeny by an independent means • Previously, • Simulation • Or avoid accuracy and focus on robustness (bootstrap) What is the accuracy of current phylogenetic methods? Use synteny to determine phylogeny by an independent means What is the accuracy of current phylogenetic methods? Use synteny to determine phylogeny by an independent means Trees found by Max Likelihood (PHYML) Matches species topology What is the accuracy of current phylogenetic methods? Phylogenies across 5154 syntenic one-to-one orthologs Etc… 316 other topologies Matches species topology Reconstruction accuracy dependent on gene sequence length Accuracy of current phylogenetic methods • Average gene is too short • Too few phylogenetically informative characters • To make progress, must use additional information • Current algorithms ignore species • Designed for solving the species tree problem • Whole genomes change the game • We can assume species tree is known • We would like to solve the gene tree problem • Our approach: • Design an algorithm specifically for the gene tree problem • Key insight: use species tree to inform the gene tree reconstruction What is the connection between species and gene evolution? What is the connection between species and gene evolution? A A A A A A A A What is the connection between species and gene evolution? A A A A A A A A 5154 gene trees What is the connection between species and gene evolution? A A A A A A A A 5154 gene trees 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site 1.0 sub/site Correlation between branch lengths Total tree length Relative branch lengths Correlation between branch lengths asp branch lengths Total tree length r = 0.957 Mer branch lengths Relative branch lengths Correlation between branch lengths Average gene tree Correlation between branch lengths Average gene tree 93% of trees have a correlation greater than .8 with the average gene tree Effect of normalization on branch correlation dvir dana Effect of normalization on branch length distribution Absolute branch lengths Gamma distributed Relative branch lengths Normally distributed A new model for gene family evolution: Two forces 2. Species-specific rates 1. Family rates Fj ~gamma(a,b) Sij ~normal(ui,sij) bij = Fj * Sij Effects that we have seen are consequences of this model bij = Fj * Sij • Total tree length Lj of one-to-one trees is proportional to family rate Fj • If species rates have small standard deviations we expect branch correlation The standard deviation of every speciesspecific rate is nearly ¼ of the mean What is the meaning of the speciesspecific rate? • The normal is partly due to error in estimating evolutionary distance • If we fit normals only on long sequences, the standard deviation goes down • Species-specific means are not affected by sequence length. All of these affects also hold for 17 fungi and 4 mammals 12 Flies Absolute branches distributed by gamma Relative branches distributed by normal 3 < Mean / sdev < 4 17 Fungi Absolute branches distributed by gamma Relative branches distributed by normal 3 < Mean / sdev < 4 A new strategy for gene tree reconstruction • Traditional Maximum Likelihood methods – Propose many topologies – For each topology • Calculate the likelihood of seeing such a tree – Return tree that achieves max likelihood • We show that one can calculate the likelihood of a tree being generated by our model • Thus, we can create our own phylogenetic algorithm that uses species information to reconstruct gene trees. Likelihood calculation: simple case INPUT: a distance matrix with all pairwise distances between genes Likelihood calculation: simple case • • • • Propose a topology Fit branch lengths to topology Estimate Family rate Normalize tree Likelihood calculation: simple case d • Reconcile gene tree to species tree • Determines actual path of evolution through species tree • Algorithms exist to do this fast (linear time) Likelihood calculation: simple case Pb Pa Pe d Pc Pd Pf • Compare branch lengths to distributions • Allows us to calculate a likelihood for every branch Likelihood calculation: simple case Pb Pa Pe d Pf Pc Pd Every branch is highly likely Tree is Highly likely • Because branches are independent, likelihood of tree is product of branch likelihoods Likelihood calculation: complex case Pb Pa d Pe Pf Pc Pd Every branch is highly likely Tree is Highly likely Likelihood calculation: complex case Pb Pa d Pe Pf • Propose another topology • This one differs only by rooting • Most branch have same length (just different name) • w = e (human) • x = c (rat) • y = d (mouse) • z = b (rodent) • Two branches are now merged • v = a + f (dog/hmr) Pc Pd Every branch is highly likely Tree is Highly likely Likelihood calculation: complex case Pb Pa d Pe Pf w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Reconcile gene tree to species tree Pc Pd Every branch is highly likely Tree is Highly likely Likelihood calculation: complex case Pc Pb Pa Pd d Pe Pf Every branch is highly likely Tree is Highly likely Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Likelihood calculation: complex case Pc Pb Pa Pd d Pe Pf Every branch is highly likely Tree is Highly likely Px Py Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Mouse and rat branches have the same likelihood as before • Px = Pc • Py = Pd Likelihood calculation: complex case Pc Pb Pa Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Py Rat Mouse Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Same distribution for dog, but now dog branch is too long. Why? • v=a+f • Pv < Pf Likelihood calculation: complex case Pc Pb Pa Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Py w1 ? Mouse w2 Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Rat • Branch w goes from Eutherian to Human (two species branches) • Which distribution should we use? Likelihood calculation: complex case Pc Pb Pa Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Py Rat Mouse Pw Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • The distribution is the sum of two independent normals w= w1 + w2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) • Branch w is too short, Pw < Pe Likelihood calculation: complex case Pc Pb Pa Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px z1 ? z2 Py Mouse Pw Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Same case for z. • Two species branches • Distribution is sum of two indep. normals z = z1 + z2 ~ N(u1,s12) + N(u2,s22) = N(u1+u2,s12+s22) Rat Likelihood calculation: complex case Pc Pb Pa Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Pz Py Mouse Pw Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Rat • Branch z is too short • Pz < Pb Likelihood calculation: complex case Pc Pb Pa Pd d Pe Every branch is highly likely Tree is Highly likely Pf Px Pz Py Rat Mouse Pw Human Pv Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Some branches are less likely Tree is less likely Bringing it all together • Turns out we find the likelihood of any tree by breaking it down into 1 of three cases • Main advantage: do not explicitly penalize dup/loss – Only ensure branch lengths are close to what we expect given our model Example of reconstructing tree with dup/loss: hemoglobin genes D H M Hemoglobin alpha This is now the correct topology R D H M Hemoglobin beta R Example of reconstructing tree with dup/loss Px Pz Py Rat Mouse Pw Human Pv Dog z w v D H M Hemoglobin alpha R D H M Hemoglobin beta R All branches are highly likely Tree is highly likely • Branch z is now longer • Branch w is now longer • Branch v is just the right length Evaluation: Datasets • Real datasets – – – – 5154 syntenic one-to-ones from 12 flies 739 syntenic one-to-ones from 17 fungi 200 Neighboring fly orthologs 220 Whole genome duplicates in 7 yeasts • Simulated (using our gene family model) – More complex events Neighboring orthologs WGD trees klac, kwal, agos sbay, smik, spar, scer Evaluation Apply genome-wide for 17 fungi • Cluster genes • Build alignment for each cluster • Build tree for each alignment • Reconcile to species tree to determine all duplications and losses General trees follow the model we learned from one-to-one trees GO enrichment in top 50 trees with most duplications term plasma membrane pval -1.50E-11 helicase activity 4.72E-12 ammonium transporter activity 1.46E-09 telomere maintenance via recombination 1.66E-09 DNA helicase activity 4.11E-09 transport 4.40E-07 transporter activity 4.80E-07 membrane 1.27E-06 alcohol dehydrogenase activity 6.29E-06 nitrogen utilization 6.29E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 6.29E-06 alpha-1,3-mannosyltransferase activity 6.29E-06 alcohol dehydrogenase (NADP+) activity 3.84E-05 magnesium ion transport 3.84E-05 sodium ion transport 3.84E-05 basic amino acid transporter activity 3.84E-05 lysophospholipase activity 3.84E-05 nuclear nucleosome 4.17E-05 cellular component unknown 0.000129 translational elongation 0.000139 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor 0.00015 translation elongation factor activity 0.000231 oxidoreductase activity 0.000353 GO enrichment in top 50 trees with most gene losses helicase activity 4.49E-12 telomere maintenance via recombination 1.60E-09 DNA helicase activity 3.95E-09 GTPase activity 7.11E-09 ubiquitin conjugating enzyme activity 6.79E-08 translational elongation 4.24E-07 alcohol dehydrogenase activity 6.17E-06 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 6.17E-06 alpha-1,3-mannosyltransferase activity 6.17E-06 translation elongation factor activity 9.27E-06 alcohol dehydrogenase (NADP+) activity 3.78E-05 1,3-beta-glucan synthase activity 3.78E-05 sodium ion transport 3.78E-05 IMP dehydrogenase activity 3.78E-05 ribosome 4.35E-05 protein serine/threonine phosphatase activity 0.000139 oxidoreductase activity, acting on the CH-OH group of donors, NAD or NADP as acceptor 0.000148 structural constituent of cytoskeleton 0.00018 alpha-glucosidase activity 0.00036 fermentation 0.00036 proteasome complex (sensu Eukaryota) 0.00036 transmembrane receptor activity 0.00036 protein amino acid O-linked glycosylation 0.000505 ammonium transporter activity 0.000701 GO enrichment in top 50 trees with most genes term pval helicase activity 1.55E-11 telomere maintenance via recombination 3.83E-09 DNA helicase activity 1.05E-08 plasma membrane 2.84E-08 1,3-beta-glucanosyltransferase activity 7.94E-08 transporter activity 1.45E-07 transport 1.71E-07 pyruvate decarboxylase activity 2.09E-06 protein amino acid O-linked glycosylation 2.29E-06 Golgi apparatus 7.73E-06 alcohol dehydrogenase activity 1.01E-05 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 1.01E-05 alpha-1,3-mannosyltransferase activity 1.01E-05 GTPase activity 1.57E-05 cyclin-dependent protein kinase holoenzyme complex 1.70E-05 alpha-1,2-mannosyltransferase activity 2.95E-05 regulation of glycogen biosynthesis 2.95E-05 cytosine-purine permease activity 5.51E-05 sodium ion transport 5.51E-05 basic amino acid transporter activity 5.51E-05 lysophospholipase activity 5.51E-05 membrane 0.000159 cell wall (sensu Fungi) 0.000386 GO enrichment in top 10 trees with most genes DNA helicase activity 3.79E-13 telomere maintenance via recombination 4.61E-13 helicase activity 8.05E-13 ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism 5.92E-08 alcohol dehydrogenase activity 5.92E-08 sodium ion transport 1.15E-06 fermentation 1.13E-05 calcium-transporting ATPase activity 0.00011 NADPH dehydrogenase activity 0.00011 transport 0.000146 oxidoreductase activity 0.000178 membrane 0.000235 alcohol dehydrogenase (NADP+) activity 0.000328 alcohol metabolism 0.000652 transporter activity 0.000735 plasma membrane 0.000846 multidrug transport 0.001078 Supplemental figure # Duplications vs rel sub/site for each species branch Orthologs and paralogs human mouse rat dog rabbit orthologs • Orthologs arise by speciation – typically keep same function paralogs • Paralogs arise by duplication – typically take on new functions Likelihood calculation: complex case Pc Pb Pa Pd d Pe Pf Every branch is highly likely Tree is Highly likely Rat Mouse Human Dog w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) Likelihood calculation: complex case Pb Pa d Pe Pf Pc Pd Every branch is highly likely Tree is Highly likely Px Py w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Mouse and rat branches have the same likelihood as before • Px = Pc • Py = Pd Likelihood calculation: complex case Pb Pa d Pe Pc Pd Every branch is highly likely Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Dog is now too long. Why? • v=a+f • Pv < Pf Likelihood calculation: complex case Pb Pa d Pe Pc Pd Every branch is highly likely Tree is Highly likely Pf Px Py Pv w = e (human) x = c (rat) y = d (mouse) z = b (rodent) v = a + f (dog/hmr) • Human is now too short, because it must now cross an extra species Likelihood calculation: complex case Pb Pa d Pe Pf Pc Pd Every branch is highly likely Tree is Highly likely Figure 4 a. Gene-tree with correct topology scores highly b. Gene-tree with incorrect topology scores poorly Figure 4. Gene-tree evaluation with a richer species-tree model