Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Methods in Reconstructing Phylogenetic Relationships 2009 EMBO World Practical Course: March 16th to 22nd, 2009, Botanical Garden, Rio de Janeiro Darwin’s letter to Thomas Huxley 1857 • The time will come I believe, though I shall not live to see it, when we shall have fairly true genealogical (phylogenetic) trees of each great kingdom of nature Haeckel’s pedigree of man Aims of the course: • To introduce the theory and practice of phylogenetic inference from molecular data • To introduce some of the most useful methods and computer programmes • To encourage a critical attitude to data and its analysis Some definitions Richard Owen Owen’s definition of homology • Homologue: the same organ under every variety of form and function (true or essential correspondence) • Analogy: superficial or misleading similarity Richard Owen 1843 Charles Darwin Darwin and homology • “The natural system is based upon descent with modification .. the characters that naturalists consider as showing true affinity (i.e. homologies) are those which have been inherited from a common parent, and, in so far as all true classification is genealogical; that community of descent is the common bond that naturalists have been seeking” Charles Darwin, Origin of species 1859 p. 413 Homology is... • Homology: similarity that is the result of inheritance from a common ancestor the identification and analysis of homologies is central to phylogenetic systematics Phylogenetic systematics • Sees homology as evidence of common ancestry • Uses tree diagrams to portray relationships based upon recency of common ancestry • Monophyletic groups (clades) - contain species which are more closely related to each other than to any outside of the group Cladograms and phylograms Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Eukaryote 2 Cladograms show branching order branch lengths are meaningless Eukaryote 3 Eukaryote 4 Bacterium 1 Bacterium 2 Bacterium 3 Eukaryote 1 Phylograms show branch order and branch lengths Eukaryote 2 Eukaryote 3 Eukaryote 4 Rooting using an outgroup archaea eukaryote archaea Unrooted tree archaea eukaryote eukaryote eukaryote Rooted by outgroup bacteria outgroup archaea Monophyletic group archaea archaea eukaryote eukaryote root eukaryote eukaryote Monophyletic group What kind of data? Fossil skulls Family tree for humans Microbial morphologies - some are complex but many are simple - for example look at a drop of lake water: Linus Pauling Molecules as documents of evolutionary history • “We may ask the question where in the now living systems the greatest amount of information of their past history has survived and how it can be extracted” • “Best fit are the different types of macromolecules (sequences) which carry the genetic information” Small subunit ribosomal RNA 18S or 16S rRNA An alignment involves hypotheses of positional homology between bases or amino acids <---------------(--------------------HELIX 19---------------------) <---------------(22222222-000000-111111-00000-111111-0000-22222222 Thermus ruber UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA E.coli UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA B.subtilis UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA match ** *** * ** ** * ** Alignment of 16S rRNA sequences from different bacteria Exploring patterns in sequence data 1: • Which sequences should we use? • Do the sequences contain phylogenetic signal for the relationships of interest? (might be too conserved or too variable) • Are there features of the data which might mislead us about evolutionary relationships? Is there a molecular clock? • The idea of a molecular clock was initially suggested by Zuckerkandl and Pauling in 1962 • They noted that rates of amino acid replacements in animal haemoglobins were roughly proportional to time - as judged against the fossil record Rate Heterogeneity Rates of amino acid replacement in different proteins There is no universal molecular clock • The initial proposal saw the clock as a Poisson process with a constant rate • Now known to be more complex - differences in rates occur for: – different sites in a molecule – different genes – different regions of genomes – different genomes in the same cell – different taxonomic groups for the same gene • There is no universal molecular clock Small subunit ribosomal RNA 18S or 16S rRNA Failure To Accommodate Rate Heterogeneity Can Lead To Problems When Making Trees Unequal rates in different lineages may cause problems for phylogenetic analysis • Felsenstein (1978) made a simple model phylogeny including four taxa and a mixture of short and long branches A p TRUE TREE A B p q q C q D WRONG TREE p>q C D B • All methods are susceptible to “long branch” problems • Methods which do not assume that all sites change at the same rate are generally better at recovering the true tree Chaperonin 60 Protein Maximum Likelihood Tree (PROTML, Roger et al. 1998, PNAS 95: 229) Longest branches Bootstrap values are a common way of assessing support for relationships High bootstrap values can be misleading adding a single new sequence Cucurbita sp. Arabidopsis thaliana Plasmodium falciparum Dictyostelium discoideum Cucurbita sp. Arabidopsis thaliana Spironucleus barkhanus Giardia lamblia Entamoeba histolytica Trichomonas vaginalis Drosophila melanogaster Homo sapiens Saccharomyces cerevisae Schizosaccharomyces pombe Trypanosoma brucei Euglena gracilis Holospora obtusa Ehrlichia sp. Ehrlichia chaffeensis Rickettsia tsutsugamushi Rhizobium meliloti Bartonella bacilliformis Bradyrhizobium japonicum Caulobacter crescentus Rhodobacter sphaeroides Pseudomonas aeruginosa Escherichia coli Chromatium vinosum Neisseria gonorrhoeae Chlamydia trachomatis Treponema pallidum Thermus thermophilus Giardia lamblia Trichomonas vaginalis Entamoeba histolytica Dictyostelium discoideum Drosophila melanogaster Homo sapiens Saccharomyces cerevisae Schizosaccharomyces pombe Trypanosoma brucei Euglena gracilis Plasmodium falciparum Ehrlichia sp. Ehrlichia chaffeensis Rickettsia tsutsugamushi Holospora obtusa Rhizobium meliloti Rhodobacter sphaeroides Bartonella bacilliformis Bradyrhizobium japonicum Caulobacter crescentus Escherichia coli Pseudomonas aeruginosa Chromatium vinosum Neisseria gonorrhoeae Chlamydia trachomatis Treponema pallidum Thermus thermophilus A proposal for three domains of life (Woese, Kandler and Wheelis 1990 PNAS 87, 4576) Concatenated LSU+SSU rRNA analyzed using a standard (GTR plus gamma*2) model eukaryotes The 3-domains tree of life Two longest branches archaebacteria eocyte archaebacteria Cox et al. 2008. PNAS bacteria The same RNA data analyzed using better models (Cox et al. 2008) eukaryotes eocytes 0.75 0.95 bacteria Other archaebacteria NDCH (GTR+g+2cv)*2 Heterogeneous across tree CAT model Saturation in sequence data: • Saturation is due to multiple changes at the same site subsequent to lineage splitting • Most data will contain some fast evolving sites which are potentially saturated (e.g. in proteins often position 3) • In severe cases the data becomes essentially random and all information about relationships can be lost Multiple changes at a single site - hidden changes Seq 1 Seq 2 AGCGAG GCGGAC Number of changes 1 Seq 1 C Seq 2 C 3 2 G T 1 A A Exploring patterns in sequence data • Do sequences manifest biased base compositions (e.g thermophilic convergence) or biased codon usage patterns which may obscure phylogenetic signal A case study in phylogenetic analysis: Deinococcus and Thermus • Deinococcus are radiation resistant bacteria • Thermus are thermophilic bacteria – BUT: – Both have the same very unusual cell wall based upon ornithine – Both have the same menaquinones (Mk 9) – Both have the same unusual polar lipids • Congruence between these complex characters supports a phylogenetic relationship between Deinococcus and Thermus % Guanine + Cytosine in 16S rRNA genes from mesophiles and thermophiles Thermophiles: Thermotoga maritima Thermus thermophilus Aquifex pyrophilus Mesophiles: Deinococcus radiodurans Bacillus subtilis %GC variable all sites sites 62 64 65 72 72 73 55 55 52 50 Shared nucleotide or amino acid composition biases can also cause problems for phylogenetic analysis Aquifex True tree Bacillus Thermus Aquifex (73%) Bacillus (50%) Wrong tree 16S rRNA Deinococcus The correct tree can be obtained if a model is used which allows base/aa composition to vary between sequences -LogDet/Paralinear Distances Heterogeneous Maximum Likelihood Thermus (72%) Deinococcus (52% G+C) Aquifex Bacillus Thermus Deinococcus Gene trees and species trees Gene tree a A b B c C Species tree We often assume that gene trees give us species trees Orthologues and paralogues paralogous orthologous a b* c Ancestral gene b* C* orthologous C* B A* A* A mixture of orthologues and paralogues sampled Duplication to give 2 copies on the same genome = paralogues of each other The malic enzyme gene tree contains a mixture of orthologues and paralogues Gene duplication 97 100 100 100 Mit Ascaris suum Mit Zea mays Ch Anas = a duck! Homo sapiens 2 100 75 Homo sapiens 1 Cyt Anas platyrhynchos Cyt Flaveria trinervia Ch Populus trichocarpa Ch Solanum tuberosum Mit 100 Amaranthus Mit Neocallimastix Plant chloroplast Plant mitochondrion Hyd Trichomonas vaginalis Hyd Giardia lamblia Cyt Schizosaccharomyces Saccharomyces Lactococcus lactis Summary: • There may be conflicting patterns in data which can potentially mislead us about evolutionary relationships • Our methods of analysis need to be able to deal with the complexities of sequence evolution and to recover any underlying phylogenetic signal • Some methods may do this better than others depending on the properties of individual data sets • All trees are simply hypotheses! Phylogenetic analysis requires careful thought • Phylogenetic analysis is frequently treated as a black box into which data are fed (often gathered at considerable cost) and out of which “The Tree” springs • (Hillis, Moritz & Mable 1996, Molecular Systematics)