Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Advanced Methods in
Reconstructing Phylogenetic
Relationships
2009 EMBO World Practical Course:
March 16th to 22nd, 2009, Botanical Garden, Rio de Janeiro
Darwin’s letter to
Thomas Huxley 1857
• The time will come I
believe, though I shall not
live to see it, when we
shall have fairly true
genealogical (phylogenetic)
trees of each great
kingdom of nature
Haeckel’s pedigree of man
Aims of the course:
• To introduce the theory and practice of
phylogenetic inference from molecular
data
• To introduce some of the most useful
methods and computer programmes
• To encourage a critical attitude to data
and its analysis
Some definitions
Richard Owen
Owen’s definition of homology
• Homologue: the same organ under every
variety of form and function (true or
essential correspondence)
• Analogy: superficial or misleading similarity
Richard Owen 1843
Charles Darwin
Darwin and homology
• “The natural system is based upon descent with
modification .. the characters that naturalists
consider as showing true affinity (i.e. homologies)
are those which have been inherited from a common
parent, and, in so far as all true classification is
genealogical; that community of descent is the
common bond that naturalists have been seeking”
Charles Darwin, Origin of species 1859 p. 413
Homology is...
• Homology: similarity that is the result
of inheritance from a common ancestor the identification and analysis of
homologies is central to phylogenetic
systematics
Phylogenetic systematics
• Sees homology as evidence of common
ancestry
• Uses tree diagrams to portray relationships
based upon recency of common ancestry
• Monophyletic groups (clades) - contain
species which are more closely related to
each other than to any outside of the group
Cladograms and phylograms
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Eukaryote 2
Cladograms show
branching order branch lengths are
meaningless
Eukaryote 3
Eukaryote 4
Bacterium 1
Bacterium 2
Bacterium 3
Eukaryote 1
Phylograms show
branch order and
branch lengths
Eukaryote 2
Eukaryote 3
Eukaryote 4
Rooting using an outgroup
archaea
eukaryote
archaea
Unrooted tree
archaea
eukaryote
eukaryote
eukaryote
Rooted
by outgroup
bacteria outgroup
archaea
Monophyletic group
archaea
archaea
eukaryote
eukaryote
root
eukaryote
eukaryote
Monophyletic
group
What kind of data?
Fossil skulls
Family tree for
humans
Microbial morphologies - some are complex but
many are simple - for example look at a drop
of lake water:
Linus Pauling
Molecules as documents of
evolutionary history
• “We may ask the question where in the now
living systems the greatest amount of
information of their past history has survived
and how it can be extracted”
• “Best fit are the different types of
macromolecules (sequences) which carry the
genetic information”
Small subunit ribosomal RNA
18S or 16S rRNA
An alignment involves hypotheses of
positional homology between bases or
amino acids
<---------------(--------------------HELIX 19---------------------)
<---------------(22222222-000000-111111-00000-111111-0000-22222222
Thermus ruber
UCCGAUGC-UAAAGA-CCGAAG=CUCAA=CUUCGG=GGGU=GCGUUGGA
Th. thermophilus UCCCAUGU-GAAAGA-CCACGG=CUCAA=CCGUGG=GGGA=GCGUGGGA
E.coli
UCAGAUGU-GAAAUC-CCCGGG=CUCAA=CCUGGG=AACU=GCAUCUGA
Ancyst.nidulans UCUGUUGU-CAAAGC-GUGGGG=CUCAA=CCUCAU=ACAG=GCAAUGGA
B.subtilis
UCUGAUGU-GAAAGC-CCCCGG=CUCAA=CCGGGG=AGGG=UCAUUGGA
Chl.aurantiacus UCGGCGCU-GAAAGC-GCCCCG=CUUAA=CGGGGC=GAGG=CGCGCCGA
match
**
***
* ** ** *
**
Alignment of 16S rRNA sequences from different bacteria
Exploring patterns in sequence data 1:
• Which sequences should we use?
• Do the sequences contain phylogenetic
signal for the relationships of interest?
(might be too conserved or too variable)
• Are there features of the data which
might mislead us about evolutionary
relationships?
Is there a molecular clock?
• The idea of a molecular clock was
initially suggested by Zuckerkandl and
Pauling in 1962
• They noted that rates of amino acid
replacements in animal haemoglobins
were roughly proportional to time - as
judged against the fossil record
Rate Heterogeneity
Rates of amino acid replacement in
different proteins
There is no universal molecular
clock
• The initial proposal saw the clock as a Poisson
process with a constant rate
• Now known to be more complex - differences in
rates occur for:
– different sites in a molecule
– different genes
– different regions of genomes
– different genomes in the same cell
– different taxonomic groups for the same gene
• There is no universal molecular clock
Small subunit ribosomal RNA
18S or 16S rRNA
Failure To Accommodate Rate
Heterogeneity Can Lead To Problems
When Making Trees
Unequal rates in different lineages may
cause problems for phylogenetic analysis
• Felsenstein (1978) made a simple model phylogeny including
four taxa and a mixture of short and long branches
A
p
TRUE TREE
A
B
p
q
q
C
q
D
WRONG TREE
p>q
C
D
B
• All methods are susceptible to “long branch” problems
• Methods which do not assume that all sites change at the
same rate are generally better at recovering the true tree
Chaperonin 60 Protein Maximum Likelihood Tree
(PROTML, Roger et al. 1998, PNAS 95: 229)
Longest
branches
Bootstrap values are a
common way of assessing
support for relationships
High bootstrap values can be misleading adding a single new sequence
Cucurbita sp.
Arabidopsis thaliana
Plasmodium falciparum
Dictyostelium discoideum
Cucurbita sp.
Arabidopsis thaliana
Spironucleus barkhanus
Giardia lamblia
Entamoeba histolytica
Trichomonas vaginalis
Drosophila melanogaster
Homo sapiens
Saccharomyces cerevisae
Schizosaccharomyces pombe
Trypanosoma brucei
Euglena gracilis
Holospora obtusa
Ehrlichia sp.
Ehrlichia chaffeensis
Rickettsia tsutsugamushi
Rhizobium meliloti
Bartonella bacilliformis
Bradyrhizobium japonicum
Caulobacter crescentus
Rhodobacter sphaeroides
Pseudomonas aeruginosa
Escherichia coli
Chromatium vinosum
Neisseria gonorrhoeae
Chlamydia trachomatis
Treponema pallidum
Thermus thermophilus
Giardia lamblia
Trichomonas vaginalis
Entamoeba histolytica
Dictyostelium discoideum
Drosophila melanogaster
Homo sapiens
Saccharomyces cerevisae
Schizosaccharomyces pombe
Trypanosoma brucei
Euglena gracilis
Plasmodium falciparum
Ehrlichia sp.
Ehrlichia chaffeensis
Rickettsia tsutsugamushi
Holospora obtusa
Rhizobium meliloti
Rhodobacter sphaeroides
Bartonella bacilliformis
Bradyrhizobium japonicum
Caulobacter crescentus
Escherichia coli
Pseudomonas aeruginosa
Chromatium vinosum
Neisseria gonorrhoeae
Chlamydia trachomatis
Treponema pallidum
Thermus thermophilus
A proposal for three domains of
life
(Woese, Kandler and Wheelis 1990 PNAS 87, 4576)
Concatenated LSU+SSU rRNA analyzed
using a standard (GTR plus gamma*2) model
eukaryotes
The 3-domains tree of life
Two longest
branches
archaebacteria
eocyte
archaebacteria
Cox et al. 2008. PNAS
bacteria
The same RNA data analyzed using better
models (Cox et al. 2008)
eukaryotes
eocytes
0.75
0.95
bacteria
Other archaebacteria
NDCH (GTR+g+2cv)*2
Heterogeneous across tree
CAT model
Saturation in sequence data:
• Saturation is due to multiple changes at the
same site subsequent to lineage splitting
• Most data will contain some fast evolving sites
which are potentially saturated (e.g. in
proteins often position 3)
• In severe cases the data becomes essentially
random and all information about relationships
can be lost
Multiple changes at a single site
- hidden changes
Seq 1
Seq 2
AGCGAG
GCGGAC
Number of changes
1
Seq 1
C
Seq 2 C
3
2
G
T
1
A
A
Exploring patterns in sequence
data
• Do sequences manifest biased base
compositions (e.g thermophilic
convergence) or biased codon usage
patterns which may obscure
phylogenetic signal
A case study in phylogenetic analysis:
Deinococcus and Thermus
• Deinococcus are radiation resistant bacteria
• Thermus are thermophilic bacteria
– BUT:
– Both have the same very unusual cell wall
based upon ornithine
– Both have the same menaquinones (Mk 9)
– Both have the same unusual polar lipids
• Congruence between these complex characters
supports a phylogenetic relationship between
Deinococcus and Thermus
% Guanine + Cytosine in 16S rRNA
genes from mesophiles and thermophiles
Thermophiles:
Thermotoga maritima
Thermus thermophilus
Aquifex pyrophilus
Mesophiles:
Deinococcus radiodurans
Bacillus subtilis
%GC variable
all sites sites
62
64
65
72
72
73
55
55
52
50
Shared nucleotide or amino acid composition biases
can also cause problems for phylogenetic analysis
Aquifex
True
tree
Bacillus
Thermus
Aquifex (73%)
Bacillus (50%)
Wrong
tree
16S rRNA
Deinococcus
The correct tree can be obtained if a
model is used which allows base/aa
composition to vary between
sequences -LogDet/Paralinear
Distances
Heterogeneous Maximum Likelihood
Thermus
(72%)
Deinococcus
(52% G+C)
Aquifex
Bacillus
Thermus
Deinococcus
Gene trees and species trees
Gene tree
a
A
b
B
c
C
Species tree
We often assume that gene trees give us
species trees
Orthologues and paralogues
paralogous
orthologous
a
b* c
Ancestral gene
b* C*
orthologous
C* B
A*
A*
A mixture of
orthologues and
paralogues sampled
Duplication to give 2 copies
on the same genome =
paralogues of each other
The malic enzyme gene tree contains a
mixture of orthologues and paralogues
Gene duplication
97
100
100
100
Mit
Ascaris suum Mit
Zea mays Ch
Anas = a duck!
Homo sapiens 2
100
75
Homo sapiens 1 Cyt
Anas platyrhynchos Cyt
Flaveria trinervia Ch
Populus trichocarpa Ch
Solanum tuberosum Mit
100
Amaranthus Mit
Neocallimastix
Plant chloroplast
Plant
mitochondrion
Hyd
Trichomonas vaginalis Hyd
Giardia lamblia Cyt
Schizosaccharomyces
Saccharomyces
Lactococcus lactis
Summary:
• There may be conflicting patterns in data which
can potentially mislead us about evolutionary
relationships
• Our methods of analysis need to be able to deal
with the complexities of sequence evolution and
to recover any underlying phylogenetic signal
• Some methods may do this better than others
depending on the properties of individual data
sets
• All trees are simply hypotheses!
Phylogenetic analysis requires
careful thought
• Phylogenetic analysis is frequently treated
as a black box into which data are fed
(often gathered at considerable cost) and
out of which “The Tree” springs
• (Hillis, Moritz & Mable 1996, Molecular
Systematics)