Download Document

Document related concepts

Point mutation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Koinophilia wikipedia , lookup

Metagenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microevolution wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Molecular Phylogenetics
Introduction to evolution and phylogeny
Nomenclature of trees
Four stages of molecular phylogeny:
[1] selecting sequences
[2] multiple sequence alignment
[3] tree-building
[4] tree evaluation
Practical approaches to making trees
Introduction
At the molecular level, evolution is a process of
mutation with selection.
Molecular evolution is the study of changes in genes
and proteins throughout different branches of the
tree of life.
Phylogeny is the inference of evolutionary relationships.
Traditionally, phylogeny relied on the comparison
of morphological features between organisms. Today,
molecular sequence data are also used for phylogenetic
analyses.
corrected amino acid changes
per 100 residues (m)
Dickerson
(1971)
Millions of years since divergence
Molecular clock for proteins:
rate of substitutions per aa site per 109 years
Fibrinopeptides
Kappa casein
Lactalbumin
Serum albumin
Lysozyme
Trypsin
Insulin
Cytochrome c
Histone H2B
Ubiquitin
Histone H4
9.0
3.3
2.7
1.9
0.98
0.59
0.44
0.22
0.09
0.010
0.010
Molecular clock hypothesis: implications
If protein sequences evolve at constant rates,
they can be used to estimate the times that
sequences diverged. This is analogous to dating
geological specimens by radioactive decay.
Molecular clock hypothesis: implications
If protein sequences evolve at constant rates,
they can be used to estimate the times that
sequences diverged. This is analogous to dating
geological specimens by radioactive decay.
N = total number of substitutions
L = number of nucleotide sites compared
between two sequences
K=
N
L
= number of substitutions
per nucleotide site
Rate of nucleotide substitution r
and time of divergence T
r = rate of substitution
= 0.56 x 10-9 per site per year for hemoglobin alpha
K = 0.093 = number of substitutions
per nucleotide site (rat versus human)
r = K / 2T
T = .093 / (2)(0.56 x 10-9) = 80 million years
Neutral theory of evolution
An often-held view of evolution is that just as organisms
propagate through natural selection, so also DNA and
protein molecules are selected for.
According to Motoo Kimura’s 1968 neutral theory
of molecular evolution, the vast majority of DNA
changes are not selected for in a Darwinian sense.
The main cause of evolutionary change is random
drift of mutant alleles that are selectively neutral
(or nearly neutral). Positive Darwinian selection does
occur, but it has a limited role.
Goals of molecular phylogeny
Phylogeny can answer questions such as:
• How many genes are related to my favorite gene?
• Was the extinct quagga more like a zebra or a horse?
• Was Darwin correct that humans are closest
to chimps and gorillas?
• How related are whales, dolphins & porpoises to cows?
• Where and when did HIV originate?
• What is the history of life on earth?
Woese PNAS
Molecular phylogeny: nomenclature of trees
There are two main kinds of information inherent
to any tree: topology and branch lengths.
We will now describe the parts of a tree.
Molecular phylogeny uses trees to depict evolutionary
relationships among organisms. These trees are based
upon DNA, RNA, and protein sequence data.
2
A
1
I
2
1
1
G
B
H 2
1
6
1
2
C
2
D
B
C
2
1
E
A
2
F
D
6
one unit
E
time
chronogram
phylogram
Tree nomenclature
taxon
taxon
2
A
1
I
2
1
1
G
B
H 2
1
6
1
2
C
2
D
B
C
2
1
E
A
2
F
D
6
one unit
E
time
Tree nomenclature
operational taxonomic unit (OTU)
such as a protein sequence
taxon
2
A
1
I
2
1
1
G
B
H 2
1
6
1
2
C
2
D
B
C
2
1
E
A
2
F
D
6
one unit
E
time
Tree nomenclature
Node (intersection or terminating point
of two or more branches)
branch
(edge)
2
I
1
1
G
B
H 2
1
6
1
2
C
2
D
B
C
2
1
E
A
2
F
1
2
A
D
6
one unit
E
time
Tree nomenclature
Branches are unscaled...
2
Branches are scaled...
A
1
I
2
1
1
G
B
H 2
1
6
1
2
C
2
D
B
C
2
1
E
A
2
F
D
6
one unit
E
time
…OTUs are neatly aligned,
and nodes reflect time
…branch lengths are
proportional to number of
amino acid changes
Tree nomenclature
bifurcating
internal
node
multifurcating
internal
node
2
A
1
I
2
1
1
G
B
H 2
1
6
A
2
F
B
2
C
2
2
1
D
E
C
D
6
one unit
E
time
Examples of multifurcation: failure to resolve the branching order
of some metazoans and protostomes
Rokas A. et al., Animal Evolution and the Molecular Signature of Radiations
Compressed in Time, Science 310:1933, 23 December 2005, Fig. 1.
Tree nomenclature: clades
Clade ABF (monophyletic group)
2
F
1
I
2
A
1
B
G
H 2
1
6
C
D
E
time
A group is monophyletic
(Greek: "of one race") if it
consists of a common
ancestor and all its
descendants.
(http://en.wikipedia.org/wiki/)
Tree roots
The root of a phylogenetic tree represents the
common ancestor of the sequences. Some trees
are unrooted, and thus do not specify the common
ancestor.
A tree can be rooted using an outgroup (that is, a
taxon known to be distantly related from all other
OTUs).
Tree nomenclature: roots
past
9
1
7
5
8
6
2
present
1
7
3 4
2
5
Rooted tree
(specifies evolutionary
path)
8
6
3
Unrooted tree
4
Tree nomenclature: outgroup rooting
past
root
9
10
7
8
7
6
2
present
9
8
3 4
1
Rooted tree
2
5
1
3 4
5
6
Outgroup
(used to place the root)
Enumerating trees
Cavalii-Sforza and Edwards (1967) derived the number
of possible unrooted trees (NU) for n OTUs (n > 3):
NU =
(2n-5)!
2n-3(n-3)!
The number of bifurcating rooted trees (NR)
(2n-3)!
NR = n-2
2 (n-2)!
For 10 OTUs (e.g. 10 DNA or protein sequences),
the number of possible rooted trees is  34 million,
and the number of unrooted trees is  2 million.
Many tree-making algorithms can exhaustively
examine every possible tree for up to ten to twelve
sequences.
Species trees versus gene/protein trees
Molecular evolutionary studies can be complicated
by the fact that both species and genes evolve.
speciation usually occurs when a species becomes
reproductively isolated. In a species tree, each
internal node represents a speciation event.
Genes (and proteins) may duplicate or otherwise evolve
before or after any given speciation event. The topology
of a gene (or protein) based tree may differ from the
topology of a species tree.
Species trees versus gene/protein trees
past
speciation
event
present
species 1
species 2
Species trees versus gene/protein trees
Gene duplication
events
species 1
speciation
event
species 2
Species trees versus gene/protein trees
Gene duplication
events
speciation
event
OTUs
species 1
species 2
Orthology/paralogy
Orthologous genes are homologous
(corresponding) genes in different species
(genomes)
Paralogous genes are homologous genes
within the same species (genome)
Four stages of phylogenetic analysis
Molecular phylogenetic analysis may be described
in four stages:
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Tree building
[4] Tree evaluation
Stage 2: Multiple sequence alignment
The fundamental basis of a phylogenetic tree is
a multiple sequence alignment.
(If there is a misalignment, or if a nonhomologous
sequence is included in the alignment, it will still
be possible to generate a tree.)
Two Major Approaches to Phylogeny
Inference
1) Distance Matrix Methods
Calculate matrix of pairwise distances from all
data, then infer tree using a clustering algorithm.
2) Character Based Methods (maximum parsimony)
Inspect columns of characters, infer trees from
columns that contain “informative” characters, and
use these to infer most likely tree given the data.
Distance Matrix Methods
(matrix calculation)
Reality: Not all sites are free to change, the same sites change
multiple times
The simplest model is that of Jukes & Cantor
Jukes & Cantor:
dxy = -(3/4) ln (1-4/3 D)
• dxy = distance between sequence x and sequence y expressed as the number of
changes per site
• (note dxy = r/n where r is number of replacements and n is the total number of
sites. This assumes all sites can vary and when unvaried sites are present in
two sequences it will underestimate the amount of change which has occurred
at variable sites) (i.e., previous reality check)
• D = is the observed proportion of nucleotides which differ between two
sequences (fractional dissimilarity)
• ln = natural log function to correct for superimposed substitutions
(in general logging tends to convert exponential trends to linear trends)
• The 3/4 and 4/3 terms reflect that there are four types of nucleotides and three
ways in which a second nucleotide may not match a first - with all types of
change being equally likely (i.e. unrelated sequences should be 25% identical
by chance alone)
The natural logarithm ln is used to correct
for superimposed changes at the same site
•
•
•
•
If two sequences are 95% identical they are different at 5% or 0.05 (D) of sites thus:
– dxy = -3/4 ln (1-4/3 0.05) = 0.0517
Note that the observed dissimilarity 0.05 increases only slightly to an estimated
0.0517 - this makes sense because in two very similar sequences one would expect
very few changes to have been superimposed at the same site in the short time since
the sequences diverged apart
However, if two sequences are only 50% identical they are different at 50% or 0.50
(D) of sites thus:
– dxy = -3/4 ln (1-4/3 0.5) = 0.824
For dissimilar sequences, which may diverged apart a long time ago, the use of ln
infers that a much larger number of superimposed changes have occurred at the
same site
Distance Matrix Methods
(tree construction)
UPGMA is
unweighted pair group method
using arithmetic mean
1
2
3
4
5
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of all
the proteins. Get ready to put the numbers 1-5
at the bottom of your new tree.
1
2
3
4
5
Tree-building methods: UPGMA
Step 2: Find the two proteins with the
smallest pairwise distance. Cluster them.
1
2
6
3
4
5
1
2
Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins
with the smallest pairwise distance. Cluster them.
1
2
6
1
3
4
5
7
2
4
5
Tree-building methods: UPGMA
Step 4: Keep going. Cluster.
1
8
2
7
6
3
4
5
1
2
4
5
3
Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
9
1
2
8
7
3
6
4
5
1
2
4
5
3
Distance-based methods: UPGMA trees
UPGMA is a simple approach for making trees.
• An UPGMA tree is always rooted.
• An assumption of the algorithm is that the molecular
clock is constant for sequences in the tree. If there
are unequal substitution rates, the tree may be wrong.
• While UPGMA is simple, it is less accurate than the
neighbor-joining approach (described next).
Distance method: Advantages
• Fast - suitable for analysing data
sets which are too large for other
more computationally intensive
methods such as maximum
likelihood
• A large number of models are
available with many parameters improves estimation of distances
Distance method: Disadvantages
• Information is lost - given only the
distances, it is impossible to derive
the original sequences
• Only through character based
analyses can the history of sites be
investigated; e.g., most informative
positions be inferred
Character Based Methods:
Maximum Parsimony
The best tree: should be the one that
requires the smallest number of
substitutions to explain the
differences among the sequences
being studied.
Occam's razor: Among his statements (translated from his Latin) are:
"Plurality is not to be assumed without necessity" and "What can be done
with fewer [assumptions] is done in vain with more."
One consequence of this methodology is the idea that the
simplest or most obvious explanation of several competing ones
is the one that should be preferred until it is proven wrong.
Not all Characters are Used
in Parsimony Analysis
• informative sites - nucleotide (or amino acid)
columns that are represented by at least two
different character states found in at least two
different sequences, these sites allow the
distinction between alternative trees.
• uninformative sites - nucleotide (or amino
acid) columns that do not allow the distinction
between two trees (e.g., constant)
Maximum Parsimony (4-taxon case)
1 2 3 4 5 6 7 8 9 10
1-A G G G T A A C T G
2-A C G A T T A T T A
3-A T A A T T G T C T
4-A A T G T T G T C G
1
3
2
4
1
2
3
4
1
3
4
2
How may informative sites are there
in this data set?
Maximum Parsimony (4-taxon case)
1 2 3 4 5 6 7 8 9 10
1-A G G G T A A C T G
2-A C G A T T A T T A
3-A T A A T T G T C T
4-A A T G T T G T C G
1
3
2
4
1
2
3
4
1
3
0 3
0 3
0 3
4
2
Maximum Parsimony
G1
C2
3T
C
3
4A
2
1-G
G1
T3
2C
C
G1
3
4A
3T
C
2C
3-T
4-A
3
A4
2-C
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A G G G T A A C T G
2-A C G A T T A T T A
3-A T A A T T G T C T
4-A A T G T T G T C G
1
3
2
4
1
2
3
4
1
3
0 3 2
0 3 2
0 3 2
4
2
Maximum Parsimony
G1
G2
3A
G
2
4T
3
1-G
G1
A3
2G
G
G1
2-G
2
4T
4-T
3A
2
T4
G
2G
3-A
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1-A G G G T A A C T G
2-A C G A T T A T T A
3-A T A A T T G T C T
1
4-A A T G T T G T C G
3
0 3 2 2
2
4
1
2
0 3 2 2
3
4
1
3
4
2
0 3 2 1
Maximum Parsimony
G1
A2
3A
A
G1
A3
4G
2A
A
A1
2
1-G
2-A
2
4G
3G
1
A4
A
2G
4
3-A
4-G
Maximum Parsimony
1
3
2
4
1
2
3
4
1
3
0 3 2 2 0 1 1 1 1 3 14
0 3 2 2 0 1 2 1 2 3 16
0 3 2 1 0 1 2 1 2 3 15
4
2
Maximum Parsimony
1
3
2
4
1 2 3 4 5 6 7 8 9 10
1-A G G G T A A C T G
2-A C G A T T A T T A
3-A T A A T T G T C T
4-A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
Parsimony - advantages
• is a simple method - easily understood
operation
• does not seem to depend on an explicit
model of evolution
• gives both trees and associated hypotheses
of character evolution
• should give reliable results if the data is well
structured and homoplasy is either rare or
widely (randomly) distributed on the tree
Parsimony - disadvantages
• May give misleading results if homoplasy is common or concentrated in
particular parts of the tree, e.g:
- thermophilic convergence
- base composition biases
- long branch attraction
• Underestimates branch lengths (Why?)
• Model of evolution is implicit - behaviour of method not well understood
• Parsimony often justified on purely philosophical grounds - we must
prefer simplest hypotheses - particularly by morphologists
• For most molecular systematists, this is uncompelling
Parsimony can be inconsistent
•
•
Felsenstein (1978) developed a simple model phylogeny including four taxa
and a mixture of short and long branches
Under this model parsimony will give the wrong tree
A
B
Model tree
p
p
q
C
q
q
D
Parsimony tree
Rates or
Branch lengths
p >> q
C
A
Wrong
B
D
Long branches
are attracted but
the similarity is
homoplastic
• With more data the certainty that parsimony will give the wrong tree
increases - so that parsimony is statistically inconsistent
• Advocates of parsimony initially responded by claiming that
Felsenstein’s result showed only that his model was unrealistic
• It is now recognised that the long-branch attraction (in the
“Felsenstein Zone”) is one of the most serious problems in
phylogenetic inference
Summary and recommendations
• Remember that molecular phylogenetics yields gene trees
• Accurate gene trees may not be accurate organismal trees
• Gene duplications and paralogy, and lateral transfer can
produce mismatches between gene and organismal
phylogenies
• Use congruence between separate gene trees to identify robust
organismal phylogenies or mismatches that require further
information
The most famous case of LBA misleading biologists
The Universal SSU rRNA Tree
Wheelis et al. 1992 PNAS 89: 2930
The SSU Ribosomal RNA Tree for Eukaryotes
Animals
Fungi
Choanozoa
Plants / green algae
Mitochondria?
Ciliates + Apicomplexa
Stramenopiles
Red algae
Dictyostelium
Entamoebae
Prokaryotic
outgroup
Euglenozoa
Physarum
Percolozoa
Trichomonas
Giardia
Microsporidia
Archezoa