Download Phylogenetics Topic 3: Methods of inferring phylogenies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Phylogenetics Topic 3: Methods of inferring phylogenies
Because no person was present to directly observe the evolution of a group of organisms, biologists must infer
phylogenies from the characters of living and fossil taxa. These days, the vast majority of phylogenies are
reconstructed from variation among nucleotide or amino acid sequences. However, a wide variety of other
types of molecular data can be used to reconstruct phylogenies; examples include restriction fragment length
polymorphisms (RFLPS), insertion-deletion events (INDELS), chromosomal rearrangements, DNA-DNA
hybridization, to name a few. Numerous methods of reconstructing trees have been implemented. This lecture
covers a very brief, and non-technical, introduction to the most common methods.
A generalized protocol for molecular phylogenetics, and the associated concerns with each step
Concerns:
Collect homolgous sequences
gene tree-species tree / paralogy–orthology / trees within trees
Multiple sequence alignment
positional homology / gaps / subjectivity-objectivity / methods
philosophy / methods / consistency / power and accuracy
Phylogeny estimation
branch support / tree comparison / statistic issues with trees
Test reliability or fit of phylogenetic
estimates
independent contrasts / impact of error on conclusions
Interpretation and application
Classification of tree-reconstruction methods
PARSIMONY METHODS: These methods utilize variation in CHARACTER STATES to reconstruct phylogenies.
Character states are most often variation in the nucleotide (see below) or amino acid “states” at a site in a
sequence of such characters. Such sequences often correspond to genes, but other sorts of sequences of
characters could be just as useful; examples include nucleotides of introns or inter-genic regions, restriction site
polymorphisms, or morphological characters.
Alignment of the nucleotide character states of the β-globin gene from five species of mammals
human
cow
rabbit
rat
opossum
GTG
...
...
...
...
CTG
...
...
..C
..C
TCT
...
...
...
..G
CCT
G.C
..C
G.A
GA.
GCC
...
..T
.AT
..T
GAC
...
...
...
...
AAG
...
...
..A
...
ACC
T..
...
...
..T
AAC
..T
...
...
C..
GTC
...
A..
A..
..G
AAG
...
...
...
..A
GCC
...
A.T
AA.
...
GCC
...
...
TG.
AT.
TGG
...
...
...
...
GGC
...
.AA
..G
..T
AAG
...
...
...
...
GTT
...
A.C
A..
..G
GGC
...
...
..T
..A
GCG
.GC
AGC
.GC
.GC
CAC
A..
...
..T
...
GCT
...
.G.
.G.
..C
GGC
..A
...
..T
..T
GAG
.CT
...
..A
.CC
TAT
...
...
...
..C
GGT
..C
..C
..C
.CA
GCG
..A
..C
.A.
..T
GAG
...
...
...
..A
GCC
..T
...
...
..T
CTG
...
G..
..A
..T
GAG
...
...
C..
.CC
AGG
...
...
...
..A
ATG
...
...
...
.CC
TTC
...
...
...
...
CTG
...
T..
GCT
..C
TCC
AG.
GG.
G..
...
TTC
...
...
...
...
CCC
...
...
...
...
ACC
...
...
...
..T
ACC
...
...
...
...
AAG
...
...
...
..A
ACC
...
...
...
..T
TAC
...
...
...
...
TTC
...
...
...
...
CCG
..C
..C
T.T
..C
CAC
...
...
...
...
TTC
...
...
A.T
...
GAC
...
...
..T
...
CTG
...
T.C
G.A
...
AGC
...
.C.
...
TC.
CAC
...
...
.C.
.C.
GGC
...
...
...
...
TCT
..G
...
...
..C
GCC
...
.AG
...
...
CAG
...
...
...
...
GTT
..C
A.C
..C
A.C
AAG
...
..A
...
C..
GGC
...
.C.
.CT
..T
CAC
...
...
...
..T
GGC
...
...
...
..T
AAG
G..
...
...
...
The order of DNA sequences in the alignment is specified by the order of the taxa in the list. To fit it on the page, the
alignment is broken into three parts; such alignments are called INTERLEAVED. The complete DNA sequence is shown for
the fist taxon (human). All the other sequences are shown relative to human, with the dot, “.”, signifying a match in the
character state with the human sequences. Differences are indicated by using the single-letter nucleotide code (A,C,T or
G). Note that this alignment could also be analyzed by using distance, likelihood, and Bayesian methods.
The parsimony principle is derived from the principle of philosophy called Occam’s Razor: plurality
should not be posited without necessity (Pluralitas non est poneneda sine necessitate, William of Occam,
medieval English philosopher [ca. 1285-1349]). Thus the “simplest” hypothesis is the one that is chosen under
the MAXIMUM PARSIMONY criterion.
Let’s take a nucleotide dataset as an example. In this case an individual tree is a hypothesis, and the
“best tree” for the dataset is the one that requires the fewest number of nucleotide substitutions to explain those
data. One first computes the minimum number of evolutionary changes required to fit a given dataset to a tree.
This number, often called the “number of STEPS”, is recorded for all candidate trees. The tree that requires the
minimum number of steps is selected as the best estimate of the phylogenetic tree, and is called the MAXIMUM
PARSIMONY TREE. When there are one or more trees with the same minimum number of steps, such trees are
called EQUALLY PARSIMONIOUS TREES.
The length of a tree in steps is called the “TREE LENGTH”. The appeal of maximum parsimony is that the
shortest tree is the one that requires the fewest number of homoplasies. Remember, homoplasies are events
such as parallelisms, convergences, reversals; and as such they represent non-phylogenetic similarities.
“Longer trees” require more assumptions of homoplasies and thus are more complex than the maximum
parsimony tree. When the truth is not parsimonious, parsimony tree length underestimates the true
evolutionary distances.
Example of the maximum parsimony principle in phylogenetics:
SITE
SPECIES
SPECIES
SPECIES
SPECIES
1
A
A
A
A
1
2
3
4
2
T
T
T
T
3
G
G
G
G
4
T
T
T
T
5
T
T
T
T
6
G
c
A
A
7
T
T
T
T
8
G
G
C
C
9
A
G
A
G
0
T
T
T
T
1
A
A
A
A
2
A
A
A
A
1G
A3
G
A
1G
C2
1G
TREE 3
4A
G2
C
A4
C2
A3
1G
4C
G4
1A
G2
A
G
3A
C4
G2
C
A3
A[G]
2G
C
3C
A
1A
A[G]
C4
1G
A
3A
A
C3
C
2G
A4
A
SITE 9
1G
G
2C
TREE 2
TREE 1: 5 steps
TREE 2: 6 steps
TREE 3: 6 steps
SITE 8
SITE 6
TREE 1
Lengths of three possible trees:
G4
G2
1A
G[A] G[A]
C
C3
4G
A3
A problem arises when the underlying mechanism of molecular evolution is sufficiently complex that the
number of homoplasies exceeds the true phylogenetic signal in the data. When this happens, methods which
choose simple solutions are sometimes “fooled” by the data. What happens is that the simplest way to fit a tree
to such data is to consider the homoplasies as the true signal and the true signal as the homopalsies. When
this happens we say that maximum parsimony is INCONSISTENT under such a mode of molecular evolution.
DISTANCE MATRIX METHODS: If one looks at a phylogeny with branch lengths scaled to some evolutionary
distance such as the mean number of changes per site in a gene, it is easy to see that there is a relationship
between evolutionary distance and a measure of pair-wise similarity between the lineages. For example, a pair
of sister taxa on a tree will have a shorter distance between each other than either will have with any other
lineage on such a tree. Distance methods seek to utilize this form of information to reconstruct phylogenies.
All distance methods start by converting the original data, say a set of gene sequences, into a matrix of
pairwise distance values between all pairs of lineages in the sample. Next a tree is inferred either (i) by some
type of sequential joining method, or (ii) by evaluating a set of candidate trees and applying a type of OPTIMALITY
CRITERION to select the best tree. Note that maximum parsimony methods described above is one example of
an optimality criterion that may be used on discrete character data. An optimality criterion for distance data with
a similar justification as parsimony is MINIMUM EVOLUTION. Under the minimum evolution criterion, the tree with
the smallest sum of branch lengths is chosen as the best estimate. As with character-based datasets, there
are a variety of optimality criteria that one can use with distance data.
Example of distance based approach to molecular phylogenetics:
Obtain set of homologous
gene sequences and
produce an alignment.
Transform primary data
into a matrix of pairwise
genetic distance values.
Select a method of
inferring a phylogenetic
tree from distance data; in
this case it is the least
squares method.
human
chimp
gorilla
orang
In this case, determine the
S statistic for the set of
candidate trees, and select
a tree that minimizes S.
Note that S is a function of
both the tree topology and
its branch lengths
Distance methods have a number of attractive qualities for phylogenetics. First and foremost, the distance
calculations between all pairs of sequences are based on an explicit model of molecular evolution. If the most
important features of the process of evolution are contained in the model then inconsistency problems such as
long-branch attraction are reduced or eliminated. For those who are interested, I have placed on the course
website a short summary of the more popular models of nucleotide and amino acid evolution. We will return to
the problem of using model-based methods to obtain “corrected” estimates of evolutionary distance later in this
course. Another very useful feature of distance methods is the statistical framework that can be used to
evaluate models or hypotheses that are not available under parsimony methods.
A noteworthy drawback of distance methods is that the information content of the dataset is reduced in
the step of transforming the primary data into a matrix of pairwise distance values. The practical effect is that
the power of distance methods could be lower than character-based methods in certain circumstances.
MAXIMUM LIKELIHOOD METHODS: Maximum likelihood is a standard statistical framework that can be applied to
the problem of tree-reconstruction when a stochastic model of evolution is assumed. Note that maximum
likelihood was invented by the great British statistician Ronald A. Fisher in 1912, and is of central importance to
the field of statistics on its own. Many of the well-known statistical estimates are maximum likelihood estimates.
Remember the binomial distribution that we used earlier to study the problem of genetic drift? ⎯The fraction of
heads is a maximum likelihood estimate of the parameter of a binomial distribution.
Likelihood methods measure the probability of the data given the hypothesis [i.e., Prob (D|H)]. Note
that this is NOT the probability of the hypothesis; that would be Prob (H|D). In terms of the phylogeny problem
we attempt to measure the probability of the data given a particular tree topology [i.e., Prob (D|τ)], which we call
the LIKELIHOOD SCORE. Given an explicit model of evolution it is relatively strait forward to compute the likelihood
of a tree (τ), although it is computationally slow as there are many terms in the likelihood function.
Supplementary notes are posted on the course website that describes the calculation of the likelihood of a tree
given a sample of nucleotide data. Given a set of candidate trees, the likelihood score is used as an optimality
criterion, and the tree that yield the largest probability of observing the data in hand (i.e., the likelihood score) is
taken as the best estimate of the tree. We call this tree the MAXIMUM LIKELIHOOD TREE, and its score is the
maximum LIKELIHOOD SCORE.
A
T
T
T
Like parsimony methods, likelihood-based methods are based on characters rather than pairwise
distance. Unlike parsimony, likelihood uses an explicit model of evolution to compute probabilities of characterstate changes along a tree. The task of computing the likelihood of the data given a tree is broken down into
separate calculations of the probability that the nucleotide data at a site evolved along a given tree (under a
given model of evolution). At the end these SITE LIKELIHOODS are multiplied to get the likelihood of the complete
dataset.
BAYESIAN METHODS: Bayesian methods have become very popular for phylogenetics over the past few years.
Bayesian inference of phylogenies involves making an inference from the posterior distribution of trees.
Because the posterior is extremely complicated there is no analytical formula for it, and a technique called
Markov Chain Monte Carlo (MCMC) is used to approximate the posterior. We will not cover Bayesian
phylogenetics in this course; however this approach, as well as the others mentioned above, is covered in detail
at the fourth year in the course, Bioinformatics (BIOL 4041 / BIOC 4010). For those who can’t wait, I have
placed a copy of an excellent review of Bayesian inference in phylogenetics on the course web site
(Huelsenbeck et al. 2001. Science 294: 2310-2314.).
ALGORITHMIC METHODS: Rather than comparing alternate topologies based on some criterion of optimality (e.g.,
parsimony or likelihood), algorithmic methods will computationally “build a tree” according to a specific set of
“steps”. All the algorithms break the task up into steps where a decision is made concerning the relationship of
a small set of taxa according to some criterion. The steps are repeated until all taxa have been placed into a
phylogenetic tree. The most common used algorithmic methods include UPGMA, STEPWISE ADDITION, STAR
DECOMPOSITION, and QUARTET PUZZELING.
Let’s take five taxa as an example to look at stepwise addition. A three-taxon tree is selected as the
starting point. In turn, both of the remaining two taxa are placed on each of the three branches and the result of
each is evaluated, say for tree length. The best 4-taxon tree is selected and all others are discarded. The four
taxon tree is used as the start point for evaluating the placement of the last taxon to all five of the possible
places on the 4-taxon tree. The best is selected and all others are discarded. This procedure can be applied to
any number of taxa.
Star decomposition reverses the process above; rather than building up a tree one species at a time,
this method starts with all the taxa present as a completely unresolved tree, called a star tree. A completely
resolved bifurcating tree is obtained by resolving the tree, step by step, by grouping two lineages in each step.
Note that there are multiple pathways for decomposing a star tree into a bifurcating tree. The figure below
illustrates one pathway of star decomposition.
Obtaining a tree by star decomposition
A
A
E
A
A
F
F
E
E
C
C
B
D
B
C
C
B
B
D
F
F
E
D
D
Algorithmic methods are very fast because they proceed directly toward a final solution, discarding
alternatives along the way. Unfortunately algorithmic methods provide no measure of suitability of the
discarded alternatives. Are they nearly as good as the tree obtained by the algorithm or are they much worse?
Consider the case where an algorithm resolves a single tree, but where parsimony identifies 500 equally
parsimonious trees. How much confidence should be placed in the single tree obtained from the algorithmic
method?
Selected list of the more popular methods of inferring a phylogenetic tree.
MAXIMUM PARSIMONY
WEIGHTED PARSIMONY
TRANSVERSION PARSIMONY
DOLLO PARSIMONY
LEAST-SQUARES
MINIMUM EVOLUTION
MAXIMUM LIKELIHOOD
NEIGHBOUR-JOINING
UPGMA
Character-based method that selects the tree that minimizes the net amount of
evolutionary character-state transformations
Character-based method that selects the tree that minimizes the net amount of
evolutionary character-state transformations after applying weights to different subsets of
the possible transformations
Character-based method that selects a tree based on DNA sequences and minimizing the
net amount of transversional transformations. Transitions are ignored.
Character-based method that selects the tree that minimizes the net amount of
evolutionary character-state transformations under the assumption that such
transformations are irreversible
Distance method that selects the tree that minimizes the discrepancy between the
observed distance values and the branch lengths predicted by a tree.
Distance method that selects the tree that minimizes the sum of the branch lengths (total
tree length) of the reconstructed tree.
A character-based method that assumes a model of evolution and selects the tree that
maximizes the probability of the data set in hand given the assumed evolutionary model.
A star-decomposition algorithm that proceeds by minimizing the total length of the tree.
There is no guarantee that the neighbour-joining tree will reconstruct the tree with the
globally shortest tree length.
A clustering algorithm that assumes a linear relationship between evolutionary distance
and divergence time. UPGMA stands for un-weighted pair-group method of arithmetic
means. Among the simplest methods for tree reconstruction.
Some assumptions that [nearly] all methods of tree-reconstruction make for gene
sequence data
•
•
•
•
•
•
The sequence data has no errors
The genes are homologous
Each position in the alignment has positional homology; although it might differ in character state
Evolution at each position is independent of the other positions
The sequence variation contained in the sample of gene sequences is representative of the broader
population of genes in the genome and lineages within the group of interest.
The signal to noise ratio in the genes sequences is sufficient to resolve the problem of interest.
Role of assumptions in the form of an evolutionary model
MAXIMUM PARSIMONY
Implicit rather then explicit
DISTANCE METHODS
Used for distance corrections
MAXIMUM LIKELIHOOD
Full and explicit use
The problem of tree searching
For those methods based on an optimality criterion, the best solution to the tree problem would be to evaluate
every possible tree topology and compute the tree score for each (e.g. likelihood score). Let’s call the set of all
possible trees TREE SPACE. It would be a simple task to keep track of the best score during the search of tree
space, and replace the current “best tree” with any that are found to be better. The best tree (or list of equally
good trees) at the end of the search of tree space is the best estimate of the phylogeny under the involved
optimality criterion. The problem is that as the number of lineages in the data set increases, the size of tree
space increases spectacularly.
Let’s take a look at the size of tree space for unrooted bifurcating trees. We will focus on unrooted
trees as this is the type of tree space that the vast majority of phylogenetic methods will search. The number of
such trees is given by:
NT = 3 × 5 × 7 × … × (2n – 5),
where n is the number of species. A table showing the increase in the size of tree space with increasing
number of lineages is presented below.
Number of
lineages
3
4
5
6
7
8
9
10
11
12
13
14
15
20
50
100
Number of unrooted trees
1
3
15
105
945
10,395
135,135
2,027,025
34,495,425
645,729,075
13,749,310,575
316,234,143,225
7,905,853,580,625
221,643,095,476,699,771,875
~3 × 1074
~3 × 10184
[Avogadro’s number is only 6 × 10 ]
23
Getting close to
Eddington’s number !!!
At 50 lineages, one is getting close to Eddington’s number, the number of electrons in the universe!
An alternative to exhaustive searching is a method called the BRANCH-AND-BOUND SEARCH. Here the
algorithm is able to eliminate parts of tree space that contain suboptimal trees, as it proceeds through the
search. [Note here the algorithm is one for searching tree space, not constructing trees]. Although very
effective, the algorithm is only practical up to about 20 lineages.
For more than 20 lineages only HEURISTIC searches of tree space are possible. Heuristic methods
employ algorithms that will conclude a search in an acceptable amount of time, but at the cost of not being able
to guarantee that the globally optimal solution has been found. These methods start with an initial tree
(provided by the user, obtained at random, or from a method like stepwise addition), and conduct a process
called BRANCH-SWAPPING. Branch swapping involves making small rearrangements to the tree topology.
Following a branch swapping event, the score of the resulting topology is computed. The process of branchswapping is continued until no more improvements can be made to the optimality criterion. This is a type of
HILL-CLIMBING ALGORITHM, and unfortunately can result in a tree that represents a local optimum in the optimality
criterion rather than the global optimum.
A “nice” likelihood surface as a function of the length of a two taxon tree
(t) and the model parameter ω; there is only one peak on the surface
t
ω
Outgroups
As we have seen above, mapping characters on a rooted phylogeny allows us to distinguish between
homology and analogy. We can also distinguish between the primitive and derived character states. Such
inferences are simply not possible with unrooted trees. Because nearly all modern methods of phylogenetic
inference produce unrooted trees; correctly identifying the root is an important aspect of phylogenetic analysis.
Today, the overwhelming majority of biologists use the OUTGROUP METHOD to root phylogenies.
Let’s define some terms:
INGROUP: A group of lineages, assumed to be monophyletic, but whose phylogenetic relationships
are of primary interest.
OUTGROUP: One or more terminal taxa that are assumed to be outside of the monophyletic group
that has been specified as the ingroup. Unlike the ingroup, the outgroup does not have to be
monophyletic
ROOT: The most evolutionary basal point of a phylogeny. The root orients the direction of change
along a phylogeny relative to time.
CHARACTER POLARITY: The evolutionary relationship between two or more states for a given
character. Say we have a character with two states, “a” and “b”. By mapping them on a phylogeny
we can determine that “b” preceded “a” in evolutionary history; hence “a” is the derived state and
“b” is the primitive state.
In the outgroup method the outgroup or outgroup taxa serve only one purpose; following a phylogenetic
analysis their location determines where to place the root on an unrooted tree. The root is placed along the
branch that connects the ingroup with the outgroup.
Rooting a phylogenetic tree by placing the root between the ingroup and outgroup
OG
IG: ingroup
OG: outgroup
Root
Root
IG-4
OG
IG-3
IG-4
IG-1
IG-1
IG-3
IG-2
IG-1
IG-2
IG-3
IG-4
IG-2
Unrooted tree
Placing root between ingroup and outgroup
Rooted tree
OG
An important point is that the phylogenetic analysis must contain both the ingroup and outgroup taxa. Usually,
no constraints are placed on either the ingroups or outgroups for the purposes of conducting the phylogenetic
analysis. There have been may misconceptions about the significance and role of outgroups. To avoid further
confusion a concept map of the outgroup method is presented below.
Flowchart of the general method of outgroup analysis. This method is based on simultaneous phylogenetic analysis of
ingroup and outgroup lineages.
Define ingroup,
usually by
synapomorphies
Define outgroup,
usually by more
inlcusive
synapomorphies
Combine ingroup
and outgroup
into single
dataset
Conduct
unrooted
phylogentic
analysis
Treat outgoups as
terminal taxa
Any method can be
used: parsimony,
likelihood, etc.
Root tree
between ingroup
and outgroup
Other methods do
not use outgroups;
e.g., mid-point
methods, and
hypothetical
ancestors
Many myths have developed about the use of outgroups in phylogenetic analysis.
illustrates the most common misconceptions about how to use outgroups.
Read characters
from phylogeny
Distinguish between
primitive and derived,
and between
homology and analogy
The following figure
Outgroup myths:
Myth 1: The character state in the outgroup should be considered primitive. In reality, character
states in the outgroup can, and often are, highly derived features of the organism.
Myth 2: The outgroup should be the sister taxon to the ingroup. There are many reasons why this
is desirable; however it is not absolutely necessary. It is possible to root a tree by using an
outgroup that is more distantly related to the ingroup than its sister group.
Myth 3: More than one outgroup is required to root a tree. Of course larger sample sizes are
generally better than smaller ones, but as we have shown above, it is possible to place a root on a
tree by using only a single outgroup taxon.
Comparison of phylogenetic methods
Random verses systematic error
In any statistical analysis there will be two potential sources of error (systematic and random). RANDOM ERROR
is defined as the deviation between a parameter and an estimate of that parameter that is due solely to the
effects of finite sample size. Since all phylogenetic methods are applied to finite sets of data, all estimates of a
phylogeny will be subject to sampling error to some degree. SYSTEMATIC ERROR is the deviation between a
parameter and an estimate that is due to incorrect assumptions of the estimation method. An important
difference between these two types of error is that while random error decreases with increasing sample size,
systematic errors persist, and sometimes intensify, as sample sizes increase.
Many commonly used methods of phylogenetic inference are not based on explicit assumptions. At first this
might seem to be an advantage over model based methods, because if an assumption of a model is violated it
could lead to systematic errors. However, a lack of stated assumptions does not mean that a method is
assumption-free. In the “model-free” methods, the assumptions are merely implicit rather than explicit.
Fortunately, phylogenetic methods are not automatically invalidated when one or more of their assumptions are
violated, in fact a very simple model can be useful. An advantage of the model-based methods is the explicit
nature of its assumptions; the fit of a specific assumption, or even the entire model, to the data can be
evaluated.
Long branch attraction as an example of systematic error:
There are many modes of molecular evolution that can lead to inconsistency. Some examples are (i) rate
heterogeneity among branches in the true tree; (ii) nucleotide compositional heterogeneity among the
branches; and (iii) non-phylogenetic convergence in site-specific rates of evolution. Rate heterogeneity among
branches can lead to something called LONG-BRANCH ATTRACTION (LBA). LBA occurs when two distantly related
branches have very high rates such that the number of non-phylogenetic similarities between those branches
exceeds the true signal in the data. When this happens many methods will yield a tree where those two
unrelated branches are put together; hence, the “long branch attraction” effect. The figure below provides an
example of this effect.
Long branch attraction: lineages 1 and 3 are not sister taxa, but are
recovered incorrectly as sister taxa because high evolutionary rates lead
to an excess of non phylogenetic similarities in character states in those
two lineages.
Extremely high rate of
substitution
1
1
3
3
2
4
true tree
2
4
inferred tree
Note that maximum parsimony appears to be particularly sensitive to LBA.
Evaluation criteria
All phylogenetic methods have advantages and disadvantages. There are a variety of criteria (see list below)
by which one can judge a method; and, a method that does not perform well by one standard often will perform
well when measured by another standard.
•
•
•
•
•
Consistency: An estimation method is said to be (statistically) consistent if the estimate converges to
the true value of the parameter when the amount of data approaches infinity. A tree construction
method is said to be consistent if the estimated tree is the true tree when the number of sites in the
sequence goes to infinity. For model-based methods, the definition of consistency assumes that the
model is correct. There has been a lot of discussion about this criterion since Felsenstein (1978)
demonstrated that parsimony can be inconsistent under some model-tree combinations. A method that
is inconsistent might be said to have a systematic error.
Efficiency measures how often we recover the true relationship given limited data. It is usually
measured by the probability of recovering the correct tree or subtree (represented by internal nodes of
the tree) when there is a limited number of sites in the sequence. In finite data, every method has
random errors or sampling errors, and can get the tree wrong just by chance if there is not enough
information in the sample of data. However, a more-efficient method has smaller sampling errors than
an inefficient method and will recover the correct tree more often than the inefficient method at a given
finite sample-size.
Robustness: A method is robust if it still gives correct answers when the assumptions of the method are
wrong; that is, a robust method is not sensitive to violations to its assumptions.
Computational speed. Distance-based methods are very fast. Likelihood is the slowest.
Philosophical justifications (typically vacuous arguments)
Evaluation methods
Given a wide variety of methods, we would like to know how each performs at recovering a set of phylogenetic
relationships. However, there are a variety of approaches for this, and each approach has its own advantages
and disadvantages as well.
•
Computer simulation. You can simulate many replicate data sets under a simulation model. You then
use various tree reconstruction methods to analyse the data and see whether each method recovers
the true tree, which you used during the simulation. You can change the variables in the simulation
such as the sequence length, the shape of the true tree, the sequence divergence, etc. to see their
effects. Simulation is probably the most commonly used method for evaluating phylogeny
reconstruction methods, and there are computer programs for simulating data sets such as seq-gen by
Andrew Rambaut in Oxford and evolver in Ziheng Yang’s paml package.
A criticism of computer simulation is that the models used for simulation do not reflect the true
complexity of molecular evolution.
•
Lab-generated phylogenies. Hillis et al. (1992 Science 255:589-92) generated a known phylogeny in
the lab using the bacteriophage T7. Since the phage was sequenced at different stages and then
separated to produce different lineages, the phylogeny as well as the ancestral sequences is known.
They then used tree reconstruction methods to reconstruct the history. All methods performed
extremely well!
A criticism of lab-generated phylogenies is that one lacks the control of specific aspects of molecular
evolution that one has in a simulation study.
•
Well-established phylogenies. In some cases, the phylogenetic relationship is almost certain, and such
well-established phylogenies can be used to evaluate the performance of tree reconstruction methods
or the utility of different genes.
A criticism of this approach is that the number of cases of well-established phylogenies is so low that
this method is impossible to apply to many questions.
Some general comments about the powers and pitfalls of different methods
Given both the complexity of molecular evolution and the wide variety of approaches to phylogeny
reconstruction, it is difficult to make any general recommendations. It appears that all methods tend to give
incorrect trees when sequence length is small, and there are conditions that can cause all methods to be
inconsistent. Perhaps the only safe recommendation is that one should not reject an entire method simply
because it did not perform well in a particular computer simulation or some other study of performance. The
utility of different methods will vary greatly among datasets, and also will depend on the analytical objectives.
For example, you might not want to use branch lengths from a parsimony tree to estimate divergence times for
very deep lineages, yet you might want to use parsimony to estimate a tree topology from a large sample of
closely related species.
Phylogenetic uncertainty
Because most of the above applications rely on the assumption that a phylogeny is known without
error, there is always the possibility that some conclusions might be overturned if a subsequent phylogenetic
study suggests a different tree topology.
One of the most common methods of minimizing dependence on a single estimate of a phylogeny is to
conduct phylogenetic analyses using several methods or several different datasets. The idea is that those
parts of the phylogeny that are robust to method or dataset are likely to be good estimates of the true
phylogeny. The different topologies can be combined to form a STRICT CONSENSUS TREE, which is a single tree
that maintains only the monophyletic groups found in all the individual estimates of the phylogeny. Phylogeny
based analyses are then conducted using only the information of evolutionary relationships contained in the
consensus tree.
In statistics, when we estimate a parameter, we also need to calculate the confidence interval to
indicate how reliable our point estimate is. For example, a 95% confidence interval, say 110 ± 10, means that if
the same experiment is repeated many times, we would expect the interval to cover the true value of the
parameter in 95% of the replicates. Clearly it is desirable for us to provide a measure of the reliability of the
estimated phylogeny. However, the phylogeny represents a complex structure that is quite different from a
conventional parameter, and this difference makes it difficult to construct a confidence interval for the estimated
phylogeny. No rigorously justified analytical solution to the problem is available. There are many approximate
methods currently in use, including some that are known not to work. We will discuss one of the most popular,
the nonparametric bootstrap.
Bootstrap proportions
The NONPARAMETRIC BOOTSTRAP is the most commonly used method, and also the most respected. This
method generates many (100 or 500) pseudo data sets (called bootstrap samples) by resampling sites from the
original data set with replacement. Each bootstrap sample has the same number of sites as the original
sequence. Each site in the bootstrap sample is chosen at random from the original data set, so some sites in
the original data set might not be sampled at all and some others might be sampled multiple times. For
example, in the figure below sites 4 and 9 were each included twice in the bootstrap pseudosample, while sites
2 and 7 are not sampled at all. Of course, different sites will be sampled in different bootstrap samples. Each
pseudo data set is analysed using a phylogeny reconstruction method in the same way as the original data set,
and the proportion of bootstrap samples that supports a particular clade is recorded. This is known as the
bootstrap support, or bootstrap proportion for the concerned clade.
Site
Species 1
Species 2
Species 3
Species 4
1
T
C
A
G
2
C
C
C
C
Original data:
3 4 5 6
A G T T
G G T G
A T T T
A T T G
7
C
A
A
A
8
G
C
G
C
Site
9 4 3 1 4 8 10
3
Species 1 A G A T G G T
A
Species 2 A G G C G C T
G
Species 3 A T A A T G A
A
Species 4 A T A G T C A
A
One possible bootstrap pseudosample
9
A
A
A
A
10
T
T
A
A
9
A
A
A
A
10
T
T
A
A
Overview of the bootstrap in phylogeny reconstruction:
Sample data for a
phylogenetic analysis
Pseudosample 1
Bootstrap Tree 1
Pseudosample 2
Bootstrap Tree 2
Pseudosample 3
Bootstrap Tree 3
..
..
Pseudosample n
Sampling
variance
...
.
Bootstrap Tree n
Estimate of true
phylogeny
Bootstrap is a method to assessing the sampling (random) error in the data. Intuitively, if the different sites
have consistent phylogenetic signals, there will be little conflict among the bootstrap pseudosamples and high
bootstrap proportions will be achieved for many clades. However, if the reconstruction method has systematic
errors, for example, if the method is inconsistent, it can give you high bootstrap support for a wrong clade!
Example of gene trees with the non-parametric bootstrap method used to assess
the support for individual branches of the trees. Bootstrap proportions were
obtained from 2000 replications and values > 70% are shown above the involved
branches.
From: Ward et al. (2002) PNAS 99:9278-9283.
Note, in molecular phylogenetics, the interpretation of bootstrap is controversial.
Bayesian posterior probabilities
The recent application of Bayesian methods to phylogenetic inference provided biologists with a much more
sophisticated tool to account for phylogenetic uncertainty. Here, Markov chain Monte Carlo (MCMC) methods
are used to approximate the Bayesian posterior probabilities of different tree topologies. These probabilities
are used to identify the set of trees that have a 95% probability of including the true tree. Now it is simply a
matter of analyzing each of the tree topologies in this set and weighting each result by the probability of the tree
on which the analysis was performed. Although application of Bayesian methods to phylogenetics is still in its
infancy, this analytical framework should greatly reduce the sensitivity of future evolutionary studies to the
assumption that a single estimate of a phylogeny is known without error.
Appendix I: Comparison of tree-reconstruction methods
Parsimony methods
Advantages
Disadvantages
1. Intuitive appeal: Occam’s razor.
1.Implicit assumption that rates are low and largely
homogenous, and that nucleotide composition is largely
homogeneous (relative to other methods)
2. Very powerful implementation in current software
2. Branch lengths are substantially underestimated when
rates are high.
3. Well studied, hence its properties are well understood
3. Unweighted parsimony not as robust to violation of
implicit assumptions, it appears to be inconsistent under
wider conditions than most other methods
4. Identifies synapomorphies
4. Under realistic patterns of sequence evolution
unweighted parsimony has lower efficiency than many
other methods
5. Can be “weighted” to improve performance
6. High efficiency when weighted properly, or when implicit
assumptions are not violated.
5. Controversy about how to properly weight parsimony
7. Only framework for some types of molecular data (e.g.,
patterns of large scale genomic changes, SINES, LINES)
6. Difficult to treat in statistical framework because no
way to compute means and variances of minimum
numbers of substitutions
Distance methods
Advantages
Disadvantages
1. High computational speed for most tree selection
methods (e.g., UPGMA, NJ, ME)
1. Software for some tree selection methods (e.g., least
squares) less developed than for other distance or
parsimony based tree selection methods
2. Can change assumptions of underlying models
2. Some tree selection methods are strictly algorithmic
and have no optimality criterion (e.g., UPGMA and NJ)
3. Some tree selection methods are well studied, and their
properties are well understood (e.g., UPGMA, NJ, ME)
3. Loss of information in conversion of nucleotide or
amino acid sequences to genetic distances
4. Some tree selection methods provide a statistical
framework for hypothesis testing.
4. Sometimes produce biologically impossible branch
lengths
5. Appear to be less efficient than ML or weighted
parsimony methods
Likelihood methods
Advantages
Disadvantages
1. Sound statistical framework for evaluating a wide variety
of evolutionary hypotheses.
1. Computational burden for tree selection is much higher
than all other methods
2. Model based approach with no loss of information, as
occurs in the distance based methods
2. Uncertainty over how to choose substitution parameter
values when computational costs prevent their estimation
by ML.
3. Can test and change assumptions of underlying models
3. Difficulties of treating a tree topology as a parameter in
a statistical approach to its estimation; a tree is not a
numerical quantity.
4. Very efficient when model is correct