Download 20071217161016301

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Metagenomics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
The deep phylogeny
problem
Using simple models to estimate trees
from sparse data sets with faintly relevant
signals.
Long history of interest in the
relationships among major
groups of animals.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
QuickTime™ and a
are needed to see this TIFF
picture.
(Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Ernst Haeckel 1834-1919
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
QuickTime™ and a
TIFF (Uncompressed) decompressor
QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
TIFF
(Uncompressed)
decompressor
are needed to see this picture.
are needed to see this picture.
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
Bateson, W., 1886, The ancestry of the Chordata:
Quarterly Journal of Microscopical Science, v. 26,
p. 535-571.
Cope, E. D., 1887, The Origin of the
Fittest: New York, Appleton &
Company.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Strong resurgent interest in late 20th Century
with the advent of Molecular Phylogenetics
First influential paper: 1988
Molecular phylogeny of the animal kingdom.
Field et al. Science 239: 748-753
Most early analyses were based on 18S rRNA.
Early enthusiasm suggested 18s sequence comparisons
were going to solve all of our problems.
But within 10 years:
Limitations of Metazoan 18S rRNA Sequence Data: Implications for
Reconstructing a Phylogeny of the Animal Kingdom and Inferring the
Reality of the Cambrian Explosion.
Quic kT i me™ and a
T IFF (Unc ompres s ed) dec ompres s or
are needed t o s ee thi s pi c ture.
Abouheif, Zardoya & Meyer. 1998
Obviously, it was claimed, we just didn’t have
enough data…
At about the same time (mid 1990’s) there was an emerging interest in
estimating animal phylogenies from whole MtDNA genome sequences.
The choice was appealing:
•Large amount of sequence (16-18kb).
•Reasonably easy to collect (no introns)
•Mode of inheritance was well understood.
•Almost no problems associated with paralogous comparisons.
But even with large amounts of data some quite controversial
groupings emerged - and different mitochondrial genes would often
suggest conflicting relationships.
In 1998, I published a study with Wes Brown in which we explored the
phylogenetic signal in the mitochondrial genome of a group of
vertebrates whose phylogenetic relationships were “uncontroversial.”
fruit fly
snail
mosquito
snail
nematode 1
fruit fly
mosquito
nematode 1
nematode 2
lancelet
sea urchin1
nematode 2
sea urchin1
sea urchin2
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
sea urchin2
lancelet
lamprey
lamprey
chicken
frog
carp
carp
trout
frog
chicken
opossum
mouse
rat
cow
blue whale
fin-back whale
trout
opossum
Obtained
mouse
rat
cow
blue whale
fin-back whale
The unexpected placement of Lancelet outside
(vertebrates + echinoderms) and the grouping of (frog+
chicken+fishes) results from parsimony analyses with
strong bootstrap support at all levels of analysis
(nucleotides, transversions and amino acids).
Likelihood analysis of the nucleotide data under the 16 canonical
models (JC, K2P, HKY, GTR + I +) all failed to yield the expected tree,
placing cephalochordates outside (vertebrates+ echinoderms)
Jukes Cantor
Equal rates
1 = 190018.6565
2 = 189706.5707
p< 0.0001
I
G
I+G
Kimura 2P
1 = 189228.2482
2 = 188966.6047
p< 0.0001
HKY 85
1 = 186023.7146
2 = 185874.1456
1 = 184936.447
2 = 184809.982
p= 0.0004
p= 0.0020
1 = 179573.3547
2 = 179525.1419
1 = 184834.253
2 = 184611.3826
1 = 183988.9626
2 = 183811.3409
1 = 180474.2233
2 = 180403.5445
p< 0.0001
p< 0.0001
p = 0.31
1 = 180223.3956
2 = 180146.4203
1 = 175238.975
2 = 175252.093
p< 0.0001
p = 0.012
p = 0.6165
1 = 181340.654
2 = 181204.7859
1 = 180109.353
2 = 180030.772
1 = 175160.793
2 = 175175.196
p = 0.0089
p = 0.5739
1 = 181487.388
2 = 18134.7965
p< 0.0001
•Expected Tree = 1
•MPT
=2
GTR
p = 0.1265
1 = 174980.1973
2 = 175003.3906
p = 0.3796
1 = 174879.9709
2 = 174903.3324
p = 0.3642
Naylor and Brown 1998
Assuming the results to be misleading, we evaluated which kind of sites might
be responsible for the misleading patterns by testing the fit of different classes
of characters to the expected tree.
Expected tree
Naylor and Brown 1998
We were able to retrieve the expected tree only when we restricted our
analyses to the subset of nucleotide sites modally coding for the amino
acids P, C, N, M and Q.
Hydrophobic residues I, L and V were found to be especially misleading.
We concluded (in 1998) that simply sequencing large amounts of
sequence wasn’t enough to ensure an accurate estimate of
phylogeny. We argued that it was more important to tailor models
to accommodate structural and functional constraints.
(NB. At that time we were not able to conduct amino acid likelihood analyses due
to computational constraints)
Naylor and Brown 1998
More recently (2007)
Dave Swofford has implemented AA likelihood models into PAUP. We
applied the MtREV model in PAUP* to the Naylor and Brown (1998) data
set to see if it yielded a different tree than that seen at the nucleotide level.
MTREV + F + 
MTREV + F
100
fruit fly
mosquito
snail
nematode 1
100
nematode 2
sea urchin 1
sea urchin 2
lancelet
lamprey
frog
carp
100
100
100
100
100
89
100
94
100
82
100
100
100
100
100
trout
chicken
opossum
mouse
rat
cow
blue whale
fin-back whale
Yields tree wherein lancelet
is sister to Vertebrata - but frog
still groups w/fishes
100
100
70
100
sea urchin 1
100
100
97
100
fruit fly
mosquito
snail
nematode 1
nematode 2
100
100
77
97
100
100
100
100
100
sea urchin 2
lancelet
lamprey
carp
trout
frog
chicken
opossum
mouse
rat
cow
blue whale
fin-back whale
Yields expected tree
-with strong support
Results corroborate prior suspicions that modelling the
substitution process appropriately is critically important
That we get strong support at all of the nodes for an incorrect topology
(frog+fishes) when we do not include  underscores that bootstrap support
reflects the sampling variance of the signal induced from the interaction
between data and model. This need not be a reflection of phylogenetic accuracy.
For the inference to be accurate, the model must be unbiased wrt the substitution
process that gave rise to the data.
Increasing sequence length
Can think of this as a landscape for a given combination of taxa and
sequences. As sequence length increases the topography of the peaks and
valleys remains roughly the same but becomes exaggerated - resulting in a
more decisive landscape (little sampling variance). As model parameters are
changed, the underlying pattern of peaks and valleys of the landscape will
shift to a different configuration of optima.
Take home message:
The details of the model are important.
Especially when applied to long sequences and a sparse sample of
highly divergent taxa. In such cases there is little help from the data to
estimate the pattern of changes. Most of the estimate comes from the
model.
Feb 2006. Delsuc et al using a “phylogenomic” approach assembled
a data set of 146 EST derived genes for 38 composite taxa representing
metazoan diversity (Fungi [2], Choanoflagelata[3], Cnidaria[4],
Protostomia [15], Echinodermata[1] , Cephalochordata[1], Tunicata [4]
and Vertebrata [8].
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
“Tunicates and not cephalochordates are the closest living relatives of vertebrates”
Delsuc et al 2006 (and cephalochordates form a clade with echinoderms)
They used MP, ML (WAG+F+), a Bayesian covarion model, partitioned
likelihood (for each of the 146 genes)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
ML methods placed Tunicates as sister to Vertebrates and Amphioxus
(Branchiostoma) in a clade with echinoderms
Delsuc et al 2006
Delsuc et al. showed that
alternative topologies for
the relationships among
cephalochordates,
echinoderms, tunicates and
vertebrates had poorer fits
to the data Under
WAG+F+
However they cautioned:
“A definitive conclusion will
only be achieved through the
phylogenetic analysis of more
genes combined with an
increased taxon sampling
including the enigmatic
Xenoturbellidans, the
hemichordates and a greater
diversity of echinoderms”
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
November 2006. Seemingly following the advice of Delsuc et al. 2006, Bourlat
et al. added EST sequences for Xenoturbella, a hemichordate and a starfish to the
data set of Delsuc et al. and augmented it for a total of 170 genes. (>35,000AA sites)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
“Deuterostome Phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida”
Bourlat et al 2006
They were able to reproduce Delsuc et al’s tree when they removed
Xenoturbella, hemichordate and starfish. This lead Bourlat et al. to
conclude that Delsuc et al’s inference was an artifact of sparse taxonsampling / model mis-specification. (They used a concatenated analysis WAG+F+)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Without Xenoturbella,
hemichordate and starfish
cf Delsuc et al.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
With Xenoturbella,
hemichordate and starfish
Bourlat et al 2006
So….. What’s going on?
Poor Models?
The fact that the data are so sensitive to taxon sampling suggests the
models are inadequate. If a model describes the process well inferences
should not vary as taxa are added or deleted.
Clemens Lakner reanalyzed the data set: Used a partitioned Bayesian AA
model under WAG +  with independent rates for each of the 170 gene
partitions. Same result as Bourlat et al.
So…. if it’s a model problem, it’s not one that can be fixed with a simple
rate multiplier tailored to each gene.
Non-orthologous gene comparisons?
Orthology can be a problem with ESTs because putative othologs in different
taxa are ultimately identified by sequence similarity, not phylogenetic analysis.
Typically orthologs are identified by bi-directional Blast hits.
A
B
X
X
X
X
X
X
X
X
X
X
X
X
orthologous
A
B
X
X
X
X
X
X
X
X
X
X
X
X
non-orthologous
However there are situations in which pairs of strings meeting this criterion
for “orthology” will not be true orthologs (rapid evolution of an ortholog in
one species can render it more dissimilar to its true ortholog in another
species than it is to a paralog in that same species)
In order to evaluate paralogy as a possible source of error, we computed
MP bootstrap trees for each of the 170 genes in the Bourlat et al. data set.
We filtered the resulting topologies into those that were consistent with 3
positive controls: Monophyletic: (1) vertebrates, (2) insects (3) echinoderms.
Only 16 of the 170 genes met the criteria.(?!!)
We contrasted the signal in the original set of 170 trees with that of the
filtered set of 16 genes meeting the +ve control criteria using consensus
networks implemented in Splits Trees 4. (Huson and Bryant, 2006)
RESULTS
tunicates
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Interesting…
Network consensus of 170
bootstrap parsimony trees
Network consensus of 16 trees
that meeting +ve control criteria
But amino acid likelihood of 16 gene subset yields tree with
Cephalochordates + Echinoderms and other strange groupings.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Apparently no “quick fix” for these issues
Something is awry.
Back to first principals…
What are the observed patterns of change in molecules?
Both multiple alignment and protein structural energetics suggest that AAs
are restricted in what they can change to over the course of evolution.
are needed to see this picture.
TIFF (LZW) decompressor
QuickTime™ and a
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
alignment
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
are needed to see this picture.
TIFF (LZW) decompressor
QuickTime™ and a
energetics
But current models average AA frequency
over entire alignment.
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
X
=
ACDEFGHIKLMNPQRSTVWY
20 stationary equilibrium
frequencies (avg. from
alignment)
Rate Matrix
180 pairwise relative rates
(JTT, WAG, MtREV)
Q
Consider this site
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
ACDEFGHIKLMNPQRSTVWY
20 stationary probabilities
equilibrium frequencies
averaged over alignment
Poor description of reality
(for this site).
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
ACDEFGHIKLMNPQRSTVWY
Site specific vector of 20
probabilities
Better
Not possible to have a separate model tailored to each site (too
many parameters) - but possible to assign sites to “categories”
with comparable evolutionary freedom to vary.
Can have a model tailored to each category and implement a
“mixture” of models .
Lartillot (2007) proposed such a mixture model to allow
categories of sites associated with different biochemical roles to
have different AA equilibrium frequencies.
(He has implemented this in his Phylobayes software)
Lartillot CAT (mixture) Model
1)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
2)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
AA Equil.Freq.profiles
ACDEFGHIKLMNPQRSTVWY
Categories (models)
1
2
3 …..
K
Site specific vector of 20
probailities
Yields a mixture of distributions that better
capture the allowable state-space
3)
Multiply each distribution by rate
matrix (WAG, JTT, MtREV etc)
From Lartillot 2007
CAT model can ameliorate model mis-specification for some
data sets.
We applied it to the Bourlat et al. data set.
Resulted in inferences that still show sensitivity to taxon
sampling, suggesting model is not adequate.
What’s going on?
•Get taxon-sampling dependent inferences for WAG, and CAT.
•Suggests models are inadequate.
•What else might be going on?
We know amino acid sequences code for structures.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Alpha -helical bundle (rhodopsin)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Beta-barrel (porins)
We know that structures show limited variation among lineages
BUT they do show a little.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Superimposed backbones of 28 Hurudinin structes (PDB_ID 4H1R)
We also know that patterns of substitution vary across both sites
and taxa.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Rate variation across cytB
(courtesy Jun Inoue)
With consequences for phylogenetic branch lengths.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Rate variation among lineages based on whole MtDNA (Courtesy Jun Inoue)
It is possible (likely?) that minor conformational changes in some
non-critical parts of structures affect the local freedom to vary of
sites in lineage specific ways?
Primates
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Fishes
Sites showing differences in freedom to vary between primates and fishes
If true, such changes in freedom to vary over a tree would require that amino
acid frequencies of mixture models should be allowed to change over the tree.
(mixture model : covarion hybrid)
OUTLOOK
Potential (practical) strategies:
(1) Ensure that input data meet some minimum criteria that ensures orthology.
(2) Minimize among lineage heterogeneity by excluding genes and/or sites that
exhibit non-stationary dynamics. (Housekeeping genes deeply embedded in
the genetic architecture with similar constraints across taxa)
(3) Optimize parameters on a (structurally informed) gene-by-gene basis to
accommodate context dependent evolutionary change.
Collecting more and more ESTs about which we
know little does not look promising
(to me).
SUMMARY
• As data sets include more characters, sampling variance decreases and
we are no longer shielded from the effects of model mis-specification
• Accurate estimates are likely to come from a better appreciation of the
transformational tendencies associated with individual sites.
(Biochemically motivated process models)
• Until then we will have to prop up our inadequate models with
thoughtful taxon sampling.
• Phylogenomics as currently practised is close to the worst case
scenario (Long sequences, Ambiguous orthology, Divergent taxa,
Sparse taxon-sampling).
Acknowledgements:
QuickTi me™ and a
TIFF ( Uncompressed) decompressor
are needed to see thi s pi ctur e.
QuickTi me™ and a
T IFF (Uncom pressed) decom pressor
are needed to see t his pict ure.
Clemens Lakner
Mark Holder
Interesting aside..
Lartillot Brinkmann & Phillippe (2007) published a paper advocating use of the CAT
model. Results they present are at odds with their paper the previous year (Delsuc et al.
2006) but consistent with classical vertebrate phylogeny they had overturned in 2006.
Posterior consensus
CAT+F+
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
classical
vertebrate
Phylogeny!
Summarized by Adoutte et al 2000.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Traditional phylogeny based on
morphology and embryology
(after Hyman)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
New molecule-based phylogeny (18s)
What are ESTs anyway?
ESTs are fragments of expressed genes cloned from a cDNA library. They are
produced by single-pass sequencing from one end of a cDNA clone.
They are generally of poor quality. Many are short (<200bp). But bioinformatic
pipelines have been constructed to sort and filter them.
EST fragments deemed usable are “blasted” against reference data bases. Sequence
similarity is used to ascertain “identity” and by transitivity “function” of sequences
EST projects are underway for several organisms. Milions of bases pour in to data
bases every day, providing potentially useful comparative data.
Many phylogenetic researchers have seized the opportunity to assemble data sets of
what they consider to be orthologous ESTs in different taxa.