Download Comparing data sets It is possible to collect multiple different data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Comparing data sets
It is possible to collect multiple different data sets for the same tips. Because one could
theoretically combine these data sets into one larger data set, the individual data sets are
called partitions. A partition is a kind or block of data that has been obtained from the
same tips (e.g., molecular and morphological data, gene sequence 1 vs. gene sequence 2,
plastid DNA versus nuclear DNA, introns versus exons, etc.). The data matrix below
includes two partitions of forty DNA sequence characters each.
Partition 2
Partition 1
A
B
C
D
E
F
G
H
TTTAGATCTCACAATTTCGTGGGCAACATCACTTGCCAGA
TTTAGATTTCACAAGTTCGTGGGCAACGTCACTTACCAGA
TTTAGATTTCACAAGTTCGAGGGGAGCACGACTTGTCAAA
TTCATACCCTACGAGGTCATGGGCATCACGACTTATCAGA
TTCAAATTTGACGACTTCGTGGGCATGACGACTTATCAGT
TTCGGGCTTTAGTAGTCCCTGGGCAGCACAATTAGTCGTA
TTTAAGTCTCAGGAATCGCTAGGCAGCACAATTTGTCTTA
TTCAGGTTTCAAGAATCGTTGGGCAGCACAATTTGTCCTA
GACGTAATCACCCAAGCCCGTTGCCTCCGGAAACGACGTG
GACGTAGGTACCCAAGCCCCTTGCCTTGGTAAACGACGTG
GACGTAATCACCTAAGTCCCTTGCGTCCGTAAACGACGCG
GACGCAACAATTAAAGCCCTTTGTACGCGGAAACCGCGTG
GACGCAACAACTAAACCCCTTTGTACGCGGAAACCACGCA
AACGTAACAACTGAACCCCCTTGTTCGCGTAGACTGCATG
GACGTAACCACTAAACCCCCCGGTCTCCGCTGATGACGTG
GACCTAACCACTAAGCCCCCCGGTCTGCGCGGATGACGTG
There are some good reasons to wonder if the true tree for each partition is the same.
Knowing that the partitions have the same tree provides information on the extent of
reticulate evolution in the group’s history (suggesting it is low) and might indicate that
the partitions are functional and/or physically linked. Also, from a methodological point
of view, if the partitions share the same history then we can combine the data partitions
into a single analysis, thereby obtaining a more detailed estimate of the shared history.
So how can we judge if the partitions derive from the same tree?
The first approaches to answering this question begin by conducting a phylogenetic
analysis on each partition separately. For example, the most parsimonious (MP) trees for
partition 1 and partition 2 above are as follows.
Single MP tree from
partition 1
Consensus of three MP
trees from partition 2
You will note that these
trees are different.
However, that does not
tell us that the partitions
have necessarily tracked
different histories. It
could be that they have
tracked the same history,
but with only forty
characters in each data
set, chance has resulted
in a misleading tree from
one or both data sets.
One observation to be made is that the trees obtained from partition 2 include four
resolved clades: (D, E), (D, E, F), (G, H), and (D, E, F, G, H). Of these, all except (D, E,
F) are also found with partition 1. While these two trees are not identical, they are
adjacent in tree space, something that would be rather improbable if the two partitions
had tracked completely different histories. For an 8-taxon tree only 15 of the total 10,305
possible trees are adjacent to a randomly chosen tree [check]. This means that there is a
less than 1 in a 1000 chance that two partitions would yield optimal trees that are this
close by chance. This is an important point to stress. Even when partitions differ in the
optimal tree they support, they generally yield trees that are more similar to each other
than would be expected by chance. This is very hard to explain except by reference to
the partitions (typically different genes) sharing a similar history of descent from
common ancestry. Indeed, the fact that the trees derived from different genes are much
more similar than expected by chance, provides among the most concrete statistical
evidence for the truth of evolutionary descent from common ancestry (Penny et al. 1982).
Showing that the partitions yield trees that are similar does not mean that the partitions
tracked the same history. For example, they could have tracked a history that is identical
except that introgression in one gene means that a single tip occupies a different position
in the true trees for each partition. Thus, showing that partitions yield surprisingly
similar trees is not sufficient to conclude that they have derived from the same tree. So
how can you assess that?
There are two general approaches taken to answering this question: topology tests and
partition homogeneity tests. While neither is entirely satisfactory, it will be valuable to
review them both because they illustrate some important general principles. It is worth
mentioning that while I am introducing these tests in the context of parsimony they can
equally be used with other tree optimality criteria such a maximum likelihood.
The starting point for using topology tests is the recognition that for any data set (for
example a partition) there is a set of trees that are worse than the MP tree but not so much
worse that they are rejected by a topology test such as the Templeton test. Let’s call the
set of trees that are not significantly longer than the MP tree the plausible trees.
Returning to the tree space metaphor, if the most parsimonious tree defines the highest
peak, plausible trees are those whose “altitude” is high enough that they are not
significantly lower than the MP tree. Even if the
MP trees from two partitions differ, there could be
Tree space
one or more tree that is plausible for both partitions.
When we find such shared plausible trees, we
Optimal tree
Plausible trees
for part. 2
for part. 2
generally accept that the partitions are likely to
have tracked the same history.
Within a parsimony framework, topology tests such
Optimal tree
for part. 1
as the Templeton test provide a tool for assessing if
a tree is plausible for each data partition. But it is
usually not practical to assess the plausibility of all
possible trees because there are too many of them.
Therefore, instead systematists engage in an artful
hunt for jointly plausible trees. To illustrate this,
consider a case with two partitions that have yielded different MP trees.
The first thing one can do is ask if any of the MP trees from partition one are plausible
for partition two, and vice versa. This amounts to asking if the peak for one partition lies
within the neighborhood of plausibility for the other partition.
If this is not the case, you could then explore
whether the MP trees obtained when the two
partitions are treated as a single data set, the
combined tree, lies in the plausibility zone of each
data set. You could do this by using a Temploton
test to ask if the combined tree explains partition 1
significantly worse than the MP tree for data set 1,
and equally if the combined tree explains partition
2 significantly worse than the MP tree for partition
2. If in both cases the combined tree is not
significantly worse than the MP tree (i.e., it is
plausible for both partitions), then you have shown
that the plausible zones overlap (see the figure).
Tree space
Plausible trees
for part. 2
Optimal tree
for part. 1
Optimal tree
for part. 2
Optimal tree for
combined data
Plausible trees
for part. 1
Often you will fail to find a tree that is plausible
for both partitions. In that case, it may be fruitful
Tree space
to reverse the procedure and see if you can prove
Plausible trees
that the plausibility zones do not overlap. Luckily,
Optimal tree
for part. 2
for part. 2
this can be achieved without having to fully map
Neighborhood
out the plausibility zone of either data set. To
of all trees with
explain how this works, remember that tree space
clade x
is divided into neighborhoods that have or lack
particular clades. Further, recall that we can use a
topology test to ask whether a tree with (or
Optimal tree
without) a particular clade can be rejected by a
for part. 1
data set. If partition one rejects all trees than lack
Plausible trees
for part. 1
clade x, while partition two rejects all tree that
have clade x, then we can conclude that there are
no trees that are plausible for both partitions. If they have clade x then are not plausible
for partition 2, if they lack clade x they are not plausible for partition 1.
Through the creative use of topology tests it is often, but not always, possible to establish
whether there are trees that are plausible for each partitions. Because evolutionary theory
predicts that different data from the same taxa will tend to share the same underlying tree,
we usually treat the sharing of a common history as the null hypothesis. Thus, it is
normal to assume that two partitions share the same true tree unless we have evidence
that the plausibility zones do not overlap. The problem with this is that the partitions
could show an overlap in their plausibility zones not because the true trees are the same,
but because the plausibility zones for each partition are very broad, for example because
one or both partition lacks a strong phylogenetic signal.
The alternative to using topology tests is to use a partition homogeneity test (originally
given the less transparent name: incongruence length difference test). This method
doesn’t worry so much about the topology of the optimal or plausible trees but rather
focuses on the length of the MP trees for each partition. Recall (PTP and skewness tests,
page xx) that as we decrease the phylogenetic signal within a data set we tend to increase
the length of the optimal tree because of conflict among characters. The same will
happen if we combine data with conflicting signals as shown in the following simple
(extreme) example.
Data set one and two are identical except that the labels among the taxa have been
muddled up. Thus, while they each support a different tree the length of the optimal tree
for each is identical (34 steps).
If we generate a composite data set with half of data set one and half of data set two the
shortest tree is 41 or 44 steps (depending which half of which data set is included). The
reason that these trees are longer than the original data sets is that they combine data with
conflicting phylogenetic signal. If two data sets have a different signal then generating
A
B
C
D
E
F
G
H
Data set 1
TTCGAGAACACGGCCCTTTGCGACCCATGTTGTTA
TCAGAGAACACGACACTTTGCGACCCATGTTGTTA
TCCGAGAGCACGGACCTTCGCGACCTATGTTATTG
TCCGGAAGTAAACACCCTTGTAACCCCAGCTATCG
TCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCG
TCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTG
TCTGGGAGCACAAGTCCTCACGACCCCTGCTATTG
TCTGGGAGCACAAATCCTCACGACCCCTGCTATTG
A
B
C
D
E
F
G
H
Data set 1
TTCGAGAACACGGCCCTTTGCGACCCATGTTGTTA
TCCGGAAGTAAACACCCTTGTAACCCCAGCTATCG
TCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCG
TCCGAGAGCACGGACCTTCGCGACCTATGTTATTG
TCTGGGAGCACAAGTCCTCACGACCCCTGCTATTG
TCTGGGAGCACAAATCCTCACGACCCCTGCTATTG
TCAGAGAACACGACACTTTGCGACCCATGTTGTTA
TCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTG
composite data sets will tend to increase the length of the trees.
The PHT uses this principle. First you determine the length of the MP trees for the
original data and sum this up across partitions. In this example, the two optimal trees
sum to 68 steps. Now, you randomly reassign characters to the two partitions. For
example, the top row below shows a random assignment of characters (columns) to two
partitions, with 35 characters in each (like the original data).
A
B
C
D
E
F
G
H
1221212211121221211221212112122121221112121221212212111212112122112221
TTCGAGAACACGGCCCTTTGCGACCCATGTTGTTATTCGAGAACACGGCCCTTTGCGACCCATGTTGTTA
TCAGAGAACACGACACTTTGCGACCCATGTTGTTATCCGGAAGTAAACACCCTTGTAACCCCAGCTATCG
TCCGAGAGCACGGACCTTCGCGACCTATGTTATTGTCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCG
TCCGGAAGTAAACACCCTTGTAACCCCAGCTATCGTCCGAGAGCACGGACCTTCGCGACCTATGTTATTG
TCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCGTCTGGGAGCACAAGTCCTCACGACCCCTGCTATTG
TCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTGTCTGGGAGCACAAATCCTCACGACCCCTGCTATTG
TCTGGGAGCACAAGTCCTCACGACCCCTGCTATTGTCAGAGAACACGACACTTTGCGACCCATGTTGTTA
TCTGGGAGCACAAATCCTCACGACCCCTGCTATTGTCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTG
The shortest tree for random partition 1 has length 39, whereas the shortest for partition 2
has length 46, giving a sum of 85 steps. The PHT repeats this procedure many times and
sees how often the sum of the two partitions is as low or lower than the original
partitions. In this case the
Sum of lengths of the
Number of times found
distribution of lengths for 100
two MP trees
random partitions is as shown.
79
1
Because none of the random
80
1
partitions yields pairs of tree
81
4
that are as short as 68 steps,
82
8
we have good reason to
83
12
conclude that the original
84
13
partitions are not random: that
85
25
the partitions have distinctly
different phylogenetic signals.
86
26
This would argue that the two
87
10
partitions are not be derived
from the same tree like process (as indeed they are not in this simulated case) and that
they should therefore not be combined into a single combined data matrix.