Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Comparing data sets It is possible to collect multiple different data sets for the same tips. Because one could theoretically combine these data sets into one larger data set, the individual data sets are called partitions. A partition is a kind or block of data that has been obtained from the same tips (e.g., molecular and morphological data, gene sequence 1 vs. gene sequence 2, plastid DNA versus nuclear DNA, introns versus exons, etc.). The data matrix below includes two partitions of forty DNA sequence characters each. Partition 2 Partition 1 A B C D E F G H TTTAGATCTCACAATTTCGTGGGCAACATCACTTGCCAGA TTTAGATTTCACAAGTTCGTGGGCAACGTCACTTACCAGA TTTAGATTTCACAAGTTCGAGGGGAGCACGACTTGTCAAA TTCATACCCTACGAGGTCATGGGCATCACGACTTATCAGA TTCAAATTTGACGACTTCGTGGGCATGACGACTTATCAGT TTCGGGCTTTAGTAGTCCCTGGGCAGCACAATTAGTCGTA TTTAAGTCTCAGGAATCGCTAGGCAGCACAATTTGTCTTA TTCAGGTTTCAAGAATCGTTGGGCAGCACAATTTGTCCTA GACGTAATCACCCAAGCCCGTTGCCTCCGGAAACGACGTG GACGTAGGTACCCAAGCCCCTTGCCTTGGTAAACGACGTG GACGTAATCACCTAAGTCCCTTGCGTCCGTAAACGACGCG GACGCAACAATTAAAGCCCTTTGTACGCGGAAACCGCGTG GACGCAACAACTAAACCCCTTTGTACGCGGAAACCACGCA AACGTAACAACTGAACCCCCTTGTTCGCGTAGACTGCATG GACGTAACCACTAAACCCCCCGGTCTCCGCTGATGACGTG GACCTAACCACTAAGCCCCCCGGTCTGCGCGGATGACGTG There are some good reasons to wonder if the true tree for each partition is the same. Knowing that the partitions have the same tree provides information on the extent of reticulate evolution in the group’s history (suggesting it is low) and might indicate that the partitions are functional and/or physically linked. Also, from a methodological point of view, if the partitions share the same history then we can combine the data partitions into a single analysis, thereby obtaining a more detailed estimate of the shared history. So how can we judge if the partitions derive from the same tree? The first approaches to answering this question begin by conducting a phylogenetic analysis on each partition separately. For example, the most parsimonious (MP) trees for partition 1 and partition 2 above are as follows. Single MP tree from partition 1 Consensus of three MP trees from partition 2 You will note that these trees are different. However, that does not tell us that the partitions have necessarily tracked different histories. It could be that they have tracked the same history, but with only forty characters in each data set, chance has resulted in a misleading tree from one or both data sets. One observation to be made is that the trees obtained from partition 2 include four resolved clades: (D, E), (D, E, F), (G, H), and (D, E, F, G, H). Of these, all except (D, E, F) are also found with partition 1. While these two trees are not identical, they are adjacent in tree space, something that would be rather improbable if the two partitions had tracked completely different histories. For an 8-taxon tree only 15 of the total 10,305 possible trees are adjacent to a randomly chosen tree [check]. This means that there is a less than 1 in a 1000 chance that two partitions would yield optimal trees that are this close by chance. This is an important point to stress. Even when partitions differ in the optimal tree they support, they generally yield trees that are more similar to each other than would be expected by chance. This is very hard to explain except by reference to the partitions (typically different genes) sharing a similar history of descent from common ancestry. Indeed, the fact that the trees derived from different genes are much more similar than expected by chance, provides among the most concrete statistical evidence for the truth of evolutionary descent from common ancestry (Penny et al. 1982). Showing that the partitions yield trees that are similar does not mean that the partitions tracked the same history. For example, they could have tracked a history that is identical except that introgression in one gene means that a single tip occupies a different position in the true trees for each partition. Thus, showing that partitions yield surprisingly similar trees is not sufficient to conclude that they have derived from the same tree. So how can you assess that? There are two general approaches taken to answering this question: topology tests and partition homogeneity tests. While neither is entirely satisfactory, it will be valuable to review them both because they illustrate some important general principles. It is worth mentioning that while I am introducing these tests in the context of parsimony they can equally be used with other tree optimality criteria such a maximum likelihood. The starting point for using topology tests is the recognition that for any data set (for example a partition) there is a set of trees that are worse than the MP tree but not so much worse that they are rejected by a topology test such as the Templeton test. Let’s call the set of trees that are not significantly longer than the MP tree the plausible trees. Returning to the tree space metaphor, if the most parsimonious tree defines the highest peak, plausible trees are those whose “altitude” is high enough that they are not significantly lower than the MP tree. Even if the MP trees from two partitions differ, there could be Tree space one or more tree that is plausible for both partitions. When we find such shared plausible trees, we Optimal tree Plausible trees for part. 2 for part. 2 generally accept that the partitions are likely to have tracked the same history. Within a parsimony framework, topology tests such Optimal tree for part. 1 as the Templeton test provide a tool for assessing if a tree is plausible for each data partition. But it is usually not practical to assess the plausibility of all possible trees because there are too many of them. Therefore, instead systematists engage in an artful hunt for jointly plausible trees. To illustrate this, consider a case with two partitions that have yielded different MP trees. The first thing one can do is ask if any of the MP trees from partition one are plausible for partition two, and vice versa. This amounts to asking if the peak for one partition lies within the neighborhood of plausibility for the other partition. If this is not the case, you could then explore whether the MP trees obtained when the two partitions are treated as a single data set, the combined tree, lies in the plausibility zone of each data set. You could do this by using a Temploton test to ask if the combined tree explains partition 1 significantly worse than the MP tree for data set 1, and equally if the combined tree explains partition 2 significantly worse than the MP tree for partition 2. If in both cases the combined tree is not significantly worse than the MP tree (i.e., it is plausible for both partitions), then you have shown that the plausible zones overlap (see the figure). Tree space Plausible trees for part. 2 Optimal tree for part. 1 Optimal tree for part. 2 Optimal tree for combined data Plausible trees for part. 1 Often you will fail to find a tree that is plausible for both partitions. In that case, it may be fruitful Tree space to reverse the procedure and see if you can prove Plausible trees that the plausibility zones do not overlap. Luckily, Optimal tree for part. 2 for part. 2 this can be achieved without having to fully map Neighborhood out the plausibility zone of either data set. To of all trees with explain how this works, remember that tree space clade x is divided into neighborhoods that have or lack particular clades. Further, recall that we can use a topology test to ask whether a tree with (or Optimal tree without) a particular clade can be rejected by a for part. 1 data set. If partition one rejects all trees than lack Plausible trees for part. 1 clade x, while partition two rejects all tree that have clade x, then we can conclude that there are no trees that are plausible for both partitions. If they have clade x then are not plausible for partition 2, if they lack clade x they are not plausible for partition 1. Through the creative use of topology tests it is often, but not always, possible to establish whether there are trees that are plausible for each partitions. Because evolutionary theory predicts that different data from the same taxa will tend to share the same underlying tree, we usually treat the sharing of a common history as the null hypothesis. Thus, it is normal to assume that two partitions share the same true tree unless we have evidence that the plausibility zones do not overlap. The problem with this is that the partitions could show an overlap in their plausibility zones not because the true trees are the same, but because the plausibility zones for each partition are very broad, for example because one or both partition lacks a strong phylogenetic signal. The alternative to using topology tests is to use a partition homogeneity test (originally given the less transparent name: incongruence length difference test). This method doesn’t worry so much about the topology of the optimal or plausible trees but rather focuses on the length of the MP trees for each partition. Recall (PTP and skewness tests, page xx) that as we decrease the phylogenetic signal within a data set we tend to increase the length of the optimal tree because of conflict among characters. The same will happen if we combine data with conflicting signals as shown in the following simple (extreme) example. Data set one and two are identical except that the labels among the taxa have been muddled up. Thus, while they each support a different tree the length of the optimal tree for each is identical (34 steps). If we generate a composite data set with half of data set one and half of data set two the shortest tree is 41 or 44 steps (depending which half of which data set is included). The reason that these trees are longer than the original data sets is that they combine data with conflicting phylogenetic signal. If two data sets have a different signal then generating A B C D E F G H Data set 1 TTCGAGAACACGGCCCTTTGCGACCCATGTTGTTA TCAGAGAACACGACACTTTGCGACCCATGTTGTTA TCCGAGAGCACGGACCTTCGCGACCTATGTTATTG TCCGGAAGTAAACACCCTTGTAACCCCAGCTATCG TCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCG TCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTG TCTGGGAGCACAAGTCCTCACGACCCCTGCTATTG TCTGGGAGCACAAATCCTCACGACCCCTGCTATTG A B C D E F G H Data set 1 TTCGAGAACACGGCCCTTTGCGACCCATGTTGTTA TCCGGAAGTAAACACCCTTGTAACCCCAGCTATCG TCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCG TCCGAGAGCACGGACCTTCGCGACCTATGTTATTG TCTGGGAGCACAAGTCCTCACGACCCCTGCTATTG TCTGGGAGCACAAATCCTCACGACCCCTGCTATTG TCAGAGAACACGACACTTTGCGACCCATGTTGTTA TCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTG composite data sets will tend to increase the length of the trees. The PHT uses this principle. First you determine the length of the MP trees for the original data and sum this up across partitions. In this example, the two optimal trees sum to 68 steps. Now, you randomly reassign characters to the two partitions. For example, the top row below shows a random assignment of characters (columns) to two partitions, with 35 characters in each (like the original data). A B C D E F G H 1221212211121221211221212112122121221112121221212212111212112122112221 TTCGAGAACACGGCCCTTTGCGACCCATGTTGTTATTCGAGAACACGGCCCTTTGCGACCCATGTTGTTA TCAGAGAACACGACACTTTGCGACCCATGTTGTTATCCGGAAGTAAACACCCTTGTAACCCCAGCTATCG TCCGAGAGCACGGACCTTCGCGACCTATGTTATTGTCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCG TCCGGAAGTAAACACCCTTGTAACCCCAGCTATCGTCCGAGAGCACGGACCTTCGCGACCTATGTTATTG TCCGGAAGTAGACGCCCTTGTAACCCCAGCTATCGTCTGGGAGCACAAGTCCTCACGACCCCTGCTATTG TCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTGTCTGGGAGCACAAATCCTCACGACCCCTGCTATTG TCTGGGAGCACAAGTCCTCACGACCCCTGCTATTGTCAGAGAACACGACACTTTGCGACCCATGTTGTTA TCTGGGAGCACAAATCCTCACGACCCCTGCTATTGTCCGGGAGTAAATGCCCTTGGGACCCCTGCTATTG The shortest tree for random partition 1 has length 39, whereas the shortest for partition 2 has length 46, giving a sum of 85 steps. The PHT repeats this procedure many times and sees how often the sum of the two partitions is as low or lower than the original partitions. In this case the Sum of lengths of the Number of times found distribution of lengths for 100 two MP trees random partitions is as shown. 79 1 Because none of the random 80 1 partitions yields pairs of tree 81 4 that are as short as 68 steps, 82 8 we have good reason to 83 12 conclude that the original 84 13 partitions are not random: that 85 25 the partitions have distinctly different phylogenetic signals. 86 26 This would argue that the two 87 10 partitions are not be derived from the same tree like process (as indeed they are not in this simulated case) and that they should therefore not be combined into a single combined data matrix.