Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Steiner Tree, Substitution Matrix Method for Reconstructing Phylogenetic Trees C. J. Ras∗ D.A. Thomas† J. F. Weng† Abstract Evolutionary theory implies that existing or extinct organisms are descended from a common ancestor. Hence, given a set of organisms, a phylogenetic tree can be reconstructed showing the evolutionary relationships between the biological organisms in the set. A commonly used method for the reconstruction of phylogenetic trees is the Distance Matrix (DM) method, which tends to be faster than the wellknown Maximum Parsimony (MP) and Maximum Likelihood (ML) methods. However, the disadvantages of DM are that it does not make the best use of all the information available from the input sequences, and it does not give any information on internal nodes (ancestors) in the trees, unlike MP and ML. This paper presents a mathematical framework for a new Steiner tree-based method of constructing phylogenetic trees. Our so called Sequence Steiner (SS) method overcomes the shortcomings of classical DM methods, whilst retaining their efficiency. We introduce decision variables in the form of edge-associated substitution frequency matrices and node-associated probability vectors to the DM optimisation model. Keywords: phylogenetic tree reconstruction, sequence-based distance method, Steiner tree, substitution frequency matrices ∗ Department of Mathematics and Statistics University of Melbourne Australia [email protected] † Department of Mechanical Engineering University of Melbourne Australia 1 1 Introduction Given a set of n organisms, a phylogenetic tree (phylogeny) T is a tree showing the evolutionary relationships among these organisms. All n organisms are leaf nodes (also called tips, terminals) of T, while the common ancestor r of all leaves is the root of T, although in many studies the tree is treated as unrooted. Any internal node is the root of a subtree of T whose leaves are the descendants of this internal node. All internal nodes in the tree are of degree 3 since biological changes of organisms are usually regarded as bifurcating (multifurcations can be treated as degenerate bifurcations). The graph structure of a phylogenetic tree is called its topology. In a phylogenetic tree the length of an edge (also called a branch) should be, in some way, proportional to the evolutionary time linking the organisms represented by the endpoints of the edge. The reconstruction of phylogenetic trees is either distance based or site based [4]. Distance based methods, such as the Distance Matrix (DM) method, rely on some estimate of evolutionary distance between given organisms. A tree is constructed so that the sum of the branch lengths along the path connecting any two organisms most closely matches the corresponding estimate of evolutionary distance. One commonly employed measure of evolutionary distance is genetic distance, which measures differences in the nucleotide or protein sequences (bio-sequences) of the organisms. Site based methods infer information about ancestor nodes by analysing base character changes in nucleotide sequences. The Maximum Parsimony (MP) method attempts to construct a tree which minimises the number of base changes needed to explain the given data [4]. In other words, the most parsimonious tree which matches the data is sought. In the site-based Maximum Likelihood (ML) method the base change frequencies are interpreted as a measure of evolutionary time via a substitution model. For a given substitution model, the ML method attempts to construct a tree which is most likely (probable) under the model [7]. Phylogenetic trees can also be modelled as Steiner trees [1, 2, 6, 14]. The Steiner tree problem is a well known network optimisation problem that asks for a minimum cost connected network T spanning a given set of terminals N , where additional nodes (called Steiner points) may be utilised [8]. If all edge costs are non-negative then T does not contain cycles, that is, T is a tree, called a Steiner minimal tree. The problem can be posed either in graphs or in metric spaces. In metric spaces the Steiner tree problem consists of 2 two parts: the global problem of finding an optimal topology connecting the terminals and Steiner points; and the local problem of finding the optimal locations of the Steiner points in the metric space, given a topology. The latter problem is referred to as the fixed topology Steiner tree problem. CCCCC GGGGG GGGGG CCCCC s1 CGGGC CGGGC GCCCG s2 GCCCG Figure 1: Two phylogenetic Steiner tree topologies (on the same four input sequences) representing distinct ancestral relationships. The cost of the second tree under the Hamming metric is 9 when s1 =CCGGC and s2 =GCGCG. A simple phylogenetic model employing Steiner trees can be constructed by measuring the genetic distance using Hamming distance on bio-sequences. This metric counts the number of differing sites in the sequences that represent the organisms. The given sequences are represented by terminals, and ancestor sequences are represented by Steiner points (see Figure 1). Within this model, finding the optimal (minimum length) topology corresponds to the process of correctly identifying the ancestral relationships between given organisms according to the principle of maximum parsimony. The identities of the ancestors are deduced from the optimal locations of the Steiner points, and the evolutionary time between organisms corresponds to the edgelengths. The phylogenetic Steiner tree problem is NP complete, even under the Hamming metric [5]. In this paper we develop a new mathematical framework, called the Sequence Steiner (SS) method, for constructing phylogenetic trees within a Steiner tree model. In our approach, terminals and Steiner points are probability vectors representing the frequencies of base characters in the corresponding bio-sequences. Our model is best viewed as a Steiner tree DM method which incorporates substitution models. In this way our approach combines the advantages of classical distance-based methods with the ability to infer information on internal nodes. In Section 2 we give a short comparison of the various classical phylogenetic reconstruction methods, DM,ML,MP. We also briefly discuss a Steiner 3 tree model for the problem in Section 2.1. In Section 2.2 we present the basics of the DM method, including the Neighbour Joining approach. Section 3 discusses our new Steiner tree-based DM method where substitution frequency matrices are used to calculate genetic distance. The method is illustrated with a simple example in Section 3.1. 2 Background and preliminaries The three most popular methods for constructing phylogenetic trees, namely DM, ML, MP, all attempt to construct a tree on the given input such that some optimality criterion is satisfied. In principle, all three methods search the entire topological space of phylogenetic trees to find the desired topology. Once a topology has been selected a local optimisation procedure is performed in order to specify the properties of the ancestors (or the branch lengths, in the case of the DM method). In practice, the DM method uses sub-optimal heuristic methods such as neighbour joining [15] to avoid searching the entire topological space. In all three approaches similar methods can be used to search the topological space, including branch and bound methods [13]. Hence in this paper we will predominantly be focussing on the local optimisation component. Each of the three methods, DM,ML and MP have their own advantages and disadvantages. There are two main advantages of DM: 1. Because DM is distance-based it runs faster than the site-based MP and ML [7]. 2. MP simply relies on observed differences (p-distance) in the given sequences and does not consider the unobserved nucleotide substitutions that possibly occurred in the evolutionary process. However, DM can be adapted to use nucleotide substitution models to correct the simple p-distance used in MP. On the other hand DM has many drawbacks: 1. For a given set of n sequences a symmetric n × n distance matrix D is established by comparing sequences pair by pair. The DM method is based on this distance matrix D and ignores other information such as the frequencies of nucleotides {A, G, C, T} in each sequence and the 4 statistics of the different 16 pairs {AA, AG, AC, AT, GA, GG, . . . , TT} that can be obtained by sequence analysis. Simply put, the DM method does not make full use of the available information in the given sequences. 2. The DM method estimates branch lengths by certain rules such that the tree is consistent with the input distance matrix D. In practice the DM method does not search the whole topological space of phylogenetic trees, as the MP and ML methods do, and can therefore produce suboptimal solutions. 3. As opposed to ML and MP, the classical DM method cannot infer any information on the ancestors. 4. The phylogenetic tree constructed by DM may not satisfy an important property of networks – path inequality (as explained in Section 2.1). None of the classical methods DM, MP, or ML propose a way of dealing with uncertain characters in the given sequences that are produced in a wet laboratory, and the gaps that are generated in the alignment of sequences. In fact, in the comparison of sequences in DM the sites containing uncertain characters or gaps are completely or pairwise deleted. As a result, some information is lost. Approaches based on Bayesian models [17], and the papers by Weng and Thomas [18, 19] proposing a probabilistic model to deal with the uncertain characters and gaps, do not mitigate all the above disadvantages of DM,MP, and ML. In this paper we develop a phylogenetic tree reconstruction model which addresses the above points. Our model is designed to accommodate useful information about phylogenetic trees that current models exclude, and to allow for the application of classical optimisation techniques. The result is the development of a new Steiner tree-based DM method for reconstructing phylogenetic trees, which we call the Sequence Steiner (SS) approach. Our SS approach has the following advantages: 1. Because SS is not a site-based method, we expect practical implementations of it to run faster than MP and ML but slower than classical DM since it needs to search the whole topological space. 2. It extracts more information from the input sequences for the reconstruction than classical DM, MP and ML. 5 3. It can use nucleotide substitution models to count unobservable nucleotide substitutions during evolution. 4. It can make use of existing continuous optimisation techniques and software packages. 5. As opposed to DM it can estimate the distribution of nucleotides in ancestors. 6. The phylogenetic trees constructed by SS satisfy the path inequality. 2.1 Phylogenies as Steiner trees A Steiner tree topology T is full if all points in the given set N of terminals are of degree one. In a full Steiner topology with n terminals there are n-2 Steiner points and 2n-3 edges. Let T be the set of all full Steiner topologies on n terminals. For a topology T ∈ T, let S(T ) be the set of Steiner points. Let e(p, q) represent the edge whose endpoints are p, q, and let E(T ) be the set of edges of T . Finally, let le(p,q) be a certain measure on the edge e(p, q). Then a general mathematical formulation of the objective of the Steiner tree problem is: ∑ min min le(p,q) . T ∈T S(T ) e(p,q)∈E(T ) Note that there are two levels of optimisation: the global problem, where an optimal topology spanning all points is selected, and the local fixed topology problem, where the optimal Steiner points are found. The global problem is combinatorial (discrete) while the fixed topology problem is continuous if l is continuous. In the example of Figure 1 the global problem consists of choosing between two topologies (as depicted); the local problem consists of assigning values to the two Steiner points from the metric space {C, G}4 so that total Hamming distance is minimised. In a metric space the function le(p,q) , which represents the distance between the two points p and q, satisfies the triangle inequality. However, if le(p,q) does not satisfies the triangle inequality then, as a minimum requirement for T to be a candidate phylogenetic tree, le(p,q) should satisfy the following path inequality [3]: suppose a set of distances d(p, q) between any two given nodes p, q is prescribed, then f(path) (p, q) := lpath (p, q) − d(p, q) ≥ 0, 6 (1) where lpath (p, q) = le(p,s1 ) + le(s1 ,s2 ) + · · · + le(sk ,q) and s1 , s2 , . . . , sk are the nodes lying on the path linking p and q. If equality holds, then the property is called additivity and the tree is called an additive tree. Note that if the path inequality is required in a Steiner tree problem then the problem is no longer an unconstrained optimisation problem but becomes a constrained optimisation problem. From the above description of Steiner trees we can see that a phylogenetic tree is a Steiner tree with a full topology. In fact, the connection between the phylogenetic tree problem and the Steiner tree problem was found in the early stages of computational biology, and is well-studied in the context of MP and ML methods [1, 6, 14]. In this paper we take a Steiner approach to DM. We do this by defining a new type of Steiner tree problem in which each variable is not a Steiner point but rather a function M(p, q) associated with each edge e(p, q). This function determines the location of the Steiner points p, q, and the edge cost le(p,q) . This is a novel approach that has not been considered in the Steiner phylogenetic tree literature. 2.2 The classical DM method Given n input sequences ωk , k = 1, 2, . . . , n, the DM method computes the genetic distances dlj := d(ωl , ωj ) (according to a prescribed definition) between the sequences ωl and ωj . This results in a n × n symmetric distance matrix D and the following is its upper triangular form: − d12 d13 · · · d1(n-1) d1n − − d23 · · · d2(n-1) d2n ··· ··· ··· ··· · · · · · · D= − − − · · · d(n-2)(n-1) d(n-2)n − − − ··· − d(n-1)n − − − ··· − − The goal of the DM method is to identify a tree whose branch lengths are consistent with D, i.e., to find an additive tree. However, for real biological sequences additive trees seldom exist and often only path inequality is satisfied. Remark 2.1 Since additivity may not hold, zero or even negative branch lengths may occur in the constructed tree (an example occurs in Section 4). 7 Most commonly, in the DM method the final tree is constructed using the Neighbor Joining (NJ) method [15, 16]. We briefly describe this approach. The tree is built in two parts: branch lengths are estimated; and then the tree is constructed so that the most closely related sequence pairs are joined as neighbours, i.e., two tips join to the same direct internal node (their direct ancestor). Because at each step only two tips are joined forming an internal node as a new tip, the distance matrix is repeatedly modified and its dimension is reduced by one at each step. After n-2 steps only two sequences are left and they are joined to the last internal node. The initial topology is a star: all terminals join to an internal node. Let the average length of the branch of tip i be ui = n ∑ dij /(n-2). j,j̸=i We then choose the tips i, j for which dij − ui − uj is smallest and join them to a new internal node (ij). Now we can compute the branch lengths from tip i and tip j to node (ij) as di(ij) = dij ui − uj dij uj − ui + , dj(ij) = + , 2 2 2 2 and the distance between the new tip (ij) and each of the remaining tips k as d(ij)k = (dik + djk − dij )/2. The process is repeated till a full unrooted tree is built. 3 A new Steiner tree-based DM method employing substitution frequency matrices Consider two DNA-sequences of length m: Q and its direct ancestor P. In the evolutionary process from P to Q, unobservable multi-, parallel-, convergent-, and back-substitutions are not counted by the genetic distance function d [20]. To overcome this limitation many statistical models based on a timecontinuous Markov process [9, 10, 11] have been proposed. We will denote the models proposed in [10, 11] by JC69 and K80 respectively. For sequence P let [pi ] (1 ≤ i ≤ 4) be the number of nucleotides A, G, C, and T respectively. Then the vector p = [pi /m] (1 ≤ i ≤ 4) is referred to as 8 the frequencies of nucleotides in P. We similarly define q with respect to Q. Let mij (1 ≤ i, j ≤ 4) be the number of nucleotides i in P replaced with j in Q, and let µij = mij /m. Then the matrix M(p, q) := [µij ] (1 ≤ i, j ≤ 4) is referred to as the substitution frequency ∑ matrix from P to Q. Clearly, p is a unit probability vector (0 ≤ pi ≤ 1, i pi = 1), and M(p, q) is a unit ∑ ∑ probability matrix (i.e., 0 ≤ µij ≤ 1, i j µij = 1). The relationships between p, q and M(p, q) are ∑ µij (where 1 ≤ i ≤ 4), (2) p = [p′i ], p′i = pi /m = j q = [qj′ ], qj′ = qj /m = ∑ µij (where 1 ≤ j ≤ 4). (3) i The genetic distance d(p, ∑ in terms of M(p, q) as follows: ∑ q) is now defined d(M(p, q)) := d(p, q) := i̸=j µij = 1 − i µii . In the substitution model of [10], it is assumed that all instantaneous substitution rates are the same. The corrected genetic distance, as in [10], is then: ( ) 3 4d(M(p, q)) JC69 d (M(p, q)) = − log 1 − , 4 3 where log is the natural logarithm. In the substitution model of [11], the transitional substitution rate (A ↔ G and C ↔ T ) is different from the transversional substitution rate (A ↔ C and G ↔ T ). The genetic distance is then corrected as in [11] dK80 (M(p, q)) = − log(1 − 2P − Q) log(1 − 2Q) − , 2 4 where the transitional and the transversional substitutions are denoted as P and Q and P+Q=d(M(p, q)). Any genetic distance d corrected by a substitution model ∗ will be denoted by d∗ , which is a function of M(p,q) . Consider again the n input sequences ωk , k = 1, 2, . . . , n,. We construct a ‘supermatrix ’ M, a matrix of matrices, containing n(n-1)/2 substitution frequency matrices Mkl := M(ωk , ωl ), where 1 ≤ k ≤ (n-1), (i+1) ≤ l ≤ n. 9 M= − M12 M13 − − M23 ··· ··· ··· − − − − − − − − − ··· M1(n-1) M1n ··· M2(n-1) M2n ··· ··· ··· · · · M(n-2)(n-1) M(n-2)n ··· − M(n-1)n ··· − − As opposed to the DM method, which is based solely on the distance matrix D and does not make full use of the information contained in the substitution frequency matrices Mkl , in our SS method the substitution frequency matrix M plays a central role. Instead of the distance matrix D, the input for our method is the nucleotide frequencies of terminals and the substitution frequency matrices M(p, q) between each pair of terminals p, q. Hence, in our SS method the phylogenetic tree problem has the following mathematical formulation: Given: • n terminals th , 1 ≤ h ≤ n that are probability vectors of length 4, • n(n-1)/2 substitution frequency matrices Mkl := [µkl ij ], 1 ≤ k ≤ (n-1), (k+1) ≤ l ≤ n, 1 ∑ ≤ i, j ≤ 4 as defined in Equations (2) and (3), i.e. ∑ kl k kl l j µij = ti , and i µij = tj for 1 ≤ i, j ≤ 4, and • a substitution model ∗ providing a genetic distance function d∗ on pairs of sequences Variables: • 2n-3 substitution frequency matrices M(p, q) such that each M(p, q) is associated with an edge e(p, q) in topology T , and • n-2 Steiner points sh , 1 ≤ h ≤ n-2. Constraints: • each Steiner point sh is a probability vector, i.e. shi ≥ 0, 1 (1 ≤ i ≤ 4), ∑ i shi = • each substitution frequency matrix M(p, q) is a unit probability matrix satisfying Equations (2) and (3), and 10 • each path connecting two terminals p and q satisfies the path inequality (1). Objective: min min T ∈T S(t) ∑ d∗ (p, q), e(p,q)∈E(T ) where S(T ) is the set of Steiner points and E(T ) is the set of edges in topology T. 3.1 An illustrative example Finally we illustrate our SS method with a simple example that consists of n = 5 sequences selected from GenBank. The length of each of the aligned sequences (using CLUTSAL-http://www.clustal.org/) is 374 and they are listed in the Appendix. Note that there are two sites (182 and 260) in Sequence 3 containing uncertain characters ”?” and there are many gaps (”-”) that are added in the alignment phase and happen to lie at the end of Sequence 3, 4 and 5. In DM these sites are deleted in pairwise comparison and consequently some useable information is lost. We first demonstrate the solution generated by the DM method. The distance matrix using the substitution model from [10] (JC69) is: DJC69 = t1 t2 t3 t4 t5 t1 t2 t3 t4 t5 − 0.052605 0.721006 1.762426 1.772407 − − 0.706888 1.621829 1.656794 − − − 1.805877 2.049003 − − − − 0.752275 − − − − − Using this distance matrix and the neighbour joining method we infer a phylogenetic tree TNJ-JC whose topology is as shown in Fig. 2. The edge lengths and the differences between path lengths and terminal distances in TNJ-JC are listed in Tables 1 and 2: Note that in TNJ-JC , the length of edge (t2 , s1 ) is negative and the path inequality does not hold for many terminal pairs. Next we present the solution generated by our SS method. Because we do not have information on uncertain symbols and gaps, in this example they 11 M(s1 , s3 ) s t1 1 M(t2 , s1 ) t2 t3 s3 s t4 2 t5 Figure 2: The topology in the reconstruction of TNJ-JC Table 1: Edge lengths of TNJ-JC and TSS-JC edge TNJ-JC TSS-JC (t1 , s1 ) 0.06186 0.03556 (t2 , s1 ) -0.00925 0.01744 (t3 , s3 ) 0.46901 0.42729 (t4 , s2 ) 0.32813 0.29723 (t5 , s2 ) 0.42415 0.45477 (s1 , s3 ) 0.21863 0.26227 (s2 , s3 ) 1.08229 1.16694 TreeLength 2.57482 2.66150 are treated as being equally distributed. As a result the 5 frequencies of nucleotides as input are t1 t2 t3 t4 t5 = = = = = A [0.27807 [0.28610 [0.30147 [0.32821 [0.31684 G 0.13636 0.13102 0.14104 0.10628 0.11631 C 0.35294 0.35561 0.34693 0.33890 0.34626 T 0.23262] 0.22727] 0.21056] 0.22660] 0.22059] Moreover, as input there are 10 substitution frequency matrices M12 , M13 , . . . , M45 . For example, M14 is t4A t4G t4C t4T t1 0.1091 0.0295 0.0973 0.0590 1A 1 4 M14 = M(t , t ) = t 0.0442 0.0147 0.0295 0.0383 G 1 tC 0.1121 0.0265 0.1386 0.0678 t1T 0.0708 0.0206 0.0826 0.0590 We use the same substitution model, JC69, to reconstruct the optimal phylogenetic tree, which is denoted by TSS−JC . Since the number of sequences is very small, the whole topology space contains only 15 different topologies. It is therefore easy in this case to find the optimal topology, which is the same as the topology of TNJ−JC . 12 Table 2: The path inequality f(path) (p, q) in TNJ-JC and TSS-JC path TNJ-JC TSS-JC (t1 ,t2 ) -0.0004 0.0000 (t1 , t3 ) 0.0285 0.0041 (t1 , t4 ) -0.0711 0.0000 (t1 , t5 ) 0.0149 0.1475 (t2 , t3 ) -0.0286 0.0000 (t2 , t4 ) -0.0022 0.1219 (t2 , t5 ) 0.0588 0.2444 (t3 , t4 ) 0.0734 0.0855 (t3 , t5 ) -0.0736 0.0000 (t4 , t5 ) 0.0003 0.0000 The primary output of our method is the 7 substitution frequency matrices associated with each of the 2n − 3 = 2(5) − 3 = 7 edges, and the derived output is the edge lengths, the path inequalities and 3 internal nodes s1 , s2 , s3 . For example, we obtain the substitution frequency matrix s1A s1G s1C s1T t1 0.2719 0.0019 0.0024 0.0019 1A 1 1 M(t , s ) = t 0.0046 0.1259 0.0035 0.0024 G t1C 0.0034 0.0020 0.3455 0.0020 t1T 0.0047 0.0024 0.0035 0.2220 and the nucleotide distributions of the 3 internal nodes (ancestors) s1 = s2 = s3 = A [0.2845 [0.3043 [0.2879 G 0.1323 0.1478 0.1615 C 0.3549 0.3257 0.3331 T 0.2283] 0.2222] 0.2174] Remark 3.1 The edge lengths and path inequalities of TSS-JC are listed in Table 1 and 2 above for comparison. We can see that the tree length of TSS-JC is a little larger than TNJ-JC but the small expense ensures the positivity of edges and the path inequality. It can easily be confirmed that the sum over the ith-row in M(t1 , s1 ) is t1i and the sum over the jth-column in M(t1 , s1 ) is s1j as given in Equations (2) and (3). 4 Conclusion We propose a new Steiner tree-based DM method for reconstructing phylogenetic trees. The method has numerous advantages: it makes full use of the available information in sequences; it generates a more realistic tree with all edges positive and path inequality ensured; and, most importantly, it is able to estimate the distributions of nucleotides in ancestors, which will be useful in the study of extinct organisms. 13 References [1] Bandelt, H-J., Forster, P., & Rhl, A. (1999). Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution, 16, 37-48. [2] Brazil, M., Nielsen, B.K., Thomas, D.A., Winter, P, & Zachariasen, M. (2009). A novel approach to phylogenetic trees: d-dimensional geometric Steiner trees. Networks, 53, 104-111. [3] Felsenstein, J. (1988). Phylogenies from molecular sequences: inference and reliability. Annu. Rev. Genet, 22, 521-565 [4] Felsenstein, J. (2004). Inferring Phylogenetics, Sinauer Associates, Inc., Sunderland, UK. [5] Foulds, L.R., & Graham, R.L. (1982). The Steiner problem in phylogeny is NP-complete. Adv. Appl. Math, 3, 4349 [6] Foulds, L.R., Hendy, M.D., & Penny, D. (1979). A graph theoretic approach to the development of minimal phylogenetic trees. Journal of molecular evolution, 13, 127-149. [7] Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic biology, 52, 696-704. [8] Hwang, F.K., Richards, D.S., & Winter, P. (1992). The Steiner Tree Problem, Elsevier Science Publishers B.V., the Netherlands. [9] Galtier, N., Gascuel, O., & Jean-Marie, A. (2005). Markov models in molecular evolution, in Statistical Methods in Molecular Evolution, R.Nielsen, (Eds. ), Springer, USA. [10] Jukes, T. H., & Cantor, C.R. (1969). Evolution of protein molecules, in Mammalian Protein Metabolism, M.N. Munro (Ed. ) Academic Press, New York, pp. 21-132. [11] Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences, J. of Mol. Evol. 16, 111-120. 14 [12] Nei, M., & Kumar, S. (2000). Molecular Evolution and Phylogenetics, Oxford University Press, Inc., USA. [13] Ratner, V.A., Zharkikh, A.A., Kolchanov, N., Rodin, S., Solovyov, S., & Antonov, A.S. (1995). Molecular Evolution Biomathematics, Series Vol 24. Springer-Verlag: New York. [14] Saitou, N., & Imanishi, T. (1989). Relative efficiencies of the FitchMargoliash, maximum-parsimony, maximum-likelihood, minimumevolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree, Molecular Biology and Evolution, 6, 514-525. [15] Saitou, N., & Nei, M. (1987). The neighbor-joining method : A new method for reconstructing pgylogenetic trees, Molecular Biology and Evolution, 4, 406-425. [16] Studier, J. & Keppler, K. J. (1988). A note on the neighbor-joining algorithm of Saitou and Nei, Molecular Biology and Evolution, 5, 729731. [17] de Villemereuil, P., Wells, J.A., Edwards, R.D., & Blomberg, S.P. (2012). Bayesian models for comparative analysis integrating phylogenetic uncertainty, BMC evolutionary biology, 12, 102. [18] Weng, J.F., Mareels, I., & Thomas, D.A. (2011). Probability Steiner trees and maximum parsimony in phylogenetic analysis, Journal of Mathematical Biology, 64, 1225-1251 [19] Weng, J.F., Thomas, D.A., & Mareels, I. (2011). Maximum parsimony, substitution model, and probability phylogenetic trees, Journal of Computational Biology, 18, 67-80. [20] Xia, X. (2006). Molecular phylogenetics: mathematical framework and unsolved problems, in Structural approaches to sequence evolution, U. Bastolla, M. Porto, H. E. Roman, and M. Vendruscolo, (Eds.) Springer, 171-191. 15 Appendix: 5 species of mammals >t1 gb|AF050738| Gorilla gorilla graueri mitochondrial D-loop, partial sequence. TTCTTTCATGGGGAGACGAATTTGGGTGCCACCTAAGTATTAGTTAACCCACCAATAATT GTCATGTATTTCGTGCATTACTGCCAGCCACCATGAATAATGTACGGTACCATAAACACT CCCTCACCTATAATACATTACCCCCCCTCACCCCCCATCCCTTGCCCACCCCAACAGCAT ACCAACTAACCTACCCCTCTACAAAAGTACATAGTACATAAAATCATTTACCGTCCATAG CACATTCCAGTTAAACCATCCTCGCCCCCACGGATGCCCCCCCTCAGATAGGGGTCCCTT AAACACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCGCTCCGGG CCCATAACGCCTGG >t2 gb|AF089820| Gorilla gorilla beringei mitochondrial D-loop, partial sequence. TTCTTTCATGGGGAGACGAATTTGGGTGCCACCCAAGTATTAGTTAACCCACCAATAATT GTCATGTATGTCGTGCATTACTGCCAGCCACCATGAATAATGTACAGTACCACAAACACT CCCCCACCTATAATACATTACCCCCCCTCACCCCCCATTCCCTGCTCACCCCAACGGCAT ACCAACCAACCTATCCCCTCACAAAAGTACATAATACATAAAATCATTTACCGTCCATAG TACATTCCAGTTAAACCATCCTCGCCCCCACGGATGCCCCCCTTCAGATAGGGATCCCTT AAACACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCGCTCCGGG CCCATAACACCTGG >t3 gb|AY079510| Gorilla gorilla gorilla isolate BH6 mitochondrial D-loop, partial sequence. TTCTTTCATGGGGAGACAAATTTGGGTACCACCCAAGTATTAGCTAACCCATCAATAATT ATCATGTATATCGTGCATCACTGCCAGACACCATGAATAATGTACGGTACCATAAACGCC CAATCACCTGTAGCACATACAACCCCCCCCTTCCCCCCCCCCGCATTGCCCAACGGAATA C?AAATAACCCATCCCTCACAAAAAGTACATAACACATAAGATCATTTATCGCACATAGC ACATCCCAGTTAAATCACC?TCGTCCCCACGGATGCCCCCCCTCAGATGGGAATCCCTTG AACACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCCCCTCGCTCCGGGC CCATGACAC---->t4 gb|AF176722| Pan troglodytes schweinfurthii isolate HARRIET mitochondrial D-loop, partial sequence. GTACCACCTAAGTATTGGCTTATTCATTACAACCGCTATGTATTTCGTACATTACTGCCA GCCACCATGAATATTGTACAGTACTATAATCACTCAACTACCTATAATACATCAAACCCA CCCCACATTACAACCTCCACCCTATGCTTACAAGCACGCACAACAATTAACCCTCAACTG TCACACATAAAACACAACTCCAAAGACATTCCTCCCCCACCCCGATACCAACAGACCTAT ACTCTCTTAACAGTACATAGTACATACAACCGTACACCATACATAGCACATTACAGTCAA 16 ATCCATCCTCGCCCCCACGGATGCCCCCCCTCAGATAGG--------------------------------->t5 gb|AF176766| Pan troglodytes troglodytes isolate DODO mitochondrial D-loop, partial sequence. GTACCACCTAAGTATTGGCCTATTCATTACAACCGCTATGTATTTCGTACATTACTGCCA GCCACCATGAATATTGTACAGTACTATAACCACTCAACTACCTATAATACATTAAGCCCA CCCCCACATTACAACCTCCACCCTATGCTTACAAGCACGCACAACAATCAACCCCCAACT GTCACACATAAAATGCAACTCCAAAGACACCCCTCTCCCACCCCGATACCAACAAACCTA TGCCCTTTTAACAGTACATAGTACATACAGCCGTACATCGCACATAGCACATTACAGTCA AATCCATCCTTGCCCCCACGGATGCCCCCCCTCAGATAGG--------------------------------- 17