Download 19 - School of Mathematics and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

The Tree of Life (film) wikipedia , lookup

Transcript
A Steiner Tree, Substitution Matrix Method
for Reconstructing Phylogenetic Trees
C. J. Ras∗ D.A. Thomas† J. F. Weng†
Abstract
Evolutionary theory implies that existing or extinct organisms are
descended from a common ancestor. Hence, given a set of organisms,
a phylogenetic tree can be reconstructed showing the evolutionary relationships between the biological organisms in the set. A commonly
used method for the reconstruction of phylogenetic trees is the Distance Matrix (DM) method, which tends to be faster than the wellknown Maximum Parsimony (MP) and Maximum Likelihood (ML)
methods. However, the disadvantages of DM are that it does not make
the best use of all the information available from the input sequences,
and it does not give any information on internal nodes (ancestors) in
the trees, unlike MP and ML. This paper presents a mathematical
framework for a new Steiner tree-based method of constructing phylogenetic trees. Our so called Sequence Steiner (SS) method overcomes
the shortcomings of classical DM methods, whilst retaining their efficiency. We introduce decision variables in the form of edge-associated
substitution frequency matrices and node-associated probability vectors to the DM optimisation model.
Keywords: phylogenetic tree reconstruction, sequence-based distance
method, Steiner tree, substitution frequency matrices
∗
Department of Mathematics and Statistics
University of Melbourne
Australia
[email protected]
†
Department of Mechanical Engineering
University of Melbourne
Australia
1
1
Introduction
Given a set of n organisms, a phylogenetic tree (phylogeny) T is a tree showing
the evolutionary relationships among these organisms. All n organisms are
leaf nodes (also called tips, terminals) of T, while the common ancestor r
of all leaves is the root of T, although in many studies the tree is treated
as unrooted. Any internal node is the root of a subtree of T whose leaves
are the descendants of this internal node. All internal nodes in the tree
are of degree 3 since biological changes of organisms are usually regarded as
bifurcating (multifurcations can be treated as degenerate bifurcations). The
graph structure of a phylogenetic tree is called its topology. In a phylogenetic
tree the length of an edge (also called a branch) should be, in some way,
proportional to the evolutionary time linking the organisms represented by
the endpoints of the edge.
The reconstruction of phylogenetic trees is either distance based or site
based [4]. Distance based methods, such as the Distance Matrix (DM)
method, rely on some estimate of evolutionary distance between given organisms. A tree is constructed so that the sum of the branch lengths along
the path connecting any two organisms most closely matches the corresponding estimate of evolutionary distance. One commonly employed measure of
evolutionary distance is genetic distance, which measures differences in the
nucleotide or protein sequences (bio-sequences) of the organisms.
Site based methods infer information about ancestor nodes by analysing
base character changes in nucleotide sequences. The Maximum Parsimony
(MP) method attempts to construct a tree which minimises the number of
base changes needed to explain the given data [4]. In other words, the most
parsimonious tree which matches the data is sought. In the site-based Maximum Likelihood (ML) method the base change frequencies are interpreted
as a measure of evolutionary time via a substitution model. For a given substitution model, the ML method attempts to construct a tree which is most
likely (probable) under the model [7].
Phylogenetic trees can also be modelled as Steiner trees [1, 2, 6, 14]. The
Steiner tree problem is a well known network optimisation problem that asks
for a minimum cost connected network T spanning a given set of terminals N ,
where additional nodes (called Steiner points) may be utilised [8]. If all edge
costs are non-negative then T does not contain cycles, that is, T is a tree,
called a Steiner minimal tree. The problem can be posed either in graphs
or in metric spaces. In metric spaces the Steiner tree problem consists of
2
two parts: the global problem of finding an optimal topology connecting the
terminals and Steiner points; and the local problem of finding the optimal
locations of the Steiner points in the metric space, given a topology. The
latter problem is referred to as the fixed topology Steiner tree problem.
CCCCC
GGGGG
GGGGG
CCCCC
s1
CGGGC
CGGGC
GCCCG
s2
GCCCG
Figure 1: Two phylogenetic Steiner tree topologies (on the same four input sequences) representing distinct ancestral relationships. The cost of the second
tree under the Hamming metric is 9 when s1 =CCGGC and s2 =GCGCG.
A simple phylogenetic model employing Steiner trees can be constructed
by measuring the genetic distance using Hamming distance on bio-sequences.
This metric counts the number of differing sites in the sequences that represent the organisms. The given sequences are represented by terminals, and
ancestor sequences are represented by Steiner points (see Figure 1). Within
this model, finding the optimal (minimum length) topology corresponds to
the process of correctly identifying the ancestral relationships between given
organisms according to the principle of maximum parsimony. The identities of the ancestors are deduced from the optimal locations of the Steiner
points, and the evolutionary time between organisms corresponds to the edgelengths. The phylogenetic Steiner tree problem is NP complete, even under
the Hamming metric [5].
In this paper we develop a new mathematical framework, called the Sequence Steiner (SS) method, for constructing phylogenetic trees within a
Steiner tree model. In our approach, terminals and Steiner points are probability vectors representing the frequencies of base characters in the corresponding bio-sequences. Our model is best viewed as a Steiner tree DM
method which incorporates substitution models. In this way our approach
combines the advantages of classical distance-based methods with the ability
to infer information on internal nodes.
In Section 2 we give a short comparison of the various classical phylogenetic reconstruction methods, DM,ML,MP. We also briefly discuss a Steiner
3
tree model for the problem in Section 2.1. In Section 2.2 we present the
basics of the DM method, including the Neighbour Joining approach. Section 3 discusses our new Steiner tree-based DM method where substitution
frequency matrices are used to calculate genetic distance. The method is
illustrated with a simple example in Section 3.1.
2
Background and preliminaries
The three most popular methods for constructing phylogenetic trees, namely
DM, ML, MP, all attempt to construct a tree on the given input such that
some optimality criterion is satisfied. In principle, all three methods search
the entire topological space of phylogenetic trees to find the desired topology. Once a topology has been selected a local optimisation procedure is
performed in order to specify the properties of the ancestors (or the branch
lengths, in the case of the DM method). In practice, the DM method uses
sub-optimal heuristic methods such as neighbour joining [15] to avoid searching the entire topological space. In all three approaches similar methods can
be used to search the topological space, including branch and bound methods [13]. Hence in this paper we will predominantly be focussing on the local
optimisation component.
Each of the three methods, DM,ML and MP have their own advantages
and disadvantages. There are two main advantages of DM:
1. Because DM is distance-based it runs faster than the site-based MP
and ML [7].
2. MP simply relies on observed differences (p-distance) in the given sequences and does not consider the unobserved nucleotide substitutions
that possibly occurred in the evolutionary process. However, DM can
be adapted to use nucleotide substitution models to correct the simple
p-distance used in MP.
On the other hand DM has many drawbacks:
1. For a given set of n sequences a symmetric n × n distance matrix D is
established by comparing sequences pair by pair. The DM method is
based on this distance matrix D and ignores other information such as
the frequencies of nucleotides {A, G, C, T} in each sequence and the
4
statistics of the different 16 pairs {AA, AG, AC, AT, GA, GG, . . . , TT}
that can be obtained by sequence analysis. Simply put, the DM method
does not make full use of the available information in the given sequences.
2. The DM method estimates branch lengths by certain rules such that
the tree is consistent with the input distance matrix D. In practice the
DM method does not search the whole topological space of phylogenetic
trees, as the MP and ML methods do, and can therefore produce suboptimal solutions.
3. As opposed to ML and MP, the classical DM method cannot infer any
information on the ancestors.
4. The phylogenetic tree constructed by DM may not satisfy an important
property of networks – path inequality (as explained in Section 2.1).
None of the classical methods DM, MP, or ML propose a way of dealing
with uncertain characters in the given sequences that are produced in a wet
laboratory, and the gaps that are generated in the alignment of sequences.
In fact, in the comparison of sequences in DM the sites containing uncertain
characters or gaps are completely or pairwise deleted. As a result, some
information is lost. Approaches based on Bayesian models [17], and the
papers by Weng and Thomas [18, 19] proposing a probabilistic model to
deal with the uncertain characters and gaps, do not mitigate all the above
disadvantages of DM,MP, and ML.
In this paper we develop a phylogenetic tree reconstruction model which
addresses the above points. Our model is designed to accommodate useful
information about phylogenetic trees that current models exclude, and to
allow for the application of classical optimisation techniques. The result is
the development of a new Steiner tree-based DM method for reconstructing
phylogenetic trees, which we call the Sequence Steiner (SS) approach.
Our SS approach has the following advantages:
1. Because SS is not a site-based method, we expect practical implementations of it to run faster than MP and ML but slower than classical
DM since it needs to search the whole topological space.
2. It extracts more information from the input sequences for the reconstruction than classical DM, MP and ML.
5
3. It can use nucleotide substitution models to count unobservable nucleotide substitutions during evolution.
4. It can make use of existing continuous optimisation techniques and
software packages.
5. As opposed to DM it can estimate the distribution of nucleotides in
ancestors.
6. The phylogenetic trees constructed by SS satisfy the path inequality.
2.1
Phylogenies as Steiner trees
A Steiner tree topology T is full if all points in the given set N of terminals
are of degree one. In a full Steiner topology with n terminals there are n-2
Steiner points and 2n-3 edges. Let T be the set of all full Steiner topologies
on n terminals. For a topology T ∈ T, let S(T ) be the set of Steiner points.
Let e(p, q) represent the edge whose endpoints are p, q, and let E(T ) be the
set of edges of T . Finally, let le(p,q) be a certain measure on the edge e(p, q).
Then a general mathematical formulation of the objective of the Steiner tree
problem is:
∑
min min
le(p,q) .
T ∈T
S(T )
e(p,q)∈E(T )
Note that there are two levels of optimisation: the global problem, where an
optimal topology spanning all points is selected, and the local fixed topology
problem, where the optimal Steiner points are found. The global problem
is combinatorial (discrete) while the fixed topology problem is continuous if
l is continuous. In the example of Figure 1 the global problem consists of
choosing between two topologies (as depicted); the local problem consists of
assigning values to the two Steiner points from the metric space {C, G}4 so
that total Hamming distance is minimised.
In a metric space the function le(p,q) , which represents the distance between the two points p and q, satisfies the triangle inequality. However, if
le(p,q) does not satisfies the triangle inequality then, as a minimum requirement for T to be a candidate phylogenetic tree, le(p,q) should satisfy the
following path inequality [3]: suppose a set of distances d(p, q) between any
two given nodes p, q is prescribed, then
f(path) (p, q) := lpath (p, q) − d(p, q) ≥ 0,
6
(1)
where lpath (p, q) = le(p,s1 ) + le(s1 ,s2 ) + · · · + le(sk ,q) and s1 , s2 , . . . , sk are the
nodes lying on the path linking p and q. If equality holds, then the property
is called additivity and the tree is called an additive tree. Note that if the
path inequality is required in a Steiner tree problem then the problem is
no longer an unconstrained optimisation problem but becomes a constrained
optimisation problem.
From the above description of Steiner trees we can see that a phylogenetic
tree is a Steiner tree with a full topology. In fact, the connection between
the phylogenetic tree problem and the Steiner tree problem was found in the
early stages of computational biology, and is well-studied in the context of
MP and ML methods [1, 6, 14].
In this paper we take a Steiner approach to DM. We do this by defining
a new type of Steiner tree problem in which each variable is not a Steiner
point but rather a function M(p, q) associated with each edge e(p, q). This
function determines the location of the Steiner points p, q, and the edge cost
le(p,q) . This is a novel approach that has not been considered in the Steiner
phylogenetic tree literature.
2.2
The classical DM method
Given n input sequences ωk , k = 1, 2, . . . , n, the DM method computes the
genetic distances dlj := d(ωl , ωj ) (according to a prescribed definition) between the sequences ωl and ωj . This results in a n × n symmetric distance
matrix D and the following is its upper triangular form:


− d12 d13 · · ·
d1(n-1)
d1n
 − − d23 · · ·
d2(n-1)
d2n 


 ··· ··· ··· ···

·
·
·
·
·
·

D= 
 − − − · · · d(n-2)(n-1) d(n-2)n 


 − − − ···
−
d(n-1)n 
− − − ···
−
−
The goal of the DM method is to identify a tree whose branch lengths
are consistent with D, i.e., to find an additive tree. However, for real biological sequences additive trees seldom exist and often only path inequality is
satisfied.
Remark 2.1 Since additivity may not hold, zero or even negative branch
lengths may occur in the constructed tree (an example occurs in Section 4).
7
Most commonly, in the DM method the final tree is constructed using the
Neighbor Joining (NJ) method [15, 16]. We briefly describe this approach.
The tree is built in two parts: branch lengths are estimated; and then the
tree is constructed so that the most closely related sequence pairs are joined
as neighbours, i.e., two tips join to the same direct internal node (their
direct ancestor). Because at each step only two tips are joined forming an
internal node as a new tip, the distance matrix is repeatedly modified and its
dimension is reduced by one at each step. After n-2 steps only two sequences
are left and they are joined to the last internal node.
The initial topology is a star: all terminals join to an internal node. Let
the average length of the branch of tip i be
ui =
n
∑
dij /(n-2).
j,j̸=i
We then choose the tips i, j for which dij − ui − uj is smallest and join them
to a new internal node (ij). Now we can compute the branch lengths from
tip i and tip j to node (ij) as
di(ij) =
dij ui − uj
dij uj − ui
+
, dj(ij) =
+
,
2
2
2
2
and the distance between the new tip (ij) and each of the remaining tips k
as
d(ij)k = (dik + djk − dij )/2.
The process is repeated till a full unrooted tree is built.
3
A new Steiner tree-based DM method employing substitution frequency matrices
Consider two DNA-sequences of length m: Q and its direct ancestor P. In the
evolutionary process from P to Q, unobservable multi-, parallel-, convergent-,
and back-substitutions are not counted by the genetic distance function d
[20]. To overcome this limitation many statistical models based on a timecontinuous Markov process [9, 10, 11] have been proposed. We will denote
the models proposed in [10, 11] by JC69 and K80 respectively.
For sequence P let [pi ] (1 ≤ i ≤ 4) be the number of nucleotides A, G, C,
and T respectively. Then the vector p = [pi /m] (1 ≤ i ≤ 4) is referred to as
8
the frequencies of nucleotides in P. We similarly define q with respect to Q.
Let mij (1 ≤ i, j ≤ 4) be the number of nucleotides i in P replaced with j
in Q, and let µij = mij /m. Then the matrix M(p, q) := [µij ] (1 ≤ i, j ≤ 4)
is referred to as the substitution frequency
∑ matrix from P to Q. Clearly, p
is a unit probability vector (0 ≤ pi ≤ 1,
i pi = 1), and M(p, q) is a unit
∑ ∑
probability matrix (i.e., 0 ≤ µij ≤ 1, i j µij = 1). The relationships
between p, q and M(p, q) are
∑
µij (where 1 ≤ i ≤ 4),
(2)
p = [p′i ], p′i = pi /m =
j
q = [qj′ ], qj′ = qj /m =
∑
µij (where 1 ≤ j ≤ 4).
(3)
i
The genetic distance d(p,
∑ in terms of M(p, q) as follows:
∑ q) is now defined
d(M(p, q)) := d(p, q) := i̸=j µij = 1 − i µii . In the substitution model
of [10], it is assumed that all instantaneous substitution rates are the same.
The corrected genetic distance, as in [10], is then:
(
)
3
4d(M(p, q))
JC69
d
(M(p, q)) = − log 1 −
,
4
3
where log is the natural logarithm. In the substitution model of [11], the
transitional substitution rate (A ↔ G and C ↔ T ) is different from the
transversional substitution rate (A ↔ C and G ↔ T ). The genetic distance
is then corrected as in [11]
dK80 (M(p, q)) = −
log(1 − 2P − Q) log(1 − 2Q)
−
,
2
4
where the transitional and the transversional substitutions are denoted as P
and Q and P+Q=d(M(p, q)).
Any genetic distance d corrected by a substitution model ∗ will be denoted by d∗ , which is a function of M(p,q) .
Consider again the n input sequences ωk , k = 1, 2, . . . , n,. We construct
a ‘supermatrix ’ M, a matrix of matrices, containing n(n-1)/2 substitution
frequency matrices Mkl := M(ωk , ωl ), where 1 ≤ k ≤ (n-1), (i+1) ≤ l ≤ n.
9




M=



− M12 M13
−
− M23
··· ··· ···
−
−
−
−
−
−
−
−
−
···
M1(n-1)
M1n
···
M2(n-1)
M2n
···
···
···
· · · M(n-2)(n-1) M(n-2)n
···
−
M(n-1)n
···
−
−








As opposed to the DM method, which is based solely on the distance
matrix D and does not make full use of the information contained in the
substitution frequency matrices Mkl , in our SS method the substitution frequency matrix M plays a central role. Instead of the distance matrix D,
the input for our method is the nucleotide frequencies of terminals and the
substitution frequency matrices M(p, q) between each pair of terminals p, q.
Hence, in our SS method the phylogenetic tree problem has the following
mathematical formulation:
Given:
• n terminals th , 1 ≤ h ≤ n that are probability vectors of length 4,
• n(n-1)/2 substitution frequency matrices Mkl := [µkl
ij ], 1 ≤ k ≤ (n-1),
(k+1) ≤ l ≤ n, 1 ∑
≤ i, j ≤ 4 as defined in Equations (2) and (3), i.e.
∑
kl
k
kl
l
j µij = ti , and
i µij = tj for 1 ≤ i, j ≤ 4, and
• a substitution model ∗ providing a genetic distance function d∗ on pairs
of sequences
Variables:
• 2n-3 substitution frequency matrices M(p, q) such that each M(p, q)
is associated with an edge e(p, q) in topology T , and
• n-2 Steiner points sh , 1 ≤ h ≤ n-2.
Constraints:
• each Steiner point sh is a probability vector, i.e. shi ≥ 0,
1 (1 ≤ i ≤ 4),
∑
i
shi =
• each substitution frequency matrix M(p, q) is a unit probability matrix
satisfying Equations (2) and (3), and
10
• each path connecting two terminals p and q satisfies the path inequality
(1).
Objective:
min min
T ∈T
S(t)
∑
d∗ (p, q),
e(p,q)∈E(T )
where S(T ) is the set of Steiner points and E(T ) is the set of edges in topology
T.
3.1
An illustrative example
Finally we illustrate our SS method with a simple example that consists of
n = 5 sequences selected from GenBank. The length of each of the aligned
sequences (using CLUTSAL-http://www.clustal.org/) is 374 and they are
listed in the Appendix. Note that there are two sites (182 and 260) in Sequence 3 containing uncertain characters ”?” and there are many gaps (”-”)
that are added in the alignment phase and happen to lie at the end of Sequence 3, 4 and 5. In DM these sites are deleted in pairwise comparison
and consequently some useable information is lost. We first demonstrate
the solution generated by the DM method. The distance matrix using the
substitution model from [10] (JC69) is:

DJC69



=


t1
t2
t3
t4
t5
t1
t2
t3
t4
t5
− 0.052605 0.721006 1.762426 1.772407
−
−
0.706888 1.621829 1.656794
−
−
−
1.805877 2.049003
−
−
−
−
0.752275
−
−
−
−
−







Using this distance matrix and the neighbour joining method we infer a
phylogenetic tree TNJ-JC whose topology is as shown in Fig. 2. The edge
lengths and the differences between path lengths and terminal distances in
TNJ-JC are listed in Tables 1 and 2:
Note that in TNJ-JC , the length of edge (t2 , s1 ) is negative and the path
inequality does not hold for many terminal pairs.
Next we present the solution generated by our SS method. Because we do
not have information on uncertain symbols and gaps, in this example they
11
M(s1 , s3 )
s
t1
1
M(t2 , s1 )
t2
t3
s3
s
t4
2
t5
Figure 2: The topology in the reconstruction of TNJ-JC
Table 1: Edge lengths of TNJ-JC and TSS-JC
edge
TNJ-JC
TSS-JC
(t1 , s1 )
0.06186
0.03556
(t2 , s1 )
-0.00925
0.01744
(t3 , s3 )
0.46901
0.42729
(t4 , s2 )
0.32813
0.29723
(t5 , s2 )
0.42415
0.45477
(s1 , s3 )
0.21863
0.26227
(s2 , s3 )
1.08229
1.16694
TreeLength
2.57482
2.66150
are treated as being equally distributed. As a result the 5 frequencies of
nucleotides as input are
t1
t2
t3
t4
t5
=
=
=
=
=
A
[0.27807
[0.28610
[0.30147
[0.32821
[0.31684
G
0.13636
0.13102
0.14104
0.10628
0.11631
C
0.35294
0.35561
0.34693
0.33890
0.34626
T
0.23262]
0.22727]
0.21056]
0.22660]
0.22059]
Moreover, as input there are 10 substitution frequency matrices M12 , M13 , . . . , M45 .
For example, M14 is


t4A
t4G
t4C
t4T
 t1 0.1091 0.0295 0.0973 0.0590 
 1A

1 4

M14 = M(t , t ) = 
t
0.0442
0.0147
0.0295
0.0383
G


1
 tC 0.1121 0.0265 0.1386 0.0678 
t1T 0.0708 0.0206 0.0826 0.0590
We use the same substitution model, JC69, to reconstruct the optimal
phylogenetic tree, which is denoted by TSS−JC . Since the number of sequences
is very small, the whole topology space contains only 15 different topologies.
It is therefore easy in this case to find the optimal topology, which is the
same as the topology of TNJ−JC .
12
Table 2: The path inequality f(path) (p, q) in TNJ-JC and TSS-JC
path
TNJ-JC
TSS-JC
(t1 ,t2 )
-0.0004
0.0000
(t1 , t3 )
0.0285
0.0041
(t1 , t4 )
-0.0711
0.0000
(t1 , t5 )
0.0149
0.1475
(t2 , t3 )
-0.0286
0.0000
(t2 , t4 )
-0.0022
0.1219
(t2 , t5 )
0.0588
0.2444
(t3 , t4 )
0.0734
0.0855
(t3 , t5 )
-0.0736
0.0000
(t4 , t5 )
0.0003
0.0000
The primary output of our method is the 7 substitution frequency matrices associated with each of the 2n − 3 = 2(5) − 3 = 7 edges, and the
derived output is the edge lengths, the path inequalities and 3 internal nodes
s1 , s2 , s3 . For example, we obtain the substitution frequency matrix


s1A
s1G
s1C
s1T
 t1 0.2719 0.0019 0.0024 0.0019 
 1A

1 1

M(t , s ) = 
t
0.0046
0.1259
0.0035
0.0024
G


 t1C 0.0034 0.0020 0.3455 0.0020 
t1T 0.0047 0.0024 0.0035 0.2220
and the nucleotide distributions of the 3 internal nodes (ancestors)
s1
=
s2 =
s3 =
A
[0.2845
[0.3043
[0.2879
G
0.1323
0.1478
0.1615
C
0.3549
0.3257
0.3331
T
0.2283]
0.2222]
0.2174]
Remark 3.1 The edge lengths and path inequalities of TSS-JC are listed in
Table 1 and 2 above for comparison. We can see that the tree length of TSS-JC
is a little larger than TNJ-JC but the small expense ensures the positivity of
edges and the path inequality.
It can easily be confirmed that the sum over the ith-row in M(t1 , s1 ) is
t1i and the sum over the jth-column in M(t1 , s1 ) is s1j as given in Equations
(2) and (3).
4
Conclusion
We propose a new Steiner tree-based DM method for reconstructing phylogenetic trees. The method has numerous advantages: it makes full use of the
available information in sequences; it generates a more realistic tree with all
edges positive and path inequality ensured; and, most importantly, it is able
to estimate the distributions of nucleotides in ancestors, which will be useful
in the study of extinct organisms.
13
References
[1] Bandelt, H-J., Forster, P., & Rhl, A. (1999). Median-joining networks
for inferring intraspecific phylogenies. Molecular Biology and Evolution,
16, 37-48.
[2] Brazil, M., Nielsen, B.K., Thomas, D.A., Winter, P, & Zachariasen, M.
(2009). A novel approach to phylogenetic trees: d-dimensional geometric
Steiner trees. Networks, 53, 104-111.
[3] Felsenstein, J. (1988). Phylogenies from molecular sequences: inference
and reliability. Annu. Rev. Genet, 22, 521-565
[4] Felsenstein, J. (2004). Inferring Phylogenetics, Sinauer Associates, Inc.,
Sunderland, UK.
[5] Foulds, L.R., & Graham, R.L. (1982). The Steiner problem in phylogeny
is NP-complete. Adv. Appl. Math, 3, 4349
[6] Foulds, L.R., Hendy, M.D., & Penny, D. (1979). A graph theoretic approach to the development of minimal phylogenetic trees. Journal of
molecular evolution, 13, 127-149.
[7] Guindon, S., & Gascuel, O. (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic
biology, 52, 696-704.
[8] Hwang, F.K., Richards, D.S., & Winter, P. (1992). The Steiner Tree
Problem, Elsevier Science Publishers B.V., the Netherlands.
[9] Galtier, N., Gascuel, O., & Jean-Marie, A. (2005). Markov models
in molecular evolution, in Statistical Methods in Molecular Evolution,
R.Nielsen, (Eds. ), Springer, USA.
[10] Jukes, T. H., & Cantor, C.R. (1969). Evolution of protein molecules, in
Mammalian Protein Metabolism, M.N. Munro (Ed. ) Academic Press,
New York, pp. 21-132.
[11] Kimura, M. (1980). A simple model for estimating evolutionary rates of
base substitutions through comparative studies of nucleotide sequences,
J. of Mol. Evol. 16, 111-120.
14
[12] Nei, M., & Kumar, S. (2000). Molecular Evolution and Phylogenetics,
Oxford University Press, Inc., USA.
[13] Ratner, V.A., Zharkikh, A.A., Kolchanov, N., Rodin, S., Solovyov, S., &
Antonov, A.S. (1995). Molecular Evolution Biomathematics, Series Vol
24. Springer-Verlag: New York.
[14] Saitou, N., & Imanishi, T. (1989). Relative efficiencies of the FitchMargoliash, maximum-parsimony, maximum-likelihood, minimumevolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree, Molecular Biology and Evolution, 6,
514-525.
[15] Saitou, N., & Nei, M. (1987). The neighbor-joining method : A new
method for reconstructing pgylogenetic trees, Molecular Biology and
Evolution, 4, 406-425.
[16] Studier, J. & Keppler, K. J. (1988). A note on the neighbor-joining
algorithm of Saitou and Nei, Molecular Biology and Evolution, 5, 729731.
[17] de Villemereuil, P., Wells, J.A., Edwards, R.D., & Blomberg, S.P.
(2012). Bayesian models for comparative analysis integrating phylogenetic uncertainty, BMC evolutionary biology, 12, 102.
[18] Weng, J.F., Mareels, I., & Thomas, D.A. (2011). Probability Steiner
trees and maximum parsimony in phylogenetic analysis, Journal of
Mathematical Biology, 64, 1225-1251
[19] Weng, J.F., Thomas, D.A., & Mareels, I. (2011). Maximum parsimony,
substitution model, and probability phylogenetic trees, Journal of Computational Biology, 18, 67-80.
[20] Xia, X. (2006). Molecular phylogenetics: mathematical framework and
unsolved problems, in Structural approaches to sequence evolution, U.
Bastolla, M. Porto, H. E. Roman, and M. Vendruscolo, (Eds.) Springer,
171-191.
15
Appendix: 5 species of mammals
>t1 gb|AF050738|
Gorilla gorilla graueri mitochondrial D-loop,
partial sequence.
TTCTTTCATGGGGAGACGAATTTGGGTGCCACCTAAGTATTAGTTAACCCACCAATAATT
GTCATGTATTTCGTGCATTACTGCCAGCCACCATGAATAATGTACGGTACCATAAACACT
CCCTCACCTATAATACATTACCCCCCCTCACCCCCCATCCCTTGCCCACCCCAACAGCAT
ACCAACTAACCTACCCCTCTACAAAAGTACATAGTACATAAAATCATTTACCGTCCATAG
CACATTCCAGTTAAACCATCCTCGCCCCCACGGATGCCCCCCCTCAGATAGGGGTCCCTT
AAACACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCGCTCCGGG
CCCATAACGCCTGG
>t2 gb|AF089820|
Gorilla gorilla beringei mitochondrial D-loop,
partial sequence.
TTCTTTCATGGGGAGACGAATTTGGGTGCCACCCAAGTATTAGTTAACCCACCAATAATT
GTCATGTATGTCGTGCATTACTGCCAGCCACCATGAATAATGTACAGTACCACAAACACT
CCCCCACCTATAATACATTACCCCCCCTCACCCCCCATTCCCTGCTCACCCCAACGGCAT
ACCAACCAACCTATCCCCTCACAAAAGTACATAATACATAAAATCATTTACCGTCCATAG
TACATTCCAGTTAAACCATCCTCGCCCCCACGGATGCCCCCCTTCAGATAGGGATCCCTT
AAACACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCTCCTCGCTCCGGG
CCCATAACACCTGG
>t3 gb|AY079510|
Gorilla gorilla gorilla isolate BH6 mitochondrial D-loop,
partial sequence.
TTCTTTCATGGGGAGACAAATTTGGGTACCACCCAAGTATTAGCTAACCCATCAATAATT
ATCATGTATATCGTGCATCACTGCCAGACACCATGAATAATGTACGGTACCATAAACGCC
CAATCACCTGTAGCACATACAACCCCCCCCTTCCCCCCCCCCGCATTGCCCAACGGAATA
C?AAATAACCCATCCCTCACAAAAAGTACATAACACATAAGATCATTTATCGCACATAGC
ACATCCCAGTTAAATCACC?TCGTCCCCACGGATGCCCCCCCTCAGATGGGAATCCCTTG
AACACCATCCTCCGTGAAATCAATATCCCGCACAAGAGTGCTACTCCCCTCGCTCCGGGC
CCATGACAC---->t4 gb|AF176722|
Pan troglodytes schweinfurthii isolate HARRIET
mitochondrial D-loop, partial sequence.
GTACCACCTAAGTATTGGCTTATTCATTACAACCGCTATGTATTTCGTACATTACTGCCA
GCCACCATGAATATTGTACAGTACTATAATCACTCAACTACCTATAATACATCAAACCCA
CCCCACATTACAACCTCCACCCTATGCTTACAAGCACGCACAACAATTAACCCTCAACTG
TCACACATAAAACACAACTCCAAAGACATTCCTCCCCCACCCCGATACCAACAGACCTAT
ACTCTCTTAACAGTACATAGTACATACAACCGTACACCATACATAGCACATTACAGTCAA
16
ATCCATCCTCGCCCCCACGGATGCCCCCCCTCAGATAGG--------------------------------->t5 gb|AF176766|
Pan troglodytes troglodytes isolate DODO
mitochondrial D-loop, partial sequence.
GTACCACCTAAGTATTGGCCTATTCATTACAACCGCTATGTATTTCGTACATTACTGCCA
GCCACCATGAATATTGTACAGTACTATAACCACTCAACTACCTATAATACATTAAGCCCA
CCCCCACATTACAACCTCCACCCTATGCTTACAAGCACGCACAACAATCAACCCCCAACT
GTCACACATAAAATGCAACTCCAAAGACACCCCTCTCCCACCCCGATACCAACAAACCTA
TGCCCTTTTAACAGTACATAGTACATACAGCCGTACATCGCACATAGCACATTACAGTCA
AATCCATCCTTGCCCCCACGGATGCCCCCCCTCAGATAGG---------------------------------
17