Download A Fast Algorithm for Building Large Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fast Algorithms for
Minimum Evolution
Richard Desper, NCBI
Olivier Gascuel, LIRMM
Overview
I.
II.
III.
IV.
V.
Statement of phylogeny reconstruction problem and
various approaches to solving it.
Tree length formula as a function of average
distances.
Greedy algorithms for tree building and tree
swapping.
Simulation results.
A few extras regarding consistency and branch
lengths.
Phylogeny Reconstruction
 General problem: reconstruct the evolutionary
history for a set L of extant species.
 Input: multiple sequence alignment for L or
matrix of estimates of pairwise evolutionary
distances.
 Output: weighted phylogeny representing history
of L and common ancestors.
Methods
 Likelihood methods: model-based likelihood
maximization.
 Parsimony methods: minimize total number of
mutations in tree.
 Distance methods: fit tree structure to inferred
evolutionary distances. Leading methods
include Felsenstein-Fitch-Margoliash weighted
least-squares and Neighbor-Joining and its
variants.
Felsenstein-Fitch-Margoliash
Least-squares Method
 FITCH searches the space of topologies by
iteratively adding leaves and by tree swapping.
 Edge weights and topology are chosen to
minimize the sum of squares (D is the input
metric, DT is the induced tree metric):
T 2
Dij Dij

i, j


2
 ij
If ij = 1 for all i and j, this is called the ordinary least-squares method.
Minimum Evolution
 Developed by Rzhetsky and Nei (1992) as a
modification of the OLS method
 For each topology T,
Define function l assigning OLS lengths to edges of T
Define size of tree
l (T ) 

eE ( T )
Choose T minimizing l(T )
l (e)
Recursive Definition of DA|B
If A = {a}, B = {b}, DA|B = Dab,
For B  B1  B2 ,
D A|B
A
B1
B2

D A|B1 
D A|B2
B1  B2
B1  B2
B1
B2
All average distances for all pairs of non-intersecting subtrees of a
given topology can be calculated in O(n2) time.
External OLS Edge Length Function
If e is the edge connecting the leaf i to the
subtrees A and B,
A
e
i
B
1
l (e)   D Ai|  D B|i  D A|B 
2
Internal OLS Edge Length Function
The length of the edge e is (Vach, 1988)
A
e
C
D
B
1
l (e)   (D A|C  D B|D )  (1   )(D A|D  D B|C )  (D A|B  D B|C ) 
2
where

| A || D |  | B || C |
.
(| A | | B |)(| C |  | D |)
Tree length formula
Lemma: with T as to the right,
A
let rX denote the root of subtree X,
and eX the edge to X for X { A, B, C , D}.
C
Then,
B
e
D A|B  D A|rA  l (eA )  l (eB )  D B|rB
D
Tree Length Formula
With T as in prior slide,
l (T )  l (e) 
  l ( X )  l (e ) 
X  A, B ,C , D
X
Using lemma and branch length formula for l(e),
  D AC|  D B|D 


1
l (T )   l ( X )  D X |rX    (1   )  D A|D  D B|C  
2

X  A, B ,C , D
   D A|B  D C|D 



General approach
 To search the space of topologies, we’ll keep in
memory two data structures:
Sizes of each subtree of given topology
Matrix of average distances DX|Y for X,Y disjoint
subtrees in given topology
 As we move from one topology to another, we’ll
update the matrix, but only as much as needed,
in an efficient manner.
Tree Swapping by NNI
A
B
A
e
C
C
e
D
B
D
NNI swapping is a basic step in topology building and searching
Tree Length Formula
With T as in prior slide,
l (T )  l (e) 
  l ( X )  l (e ) 
X  A, B ,C , D
X
Using lemma and branch length formula for l(e),
  D AC|  D B|D 


1
l (T )   l ( X )  D X |rX    (1   )  D A|D  D B|C  
2

X  A, B ,C , D
   D A|B  D C|D 



Tree Length after NNI
Given T g T ’ the tree swap in prior slide, l the
edge length function:
(1)
(  1)(D AC| D B|D ) 

1
'
'
l (T )  l (T )   (  1)(D A|B  D C|D ) 
2

'
 (   )(D A|D  D B|C ) 
where  and ’ are constants depending on
the topologies.
OLS: FASTNNI
Pre-compute average distances between nonintersecting sub-trees. (O(n2) computations)
2. Loop over all internal edges, select the best swap
using Equation (1). (O(n))
3. If no swap improves length of the tree, stop and
return the tree, else perform the best swap and
update the matrix of average distances and repeat
Step 2. (O(n) per swap; there is only one new split.)
Thus, if we require p swaps, the total complexity of
FASTNNI is O(n2 + pn).
1.
Balanced Minimum Evolution
 Gascuel (2000) observed that the OLS/ME
method was weaker than NJ in approximating
the correct topology.
 Pauplin (2000) to simplify tree length
computation proposed to use a “balanced”
version of Minimum Evolution, weighting each
sub-tree equally when calculating averages: if A
and B are sub-trees of T, with B  B1  B2 ,
1 T
1 T
T
D A|B  D A|B1  D A|B2
2
2
BNNI
1.
2.
Calculate balanced averages of all pairs of sub-trees.
(O(n2))
Calculate improvement for each swap using
(2)
l (T )  l (T ') 
1
D TA|B  D CT |D  D TA|C  D TB|D 

2
If no tree swap improves length of the tree, stop and
return tree, else update matrix of average distances
and repeat Step 2. (O(n diam(T)) per swap)
The average complexity, when performing p swaps, is
O(n2 + pn diam(T)).
3.
Updating Subtree Averages
T
x
Here, X  A...
X
A
y
...and B  C  D  Y
C
e
If we perform the B-C
tree swap, then we
T
must recalculate D X |Y
B
Y
D
Q: How many recalculations?
Typical values for diam(T):
(Hint: you can count (x,y) pairs).
Yule-Harding distribution: O (log n)
A: O(n diam(T))
Uniform distribution: O
 n
Building trees from scratch
We have NNI algorithms for OLS and
balanced branch lengths. But what if we
have no initial topology for NNIs?
OLS: Greedy Minimum Evolution
1.
2.
Start with three-taxon tree T3
For k=4 to n,
a) Calculate Dk|A for each subtree A in Tk-1
b) Express cost of inserting k along edge e as f(e).
(Use Equation (3) on the next slide.)
c) Choose e minimizing f. Insert k along e to form Tk.
d) Update matrix of average distances between every pair
of 2-distant subtrees.
GME runs in O(n2) running time
Greedy Minimum Evolution
We use a variant of Equation (1), where D = {k}. Let L = l(T).
C
C
k
T’
T
A
B
Then
(3)
k
A
B
    '  ( D k | A  D B|C ) 

1
'
'
l (T )  L    (  1)(D A|B  D k |C ) 
2

  (1   )(D A|C D k |B ) 
Balanced Minimum Evolution
Same as GME,except:
2. (modifications)
a)
b)
d)
Calculate balanced average distances instead of
ordinary average distances
Use  = ½ to find weights for insertion points
Must keep average distances for all pairs of subtrees.
BME runs in O(n2 diam(T)) running time.
Simulations
 Created 24- and 96-taxon trees, 2000 per each
size, Yule-Harding process (g molecular clock).
 Edge lengths multiplied by (1.0 + mX), where X
is exponentially distributed.
 Generated trees with three rates of evolution
 SeqGen used to generate sequences for each
tree and rate (12,000 in all)
 DNADIST used to calculate distance matrices
Results: topological distances
24-Taxa Trees - Slow Rate of Evolution
0.120
0.115
with FastNNI
without NNIs
0.110
with BNNI
0.105
P
T/
F
E
HG
G
M
E
BM
NJ
TC
NJ H
/W
W
LS
EI
G
HB
O
R
BI
O
N
J
FI
TR
U
E
0.100
BNNI
improved
all input trees
Results: topological distances
96-Taxa Trees - Fast Rate of Evolution
0.120
0.110
with FastNNI
without NNIs
0.100
with BNNI
0.090
0.080
P
T/
F
E
HG
G
M
E
BM
NJ
G
HB
O
R
BI
O
N
J
W
EI
FI
T
NJ
/
TR
U
E
CH
0.070
This improvement
is large with fast rates
and high numbers of taxa
Results: topological distances
24-Taxa Trees - Slow Rate of Evolution
0.120
0.115
with FastNNI
without NNIs
0.110
with BNNI
0.105
P
T/
F
E
HG
G
M
E
BM
NJ
TC
NJ H
/W
W
LS
EI
G
HB
O
R
BI
O
N
J
FI
TR
U
E
0.100
NNI trees are close to the
best possible for BME
Results: topological distances
24-Taxa Trees - Slow Rate of Evolution
0.120
0.115
with FastNNI
without NNIs
0.110
with BNNI
0.105
P
T/
F
E
HG
G
M
E
BM
NJ
TC
NJ H
/W
W
LS
EI
G
HB
O
R
BI
O
N
J
FI
TR
U
E
0.100
The quality of the
NNI tree is (mostly)
independent
of starting point
Results: topological distances
96-Taxa Trees - Slow Rate of Evolution
0.2
0.195
0.19
with FastNNI
0.185
without NNIs
with BNNI
0.18
0.175
0.17
U
R
T
E
/
NJ
FI
H
TC
E
W
J
R
N
O
O
BI
HB
IG
NJ
BM
E
M
G
E
HG
F
T/
P
FASTNNI trees comparable to
NJ as n grows to 96
Computational Times
in (MM:SS)
24 Taxa
96 Taxa
1000 Taxa
4000 Taxa
GME + BNNI
0.0263
0.0842
11.3390
06:02.1
HGT/FP
0.0252
0.1349
13.8080
03:33.1
NJ/BIONJ
0.0630
0.1628
21.2500
20:55.9
WEIGHBOR
0.4244
26.8818
FITCH
4.3745
Computations done on Sun Enterprise E4500/E5500 running Solaris 8
on 10 400-Mhz processors with 7 Gb memory.
Average number of NNIs
24 Taxa
96 Taxa
1000 Taxa
4000 Taxa
GME + FASTNNI
1.244
8.446
44.9
336.50
GME + BNNI
1.446
11.177
59.1
343.75
BME + BNNI
1.070
6.933
29.1
116.25
We see that the average number of NNIs is
considerably lower than the number of taxa.
BME = WLS
Why does the balanced approach work so well?
 Pauplin’s formula for the length of a tree is
l (T )   2
1 pT ( i , j )
i j
Dij ,
Where pT(i,j) is the length of the (i,j) path in T.
 BME is a weighted least squares approach with
 ij  cpT (i, j ). Distantly related taxa see their importance
decrease exponentially.
Bonus features
 BME is a consistent method.
As observed
distances converge to true distances, the true
topology becomes the minimum evolution tree.
 The BNNI tree has no negative branch lengths.
A negative value to the branch length function
implies a NNI leading to a smaller tree.
Consistency of Balanced ME
is a weighted tree, and T
is a tree topology incompatible with S. Let T be
the tree of topology T with weights determined
by the balanced scheme. Then
l(T) > l(S).
 Lemma: it suffices to prove the case when S is a
split metric.
 Theorem: Suppose S
Balanced ME consistency
 Basic idea: let l
be the tree length function on
the space of topologies. We find a sequence of
topologies, T=T0, T1, ... Tk=S such that
Each Ti+1 can be reached from Ti via one of two
simple topological transformations
l(Ti) > l(Ti+1) for all i.
Proof structure modeled after OLS/ME proof (Rzhetsky
and Nei, 1993).
Type I transformation
Color the leaves black or white according to the split metric S. A
Type I transformation uses a NNI to form a larger monochromatic
cluster
D
D
A
C
B
A
B
C
This transformation reduces the size of the tree under l
Type II transformation
A Type II transformation uses two NNIs to form two monochromatic
subtrees
A1
A1
B1
A2
C
C
A2
B2
B1
B2
This transformation also reduces the value of the size of the tree
under l
Positive Branch Lengths after BNNI
B Recall that the length of an edge is described by
A
e
C
1 1

l (e)   (D AC|  D B|D  D A|D  D B|C )  (D A|B  D C|D ) 
2 2

D
We do not perform the
l (T )  l (T ') 
i.e.
B C
switch because
1
D TA|B  D CT | D  D TA|C  D TB|D   0,

2
D A|C D B|D  D A|B D C|D .
Similarly, D A|D D B|C  D A|B D C|D . Thus l (e)  0
Conclusions
BME + BNNI runs in O((n2 + pn) diam(T)), outputs
trees comparable to (better than) FITCH, Weighbor,
BioNJ, or NJ.
 FastME is faster than NJ or its variants.
 BNNI consistently improved output trees in all settings,
even when WLS/Fitch trees were input.
 BNNI outputs tree without negative branch lengths.
 FASTME software available at

http://www.ncbi.nlm.nih.gov/CBBResearch/Desper/FastME.html
or http://www.lirmm.fr/~w3ifa/MAAS/.