Download Algorithms in Computational Biology Building Phylogenetic Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Algorithms in Computational Biology
Building Phylogenetic Trees
Department of Mathematics & Computer Science
Algorithms in Computational Biology
1
Phylogeny
• All organisms on Earth had a common ancestor
• Evidence from morphological, biochemical, and gene sequence
data
• Phylogeny
• This history of organismal lineages as they change through time
• Phylogenetic tree
• A tree showing the evolutionary relationships among various
biological species
• All living organisms today, from smallest microbe to the largest
plants and animals, are connected by the passage of genes
along the branches of the phylogenetic tree
Department of Mathematics & Computer Science
Algorithms in Computational Biology
2
Phylogenetic Tree of Life
Department of Mathematics & Computer Science
Algorithms in Computational Biology
3
Inferring Phylogenies
• Traditionally
• Use morphological characters (both from living and
fossilized organisms)
• 1962
• Zuckerkandl & Pauling showed that molecular
sequences can be used to infer phylogenies
• Assumes current sequences descended from some
common ancestral gene in a common ancestral
species
Department of Mathematics & Computer Science
Algorithms in Computational Biology
4
Major Tree Building Algorithms
• Distance based
• Parsimony
• Maximum likelihood
Department of Mathematics & Computer Science
Algorithms in Computational Biology
5
Orthologue vs Paralogue
• Both of them are homologous genes
(homologues)
• Orthologues are a set of genes diverged from
a common ancestor through gene speciation
• Homologous genes from different species
• Paralogues are a set of genes diverged from
a common ancestor through gene duplication
• Homologous genes from the same species
Department of Mathematics & Computer Science
Algorithms in Computational Biology
6
A Tree of Orthologues
A tree of orthologues
based on a set of
alpha hemoglobins
Department of Mathematics & Computer Science
Algorithms in Computational Biology
7
A Tree of Paralogues
Department of Mathematics & Computer Science
Algorithms in Computational Biology
8
Background on Trees
• Nodes and Edges
• Nodes: unobserved ancestor
• Edge length
• On average, corresponds to evolutionary time period
• Variations
• Different proteins can change at different rates
• Same sequence evolve much faster in some organism than others
• Root of a phylogenetic tree
• Ultimate ancestor of all species
• Some algorithms provides the location of the root, while other
don’t
Department of Mathematics & Computer Science
Algorithms in Computational Biology
9
Counting and Labeling Trees
• Counting:
• For a rooted tree with n leaves
• As we move up the tree, the edges coalesce as each new node is reached
• In addition to n leaves, there are n-1 nodes (internal nodes plus root node).
•
A total of 2n-1 nodes
• There will be 2n-2 edges (discounting the edge above the root node)
• For an unrooted tree with n leaves
• Total number of nodes = 2n – 2
• Total number of edges = 2n – 3
• Labeling (for rooted tree)
• Label the leaves using 1 to n
• Label the branch nodes using n+1 to 2n-2
• Label the root using 2n-1
Department of Mathematics & Computer Science
Algorithms in Computational Biology
10
Rooting an Unrooted Tree
1
3
1
1
2
2
3
3
2
2
1
3
1
3
3
2
Department of Mathematics & Computer Science
1
2
Algorithms in Computational Biology
11
How Many Possible Topologies?
# of leaves
Ways to add
nth leaf
# of edges in
the sub-tree
# of un-rooted
trees
4
3
5
3
5
5
7
3x5
6
7
9
3x5x7
7
9
11
3x5x7x9
…
…
…
…
n
2n-5
2n-3
3x5x7x9x…x(2n-5)
# of rooted trees: (2n-3)!!
Department of Mathematics & Computer Science
(2n-5)!!
Algorithms in Computational Biology
12
Making a Tree from Pairwise Distances
• Distance Measure
• First find f which is the fraction of differences between two sequences
presupposing an alignment of the two sequences
• Fraction of difference expected by chance (by random substitution) is
about 3/4
• Jukes-Cantor distance (odds ratio)
d ij 
3
4f 

log 1 

4
3


• Clustering methods
• UPGMA
• Neighbor-joining
Department of Mathematics & Computer Science
Algorithms in Computational Biology
13
Unweighted Pair Group Method Using
Arithmetic Average (UPGMA)
[Sokal & Michener, 1958]
Overview
1. Cluster the sequences
2. Amalgamate two clusters at each stage, create a new node
on a tree
3. Assemble the tree upwards, each node being added above
the others
4. The edge length determined by the difference in the
heights of the nodes at the top and bottom of an edge
Department of Mathematics & Computer Science
Algorithms in Computational Biology
14
Distance Measure Used in UPGMA
1
d ij 
Ci C j
d kl 
d
Distance b/w two clusters Ci
and Cj is the average
distance between pairs of
sequences from each other
pq
p in C i , q in C j
d il Ci  d jl C j
Ci  C j
Department of Mathematics & Computer Science
Distance b/w two clusters Ck
and Cl, if Ck is the union of
two clusters Ci and Cj
Algorithms in Computational Biology
15
Algorithm UPGAM
Initialization
Assign each sequence i to its own cluster Ci
Define one leaf of T for each sequence, and place at height zero
Iteration
Determine the two clusters i, j for which dij is minimal (if there are ties,
pick one randomly)
Define a new cluster k by Ck = Ci  Cj, and define dkl for all l using
arithmetic average
Define a node k with daughter nodes i and j, and place it at height dij/2.
Add k to the current clusters and remove i and j
Termination
When only two clusters i, j remain, place the root at height dij/2
Department of Mathematics & Computer Science
Algorithms in Computational Biology
16
An Example
Department of Mathematics & Computer Science
Algorithms in Computational Biology
17
Cont’
Department of Mathematics & Computer Science
Algorithms in Computational Biology
18
Molecular Clock Assumption in UPGMA
• UPGMA produces a rooted tree
• Edge lengths in the resulting tree can be viewed as times measured by a
molecular clock with a constant rate
• The sum of times down a path to the leaves from any node is the same,
whatever the path
• The distances dij are said to be ultrametric, if for any triplet of
sequences, xi, xj, xk, the distances dij, djk, dik are either all equal,
or two are equal and the remaining one is smaller
• True for a tree with a molecular clock
• Implied additivity
• The edge lengths are said to be additive if the distance b/w any pair of
the leaves is the sum of the lengths of the edges on the path connecting
them
Department of Mathematics & Computer Science
Algorithms in Computational Biology
19
Molecular Clocks
• Mutations may build up in any given stretch
of DNA at a reliable rate
• If the rate of mutation of a gene is reliable,
this gene can be used as a molecular clock
• This gene can be a powerful tool for
estimating the dates of lineage-splitting
events.
Department of Mathematics & Computer Science
Algorithms in Computational Biology
20
Example
The entire length of DNA of a genes changes at a rate of approximately
one base per 25 million years
Department of Mathematics & Computer Science
Algorithms in Computational Biology
21
What If Molecular Clock Property Fails?
A tree that is
reconstructed
incorrectly by
UPGMA (right)
2
3
4
1
4
2
3
1
Department of Mathematics & Computer Science
Algorithms in Computational Biology
22
Additivity
• Given a tree, its edge length is additive
• If the distance between any pair of leaves is the sum
of lengths of the edges on the path connecting them
• Build-in assumption in UPGMA
Department of Mathematics & Computer Science
Algorithms in Computational Biology
23
Test for Additivity
• For every set of four leaves, 1, 2, 3 and 4, two
of the three distances d12 + d34 , d13 + d24 and
d14 + d23 must be equal and larger than the
3rd.
1
2
Department of Mathematics & Computer Science
3
4
Algorithms in Computational Biology
24
Joining a Pair of Neighboring Leaves
m
i
k
j
Dim = dik + dkm
Djm = djk + dkm
Dij = dik
Node k joins leaf
nodes i and j
Dkm = 0.5(dim + djm – dij)
+ djk
Department of Mathematics & Computer Science
Algorithms in Computational Biology
25
Closest Pairs of Leaves Are not
Necessarily Neighboring Leaves
2
1
0.1
0.1
d Table
1
0.1
2
3
4
1
0.4
3
0.4
2
0.3
3
0.5 0.6
4
0.6 0.5 0.9
4
Department of Mathematics & Computer Science
Algorithms in Computational Biology
26
Compensation for Long Edges
D Table
1
Dij  d ij  (ri  r j )
Where, ri 
2
3
4
1

1
d ik
L  2 kL
L is the size of set L of leaves
2
-1.1
3
-1.2
-1.1
4
-1.1
-1.2
-1.1
r1 = 0.7
r2 = 0.7
r3 = 1
r4 = 1
Department of Mathematics & Computer Science
Algorithms in Computational Biology
27
Algorithm: Neighbor-Joining
Initialization:
Define T to be the set of leaf nodes, one for each given sequence, and
put L = T.
Iteration:
Pick a pair i, j in L for which Dij is minimal
Define a new node k and set dkm = 0.5(dim + djm – dij), for all m in L.
Add k to T with edges of lengths dik = 0.5(dij+ri-rj), djk = dij – dik, joining k
to i and j, respectively.
Remove i and j from L and add k.
Termination
When L consists of two leaves i and j add the remaining edge between i
and j, with length dij
Produces an unrooted tree
Department of Mathematics & Computer Science
Algorithms in Computational Biology
28
Rooting Trees
• Outgroup
• Species known to be more distantly related to each of the
remaining species than they are to each other
• Find the root by adding an outgroup
• The point in the tree where the edge to the outgroup joins is
expected to be the best root candidate
• In the absence of a convenient outgroup, methods
are quite ad hoc
• E.g. picking the midpoint of the longest chain of consecutive
edges if deviation from a molecular clock were not too great.
Department of Mathematics & Computer Science
Algorithms in Computational Biology
29
Assumptions Used by UPGMA and
Neighbor-Join
• UPGMA (molecular clock with implied additivity)
• The edge lengths in the resulting tree can be viewed as times
measured by a molecular clock with a constant rate
• The divergence of sequences is assumed to occur at the same
constant rate at all points in the tree
• The distance from an internal node to a leaf node will always be
the same no matter what path is taken
• Neighbor-Join
• It is possible for the molecular clock property to fail but for
additivity to hold
• Assume additivity only
Department of Mathematics & Computer Science
Algorithms in Computational Biology
30
Parsimony
•
•
•
Most widely used tree building algorithm
It works by finding the tree which can
explain the observed sequences with a
minimum # of substitutions
Two components to the algorithm
1. The computation of a cost for a given tree T
2. A search through all trees, to find the overall
minimum of this cost
Department of Mathematics & Computer Science
Algorithms in Computational Biology
31
Notations Used in Weighted Parsimony
• Sk(a) denotes the minimal cost for the
assignment of a to node k
• S(a, b): cost for each substitution of a by b
Department of Mathematics & Computer Science
Algorithms in Computational Biology
32
Algorithm: Weighted Parsimony
Compute the minimum cost at site u
[Sankoff & Cedergren 1983]
Initialization:
Set k = 2n – 1, the number of the root node
Recursion: Compute Sk(a) for all a as follows:
If k is a leaf node:
Set Sk(a) = 0 for a = xuk, Sk(a) = , otherwise
If k is not leaf node:
Compute Si(b), Sj(b) for all b at the daughter nodes i, j and
define Sk(a) = minb(Si(b) + S(a, b)) + minb(Sj(b) + S(a, b)).
Termination:
Minimal cost of tree = minaS2n-1(a)
Weighted parsimony reduces to
traditional parsimony if S(a, a) = 0
for all a, S(a, b) = 1 for all a  b
Department of Mathematics & Computer Science
Algorithms in Computational Biology
33
Algorithm: Traditional Parsimony [Fitch 1971]
Initialization
Set C = 0 and k = 2n -1
Recursion: to obtain the set Rk
If k is leaf node:
Set Rk = xuk
If k is not a leaf node:
Compute Ri, Rj for the daughter nodes i, j of k, and set
Rk = Ri  Rj if this intersection is not empty, or else
Rk = Ri  Rj and increment C
Termination:
Minimal cost of the tree = C
Department of Mathematics & Computer Science
Algorithms in Computational Biology
34
Parsimony Example
A
{A, B}
A
A
B
A
A X
B
B
B
A
X
A X
{A, B}
A
A
B
X
B
B
A
A
A
Minimum cost = 2
Obtained by traditional parsimony
Department of Mathematics & Computer Science
Algorithms in Computational Biology
35
Cont’
B
B
B
B
B
A
A
Minimum cost tree: not obtained
by traditional parsimony
Department of Mathematics & Computer Science
Algorithms in Computational Biology
36
Enumeration of Unrooted Trees
• Enumerate all unrooted trees by an array [i3]
[i5] [i7] [i9]… [i2n-5]
• Take the unrooted tree with 3 sequences x1, x2 and
x3 and add an edge for x4 on the edge labeled by i3,
since the new edge divides the preexisting edge in
two, the total number of edges is now 3 + 2 = 5. The
value of i5 determines which of these x5 is added to.
• Think of [i3] [i5] [i7] [i9]… [i2n-5] as an odometer
…
Department of Mathematics & Computer Science
Algorithms in Computational Biology
37
Counting Trees
Cont’
• Counting complete trees
• The rightmost numbers advance till they reach 2n-5
• The next-to-rightmost array index clicks forward by 1
when the rightmost array index go back to 1
• The second-to-rightmost index clicks forward by 1
when the next-to-rightmost index reaches 2n-7
• And so on and so forth …
• Counting both complete and incomplete trees
• Add 0 to each array index, meaning that there is no
edge of the order specified by the counter
Department of Mathematics & Computer Science
Algorithms in Computational Biology
38
Selecting Labeled Branching Patterns by
Branch and Bound
• Starts from the odometer setting [1][0][0]…[0]
• Let the smallest cost so far for a complete tree be C
• Brand and bound
• Adding more leaves can only increase cost
• No point branching out if current cost is larger than the minimum
cost so far
• Implementation trick
• Whenever the cost of our current subtree T is more than C, we know
that T is not part of the optimal tree
• If all the counters to the right of a given non-zero counter are 0,
instead of advancing them all to ‘1’ we can click the rightmost nonzero counter one forward
Department of Mathematics & Computer Science
Algorithms in Computational Biology
39
An Example of Branch-and-Bound
3
3
……
7
0
0 0
0
……
7
1
1 1
1
8
0
0 0
0
3
Skip 3…70001 to 3…7(2n-11)(2n-9)(2n-7)(2n-5)
and go directly to 3…80000 if the cost of 3…70000
is higher the the minimum cost found so far
Department of Mathematics & Computer Science
Algorithms in Computational Biology
40
Assessing the Trees: the Bootstrap
• Bootstrapping (sample with replacement)
• Given a dataset consisting an alignment of sequences, generates
an artificial dataset by picking columns from the alignment at
random with replacement
• Generate large number (order of thousands) of artificial alignment
datasets
• For each artificially generated data set, build a tree
• Assessing phylogenetic features
• Find the frequency of each phylogenetic feature that appears in
the thousands trees generated above
• The higher the frequency, the more confident we have with a
phylogenetic feature
Department of Mathematics & Computer Science
Algorithms in Computational Biology
41
Describe a New Hampshire Standard Tree
Tree file representation of the above rooted tree, starting at the
beginning of the file:
(B,(A,C,E),D);
(B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);
Department of Mathematics & Computer Science
Algorithms in Computational Biology
42
Visualize Trees
Phylip DrawTree
Department of Mathematics & Computer Science
Algorithms in Computational Biology
43
Visualize Trees
Cladogram
Department of Mathematics & Computer Science
Algorithms in Computational Biology
44
Visualize Trees
Phenogram
Department of Mathematics & Computer Science
Algorithms in Computational Biology
45
Visualize Trees
Curve-O-Gram
Department of Mathematics & Computer Science
Algorithms in Computational Biology
46
Visualize Trees
Eurogram
Department of Mathematics & Computer Science
Algorithms in Computational Biology
47
Programs to Build Phylogenetic Trees
• PAUP
• Include parsimony, maximum likelihood, and distance methods
• Phylip
• Include parsimony, distance matrix, and likelihood methods,
including bootstrapping and consensus trees.
• MrBayes
• Bayesian estimation of phylogeny
• Uses a simulation technique called Markov chain Monte Carlo (or
MCMC) to approximate the posterior probabilities of trees
• NoTung
• Incorporating duplication/loss parsimony into phylogenetic tasks
• ……
Department of Mathematics & Computer Science
Algorithms in Computational Biology
48