Download tree i

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
BINF6201/8201
Molecular phylogenetic methods 4
11-10-2011
Maximum likelihood methods
 So far we have only considered a single site (configuration). The
likelihood for all sites is the product of the likelihoods for each site if
all the sites evolve independently.
 Suppose there are s homologous sequences each with
N nucleotides. Let Dn be the n-th column of the
multiple alignment.
 d1n 
 
 d 2n 
Dn   
...
 
d 
 sn 
 For a tree T, let f ( Dn | 1,2 ...,m , T ) be the likelihood of tree T for the
n-th site, where 1, 2,…, m are the unknown parameters such as the
branch length. Using the previous case as an example, we have,
i 
 
f ( Dn | 1 ,  2 ...,  m , T )  h(i, j, k , l | v1 , v2 , v3 , v4 , T )
 j
Dn   , i  vi ,
k
{g x Pxl (v4 )Pxk ( v3 ) Pxy ( v5 )Pyi (v1 ) Pyj ( v2 )}

 
x
y
l 
Maximum likelihood methods
 For simplicity, let’s assume the sequences are homogenous, i.e., all
sites evolve at the same rate, then the likelihood function for the entire
sequence for the tree T is,
N
L(1 ,  2 ...,  m | D, T )   f ( Dn | 1 ,  2 ...,  m , T )
n 1
 Here, we treat L as a function of the parameters. We then search for the
values of 1, 2,…, m that maximize L given the topology of the tree T,
this value of L is called a ML value of the tree T.
 Finding the ML value can be a slow process.
 We do this for all possible tree topologies, and identify the one that has
the largest ML value as the inferred phylogenetic tree of the s
sequences.
 Clearly, different substitution models may result in different trees.
 When the number of OTUs is larger, a heuristic trees search algorithm
should be used for evaluating the alternative trees.
Heuristic tree search using predefined clusters
 Although the tree space could be very large, majority of them have
extremely low likelihood values for a certain OTUs.
 So we can safely ignore these unpromising trees, and focus on the
promising ones.
 To reduce the searching
space, we can predefine
clusters if their relationships
are known as the input.
 Then the problem becomes
to examine the (105)
possible trees generated by
connecting these predefined
groups, instead of an
astronomically large number
of unrooted trees:
NU  (2 N  5)!!  (2  23  5)!!  41!!
Heuristic tree search using predefined clusters
 The ML value is computed for
each tree, the one with the largest
ML value is returned as the
inferred tree.
 As this algorithm examines all
possible trees, so the global
optimum is guaranteed if the
predefined groups are correct.
 When the simple J-C model was
used, and a homogenous
substitution rate is assumed, the
resulting ML tree is similar to the
NJ and parsimony tree with the
problem of misplacing tree shrews
inside the primate group.
Maximum likelihood trees for primates
 However, when the more
sophisticated HKY
substitution model, plus six gdistribution rate categories and
invariant sites were used, the
tree constructed by the ML
method places the tree shrews
outside of the primate group.
 Nevertheless, there are three
trifurcations on this tree,
indicating that at a trifurcation
point, any of the three clusters
can be an outgroup of the other
two, and the three trees have
the same ML value.
Comparison of parsimony and maximum likelihood
methods
 Parsimony methods have only one assumption that the changes on the
branches are equally possible, however, this assumption may not hold.
 Because of the few assumptions are used in parsimony methods, their
proponents believe that these methods can be applied to any sequence
data.
 Parsimony method is also relatively fast, so can be applied to larger
data sets.
 ML methods make assumptions about the evolutionary models.
 ML methods need to optimize all these parameters to find the ML
value, therefore they are computationally intensive, and are very slow.
 When evolutionary models are properly selected, ML methods tend to
achieve better results than parsimony methods.
Heuristic tree search using quartet puzzling
 The quartet puzzling algorithm is very fast heuristic algorithm for
exploring the promising trees.
Step 1: Computer ML values of the three trees for all possible four
sequences
1
2
3
4
For each 1
possible 4
sequences
3
1
2
2
4
3
4
The best ML tree
1
2
5
6
n
3  trees
4
4
3
Heuristic tree search using quartet puzzling
Step 2: Randomly pick up four sequences, place them in the tree
according to their best ML tree.
1
2
4
3
Step 3: Randomly pick up a remaining sequence, and add it to the tree,
such that growing tree has a maximum number of best ML quartet trees.
Repeat this process until all sequences are added to the tree.
For example, if sequence 5 is randomly picked, and if one or both of
the following trees are the best ML quartet trees involving 1, 2, 3, 4,
and 5:
1
2 4
2
3
5
3
5
2
1
5
then, the resulting tree will be,
4
3
Heuristic tree search using quartet puzzling
Then last sequence 6 is added to the tree. If the following has the
best ML among all quartet trees containing sequence 6,
6
3
1
4
Then the resulting tree will be
6
2
1
5 Add sequence 6
1
2
5
4
3
4
3
 The whole process is repeated many times with the sequences being
selected in different orders. The resulting tree will depend on the order
of sequence selections.
 The tree that happens most frequently will be chosen as the inferred
tree.
Bayesian phylogenetic methods
 Bayesian theorem: if A and B are two events, then
P ( AB )  P( B / A) P ( A)  P ( A / B ) P( B ),
P ( B / A) 
P( A / B ) P( B )
P( A)
 If T1, T2, …, and Tn, are events that partitions the sample space, and D
is an event from the sample space, then,
P( D )  P( D /T 1) P(T 1)  P( D /T 2) P(T 2)  ...  P( D /T n) P(T n)
n

 P( D /T ) P(T ).
i
i
i 1
P (T j / D ) 

P ( D / T j ) P (T j )
P( D )
P ( D / T j ) P (T j )
n
 P( D /T ) P(T )
i
i 1
i
T1 T2
T7 T8
T3
D
T9
T4
T5
T6
T10 T11 T12
Bayesian phylogenetic methods
 For N OTUs, we can have n=(2N-5)!! possible unrooted trees, which is
a partition of the tree space. Let D be the alignment of the N OUTs, but
we do not know which tree is most likely to account for D.
tree1
tree2
tree3
tree4
tree5
tree7
tree8
tree9
tree10
…….
tree6
treen
 In the ML method, we compute the probability (likelihood) that D can
be generated by each tree:
L(treei)=P(D/treei).
We find the maximum likelihood ML=max [P(D/treei)] by changing
the parameters (branch length or substitution rates) on each tree i, and
return the tree that has largest ML.
 In Bayesian methods, we compute the probability that a tree can be
generated by the observed alignment of the N OTUs, which is called
the posterior probability, P(tree / D ).
j
Bayesian phylogenetic methods
 Using Bayesian theorem, we have,
P(tree j / D ) 
P( D / tree j ) P(tree j )
n
 P( D / tree )P(tree )
i
i
,
where, P(treei) is called the prior
probability.
i 1
 Calculation of the denominator of the posterior probability can
difficulty, because we have to numerate all possible trees, and their
branch length or substitution rate.
 However, the value of the denominator is a constant for all possible
trees, thus the posterior probability of each tree is only proportional to
the likelihood of the tree multiplied by the prior probability.
 If we can generate a large number of trees, such that the frequency of a
tree is proportional to its likelihood of the tree multiplied by the prior
probability, then the posterior probability can be easily computed by,
P(tree j / D )  P( D / tree j ) P(tree j )

number of trees with the same topology as tree j
total number of tree in the sample
.
The Markov chain Monte Carlo method for sampling
 Markov chain Monte Carlo (MCMC) is a method for generating a
sample from the entire sample space, such that the frequency of each
individual in the sample is propotional to the likelihood to generate the
observed data.
 If we have no preference for choosing a tree before seeing the data, we
can use a non-informative uniform prior probability, therefore,
P(tree j / D ) 
P( D / tree j ) P(tree j )
n

P( D / tree j )
n
 P( D / tree )P(tree )  P( D / tree )
i
i 1
i
 P( D / tree j )
i
i 1
 The MCMC method begins with a trial tree T1 and compute its
likelihood, L1, a move is then made on this tree that changes it by a
small amount on any of the following parameters,
1. Branch length;
2. Rate of substitution;
3. Topology by a nearest neighbor interchange tree move.
The Markov chain Monte Carlo method for sampling
 The likelihood of the new tree T2, L2 is computed, which is usually
slightly different from L1.
If L2 > L1, then T2 is accepted, and it becomes an element in the
sample
If L2 < L1, then T2 is accepted with probability L2 / L1.
This rule of selection is call the Metropolis algorithm.
 Therefore the MCMC method favors hill-climbing moves, but also
allows downhill moves with the a certain probability.
 The result will be that the equilibrium probabilities of observing the
different trees in the sample are given by the likelihoods of the trees.
 To see this, suppose that we have only two trees, so MCMC moves
back and forward between them with transition probabilities r12 and r21.
r12
T1
r21
T2
The Markov chain Monte Carlo method for sampling
 Let p1 and p2 be the equilibrium probabilities of these trees in the
sample. Then at equilibrium, the probabilities of observing these trees
during the sampling process should be constant,
r21 p1
p1r12  p2 r21, or
 .
r12 p2
 This property is called detailed balance. To have trees in the sample
to be proportional to their likelihoods, we need to set
p1 L1
 .
p2 L2
r21 L1
Therefore, we have, r  L .
12
2
 This means that to generate the desired sample, we should set the ratio
of transitional probability to be equal to the ratio of likelihoods.
 The MCMC algorithm just does this, because,
if L2 > L1, we set r12=1, r21= L1 /L2; therefore, r21/r12= L1 /L2.
if L2 < L1, we set r12= L2 /L1, and r21=1; therefore, r21/r12= L1 /L2.
The top four trees for the Platyrrhini group by MCMC
 To compute likelihoods, HKY substitution model, plus six gdistribution rate categories and invariant sites are used.
 The most parts o the tree are well defined, except the following groups.
The positions of Capuchin is varying
The same as in the tree
constructed by NJ and
parsimony methods
The top seven trees for principle groups by MCMC
 The uncertainty of these trees indicate that more sequences are needed
to solve the problem.
The same as by
The positions
NJ and of Capuchin is varying
parsimony
Popular phylogenetic tree construction programs
PHYLIP
• Developed by Joseph Felsenstein;
• Implements most known distance methods such as UPGAM and
NJ, maximum parsimony and ML methods;
• The most recent release is version 3.69, which contains more than
50 programs;
• Command line interface;
• The package can be freely downloaded at
http://evolution.genetics.washington.edu/phylip.html
PAUP (Phylogenetic Analysis Using Parsimony)
• Written by David Swofford;
• Includes parsimony, distance matrix, invariants, and maximum
likelihood methods and many indices and statistical tests;
• Described at http://paup.csit.fsu.edu/
• Unfortunately, it is now commercialized by Sinauer Associates,
selling for $85-150/package.
Popular phylogenetic tree construction programs
MEGA (Molecular Evolutionary Genetic Analysis)
• Developed by Sudhir Kumar and colleagues;
• Contains parsimony, distance and likelihood methods for molecular
data (nucleic acid sequences and protein sequences);
• Can do bootstrapping, consensus trees, and a variety of data editing
tasks;
• Has sequence alignment function using an implementation of
ClustalW;
• A GUI based program;
• Contain tree display functions.
TREE-PUZZLE
• Written by Korbinian Strimmer;
• A program for maximum likelihood analysis for nucleotide and
amino acid alignments;
• Infers phylogenies by quartet puzzling;
Popular phylogenetic tree construction programs
TREE-PUZZLE
• Supports all popular models of sequence evolution of nucleotides
and proteins, and can take rate heterogeneity among sites into
account;
• Compatible with PHYLIP files;
• The current version also has features for parallel computation using
the MPI message-passing interface if this is available;
• Freely available at http://www.tree-puzzle.de/.
MrBayes
• A program for the Bayesian estimation of phylogenetic trees.
• Ability to analyze nucleotide, amino acid, restriction site, and
morphological data
• Freely available at http://mrbayes.csit.fsu.edu/
Tree View
• A program for visualization and printing trees;
• Free at http://taxonomy.zoology.gla.ac.uk/rod/treeview.html