Download MaxL - Brandeis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multi-state modeling of biomolecules wikipedia , lookup

Transcript
Maximum Likelihood
Flips usage of probability function
A typical calculation:
P(h|n,p) = C(h, n) * ph * (1-p)(n-h)
The implied question:
Given p of success in single trial, what is
probability of h success over n trials?
The ML question
The ML calculation:
L(p|nh) = C(h, n) * ph * (1-p)(n-h)
What is probability that parameter p results in
h success over n trials?
Experiment with test values of p and choose
the one that results in highest likelihood
Consider a small alignment
AG
AC
TG
• Let #sequences = s = 3
• Each position a data point
• For each position, 4s possible values, e.g.
{A,A,A},{A,A,T}…
• In this example, 64 possible values each
position.
Probability/Likelihood function
The simplest model – use an arbitrary p for
each of the 64 possible values based on its
observed freq.
2 patterns have p=0.5, all others p=0.
Result “works” but is not biologically
interesting.
Maximum likelihood testing
model
Definition
• Method for the inference of phylogeny
• Method that searches for the tree with the
highest probability or likelihood.
Example going through the Maximum
likelihood model
•
Assume that we have the aligned
nucleotide sequences for four taxa:
(1)
A G G C U C C A A ....A
(2)
A G G U U C G A A ....A
(3)
A G C C C A G A A.... A
(4)
A U U U C G G A A.... C
Evaluate the likelihood of the uprooted tree
represented by the nucleotides of site j in
the
sequence
http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
•Since the likelihood of the tree is independent of the position of the
root, we can display the figure as shown in Figure B.
•Assume that the nucleotides evolve independently (the Markovian
model of evolution)
•Calculate the likelihood for each site separately and combine the
likelihood into a total value towards the end.
•. To calculate the likelihood for site j, we have to consider all the
possible scenarios by which the nucleotides present at the tips of the
tree could have evolved.
•Therefore the likelihood for a particular site is the summation of the
probabilities of every possible reconstruction of ancestral states, given
some model of base substitution.
http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
• So in this specific case all possible nucleotides A, G, C, and T
occupying nodes (5) and (6), or 4 x 4 = 16 possibilities:
• Protein sequences each site may occupy 20 states (that of the 20
amino acids)
• 20x20 thus 400 possibilities have to be considered.
• Since any one of these scenarios could have led to the nucleotide
configuration at the tip of the tree, we must calculate the
probability of each and sum them to obtain the total probability
for each site j.
http://www.icp.ucl.ac.be/~opperd/private/max_likeli.html
• The likelihood for the full tree then is product of
the likelihood at each site.
• Since the individual likelihoods are extremely
small numbers it is convenient to sum the log
likelihoods at each site and report the
likelihood of the entire tree as the log
likelihood.
• This above procedure is then repeated for all
possible topologies (for all possible trees).
• The tree with the highest probability is the tree
with the highest maximum likelihood.
Hulsenbeck J., Crandall, K. Annu. Rev. Ecol.
Syst., 1997, 28:437-66.
DNA Substitution Models
General DNA Substitution Model
Likelihood L is the propability of observing data
D given hypothesis H
L = Pr(D/H)
The use of maximum likelihood (ML) algorithms
in developing phylogenetic hypotheses requires
a model of evolution.
The rate matrix for a general model of DNA
substitution is given by
The rows and columns are ordered A, C, G and T. The matrix
gives the rate of change from nucleotide i(arranged along the
rows) to nucleotide j(along the columns).
For example r2pC gives the rate of change from A to C.
Let P(v,s) be the transition probability matrix where
pi,j(v,s) is the probability that nucleotide i changes into j
over branch length v. The vector s contains the
parameters of the substitution model(eg. pA, pC, pG, pT,
r1,r2…).
For two-state case, to calculate the probability of
observing a change over a branch of length v, the
following matrix calculation is performed:
P (v,s) = eQv
DNA substitution Models
Advantages of Maximum likelihood
• Lower variance than other methods
• Least affected by sampling error
• Robust to many violations of the assumptions
of the evolutionary model, even with very
short sequences, they outperform other
methods).
• Are less error prone.
• Statistically well founded.
• Evaluate different tree topologies.
Disadvantages of Maximum
likelihood
• CPU intensive and may take a long time to
complete an evaluation
• The result is dependent on the model of
evolution used.