Download Lecture 8 - ML and Bayesian JTT

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript






14.4. Tue Introduction to models (Jarno)
16.4. Thu Distance-based methods (Jarno)
17.4. Fri ML analyses (Jarno)
20.4. Mon
21.4. Tue
23.4. Thu



24.4. Fri
Assessing hypotheses (Jarno)
Problems with molecular data (Jarno)
Problems with molecular data (Jarno)
Phylogenomics
Search algorithms, visualization, and
other computational aspects (Jarno)
J
Maximum Likelihood

Maximum likelihood methods of
phylogenetic inference evaluate a hypothesis
about evolutionary history (the branching
order and branch lengths of a tree) in terms
of a probability that a proposed model of the
evolutionary process and the hypothesised
history (tree) would give rise to the data we
observe

The probability, P, of the data (D), given the
hypothesis (H)
◦ L = P (D | H)
Observed data
(aligned
sequences)
Tree topology, branch lengths and
model of evolution



In statistical usage, a distinction is made
depending on the roles of the outcome or
parameter.
Probability is used when describing a function of
the outcome given a fixed parameter value. For
example, if a coin is flipped 10 times and it is a
fair coin, what is the probability of it landing
heads-up every time?
Likelihood is used when describing a function of
a parameter given an outcome. For example, if a
coin is flipped 10 times and it has landed headsup 10 times, what is the likelihood that the coin
is fair? [Wikipedia, article on likelihood]
J




An optimality criterion (as is parsimony)
Given a model and data we can evaluate a
tree
We can choose between trees based on the
likelihood of a given tree
The tree(s) with the highest likelihood is the
best
GTR
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Single substitution type



Maximum Likelihood estimates parameter
values of an explicit model from observed
data
Likelihood provides ways of evaluating
models in terms of their log likelihoods
Different trees can also be evaluated for
their fit to the data under a particular
model (likelihood ratio tests of two trees
after Kishino & Hasegawa)

Let's toss coin ten times (n). It lands 4 times
heads up (x), 6 times tails up. What is
probability of a head in a single toss?
◦ Compare: What is the likelihood of the data given
the process?



Naturally phat= x / n = 4 / 10 = 0.4
This is also a maximum likelihood estimater
for phat.
Let's see why...
J

Coin toss is a binomial process:
◦ Pr (X=x|n, p) =

𝑛
𝑛
𝑥=0 𝑥
𝑝 𝑥 (1 − 𝑝)𝑛−𝑥
Likelihood function then becomes:
◦ L(p|x, n) =
𝑛
𝑥
𝑝 𝑥 (1 − 𝑝)𝑛−𝑥
Note: in the binomial formula X is the
unknown, whereas in the binomial the p is the
unknown (because we have the data, the coin
tosses).
J


The likelihood function can be solved
analytically or using "brute force".
For example, result for p=0.4 is:
◦ L = 210 * 0.4^4 * 0.6^6 = 0.2508227
◦ logL = log(L) = -1.383009
◦ -logL = -logL = 1.383009


Analytically, the point where the derivative of
the likelihood function is zero, and the
second derivative is negative, is the
maximum of the function.
Graphically...
J
Maximum
likelihood
Likelihood
p
Maximum
likelihood
estimator of p
Precise estimate
Imprecise estimate
Likelihood
μ1
l<-function(x, n) {
p<-seq(0,1,0.01)
L<-rep(NA, length(p))
for(i in 1:length(p)){
L[i]<-p[i]^x*
(1-p[i])^(n-x)*
(factorial(n)/
(factorial(x)*
factorial(n-x)))
}
d<-data.frame(p=p, L=L,
logL=log(L))
return(d)
}
plot(l(4,10)[,c(1,3)],
ylim=c(-30,0), type="l")
l2<-function(x, n) {
p<-seq(0,1,0.01)
L<-rep(NA, length(p))
for(i in 1:length(p)) {
L[i]<dbinom(4,size=10,
prob=p[i],log=TRUE)
}
d<-data.frame(p=p, L=L)
return(d)
}
plot(l2(), type="l")
J
plot(l2(),
type="L")
J



Why log likelihood?
L(0.99|10, 4) = 0.0000000002017251
-logL(0.99|10, 4) = -22.324115
◦ When you multiply very small values together, the result
is even smaller, and at some point the precision
disappears (a restriction of computers)
◦ The same does not happen with log values:
 L = 210 * 0.4^4 * 0.6^6 = 0.2508227
 logL = log(210) + 4*log(0.4)+6*log(0.6) = -1.383009
J


DNA sequences can be thought of as four
sided dice.
Thus, the previous coin example can be
straight-forwardly generealized to DNA
sequences.
J
Maximum likelihood tree reconstruction
1
2
3
4
CGAGAC
AGCGAC
AGATTA
GGATAG
1
3
Tree A
2
4
What is the probability that unrooted Tree A
(rather than another tree) could have generated
the data shown under our chosen model ?
Maximum likelihood tree reconstruction
Stationarity!
C
1
2
3
4
CGAGA
AGCGA
AGATT
GGATA
C
C
A
G
j
ACGT
?
C
A
?
G
Tree A
4 x 4 possibilities
The likelihood for a particular site j is the sum of
the probabilities of every possible reconstruction of
ancestral states under a chosen model
Maximum likelihood tree reconstruction
C
1
2
3
4
CGAGA
AGCGA
AGATT
GGATA
C
C
A
G
j
ACGT
?
C
A
?
G
Tree A
A
C G T
A
α α α
C α
α α
G α α
α
T α α α
The likelihood for a particular site j is the sum of
the probabilities of every possible reconstruction of
ancestral states under a chosen model
Maximum likelihood tree reconstruction
A
1
2
3
4
CGAG
AGCG
AGAT
GGAT
A
A
T
A
j
C
C
A
G
ACGT
?
A
T
?
A
Tree A
A
C G T
A
α α α
C α
α α
G α α
α
T α α α
The likelihood for a particular site j is the sum of
the probabilities of every possible reconstruction of
ancestral states under a chosen model
C
C
t2
y
x
t6
A
t1
G
t4
t5
w
t8
t7
z
t3
ti are branch
lengths
(rate x time)
C
P(A,C,C,C,G,x,y,z,w|T)=Prob(x) Prob(y|x,t6) Prob(A|y,t1) Prob(C|y,t2)
Prob(z|x,t8) Prob(C|z,t3)
Prob(w|z,t7) Prob(C|w,t4) Prob(G|w,t5)



Assume a Jukes-Cantor model (all nucleotide
frequencies are equal). Further assume that
the branch length is 0.1.
Then we can generate a so called P-matrix
from the Jukes-Cantor model's Q-matrix:
These are probabilities of a nucleotide
changing to some other nucleotide.
J






A: acct
B: gcct
L = (0.25 * 0.0062)^1 * (0.25 * 0.9815)^3 =
2.289932e-05
logL = log(L) = -4.64
For other branch lengths, the P matrix can
be multiplied by itself k times. This gives
a P matrix for a k cex length.
A branch lenght can be optimized by
maximizing the likelihood of a certain
branch lenght.
J



Depending on the software, each iteration (in
the tree optimization algorithm) has to for a
certain tree topology:
Calculate the likelihood of the tree topology
given the model and the observed data
Estimate the optimal branch lenghts
J
Maximum likelihood tree reconstruction



The likelihood of Tree A is the product of the
likelihoods at each site
The likelihood is usually evaluated by summing
the log of the likelihoods (because the summed
probabilities are so small) at each site and
reported as the log likelihood of the full tree
The Maximum likelihood tree is the one with the
highest likelihood (might not be Tree A i.e. it
could be another tree topology)
◦ Note: highest likelihood (largest value) = the largest
–logL (closest to zero) = smallest logL (closest to zero)
Typical assumptions of ML substitution
models



The probability of any change is independent
of the prior history of the site (a Markov
Model)
Substitution probabilities do not change with
time or over the tree (a homogeneous
Markov process)
Change is time reversible e.g. the rate of
change of A to T is the same as T to A

A model is always a simplification of what
happens in nature
◦ Assumes evolution works parsimoniously


A given model will give more weight to
certain changes over others
ML – an objective criterion for choosing one
weighting scheme over another?
Based largely on slides by Paul Lewis (www.eeb.uconn.edu)


D will stand for Data
H will mean any one of a number of things:
◦
◦
◦
◦
a discrete hypothesis
a distinct model (e.g. JC, HKY, GTR, etc.)
a tree topology
one of an infinite number of continuous model
parameter values (e.g. ts:tv rate ratio)



In ML, we choose the hypothesis that gives the
highest (maximized) likelihood to the data
The likelihood is the probability of the data
given the hypothesis L = P (D | H).
A Bayesian analysis expresses its results as the
probability of the hypothesis given the data.
◦ this may be a more desirable way to express the
result


The posterior probability, [P (H | D)], is the
probability of the hypothesis given the
observations, or data (D)
The main feature in Bayesian statistics is that
it takes into account prior knowledge of the
hypothesis
Likelihood of
hypothesis
Posterior probability
of hypothesis H
Prior probability
of hypothesis
Probability of
the data (a
normalizing
constant)

Both ML and Bayesian methods use the
likelihood function
◦ In ML, free parameters are optimized, maximizing
the likelihood
◦ In a Bayesian approach, free parameters are
probability distributions, which are sampled.




Data D: 6 heads (out of 10 flips)
H = true underlying proportion of heads (the
probability of coming up heads on any single
flip)
if H = 0.5, coin is perfectly fair
if H = 1.0, coin always comes up heads (i.e. it
is a trick coin)

F: there exists true probability H of getting
heads, H0: H=0.5
◦ Does the data reject the null hypothesis?

B: what is the range around 0.5 that we are
willing to accept as being in the ”fair coin”
range?
◦ What is the probability that H is in this range?
◦ For the coin tossing example, we can calculate
exactly the probabilities
◦ For more complex data, we need to explore the
probability space  MCMC

Start somewhere
◦ That “somewhere” will have a likelihood associated
with it
◦ Not the optimized, maximum likelihood

Randomly propose a new state
◦ If the new state has a better likelihood, the chain
goes there


The target distribution is the posterior
distribution of interest
The proposal distribution is used to decide
where to go next; you have much flexibility
here, and the choice affects the efficiency of
the MCMC algorithm



Pro: taking big steps helps in jumping from
one “island” in the posterior density to
another
Con: taking big steps often results in poor
mixing
Solution: MCMCMC!



MC3 involves running several chains
simultaneously (one “cold” and several
“heated”)
The cold chain is the one that counts, the
heated chains are “scouts”
Chain is heated by raising densities to a
power less than 1.0 (values closer to 0.0 are
warmer)
Marginal = taking into account all possible values for all parameters




Record the position of the robot every 100 or
1000 steps (1000 represents more “thinning”
than 100)
This sample will be autocorrelated, but not
much so if it is thinned appropriately (can
measure autocorrelation to assess this)
If using heated chains, only the cold chain is
sampled
The marginal distribution of any parameter can
be obtained from this sample




Start with random tree and arbitrary initial values
for branch lengths and model parameters
Each generation consists of one of these (chosen at
random):
◦ Propose a new tree (e.g. Larget-Simon move) and
either accept or reject the move
◦ Propose (and either accept or reject) a new model
parameter value
Every k generations, save tree topology, branch
lengths and all model parameters (i.e. sample the
chain)
After n generations, summarize sample using
histograms, means, credible intervals, etc.



For topologies: discrete Uniform distribution
For proportions: Beta(a,b) distribution
 flat when a=b
 peaked above 0.5 if a=b and both are greater than 1
For base frequencies: Dirichlet(a,b,c,d)
distribution
 flat when a=b=c=d
 all base frequencies close to 0.25 if v=a=b=c=d and v
large (e.g. 300)

For GTR model relative rates:
Dirichlet(a,b,c,d,e,f) distribution

For other model parameters and branch
lengths: Gamma(a,b) distribution
◦ Exponential(λ) equals Gamma(1, λ-1) distribution
◦ Mean of Gamma(a,b) is ab (so mean of an
Exponential(10) distribution is 0.1)
◦ Variance of a Gamma(a,b) distribution is ab2 (so
variance of an Exponential(10) distribution is 0.01)

Flat (uninformative) priors mean that the
posterior probability is directly proportional
to the likelihood
◦ The value of H at the peak of the posterior
distribution is equal to the MLE of H

Informative priors can have a strong effect on
posterior probabilities
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Beware arbitrarily truncated priors
Branch length priors particularly important
Beware high posteriors for very short branch lengths
Partition with care (prefer fewer subsets)
MCMC run length should depend on number of
parameters
Calculate how many times parameters were updated
Pay attention to parameter estimates
Run without data to explore prior
Run long and run often!
Future: model selection should include effects of priors
Marshall, D.C., 2010. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in
the land of long trees. Syst Biol 59, 108-117.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Beware arbitrarily truncated priors
Branch length priors particularly important
Beware high posteriors for very short branch lengths
Partition with care (prefer fewer subsets)
MCMC run length should depend on number of
parameters
Calculate how many times parameters were updated
Pay attention to parameter estimates
Run without data to explore prior
Run long and run often!
Future: model selection should include effects of priors



Bayesian methods are here to stay in
phylogenetics
Are able to take into account uncertainty in
parameter estimates
Are able to relax most assumptions, including
rate homogeneity among branches
◦ Timing of divergence analyses

Being heavily developed, new features and
algorithms appear regularly