Download Maximum Likelihood Estimation Choosing among models of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Choosing among models of
sequence evolution.
Maximum likelihood (RA Fisher)




Compares hypotheses (H1 & H2) given the
observed data (D).
Likelihood ratio statistic (L or Λ ):
L = (L1/ L2) = Pr(D/ H1)/ Pr(D/ H2)
So the log-likelihood statistic:
lnL = ln(L1/ L2 ) = lnL1 - lnL2
2lnL is approximately ∼ Χ 2 distributed with
the degrees of freedom = difference in number
of parameters between hypotheses.


From the homework: we fit different models
Found distances and branch lengths differed
among models.
How do we objectively choose which model
is “best”?
Likelihood curve for 11 coin tosses.
Experimental result: 5 heads.

L = Pr(D/H)
 (joint) probability of the data (D) given
the hypothesis (H)
 H may be a tree or a branch length or a
model parameter
 D is the alignment of nucleotide
sequences
Likelihood curve for 11 coin tosses.
Experimental result: 5 heads.
0.500
Ho: p=0.5
(fair coin)
^p = 5/11=
0.454.
0.454
11
Probability of heads (p)
Parameter estimate.
Likelihood
Trees and model parameters
Bootstrap
Searching trees

Likelihood
Maximum Likelihood Estimation
Maximum likelihood (RA Fisher)
We want to know how
likely is our parameter
estimate given that:
0.454
Pr(D/ H 0 ) =0.500
vs.
Pr(D/ H 1 ) ≠ 0.500
^p = 5/11=
0.454.
Probability of heads (p)11
Ho: p=0.5
(fair coin)
Likelihood difference between the
estimate and the hypothesis.
Likelihood difference between the
estimate and the hypothesis.
0.500
Likelihood
Parameter estimate.
^p = 5/11=
0.454.
We want to know how
likely is our parameter
estimate given that:
0.454
Probability of heads (p)11
Ho: p=0.5
(fair coin)
Likelihood surface: more than one
parameter: e.g. 2 dimensions.
Parameter estimate.
Likelihood
0.500
Likelihood surface: more than one
parameter: e.g. 2 dimensions.
p(inv) and
p(alpha)
Is the likelihood
difference statistically
significant.
0.454
Probability of heads (p)11
Ho: p=0.5
(fair coin)
How do 2 parameters covary?
Likelihood surface: more than one
parameter: e.g. 2 dimensions.
The -lnL
The -lnL
p(inv) and
p(alpha)
p(inv) and
p(alpha)
Furthermore most models are
mutidimensional surfaces - more than 2
parameters. Harder to visualize but the
Furthermore most models are mutidimensional surfaces - more than 2 parameters.
math is exactly the same.
What combination of parameters results
in the highest likelihood?
The -lnL
^p = 5/11=
0.454.
ML has two distinct steps.


Model parameter estimation
• maximum likelihood estimates MLEs.
• conditioned on a “good” tree.
Tree estimation
• Uses the MLEs --> ML tree.
Today


Site likelihoods for 2 models
Model parameter estimation
• maximum likelihood estimates MLEs.
• conditioned on a “good” tree.
Later: Tree estimation
• Uses the MLEs --> ML tree.
For example equal Base Frequencies?
Central assumption of JC69 and K80.
site
model
F81
F81+I
Likelihoods are calculated at each site for each model using the
same tree (e.g. a pretty good NJ tree). Just the models vary.
(Swofford showed that parameter estimates are pretty stable
across many trees that are not ML trees but are close to right.)
So use
F81:assume
unequal base
frequencies.
-lnL = 35046.7
Estimated Freq(G)= 0.125
F81+ I: suppose also lots of constant
(invariant) sites?
-lnL = 33309.6
44.2% invariant sites
Note the estimate of the base frequencies changes
with the added parameter.
-lnL = 33309.6
0.11 estimated freq(G)
44.2% estimated invariant sites
Log likelihood statistic
F81
F81+ I
-lnL1 = 35047
-lnL2 = 33310
df= 3
df= 4
lnL = ln(L 1/ L2 ) = lnL1 - lnL 2
2lnL is approximately ∼ Χ 2 distributed
degrees of freedom = difference in number of parameters between
hypotheses.
2(lnL1 -lnL2 ) = 2 x 1737.1 = 3474.2
df difference = 1
If no significant difference, Χ 21df ≤ 3.84 for p≤0.05.
Log likelihood statistic
F81
F81+ I
df= 3
df= 4
lnL = ln(L 1/ L2 ) = lnL1 - lnL 2
2lnL is approximately ∼ Χ 2 distributed
degrees of freedom = difference in number of parameters between
hypotheses.
So Χ 21df ≤ 3.84 is much much smaller than 3474.2
so F81+I is much much better than F81.
Similarly: site likelihoods for 2 trees
2(lnL1 -lnL2 ) = 2 x 2183 = 4366
df difference = 7
Model selection software



Later we will compare the fit of 2 or more trees under our
estimated model of sequence evolution (say GTR+G) to
choose the optimal tree.
-lnL1 = 35047
-lnL2 = 33310
And similarly to
account for other
features of the
sequence alignment:
GTR+I+G
-lnL = 31120.67
df=10

jModeltest (Posada, 2008)
TOPALi v2
DT-ModSel (Minin, Abdo, Joyce, Sullivan,
2003)
 This is the preferred performance-based
method of Sullivan and Joyce.
More in later lectures.
An abbreviated example.



We use SeaView and a small primate data
set.
ML with Phyml (a fast version of phylip)
We generate the models and compare the
models.
SeaView analysis
GTR and GTR+I
lnL=-5946
lnL difference = 9, df difference=3
lnL=-5719
lnL=-5728
lnL=-5759
lnL difference = 9, df difference=3
χ32=7.8, so significantly different.
lnL=-5719
lnL=-5728
GTR+I+G and HKY+I+G
lnL=-5719
lnL=-5728
But note: same topology estimate.
lnL=-5719
lnL=-5728
So HKY+I+G significantly
poorer fit.
And same branch length estimates.


The bootstrap (nonparametric)

Bootstrap (non parametric)
Sequences are resampled 1000 times (at least)
Tree search on each of 1000 replicated
sequences yields 1000 trees
 bootstrap consensus is the majority rule
consensus of 1000 tree
 branches are labeled by % occurrence in 1000
trees.
»Felsenstein, J. 1985. Evolution 39:
783-791


Then a pseudosample provides
variation estimate.
Measuring Tree Support
Nonparametric bootstrap
Used in statistics as confidence levels when the data
distribution is unknown (Efron, 1979).
Eg.
Eg. Suppose data is
not normally
distributed.
But GTR+I+G more complex
 Search time is much longer and...
 Less stable, sometimes fails to converge
 Less power to detect the true tree.
Consider slightly poorer fit if tree is nearly as
good.
 The lack of fit apparently is due to systematic
error that may not apply to the goodness of fit
for the tree.
 But may be hard to justify to reviewers.



Adapted for phylogenies by Felsenstein in
1985.
Most commonly used measure of tree support.
For MP, distance and ML methods.
Unnecessary for Bayesian methods.

The Bootstrap Replicate
Tree Confidence: The Bootstrap
Resamples, with
replacement, all sites
along the alignment
Note that high
bootstrap support will
be misleading if model
assumptions violated.
Not a way to check the
model.
53
68
94
88
86
99
73
91
90
80
100
74
Count the number of
times each partition
occurs in all the trees.
100
100
62
91
91
55
100
100
100
Any partition that occurs
in more than 50% of
trees shows up in the
majority rule consensus
tree - the “bootstrap
tree”.
Notice that it is the
sequences that are
bootstrapped, not the
original tree.
Note: multiple tests
problem.
79
Collection of trees built
from the replicated data.
Some sites are sampled
twice or more and some
not at all.
Bootstrap tree
Bootstrap
Convention: published trees
Actually a maximum
likelihood tree with
bootstrap values added
in a Word document.
Note that branch
lengths cannot be
represented in the
consensus tree itself.
100
55
97
100
72
74
82
84
51
99
99
50
90
88
99
100
Gulo gulo
Martes pen
Martes ame
Martes mel
Martes mar
Martes zib
Martes Foi
Martes fla
Mustela ev
Mustela fu
Mustela pu
Mustela lu
Mustela si
Mustela it
Mustela er
Mustela al
Mustela ni
Mustela vi
Taxidea ta
Meles mele
Spilogale
Mephitis m
Enhydra lu
Aonyx cape
Amblonyx c
Lontra fel
Lontra lon
Lutra lutr
Lutra macu
Pteronura
A forsteri
Zalophusca
walrus
C cristata
P fasciata
P groenlan
grayseal
harborseal
E barbatus
H leptonyx
Weddell seal
M schauins
ringtail
Racoon
Panda
PolarBear
Grizzly
mongoose
cat
dog
fox
donkey
horse
indiarhino
whiterhino
blackrhino
tapir
pig
sheep
cow
alpaca
pygmyhippo
hippo
blue
gray
fin
humpback
bowhead
n right
dolphin
Bootstrap tree
Pseudosampling the
sequence.
Idea: if many sites
support a clade then
some will appear in
most random
replicates.
If just a few sites, many
replicates will lack
support for a clade.
The boot strap samples with replacement.
So the sequence length is the same as the
original sequence
 Some sites are randomly sampled
multiple times, while others are randomly
omitted.


So some rare, but informative, sites are
only rarely sampled and so do not in show
up in all bootstrap trees.
 Hence the clades supported by just a very
few sites will not be resolved.
 This is the point: high bootstrap values
show that many sites support the clade.

How do we search tree space?




Search algorithms find the best tree for the data.
Two methods guaranteed to find the globally
best tree.
Exhaustive search: every single tree
Branch and bound: discard one tree usually
means a set of subtrees are bad.
Algorithm is much faster than exhaustive
search.
SeaView analysis for homework
We will try the bootstrap for distances and
for parsimony.
Searching methods
We will try different model parameters using
maximum likelihood tests.
For ML and MP trees
Figures from Felsenstein’s Inferring
phylogenies
How many
possible
trees?
Global vs Heuristic Search

Too many for
computational
tractability.


Most data sets are large.
 No possibility of exploring all trees
Currently few packages even have these
options (Exception is PAUP4).
Goal: Balance thorough search against
tractability and speed.
Branch Swapping Strategies
S
T
Heuristic Search Strategies
1. NNI:
nearest neighbor
interchange.
2. SPR: Subtree
pruning
and reconnection.
Dissolve interior
branch and form each
alternative.
Break a branch off and
reconnect the “root”
somewhere else (any
other branch).
Which Branch Swapping to Use?
But all can get stuck on local optima.
V
U
S
T
V
U
S
T
S
T
U
V
U
V
NNI --> STR --> TBR




Local maximums
Global maximum
Heuristic Search Strategies
Tree bisection
and
reconnection
(TBR)
Break a branch
into two. Then
reattach using a
different branch.
But all can get stuck on local optima.
Maximum Likelihood
tree.
NNI and STR subsets of TBR.
Increasingly accurate: TBR best.
Decreasingly fast: TBR slowest.
PAUP has all three methods
 phyml just NNI and STR.
Traveling through tree space by accepting
increasingly better trees (hill climbing)
Traveling through tree space
But all can get stuck on local optima.
…or Maximum
Parsimony tree.
Starting tree influences outcome.
Global maximum
If start here then…
Starting tree influences outcome.
…end up here.
Global maximum
If start here then…
Traveling through tree space
Two effective methods
…end up here.
Global maximum
If start here then…
Sequential addition of taxa.
Felsenstein
recommendation:
best resolved taxa
listed first
 Add increasingly
unresolved
relationships.

Sequential addition of taxa.
Best known first
Requires
knowledge of the
biology (oh dear!)
 Must know what
the question is.
 Must limit the
question that you are
asking.


Or use Stepwise Addition.
Or use Stepwise Addition:
multiple starting points.
Global maximum
Fastest: start with MP or NJ tree
Global maximum
NJ distance tree
Suppose you start at multiple random trees to
increase chance of covering tree space.
Better chance of reaching the global maximum.
Use a quick method. Hope it lands near the
global max. No guarantees but usually OK.