Download Introduction to model based methods Some useful links DNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Viral phylodynamics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Some useful links
Introduction to model based
methods
„
„
www.zoologi.su.se/research/wahlberg/Phylocourse/phylocourse.htm
www.helsinki.fi/~jhyvonen/ms06
Niklas Wahlberg
DNA evolves through mutation
„
„
„
With 100 billion bases in GenBank, we are
beginning to understand how DNA
sequences evolve
Different genes have their own mutation
dynamics
Mitochondrial and nuclear genes differ in
mutation dynamics
Hidden evolution in DNA sequences
Seq 1
Seq 2
AGCGAG
GCGGAC
Number of changes
1
Seq 1 C
Seq 2
C
3
2
G
T
1
A
A
Modeling evolution
„
Models incorporate information about
the rates at which each nucleotide is
replaced by each alternative nucleotide
DNA this can be expressed as a 4 x 4
rate matrix (known as the Q matrix)
Parameters we are interested in
„
„ For
„
Other model parameters may include:
„ Site
by site rate variation - often modelled
as a statistical distribution - for example a
gamma distribution
Purines
Pyrimidines
„
„
„
The mean instantaneous substitution rate
(=the general mutation rate + rate of
fixation in population)
The relative rates of substitution between
each base pair
The average frequencies of each base in
the dataset
Branch lengths
A general model of sequence evolution
πA
a g
πC
c
b
h
i d
e
k
πG
j
l
f
πT
A general model of molecular
evolution
Q=
A
C
G
-µ(aπC+bπG+cπT)
µaπC
µbπG
µcπT
µgπA
-µ(gπA+dπG+eπT)
µdπG
µeπT
µhπA
µjπC
-µ(hπA+jπC+fπT)
µfπT
µiπA
µkπC
µlπG
-µ(iπA+kπC+lπG)
µ = mean instantaneous substitution rate
a, b, c,... l = relative rate of substitution
}
T
The Jukes and Cantor model is the simplest
model
A C G T
A −3α α α α
C α−3α α α
G α α −3α α
T α α α −3α
The JC model is a
one parameter
model
1) it assumes that
all bases are equally
frequent (p=0.25)
2) unless modified it
assumes all sites can
change and that
they do so at the
same rate
product is the rate parameter
πA = frequency of A
Jukes-Cantor model
α
A
α
α
C
•
•
•
G
α
α
Kimura model
α
A
α
T
α = the rate of substitution (α changes from A to G every t)
The rate of substitution for each nucleotide is 3α
In t steps there will be 3αt changes
β
β
β
C
α = transitions
G
α
β
β
T
= transversions
The Kimura model has 2 parameters
A C
A − β
C β −
G α β
T β α
G T
α β
β α
− β
β
−
The K2P model is
more realistic, but
still
1) it assumes that
all bases are equally
frequent (p=0.25)
2) unless modified it
assumes all sites can
change and that
they do so at the
same rate
The Hasegawea-Kishino-Yano model
A C G T
A − πβ πα πβ
C π β− π β πα
G π απβ − πβ
T π βπα π β−
C
A
G
T
G
C
The most general timereversible model
The GTR model
b
πA
Q=
c
-µ(aπC+bπG+cπT)
µaπC
µbπG
µcπT
µaπA
-µ(aπA+dπG+eπT)
µdπG
µeπT
µbπA
µdπC
-µ(bπA+dπC+fπT)
µfπT
µcπA
µeπC
µfπG
-µ(cπA+eπC+fπG)
µ = mean instantaneous substitution rate
a, b, c,... f = relative rate of substitution
}
product is the rate parameter
πG
d
f
a
πC
πA = frequency of A
T
C
A
A
T
G
The HKY model
takes into account
variable base
frequencies, but still
1) unless modified it
assumes all sites can
change and that
they do so at the
same rate
e
πT
GTR
The most commonly used models
Variable base frequencies
6 substitution types
TrN
„
Almost all models used are special cases
of one model:
„
SYM
3 substitution types
6 substitution types
HKY85
The general time reversible model
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Models
„
Model parameters can be:
estimated from the data (using a likelihood
function)
„ can be prepre-set based upon assumptions about
the data (for example that for all sequences all
sites change at the same rate and all
substitutions are equally likely - e.g. the Jukes
and Cantor Model)
„ wherever possible avoid assumptions which are
violated by the data because they can lead to
incorrect trees
„
Single substitution type
Invariable sites
Models can be made more parameter
rich to increase their realism
„
The most common additional parameters
are:
A correction for the proportion of sites which
are invariable (parameter I)
„ A correction for variable site rates at those
sites which can change (parameter gamma,
G)
„
All models can be supplemented with
these parameters (e.g. GTR+I+G,
HKY+I+G)
A gamma distribution can be
used to model site rate
heterogeneity
Gamma distribution
computationally costly
„
„
Computational difficulties in using
continuous distribution
Most programs use discrete categories
Frequency
„
Rate
Difficulties in estimating
parameters
„
„
„
The parameters I and G covary!
(I + G) can be estimated, but the values
of I and G are not easily teased apart
Parameter G takes I into account, I not
needed
Estimation of ML substitution model parameters:
„
„
„
Yang (1995) has shown that parameter
estimates are reasonably stable across tree
topologies provided trees are not “too
“too
wrong”.
wrong”.
Thus one can obtain a tree using a quick
method (useful when many sequences are
being analysed) and then estimate
parameters on that tree.
These parameters can then be used in a
search for the most likely tree(s) (given the
model)
Models can be made more parameter
rich to increase their realism
„
But the more parameters you estimate from the
data the more time needed for an analysis and
the more sampling error accumulates
„
„
„
„
One might have a realistic model but large sampling
errors
Realism comes at a cost in time and precision!
Fewer parameters may give an inaccurate estimate,
but more parameters decrease the precision of the
estimate
In general use the simplest model which fits the data
Choosing your model
„
When models are nested
„
When models are not nested
„
„
„
Likelihood ratio test (LRT)
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
GTR
GTR
Variable base frequencies
6 substitution types
TrN
TrN
SYM
3 substitution types
6 substitution types
SYM
3 substitution types
HKY85
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
3 substitution types
2 substitution types
JC
Single substitution type
Equal base frequencies
GTR
TrN
Variable base frequencies
6 substitution types
TrN
SYM
3 substitution types
Single substitution type
GTR
Variable base frequencies
6 substitution types
2 substitution types
Variable base frequencies
JC
Equal base frequencies
K2P
F81
2 substitution types
Variable base frequencies
K3ST
F84
K2P
F81
6 substitution types
SYM
3 substitution types
HKY85
6 substitution types
HKY85
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
K3ST
F84
3 substitution types
2 substitution types
K2P
F81
2 substitution types
Variable base frequencies
JC
Equal base frequencies
Variable base frequencies
6 substitution types
JC
Single substitution type
Equal base frequencies
Single substitution type
Need to know the likelihood of a
model
„
For both tests, one needs to compute the
likelihood of the model
„
„
Likelihood ratio test (LRT)
tomorrow
For now, assume we know the likelihood
of the models we want to compare
LR = 2*(lnL1-lnL2)
„
„
Example 2 – testing a molecular
clock
Example 1
„
„
„
„
HKY85 -lnL = 1787.08
GTR
-lnL = 1784.82
Then, LR = 2 (1787.08 - 1784.82) = 4.53
degrees of freedom = 4 (GTR adds 4
additional parameters to HKY85)
critical value (P = 0.05) = 9.49
GTR does not fit significantly better!
LRT statistic approximately follows a chichisquare distribution
Degrees of freedom equal to the number
of extra parameters in the more complex
model
„
„
„
HKY85 + clock -lnL = 7573.81
HKY85
-lnL = 7568.56
Then, LR = 2 (7573.81 - 7568.56) = 10.50
degrees of freedom = ss-2 = 55-2 = 3
critical value (P = 0.05) = 7.82
„
„
Degrees of freedom in molecular clock case is number
of taxa (s) minus 2
Clock model is simpler (allows only a single rate)
Akaike Information Criterion
„
KullbackKullback-Leibler Information (KLI):
„
„
„
„
„
AIC(M
AIC(M) = - 2xLog(Likelihood(M
2xLog(Likelihood(M)) + 2xK(M)
„
„
“information lost when model M(0) is used to
approximate model M(1)”
M(1)”
“distance from M(0) to M(1)”
M(1)”
Bayesian Information Criterion
„
K(M
K(M) is number of estimable parameters of model M
AIC is an estimate of the expected relative
distance (KLI) between a fitted model, M, and
the unknown true mechanism that generated
the data
Other kinds of models
„
„
„
„
„
Mixture models
Codon usage models
Covarion models
Amino acid models
Etc etc etc (more on the way...)
BIC takes into account also sample size n
BIC(
BIC(M) = - 2xLog(Likelihood(M
2xLog(Likelihood(M)) +
K(M)xLog(n)
K(M
K(M) is number of estimable parameters of
model M and n is the number of characters
Mixture models
„
„
„
Are in fact the same models as already
described
Data is partitioned according to properties
and different models are applied to each
partition
Partitions are found using the model and
some kind of likelihood function
Codon usage models
Two types of changes among codons:
„
Synonymous: TTT
TTT (Phe) Æ TTC
TTC (Phe)
„
Nonsynonymous: TTT
TTT (Phe) Æ TTA
TTA (Leu)
Codon models
Important feature of codon models
„
„
dS: number of synonymous substitutions
per synonymous site (KS)
dN: number of nonsynonymous
substitutions per nonsynonymous site (KA)
„
Important parameters:
Transition/transversion rate ratio: κ
Biased codon usage: πj for codon j
„ Nonsynonymous/synonymous rate ratio:
ω=dN/dS
„
„
Covarion model
„
Sites that are invariable in one part of the tree
may become variable in another, and vice versa.
To model this, need 8 states at internal nodes:
„
but only 4 observable states at leaves:
„
„
„
„
„
Aon, Con, Gon, Ton, Aoff, Coff, Goff, Toff
When the taxa you are interested in are
not very closely related (diverged over
300 million years ago?)
„
A, C, G, T
Allowing sites to switch between
variable/invariable modes in divergent parts of
tree is believed to increase biological realism,
especially for highly divergent taxa.
Amino acid models
„
Amino acids
Amino acid models are based on step
matrices (known as Dayhoff models)
PAMn matrices – the transition probabilities
from one amino acid to another along a
branch with length n
„ Other matrices used are BLOSUM and WAG
„ Empirically derived!
„
Amino acid data (protein sequences) are more
reliable for homology statements and analysis
Mutation Data Matrix (250 PAMs) a matrix of the logarithms of the probabilties,
multiplied by 10
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
2
-2
0
0
-2
0
0
1
-1
-1
-2
-1
-1
-4
1
1
1
-6
-3
0
R
N
D
C
Q
E
G
H
I
L
K
M
6
0
-1
-4
1
-1
-3
2
-2
-3
3
0
-4
0
0
-1
2
-4
-2
2
2
-4
1
1
0
2
-2
-3
1
-2
-4
-1
1
0
-4
-2
-2
4
-5
2
3
1
1
-2
-4
0
-3
-6
-1
0
0
-7
-4
-2
12
-5
-5
-3
-3
-2
-6
-5
-5
-4
-3
0
-2
-8
0
-2
4
2
-1
3
-2
-2
1
-1
-5
0
-1
-1
-5
-4
-2
4
0
1
-2
-3
0
-2
-5
-1
0
0
-7
-4
-2
5
-2
-3
-4
-2
-3
-5
-1
1
0
-7
-5
-1
6
-2
-2
0
-2
-2
0
-1
-1
-3
0
-2
5
2
-2
2
1
-2
-1
0
-5
-1
4
6
-3
4
2
-3
-3
-2
-2
-1
2
5
0
-5
-1
0
0
-3
-4
-2
6
0
-2
-2
-1
-4
-2
2
F
P
S
T
W
Y
V
9
-5 6
-3 1 2
-3 0 1 3
0 -6 -2 -5 17
7 -5 -3 -3 0 10
-1 -1 -1 0 -6 -2
4
Models have parameters
„
How to estimate values for those
parameters?
Maximum likelihood methods
„ Bayesian methods
„