Download Modelling_evolution - the Department of Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome evolution wikipedia , lookup

Group selection wikipedia , lookup

Genomics wikipedia , lookup

Genetic drift wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Mutagen wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Oncogenomics wikipedia , lookup

Adaptive evolution in the human genome wikipedia , lookup

Genetic code wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Microsatellite wikipedia , lookup

Epistasis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Koinophilia wikipedia , lookup

Frameshift mutation wikipedia , lookup

Population genetics wikipedia , lookup

Mutation wikipedia , lookup

Point mutation wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Modelling evolution
Gil McVean
Department of Statistics
T
C
A
G
Outline
•
The dynamics of microevolution
•
Natural selection
•
Modelling changes in DNA
•
Estimation of substitution parameters
How do genes change
•
Occasionally, the DNA sequence changes through mutation (about 1.5 x 10-8
per base pair per generation in humans)
•
Most of these mutations will be lost from the population by chance or driven
out by selection. But some will also increase in frequency and ultimately
become fixed
1
Frequency
Advantageous
Deleterious
Neutral
0
Time
The rate of fixation
•
Consider a neutral locus – every generation the expected number of new
mutations appearing in the population is equal to the number of
chromosomes (2N) times the mutation rate (u)
•
It is also true that ultimately, every chromosome in the population will
ultimately be descended from one in the current population
•
Because the locus is neutral, every chromosome has equal chance of being
the ultimate ancestor, so the fixation probability is 1/2N
•
The rate of substitution is equal to the rate of appearance of new mutations
times their probability of fixation. For the neutral locus this is u – the neutral
mutation rate
•
It will take this mutation an average of 4N generations for mutations to fix,
but the variance in fixation time is large
Adding selection
•
Natural selection comes in many forms, the simplest model is to consider
two alleles at a single locus
Genotype
Fitness
aa
1
Aa
1+hs
AA
1+s
•
We can make use of diffusion theory to model the dynamics of
microevolution with selection
•
Most results are due to Wright and Kimura. Kimura derived the fixation
probability formula (when h is 0.5)
2s
u (s) 
 4 Ns
1 e
Consequences of Kimura’s formula
•
The key parameter for more weakly selected mutations is the product of the
population size and selection coefficient (Ns)
•
For strongly favourable mutations (Ns>>1), the fixation probability is
approximately twice the selective advantage
•
There is a window (-1<Ns<1) where mutations behave in a manner that is
nearly (but not quite) neutral.
10
1
-10
-5
0
0.1
5
10
4Ns
0.01
0.001
0.0001
Relative fixation
probability
In smaller populations, the
substitution rate of slightly
deleterious mutations can
increase.
Perhaps this explains why
human proteins evolve faster
than those in rodents?
Various complications
•
Simple models of molecular evolution are helpful, but ignore many important
biological features
–
–
–
–
–
Changing population sizes
Changing selection pressures
Geographical structure
Interactions between mutations (epistasis)
Interference between linked mutations
The time-scale of micro- and macro-evolution
•
The process of substitution is called micro-evolution
•
Ultimately, only fixed mutations leave traces in our genes. The
accumulation of fixed differences between species leads to macroevolution.
•
Viewed over scales of millions of years, the process of substitution is
effectively instantaneous
Modelling long-term evolution of DNA sequences
•
If we approximate the selection process as instantaneous, modelling DNA
sequence evolution gets much easier
•
In particular, we can treat the evolution of a DNA sequence through time as
a Markov process
•
Furthermore, if we assume the nucleotide (or codon) positions evolve
independently we can perform very efficient simulation and inference on the
time of divergence
Modelling evolution of a single nucleotide
•
At a single nucleotide we only have four states (T, C, A, G). We will ignore
indels.
•
Define the 12 rates of change from each nucleotide to every other one
•
T
C
A
G
We may wish to deal with changes over a fixed length of time – say one
million years
– The key is that we need a ‘rate’ for each possible change
 0


Q   CT

 AT

 GT
TC
0
 AC
GC
TA TG 

CA CG 
0  AG 

GA 0 
Using the model to simulate evolution
•
Draw the ancestor sequence from it’s stationary distribution
•
Calculate the total rate at which events occur to this sequence – the sum
over nucleotides, i, of the rate at which they are substituted by other
nucleotides, j.
L   ij
ij
•
If we can treat the substitution process as a Poisson process, the time to
the first substitution is exponentially distributed with rate L
•
The probability that a given nucleotide is chosen to be substituted by a
given other nucleotide is proportional to the rate of that substitution
– This allows you to choose the substitution event
•
Continue for as long as you want to simulate evolution for
An alternative approach
•
We have outlined a very general approach to simulating sequence evolution
•
For the very simple model we have discussed, we can actually make things
even more efficient – which helps for inference – by taking a continuous
time limit
•
Rewrite the transition probabilities as an instantaneous rate matrix
   TC

 CT  
Q

 AC
 AT

 GT GC
TA TG 

CA CG 
   AG 

GA   
The – for the diagonal is such
that the sum for each row is zero
Simulating and performing inference
•
The conditional distribution of the descendant nucleotide after a period of
time t is given by
P  exp( Qt )
•
So the probability of observing nucleotide N1 in species 1 and N2 in species
2 is given by the appropriate element in
V  P1  FP2
T
•
F is a diagonal matrix of the nucleotide frequencies in the ancestor of the
two species
•
If F and Q are known, it is also easy to estimate t if we assume
independence between sites
Making it simpler
•
Usually, various assumptions will be made to make inference easier
•
First, it is assumed that the matrix Q is reversible
– This means that watching the process forwards in time is equivalent to watching
it back in time
– Consequently, summing over ancestral states is equivalent to treating one of the
two sequences as ancestral
•
Second, it is assumed that the processes is at stationarity. This means that
F is given by any column of exp(Q x t) as t → ∞
•
Thirdly, further constraints are imposed on the Q matrix.
– Jukes-Cantor model: all lambda’s are equal
– Kimura 2-parameter model: allows different rates for transitions and
transversions
An example
•
Suppose we observe the following ‘aligned’ sequence
TGGCTGTGGACTAGTCAGCTGAGGGATATGCTAG
CGATAATGCACCGGTCAGCTGAGAAATATGCAGG
•
We will use the Kimura 2-parameter model to estimate the transitiontransversion ratio and the divergence time – we will assume that the rate of
transversion substitution is 1.5 x 10-9 per site per year
•
Under the model, all sites are independent, so we can tally up the changes
S1/
S2
T
C
A
G
T
5
2
2
0
C
1
4
0
0
A
0
0
4
2
G
0
1
4
8
More on inference
•
In calculating the likelihood we want to sum over the possible ancestral
states
•
In effect – we have a (very simple) hidden Markov model, where the state is
the ancestor and the ‘emission’ is the two daughter sequences
Ai
Si1, Si2
•
Because positions are independent, the likelihood is found by multiplying
marginal likelihoods across sites
– In effect it is a multinomial sample
•
We can use maximum likelihood to provide point estimates
Some estimates
•
For the K2P model, you actually only need to count up the number of
transition and transversion differences and the total sequence length
– These are sufficient statistics for the transition-transversion parameter and the
divergence

1
K   ln (1  2 P  Q) 1  2Q
2

P  0.273
Q  0.091
Kˆ  0.56
tˆ  23.3 MY
•
There are comparable analytical expressions for estimating divergence
times and model parameters for the simpler divergence models
•
Analytically tractability is lost as models get more complex (realistic)
Parameterisation using micro-evolutionary models
•
Most approaches to estimating divergence, etc. conflate the mutation
process with the substitution process
•
However, it is perfectly possible to separate out the two processes
– For example, estimates of the mutation process can be obtained from selectively
neutral genomic regions or synonymous substitutions
•
This allows you to estimate the selective constraint or advantage to
mutations
•
An area where this is applicable is in the analysis of codon usage bias,
where particular codons are favoured over others for translational efficiency
– McVean and Vieira (2001)
Making it more complex
•
It is easy to see how the model can be extended to more states. For
example, a widely used approach to analysing coding sequences is to deal
with the 64 x 64 matrix of states that are the codons (Goldman and Yang
1994)
•
The key point is always that parts of the sequence (nucleotides, codons) are
independent
•
This kind of approach cannot (at least in a straightforward way) deal with
context-dependent substitution rates or insertions and deletions
– For example, there is a greatly elevated rate of mutation at CpG sites in
vertebrates
Building up to trees
•
Analysing more sequences means thinking about the evolutionary
relationships between all of them
•
This can (often, but not always) be represented as a tree
•
As before, we can utilise HMM structures to make inference efficient
A1-5
Here, the HMM algorithm used to
calculate the likelihood is called
peeling
A1-3
A1,2
S1, S2
A4,5
S3
S4, S5