* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Modelling_evolution - the Department of Statistics
Genome evolution wikipedia , lookup
Group selection wikipedia , lookup
Genetic drift wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Oncogenomics wikipedia , lookup
Adaptive evolution in the human genome wikipedia , lookup
Genetic code wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Microsatellite wikipedia , lookup
Helitron (biology) wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Koinophilia wikipedia , lookup
Frameshift mutation wikipedia , lookup
Population genetics wikipedia , lookup
Modelling evolution Gil McVean Department of Statistics T C A G Outline • The dynamics of microevolution • Natural selection • Modelling changes in DNA • Estimation of substitution parameters How do genes change • Occasionally, the DNA sequence changes through mutation (about 1.5 x 10-8 per base pair per generation in humans) • Most of these mutations will be lost from the population by chance or driven out by selection. But some will also increase in frequency and ultimately become fixed 1 Frequency Advantageous Deleterious Neutral 0 Time The rate of fixation • Consider a neutral locus – every generation the expected number of new mutations appearing in the population is equal to the number of chromosomes (2N) times the mutation rate (u) • It is also true that ultimately, every chromosome in the population will ultimately be descended from one in the current population • Because the locus is neutral, every chromosome has equal chance of being the ultimate ancestor, so the fixation probability is 1/2N • The rate of substitution is equal to the rate of appearance of new mutations times their probability of fixation. For the neutral locus this is u – the neutral mutation rate • It will take this mutation an average of 4N generations for mutations to fix, but the variance in fixation time is large Adding selection • Natural selection comes in many forms, the simplest model is to consider two alleles at a single locus Genotype Fitness aa 1 Aa 1+hs AA 1+s • We can make use of diffusion theory to model the dynamics of microevolution with selection • Most results are due to Wright and Kimura. Kimura derived the fixation probability formula (when h is 0.5) 2s u (s) 4 Ns 1 e Consequences of Kimura’s formula • The key parameter for more weakly selected mutations is the product of the population size and selection coefficient (Ns) • For strongly favourable mutations (Ns>>1), the fixation probability is approximately twice the selective advantage • There is a window (-1<Ns<1) where mutations behave in a manner that is nearly (but not quite) neutral. 10 1 -10 -5 0 0.1 5 10 4Ns 0.01 0.001 0.0001 Relative fixation probability In smaller populations, the substitution rate of slightly deleterious mutations can increase. Perhaps this explains why human proteins evolve faster than those in rodents? Various complications • Simple models of molecular evolution are helpful, but ignore many important biological features – – – – – Changing population sizes Changing selection pressures Geographical structure Interactions between mutations (epistasis) Interference between linked mutations The time-scale of micro- and macro-evolution • The process of substitution is called micro-evolution • Ultimately, only fixed mutations leave traces in our genes. The accumulation of fixed differences between species leads to macroevolution. • Viewed over scales of millions of years, the process of substitution is effectively instantaneous Modelling long-term evolution of DNA sequences • If we approximate the selection process as instantaneous, modelling DNA sequence evolution gets much easier • In particular, we can treat the evolution of a DNA sequence through time as a Markov process • Furthermore, if we assume the nucleotide (or codon) positions evolve independently we can perform very efficient simulation and inference on the time of divergence Modelling evolution of a single nucleotide • At a single nucleotide we only have four states (T, C, A, G). We will ignore indels. • Define the 12 rates of change from each nucleotide to every other one • T C A G We may wish to deal with changes over a fixed length of time – say one million years – The key is that we need a ‘rate’ for each possible change 0 Q CT AT GT TC 0 AC GC TA TG CA CG 0 AG GA 0 Using the model to simulate evolution • Draw the ancestor sequence from it’s stationary distribution • Calculate the total rate at which events occur to this sequence – the sum over nucleotides, i, of the rate at which they are substituted by other nucleotides, j. L ij ij • If we can treat the substitution process as a Poisson process, the time to the first substitution is exponentially distributed with rate L • The probability that a given nucleotide is chosen to be substituted by a given other nucleotide is proportional to the rate of that substitution – This allows you to choose the substitution event • Continue for as long as you want to simulate evolution for An alternative approach • We have outlined a very general approach to simulating sequence evolution • For the very simple model we have discussed, we can actually make things even more efficient – which helps for inference – by taking a continuous time limit • Rewrite the transition probabilities as an instantaneous rate matrix TC CT Q AC AT GT GC TA TG CA CG AG GA The – for the diagonal is such that the sum for each row is zero Simulating and performing inference • The conditional distribution of the descendant nucleotide after a period of time t is given by P exp( Qt ) • So the probability of observing nucleotide N1 in species 1 and N2 in species 2 is given by the appropriate element in V P1 FP2 T • F is a diagonal matrix of the nucleotide frequencies in the ancestor of the two species • If F and Q are known, it is also easy to estimate t if we assume independence between sites Making it simpler • Usually, various assumptions will be made to make inference easier • First, it is assumed that the matrix Q is reversible – This means that watching the process forwards in time is equivalent to watching it back in time – Consequently, summing over ancestral states is equivalent to treating one of the two sequences as ancestral • Second, it is assumed that the processes is at stationarity. This means that F is given by any column of exp(Q x t) as t → ∞ • Thirdly, further constraints are imposed on the Q matrix. – Jukes-Cantor model: all lambda’s are equal – Kimura 2-parameter model: allows different rates for transitions and transversions An example • Suppose we observe the following ‘aligned’ sequence TGGCTGTGGACTAGTCAGCTGAGGGATATGCTAG CGATAATGCACCGGTCAGCTGAGAAATATGCAGG • We will use the Kimura 2-parameter model to estimate the transitiontransversion ratio and the divergence time – we will assume that the rate of transversion substitution is 1.5 x 10-9 per site per year • Under the model, all sites are independent, so we can tally up the changes S1/ S2 T C A G T 5 2 2 0 C 1 4 0 0 A 0 0 4 2 G 0 1 4 8 More on inference • In calculating the likelihood we want to sum over the possible ancestral states • In effect – we have a (very simple) hidden Markov model, where the state is the ancestor and the ‘emission’ is the two daughter sequences Ai Si1, Si2 • Because positions are independent, the likelihood is found by multiplying marginal likelihoods across sites – In effect it is a multinomial sample • We can use maximum likelihood to provide point estimates Some estimates • For the K2P model, you actually only need to count up the number of transition and transversion differences and the total sequence length – These are sufficient statistics for the transition-transversion parameter and the divergence 1 K ln (1 2 P Q) 1 2Q 2 P 0.273 Q 0.091 Kˆ 0.56 tˆ 23.3 MY • There are comparable analytical expressions for estimating divergence times and model parameters for the simpler divergence models • Analytically tractability is lost as models get more complex (realistic) Parameterisation using micro-evolutionary models • Most approaches to estimating divergence, etc. conflate the mutation process with the substitution process • However, it is perfectly possible to separate out the two processes – For example, estimates of the mutation process can be obtained from selectively neutral genomic regions or synonymous substitutions • This allows you to estimate the selective constraint or advantage to mutations • An area where this is applicable is in the analysis of codon usage bias, where particular codons are favoured over others for translational efficiency – McVean and Vieira (2001) Making it more complex • It is easy to see how the model can be extended to more states. For example, a widely used approach to analysing coding sequences is to deal with the 64 x 64 matrix of states that are the codons (Goldman and Yang 1994) • The key point is always that parts of the sequence (nucleotides, codons) are independent • This kind of approach cannot (at least in a straightforward way) deal with context-dependent substitution rates or insertions and deletions – For example, there is a greatly elevated rate of mutation at CpG sites in vertebrates Building up to trees • Analysing more sequences means thinking about the evolutionary relationships between all of them • This can (often, but not always) be represented as a tree • As before, we can utilise HMM structures to make inference efficient A1-5 Here, the HMM algorithm used to calculate the likelihood is called peeling A1-3 A1,2 S1, S2 A4,5 S3 S4, S5