Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some useful links Introduction to model based methods www.zoologi.su.se/research/wahlberg/Phylocourse/phylocourse.htm www.helsinki.fi/~jhyvonen/ms06 Niklas Wahlberg DNA evolves through mutation With 100 billion bases in GenBank, we are beginning to understand how DNA sequences evolve Different genes have their own mutation dynamics Mitochondrial and nuclear genes differ in mutation dynamics Hidden evolution in DNA sequences Seq 1 Seq 2 AGCGAG GCGGAC Number of changes 1 Seq 1 C Seq 2 C 3 2 G T 1 A A Modeling evolution Models incorporate information about the rates at which each nucleotide is replaced by each alternative nucleotide DNA this can be expressed as a 4 x 4 rate matrix (known as the Q matrix) Parameters we are interested in For Other model parameters may include: Site by site rate variation - often modelled as a statistical distribution - for example a gamma distribution Purines Pyrimidines The mean instantaneous substitution rate (=the general mutation rate + rate of fixation in population) The relative rates of substitution between each base pair The average frequencies of each base in the dataset Branch lengths A general model of sequence evolution πA a g πC c b h i d e k πG j l f πT A general model of molecular evolution Q= A C G -µ(aπC+bπG+cπT) µaπC µbπG µcπT µgπA -µ(gπA+dπG+eπT) µdπG µeπT µhπA µjπC -µ(hπA+jπC+fπT) µfπT µiπA µkπC µlπG -µ(iπA+kπC+lπG) µ = mean instantaneous substitution rate a, b, c,... l = relative rate of substitution } T The Jukes and Cantor model is the simplest model A C G T A −3α α α α C α−3α α α G α α −3α α T α α α −3α The JC model is a one parameter model 1) it assumes that all bases are equally frequent (p=0.25) 2) unless modified it assumes all sites can change and that they do so at the same rate product is the rate parameter πA = frequency of A Jukes-Cantor model α A α α C • • • G α α Kimura model α A α T α = the rate of substitution (α changes from A to G every t) The rate of substitution for each nucleotide is 3α In t steps there will be 3αt changes β β β C α = transitions G α β β T = transversions The Kimura model has 2 parameters A C A − β C β − G α β T β α G T α β β α − β β − The K2P model is more realistic, but still 1) it assumes that all bases are equally frequent (p=0.25) 2) unless modified it assumes all sites can change and that they do so at the same rate The Hasegawea-Kishino-Yano model A C G T A − πβ πα πβ C π β− π β πα G π απβ − πβ T π βπα π β− C A G T G C The most general timereversible model The GTR model b πA Q= c -µ(aπC+bπG+cπT) µaπC µbπG µcπT µaπA -µ(aπA+dπG+eπT) µdπG µeπT µbπA µdπC -µ(bπA+dπC+fπT) µfπT µcπA µeπC µfπG -µ(cπA+eπC+fπG) µ = mean instantaneous substitution rate a, b, c,... f = relative rate of substitution } product is the rate parameter πG d f a πC πA = frequency of A T C A A T G The HKY model takes into account variable base frequencies, but still 1) unless modified it assumes all sites can change and that they do so at the same rate e πT GTR The most commonly used models Variable base frequencies 6 substitution types TrN Almost all models used are special cases of one model: SYM 3 substitution types 6 substitution types HKY85 The general time reversible model K3ST F84 3 substitution types 2 substitution types K2P F81 ACAGGTGAGGCTCAGCCAATTTGAGCTTTGTCGATAGGT 2 substitution types Variable base frequencies JC Equal base frequencies Models Model parameters can be: estimated from the data (using a likelihood function) can be prepre-set based upon assumptions about the data (for example that for all sequences all sites change at the same rate and all substitutions are equally likely - e.g. the Jukes and Cantor Model) wherever possible avoid assumptions which are violated by the data because they can lead to incorrect trees Single substitution type Invariable sites Models can be made more parameter rich to increase their realism The most common additional parameters are: A correction for the proportion of sites which are invariable (parameter I) A correction for variable site rates at those sites which can change (parameter gamma, G) All models can be supplemented with these parameters (e.g. GTR+I+G, HKY+I+G) A gamma distribution can be used to model site rate heterogeneity Gamma distribution computationally costly Computational difficulties in using continuous distribution Most programs use discrete categories Frequency Rate Difficulties in estimating parameters The parameters I and G covary! (I + G) can be estimated, but the values of I and G are not easily teased apart Parameter G takes I into account, I not needed Estimation of ML substitution model parameters: Yang (1995) has shown that parameter estimates are reasonably stable across tree topologies provided trees are not “too “too wrong”. wrong”. Thus one can obtain a tree using a quick method (useful when many sequences are being analysed) and then estimate parameters on that tree. These parameters can then be used in a search for the most likely tree(s) (given the model) Models can be made more parameter rich to increase their realism But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates One might have a realistic model but large sampling errors Realism comes at a cost in time and precision! Fewer parameters may give an inaccurate estimate, but more parameters decrease the precision of the estimate In general use the simplest model which fits the data Choosing your model When models are nested When models are not nested Likelihood ratio test (LRT) Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) GTR GTR Variable base frequencies 6 substitution types TrN TrN SYM 3 substitution types 6 substitution types SYM 3 substitution types HKY85 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types 3 substitution types 2 substitution types JC Single substitution type Equal base frequencies GTR TrN Variable base frequencies 6 substitution types TrN SYM 3 substitution types Single substitution type GTR Variable base frequencies 6 substitution types 2 substitution types Variable base frequencies JC Equal base frequencies K2P F81 2 substitution types Variable base frequencies K3ST F84 K2P F81 6 substitution types SYM 3 substitution types HKY85 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Variable base frequencies 6 substitution types JC Single substitution type Equal base frequencies Single substitution type Need to know the likelihood of a model For both tests, one needs to compute the likelihood of the model Likelihood ratio test (LRT) tomorrow For now, assume we know the likelihood of the models we want to compare LR = 2*(lnL1-lnL2) Example 2 – testing a molecular clock Example 1 HKY85 -lnL = 1787.08 GTR -lnL = 1784.82 Then, LR = 2 (1787.08 - 1784.82) = 4.53 degrees of freedom = 4 (GTR adds 4 additional parameters to HKY85) critical value (P = 0.05) = 9.49 GTR does not fit significantly better! LRT statistic approximately follows a chichisquare distribution Degrees of freedom equal to the number of extra parameters in the more complex model HKY85 + clock -lnL = 7573.81 HKY85 -lnL = 7568.56 Then, LR = 2 (7573.81 - 7568.56) = 10.50 degrees of freedom = ss-2 = 55-2 = 3 critical value (P = 0.05) = 7.82 Degrees of freedom in molecular clock case is number of taxa (s) minus 2 Clock model is simpler (allows only a single rate) Akaike Information Criterion KullbackKullback-Leibler Information (KLI): AIC(M AIC(M) = - 2xLog(Likelihood(M 2xLog(Likelihood(M)) + 2xK(M) “information lost when model M(0) is used to approximate model M(1)” M(1)” “distance from M(0) to M(1)” M(1)” Bayesian Information Criterion K(M K(M) is number of estimable parameters of model M AIC is an estimate of the expected relative distance (KLI) between a fitted model, M, and the unknown true mechanism that generated the data Other kinds of models Mixture models Codon usage models Covarion models Amino acid models Etc etc etc (more on the way...) BIC takes into account also sample size n BIC( BIC(M) = - 2xLog(Likelihood(M 2xLog(Likelihood(M)) + K(M)xLog(n) K(M K(M) is number of estimable parameters of model M and n is the number of characters Mixture models Are in fact the same models as already described Data is partitioned according to properties and different models are applied to each partition Partitions are found using the model and some kind of likelihood function Codon usage models Two types of changes among codons: Synonymous: TTT TTT (Phe) Æ TTC TTC (Phe) Nonsynonymous: TTT TTT (Phe) Æ TTA TTA (Leu) Codon models Important feature of codon models dS: number of synonymous substitutions per synonymous site (KS) dN: number of nonsynonymous substitutions per nonsynonymous site (KA) Important parameters: Transition/transversion rate ratio: κ Biased codon usage: πj for codon j Nonsynonymous/synonymous rate ratio: ω=dN/dS Covarion model Sites that are invariable in one part of the tree may become variable in another, and vice versa. To model this, need 8 states at internal nodes: but only 4 observable states at leaves: Aon, Con, Gon, Ton, Aoff, Coff, Goff, Toff When the taxa you are interested in are not very closely related (diverged over 300 million years ago?) A, C, G, T Allowing sites to switch between variable/invariable modes in divergent parts of tree is believed to increase biological realism, especially for highly divergent taxa. Amino acid models Amino acids Amino acid models are based on step matrices (known as Dayhoff models) PAMn matrices – the transition probabilities from one amino acid to another along a branch with length n Other matrices used are BLOSUM and WAG Empirically derived! Amino acid data (protein sequences) are more reliable for homology statements and analysis Mutation Data Matrix (250 PAMs) a matrix of the logarithms of the probabilties, multiplied by 10 A R N D C Q E G H I L K M F P S T W Y V A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -4 1 1 1 -6 -3 0 R N D C Q E G H I L K M 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 2 2 -4 1 1 0 2 -2 -3 1 -2 -4 -1 1 0 -4 -2 -2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 5 -2 -3 -4 -2 -3 -5 -1 1 0 -7 -5 -1 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 6 -3 4 2 -3 -3 -2 -2 -1 2 5 0 -5 -1 0 0 -3 -4 -2 6 0 -2 -2 -1 -4 -2 2 F P S T W Y V 9 -5 6 -3 1 2 -3 0 1 3 0 -6 -2 -5 17 7 -5 -3 -3 0 10 -1 -1 -1 0 -6 -2 4 Models have parameters How to estimate values for those parameters? Maximum likelihood methods Bayesian methods