Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
14.4. Tue Introduction to models (Jarno) 16.4. Thu Distance-based methods (Jarno) 17.4. Fri ML analyses (Jarno) 20.4. Mon 21.4. Tue 23.4. Thu 24.4. Fri Assessing hypotheses (Jarno) Problems with molecular data (Jarno) Problems with molecular data (Jarno) Phylogenomics Search algorithms, visualization, and other computational aspects (Jarno) J Maximum Likelihood Maximum likelihood methods of phylogenetic inference evaluate a hypothesis about evolutionary history (the branching order and branch lengths of a tree) in terms of a probability that a proposed model of the evolutionary process and the hypothesised history (tree) would give rise to the data we observe The probability, P, of the data (D), given the hypothesis (H) ◦ L = P (D | H) Observed data (aligned sequences) Tree topology, branch lengths and model of evolution In statistical usage, a distinction is made depending on the roles of the outcome or parameter. Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing heads-up every time? Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed headsup 10 times, what is the likelihood that the coin is fair? [Wikipedia, article on likelihood] J An optimality criterion (as is parsimony) Given a model and data we can evaluate a tree We can choose between trees based on the likelihood of a given tree The tree(s) with the highest likelihood is the best GTR Variable base frequencies 6 substitution types TrN SYM 3 substitution types 6 substitution types HKY85 K3ST F84 3 substitution types 2 substitution types K2P F81 2 substitution types Variable base frequencies JC Equal base frequencies Single substitution type Maximum Likelihood estimates parameter values of an explicit model from observed data Likelihood provides ways of evaluating models in terms of their log likelihoods Different trees can also be evaluated for their fit to the data under a particular model (likelihood ratio tests of two trees after Kishino & Hasegawa) Let's toss coin ten times (n). It lands 4 times heads up (x), 6 times tails up. What is probability of a head in a single toss? ◦ Compare: What is the likelihood of the data given the process? Naturally phat= x / n = 4 / 10 = 0.4 This is also a maximum likelihood estimater for phat. Let's see why... J Coin toss is a binomial process: ◦ Pr (X=x|n, p) = 𝑛 𝑛 𝑥=0 𝑥 𝑝 𝑥 (1 − 𝑝)𝑛−𝑥 Likelihood function then becomes: ◦ L(p|x, n) = 𝑛 𝑥 𝑝 𝑥 (1 − 𝑝)𝑛−𝑥 Note: in the binomial formula X is the unknown, whereas in the binomial the p is the unknown (because we have the data, the coin tosses). J The likelihood function can be solved analytically or using "brute force". For example, result for p=0.4 is: ◦ L = 210 * 0.4^4 * 0.6^6 = 0.2508227 ◦ logL = log(L) = -1.383009 ◦ -logL = -logL = 1.383009 Analytically, the point where the derivative of the likelihood function is zero, and the second derivative is negative, is the maximum of the function. Graphically... J Maximum likelihood Likelihood p Maximum likelihood estimator of p Precise estimate Imprecise estimate Likelihood μ1 l<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)){ L[i]<-p[i]^x* (1-p[i])^(n-x)* (factorial(n)/ (factorial(x)* factorial(n-x))) } d<-data.frame(p=p, L=L, logL=log(L)) return(d) } plot(l(4,10)[,c(1,3)], ylim=c(-30,0), type="l") l2<-function(x, n) { p<-seq(0,1,0.01) L<-rep(NA, length(p)) for(i in 1:length(p)) { L[i]<dbinom(4,size=10, prob=p[i],log=TRUE) } d<-data.frame(p=p, L=L) return(d) } plot(l2(), type="l") J plot(l2(), type="L") J Why log likelihood? L(0.99|10, 4) = 0.0000000002017251 -logL(0.99|10, 4) = -22.324115 ◦ When you multiply very small values together, the result is even smaller, and at some point the precision disappears (a restriction of computers) ◦ The same does not happen with log values: L = 210 * 0.4^4 * 0.6^6 = 0.2508227 logL = log(210) + 4*log(0.4)+6*log(0.6) = -1.383009 J DNA sequences can be thought of as four sided dice. Thus, the previous coin example can be straight-forwardly generealized to DNA sequences. J Maximum likelihood tree reconstruction 1 2 3 4 CGAGAC AGCGAC AGATTA GGATAG 1 3 Tree A 2 4 What is the probability that unrooted Tree A (rather than another tree) could have generated the data shown under our chosen model ? Maximum likelihood tree reconstruction Stationarity! C 1 2 3 4 CGAGA AGCGA AGATT GGATA C C A G j ACGT ? C A ? G Tree A 4 x 4 possibilities The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model Maximum likelihood tree reconstruction C 1 2 3 4 CGAGA AGCGA AGATT GGATA C C A G j ACGT ? C A ? G Tree A A C G T A α α α C α α α G α α α T α α α The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model Maximum likelihood tree reconstruction A 1 2 3 4 CGAG AGCG AGAT GGAT A A T A j C C A G ACGT ? A T ? A Tree A A C G T A α α α C α α α G α α α T α α α The likelihood for a particular site j is the sum of the probabilities of every possible reconstruction of ancestral states under a chosen model C C t2 y x t6 A t1 G t4 t5 w t8 t7 z t3 ti are branch lengths (rate x time) C P(A,C,C,C,G,x,y,z,w|T)=Prob(x) Prob(y|x,t6) Prob(A|y,t1) Prob(C|y,t2) Prob(z|x,t8) Prob(C|z,t3) Prob(w|z,t7) Prob(C|w,t4) Prob(G|w,t5) Assume a Jukes-Cantor model (all nucleotide frequencies are equal). Further assume that the branch length is 0.1. Then we can generate a so called P-matrix from the Jukes-Cantor model's Q-matrix: These are probabilities of a nucleotide changing to some other nucleotide. J A: acct B: gcct L = (0.25 * 0.0062)^1 * (0.25 * 0.9815)^3 = 2.289932e-05 logL = log(L) = -4.64 For other branch lengths, the P matrix can be multiplied by itself k times. This gives a P matrix for a k cex length. A branch lenght can be optimized by maximizing the likelihood of a certain branch lenght. J Depending on the software, each iteration (in the tree optimization algorithm) has to for a certain tree topology: Calculate the likelihood of the tree topology given the model and the observed data Estimate the optimal branch lenghts J Maximum likelihood tree reconstruction The likelihood of Tree A is the product of the likelihoods at each site The likelihood is usually evaluated by summing the log of the likelihoods (because the summed probabilities are so small) at each site and reported as the log likelihood of the full tree The Maximum likelihood tree is the one with the highest likelihood (might not be Tree A i.e. it could be another tree topology) ◦ Note: highest likelihood (largest value) = the largest –logL (closest to zero) = smallest logL (closest to zero) Typical assumptions of ML substitution models The probability of any change is independent of the prior history of the site (a Markov Model) Substitution probabilities do not change with time or over the tree (a homogeneous Markov process) Change is time reversible e.g. the rate of change of A to T is the same as T to A A model is always a simplification of what happens in nature ◦ Assumes evolution works parsimoniously A given model will give more weight to certain changes over others ML – an objective criterion for choosing one weighting scheme over another? Based largely on slides by Paul Lewis (www.eeb.uconn.edu) D will stand for Data H will mean any one of a number of things: ◦ ◦ ◦ ◦ a discrete hypothesis a distinct model (e.g. JC, HKY, GTR, etc.) a tree topology one of an infinite number of continuous model parameter values (e.g. ts:tv rate ratio) In ML, we choose the hypothesis that gives the highest (maximized) likelihood to the data The likelihood is the probability of the data given the hypothesis L = P (D | H). A Bayesian analysis expresses its results as the probability of the hypothesis given the data. ◦ this may be a more desirable way to express the result The posterior probability, [P (H | D)], is the probability of the hypothesis given the observations, or data (D) The main feature in Bayesian statistics is that it takes into account prior knowledge of the hypothesis Likelihood of hypothesis Posterior probability of hypothesis H Prior probability of hypothesis Probability of the data (a normalizing constant) Both ML and Bayesian methods use the likelihood function ◦ In ML, free parameters are optimized, maximizing the likelihood ◦ In a Bayesian approach, free parameters are probability distributions, which are sampled. Data D: 6 heads (out of 10 flips) H = true underlying proportion of heads (the probability of coming up heads on any single flip) if H = 0.5, coin is perfectly fair if H = 1.0, coin always comes up heads (i.e. it is a trick coin) F: there exists true probability H of getting heads, H0: H=0.5 ◦ Does the data reject the null hypothesis? B: what is the range around 0.5 that we are willing to accept as being in the ”fair coin” range? ◦ What is the probability that H is in this range? ◦ For the coin tossing example, we can calculate exactly the probabilities ◦ For more complex data, we need to explore the probability space MCMC Start somewhere ◦ That “somewhere” will have a likelihood associated with it ◦ Not the optimized, maximum likelihood Randomly propose a new state ◦ If the new state has a better likelihood, the chain goes there The target distribution is the posterior distribution of interest The proposal distribution is used to decide where to go next; you have much flexibility here, and the choice affects the efficiency of the MCMC algorithm Pro: taking big steps helps in jumping from one “island” in the posterior density to another Con: taking big steps often results in poor mixing Solution: MCMCMC! MC3 involves running several chains simultaneously (one “cold” and several “heated”) The cold chain is the one that counts, the heated chains are “scouts” Chain is heated by raising densities to a power less than 1.0 (values closer to 0.0 are warmer) Marginal = taking into account all possible values for all parameters Record the position of the robot every 100 or 1000 steps (1000 represents more “thinning” than 100) This sample will be autocorrelated, but not much so if it is thinned appropriately (can measure autocorrelation to assess this) If using heated chains, only the cold chain is sampled The marginal distribution of any parameter can be obtained from this sample Start with random tree and arbitrary initial values for branch lengths and model parameters Each generation consists of one of these (chosen at random): ◦ Propose a new tree (e.g. Larget-Simon move) and either accept or reject the move ◦ Propose (and either accept or reject) a new model parameter value Every k generations, save tree topology, branch lengths and all model parameters (i.e. sample the chain) After n generations, summarize sample using histograms, means, credible intervals, etc. For topologies: discrete Uniform distribution For proportions: Beta(a,b) distribution flat when a=b peaked above 0.5 if a=b and both are greater than 1 For base frequencies: Dirichlet(a,b,c,d) distribution flat when a=b=c=d all base frequencies close to 0.25 if v=a=b=c=d and v large (e.g. 300) For GTR model relative rates: Dirichlet(a,b,c,d,e,f) distribution For other model parameters and branch lengths: Gamma(a,b) distribution ◦ Exponential(λ) equals Gamma(1, λ-1) distribution ◦ Mean of Gamma(a,b) is ab (so mean of an Exponential(10) distribution is 0.1) ◦ Variance of a Gamma(a,b) distribution is ab2 (so variance of an Exponential(10) distribution is 0.01) Flat (uninformative) priors mean that the posterior probability is directly proportional to the likelihood ◦ The value of H at the peak of the posterior distribution is equal to the MLE of H Informative priors can have a strong effect on posterior probabilities 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Beware arbitrarily truncated priors Branch length priors particularly important Beware high posteriors for very short branch lengths Partition with care (prefer fewer subsets) MCMC run length should depend on number of parameters Calculate how many times parameters were updated Pay attention to parameter estimates Run without data to explore prior Run long and run often! Future: model selection should include effects of priors Marshall, D.C., 2010. Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. Syst Biol 59, 108-117. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Beware arbitrarily truncated priors Branch length priors particularly important Beware high posteriors for very short branch lengths Partition with care (prefer fewer subsets) MCMC run length should depend on number of parameters Calculate how many times parameters were updated Pay attention to parameter estimates Run without data to explore prior Run long and run often! Future: model selection should include effects of priors Bayesian methods are here to stay in phylogenetics Are able to take into account uncertainty in parameter estimates Are able to relax most assumptions, including rate homogeneity among branches ◦ Timing of divergence analyses Being heavily developed, new features and algorithms appear regularly