Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Getting Parameters from data 2017/5/3 Comp 790– Coalescence with Mutations 1 1. Introduction • No fun 2. Estimating θ • Unlike ρ,θ does not shape the genealogy, it modifies genetic types • Mutation-rich samples allow for more accurate estimation of ‘genealogical shape’-parameters. • More segregating sites -> More information 2. Estimating θ • 2.1 Watterson’s estimator – One of the two very popular estimators – The estimator has mean θ: 2. Estimating θ • The estimator is unbiased and variance decreases with increasing ρ: recombination breaks up linkage and reduces correlation between sites • Under exponential growth, it is downwards biased. Under migration, it is upwards biased, because the MRCA tends to be pushed further back in time. 2. Estimating θ • 2.2 Tajima’s estimator – πij: The number of sites that differ between sequence i and j. – πij has mean θ, because πij is the number of segregating sites in a sample of size two. – As a consequence, 2. Estimating θ 2. Estimating θ • Code for Watterson Estimator 2. Estimating θ 2. Estimating θ 2. Estimating θ 2.3. Fu 1994 2. Estimating θ 2.3. Fu 1994 2. Estimating θ Fu’s other estimators: i-Mutation: only good with large n UPBLUE: does not generalize to settings with recombination 3. Estimating ρ • Estimating ρ is hard, both statistically and computationally. – Using infinite side model, all mutation events can be listed whereas all recombination events cannot. A recombination can only be inferred with certainty if all four gametes are present. – The number of possible genealogical relationship between sequences subject to recombination is unlimited. 3. Estimating ρ • Both TM and HM might overlook important information in the data. They only provides lower bounds to the number of recombination events. (TM: Least number of gene trees required to explain the sample. HM: 5.11.3) • Assume two non-recombining loci with rate ρbetween them. • Likelihood function: L(θ1, θ2, ρ) = P θ1, θ2, ρ(S1,n=k1,S2,n=k2) • For n=2, the two extreme cases: • L(θ1, θ2, 0) = • L(θ1, θ2, ∞) = 3. Estimating ρ • Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and k2=7, L(θ1, θ2, 0)=0.0014 L(θ1, θ2, ∞) =0.0067 (θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞) The likelihood supports two unlinked loci(ρ= ∞) more than two completely linked loci(ρ=0). ρ>0 even though the data passes the four gamete test and has TM = HM = 0. recombination can be inferred even in the absence of incompatibilities. • If k1=k2=4, the likelihood supports two complete linked loci. 3. Estimating ρ • Recombination is difficult to take into account in an analysis and it is tempting to ignore it or assume that its effects are minor. Unfortunately it is not true. • 3.1 Estimators based on summary statistics – Wakeley’s estimator: Wakeley(1997) • A complicated function of a complicated function of ρ. ( the form is fully know though ) • Large variance. The expectation doesn’t strongly depend on ρ. – Likelihood and summary statistics: Wall(2000) • Infer ρ based on the likelihood of (Sn, Kn, TM) where Kn is the number of haplotypes in the sample. 3. Estimating ρ • 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead & Donnelly(2001) – Consider all pairs of segregating sites, ignoring all non-polymorphic sites – Let nij denote the vector of gamete counts for sites i and j (00,01,…). Let ρij be the scaled recombination rate: – Probability of obtaining nij given that both i,j are polymorphic – The proposed pseudo-likelihood function is: – Then to estimate ρ from this pseudo-likelihood. • It depends on ρ only. Because ρij is assumed proportional to sequence length. 3. Estimating ρ • It is pseudo: – 1.Only likelihood of pairs of segregating sites are considered – 2. Pairs are treated as independent of each other – 3.The likelihood of a pair is conditioned on the pair being segregating in both loci 4. Monte Carlo methods • The principle: throw dice several times and calculate the average – Var(g(X))/M • Use Monte Carlo integration to find P(Sn=k) : A naïve approach – Simulate genealogies of n genes, add mutations and count the number of times a genealogy has exactly k mutations. – Many simulated genealogies will not contribute to the sum. 4. Monte Carlo methods • A better approach: Write P(Sn=k) in the form of an integral – Rearrange the terms and we can get: • Where X is gamma distributed with parameters k+1 and θ/2 – Or: • Where Ln is the sum of all branches in the coalescent tree. – It’s better because every simulated values counts. 2017/5/3 Comp 790– Continuous-Time Coalescence 21 4. Monte Carlo methods 2017/5/3 Comp 790– Continuous-Time Coalescence 22 4. Monte Carlo methods • 4.1 Likelihood curve – Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a whole range of θ values – E.g.: to calculate L(θ) = Pθ(Sn=k) for a large range of θs and single out the θ value with highest probability. – Recall: – Simulate y1,….yM from Ln and calcuate the empirical average: – Note that only one set of simulations is performed and is used to calculate the likelihood for all θ 2017/5/3 Comp 790– Continuous-Time Coalescence 23 4. Monte Carlo methods • Alternatively, recall: • one can extend it this way: consider some fixed θ0. for any θ • Using Monte Carlo technique, the integral can be approximated by: • Where x1,…xM are M values obtained from the proposal distribution gamma(k+1, θ0). θ0 is called the ‘driving value’. An appropriate choice of θ0 could be a simple estimator of θ. E.g. Watterson’s estimator. 2017/5/3 Comp 790– Continuous-Time Coalescence 24 4. Monte Carlo methods 2017/5/3 Comp 790– Continuous-Time Coalescence 25 4. Monte Carlo methods • 4.2 Monte Carlo integration and the coalescent – Full likelihood of a sample under a coalescent model: – H: history D: data – Let’s define H here: – N!(n-1)!/2^(n-1) different coalescent topologies – Impossible to sum up all these. ( we haven't even considered recomb) 2017/5/3 Comp 790– Continuous-Time Coalescence 26 4. Monte Carlo methods • A naïve Monte Carlo approach: – It’s not efficient for most of the coalescent topologies will not be compatible with D. most simulations do not contribute to the likelihood. – A four sequence example: (1/3 compatible ) 2017/5/3 Comp 790– Continuous-Time Coalescence 27 4. Monte Carlo methods • Importance Sampling: – Reduce the variance of the estimated probability – Reduce the number of simulations that contribute little to likelihood • Instead of choosing histories from distribution Pθ(H), sample histories from a proposal distribution Q(H) • Now the likelihood of data can be approximated by: 2017/5/3 Comp 790– Continuous-Time Coalescence 28 4. Monte Carlo methods • Ideally, one would like to sample from, where because in that case the approximation becomes exact: Not feasible approach. • A proposal distribution between and • Giffiths and Tavare(1994), Stephens and Donnelly(2000) 2017/5/3 Comp 790– Continuous-Time Coalescence 29 4. Monte Carlo methods • Giffiths and Tavare(1994): • Let’s go back to 2.4.2 Infinite Site Model 2017/5/3 Comp 790– Continuous-Time Coalescence 30 • Giffiths and Tavare(1994): • H is defined as a path through the diagram. • H has probability defined by the product of weights attached to the edges that belong to H. • E.g.H’ follows the rightmost path: – 1st term of Q(H’): is the driving value – Last five terms are all 1 – θ0 2017/5/3 Comp 790– Continuous-Time Coalescence 31 4. Monte Carlo methods • Pθ(D|H’) = 1 ??? • is the product of coalescent probabilities of the events defining the history: – Coalescent -> mutation -> coalescent … • The factor in front of a fraction is the probability that a mutation happens in a given lineage(s) or that a coalescent event happens amongst certain pair of genes. 2017/5/3 Comp 790– Continuous-Time Coalescence 32 4. Monte Carlo methods • 4.3 Markov Chain Monte Carlo – Kuhner et al.(1995,1998) – Finite sites model – All coalescent topologies are compatible with data – Likelihood ratio: 2017/5/3 Comp 790– Continuous-Time Coalescence 33 4. Monte Carlo methods • The importance sampling function is: • Use Metropolis-Hastings algorithm to construct a Markov Chain with distribution Q(H) • The benefit of the approach is that the Markov Chain tends to stay in areas of the tree space that suport data well before moving to another area.34 2017/5/3 Comp 790–the Continuous-Time Coalescence