Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Distribution of Mutations Biostatistics 666 Lecture 5 Last Lecture: Introduction to the Coalescent z Coalescent approach z Infinite sites model • Proceed backwards through time. • Genealogy of a sample of sequences. • All mutations distinguishable. • No reverse mutation. Some key ideas … z Probability of coalescence events z Length of genealogy and its branches z Expected number of mutations z Parameter θ which combines population size and mutation rate Building Blocks… z Probability of sampling distinct ancestors for n sequences ⎛n⎞ ⎜⎜ ⎟⎟ n −1 2⎠ i ⎞ ⎛ ⎝ P ( n ) = ∏ ⎜1 − ⎟ ≈ 1 − N⎠ N i =1 ⎝ z Coalescence time t is approximately exponentially distributed Some Key Results… z Coalescence Time (population size units) ⎛ j⎞ E (T j ) = 1 / ⎜⎜ ⎟⎟ ⎝ 2⎠ z Total Length (population size units) n −1 2 E (Ttot ) = ∑ i =1 i Some More Key Results … z Expected Number of Polymorphisms For a diploid sample n −1 n −1 i =1 i =1 E ( S ) = 4 Nµ ∑ 1 / i = θ ∑ 1 / i For an haploid sample n −1 n −1 i =1 i =1 E ( S ) = 2 Nµ ∑ 1 / i = θ ∑ 1 / i Inferences about θ z Could be estimated from S • Divide by expected length of genealogy θˆ = S n −1 ∑1 / i i =1 z Could then be used to: • • Estimate N, if mutation rate µ is known Estimate µ, if population size N is known Alternative Estimator for θ … z Count pairwise differences between sequences z Compute average number of differences ⎛n⎞ θ = ⎜⎜ ⎟⎟ ⎝ 2⎠ ~ −1 n n ∑ ∑S i =1 j = i +1 ij ^ 1.2 Var(θ) as a function of N 0.8 N = 10,000 individuals µ = 10-4 0.2 0.4 0.6 θ=4 0.0 Variance in Estimate of Theta 1.0 Parameters 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 Sample Size Today ... z More applications of the coalescent z Predicting allele frequency distributions z The full distribution of S • Using simulations • Using analytical calculations A Coalescent Simulation … z Let’s consider tracing the ancestry of 4 sequences When n = 4 Probability of Coalescent Event ⎛ 4⎞ P(4) ≈ ⎜⎜ ⎟⎟ 2 N ⎝ 2⎠ Time to Next Coalescent Event ⎛ 4⎞ T (4) ≈ 2 N ⎜⎜ ⎟⎟ ⎝ 2⎠ Sample time from exponential distribution Pick two sequences at random to coalesce Next n = 3 … Let’s assume that sequences 3 and 4 are selected … Then, we repeat the process for a sample of 3 sequences T(4) Next n = 2 … Let’s assume that sequences 1 and 2 are selected to coalesce Then, we repeat the process for a sample of 2 sequences T(3) T(4) The Simulated Coalescent At this point, we could place mutations in genealogy. Most often, these would fall in longer branches. T(2) T(3) T(4) A Coalescent Simulation … Mutations in these branches affect a pair of sequences T(2) T(3) T(4) A Coalescent Simulation … Mutations in these branches affect a single sequence T(2) T(3) T(4) Frequency Spectrum 0.0 0.1 0.2 0.3 0.4 0.5 Repeating the simulation multiple times, would give us a predicted mutation spectrum. Probability z Frequency (out of n) 0.20 0.15 0.10 0.05 0.00 Proportion of All Variants 0.25 0.30 0.35 Frequency Spectrum (n = 10) 1 2 3 4 5 6 Frequency (out of n) 7 8 9 10 0.10 0.05 0.00 Proportion of All Variants 0.15 Frequency Spectrum (n = 100) 1 4 7 11 16 21 26 31 36 41 46 51 56 Frequency (out of n) 61 66 71 76 81 86 91 96 Frequency Spectrum z Constant size population Exponentially growing population z Most variants are rare z • For n = 100, ~44% of variants occur < 5/100. • For n = 10, ~35% of variants observed once. Mutation Spectrum z Depends on genealogy z Does not depend on • Population Size • Population Growth • Population Subdivision • Mutation rate! Deviations from Neutral Spectrum z When would you expect deviations from the spectra we described? z What would you expect for … z Why? • A rapidly growing population? • A population whose size is decreasing? Effect of Polymorphism Type Data from Cargill et al, 1999 Number of Mutations z Can be derived from coalescent tree z Analytical results possible • What are the key features? • Trace back in time until MRCA, tracking mutation events Sample of Two Sequences z Track coalescences and mutations • Probability of a coalescent event? • Depends on population size … • Probability of a mutation? • Depends on mutation rate … z Proceed backwards until either occurs… • Conditional probability for each outcome? Two Identical Sequences PCA P2 ( S is 0) ≈ PCA + Pmut 1/ 2N = 1 / 2 N + 2µ 1 = 1+θ Full distribution of S… z Probability that first j events are mutations… ⎛ θ ⎞ ⎛ 1 ⎞ P2 ( j ) = ⎜ ⎟ ⎜ ⎟ ⎝1+θ ⎠ ⎝1+θ ⎠ j Example… z 2 sequences Population size N = 25,000 Mutation rate µ = 10-5 z Probability of 0, 1, 2, 3… mutations z z And for multiple sequences… z Describe number of mutations until the next coalescence event z Proceed back in time, until: • One of n sequences mutates… • A coalescent event occurs… • Then track mutations in (n-1) sequences Formulae … j ⎛ ⎞ ⎛n⎞ ⎜ ⎟ ⎜⎜ ⎟⎟ ⎜ ⎟ ⎝ 2⎠ j ⎜ nµ ⎟ ⎞ n −1 ⎛ θ 2 N =⎜ Qn ( j ) = ⎜ ⎟ ⎟ ⎛n⎞ ⎟ ⎛n⎞ ⎝θ + n −1⎠ θ + n −1 ⎜ ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ 2⎠ ⎟ 2⎠ ⎜ ⎝ ⎝ ⎜ nµ + ⎟ nµ + 2N ⎠ 2N ⎝ j Pn ( j ) = ∑ Pn −1 ( j − i )Qn (i ) i =0 Example… z 3 sequences Population size N = 25,000 Mutation rate µ = 10-5 z Probability of 0, 1, 2, 3… mutations z z Number of Mutations 0.20 N = 10,000 n = 10 µ = 2 * 10-5 0.10 0.05 0.00 Probability 0.15 Approximately 2kb of sequence, sequenced in 10 individuals 0 1 2 3 4 5 6 Number of Mutations 7 8 9 10+ So far … z One homogeneous population • Coalescence times • Number of mutations • Expectation • Distribution • Spectrum of mutations z Several assumptions, including … • No recombination Recombination … z z z No recombination • Single genealogy Free recombination • • Two independent genealogies Same population history Intermediate case • Correlated genealogies Recommended Reading Richard R. Hudson (1990) Gene genealogies and the coalescent process Oxford Surveys in Evolutionary Biology, Vol. 7. D. Futuyma and J. Antonovics (Eds). Oxford University Press, New York.