Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
N-gene Coalescent Problems • Probability of the 1st success after waiting t, given a time-constant, a ~ p, of success at Exp(a,t) ae 1 E(Exp(a,t)) a 1 Var(Exp(a,t)) 2 a 5/12/2017 Comp 790– Continuous-Time Coalescence 1 Review N-genes • Likelihood k genes have a distinct lineage is: (2N 1) (2N 2) 2N 2N (2N (k 1)) 2N k1 1 2Ni i1 • Manipulating a little k1 1 i1 k1 i 2N 1 i1 j O 2N k 1 1 O 2 2N 1 N2 The 1st gene can choose its parent freely, but the next k-1 must choose from the remainder Genes without a child 1 N2 • Where, for large N, 1/N2 is negligible 5/12/2017 Comp 790– Continuous-Time Coalescence 2 Approx N-gene Coalescence • Approximate probability k-genes have different Recall that the 2-gene case had a parents: similar form, but with 1 in place of the k 1 1 2 2N combinatorial. Here the combinatorial terms accounts for all possible k-choose-2 pairs, which are treated independently • The probability two or more have a common parent: k 1 k 1 1 1 2 2N 2 2N • Repeated distinct lineages for j generations leads to a geometric distribution, with k 1 p 2 2N 5/12/2017 k 1 P(N j) 1 2 2N j1 Comp 790– Continuous-Time Coalescence k 1 2 2N 3 Impact of Approximation • Approximation is not “proper” for all values of k < 2N k 1 k(k 1) 16N 1 1 1 1 0 for k 4N 2 2 2N • Considering the following values of N 5/12/2017 N 10 100 1000 10000 100000 1000000 k 7 21 64 201 633 2001 Comp 790– Continuous-Time Coalescence 4 Fix N and Vary k • Comparing the actual to the approximation 5/12/2017 Comp 790– Continuous-Time Coalescence 5 Concrete Example • In a population of 2N = 10 the probability that 3 genes have one ancestor in the previous generation is: 1 1 1 10 10 100 The 1st gene can choose its parent freely, while the next 2 must choose the same one • The probability that all 3 have a different ancestor is: 10 9 8 72 10 10 10 100 The ist gene can choose its parent from the 10, while the next 2 must choose the remainder • The remaining probability is that the 3 genes have two parents in the previous generation 1 5/12/2017 1 72 27 100 100 100 Comp 790– Continuous-Time Coalescence 6 Example Continued • The probability is that 2 or more genes have common parents in the previous generation is: 27 1 28 100 100 100 The probability that 2 have common parents plus the probability all 3 have a common parent • By our approximation term the probability that two or more genes share a common parent is: 3 1 3 3 28 2 error 10 100 100 2 10 10 Error in approximation for k=3, 2N=10 1 1 10 3.33 p 3 1 3 2 10 Comp 790– Continuous-Time Coalescence • Leads to a MRCA estimate of 5/12/2017 7 For Large N and Small k • For 2N > 100, the agreement improves, so long as k << 2N • The advantage of the approximation is that it fit’s the “form” of a geometric distribution, an thus can be generalized to a continuous-time model 5/12/2017 Comp 790– Continuous-Time Coalescence 8 Continuous-time Coalescent • In the Wright-Fisher model time is measures in discrete units, generations. • A continuous time approximation is conceptually more useful, and via the given approximation, computationally simple • Moreover, a continuous model can be constructed that is independent of the population size (2N), so long as our sample size, k, is much smaller (one of those rare cases where a small sample size simplifies matters) • The only time we will need to consider population size (2N) is when we want to convert from time back into generations. 5/12/2017 Comp 790– Continuous-Time Coalescence 9 Continuous-time Derivation j • As before, let t , where j is now time measured in 2N generations • It follows that j = 2Nt translates continuous time, t, back into generations j. In practice floor(2Nt) is used to assign a discrete generation number. • The waiting time, Tkc , for k genes to have k – 1 or fewer k ancestors is exponentially distributed, Tkc ~ Exp , derived 2 k from t = j/2N, M=2N and p / 2N 2 The probability that k genes will have • Giving: P Tkc t 1 e 5/12/2017 k t 2 k-1 or fewer ancestors at some time greater than or equal to t Comp 790– Continuous-Time Coalescence 10 Visualization • Plots of P Tkc t , for k = [3, 4, 5, 6] k=6 k=3 k=4 k=5 5/12/2017 Comp 790– Continuous-Time Coalescence 11 Continuous Coalescent Time Scale • In the continuous-time time constant is a measure of ancestral population size, with the original at time 0, ½ the original at time 0.5, and ¼ at 1.0 Population size t 2.6N 1.3 2N 1.0 N 0.5 0 0.0 1 5/12/2017 2 3 4 5 Comp 790– Continuous-Time Coalescence 6 12 A Coalescent Model • The continuous coalescent lends itself to generative models • The following algorithm constructs a plausible genealogy for n genes 1. Start with k = n genes c c 2. Simulate the waiting time, Tk , to the next event, Tk ~ Exp 2k 3. Choose a random pair (i, j) with 1 ≤ i < j ≤ k uniformly k among the 2 pairs 4. Merge I and J into one gene and decrease the sample size by one, k k -1 5. Repeat from step 2 while k > 1 is backwards, it begins from the current • This model populations and posits ancestry, in contrast to a forward algorithm like those used in the first lecture 5/12/2017 Comp 790– Continuous-Time Coalescence 13 Properties of a Coalescent Tree • The height, Hn, of the tree is the sum of time epochs, Tj, where there are j = n, n-1, n-2, … , 2, 1 ancestors. • The distribution of Hn amounts to a convolution of the exponential variables whose result is: n e P Hn t • Where • With 5/12/2017 k1 k t 2 (1)k1 (2k 1)F(k) G(k) F(k) n(n 1)(n 2) (n k 1) G(k) n(n 1)(n 2) (n k 1) n E(Hn ) n E(T ) 2 j(j11) 21 1 n j j2 j2 n Var(Hn ) n As n ∞, E(Hn) 2, and, if n=2, E(H2)=1. Thus, the waiting time for n genes to find their common ancestor is less than twice the time for 2! j (j 1) Var(Tj ) 4 1 2 2 j2 j2 Comp 790– Continuous-Time Coalescence 14 5/12/2017 Comp 790– Continuous-Time Coalescence 15