Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical population genetics Lecture 5: Properties of the coalescent Xavier Didelot Dept of Statistics, Univ of Oxford [email protected] Slide 70 of 161 Coalescent simulation A coalescent tree for n genes can be simulated using the following algorithm: Algorithm (Coalescent simulation). 1. Start with k = n lines; 2. Simulate the waiting time for the next coalescence event from Exp(k(k − 1)/2); 3. Choose without replacement a random pair of lines (i, j) amongst the k(k − 1)/2 possible pairs; 4. Join i and j into a single line so that the number of lines k is decreased by one; 5. If k > 1, go back to Step 2, otherwise stop. Slide 71 of 161 Convolution of exponentials Theorem (Convolution of exponentials). If R1 , ..., Rn are independent exponential distributed with parameters λ1 , ..., λn , then their sum is distributed as: fPni=1 Ri (x) = n X i=1 λi e−λi x n Y j=1,j6=i λj λj − λi Slide 72 of 161 Convolution of exponentials Proof (in the case n = 2) X1 and X2 are exponentially distributed with parameter λ1 and λ2 . The probability density function of X1 + X2 is therefore: fX1 +X2 (t) = = Z Z t fX1 (x)fX2 (t − x)dx 0 t λ1 exp(−λ1 x)λ2 exp(−λ2 (t − x))dx 0 = λ1 λ2 exp(−λ2 t) Z t exp((λ2 − λ1 )x)dx 0 λ1 λ2 (exp(−λ1 t) − exp(−λ2 t)) = λ2 − λ1 2 Slide 73 of 161 Height of a coalescent tree The following theorem is due to Tavaré (1984). Theorem (Time to the most recent common ancestor). The time to the most recent common ancestor of a sample of size n is the coalescent model has distribution: fTMRCA (t) = n X i(i − 1) i=2 2 −i(i−1)t/2 e n Y j=2,j6=i j(j − 1) j(j − 1) − i(i − 1) and mean and variance: n X 1 1 E(TMRCA ) = 2 1 − and var(TMRCA ) = 4 2 (i − 1)2 n i i=2 Slide 74 of 161 Height of a coalescent tree Proof. In a sample of k genes, the time Tk before the first coalescent event is Exponential with parameter k(k−1) . The time to the most recent common 2 ancestor is then equal to TMRCA = T2 + T3 + ... + Tn . This is a sum of independently but non-identically distributed exponential variables. Thus we can use a convolution to find its probability density function. 2 Slide 75 of 161 Height of a coalescent tree The expectation and variance of TMRCA can be derived from the probability density function above, or more directly as follows: E(TMRCA ) = n X var(TMRCA ) = n X n X 2 i(i − 1) i=2 i=2 n X 1 1 1 − =2 =2 1− i − 1 i n i=2 i=2 E(Ti ) = var(Ti ) = 4 n X i=2 1 i2 (i − 1)2 Slide 76 of 161 Height of a coalescent tree Slide 77 of 161 Human effective population size • Anthropological evidence shows that Homo sapiens appeared approximately 200,000 years ago in Africa. • All humans must therefore share a most recent common ancestor (MRCA) less than 200,000 years ago. • This is equal to 10,000 generations if we assume that each human generation lasted approximately 20 years. • In coalescent theory, the expected height of a genealogical tree is equal to 2 coalescent units for a large sample size n. • Since a coalescent unit of time is equal to Me generations, this gives an estimate of 5,000 for the human effective population size Me . Slide 78 of 161 Total branch lengths The following theorem is due to Tavaré (1984). Theorem (Total branch lengths). The sum of branch lengths of a coalescent tree for a sample of size n has distribution: fTtotal (t) = Y n j−1 i−1 exp − t 2 2 j−i j=2 n X i−1 i=2 j6=i and mean and variance: E(Ttotal ) = 2 n−1 X i=1 n−1 X 1 1 and var(Ttotal ) = 4 2 i i i=1 Slide 79 of 161 Total branch lengths Proof. Let Ttotal denote the sum of branch lengths. Then: Ttotal = Notice that iTi is Exponentially distributed: Pn i=2 iTi . i(i − 1) t P(iTi < t) = P(Ti < t/i) = 1 − exp − 2 i which is the cumulative distribution function of an exponential with parameter (i − 1)/2. We can therefore use a convolution to find the probability distribution of Ttotal . 2 Slide 80 of 161 Total branch lengths The expectation and variance can be derived from the above, or more directly: E(Ttotal ) = n X i=2 var(Ttotal ) = n X i=2 E(iTi ) = n X i=2 var(iTi ) = n−1 X1 2i =2 i(i − 1) i i=1 n X i=2 n−1 X 1 4i2 =4 2 2 2 i (i − 1) i i=1 2 Slide 81 of 161 Total branch lengths Slide 82 of 161 Hanging configuration Theorem (Probability of hanging configuration). Consider a coalescent tree for which at time t back in time, there were k ancestral lineages. Let Z = (z1 , ..., zk ) be the number of descendants of these P k lineages such that zi = n. We have: P(Z = z|n, k) = n−1 k−1 −1 Slide 83 of 161 Hanging configuration z1 = 3 z2 = 1 z3 = 3 t Slide 84 of 161 Hanging configuration Proof. • • • • • • We prove the result by induction on n. For n = 2 the hypothesis is true. Let us now assume that it is true for n − 1 and show that it is true for n. If k = n then there is only one possibility z = (1, ..., 1) and the hypothesis is true. If k < n, consider the class in which the last coalescent event happened. The probability that it is class i is (zi − 1)/(n − 1) since any branch is equally likely to split. Thus: k X zi − 1 P(Z = z|n, k) = P(Z = z − 1i |n − 1, k) n − 1 i=1 −1 −1 n−1 n−k n−2 = = n−1 k−1 k−1 2 Slide 85 of 161 Hanging configuration Lemma 1. In a coalescent tree for n genes, we consider a branch at a time back in time when there were k ancestral lineages. The distribution of the number Y of genes that are derived from that branch is: P(Y = y|n, k) = n−1−y k−2 n−1 k−1 Lemma 2. The number of descendants on one side of the root of a coalescent tree is uniformly distributed. Slide 86 of 161 Descendants of a branch With n = 20: Slide 87 of 161 Summary • The coalescent model is easy to simulate • We derived the distribution of the time to the most recent common ancestor for a sample • We also derived the sum of branch lengths of a coalescent tree • The probability distribution of hanging configurations is uniform Slide 88 of 161