Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Combinatorics & the Coalescent (26.2.02) Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number af ancestral lineages after time t. Inclusion-Exclusion Principle. A set of realisations (from Felsenstein) Binomial Numbers 1 2 3 4 5 n 1 5 4 2 Binomial Expansion: 1 1 n n 2 i 0 i n 3 a b Special Cases: n n! n! n r r!(n r)! r!b! n n a a a n n i n i * *....* a b i 0 i b b b 11 n n i (1) 0 i 0 i n + n-1 1 5 4 2 r 3 + n 4 2 r n=0 1 1 2 1 3 1 4 1 1 1 0 1 n-r 3 n-r 1 1 k = 3 n n n n 1 n 1 Initialisation: 1 0 n r r r 1 Recursion: 1 4 6 5 7 2 r-1 1 6 n-1 1 5 n-r-1 5 n 6 7 2 2 3 4 5 1 1 4 3 6 10 15 1 4 10 20 21 35 3 2 4 8 1 5 15 35 5 16 1 6 21 1 7 6 32 7 1 64 128 The Exponential Distribution. The Exponential Distribution: R+ Density: f(t) = ae-at, 0 1 2 P(X>t)= e-at 3 Properties: X ~ Exp(a) i. Y ~ Exp(b) independent P(X>t2|X>t1) = P(X>t2-t1) ii. E(X) = 1/a. iii. P(X < Y) = a/(a + b). iv. min(X,Y) v. Expo(a) (t2 > t1) ~ Exp (a + b). Sums of k iid Xi is G(k,a) distributed ak x k 1e ax G(k) The Standard Coalescent Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce. Waiting {1,2,3,4,5} Coalescing (1,2)--(3,(4,5)) Exp2 {1,2}{3,4,5} 2 Exp3 {1,2}{3}{4,5} 2 Exp4 {1}{2}{3,4,5} 2 Exp5 {1}{2}{3}{4}{5} 1 2 3 4 5 2 1--2 3--(4,5) 4--5 Tree Counting Tree: Connected undirected graph without cycles. k nodes (vertices) & k-1 edges. Nodes with one edge are leaves (tips) - the rest are internal. s5 s2 Labels of internal nodes are permutable without change of biological interpretation. If labels at leaves are ignored we have the shape of a tree. r a1 a2 s1 s3 a3 s4 Ignore root & branch lengths gives unrooted tree topology. a4 s6 Most biological trees are bifurcating. Valency 3 (number of edges touching internal nodes) if made unrooted. Such unrooted trees have n-2 internal nodes & 2n-3 edges. If age ordering of internal nodes are retained this gives the coalescent topology. Counting by Bijection Bijection to a decision series: Level 0 Level 1 1 Level 2 Level L 1 2 1 2 2 3 3 3 k1 N=k1*k2*...*kL k2 N Trees: Rooted, bifurcating & nodes time-ranked. i 1 j (i,j) m 1 k (i,j) 1 n k-1 (n,m) k-2 k 2 k 1 2 k 2 2 k 2Tk-1 3 3 2 Initialisation: T1= T2=1 2 1 2 Recursion: Tk= k j j! j!( j 1)! 2 j 2 ( j 2)!2 2 n1 j 2 k 3 4 5 6 7 8 9 10 3 18 180 2700 5.7 104 1.5 106 5.7 107 2.5 109 15 20 6.9 1018 5.6 1029 Trees: Unrooted & valency 3 1 2 1 3 1 3 1 2 4 2 3 5 2 4 1 3 2 5 4 1 4 (2n 5)! (2 j 3) (n 2)!2n 2 3 j 3 3 3 5 5 5 1 2 Recursion: Tn= (2n-5) Tn-1 n1 4 4 3 1 2 4 1 2 3 3 4 2 5 Initialisation: T1= T2= T3=1 6 7 15 105 945 8 9 10 15 20 10345 1.4 105 2.0 106 7.9 1012 2.2 1020 4 Coalescent versus unrooted tree topologies 4 leaves: 3 unrooted trees & 18 coalescent topologies. 1 unrooted tree topology contains 6 coalescent topologies. 4 1 2 1 3 2 4 3 4 1 2 3 4 1 2 3 4 Inner & outer branches Fu & Li (1993) External (e) versus Internal Branches. k 1 E(e) = 2 E() = ( 1i ) 2 i1 Red - external. Others internal. Let li,n be length of i’th external branch in an n-tree. Obviously E(e) = nE(ln,i) (any i) Except for green branch, internalexternal corresponds to singlet/nonsinglet segregating sites if only one mutation can happen per position. ACTTGTACGA ACTTGTACGA ln-1,j+ tn Pr= 1-2/n Ln,i = ACTTGTACGA TCTTATACGA ACTTATACGA tn Pr= 2/n s n Probability of hanging Sub-trees. Kingman (1982b) For a coalescent with n leaves at time 0, with k ancestors at time t1, let be the groups of leaves of the k subtrees hanging from time t1. Let 1, 2 .., k be the number of leaves of these sub-trees. P{Rk ) 1 1 2 2 3 k 4 t=t1 n t=0 (n k)!k!(k 1)! 1! 2 !.. k! n!(n 1)! Example: n=8, k=3. Classes observed : 4, 3, 1 5!3!2! 4!3!1! 0.0012 8!7! The basal division splits the leaves into (k,n-k) sets with probability: 1/(n-1). Nested subsamples (Saunders et al.(1986) Adv.Appl.Prob.16.471-91.) Transitions t=t1 2N i j i-1,j 2 2 j i-1,j-1 2 i,j j’ i’ i,j i,j 1,1 t=0 2N j Population Sub-sample i Sample 2,1 2,2 3,1 3,2 3,3 4,1 4,2 4,3 4,4 5,1 5,2 5,3 5,4 5,5 6,1 6,2 6,3 6,4 6,5 6,6 7,1 7,2 7,3 7,4 7,5 7,6 7,7 8,1 8,2 8,3 8,4 8,5 8,6 8,7 8,8 9,1 9,2 9,3 9,4 9,5 9,6 9,7 9,8 9,9 Nested subsamples (Saunders et al.(1986) Adv.Appl.Prob.16.471-91.) Pr{MRCA(sub-sample) = MRCA(sample)} = (i 1)( j 1) (i 1)( j 1) Pr{MRCA(sub-sample) = MRCA(population)} = ( j 1) ( j 1) Age of a Mutation Wiuf & Donnelly (1999) Wiuf (2000), Matthews (2000) The probability that there are k differences between two sequences. Going back in time 2 kinds of events can occur (mutations ( - or a coalescent (1). This gives a geometric distribution: 1 ( )k 1 1 --*-------*------*----- ----*----*----*----*--- Exp(1) Exp() Polya Urns & Infinite Allele Model (Donnelly,1986 + Hoppe,1984+87) The only observation made in the infinite allele models is identity/non-identity among all pairs of alleles. I.e. The central observation is a series of classes and their sizes. Expected number of mutations in unit interval (2N) is . This model will give rise to distributions on partitions of {1,2,..,n} like {1,4,7}{2,3}{5}{6}. Since the labelling is arbitrary, only the information about the size of these groups is essential for instance represented as 122131. What is the next event - a duplication of an exiting type or a introduction of a ”new” allele. Classical Polya Urns Feller I. 1 2 3 Let X0 be the initial configuration of the initial Urn. A step: take a random ball the urn and put it back together with an extra of the same colour. Xk be the content after the k’th step. Let Yk be the colour of the k’th picked ball. i. P{Yk =j} = P{Y1 =j}. ii. Sequences Y1 ... Yk resulting in the same Xk - has the same probability. Labelling, Polya Urns & Age of Alleles (Donnelly,1986 + Hoppe,1984+87) As they come By size By age An Urn: A ball is picked proportionally to its weight. Ordinary balls have weight 1. If the initial -size ball is picked, it is replaced together with a completely new type. 1 1 2 1 If an ordinary ball is picked, it is replaced together with a copy of itself. There is a simple relationship between the distribution of ”the alleles labeled with age ranking” is the same as ”the alleles labeled with size ranking” Ewens' formula. (1972 TPB 3.87-112) P5(2,0,1,0,0) is the probability of seeing 2 singles and one allele in 3 copies in a sample of 5. Obviously, Pn(a1,a2, a1+2a2+ +iai +nan=n a n n! j ) ( ( 1)..( n 1) j1 j a j a j! ,an) = En(k types) = j 1 n j1 Pn(a1,a2, ,an;k) = n! n a a a Sk 1 1 2 2 ..k k a1!a2 !..ak! k is a minimal sufficient statistic for the probability of the data conditioned on k is -less and there is no simpler such statistic. Stirling Numbers Partitioning into k sets - Stirling Numbers (of second kind) - Sn,k 1 2 3 4 5 6 n 1 2 3 k k unlabelled bins - all non-empty. k n 1 2 3 4 5 6 1 1 2 1 1 3 1 3 1 4 1 7 6 1 5 1 15 25 10 1 6 1 31 90 65 15 1 7 1 63 301 350 140 21 7 1 2 5 15 52 193 1 Bell Numbers - Bn - Partioning into any number of sets. Obviously: n Bn Sn,k k 1 B 877 Stirling Numbers n-1 items - k classes: {..},{..},..,{..} + ”n” (n-1,k-1): {..},{..},..,{..} + ”n” (n,k) : {..},{..},..,{..} Basic Recursion: Sn,k = kSn-1,k + Sn-1,k-1 Initialisation: Sn,1 = Sn,n = 1. Ewens' formula - example. (1972 TPB 3.87-112) Assume has been observed and that 0.5 mutation is expected per unit (2N) time. a n n! j ) ( ( 1)..( n 1) j1 j a j a j! P5(2,0,1,0,0) = 5! 0.5 3 * 0.5 *1.5 *2.5 * 3.5* 4.5 3 n E5(k types) = j 1 j1 P5(2,0,1,0,0;3) = 1 1 1 1 1 0.5( ) 0.5 1.5 2.5 3.5 4.5 n! n a a a Sk 1 1 2 2 ..k k a1!a2 !..ak! 5! 25 * 3*2! Ancestors to Ancestors Griffiths(1980), Tavaré(1984) hi,j = probability that i individuals has j ancestors after time t. i k t 2 hi, j e k j (2k 1)(1) k j j(k1)i[k ] j!(k j)!i(k ) i[k] = i(i-1)..(i-k+1) Example: Disappearance of 7 lineages. i (k) = i(i+1)..(i+k-1) Y:# of Ancestors to time t. 3 methods of solution: i.Sum of different independent exponential distributions: {Y j} {Exp i .. Exp j t & Expi .. Exp j Exp j1 t} 2 2 2 2 2 ii. Distribution in markov chain: i-1 i i 2 j+1 j j-1 j 2 1 1 iii. Combination of known probabilities: a. Probability that i alleles has i/less ancestors. b. This probability is the same for all i-sets c. No coalescence within a set, implies no coalescence within all subsets. t 3 Ancestors to 2 Ancestors : (3/2)(e-t - e-3t) e-t 1,2 ? (2,3) (1,2) (1,3) (1,2) e-3t 1,2,3 2,3 e-t (2,3) (1,3) {1,2,3} ? {1,2} {1,3} {2,3} e-t ? 1,3 ?: (e-t - e-3t)/2 Exactly one coalescence:3(e-t- (e-t - e-3t)/2)-e-3t) Jordan’s Sieve: A1 3e-t : - 2A2 : 2 + 3A3 : 3 ((e-t + e-3t)/2) e-3t The exclusion-inclusion principle. Venn Diagrams: 2 1 I 1 2 1 II I 2 3 1 II 2 1 III {I + II} - {I} + {II} + {I&II} = 0 {I+II+III} = {I}+ {II}+ {III} - ({I,II} + {I,III} + {II,III}) +{I,II,III} Exclusion-inclusion& Jordan’s Sieve Sj j=1,..,r the given sets, Ak - sum of intersection of k sets r Total number: (1) k 1 1 1 2 2 Ak 4 3 2 3 1 1 (1) k m k m Example: the elements above: in 1 sets A1 - 2A2 k A m k 1 + 3A3 - 4 A4 in 2 sets A2 - 3A3 + 6 A4 in 3 sets A3 - 4 A4 in 4 sets in some set A1 3 2 2 r (Jordan’s Sieve) 3 2 k 1 In exactly m sets: 2 (Jordan’s Sieve) A4 - A2 + A3 - A4 exclusion-inclusion Surviving Lineages Which probability statements can be made? Let s be subset of i {1,2,..i} and S(s) be the event that no coalescence has happened to s. Additionally, if s’ is a subset of s, then S(s) implies S(s’). Size number i 1 i-1 i {1,2,..,i} e {2,..,i} e j i j 2 i 2 i1 2 t {1,2} e-t i 2 t {1,3..,i} e {1,2,..,i-1} i1 2 t e i1 2 t (i-1,i) e-t Surviving Lineages There are r i r j (1) k 1 k j sets. We want events member of only one of them. k A j k where Ak Si i Summation is over all k-subsets of {1,..,r} and intersection is between the k sets chosen. Conditional Ancestors-to-Ancestors 7 0 7 6 5 4 t 1 Pk(t1) = hi,k(t1)* hk,j(t- t1)/ hi,j(t) Example: 7 --> 4 lineages. 4t Summary Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number af ancestral lineages after time t. Inclusion-Exclusion Principle. Recommended Literature Bender(1974) Asymmptotic Methods in Enumeraion Siam Review vol16.4.485Donnelly (1986) ”Theor.Pop.Biol. Ewens (1972) Theor.Pop.Biol. Ewens (1989) ”Population Genetics Theory - The Past and the Future” Feller (1968+71) Probability Theory and its Applications I + II Wiley Fu & Li (1993) ”Statistical Tests of Neutrality of Mutations” Genetics 133.693-709. Griffiths (1980) Griffiths & Tavaré(1998) ”The Age of a mutation on a general coalescent tree. Griffiths & Tavaré(1999) ”The ages of mutations in gene trees” Griffiths & Tavaré(2001) ”The genealogy of a neutral mutation” Hoppe (1984) ”Polya-like urns and the Ewens’ sampling formula” J.Math.Biol. 20.91-94 Kingman (1982) ”On the Genealogy of Large Populations” 27-43. Kingman (1982) ”The Coalescent” Stochastic Processes and their Applications 13..235-248. Kingman (1982) Matthews,S.(1999) ”Times on Trees, and the Age of an Allele” Theor.Pop.Biol. 58.61-75. Möhle Pitman Schweinsberg Simonsen & Churchill (1997) Saunders et al.(1986) ”On the genealogy of nested subsamples from a haploid population” Adv.Apll.Prob. 16.471-91. Tajima (1983) Evolutionary Relationships of DNA Sequences in Finite Poulations Genetics 105.437-60. Tavaré (1984) Line-of-Descent and Genealogical Processes, and Their Application in Population Genetics Models. Theor.Pop.Biol. 26.119-164. Thompson,R. (1998) ”Ages of mutations on a coalescent tree” Math.Bios. 153.41-61. van Lint & Wilson (1991) A Course in Combinatorics - Cambridge Wiuf (2000) On the Genealogy of a Sample of Neutral Rare Alleles. Theor.Pop.Biol. 58.61-75. Wiuf & Donnelly (1999) Conditional Genealogies and the Age of a Mutant. Theor. Pop.Biol. 56.183-201.