Download A 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Combinatorics & the Coalescent (26.2.02)
Tree Counting & Tree Properties.
Basic Combinatorics.
Allele distribution.
Polya Urns + Stirling Numbers.
Number af ancestral lineages after time t.
Inclusion-Exclusion Principle.
A set of realisations
(from Felsenstein)
Binomial Numbers
1
2
3
4
5
n
1
5
4
2
Binomial Expansion:
1 1

n 
n

   2
i 0 i 
n
3
a  b
Special Cases:
n

n!
n!
n


r r!(n  r)! r!b!
n
n
a a 
a  n
n i n i
  *  *....*    a b
   
  i 0 i 
b  b 
b 
11
n

n
i

  (1)  0
i 0 i 
n
+
n-1
1
5
4
2
r
3
+
n
4
2
r
n=0
1
1
2
1
3
1
4
1
1
1
0
1
n-r
3
n-r
1
1
k =
3

n

n



n

n
1

n

1



  
 
 Initialisation:     1
0 n 
r  r  r  1
Recursion:
1
4
6
5
7
2
r-1
1
6
n-1
1
5
n-r-1
5
n
6
7
2
2
3
4
5
1
1
4
3
6
10
15
1
4
10
20
21 35
3
2
4
8
1
5
15
35
5
16
1
6
21
1
7
6
32
7
1
64
128
The Exponential Distribution.
The Exponential Distribution: R+
Density: f(t) = ae-at,
0
1
2
P(X>t)= e-at
3
Properties: X ~ Exp(a)
i.
Y ~ Exp(b) independent
P(X>t2|X>t1) = P(X>t2-t1)
ii.
E(X) = 1/a.
iii.
P(X < Y) = a/(a + b).
iv.
min(X,Y)
v.
Expo(a)
(t2 > t1)
~ Exp (a + b).
Sums of k iid Xi is G(k,a) distributed
ak x k 1e ax
G(k)
The Standard Coalescent
Two independent Processes
Continuous: Exponential Waiting Times
Discrete: Choosing Pairs to Coalesce.
Waiting
{1,2,3,4,5}
Coalescing
(1,2)--(3,(4,5))
Exp2 
{1,2}{3,4,5}
 
2 
 
Exp3 
{1,2}{3}{4,5}
 
2 
 
Exp4 
{1}{2}{3,4,5}
 
 2 
 
Exp5 
{1}{2}{3}{4}{5}
1
2
3
4
5
 
2 
 
1--2
3--(4,5)
4--5
Tree Counting
Tree: Connected undirected
graph without cycles. k nodes
(vertices) & k-1 edges. Nodes
with one edge are leaves
(tips) - the rest are internal.
s5
s2
Labels of internal nodes are
permutable without change of
biological interpretation. If labels
at leaves are ignored we have the
shape of a tree.
r
a1 a2
s1
s3
a3
s4
Ignore root & branch lengths gives
unrooted tree topology.
a4
s6
Most biological trees are bifurcating.
Valency 3 (number of edges touching
internal nodes) if made unrooted. Such
unrooted trees have n-2 internal nodes &
2n-3 edges.
If age ordering of internal nodes
are retained this gives the
coalescent topology.
Counting by Bijection
Bijection to a decision series:
Level 0
Level 1
1
Level 2
Level L
1
2
1
2
2
3
3
3
k1
N=k1*k2*...*kL
k2
N
Trees: Rooted, bifurcating & nodes time-ranked.
i
1
j
(i,j)
m
1
k
(i,j)
1
n
k-1
(n,m) k-2

k 

2

k 1

 2 

k  2

 2 

k 

2Tk-1

3
 3
2
Initialisation: T1= T2=1

2
 1
2
Recursion: Tk=
k

j
j!
j!( j  1)!

 


2 j 2 ( j  2)!2
2 n1
j 2 
k
3
4
5
6
7
8
9
10
3
18
180
2700
5.7 104 1.5 106 5.7 107 2.5 109
15
20
6.9 1018
5.6 1029
Trees: Unrooted & valency 3
1
2
1
3
1
3
1
2
4
2
3
5
2
4
1
3
2
5
4
1
4
(2n  5)!
 (2 j  3)  (n  2)!2n 2 3
j 3
3
3
5
5
5
1
2
Recursion: Tn= (2n-5) Tn-1
n1
4
4
3
1
2
4
1
2
3
3
4
2
5
Initialisation: T1= T2= T3=1
6
7
15 105 945
8
9
10
15
20
10345
1.4 105
2.0 106
7.9 1012
2.2 1020
4
Coalescent versus unrooted tree topologies
4 leaves: 3 unrooted trees & 18 coalescent topologies.
1 unrooted tree topology contains 6 coalescent topologies.
4
1
2
1
3
2
4
3
4
1 2
3
4 1
2
3
4
Inner & outer branches
Fu & Li (1993)
External (e) versus Internal  Branches.
k 1
E(e) = 2 E() = ( 1i )  2
i1
Red - external. Others internal.
Let li,n be length of i’th external
branch in an n-tree. Obviously
E(e) = nE(ln,i) (any i)
Except for green branch, internalexternal corresponds to singlet/nonsinglet segregating sites if only one
mutation can happen per position.
ACTTGTACGA
ACTTGTACGA
ln-1,j+ tn
Pr= 1-2/n
Ln,i =
ACTTGTACGA
TCTTATACGA
ACTTATACGA
tn
Pr= 2/n
s
n
Probability of hanging Sub-trees.
Kingman (1982b)
For a coalescent with n leaves at time 0, with k ancestors at time t1, let 
be the groups of leaves of the k subtrees hanging from time t1. Let 1, 2
.., k be the number of leaves of these sub-trees.
P{Rk   ) 
1
1
2
2
3
k
4
t=t1
n
t=0
(n  k)!k!(k 1)!
1! 2 !.. k!
n!(n  1)!
Example: n=8, k=3. Classes observed : 4, 3, 1
5!3!2!
4!3!1! 0.0012
8!7!
The basal division splits the leaves into
(k,n-k) sets with probability: 1/(n-1).
Nested subsamples
(Saunders et al.(1986) Adv.Appl.Prob.16.471-91.)
Transitions
t=t1
2N

i 
 
j 
i-1,j 2 2


j 

i-1,j-1 
2 
i,j
j’
i’
i,j
i,j
1,1
t=0
2N
j
Population
Sub-sample
i
Sample
2,1
2,2
3,1
3,2
3,3
4,1
4,2
4,3
4,4
5,1
5,2
5,3
5,4
5,5
6,1
6,2
6,3
6,4
6,5
6,6
7,1
7,2
7,3
7,4
7,5
7,6
7,7
8,1
8,2
8,3
8,4
8,5
8,6
8,7
8,8
9,1
9,2
9,3
9,4
9,5
9,6
9,7
9,8
9,9
Nested subsamples
(Saunders et al.(1986) Adv.Appl.Prob.16.471-91.)
Pr{MRCA(sub-sample) = MRCA(sample)} =
(i  1)( j  1)
(i  1)( j  1)
Pr{MRCA(sub-sample) = MRCA(population)} =
( j  1)
( j  1)
Age of a Mutation
Wiuf & Donnelly (1999) Wiuf (2000), Matthews (2000)
The probability that there are k differences between two
sequences. Going back in time 2 kinds of events can occur
(mutations ( - or a coalescent (1). This gives a geometric
distribution: 1 (  )k
1  1 
--*-------*------*-----
----*----*----*----*---
Exp(1)
Exp()
Polya Urns & Infinite Allele Model
(Donnelly,1986 + Hoppe,1984+87)
The only observation made in the infinite allele models is
identity/non-identity among all pairs of alleles. I.e. The
central observation is a series of classes and their sizes.
Expected number
of mutations in unit
interval (2N) is .
This model will give rise to distributions on partitions of {1,2,..,n} like
{1,4,7}{2,3}{5}{6}. Since the labelling is arbitrary, only the
information about the size of these groups is essential for instance
represented as 122131.
What is the next event - a duplication of an exiting type or a introduction
of a ”new” allele.
Classical Polya Urns
Feller I.
1
2
3
Let X0 be the initial configuration of the initial Urn.
A step: take a random ball the urn and put it back together with an extra of the
same colour.
Xk be the content after the k’th step. Let Yk be the colour of the k’th picked
ball.
i. P{Yk =j} = P{Y1 =j}.
ii. Sequences Y1 ... Yk resulting in the same Xk - has the same probability.
Labelling, Polya Urns & Age of Alleles
(Donnelly,1986 + Hoppe,1984+87)
As they come
By size
By age
An Urn:
A ball is picked proportionally to its weight. Ordinary balls
have weight 1.

If the initial -size ball is picked, it is replaced together with a
completely new type.
1
 1
2
1
If an ordinary ball is picked, it is replaced together with a copy
of itself.
There is a simple relationship between the distribution of ”the alleles labeled
with age ranking” is the same as ”the alleles labeled with size ranking”
Ewens' formula.
(1972 TPB 3.87-112)
P5(2,0,1,0,0) is the probability of seeing 2 singles and
one allele in 3 copies in a sample of 5.
Obviously,
Pn(a1,a2,
a1+2a2+ +iai +nan=n
a
n
n!
 j
)
(
(  1)..(  n 1) j1 j a j a j!
,an) =
En(k types) =
  j  1
n
j1
Pn(a1,a2,
,an;k) =
n!
n a a
a
Sk 1 1 2 2 ..k k a1!a2 !..ak!
k is a minimal sufficient statistic for 
the
probability of the data conditioned on k is -less and
there is no simpler such statistic.
Stirling Numbers
Partitioning into k sets - Stirling Numbers (of second kind) - Sn,k
1
2
3
4
5
6
n
1
2
3
k
k unlabelled bins - all non-empty.
k
n
1
2
3
4
5
6
1
1
2
1
1
3
1
3
1
4
1
7
6
1
5
1
15
25
10
1
6
1
31
90
65
15
1
7
1
63
301
350
140
21
7
1
2
5
15
52
193
1
Bell Numbers - Bn - Partioning into any number of sets.
Obviously:
n
Bn   Sn,k
k 1
B
877
Stirling Numbers
n-1 items - k classes:
{..},{..},..,{..}
+ ”n”
(n-1,k-1): {..},{..},..,{..}
+ ”n”
(n,k) : {..},{..},..,{..}
Basic Recursion: Sn,k = kSn-1,k + Sn-1,k-1
Initialisation:
Sn,1 = Sn,n = 1.
Ewens' formula - example.
(1972 TPB 3.87-112)
Assume
has
been observed and that 0.5 mutation is
expected per unit (2N) time.
a
n
n!
 j
)
(
(  1)..(  n 1) j1 j a j a j!
P5(2,0,1,0,0) =
5!
0.5 3

*
0.5 *1.5 *2.5 * 3.5* 4.5
3
n
E5(k types) =

  j  1
j1
P5(2,0,1,0,0;3) =
1
1
1
1
1
 0.5(




)
0.5 1.5 2.5 3.5 4.5
n!
n a a
a
Sk 1 1 2 2 ..k k a1!a2 !..ak!
5!

25 * 3*2!
Ancestors to Ancestors
Griffiths(1980), Tavaré(1984)
hi,j = probability that i individuals has j ancestors after time t.
i
 k

 t
 2
 
hi, j   e
k j
(2k  1)(1) k j j(k1)i[k ]
j!(k  j)!i(k )
i[k] = i(i-1)..(i-k+1)
Example: Disappearance of 7 lineages.
i (k) = i(i+1)..(i+k-1)
Y:# of Ancestors to time t.
3 methods of solution:
i.Sum of different independent exponential distributions:
{Y  j}  {Exp i   .. Exp j   t & Expi   .. Exp j   Exp j1  t}
 
 2 
 
 
 2 
 
 
2 
 
 
 2 
 
 
 2 
 
ii. Distribution in markov chain:
i-1
i

i 

2 
j+1
j
j-1

j 

2 
1
1
iii. Combination of known probabilities:
a. Probability that i alleles has i/less ancestors.
b. This probability is the same for all i-sets
c. No coalescence within a set, implies no
coalescence within all subsets.
t
3 Ancestors to 2 Ancestors : (3/2)(e-t - e-3t)
e-t
1,2
?
(2,3)
(1,2)
(1,3)
(1,2)
e-3t
1,2,3
2,3
e-t
(2,3)
(1,3)
{1,2,3}
?
{1,2}
{1,3}
{2,3}
e-t
?
1,3
?: (e-t - e-3t)/2
Exactly one coalescence:3(e-t- (e-t - e-3t)/2)-e-3t)
Jordan’s Sieve:
A1
3e-t
:
- 2A2 :
2
+ 3A3 :
3
((e-t + e-3t)/2)
e-3t
The exclusion-inclusion principle.
Venn Diagrams:
2
1
I
1
2
1
II
I
2
3
1
II
2
1
III
{I + II} - {I} + {II} + {I&II} = 0
{I+II+III} = {I}+ {II}+ {III}
- ({I,II} + {I,III} + {II,III})
+{I,II,III}
Exclusion-inclusion& Jordan’s Sieve
Sj j=1,..,r the given sets, Ak - sum of
intersection of k sets
r
Total number:
 (1)
k 1
1
1
2
2
Ak
4
3
2
3
1
1
 (1)
k m
k m
Example: the elements above:
in 1 sets
A1 - 2A2

k 
A
m  k
1
+ 3A3
- 4 A4
in 2 sets
A2 - 3A3
+ 6 A4
in 3 sets
A3
- 4 A4
in 4 sets
in some set A1
3
2
2
r
(Jordan’s Sieve)
3
2
k 1
In exactly m sets:
2
(Jordan’s Sieve)
A4
-
A2
+
A3
-
A4
exclusion-inclusion
Surviving Lineages
Which probability statements can be made? Let s be subset of i
{1,2,..i} and S(s) be the event that no coalescence has happened
to s. Additionally, if s’ is a subset of s, then S(s) implies S(s’).
Size number
i
1
i-1
i
{1,2,..,i}
e
{2,..,i}
e
j

i 

j 
2

i 

2 
 i1 


 2 t
 
{1,2}
e-t
 i 


 2 t
 
{1,3..,i}
e
{1,2,..,i-1}
 i1 


 2 t
 
e
 i1 


 2 t
 
(i-1,i)
e-t
Surviving Lineages
There are
r

i 

r  
j 
 (1)
k 1
k j
sets. We want events member of only one of them.

k 
A
j  k
where
Ak  
Si
i
Summation is over all k-subsets of {1,..,r} and intersection is
between the k sets chosen.
Conditional Ancestors-to-Ancestors
7
0
7
6
5
4 t
1
Pk(t1) = hi,k(t1)* hk,j(t- t1)/ hi,j(t)
Example: 7 --> 4 lineages.
4t
Summary
Tree Counting & Tree Properties.
Basic Combinatorics.
Allele distribution.
Polya Urns + Stirling Numbers.
Number af ancestral lineages after time t.
Inclusion-Exclusion Principle.
Recommended Literature
Bender(1974) Asymmptotic Methods in Enumeraion Siam Review vol16.4.485Donnelly (1986) ”Theor.Pop.Biol.
Ewens (1972) Theor.Pop.Biol.
Ewens (1989) ”Population Genetics Theory - The Past and the Future”
Feller (1968+71) Probability Theory and its Applications I + II Wiley
Fu & Li (1993) ”Statistical Tests of Neutrality of Mutations” Genetics 133.693-709.
Griffiths (1980)
Griffiths & Tavaré(1998) ”The Age of a mutation on a general coalescent tree.
Griffiths & Tavaré(1999) ”The ages of mutations in gene trees”
Griffiths & Tavaré(2001) ”The genealogy of a neutral mutation”
Hoppe (1984) ”Polya-like urns and the Ewens’ sampling formula” J.Math.Biol. 20.91-94
Kingman (1982) ”On the Genealogy of Large Populations” 27-43.
Kingman (1982) ”The Coalescent” Stochastic Processes and their Applications 13..235-248.
Kingman (1982)
Matthews,S.(1999) ”Times on Trees, and the Age of an Allele” Theor.Pop.Biol. 58.61-75.
Möhle
Pitman
Schweinsberg
Simonsen & Churchill (1997)
Saunders et al.(1986) ”On the genealogy of nested subsamples from a haploid population” Adv.Apll.Prob. 16.471-91.
Tajima (1983) Evolutionary Relationships of DNA Sequences in Finite Poulations Genetics 105.437-60.
Tavaré (1984) Line-of-Descent and Genealogical Processes, and Their Application in Population Genetics Models. Theor.Pop.Biol. 26.119-164.
Thompson,R. (1998) ”Ages of mutations on a coalescent tree” Math.Bios. 153.41-61.
van Lint & Wilson (1991) A Course in Combinatorics - Cambridge
Wiuf (2000) On the Genealogy of a Sample of Neutral Rare Alleles. Theor.Pop.Biol. 58.61-75.
Wiuf & Donnelly (1999) Conditional Genealogies and the Age of a Mutant. Theor. Pop.Biol. 56.183-201.
Related documents