Download notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
The Coalescent
January 27, 2015
1
Derivation
Suppose that k chromosomes are randomly sampled from a population containing N diploid
individuals that is described by the Wright-Fisher model. Our objective is to characterize the
distribution of the genealogical tree that describes how these chromosomes are related to
one another at a fixed, non-recombining locus. This tree is a random tree both because the
sample has been chosen at random and because the relationships between the individuals alive
in one generation and those alive in the next generation are determined at random under the
Wright-Fisher model.
If k = 2, then the relatedness of the two chromosomes is completely determined by the time (in
generations) to their most recent common ancestor (MRCA). Let us denote this random
time by the variable t2 . Since t2 = 1 if and only if the two chromosomes share the same ancestor
in the previous generation, it is easy to see that this event has probability P(t2 = 1) = 1/2N .
Similarly, for the event t2 = 2 to occur, the two chromosomes must have distinct ancestors in
the previous generation and these two ancestral chromosomes must have a common ancestor in
the generation before that. Since parent-offspring relationships are assigned independently under
the Wright-Fisher model, it follows that
1
1
P(t2 = 2) = P(t2 > 1) · P(t2 = 2|t2 > 1) = 1 −
.
2N 2N
Since, in general, t2 = t ≥ 1 if and only if distinct ancestors are chosen in each of the first t − 1
generations before the present and a common ancestor is chosen in the t’th generation, we have
1 t−1 1
P(t2 = t) = 1 −
,
2N
2N
which shows that t2 is a geometrically-distributed random variable with parameter p = 1/2N .
We say that the two lineages ancestral to the sampled chromosomes coalesce at time t2 and
we call this event a coalescent event. Our result implies that a pair of randomly sampled
lineages coalesces at rate 1/2N per generation and that pairs of randomly sampled chromosomes
are typically less closely related in larger populations than in smaller populations.
When k > 2, we must consider both the topology and the branch lengths of the genealogy.
Suppose that the sampled chromosomes are assigned labels 1 − n and consider how these can
be related through events happening one generation into the past. One possibility is that j > 2
of these chromosomes share a common ancestor in the previous generation, which happens with
probability O((2N )−j . Similarly, the probability that two distinct pairs (say (a, b) and (c, d))
each share a distinct common ancestor in the previous generation is O((2N )−2 ). In general, as
long as n 2N , the probability that more
than two lineages coalesce in a single generation
is o(1/2N ). In contrast, since there are k2 = 21 k(k − 1) distinct pairs of chromosomes in our
sample, the probability that exactly one of these pairs coalesces during the previous generation
is equal to
! n
Y
1
2N − (i − 2)
k 1
1
·
=
+O
.
2N
2N
2 2N
(2N )2
i=3
Since this probability is at least N times larger than the probability that more than two lineages
participate in coalescent events in a single generation, it follows that when 2N is large and
k 2N , it is very likely that the genealogy will only contain pairwise mergers, i.e., the random
tree will be a binary tree. In fact, by taking the limit N → ∞, we can guarantee that this will
be the case with probability one. Henceforth we will assume that N and k are such that we can
neglect any event that isn’t a simple pairwise merger.
If no lineages coalesce in the previous generation, then once again we can appeal to the independence of parent-offspring relationships between generations to argue that when there are k
1
1
lineages ancestral to our sample the pairwise coalescent rate is pk ≡ k2 2N
per generation. Thus,
if we let tk be the random time until the first coalescent event involving our sample, then tk
is approximately geometrically distributed with parameter pk . Furthermore, since the ancestral
lineages are
chosen at random in each generation without a coalescent event, it follows that each
k
of the 2 pairs of lineages is equally likely to be the pair that coalesces at time tk . At this time,
the number of lineages ancestral to the sample is reduced from k to k − 1 and we can repeat this
reasoning to argue that the next coalescent event will occur after an additional tk−1 generations,
1
where tk−1 is a geometrically-distributed random variable with parameter pk−1 = k−1
2 2N . This
process is continued until the final pair of lineages coalesces, at which point we have reached the
most recent common ancestor of the k chromosomes in our sample (also called the root of the
tree).
When N is large and time is measured in units of 2N generations, we can approximate the
geometric distributions of the waiting times tk by exponentially-distributed random variables:
1
k
tk ≈ τk ∼ Exponential
.
2N
2
With this approximation, we arrive at the following description of the genealogical tree of a
random sample of size k, which is exact in the limit as N → ∞. This result is due to Kingman
(1982) and is known as Kingman’s coalescent. This description also yields an algorithm that
can be used to sample random trees with the correct limiting distribution.
• Generate a sequence of independent
exponentially-distributed random variables, τk , · · · , τ2 ,
where τi has rate parameter 2i . τi is the waiting time until the next pairwise coalescent
event when the tree contains exactly i branches.
• At the time of each coalescent event, choose two of the branches contained in tree at that
time uniformly at random and merge them.
• This process halts once the root of the tree has been reached.
Because the coalescent only tracks the relationships of lineages ancestral to our sample, this
algorithm is substantially more efficient than direct simulations of the full Wright-Fisher model
which in general will require that order O(N 2 ) random variables be simulated. In contrast,
simulation of the coalescent only requires 2n − 3 random variables (n − 1 branch lengths and
n − 2 pairs of coalescing lineages).
2
2.1
Shape
Branch Lengths
Due to the quadratic relationship between the coalescent rate and the number of branches in
the tree, coalescent trees have a fairly characteristic shape. In particular, since the mean of
an exponentially distributed random variable is inversely proportional to its rate parameter, it
follows that
2
.
E[τi ] =
i(i − 1)
This shows that the branches near the root of the tree will typically be much longer than the
branches near the leaves (the sampled chromosomes). Furthermore, the depth of the tree remains
finite even in the limit as k → ∞. To see this, observe that since the total time elapsed until the
most recent common ancestor is the sum of the waiting times between coalescent events,
(k)
Tmrca
= τk + · · · + τ2 ,
2
it follows that
h
E
(k)
Tmrca
i
=
k
X
E [τj ] =
j=2
k
X
j=2
k
X
2
=2
j(j − 1)
j=2
1
1
−
j−1 j
1
=2 1−
.
k
In particular, the expectation
h
i
(k)
lim E Tmrca
=2
k→∞
is finite. Also, since E[τ2 ] = 1, the mean time until the final coalescent event is more than half
the mean of the depth of the entire tree.
2.2
Topology
The topology of a genealogical tree can be investigated with the help of the following useful
device. Recall that a partition of a set A is a collection of subsets {A1 , · · · , Ak } such that the
sets are disjoint, i.e., Ai ∩ Aj = ∅ whenever i 6= j, and A = A1 ∪ · · · ∪ Ak . Here we will consider
sets of the form A = {1, · · · , n} and we will let Pn be the collection of all partitions of A. Given a
partition ξ = {ξ1 , · · · , ξk } ∈ Pn , the subsets ξ1 , · · · , ξk are also called the blocks of the partition
and we will write |ξ| = k to denote the number of blocks in ξ.
Given a genealogical tree, T , of a sample of n chromosomes labeled 1, · · · , n, we can construct a
sequence of partitions (ξt : 0 ≤ t ≤ Tmrca ) ⊂ Pn in the following way. For each time t ∈ [0, Tmrca ],
let ξt be the partition of {1, · · · , n} formed from the subsets of lineages that have coalesced by
time t. In other words, lineages i and j will belong to the same block in ξt if and only if they
share a common ancestor by time t. Since no lineages will have coalesced at time t = 0, it follows
that ξ0 = {{1}, · · · , {n}} is the partition containing n singletons. In contrast, since all lineages
will have coalesced by time Tmrca , it also follows that ξTmrca is the partition consisting of the
single block {1, · · · , n}.
By associating genealogical trees with sequences of partitions, we can re-interpret Kingman’s
coalescent as a continuous-time Markov chain (ξt : t ≥ 0) with values in the set Pn . When
ξt contains k = |ξt | blocks, transitions occur at rate k2 and each pair of blocks is equally likely to
merge or coalesce at the time of the jump. Since the partitions do not change between coalescent
(n)
(n)
events, it will be convenient to focus on the discrete sequence of n partitions ξn , · · · , ξ1 visited
(n)
by the continuous-time process. Here ξk ∈ Pn is the partition
visited by the continuous-time
(n)
process when there are precisely k blocks. The sequence ξk
: n ≥ k ≥ 1 is itself a discrete(n)
time Markov chain which evolves according to the following rule. The partition ξk−1 is obtained
(n)
by choosing two blocks in ξk uniformly at random and merging them into a single block. This
construction motivates the following definition. Given two partitions ξ, ν ∈ Pn with |ν| = |ξ| − 1,
let us say that ξ is a refinement of ν, written ξ < ν, if ν can be formed by merging two blocks
in ξ into a single block. The next theorem describes the marginal distributions of the variables
(n)
ξk .
Theorem 1. If ξ ∈ Pn is a partition with k blocks of sizes λ1 , · · · , λk , then
(n)
P ξk = ξ = cn,k · w(ξ)
where
w(ξ) =
k
Y
λi !
and
i=1
3
cn,k =
k!
n!
1
.
n−1
k−1
(n)
Proof. We can use the fact that (ξk : n ≥ k ≥ 1) is a Markov process to give a backwards
(n)
induction argument (i.e., start at n and then work down to 1). If k = n, then by definition ξn
is the partition ξ containing the n singleton sets {1}, · · · , {n}. In this case, λ1 = · · · = λn = 1
and so w(ξ) = 1. Since cn,n = 1, the result is confirmed in this case. Now suppose that the
result holds for k ∈ {2, · · · , n} and let ν ∈ Pn be a partition containing k − 1 blocks with sizes
(n)
λ1 , · · · , λk−1 . Our objective is to show that the result holds for ν by conditioning on ξk . By
the law of total probability, we have
X (n)
(n)
(n)
(n)
P ξk−1 = ν =
P ξk−1 = ν ξk = ξ · P ξk = ξ ,
ξ<ν
(n)
where the sum is over all partitions ξ which are refinements of ν. Since the probabilities P(ξk =
ξ) are specified by the induction hypothesis, we only need to evaluate the transition probabilities
that appear in the sum. However, since each pair of blocks in ξ is equally likely to coalesce, we
see that
1
(n)
(n)
P ξk−1 = ν ξk = ξ = k
2
for every partition ξ with ξ < ν. Notice that each partition ξ with ξ < ν can be obtained
(in two ways) by first choosing one of the k − 1 blocks contained in ν, say the i’th, and then
splitting
it into two blocks of sizes a and λi − a, where 1 ≤ a ≤ λi − 1. Furthermore, there are
λi
a different ways of choosing the elements that will be assigned to the block of size a. In fact,
this procedure will generate the same partition ξ in two ways, either by choosing a and then
a particular collection of a elements from νi or by choosing λi − a and then the other λi − a
elements from νi . For this reason, we will need to multiply by a factor of 1/2 in the following
calculations. Combining these several observations gives
X 2
(n)
(n)
P ξk−1 = ν
=
P ξk = ξ
k(k − 1)
ξ<ν
k−1 λi −1 1XX
λi
cn,k · w(ξ)
=
a
2
i=1 a=1
!
X
k−1 λX
i −1 2
k! 1
a!(λi − a)!
1
λi
=
· w(ν) ·
k(k − 1) 2
a
n! n−1
λi !
k−1
i=1 a=1
!
k−1 λX
i −1
X
(k − 1)! 1
1
·
1
=
·
w(ν)
·
n−1
n!
n−k+1
k−2
2
k(k − 1)
i=1 a=1
= cn,k−1 · w(ν) ·
1
n−k+1
k−1
X
(λi − 1)
i=1
= cn,k−1 · w(ν),
where the final equality follows from the fact that λ1 + · · · + λk−1 = n, and this completes the
induction step.
(n)
Our next result characterizes the joint distribution of the sizes of the blocks in the partition ξi .
(n)
Theorem 2. Suppose that the blocks in ξk are put into random order and let λi be the size of
the i’th block. Then (λ1 , · · · , λk ) is uniformly distributed on the set of k-dimensional vectors of
positive integers that sum to n.
4
Proof. Given a set of positive integers {λ1 , · · · , λk } that sum to n, observe that there are λ1 ,···n ,λk
different partitions ξ of {1, · · · , n} into k blocks with these sizes and that all of these partitions
have the same probability cn,k λ1 ! · · · λk !. Since there are k! many ways of ordering the k blocks
(n)
in ξk , each of which is equally likely, it follows that
cn,k · λ1 ! · · · λk !
n
P((λ1 , · · · , λk )) =
·
λ1 , · · · , λk
k!
1
n! k!
·
·
=
k! n! n−1
k−1
=
1
,
n−1
k−1
which shows that all vectors (λ1 , · · · , λk ) with positive integer arguments summing to n have
the same probability, i.e., the distribution is uniform.
Corollary 1. Let pN,n be the probability that the most recent common ancestor of a sample of n
individuals from a Wright-Fisher population of size N is also the most recent common ancestor
of the entire population. Then
n−1
lim pN,n =
N →∞
n+1
Proof. Consider the genealogy of the entire population. For the sample and the entire population
to have the same MRCA, the sample must contain descendants of each of the two branches that
form the first split in this tree. If we let (λ, N − λ) denote the number of descendants of these
two branches in the entire population, then by the preceding theorem, λ is uniformly distributed
on the set {1, · · · , N − 1}. Furthermore, if we let X denote the proportion of the population
descended from the first branch, then in the limit as N → ∞, X is uniformly distributed on
[0, 1]. Since the n individuals in the sample are chosen at random, it follows that
Z 1
2
n−1
lim pN,n = 1 −
(xn + (1 − x)n ) dx = 1 −
=
.
N →∞
n+1
n+1
0
Our final result describes how we can generate coalescent trees from the root up.
Theorem 3. Suppose that a sequence of random partitions ξ1 , · · · , ξn ∈ Pn is generated by
setting ξ1 = {1, · · · , n} and then using the following recursive procedure. Given ξk−1 = ν =
{ν1 , · · · , νk−1 }, let λi = |νi | be the number of elements in the i’th block and form the partition ξk
by performing these three steps:
1. Choose a random block from ν with probability
λi −1
n−i+1
for νi .
2. Then choose a number a uniformly at random from the set {1, · · · , λi − 1}.
3. Finally, randomly split νi into two blocks of sizes a and λi − a by choosing a elements at
random from νi .
Then (ξn , · · · , ξ1 ) has the same distribution as the sequence of partitions generated by Kingman’s
coalescent for a sample of size n.
(n)
(n)
Proof. Let ξn , · · · , ξ1
be a sequence of partitions in Pn generated by Kingman’s coalescent.
Suppose that ξ, µ are partitions in Pn with |ξ| = k and |ν| = k − 1 and let λ1 , · · · , λk−1 be the
5
sizes of the blocks in ν. If ξ < ν, then for some 1 ≤ i ≤ k − 1 and some 1 ≤ a ≤ λi , ξ can
be obtained by splitting νi into blocks of sizes a and λi − a. Appealing to Bayes’ formula and
Theorem 1, we have
P ξk(n) = ξ
(n)
(n)
(n)
(n)
P ξk = ξ ξk−1 = µ
= P ξk−1 = µξk = ξ · (n)
P ξk−1 = µ
=
1 cn,k · w(ξ)
k c
n,k−1 · w(ν)
2
2
k(k − 1) a!(λi − a)!
·
·
k(k − 1) n − k + 1
λi !
λi − 1
1
1
= 2·
·
·
.
n − k + 1 λi − 1 λai
=
The factor of two reflects the fact that there are two ways to form ξ from ν (see proof of Theorem
1), while the next three terms correspond to the sampling distributions of each of the three steps
in the procedure described in the statement of the theorem.
6