Download Math 338, Lecture 1 Elementary probability models in DNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Math 338, Lecture 1
Elementary probability models in DNA Sequence Analysis: Shotgun sequencing
I. Shotgun sequencing.
Let S be a segment of DNA isolated from some genome. Let us assume that the length
of S, in number of bases, is known at least approximately, and let g denote this length.
By considering only one of the complementary strands of S, we think of S as a string of
letters from the alphabet {A, T, C, G}.
A fundamental issue of current DNA biology is to determine the sequence of base
pairs in a given segment S. Chemical sequencing methods can be applied effectively only
to relatively short strands of DNA, on the order of 5 × 102 to 2 × 103 bases. Shotgun
sequencing is one strategy for determining the sequence of longer pieces. It works well for
g on the order of 105 or 106 bases. Copies of S are broken randomly into fragments of a
size allowing direct sequencing. N fragments are chosen at random, cloned, and sequenced,
where N is a number chosen by the experimentalist. Two fragments that overlap contain
a common subsequence of DNA. Hence one can fit the sequenced fragments together by
way of their overlaps and so try to reconstruct S.
There are two issues of mathematical analysis that arise in this strategy. First, one
must design an efficient and accurate algorithm for assembling fragments. Second, one
wants to understand how well S is covered by fitting together the fragments. The first
issue is a combinatorial problem which we shall assume solved; that is, we assume we are
able to link all overlapping fragments correctly. Once we have done so, we encounter the
second issue. In general, the N fragments will not cover the whole of S. Instead there
will be stretches of contiguous overlapping fragments, called contigs, separated by gaps
not covered by any fragment. To assess the quality of the shotgun sequencing, one wants
to know the expected proportion of S covered by contigs, the expected number of contigs,
their expected length, and how these averages depend on N. We shall call this the coverage
problem. Its solution helps the DNA sequencer to choose N so as to balance accuracy of
sequencing with overall cost and complexity.
This lecture addresses the probabilistic analysis of the coverage problem. There are
two steps. First, we construct a probabilistic model of random fragments. Then, given the
model, we calculate expected coverage, expected contig length, etc. Actually we consider
two models here, the second being a more realistic modification of the first. For the
discussion, it is only necessary that you understand the properties of Bernoulli and binomial
random variables and the idea of a probability density. In addition we shall use the
following approximation from calculus: for values of x small relative to n and n large,
x n
1−
≈ e−x .
n
Stated thus, this approximation is a little vague. You can explore how good it is in one of
the homework problems. For now, we note that the approximation reflects the following
more precise fact, which you should(!) be able to prove using techniques from Calc I and
II: for any real number x,
x n
lim 1 −
= e−x .
(1)
n→∞
n
1
Remark The shotgun strategy is also applied to physical mapping of DNA at larger scales.
Long, random subunits of a genome are classified not by their exact sequence of bases, but
by fingerprints, that is, chemical or physical properties that identify them uniquely. The
fingerprints are used to identify overlapping units so that they may be pieced together.
The adjective ”shotgun” applies to the randomness inherent in this method. Shotgun
strategies played an important role in the approach of Ventner, et al (1996), to sequencing
the human genome.
II. Probability Model I.
We start with a DNA segment S of length g, oriented in the 30 to 50 direction and draw
N fragments from it at random. Since the actual sequence of bases in S is not relevant to
the coverage problem, we shall represent S simply as the interval [0, g], so that a point x
in [0, g] represents the distance, in bases, from the 30 end. Our first model is then:
(i) Each fragment has identical length L, where L g (this means that L is much less
than g, so that L/g is very small.)
(ii) The left-hand endpoint of a randomly drawn fragment is uniformly distributed in the
interval (0, g).
Remarks: Two approximations are made in assumption (ii). First, a continuous model
has been used where a discrete model would be more accurate. The left endpoints of the
fragments ought really to be drawn from the uniform distribution on the set of integers
{1, 2, . . . , g}, labelling the locations of the bases. By drawing them uniformly from [0, g],
they will fall in general at non-integer points. However, for our purposes the approximation
of the uniform distribution on {1, . . . , n} by the uniform distribution on [0, g] does not
introduce any essential inaccuracy, and it is more convenient mathematically.
The second approximation in (ii) is that it ignores end effects. If the left-hand endpoint
of a fragment falls in the interval (g − L, g), its right-hand endpoint reaches beyond g.
Assumption (ii) allows this to happen, but only with probability L/g, which, by assumption
(i), is small, and so has little effect on the model.
III. Analysis of the model
A.The coverage number. This is defined to be a = NL/g. It is assumed that a is of
moderate size, say 1 < a ≤ 10. Since L/g is small, N is large. Notice that NL is the
total length of all the fragements, and NL = ag. Thus we say that the N fragments give
a-times coverage of S.
B. Some random variables associated to the model.
Let x be a site in (0, G). For each i, 1 ≤ i ≤ N, let
1, if fragment i contains x;
x
ξi =
0, otherwise.
x
Then, ignoring end effects, ξ1x , . . . , ξN
are independent, identically distributed Bernoulli
x
random variables with IP (ξi = 1) = L/g = 1 − IP (ξix = 0). The random variable,
x
K =
N
X
1
2
ξix
counts the number of fragments which contain x. Then K x is a binomial random variable
with parameters n = N and p = L/g. The expected number of fragments that cover the
point x is then NL/g = a; this gives us an alternative way to think of a.
C. Proportion of coverage. Let X denote the total length of all the contigs. The proportion
of S covered by contigs is then X/g. The expected proportion of coverage is E[X]/g. We
show that
the expected proportion of coverage ≈ 1 − e−a = 1 − e−N L/g .
This is sometimes called the Clarke-Carbon formula.
We give a simple derivation. Note that the expected proportion of coverage is just
the probability that a point chosen randomly from the uniform distribution on (0, G) is
covered. But for any point x, we derive from the previous paragraph that
N
x
x
(1 − L/g)N ≈ 1 − e−a .
IP (K > 0) = 1 − IP (K = 0) = 1 −
0
(This formula ignores end effects for 0 < x < L and g − L < x < g.) Since this probability
is independent of x, ignoring end effects, it also gives the expected fraction of S covered
by contigs.
Remark . To express the Clarke-Carbon formula it was not necessary to use the concept
of contigs. It just gives, approximately, the proportion of an interval covered by N subintervals of uniform length drawn uniformly from the larger interval.
D. The expected number of contigs. We show that the expected number of contigs is
approximately
Ne−a .
We calculate this using a trick, which is good to know. For each fragment i define the
random variable Zi so that Zi = 1 if fragment i is the rightmost fragment of a contig, and
Zi = 0 otherwise. We calculate E[Zi ] = IP (Zi = 1) by first conditioning on the location
of fragment i. This is an interval of length L somewhere in (0, G). It will be a rightmost
interval of a contig if and only if no left-hand endpoint of any of the other N − 1 fragments
falls in it. The probability of this is, by the binomial distribution and the fact that N is
large,
N −1
(1 − L/g)N−1 ≈ e−a .
0
This does not depend on the location of interval i and so is the same as IP (Zi = 1). Now
P
the total number of contigs is N
1 Zi , because the sum counts the number of fragments
that are rightmost in a contig. Thus the expected number of contigs is
"N
#
N
X
X
E
Zi =
E [Zi ] ≈ Ne−a
1
1
E. Mean Length of Contigs. The mean length of a contig is tricky to formulate and
approximate under the present model. We give a rough heuristic. Since the average
3
portion of S covered by contigs is 1 − e−a , the average total contig length is g(1 − e−a ).
Thus, since there are on average Ne−a contigs, the average length should be approximately
g(1 − e−a )
.
Ne−a
A little algebra shows that this is equal to L(e a−1) . This result will be rederived using the
Poisson process approximation method discussed below.
a
IV. Model II; variation with random fragment length.
The assumption that all fragments are of equal length is clearly unrealistic. Suppose
instead that we replace assumption (i) by
(i)’ the length of a randomly drawn fragment is a random variable with a known density function fL , and is independent of the location of the left-hand endpoint of the
fragment. Furthermore, there is a level ` g such that
fL (y) = 0
if y > `.
This assumption implies that P (fragment length > `) = 0; hence the fragment lengths,
although random, are all uniformly small compared to g;
We show that under assumptions (i)’ and (ii),
the expected proportion of coverage ≈ 1 − e−N E[L]/g .
Notice that this is the same as the formula for expected proportion, except that L has
been replaced by E[L].
Define ξix as above, so that ξix equals 1 if fragment i covers x and equals 0 otherwise.
Let Xi denote the position of the left endpoint of fragment i; by assumption (ii) Xi is
uniformly distributed on (0, g). Let Li denote the length of fragment i; by assumption (i)’,
Li has density fL and is independent of Xi . Then
IP fragment i covers x Xi = x − w = IP ξix = 1 Xi = x − w
= IP Li > w Xi = x − w
Z `
=
fL (y) dy.
w
We used the independence of fragment length and location in deriving the last equality.
Thus,
Z x
1
x
IP (ξi = 1) =
IP ξix = 1 Xi = x − w dw
0 g
Z Z
1 x `
fL (y) dy dw
=
g 0 w
Z
1 `
=
min{x, y}fL (y) dy = E[min{x, L}]
g 0
4
Notice that if x ≥ `, where ` is the maximum possible length of a fragment, min{x, L} = L
with probability one, and hence IP (ξix = 1) = g −1 E[L]. Let us approximate by using
g −1 E[L] for IP (ξix = 1) for all x—in other words, we ignore end effects. Then, reasoning
as in III C we find that the average proportion of the sequence S covered by contigs equals
the probability that a randomly chosen point x is covered, which equals
1 − e−N E[L]/g .
Some references
This lecture was prepared mostly from the treatment in Chapter 5 of
Ewens and Grant, Statistical Methods in Bioinformatics, Springer-Verlag, 2000.
More detailed treatments are given in
Chapters 5 and 6 of Waterman, Introduction to Computational Biology, Chapman-Hall
1995.
For a description of shotgun sequencing applied to the sequencing of the genome of
H . influenzae, and background on the molecular biology of sequencing, see chapter 4 in
T.A. Brown, Genomes, Wiley-Liss, 2000.
The paper
Ventner, Smith, and Hood, A new strategy for genome sequencing. Nature 381:364-366
(1996)
describes the shotgun-based strategy for sequencing the human genome.
5