Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Math 338, Lecture 1 Elementary probability models in DNA Sequence Analysis: Shotgun sequencing I. Shotgun sequencing. Let S be a segment of DNA isolated from some genome. Let us assume that the length of S, in number of bases, is known at least approximately, and let g denote this length. By considering only one of the complementary strands of S, we think of S as a string of letters from the alphabet {A, T, C, G}. A fundamental issue of current DNA biology is to determine the sequence of base pairs in a given segment S. Chemical sequencing methods can be applied effectively only to relatively short strands of DNA, on the order of 5 × 102 to 2 × 103 bases. Shotgun sequencing is one strategy for determining the sequence of longer pieces. It works well for g on the order of 105 or 106 bases. Copies of S are broken randomly into fragments of a size allowing direct sequencing. N fragments are chosen at random, cloned, and sequenced, where N is a number chosen by the experimentalist. Two fragments that overlap contain a common subsequence of DNA. Hence one can fit the sequenced fragments together by way of their overlaps and so try to reconstruct S. There are two issues of mathematical analysis that arise in this strategy. First, one must design an efficient and accurate algorithm for assembling fragments. Second, one wants to understand how well S is covered by fitting together the fragments. The first issue is a combinatorial problem which we shall assume solved; that is, we assume we are able to link all overlapping fragments correctly. Once we have done so, we encounter the second issue. In general, the N fragments will not cover the whole of S. Instead there will be stretches of contiguous overlapping fragments, called contigs, separated by gaps not covered by any fragment. To assess the quality of the shotgun sequencing, one wants to know the expected proportion of S covered by contigs, the expected number of contigs, their expected length, and how these averages depend on N. We shall call this the coverage problem. Its solution helps the DNA sequencer to choose N so as to balance accuracy of sequencing with overall cost and complexity. This lecture addresses the probabilistic analysis of the coverage problem. There are two steps. First, we construct a probabilistic model of random fragments. Then, given the model, we calculate expected coverage, expected contig length, etc. Actually we consider two models here, the second being a more realistic modification of the first. For the discussion, it is only necessary that you understand the properties of Bernoulli and binomial random variables and the idea of a probability density. In addition we shall use the following approximation from calculus: for values of x small relative to n and n large, x n 1− ≈ e−x . n Stated thus, this approximation is a little vague. You can explore how good it is in one of the homework problems. For now, we note that the approximation reflects the following more precise fact, which you should(!) be able to prove using techniques from Calc I and II: for any real number x, x n lim 1 − = e−x . (1) n→∞ n 1 Remark The shotgun strategy is also applied to physical mapping of DNA at larger scales. Long, random subunits of a genome are classified not by their exact sequence of bases, but by fingerprints, that is, chemical or physical properties that identify them uniquely. The fingerprints are used to identify overlapping units so that they may be pieced together. The adjective ”shotgun” applies to the randomness inherent in this method. Shotgun strategies played an important role in the approach of Ventner, et al (1996), to sequencing the human genome. II. Probability Model I. We start with a DNA segment S of length g, oriented in the 30 to 50 direction and draw N fragments from it at random. Since the actual sequence of bases in S is not relevant to the coverage problem, we shall represent S simply as the interval [0, g], so that a point x in [0, g] represents the distance, in bases, from the 30 end. Our first model is then: (i) Each fragment has identical length L, where L g (this means that L is much less than g, so that L/g is very small.) (ii) The left-hand endpoint of a randomly drawn fragment is uniformly distributed in the interval (0, g). Remarks: Two approximations are made in assumption (ii). First, a continuous model has been used where a discrete model would be more accurate. The left endpoints of the fragments ought really to be drawn from the uniform distribution on the set of integers {1, 2, . . . , g}, labelling the locations of the bases. By drawing them uniformly from [0, g], they will fall in general at non-integer points. However, for our purposes the approximation of the uniform distribution on {1, . . . , n} by the uniform distribution on [0, g] does not introduce any essential inaccuracy, and it is more convenient mathematically. The second approximation in (ii) is that it ignores end effects. If the left-hand endpoint of a fragment falls in the interval (g − L, g), its right-hand endpoint reaches beyond g. Assumption (ii) allows this to happen, but only with probability L/g, which, by assumption (i), is small, and so has little effect on the model. III. Analysis of the model A.The coverage number. This is defined to be a = NL/g. It is assumed that a is of moderate size, say 1 < a ≤ 10. Since L/g is small, N is large. Notice that NL is the total length of all the fragements, and NL = ag. Thus we say that the N fragments give a-times coverage of S. B. Some random variables associated to the model. Let x be a site in (0, G). For each i, 1 ≤ i ≤ N, let 1, if fragment i contains x; x ξi = 0, otherwise. x Then, ignoring end effects, ξ1x , . . . , ξN are independent, identically distributed Bernoulli x random variables with IP (ξi = 1) = L/g = 1 − IP (ξix = 0). The random variable, x K = N X 1 2 ξix counts the number of fragments which contain x. Then K x is a binomial random variable with parameters n = N and p = L/g. The expected number of fragments that cover the point x is then NL/g = a; this gives us an alternative way to think of a. C. Proportion of coverage. Let X denote the total length of all the contigs. The proportion of S covered by contigs is then X/g. The expected proportion of coverage is E[X]/g. We show that the expected proportion of coverage ≈ 1 − e−a = 1 − e−N L/g . This is sometimes called the Clarke-Carbon formula. We give a simple derivation. Note that the expected proportion of coverage is just the probability that a point chosen randomly from the uniform distribution on (0, G) is covered. But for any point x, we derive from the previous paragraph that N x x (1 − L/g)N ≈ 1 − e−a . IP (K > 0) = 1 − IP (K = 0) = 1 − 0 (This formula ignores end effects for 0 < x < L and g − L < x < g.) Since this probability is independent of x, ignoring end effects, it also gives the expected fraction of S covered by contigs. Remark . To express the Clarke-Carbon formula it was not necessary to use the concept of contigs. It just gives, approximately, the proportion of an interval covered by N subintervals of uniform length drawn uniformly from the larger interval. D. The expected number of contigs. We show that the expected number of contigs is approximately Ne−a . We calculate this using a trick, which is good to know. For each fragment i define the random variable Zi so that Zi = 1 if fragment i is the rightmost fragment of a contig, and Zi = 0 otherwise. We calculate E[Zi ] = IP (Zi = 1) by first conditioning on the location of fragment i. This is an interval of length L somewhere in (0, G). It will be a rightmost interval of a contig if and only if no left-hand endpoint of any of the other N − 1 fragments falls in it. The probability of this is, by the binomial distribution and the fact that N is large, N −1 (1 − L/g)N−1 ≈ e−a . 0 This does not depend on the location of interval i and so is the same as IP (Zi = 1). Now P the total number of contigs is N 1 Zi , because the sum counts the number of fragments that are rightmost in a contig. Thus the expected number of contigs is "N # N X X E Zi = E [Zi ] ≈ Ne−a 1 1 E. Mean Length of Contigs. The mean length of a contig is tricky to formulate and approximate under the present model. We give a rough heuristic. Since the average 3 portion of S covered by contigs is 1 − e−a , the average total contig length is g(1 − e−a ). Thus, since there are on average Ne−a contigs, the average length should be approximately g(1 − e−a ) . Ne−a A little algebra shows that this is equal to L(e a−1) . This result will be rederived using the Poisson process approximation method discussed below. a IV. Model II; variation with random fragment length. The assumption that all fragments are of equal length is clearly unrealistic. Suppose instead that we replace assumption (i) by (i)’ the length of a randomly drawn fragment is a random variable with a known density function fL , and is independent of the location of the left-hand endpoint of the fragment. Furthermore, there is a level ` g such that fL (y) = 0 if y > `. This assumption implies that P (fragment length > `) = 0; hence the fragment lengths, although random, are all uniformly small compared to g; We show that under assumptions (i)’ and (ii), the expected proportion of coverage ≈ 1 − e−N E[L]/g . Notice that this is the same as the formula for expected proportion, except that L has been replaced by E[L]. Define ξix as above, so that ξix equals 1 if fragment i covers x and equals 0 otherwise. Let Xi denote the position of the left endpoint of fragment i; by assumption (ii) Xi is uniformly distributed on (0, g). Let Li denote the length of fragment i; by assumption (i)’, Li has density fL and is independent of Xi . Then IP fragment i covers x Xi = x − w = IP ξix = 1 Xi = x − w = IP Li > w Xi = x − w Z ` = fL (y) dy. w We used the independence of fragment length and location in deriving the last equality. Thus, Z x 1 x IP (ξi = 1) = IP ξix = 1 Xi = x − w dw 0 g Z Z 1 x ` fL (y) dy dw = g 0 w Z 1 ` = min{x, y}fL (y) dy = E[min{x, L}] g 0 4 Notice that if x ≥ `, where ` is the maximum possible length of a fragment, min{x, L} = L with probability one, and hence IP (ξix = 1) = g −1 E[L]. Let us approximate by using g −1 E[L] for IP (ξix = 1) for all x—in other words, we ignore end effects. Then, reasoning as in III C we find that the average proportion of the sequence S covered by contigs equals the probability that a randomly chosen point x is covered, which equals 1 − e−N E[L]/g . Some references This lecture was prepared mostly from the treatment in Chapter 5 of Ewens and Grant, Statistical Methods in Bioinformatics, Springer-Verlag, 2000. More detailed treatments are given in Chapters 5 and 6 of Waterman, Introduction to Computational Biology, Chapman-Hall 1995. For a description of shotgun sequencing applied to the sequencing of the genome of H . influenzae, and background on the molecular biology of sequencing, see chapter 4 in T.A. Brown, Genomes, Wiley-Liss, 2000. The paper Ventner, Smith, and Hood, A new strategy for genome sequencing. Nature 381:364-366 (1996) describes the shotgun-based strategy for sequencing the human genome. 5