Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards a DNA Sequencing Theory (Learning a String) (Preliminary Version) Ming Li University of Waterloo DNA: Death’s Natural Altenrative - Globe EI Mail, April 14, 1990 Key words: Machine Learning, Shortest Common SUperstring, Approximation Algorithm, DNA sequencing. Abstract In laboratories, DNA sequencing is (roughly) done by sequencing large amount Of short fragments and then finding a shortest 1 Introduction The fundamental quest to Life is to understand how it functions. Since the discovery of deoxyribonucleic common superstring of the fragments. ~ new field of molecular acid (DNA) in the 1 9 5 0 ’ ~the we study frameworks, under daU-genetics has expanded a t a rate that can be matched sible assumptions, suitable for massive automated DNA by no other fields except computer science. sequencing and for analyzing DNA sequencing ‘goA DNA molecule, which holds the secrets of Life, conIithms’ we the DNA sequencing problem as sists of a sugar-phosphate backbone and, attached to learning a superstring from its randomly drawn sub- it, a longsequence of four kinds of nucleotide bases. strings. Under certain restrictions, this may be viewed At an abstract level, a single-stranded DNA as learning a superstring in Valiant learning model molecule can be represented as a character string, and in this we give an efficient for Over the set of nucleotides {A, C , G,T}.Such a charlearning a and a quantitative bound On acter string ranges from a few thousand symbols long how many samples suffice. for a simple virus to 2 x 10’ symbols for a fly and One major Obstacle to Our approach turns Out to to 3 109 symbols for a human being. Deterdnbe a quite w e l l - ~ ~ o wopen n question on how to aP- ing this for different molecules, or seqvenc;ng proximate the shortest common superstring of a set of the molecules, is a crucial step towards understandstrings, raised by a number of authors in the last ten ing the biological functions of the Huge years [ 6 9 291 30i. we give the first provably good al- (in terms of money and man years) national projects gorithm which approximates the shortest superstring of sequencing human genome are proposed Or underOf length by a superstring Of length o(nlogn). way although, unfortunately, major technical prob‘The author is supported by NSERC operating grants lems rem&. OGP-0036747and OGP-0046506. Address: Computer SciWith current laboratory methods, such as M a a m ence Department, University of Waterloo, Ontario N2L 3G1, Gilbert or Sanger procedure, it is by far impossible Canada. Email: mliOwatmath.wsterloo.edu CH2925-6/90/0000/0125$01.OO (6.1990 IEEE 125 to sequence a long molecule (of 3 x IO9 base pairs experiments. We will try t o formulate the problem for human) as a whole. Each time, a randomly cho- in different ways and provide solutions in certain sit- sen fragment of only u p to 500 base pairs can be uations. We face at least two major problems: sequenced. (a) Given a set of strings, how do we efficiently a p Various methods to sequence the long molecule are used. In general, biochemists ”cut” mil- proximate the shortest common superstring (finding lions of such (identical) molecules into pieces each it is NP-complete)? This has been an open problem typically containing about 200-500 nucleotides (char- raised by Gallant, Maier and Storer [6], Turner [30], acters) [4, 161. For example, a biochemist can decide and Tarhio and Ukkonen [29] in the past ten years. to cut the molecules whenever the substring GGACTT The latter two papers contain two algorithms, based (via restriction enzymes) appears. She or he may a p on the maximum overlapping method also used by ply many different cuttings using different substrings. biochemists, and a conjecture that these algorithms After cutting, these fragments can be roughly “sorted” guarantee good performance with respect to optimal length. according to their “weight” (“length”) in a not very precise fashion. A biochemist “samples” the frag- (b) Even given good approximation algorithms for ments (of each length plus minus 50 base pairs) ran- the superstrings, does that guarantee we can infer the domly. Then, say, Sanger procedure is applied to DNA sequence correctly? sequence the sampled fragment. A good technician We will provide an answer for (a) by giving a prov- can process, say, 7 to 14 such fragments a day manu- ably good approximation algorithm for the shortest ally. Sampling is expensive and slow in the sense that common superstring problem. it is manual or, at the best, mechanical. The sam- puts a superstring of length at most n l o g n where n Our algorithm out- pling is random in the sense that we have absolutely is the optimal length. We will also partially answer no idea where the fragment of 500 characters might (b). Under certain restrictions, we study how well the come from the 3 x lo9 characters long sequence. It shortest superstring algorithm does and we provide a is also well-known that long repetitions appear i n a new “sequencing” algorithm under the Valiant learn- human genome. ing model. We will also give quantitative bounds on From hundreds, sometimes millions, of these ran- the number of random fragments needed to be se- dom fragments, a biochemist has to assemble the su- quenced. perstring representing the whole molecule. Programs, This paper studies Valiant learning in a domain that try t o approximate a shortest common super- where the sample space is of polynomial size. Since string (whole molecule) of a given set of strings (frag- these concepts are trivially polynomially learnable, ments) are routinely used [23, 161. However, it is not not much attention has been paid to them. In the known whether this always works or why it works. past, researchers are concentrated on the “learnabil- It is reported in [23, 27, 161 that programs based on ity” of concept classes whose sample spaces are, of shortest common superstring algorithms1 work very course (otherwise the problem would be trivial), super- satisfactorily for real biological data. In [29], it is polynomial [31, 2, 12, 8, 26, 22, 113, although efficient mentioned that “although there is no a priori reason sampling was studied for example in [9]. On the other to aim at a shortest superstring, this seems to be the hand, our problem has trivial algorithms that need most natural restriction that makes the program non- high polynomial number of samples, but it also has trivial.” The goal of this paper is to initiate a study of non-trivial algorithms requiring low polynomial num- mathematical foundations for above discussions and ber of samples. It worths noting that artificial intelligence meth- ‘It was only conjecturedthat these algorithms approximate shortest superstrings. ods [21] and string matching algorithms have been 126 extensively applied to DNA sequence analysis [5]. The paper is organized as follows: In the next section, we provide a solution to question (a). Our new approximation algorithm is also crucial in the follow- sidered similar approximation algorithms. Both [29, 301 conjectured that their algorithms would return a superstring within the length of 2 times the opti- mal. But this Tarhio-Turner-Ukkonen conjecture and the question of finding an algorithm with non-trivial modeling our DNA sequencing problem. We will as- performance guarantee with respect to length remain sociate this problem with the Valiant learning model. open today. Papers [29, 301 did establish some perWe also state a stronger version of the Occam Razor formance guarantees with respect to so-called "overtheorem proved in [3], which is needed in Section 4. lapping" measure. Such measure basically measures Then in Section 4 we give provably efficient learning the number of bits saved by the algorithm compared to plainly concatenating all the strings. It was shown algorithms for the DNA sequencing problem. that if the optimal solution saves m bits, then the ap- ing sections. In Section 3, we study possible ways of 2 Approximating the Shortest proximate solutions in [29, 301 save at least 7n/2 bits. Common Superstring spect to the optimal length since this in the best case only says that the approximation procedure produces a superstring which is of half the total length of all strings. The basic algorithm of [29, 30, 281 is quite simple and elegant. It is as follows: Given a set of strings S. Choose a pair of strings from S with largest overlap, merge these two, and put the result back in S. Repeat this process until only one string, the su- As we have discussed above, the shortest common superstring problem plays an important role in the current DNA sequencing practice. This section will present the first provably good algorithm for such problem. 2.1 However, in general this has no implication with re- The Shortest Common Superstring perstring, is left in S. This algorithm has always proProblem duced a superstring of length within 2 times the o p The shortest common superstring problem is: Given a finite set S of strings over some finite or infinite alphabet Cl find a shortest string w such that each string x in S is a substring of w , i.e., w can be written as uxv for some U and v . The decision version of this problem is known t o be NP-complete [6, 71. Due to its wide range of applications in many areas such as data compression and molecular biology [29, 30, 61, finding good approximation algorithms for the shortest common superstring problem has become an important subject of research. It is an open question to find a provably good approximation algorithm [6,29,30] for this problem. We provide a solution. The superstring constructed by our algorithm is a t most nlogn long where n is the optimal length. timal length in all, random or artificial, experiments so far. We will give a different algorithm. Our algorithm guarantees that the solution is within a logn multiplicative factor of the optimal length n, although we believe that it achieves 2n. Notation: We usually use small letters for strings and capital letters for sets. If s is a string, then denotes its length, that is, number of characters in s. If S is a set, then 1.51 denotes its cardinality, that is, number of elements in set S. [SI 2.2 The Approximation Algorithm Assume that we are given a set S of m strings over some finite or infinite alphabet E. Let t be an optimal There have been twoindependent results by Tarhio superstring with It1 = so each string in s is a and Ukkonen [291 and "urner L3O1 both Of which con- substring of t . Following [29, 301, we assume that no string in S is a substring of another since these I n the above algorithm we could put the result m(s, s’) back into S, and hence get the following algorithm. Group-Merge 1: Input: S = ( 3 1 , ..., sm}. (0) Let weight(si) = IsiI for each string si E S. appearances in t , reading from left t o right. We list strings can be easily deleted from S. Hence we can order the strings in S by the left ends of their first them according to above ordering: 31, ...,sm. (1) Find s,s’ E S such that In the following we identify si with its first appearance in t . cost (9, The idea behind our algorithm is t o merge large groups. Each time, we try to determine two strings 3’) = min m Imb1s’)I weight(m(s,s t ) ) is minimized where, such that merging them properly would make many weight(m(s,3’)) = others become substrings of the newly merged string. weight(a) aEA For two strings s and s’, we use the word merge where A is the set of strings in S that are substrings t o mean that we put s and s’ (and nothing else) to- gether, possibly utilizing some overlaps they have in of m(3,s’). (2) Merge s, s’ to m ( s , s‘) as defined in Step (1). common, to construct a superstring of the two. In general there may be more than one way to merge s Delete set A from S. Add m ( s ,s’) to S. (3) If IS1 > 1 then go t o (l),else the only string and d. There may be two optimal ways, and many other non-optimal ways. For example, if s = 010 left in S is the superstring. and s’ = 00200, we can merge them with s in front In order t o keep the analysis simple, we only analyze steps (0) - (3) of algorithm Group-Merge. optimally as m l ( s , s t ) = 0100200 or with s‘ in front optimally as mZ(s,s’) = 0020010; We can also merge Theorem 1 Given a set of strings S, if the length of optimal superstring is n, then algorithm Groupmq(s,s’) = 00200010. These are all possible ways of Merge pToduces a superstring of length O(n log n). merging s,s’. For each merge function m, we write m(s,3’) to denote the resulting superstring. There As discussed above, we can assume Proof: are at most 2min{ls), Is’I) ways of merging s and 3’. that in S no string is a substring of another and all them non-optimally as m ~ ( ss ,t ) = 01000200 or We now present our algorithm. strings are ordered by their first appearance in the Group-Merge: Input: S = { S I , ..., sm). shortest superstring. Let this order be (0) Let T := 0. S I , 32, ...,sm. We separate S into groups: The first group G1 con- (1) Find s,s’ E S such that cost(s, s t ) = min m , i, i tains s ~ , . . . , swhere < m, is the largest index such that (the first appearances of) s1 and Im(3, 3’)l lap in t ; The second group contains zueight(m(s,3’)) j, i is minimized where + 15 j 5 m, is the Si+lr Si over- ...,sj where largest index such that sj overlaps with s,+l in t ; And so on. In general if sk is weight(m(s,s’)) = ~ sk+l, ...,sp the last element in G I ,then G I + contains lull aEA where p , k where A is the set of strings in S that are substrings sk+l overlaps with sp in t . See Figure 1. Assume that there are g groups: G I ,..., G,. For of m(s,3’). (2) Merge s, s’ t o m ( s ,s t ) as defined in Step (1). T := T u { m ( s ,3’)). Delete set A from S. (3) If IS1 > 0 then go to (1). (4) If IT(> 1 then S := T and go to return the only string in T as superstring. + 1 5 p 5 m,is the largest index such that Gi let b, and ti be the first (bottom) and last (top) string in G,,according to our ordering, respectively. That is, b; and ti sandwich the rest of strings in G; (l),else and for some optimal mi, every string in Gi is a substring of m ( b i , t ; ) . 128 first and last strings in S, , according to our ordering, f respectively. And let m, be the merge function used in the r t h iteration to combine br and t,. Let there be a total of B iterations executed by Group-Merge. G3 Now the length Lj we used to merge strings in Gj can be measured as follows: Figure 1: Grouping of Substrings (where IC' Lemma 2 5 IC indicates the first step Gjk"' becomes E:==,Imi(bi, t,)l 5 2n, where n is the length empty) of shortest superstring t for S. Proof: This can easily be seen geometrically: 5 Imj ( b j put all the groups G1,...,Ggback to their original positions in the sptimal arrangement (which gives the (where H(m)= i t j ) IH(IIGj II) = O(1ogm)) shortest superstring t ) . Then strings in G , overlap = lmj ( b j v ti 1I@(logIIGj I 1)with nothing except strings in Gi-1 and Gi+l (and of Hence the total length we use to merge all GI, ...,G , course in G; itself), for i = 2, ...,g - 1. Thus counting is, by Lemma 2, the optimal solution for each G, separately at most 9 9 doubles the length of the optimal solution. U 5 O ( l o g 1 ) Imi(bi,ti)l ~ CL' i=l Lemma 3 FOTeach G,, after deleting an arbitrary i=l < @(log1)2n = @(log1)n, number of strings from Gi, among the strings left, we can still merge a pair of strings in G , so that all other where 1 = m q [[Gill. But O(1ogn) = O(log1) since n is (polynomially) larger than the number of strings strings are substrings of this merge. Furthermore the in any G, and (polynomially) larger than the length resulting merge is a substring of q ( b ; , t i ) . of longest string in any G,. Therefore the algorithm Proof: Consider the original optimal arrangement will output a superstring of length at most O(n1ogn). and ordering of strings 91,.-.,,s in t . All strings in U G, overlap at a single position. So after deleting some Remark. We conjecture that our algorithm alstrings, we can still properly merge the first (bottom) and last (top) strings among the leftover strings in Gi to obtain a superstring such that all the leftover strings in G, are substrings of this superstring. Obvi- ways outputs a superstring of length no more than 2n (and possibly even less) where n is the optimal length. Vijay Vazirani constructed an example for which our algorithm outputs a superstring of length ously this superstring is also a substring of m i ( b ; , t , ) . 1.3n. Samir Khuller suggested that we can replace U the weight function by just counting the number of strings "sandwiched" by the two merging strings. Albe the set of strings remaining in G, before the though this algorithm still guarantees the n log n per7th iteration. Let Sr be the set of strings cancelled formance, but a counter example shows that the outat r t h iteration of Group-Merge. Let b r , t , be the put sometimes exceeds 2n in length. For a set of strings A, let IlAll = CoEA Jul. Let 129 Modeling DNA Sequencing Problem 3 sitions of s are covered with probability 1 - n-O(l). By the same calculation, the probability of existing some substring which does not overlap for more than 112 characters with some other substring in the sam- The task of this section is to properly put our DNA sequencing problem into a suitable mathematical frame-.ple is also less than n-'(l). Since s is random, there work. In the following, we safely assume that the is no repetitions in s of length greater than 112 if > 4logn expected sampling length I is much larger than log n 1 where n is the length of the DNA sequence2. probability 1 - n-O('), after O(*") [18], we thus can precisely recover s with samples. I t may also be reasonable to assume that in order Of course, assuming s being random is danger- t o sequence a human genome of 3 x l o 9 long, at about ous since it is known that human DNA molecules are 500 characters per fragment, sequencing process must not very random and actually contain many (pretty be fully automated without human intervention. We long) repetitions (and redundancies). Above calculation may be meaningful only for lower lifes when the are interested in fast and simple sequencing methods that can be easily automated. Another safe assump- repetitions mostly are less than 112. tion we will make is that sampling will be uniformly 3.2 random. This is more or less the reality or achievable. We consider two models. Learning It Approximately An alternative approach, which we will take in the rest of this paper, is to make no assumptions about 3.1 Recovering It Precisely the sequence s, although we will lose some power of If we can assume that a DNA sequence s is random our theory. and substrings from SJ are uniformly sampled, where Our learned DNA sequence will only be good with We appeal to Valiant learning model. Sl contains s's substrings of length 1 and substrings high probability for short queries about several hunof length less than I that are prefixes or suffixes of dred characters long. Although this only partially s, then s can be identified with no error with high captures reality, we are able to fully characterize the world we have captured. This is still meaningful since probability. certain biological functions are encoded by just a few Just t o simplify calculation, we consider a fixed I, rather than for all 1 (less than 500). This should not hundred base pairs. One purpose of DNA sequencing change the result. We provide a rough calculation. is to find the functionality of the genes, there is really Let K ( s ) 2 )SI = n, no need to insist on recovering the original sequence where K ( s ) is the Kolmogorov complexity of s '. Divide s into 112 = precisely, especially when this is impossible. consecutive blocks. Let E, denote the event where the ith block We first describe the distribution independent is not fully covered by any sample. Let E = U&. model of learning introduced by Valiant [31]. More At each random sampling, Pr(Ei) 5 Then discussion and justification of this model can be found samples, P r ( E ) 5 n-O('). SO all po- in [2, 12, 14, 32, 331. [15] has been a useful source for after o(+) *. me. We assume that the learning algorithm of A has 2 B y current technology l is about 500, whereas n 5 3 x 10' available a black box called EXAMPLES(A), with to be interesting. 3We refer the reader to [18]for an introduction of Kol- two buttons labeled POS and NEG. If POS (NEG) is mogorov complexity. Here we give an intuitive definition: Kolmogorov complexity of a binary string z conditionalto g, writ- pushed, a positive (negative) example is generated ac- ten as K(zly), is the lengthbf the shortest Pascal program, cording to some h e d but unknown probability distri- encoded in binary in a standard way, that starts with input y, bution D+ ( D - ) . We assume nothing about the dis- prints z and halts. K ( z ) = K ( z ( e ) . tributions D+ and D-,except that 130 CsEPOS D + ( s )= xsENEG zaENEG 1 and D - ( s ) = 1 (i.e., CaEPOS D - ( s ) = 0 Theorem 4 Let C and C' be concept classes. Let and D + ( s ) = 0). For discrete domains, the c E C with size(c) = n. FOT a! 2 1 and 0 5 p < 1, and let A be an algorithm that on input of m/2 posiValiant learnability can be defined as follows. tive examples of c drawn from D+ and m/2 negative Definition 1 Let C and C' be concept classes. C is examples of c drawn front D-, outputs a hypothepolynomially learnable from examples (by e') if there sis c' E c' that is consistent with the examples and is an (randomized) algorithm A with access t o POS satisfies K(c'1c) 5 n a d , where K(c'1c) is the Koland N E G which, taking inputs 0 < c , 6 < 1, for any mogorov complexity of c' conditional t o c. Then A as c E C and D+, D-, halts in polynomial(size(c), f ) a learning algorithm for c by c' for time and outputs a concept c' E C' that with probabil1 l n Q i i t y greater than 1 - 6 satisfies m = O(nax(-log-,(-)i=q). € b e i, D + ( s )< E When p = 0, and n c'(s)=O > log :, we will use m = O($). By the above theorem, we certainly can trivially learn, with error probability l / n , a string of length n by sampling m = n3 examples: the output will be a and D-(s) < e . C'(6)=1 W e say that A is a learning algorithm of C . set of substrings of total length at most n2 since there n2 substrings and we can merge them into only n substrings. Needless to say, sampling 50003 of) DNA sequencing problem as the string learning not problem under the Valiant learning.Notice that the fragments to sequence a DNA-mo1ecu1e is Valiant model allows under arbitrary disA more can bring this down to O(nlogn/e) which is still not satisfa~tory.~ tribution, whereas in our case uniform sampling is show that A theory must agree with practice* we sufficient. Therefore we are really studying a fragments, where 1 we need only about general problem of learning a string. is the D+-expected sampling length. This is much Definition 2 String Learning PToblem- The concept closer to the laboratory practice. Notice that the diclass c is the set of strings (DNA-molecules t o be se- visor1 is around 500 and hence is much more signifi- In the following definition, we model (a subset are Only logni:g(n''l w n c e d ) over the 4 letter abhabet (A, C,G , TI. The positive eXampkS f o r each Concept (i.e. String) C of length n are its substrings of length no more than n; The negative examples4 aTe strings that are not SubStTings of c. Sampling is done at random according to some unknown distributions for Positive and negative examples, respectively. cant than a log n term for n 5 3 x 10'. We conjecture that this can be improved to v, This algorithm will depend on a good approximation algorithm, given in Section 2, for finding a shortest superstring which guarantees an output not much longer than the shortest superstring for a given set of strings (fragments). be the length Of the strhg to be learned- Then We strengthen a well-known result of Blumer, Ehrena trivial algorithm would sample substrings and keep them, feucht, Haussler, and Warmuth [2, 31 in the following merging a and b only if a is a substring of b, as the concept. theorem. we have rep1aced their Iequirement that Thus, since there are at most nn such concepts (although the size(c') 5 numa by that of K(c'1c) same proof of [3] still works. 5 nama. The totd length may be na),we need only O(nlogn/c) examples 'In DNA sequencingpracticeB there do appear to be nege- to learn with error probability c by Theorem 4. However, this is still not good enough. For example, we still do not want to sample say, 50001og5000, substrings to identify a virus DNA tive examples, due to biologicalrestrictionson what cannot be combined. 131 of 5ooo bsse 4 Learning a String Efficiently choices. So in some way, and this has a t most altogether we have no(-) potential c‘. Therefore Formally our concept class C is a set of strings. Positive examples are substrings of a target string t dis- K(C’1C) tributed according t o D t , which maybe uniform in positive or negative example by testing whether it is that are not substrings oft distributed according to a substring of some string in c’. By Theorem 4, our error probability is at most E. 0 . C’ is a class of sets of strings. Theorem 5 C is learnable (by C’), with error probability e , using only O( w) samples, where 1 i s the D+ -ezpected sample length. Proof: n log’ n 1 1. Given a new string, we can decide whether it is a the DNA applicdtons. Negative examples are strings D- 5 O(- Corollary 6 If we have an algorithm appTozimating the shortest common superstring of length n within length 2n, then C is learnable, with error probability E , Given O ( w ) positive and nega- using only samples, wheTe 1 is the Dt-ezpected sample length. tive examples, if we output a concept c’ such that K(c‘1c) 5 O ( w ) , then by Theorem 4, we have a learning algorithm. In the real life situation of DNA sequencing, there are many restrictions on what cannot be combined. We first change Group-Merge algorithm t o also These conditions may be regarded as negative exam- deal with negative examples: At step (l),we now look ples. However, if one prefers to think that no negative for a pair of s,s’ such that cost(s, 5’) is minimized ezamples are given, we can still reasonably assume under the condition that m(s,s’) must not contain a that negative instances are more or less uniformly negative example as a substring. The learned concept distributed. And then, is Corollary 7 Assume a uniform distribution oveT the c’ = { m ( s ,s’)Im(s,s‘) chosen in step (I) negative ezamples. C is learnable ( b y C‘), with error pTobability e , using only O( of Group - Merge}. Proof: So strings in c’ may contain more than one string. In order to show that the old analysis is still good, w) positive ezamples. By modifying the arguments of Theo- rem 4 (proof omitted) it is easily seen that our algo- Group-Merge will still guarantee we only need to observe one fact: there is always a rithm way t o properly combine the first and last strings in - O+(S) each group GS a t the r t h step such that they contain 5 €1 c‘(a)=O as substrings and no negative exam- where c’(s) = 0 stands for s is not a substring of some ples as substrings. Hence the analysis of Theorem 1 string in the learned concept c’. On the negative side, still carries through. because the output length is a t most O(n1og n ) , this all strings in Now we count how many c’ are possible outputs will make a t most O(( nlog n)’) negative examples of Group-Merge, given the fact that we draw ex- positive. Since we have assumed the uniform distri- amples using c. c’ is constructed by less than bution, the error probability on the negative side is iterations since the total length of c’ is n l o g n and only, for not too small e, 1 is the D+-expected length for a positive sample6. Each step, two possible substrings of c are combined O( ( nlog n)2 ) 4n < e. 6By statistics (Chernoff bounds), we know that with high ability” can be absorbed in the error probability 6 and the probability, the average sample size I‘ for the sample we have discrepancy between I’ and I can be absorbed by some s m a l l drawn is close to 1. In our case, the phrase “with high prob- constant. 132 0 superstring. Our theory guarantees the performance and we have bounded the number of samples required. Remark 1. The algorithms of Turner [30] and Tarhio and Ukkonen 1291 do not seem to work when there are negative samples. This is because their algorithms merge a pair of strings each step; If merged wrongly at first, the negative examples could prohibit further mergings, hence ending up with a long superstring. On the other hand, although it was not our original purpose, algorithm Group-Merge adapts to this situation perfectly. Remark 2. Our approach also fits the situation of inaccurate data. Such inaccurate data happen of- It is foreseeable that this method may be useful in situations where we are interested only in finding out whether certain (not too long) character sequence (representing a function) appear in a given DNA molecule. If the future technology (like the PCR method) allows us to sample longer fragments, our method will be more useful since then we can guarantee to answer longer sequences with high confidence. 5 Acknowledgements ten during the Sanger procedure [27] and often obstruct us from obtaining the real underlying sequence. I gratefully thank Qiang Yang, Derick Wood, and By allowing approximate matching, straightforward Esko Ukkonen for bringing the literature on shortest extension of our approach, combining ideas in [I, 131, common superstrings t o my attention; Murray Sherk, would provide efficient and robust algorithms. John Tromp, and Paul Vitanyi for commenting on the Remark 3. It seems that our new algorithm manuscript; Deming Li, Ed Shipwash, Xiaohong Sun, deals with the repetitions better than maximum over- and Baozhang Tang for teaching me biochemistry; lap algorithm used by Turner, Tarhio, Ukkonen. It is David Haussler (who suggested me to calculate the believed that both algorithms have the 2 times op- probability of recovering the superstring precisely), timal performance. Based on this conjecture, the Udi Manber, and a referee (COLT) for crucial critiCorollary 6 says that we need only samples cisms and suggestions; Samir Khuller and Vijay Vaziin order to achieve (1 - E ) x 100 percent confidence. rani for enlightening discussions on the superstring Several major open problems remain: Improve our algorithms, and NSERC for its financial support. O(n1ogn) bound (to 2n). Prove the 2n bound or any non-trivial bound for the maximum overlap algorithm. Remark 4. We have already indicated the limitation of our approach. In our model, we have as- References sumed that future queries to the learned DNA sequence are of size of several hundred characters. Whether one can correctly answer, with high probability, queries about longer substrings of the learned DNA sequence is not clear. This should be an interesting subject of future research. Remark 5. It is hoped that our theory and methods can be applied at a massive scale when the sampling and sequencing process are fully automated. The procedure we have described requires no human intervention. It only involves randomly sequencing short fragments and computing the shortest common 133 D. Angluin and P.D. Laird. Learning From Noisy Examples. Machine Learning 2(4) 343-370 1988. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and VapnikChervonenkis Dimension. Journal of the ACM 35(4) i 9 8 9 . [31 A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam’s Razor. Information Processing Letters 24 1987. 141 D. Freifelder. Molecular Biology. Jones €4 Bartlett Publishers, Inc. 1983. [SI P. F’reidland and L. Kedes. Discovering the secrets of DNA. C. ACM 28(11) 1164-1186 1985. [6] J. Gallant, D. Maier, J. Storer. On finding minimal length superstring. Journal of Computer and System Sciences 20 50-58 1980. [7]M. Garey and D. Johnson. Computers and In- [23] H. Peltola, H. Soderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics. Information Processing 83 (PTOC.I F I P Congress, 1983) 53-64. tractability. Freeman, N e w York, 1979. [8] D. Haussler. Generalizing the PAC model: sample size bounds from metric dimension-based uniform convergence results. 30th I E E E Symp. o n Foundation of Compt. Sci. 40-45 1989. [24] L. Pitt and L.G. Valiant. Computational Limitations on Learning From Examples. J o u m a l of the A C M , 35(4) 965-984 1988. [9] D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant's model. Artificial Intelligence, 36(2) 177-221 1988. r251 . . L. Pitt and M. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. A C M Symp. on Theory of Computing, 421-432 1989. D. Haussler, N. Littlestone, and M. Warmuth. Expected Mistake Bounds for On-Line Learning Algorithms. 29th IEEE Symp. o n Foundation of Compt. Sci. 100-109 1989. [26] R. Rivest. Learning Decision-Lists. Machine Learning, 2(3) 229-246 1987. D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed [271 R. Staden. Automation of the computer banclasses. 2nd Workshop o n Computational Learndling of el data produced by the shoting Theory, 41-56 1989. aun metiod of D#A sequencing. Nucleic Acids iiesearch, lO(15) 4731-4'751 19i.2. M. Kearns. M. Li, L. Pitt, and L. G. Valiant. Recent Results on Boolean Concept Learning. proceedings of the 4th International Workshop [28] J. Storer. Data compression: methods and theory. Computer Science Press, 1988. o n Machine Learning, 337-352, 1987 M. Keains and M. Li. Learning in the Presence of Malicious Errors. 1988 A C M S y m p . o n T h e o m of Computing. Also Harvard TR-03-87. [29] J . Tarhi0 and E - Ukkonen. A Greedy a P P r o i mation algorithm for constructing shortest Common superstrings. Theoretical Computer Science 57 131-145 1988 M. Kearns, M. Li, L. Pitt, and L.G. Valiant. On the Learnability of Boolean Formulae. lgth ACM Symposium o n Theory of Computing 285295 1987. [30] J. Turner. Approximation algorithms for the shortest common superstring problem. Information and Computation 83 1-20 1989. M. Kearns. The computational complexity of machine learnin . Ph.D. Thesis, Harvard University. Report 8R-13-89, 1989. A. Lesk (Edited). Computational Molecular Biology, Sources and Methods for Sequence Analysis. Ozford University Press, 1988. M. Li and P. Vitanyi. A theory of learning simple concepts under simple distributions. 30th I E E E Symp. o n Foundations of Computer Science, 3439, 1989. [31] L. G. Valiant. A Theory of the Learnable. C o m m . A C M 27(11) 1134-1142 1984. [32] L. G. Valiant. Learning Disjunctions of Conjunctions. In Proceedings of the gth I J C A I , vol. 1 560566 Los Angeles, CA. August, 1985. [33] L. G. Valiant. Deductive Learning. Phil. Trans. R. Soc. Lond. A 312 441-446 1984. M. Li and P. Vitanyi. Kolmogorov complexity and its applications. Handbook of Theoretical Comput. Sci., J . van Leeuwen Ed. 1990 N. Littlestone. vant Attributes old Algorithm. Foundations of Learning Quickly When IrreleAbound: A New Linear Thresh28th I E E E Symposium on the Computer Science, 1987. W. Maass and G. Turan. O n the complexity of learning from counterexamples. 30th IEEE Foundations of Computer Science, 262267 1989. R. Michalski, J. Carbonell, T. Mitchell. Machine Learning. Morgan Kaufmann 1983. B.K. Natarajan. O n Learning Boolean Functions. A C M Symp. o n Theory of Computing, 296-304 1987. 134