Download Towards a DNA sequencing theory (learning a string)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Multi-armed bandit wikipedia , lookup

Genetic algorithm wikipedia , lookup

Reinforcement learning wikipedia , lookup

Pattern recognition wikipedia , lookup

Machine learning wikipedia , lookup

Concept learning wikipedia , lookup

Transcript
Towards a DNA Sequencing Theory
(Learning a String)
(Preliminary Version)
Ming Li
University of Waterloo
DNA: Death’s Natural Altenrative
- Globe EI Mail, April 14, 1990
Key words: Machine Learning, Shortest Common SUperstring, Approximation Algorithm, DNA sequencing.
Abstract
In laboratories, DNA sequencing is (roughly) done by
sequencing large amount Of
short
fragments and then
finding a shortest
1
Introduction
The fundamental quest to Life is to understand how
it functions. Since the discovery of deoxyribonucleic
common superstring of the fragments.
~ new field of molecular
acid (DNA) in the 1 9 5 0 ’ ~the
we study
frameworks, under daU-genetics has expanded a t a rate that can be matched
sible assumptions, suitable for massive automated DNA by no other fields except
computer science.
sequencing and for analyzing DNA sequencing ‘goA DNA molecule, which holds the secrets of Life, conIithms’ we
the DNA sequencing problem as sists of a sugar-phosphate backbone and, attached to
learning a superstring from its randomly drawn sub- it, a longsequence of four kinds of nucleotide bases.
strings. Under certain restrictions, this may be viewed
At an abstract level, a single-stranded DNA
as learning a superstring in Valiant learning model molecule can be represented as a character string,
and in this
we give an efficient
for
Over the set of nucleotides {A, C , G,T}.Such a charlearning a
and a quantitative bound On acter string ranges from a few thousand symbols long
how many samples suffice.
for a simple virus to 2 x 10’ symbols for a fly and
One major Obstacle to Our approach turns Out to
to 3 109 symbols for a human being. Deterdnbe a quite w e l l - ~ ~ o wopen
n question on how to aP- ing this
for different molecules, or seqvenc;ng
proximate the shortest common superstring of a set of
the molecules, is a crucial step towards understandstrings, raised by a number of authors in the last ten ing the biological functions of the
Huge
years [ 6 9 291 30i. we give the first provably good al- (in terms of money and man years) national projects
gorithm which approximates the shortest superstring of sequencing
human genome are proposed Or underOf length
by a superstring Of length o(nlogn).
way although, unfortunately, major technical prob‘The author is supported by NSERC operating grants lems rem&.
OGP-0036747and OGP-0046506. Address: Computer SciWith current laboratory methods, such as M a a m ence Department, University of Waterloo, Ontario N2L 3G1,
Gilbert
or Sanger procedure, it is by far impossible
Canada. Email: mliOwatmath.wsterloo.edu
CH2925-6/90/0000/0125$01.OO (6.1990 IEEE
125
to sequence a long molecule (of 3 x IO9 base pairs
experiments. We will try t o formulate the problem
for human) as a whole. Each time, a randomly cho-
in different ways and provide solutions in certain sit-
sen fragment of only u p to 500 base pairs can be
uations. We face at least two major problems:
sequenced.
(a) Given a set of strings, how do we efficiently a p
Various methods to sequence the long
molecule are used. In general, biochemists ”cut” mil-
proximate the shortest common superstring (finding
lions of such (identical) molecules into pieces each
it is NP-complete)? This has been an open problem
typically containing about 200-500 nucleotides (char-
raised by Gallant, Maier and Storer [6], Turner [30],
acters) [4, 161. For example, a biochemist can decide
and Tarhio and Ukkonen [29] in the past ten years.
to cut the molecules whenever the substring GGACTT The latter two papers contain two algorithms, based
(via restriction enzymes) appears. She or he may a p
on the maximum overlapping method also used by
ply many different cuttings using different substrings.
biochemists, and a conjecture that these algorithms
After cutting, these fragments can be roughly “sorted” guarantee good performance with respect to optimal
length.
according to their “weight” (“length”) in a not very
precise fashion.
A biochemist “samples” the frag-
(b) Even given good approximation algorithms for
ments (of each length plus minus 50 base pairs) ran-
the superstrings, does that guarantee we can infer the
domly. Then, say, Sanger procedure is applied to
DNA sequence correctly?
sequence the sampled fragment. A good technician
We will provide an answer for (a) by giving a prov-
can process, say, 7 to 14 such fragments a day manu-
ably good approximation algorithm for the shortest
ally. Sampling is expensive and slow in the sense that
common superstring problem.
it is manual or, at the best, mechanical. The sam-
puts a superstring of length at most n l o g n where n
Our algorithm out-
pling is random in the sense that we have absolutely is the optimal length. We will also partially answer
no idea where the fragment of 500 characters might
(b). Under certain restrictions, we study how well the
come from the 3 x lo9 characters long sequence. It
shortest superstring algorithm does and we provide a
is also well-known that long repetitions appear i n a
new “sequencing” algorithm under the Valiant learn-
human genome.
ing model. We will also give quantitative bounds on
From hundreds, sometimes millions, of these ran-
the number of random fragments needed to be se-
dom fragments, a biochemist has to assemble the su-
quenced.
perstring representing the whole molecule. Programs,
This paper studies Valiant learning in a domain
that try t o approximate a shortest common super-
where the sample space is of polynomial size. Since
string (whole molecule) of a given set of strings (frag-
these concepts are trivially polynomially learnable,
ments) are routinely used [23, 161. However, it is not
not much attention has been paid to them. In the
known whether this always works or why it works.
past, researchers are concentrated on the “learnabil-
It is reported in [23, 27, 161 that programs based on
ity” of concept classes whose sample spaces are, of
shortest common superstring algorithms1 work very
course (otherwise the problem would be trivial), super-
satisfactorily for real biological data. In [29], it is
polynomial [31, 2, 12, 8, 26, 22, 113, although efficient
mentioned that “although there is no a priori reason
sampling was studied for example in [9]. On the other
to aim at a shortest superstring, this seems to be the
hand, our problem has trivial algorithms that need
most natural restriction that makes the program non-
high polynomial number of samples, but it also has
trivial.” The goal of this paper is to initiate a study of
non-trivial algorithms requiring low polynomial num-
mathematical foundations for above discussions and
ber of samples.
It worths noting that artificial intelligence meth-
‘It was only conjecturedthat these algorithms approximate
shortest superstrings.
ods [21] and string matching algorithms have been
126
extensively applied to DNA sequence analysis [5].
The paper is organized as follows: In the next section, we provide a solution to question (a). Our new
approximation algorithm is also crucial in the follow-
sidered similar approximation algorithms. Both [29,
301 conjectured that their algorithms would return
a superstring within the length of 2 times the opti-
mal. But this Tarhio-Turner-Ukkonen conjecture and
the question of finding an algorithm with non-trivial
modeling our DNA sequencing problem. We will as- performance guarantee with respect to length remain
sociate this problem with the Valiant learning model. open today. Papers [29, 301 did establish some perWe also state a stronger version of the Occam Razor formance guarantees with respect to so-called "overtheorem proved in [3], which is needed in Section 4. lapping" measure. Such measure basically measures
Then in Section 4 we give provably efficient learning the number of bits saved by the algorithm compared
to plainly concatenating all the strings. It was shown
algorithms for the DNA sequencing problem.
that if the optimal solution saves m bits, then the ap-
ing sections. In Section 3, we study possible ways of
2
Approximating the Shortest
proximate solutions in [29, 301 save at least 7n/2 bits.
Common Superstring
spect to the optimal length since this in the best case
only says that the approximation procedure produces
a superstring which is of half the total length of all
strings. The basic algorithm of [29, 30, 281 is quite
simple and elegant. It is as follows: Given a set of
strings S. Choose a pair of strings from S with largest
overlap, merge these two, and put the result back in
S. Repeat this process until only one string, the su-
As we have discussed above, the shortest common
superstring problem plays an important role in the
current DNA sequencing practice. This section will
present the first provably good algorithm for such
problem.
2.1
However, in general this has no implication with re-
The Shortest Common Superstring perstring, is left in S. This algorithm has always proProblem
duced a superstring of length within 2 times the o p
The shortest common superstring problem is: Given
a finite set S of strings over some finite or infinite
alphabet Cl find a shortest string w such that each
string x in S is a substring of w , i.e., w can be written
as uxv for some U and v . The decision version of this
problem is known t o be NP-complete [6, 71. Due to its
wide range of applications in many areas such as data
compression and molecular biology [29, 30, 61, finding
good approximation algorithms for the shortest common superstring problem has become an important
subject of research. It is an open question to find a
provably good approximation algorithm [6,29,30]
for
this problem. We provide a solution. The superstring
constructed by our algorithm is a t most nlogn long
where n is the optimal length.
timal length in all, random or artificial, experiments
so far.
We will give a different algorithm. Our algorithm
guarantees that the solution is within a logn multiplicative factor of the optimal length n, although we
believe that it achieves 2n.
Notation: We usually use small letters for strings
and capital letters for sets. If s is a string, then
denotes its length, that is, number of characters in s.
If S is a set, then 1.51 denotes its cardinality, that is,
number of elements in set S.
[SI
2.2
The Approximation Algorithm
Assume that we are given a set S of m strings over
some finite or infinite alphabet E. Let t be an optimal
There have been twoindependent results by Tarhio superstring with It1 =
so each string in s is a
and Ukkonen [291 and "urner L3O1 both Of which con- substring of t . Following [29, 301, we assume that
no string in S is a substring of another since these
I n the above algorithm we could put the result m(s, s’)
back into S, and hence get the following algorithm.
Group-Merge 1: Input: S = ( 3 1 , ..., sm}.
(0) Let weight(si) = IsiI for each string si E S.
appearances in t , reading from left t o right. We list
strings can be easily deleted from S. Hence we can
order the strings in S by the left ends of their first
them according to above ordering:
31,
...,sm.
(1) Find s,s’ E S such that
In the
following we identify si with its first appearance in t .
cost (9,
The idea behind our algorithm is t o merge large
groups. Each time, we try to determine two strings
3’)
= min
m
Imb1s’)I
weight(m(s,s t ) )
is minimized where,
such that merging them properly would make many
weight(m(s,3’)) =
others become substrings of the newly merged string.
weight(a)
aEA
For two strings s and s’, we use the word merge
where A is the set of strings in S that are substrings
t o mean that we put s and s’ (and nothing else) to-
gether, possibly utilizing some overlaps they have in of m(3,s’).
(2) Merge s, s’ to m ( s , s‘) as defined in Step (1).
common, to construct a superstring of the two. In
general there may be more than one way to merge s Delete set A from S. Add m ( s ,s’) to S.
(3) If IS1 > 1 then go t o (l),else the only string
and d. There may be two optimal ways, and many
other non-optimal ways. For example, if s = 010 left in S is the superstring.
and s’ = 00200, we can merge them with s in front In order t o keep the analysis simple, we only analyze
steps (0) - (3) of algorithm Group-Merge.
optimally as m l ( s , s t ) = 0100200 or with s‘ in front
optimally as mZ(s,s’) = 0020010; We can also merge
Theorem 1 Given a set of strings S, if the length
of optimal superstring is n, then algorithm Groupmq(s,s’) = 00200010. These are all possible ways of
Merge pToduces a superstring of length O(n log n).
merging s,s’. For each merge function m, we write
m(s,3’) to denote the resulting superstring. There
As discussed above, we can assume
Proof:
are at most 2min{ls), Is’I) ways of merging s and 3’. that in S no string is a substring of another and all
them non-optimally as m ~ ( ss ,t ) = 01000200 or
We now present our algorithm.
strings are ordered by their first appearance in the
Group-Merge: Input: S = { S I ,
..., sm).
shortest superstring. Let this order be
(0) Let T := 0.
S I , 32,
...,sm.
We separate S into groups: The first group G1 con-
(1) Find s,s’ E S such that
cost(s, s t ) = min
m
,
i, i
tains s ~ , . . . , swhere
< m, is the largest index
such that (the first appearances of) s1 and
Im(3, 3’)l
lap in t ; The second group contains
zueight(m(s,3’))
j, i
is minimized where
+ 15 j
5 m, is the
Si+lr
Si
over-
...,sj where
largest index such that sj
overlaps with s,+l in t ; And so on. In general if sk is
weight(m(s,s’)) =
~
sk+l, ...,sp
the last element in G I ,then G I + contains
lull
aEA
where p , k
where A is the set of strings in S that are substrings
sk+l overlaps with sp in t . See Figure 1.
Assume that there are g groups: G I ,..., G,. For
of m(s,3’).
(2) Merge s, s’ t o m ( s ,s t ) as defined in Step (1).
T := T u { m ( s ,3’)). Delete set A from S.
(3) If IS1 > 0 then go to (1).
(4) If IT(> 1 then S := T and go to
return the only string in T as superstring.
+ 1 5 p 5 m,is the largest index such that
Gi let b, and ti be the first (bottom) and last (top)
string in G,,according to our ordering, respectively.
That is, b; and ti sandwich the rest of strings in G;
(l),else
and for some optimal mi, every string in Gi is a substring of m ( b i , t ; ) .
128
first and last strings in S, , according to our ordering,
f
respectively. And let m, be the merge function used
in the r t h iteration to combine br and t,. Let there
be a total of B iterations executed by Group-Merge.
G3
Now the length Lj we used to merge strings in Gj can
be measured as follows:
Figure 1: Grouping of Substrings
(where IC'
Lemma 2
5 IC indicates the first step Gjk"' becomes
E:==,Imi(bi, t,)l 5 2n, where n is the length empty)
of shortest superstring t for S.
Proof:
This can easily be seen geometrically:
5 Imj ( b j
put all the groups G1,...,Ggback to their original
positions in the sptimal arrangement (which gives the
(where H(m)=
i
t j ) IH(IIGj
II)
= O(1ogm))
shortest superstring t ) . Then strings in G , overlap
= lmj ( b j v ti 1I@(logIIGj I 1)with nothing except strings in Gi-1 and Gi+l (and of
Hence the total length we use to merge all GI, ...,G ,
course in G; itself), for i = 2, ...,g - 1. Thus counting
is,
by Lemma 2,
the optimal solution for each G, separately at most
9
9
doubles the length of the optimal solution.
U
5 O ( l o g 1 ) Imi(bi,ti)l
~
CL'
i=l
Lemma 3 FOTeach G,, after deleting an arbitrary
i=l
< @(log1)2n = @(log1)n,
number of strings from Gi, among the strings left, we
can still merge a pair of strings in G , so that all other where 1 = m q [[Gill. But O(1ogn) = O(log1) since
n is (polynomially) larger than the number of strings
strings are substrings of this merge. Furthermore the
in any G, and (polynomially) larger than the length
resulting merge is a substring of q ( b ; , t i ) .
of longest string in any G,. Therefore the algorithm
Proof: Consider the original optimal arrangement will output a superstring of length at most O(n1ogn).
and ordering of strings 91,.-.,,s in t . All strings in U
G, overlap at a single position. So after deleting some
Remark. We conjecture that our algorithm alstrings, we can still properly merge the first (bottom)
and last (top) strings among the leftover strings in
Gi to obtain a superstring such that all the leftover
strings in G, are substrings of this superstring. Obvi-
ways outputs a superstring of length no more than
2n (and possibly even less) where n is the optimal
length. Vijay Vazirani constructed an example for
which our algorithm outputs a superstring of length
ously this superstring is also a substring of m i ( b ; , t , ) . 1.3n. Samir Khuller suggested that we can replace
U
the weight function by just counting the number of
strings "sandwiched" by the two merging strings. Albe the set of strings remaining in G, before the though this algorithm still guarantees the n log n per7th iteration. Let Sr be the set of strings cancelled formance, but a counter example shows that the outat r t h iteration of Group-Merge. Let b r , t , be the put sometimes exceeds 2n in length.
For a set of strings A, let IlAll =
CoEA
Jul. Let
129
Modeling DNA Sequencing
Problem
3
sitions of s are covered with probability 1 - n-O(l).
By the same calculation, the probability of existing
some substring which does not overlap for more than
112 characters with some other substring in the sam-
The task of this section is to properly put our DNA
sequencing problem into a suitable mathematical frame-.ple is also less than n-'(l). Since s is random, there
work. In the following, we safely assume that the is no repetitions in s of length greater than 112 if
> 4logn
expected sampling length I is much larger than log n
1
where n is the length of the DNA sequence2.
probability 1 - n-O('), after O(*")
[18], we thus can precisely recover s with
samples.
I t may also be reasonable to assume that in order
Of course, assuming s being random is danger-
t o sequence a human genome of 3 x l o 9 long, at about
ous since it is known that human DNA molecules are
500 characters per fragment, sequencing process must
not very random and actually contain many (pretty
be fully automated without human intervention. We long) repetitions (and redundancies). Above calculation may be meaningful only for lower lifes when the
are interested in fast and simple sequencing methods
that can be easily automated. Another safe assump- repetitions mostly are less than 112.
tion we will make is that sampling will be uniformly
3.2
random. This is more or less the reality or achievable.
We consider two models.
Learning It Approximately
An alternative approach, which we will take in the
rest of this paper, is to make no assumptions about
3.1
Recovering It Precisely
the sequence s, although we will lose some power of
If we can assume that a DNA sequence s is random
our theory.
and substrings from SJ are uniformly sampled, where
Our learned DNA sequence will only be good with
We appeal to Valiant learning model.
Sl contains s's substrings of length 1 and substrings high probability for short queries about several hunof length less than I that are prefixes or suffixes of dred characters long. Although this only partially
s, then s can be identified with no error with high captures reality, we are able to fully characterize the
world we have captured. This is still meaningful since
probability.
certain biological functions are encoded by just a few
Just t o simplify calculation, we consider a fixed I,
rather than for all 1 (less than 500). This should not
hundred base pairs. One purpose of DNA sequencing
change the result. We provide a rough calculation.
is to find the functionality of the genes, there is really
Let K ( s ) 2
)SI = n,
no need to insist on recovering the original sequence
where K ( s ) is the Kolmogorov
complexity of s '. Divide s into
112
=
precisely, especially when this is impossible.
consecutive
blocks. Let E, denote the event where the ith block
We first describe the distribution independent
is not fully covered by any sample. Let E = U&.
model of learning introduced by Valiant [31]. More
At each random sampling, Pr(Ei) 5
Then
discussion and justification of this model can be found
samples, P r ( E ) 5 n-O('). SO all po-
in [2, 12, 14, 32, 331. [15] has been a useful source for
after
o(+)
*.
me. We assume that the learning algorithm of A has
2 B y current technology l is about 500, whereas n 5 3 x 10'
available a black box called EXAMPLES(A), with
to be interesting.
3We refer the reader to [18]for an introduction of Kol-
two buttons labeled POS and NEG. If POS (NEG) is
mogorov complexity. Here we give an intuitive definition: Kolmogorov complexity of a binary string z conditionalto g, writ-
pushed, a positive (negative) example is generated ac-
ten as K(zly), is the lengthbf the shortest Pascal program,
cording to some h e d but unknown probability distri-
encoded in binary in a standard way, that starts with input y,
bution D+ ( D - ) . We assume nothing about the dis-
prints z and halts. K ( z ) = K ( z ( e ) .
tributions D+ and D-,except that
130
CsEPOS
D + ( s )=
xsENEG
zaENEG
1 and
D - ( s ) = 1 (i.e., CaEPOS
D - ( s ) = 0 Theorem 4 Let C and C' be concept classes. Let
and
D + ( s ) = 0). For discrete domains, the c E C with size(c) = n. FOT a! 2 1 and 0 5 p < 1,
and let A be an algorithm that on input of m/2 posiValiant learnability can be defined as follows.
tive examples of c drawn from D+ and m/2 negative
Definition 1 Let C and C' be concept classes. C is
examples of c drawn front D-, outputs a hypothepolynomially learnable from examples (by e') if there
sis c' E c' that is consistent with the examples and
is an (randomized) algorithm A with access t o POS
satisfies K(c'1c) 5 n a d , where K(c'1c) is the Koland N E G which, taking inputs 0 < c , 6 < 1, for any
mogorov complexity of c' conditional t o c. Then A as
c E C and D+, D-, halts in polynomial(size(c), f )
a learning algorithm for c by c' for
time and outputs a concept c' E C' that with probabil1
l n Q i
i t y greater than 1 - 6 satisfies
m = O(nax(-log-,(-)i=q).
€
b e
i,
D + ( s )<
E
When p = 0, and n
c'(s)=O
> log :, we will use m = O($).
By the above theorem, we certainly can trivially
learn, with error probability l / n , a string of length n
by sampling m = n3 examples: the output will be a
and
D-(s) < e .
C'(6)=1
W e say that A is a learning algorithm of C .
set of substrings of total length at most n2 since there
n2 substrings and we can merge them into
only n substrings. Needless to say, sampling 50003
of) DNA sequencing problem as the string learning
not
problem under the Valiant learning.Notice that the fragments to sequence a DNA-mo1ecu1e is
Valiant model allows
under arbitrary disA more
can bring this
down
to
O(nlogn/e)
which
is
still
not
satisfa~tory.~
tribution, whereas in our case uniform sampling is
show that
A theory must agree with practice* we
sufficient. Therefore we are really studying a
fragments, where 1
we need only about
general problem of learning a string.
is the D+-expected sampling length. This is much
Definition 2 String Learning PToblem- The concept closer to the laboratory practice. Notice that the diclass c is the set of strings (DNA-molecules t o be se- visor1 is around 500 and hence is much more signifi-
In the following definition, we model (a subset
are Only
logni:g(n''l
w n c e d ) over the 4 letter abhabet (A, C,G , TI. The
positive eXampkS f o r each Concept (i.e. String) C of
length n are its substrings of length no more than n;
The negative examples4 aTe strings that are not SubStTings of c. Sampling is done at random according to
some unknown distributions for Positive and negative
examples, respectively.
cant than a log n term for n 5 3 x 10'. We conjecture
that this can be improved to
v,
This algorithm will depend on a good approximation algorithm, given in Section 2, for finding a
shortest superstring which guarantees an output not
much longer than the shortest superstring for a given
set of strings (fragments).
be the length Of the strhg to be learned- Then
We strengthen a well-known result of Blumer, Ehrena trivial algorithm would sample substrings and keep them,
feucht, Haussler, and Warmuth [2, 31 in the following
merging a and b only if a is a substring of b, as the concept.
theorem. we have rep1aced their Iequirement that Thus, since there are at most nn such concepts (although the
size(c') 5 numa by that of K(c'1c)
same proof of [3] still works.
5 nama. The
totd length may be na),we need only O(nlogn/c) examples
'In DNA sequencingpracticeB
there do appear to be nege-
to learn with error probability c by Theorem 4. However, this
is still not good enough. For example, we still do not want to
sample say, 50001og5000, substrings to identify a virus DNA
tive examples, due to biologicalrestrictionson what cannot be
combined.
131
of 5ooo bsse
4
Learning a String Efficiently
choices. So
in some way, and this has a t most
altogether we have no(-)
potential c‘. Therefore
Formally our concept class C is a set of strings. Positive examples are substrings of a target string t dis-
K(C’1C)
tributed according t o D t , which maybe uniform in
positive or negative example by testing whether it is
that are not substrings oft distributed according to
a substring of some string in c’. By Theorem 4, our
error probability is at most E.
0
. C’ is a class of sets of strings.
Theorem 5 C is learnable (by C’), with error probability e , using only O(
w)
samples, where 1 i s the
D+ -ezpected sample length.
Proof:
n log’ n
1 1.
Given a new string, we can decide whether it is a
the DNA applicdtons. Negative examples are strings
D-
5 O(-
Corollary 6 If we have an algorithm appTozimating
the shortest common superstring of length n within
length 2n, then C is learnable, with error probability E ,
Given O ( w ) positive and nega-
using only
samples, wheTe 1 is the Dt-ezpected
sample length.
tive examples, if we output a concept c’ such that
K(c‘1c) 5 O ( w ) , then by Theorem 4, we have a
learning algorithm.
In the real life situation of DNA sequencing, there
are many restrictions on what cannot be combined.
We first change Group-Merge algorithm t o also
These conditions may be regarded as negative exam-
deal with negative examples: At step (l),we now look
ples. However, if one prefers to think that no negative
for a pair of s,s’ such that cost(s, 5’) is minimized
ezamples are given, we can still reasonably assume
under the condition that m(s,s’) must not contain a
that negative instances are more or less uniformly
negative example as a substring. The learned concept
distributed. And then,
is
Corollary 7 Assume a uniform distribution oveT the
c’ = { m ( s ,s’)Im(s,s‘) chosen in step (I)
negative ezamples. C is learnable ( b y C‘), with error
pTobability e , using only O(
of Group - Merge}.
Proof:
So strings in c’ may contain more than one string.
In order to show that the old analysis is still good,
w)
positive ezamples.
By modifying the arguments of Theo-
rem 4 (proof omitted) it is easily seen that our algo-
Group-Merge will still guarantee
we only need to observe one fact: there is always a rithm
way t o properly combine the first and last strings in
-
O+(S)
each group GS a t the r t h step such that they contain
5
€1
c‘(a)=O
as substrings and no negative exam- where c’(s) = 0 stands for s is not a substring of some
ples as substrings. Hence the analysis of Theorem 1 string in the learned concept c’. On the negative side,
still carries through.
because the output length is a t most O(n1og n ) , this
all strings in
Now we count how many c’ are possible outputs
will make a t most O(( nlog n)’) negative examples
of Group-Merge, given the fact that we draw ex-
positive. Since we have assumed the uniform distri-
amples using c. c’ is constructed by less than
bution, the error probability on the negative side is
iterations since the total length of c’ is n l o g n and
only, for not too small e,
1 is the D+-expected length for a positive sample6.
Each step, two possible substrings of c are combined
O(
( nlog n)2
)
4n
< e.
6By statistics (Chernoff bounds), we know that with high
ability” can be absorbed in the error probability 6 and the
probability, the average sample size I‘ for the sample we have
discrepancy between I’ and I can be absorbed by some s m a l l
drawn is close to 1. In our case, the phrase “with high prob-
constant.
132
0
superstring. Our theory guarantees the performance
and we have bounded the number of samples required.
Remark 1. The algorithms of Turner [30] and
Tarhio and Ukkonen 1291 do not seem to work when
there are negative samples. This is because their algorithms merge a pair of strings each step; If merged
wrongly at first, the negative examples could prohibit
further mergings, hence ending up with a long superstring. On the other hand, although it was not our
original purpose, algorithm Group-Merge adapts to
this situation perfectly.
Remark 2. Our approach also fits the situation
of inaccurate data. Such inaccurate data happen of-
It is foreseeable that this method may be useful in situations where we are interested only in finding out
whether certain (not too long) character sequence
(representing a function) appear in a given DNA molecule.
If the future technology (like the PCR method) allows us to sample longer fragments, our method will
be more useful since then we can guarantee to answer
longer sequences with high confidence.
5
Acknowledgements
ten during the Sanger procedure [27] and often obstruct us from obtaining the real underlying sequence. I gratefully thank Qiang Yang, Derick Wood, and
By allowing approximate matching, straightforward Esko Ukkonen for bringing the literature on shortest
extension of our approach, combining ideas in [I, 131, common superstrings t o my attention; Murray Sherk,
would provide efficient and robust algorithms.
John Tromp, and Paul Vitanyi for commenting on the
Remark 3. It seems that our new algorithm manuscript; Deming Li, Ed Shipwash, Xiaohong Sun,
deals with the repetitions better than maximum over- and Baozhang Tang for teaching me biochemistry;
lap algorithm used by Turner, Tarhio, Ukkonen. It is David Haussler (who suggested me to calculate the
believed that both algorithms have the 2 times op- probability of recovering the superstring precisely),
timal performance. Based on this conjecture, the Udi Manber, and a referee (COLT) for crucial critiCorollary 6 says that we need only
samples cisms and suggestions; Samir Khuller and Vijay Vaziin order to achieve (1 - E ) x 100 percent confidence. rani for enlightening discussions on the superstring
Several major open problems remain: Improve our algorithms, and NSERC for its financial support.
O(n1ogn) bound (to 2n). Prove the 2n bound or
any non-trivial bound for the maximum overlap algorithm.
Remark 4. We have already indicated the limitation of our approach. In our model, we have as-
References
sumed that future queries to the learned DNA sequence are of size of several hundred characters. Whether
one can correctly answer, with high probability, queries
about longer substrings of the learned DNA sequence
is not clear. This should be an interesting subject of
future research.
Remark 5. It is hoped that our theory and
methods can be applied at a massive scale when the
sampling and sequencing process are fully automated.
The procedure we have described requires no human
intervention. It only involves randomly sequencing
short fragments and computing the shortest common
133
D. Angluin and P.D. Laird. Learning From Noisy
Examples. Machine Learning 2(4) 343-370 1988.
A. Blumer, A. Ehrenfeucht, D. Haussler,
and M. Warmuth. Learnability and VapnikChervonenkis Dimension. Journal of the ACM
35(4) i 9 8 9 .
[31 A. Blumer, A. Ehrenfeucht, D. Haussler, and M.
Warmuth. Occam’s Razor. Information Processing Letters 24 1987.
141 D. Freifelder. Molecular Biology. Jones €4
Bartlett Publishers, Inc. 1983.
[SI P. F’reidland and L. Kedes. Discovering the secrets of DNA. C. ACM 28(11) 1164-1186 1985.
[6] J. Gallant, D. Maier, J. Storer. On finding minimal length superstring. Journal of Computer
and System Sciences 20 50-58 1980.
[7]M. Garey and D. Johnson. Computers and In- [23] H. Peltola, H. Soderlund, J. Tarhio, and E.
Ukkonen. Algorithms for some string matching
problems arising in molecular genetics. Information Processing 83 (PTOC.I F I P Congress, 1983)
53-64.
tractability. Freeman, N e w York, 1979.
[8] D. Haussler. Generalizing the PAC model: sample size bounds from metric dimension-based
uniform convergence results. 30th I E E E Symp.
o n Foundation of Compt. Sci. 40-45 1989.
[24] L. Pitt and L.G. Valiant. Computational Limitations on Learning From Examples. J o u m a l of
the A C M , 35(4) 965-984 1988.
[9] D. Haussler. Quantifying inductive bias: AI
learning algorithms and Valiant's model. Artificial Intelligence, 36(2) 177-221 1988.
r251
. . L. Pitt and M. Warmuth. The minimum consistent DFA problem cannot be approximated
within any polynomial. A C M Symp. on Theory
of Computing, 421-432 1989.
D. Haussler, N. Littlestone, and M. Warmuth.
Expected Mistake Bounds for On-Line Learning
Algorithms. 29th IEEE Symp. o n Foundation of
Compt. Sci. 100-109 1989.
[26] R. Rivest. Learning Decision-Lists. Machine
Learning, 2(3) 229-246 1987.
D. Helmbold, R. Sloan, and M. Warmuth.
Learning nested differences of intersection-closed [271 R. Staden. Automation of the computer banclasses. 2nd Workshop o n Computational Learndling of el
data produced by the shoting Theory, 41-56 1989.
aun metiod of D#A sequencing. Nucleic Acids
iiesearch, lO(15) 4731-4'751 19i.2.
M. Kearns. M. Li, L. Pitt, and L. G. Valiant.
Recent Results on Boolean Concept Learning.
proceedings of the 4th International Workshop [28] J. Storer. Data compression: methods and theory. Computer Science Press, 1988.
o n Machine Learning, 337-352, 1987
M. Keains and M. Li. Learning in the Presence
of Malicious Errors. 1988 A C M S y m p . o n T h e o m
of Computing. Also Harvard TR-03-87.
[29] J . Tarhi0 and E - Ukkonen. A Greedy a P P r o i mation algorithm for constructing shortest Common superstrings. Theoretical Computer Science
57 131-145 1988
M. Kearns, M. Li, L. Pitt, and L.G. Valiant.
On the Learnability of Boolean Formulae. lgth
ACM Symposium o n Theory of Computing 285295 1987.
[30] J. Turner. Approximation algorithms for the
shortest common superstring problem. Information and Computation 83 1-20 1989.
M. Kearns. The computational complexity of
machine learnin . Ph.D. Thesis, Harvard University. Report 8R-13-89, 1989.
A. Lesk (Edited). Computational Molecular Biology, Sources and Methods for Sequence Analysis. Ozford University Press, 1988.
M. Li and P. Vitanyi. A theory of learning simple
concepts under simple distributions. 30th I E E E
Symp. o n Foundations of Computer Science, 3439, 1989.
[31] L. G. Valiant. A Theory of the Learnable.
C o m m . A C M 27(11) 1134-1142 1984.
[32] L. G. Valiant. Learning Disjunctions of Conjunctions. In Proceedings of the gth I J C A I , vol. 1 560566 Los Angeles, CA. August, 1985.
[33] L. G. Valiant. Deductive Learning. Phil. Trans.
R. Soc. Lond. A 312 441-446 1984.
M. Li and P. Vitanyi. Kolmogorov complexity
and its applications. Handbook of Theoretical
Comput. Sci., J . van Leeuwen Ed. 1990
N. Littlestone.
vant Attributes
old Algorithm.
Foundations of
Learning Quickly When IrreleAbound: A New Linear Thresh28th I E E E Symposium on the
Computer Science, 1987.
W. Maass and G. Turan. O n the complexity of learning from counterexamples. 30th
IEEE Foundations of Computer Science, 262267 1989.
R. Michalski, J. Carbonell, T. Mitchell. Machine
Learning. Morgan Kaufmann 1983.
B.K. Natarajan. O n Learning Boolean Functions. A C M Symp. o n Theory of Computing,
296-304 1987.
134