Download S 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene prediction wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Computational complexity theory wikipedia , lookup

Theoretical computer science wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Algorithm wikipedia , lookup

Potentially all pairwise rankings of all possible alternatives wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

String (computer science) wikipedia , lookup

Planted motif search wikipedia , lookup

Transcript
An Efficient Algorithm for Finding Similar Short
Substrings from Large Scale String Data
Takeaki Uno
National Institute of Informatics, JAPAN
&
The Graduate University for Advanced Science
May/23/2008 PAKDD 2008
Motivation: Analyzing Huge Data
• Recent information technology gave us many huge database
- Web, genome, POS, log, …
• "Construction" and "keyword search" can be done efficiently
• The next step is analysis; capture features of the data
- statistics, such as size, #rows, density, attributes, distribution…
• Can we get more?
Database
 look at (simple) local structures
but keep being simple and basic
実験1 実験2 実験3 実験4
●
▲
▲
●
▲
●
●
▲
●
●
●
▲
●
▲
●
●
●
▲
●
●
▲
▲
▲
▲
Results of
experiments
ATGCGCCGTA
TAGCGGGTGG
TTCGCGTTAG
GGATATAAAT
GCGCCAAATA
ATAATGTATTA
TTGAAGGGCG
ACAGTCTCTCA
ATAAGCGGCT
genome
Our Focus
• Find all pairs of similar objects (or structures)
(or binary relation, instead)
• Maybe, this is very basic and fundamental
 There would be many applications
- detecting global similar structures,
- constructing neighbor graphs,
- detecting locally dense structures (groups of related objects)
In this talk, we look at strings
Existing Studies
• There are so many studies on similarity search (homology search)
 Given a database, construct a data structure which enables us
to find the objects similar to the given a query object, quickly
- strings with Hamming distance, edit distance
- points in plane (k-d trees), Euclidian space
- sets
- constructing neighbor graphs (for smaller dimensions)
- genome sequence comparison (heuristics)
• Both exact and approximate approaches
• All pairs comparison does not work for
large scale data
Approach from Algorithm Theory
• Parallel computation is a popular way to fast computation, but its
high cost, including hardness of programming, is a disadvantage
• Algorithm improvement decreases the increase against the database
size by the derivation on the design of the way of computation
size = 1,000,000
size = 100
2-3 times
10,000
times
Efficiency increases as the increase of database size
We approach the problem from the algorithmic point
Our Problem
• We address databases whose records are short strings
Problem: For given a database composed of n strings of the fixed
same length l, and a threshold d, find all the pairs of strings such
that the Hamming distance of the two strings is at most d.
• We propose an efficient algorithm SACHICA (Scalable
Algorithm for Characteristic/Homogenous Interval Calculation),
and a method to detect long similar substrings of input strings
(especially efficient for genomic data)
ATGCCGCG
GCGTGTAC
GCCTCTAT
TGCGTTTC
TGTAATGA
...
・ ATGCCGCG , AAGCCGCC
・ GCCTCTAT , GCTTCTAA
・ TGTAATGA , GGTAATGG
...
Approaching Long-string Similarity
• When two strings S1 and S2 are similar, they must have several
pairs of similar short substrings
• “Having several similar substrings” is
a necessary condition to be similar strings
Ex) for strings of length 3000 s.t., Hamming distance 290 (=10%)
 they have at least 3 pairs of substrings of length 30 with
Hamming distance at most 2
 the position of these substrings must differ at most 30, if we
allow deletion and insertion
• It gives a condition that substrings of length β are similar only if
“k pairs of their short substrings are similar, and their start
positions differ at most α
Detecting Long Similar Substrings
• Consider to find long similar substrings of given strings S1 and S2
• Comparison of all substrings of length β
 needs square time
 redundant overlapping pairs
 use our similarity condition
(1) find all pairs of similar short substrings
(2) scan diagonal belt of width 2α to find
an interval of length β including k pairs
(3) shift the diagonal belt by α, and repeat
• We can always find substrings of length βsatisfying the condition
 approach from similar short substrings is possible
Related Works
• Computing edit/Hamming distance is done in square/linear time
 the whole strings have to be similar
 can not detect local exchange
• Heuristic homology search such as BLAST, Pattern Hunter
usually finds exact match of short substrings (11 letters), and extend
 must find terrible number of pairs when input strings are huge
 lengthen 11 letters loses the accuracy
 heuristics, ignoring frequent substring, dealing only gene areas
• Similarity search
 involves huge number of queries, taking much much longer time
than exact search
Trivial Bound of the Complexity
• If all the strings are exactly the same, we have to output all the
pairs, thus take Θ(n2) time
 simple all pairs comparison of O(l n2) time is optimal,
if l is a fixed constant
 is there no improvement?
• In practice, we would analyze only when output is small,
otherwise the analysis is non-sense
 consider complexity in the term of
the output size
M: #outputs
We propose O(2l(n+lM)) time algorithm
Basic Idea: Solve Subproblem
• Consider the partition of strings into k blocks, and a subproblem
subproblem: for given k-d block positions, find all pairs of strings
with distance at most d s.t. "the given blocks are the same"
Ex) 2nd, 4th, 5th blocks of S1 and S2 (length 30) are the same
 much much fewer comparisons !!
• We can solve by "radix sort" on combined blocks, in O(l n) time.
Examine All Cases
• Solve the subproblem for all combinations of the positions
 if distance of two strings S1 and S2 is at most 2,
letters on l-2 blocks are the same
 in at least one combination of blocks, the pair ”S1 and S2” is found
(in the subproblem of combination P)
• #combinations is kCd. When k=5 and d=2, it is 10
 computation is "radix sorts +α", O(kCd ln ) time for sorting
 recursive radix sort to reducing to O(kCd n )
Example
・ Find all pairs of strings with Hamming distance at most 1
ABCDE
ABDDE
ADCDE
CDEFG
CDEFF
CDEGG
AAGAB
A
A
A
C
C
C
A
BCDE
BDDE
DCDE
DEFG
DEFF
DEGG
AGAB
A
A
A
C
C
C
A
BC
BD
DC
DE
DE
DE
AG
DE
DE
DE
FG
FF
GG
AB
ABC
ABD
ADC
CDE
CDE
CDE
AAG
DE
DE
DE
FG
FF
GG
AB
Figure out Intuition
• Finding pairs of similar records is something finding all certain
cells in a matrix
• All pairs comparison sweeps and looks at all cells
• Our multi-classification algorithm recursively reduces the areas
to be checked in many ways,
thus the search route forms a tree,
whose leaves corresponds to
a group of strings to be compared
Avoid Duplications by Canonical Positions
• For two strings S1 and S2, their canonical positions are the
first l-d positions of the same letters
• Only we output the pair S1 and S2 only in the subproblem of
their canonical positions
• Computation of canonical posisions takes O(l) time, "+α"
needs O(M l kCd ) time
Avoid duplications without keeping the solutions in memory
O(lCd (n+dM)) = O(2l (n + lM) ) time in total ( if we set k=l )
Difference from BLAST
• The original “BLAST” algorithm finds pairs of the identical
intervals of 11 letters
 roughly, classifies into 411 = 4 million groups
 may take long time for 100 million letters
• Our method for length 30 with Hamming distance 3 (quality equal
to finding same interval of 7 letters), with dividing into 6 blocks
 roughly, classifies into 415 = 1,000 million groups, (20 times)
 may take long time for 2000 million letters, but we can
increase the #blocks
But, not good at searching a given short string
(no difference of time between many strings and one string)
Experiments: l = 20 and d = 0,1,2,3
Prefixes of Y chromosome of Human
Note PC with Pentium M 1.1GHz, 256MB RAM
10000
d=0
d=1
d=2
d=3
100
10
20
00
70
00
22
95
3
0.1
70
0
1
20
0
CPU time(sec.)
1000
length(1000base)
Comparison of Chromosome
• Grid lines detect "repetitions
of similar structures"
chimpanzee
Human 21st and chimpanzee 22nd chromosomes
• Take strings of 30 letters from both, with overlaps
• Intensity is given by # pairs
human 21st chr.
• White  possibly similar
• Black  never similar
nd
22
chr.
20 min. by PC
Homology Search on Mouse X Chr.
Human X and mouse X chromosomes (150M strings for each)
human X chr.
1 hour by PC
mouse X chr.
• take strings of 30
letters beginning at
every position
・ For human X,
without overlaps
・ d=2, k=7
・ dots if 3 points are
in area of width 300
and length 3000
Comparison of Many Bacterias
Comparison of the
genomes of 30
bacteria
• The genomes are
concatenated and
compared in the same
way
• The genomes are
concatenated and
compared in the same
way
1 hour by PC
Comparison of BAC clones
• Sequencing a genome is done by detecting overlaps of fragments
• When genome has complex repeating structures, detection is hard
• We detected the overlaps in
the mouse genome, and
completed some undetermined
complex repeating parts
(joint research with Koide,
Umemori of National
Institute of Genetics, Japan)
1 sec. by PC for a pair
Extensions ???
• Can we solve the problem for other objects?
(sets, sequences, graphs,…)
• For graphs, maybe yes, but not sure for the practical performance
• For sets, Hamming distance is not preferable.
for large sets, many difference should be allowed.
• For continuous objects, such as points in Euclidian space, we can
hardly bound the complexity in the same way.
(In the discrete version, the neighbors are finite, actually
classified into constant number of groups)
Conclusion
• Output sensitive algorithm for finding pairs of similar strings
( in the term of Hamming distance)
• Multi-classification for speeding up
• Application to genome sequence comparison
Future works
• Models and algorithms for natural language text
• Extension to other objects (sets, sequences, graphs)
• Extension to continuous objects (points in Euclidian space)
• Efficient spin-out heuristics for practice
• Genome analyze tools and systems