Download Indexing and Filtering for Similarity Search

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia , lookup

Corecursion wikipedia , lookup

Operational transformation wikipedia , lookup

Transcript
Effective Indexing and Filtering
for Similarity Search in Large
Biosequence Databases
O. Ozturk and H. Ferhatosmanoglu. IEEE
International Symp. on Bioinformatics and
Bioengineering (BIBE '03), pp. 359-366.
Washington, DC. March 2003.
Overview
•
•
•
•
•
•
Applications of queries
Background on queries
Current problem
Solutions and our solution
Comparison experiments and results
Future work
BMI 731 - Winter'04
2
Queries in general
• We need a metric distance function
– To measure the (dis)similarity btw objects
• Dynamic programming Algorithm
– O( |string1| * |string2| ) time and space
• i.e. O(n2) where n is length of the strings
– Especially bad for genetic sequence queries
where you have long sequences
BMI 731 - Winter'04
3
2 kinds of queries
• -range queries
– Retrieve all objects similar to query more than a certain
degree 

BMI 731 - Winter'04
4
2 kinds of queries
k-nearest neighbor (k-NN) queries
– Retrieve k most similar objects
• No domain knowledge necessary
Ex: 4 NN

BMI 731 - Winter'04
5
2 kinds of queries
• -range queries
• Requires domain knowledge
– Data distribution & Distance definition

 too small
None returned
BMI 731 - Winter'04
6
2 kinds of queries
• -range queries

 too large
All returned
BMI 731 - Winter'04
7
Measuring similarity
• We need a metric distance function
– To measure the (dis)similarity btw objects
• Edit Distance (ED)
– Three kinds of operations
• Insert, delete, replace
– ACTTAGC to AATGATAG
– ACT - - TAGC
R I I
D  ED = 4
AATGATAG– Dynamic programming Algorithm
– O(mn) time and
space
BMI 731 - Winter'04
8
DPA
BMI 731 - Winter'04
9
String/Genome Data
• Asks the most similar substrings in the
database to the given string.
• BLAST has -range queries
– Naïve search (linear scan)
– scalability problems
• How to Handle Size
– Partial information rather than whole database
• Approximate the string data (compress)
 may fit in memory
 may be used for indexing, clustering
BMI 731 - Winter'04
11
How to Handle Size
•
3 approaches to make use of compressed data
1. Prune irrelevant data, I/O for non-pruned entries 
calculate exact values for non-pruned
(especially -range queries)
2. Get approximate answers, virtually no I/O (I/O
only for answers)(especially k-NN queries)
3. Approximate pruning for -range queries
BMI 731 - Winter'04
12
Overview
•
•
•
•
•
Background on queries
Current problem
Transformation and Indexing
Comparison experiments and results
Future work
BMI 731 - Winter'04
13
Big Picture
General Approach step by step
• Transform (large) string data into (hopefully
smaller sized) multi-dimensional vectors
• Develop a distance function df in vector
spaces to approximate the string similarity
• Build a multi-dimensional indexing technique
on top of multi-dimensional vectors
-Preprocessing• Implement one of the three approaches
mentioned
-QueryBMI 731 - Winter'04
14
Preprocessing
1
Windowing
Overlapping Windows
String Database
2
Transformation
Into vector
Space
3
Indexing
Multidimentional
Vectors
Indexed with
BMI 731 - Winter'04
respect to some
distance function
15
Using the index
Done
2a Approximate
Query
(k-NN or
-range)
1
Index of vectors
Transformation
2b
Query sequence
Exact
Query
(k-NN or
-range)
The vectors returned
represent most of
k-NN (or vectors in range ) + some false
positives
Candidate
set
Index of vectors
BMI 731 - Winter'04
Continued
16
Using the index
3
Candidate
set
Refine
I/O for strings
represented by
those vectors.
BMI 731 - Winter'04
Calculate ED
for each of them.
(Remove false
positives.)
17
1ST Step: Partitioning into
overlapping Windows
• AACCGGTTACGTACGT…
e.g W=6
• AACCGGTTACGTACGT…
e.g =2
• AACCGGTTACGTACGT…
BMI 731 - Winter'04
18
2ND Step: Mapping Windows into
Vector Space
•
•
•
•
Choose a tuple size k
Associate an int to each 4k k-tuples
Frequencies of those k-tuples, is the vector
If k=2  4k=16 k-tuples
•
•
•
•
AA, AC, AG, AT,
CA, CC, CG, CT
TA, TC, TG, TT
GA, GC, GG, GT
BMI 731 - Winter'04
19
Example Mapping
• The integers assigned
•
•
•
•
AA=0, AC=1, AG=2, AT=3,
CA=4, CC=5, CG=6, CT=7
TA=8, TC=9, TG=10, TT=11
GA=12, GC=13, GG=14, GT=15
• Assume window AACCGG
• AA, AC, CC, CG, GG all occur once
• 1100011000100000 is the matching vector.
BMI 731 - Winter'04
20
Different transformations &
Distance Functions
• Tuple size  transformation size
– 1  4 (frequencies of A, C, G, T)
– 2  16 (frequencies of 2-tuples)
BMI 731 - Winter'04
FV1
FV2
21
Different transformations &
Distance Functions 2
• WVn transformation
– String into halves x,y
– FVns for x,yFVx,FVy
– Concatenate addition
and subtraction of them
[ FVx + FVy, FVx-FVy]
• Wavelet 1 on example
– TCACTTAG
– 1st: divide into halves &
find FV1 transformation
• x:TCAC  1 2 0 1
• y:TTAG  1 0 1 2
– 2nd: add and subtract
• 2 2 1 3 0 2 –1 –1
WV1
• Same operations on 2tuples
WV2
BMI 731 - Winter'04
22
Distance Functions on the Vector
Spaces
• All of them are proved to be lower-bounds
to edit-distance
• FD1  distance on FV1
• FD2  distance on FV2
• WD1  distance on WV1
• WD2  distance on WV2
BMI 731 - Winter'04
23
Frequency Distance FDn
Algorithm
Example (n=1)
• u:ACTTAGC2,2,1,2
FDn (n-gram frequencies
u,v)
• posDist:=negDist:=0
• for all dimensions ui,vi
– If ui>vi then posDist:=ui-vi
– else
negDist:=ui-vi
• Return max(posDist,
negDist)/n
v:AATGATAG4,0,2,2
• – 2-4<0 negDist+=|2-4|
– 2-0>0 posDist+=|2-0|
– 1-2<0 negDist+=|1-2|
– 2-2=0
• posDist:2 negDist:3
• FD1 is 3
BMI 731 - Winter'04
24
FDn Why
• On example
lower bound?
– need to incresase A by 2 G by 1 3
– need to decrease c by 2
• We may “increase+decrease” if we can
replace (back to slide #8)
• So in best case edit dist is only FD1
• But it may not be the case, you may need
more operations, because of mismatch of
locations…
• Divide by n is because a change in one
character, updates frequency of n n-grams.
BMI 731 - Winter'04
25
Wavelet Distance WDn
Algorithm
WDn (n-gram frequency
wavelets u,v)
• Find posDist and negDist
on u,v
• m:=min(posDist, negDist)
• d:= (posDist-negDist)/2
• if m < d
– Return d / n
• else
– Return (d + (m-d )/2 )/n
Example (n=1)
• u:ACTC TAGC
1201 1111
2 3 1 2 0 1 –1 0
• v:AATG ATAG
2011 2011
4 0 2 2 0 0 0 0
•
•
•
•
•
posDist: 3 + 1 = 4
negDist: 2 + 1 + 1 = 4
m:4 d:0
(0 + 4/2)/1
Return 2
BMI 731 - Winter'04
26
WDn Why
lower bound?
• Assume a string transformed into wavelet
[a1,…a, b1,…b]
• Largest change posDist+=3 negDist-=1 or
vice versa
– So use this change whenever posDist<>negDist
BMI 731 - Winter'04
27
Overview
•
•
•
•
•
Background on queries
Current problem
Transformation and Indexing
Comparison experiments and results
Future work
BMI 731 - Winter'04
28
Experiment Design
• Implemented transformations & distance functions
• Evaluated their pruning efficiency on -range
queries and approximation efficiency on k-NN
queries experimentally on real genetic data
• Ran queries with different parameters
–
–
–
–
Varying string size W, shift amount 
Some containing exact match, some not
For -range queries different  values
For k-NN queries different k values
BMI 731 - Winter'04
29
K-nearest efficiency
90
Average of edit-distances of k-nearest
80
70
60
EditDist
Freq
50
Freq2
MaxFreq
40
Wav
Wav2
30
20
10
0
5
10
15
20
25
k (for k-nearest neighbor query )
BMI 731 - Winter'04
30
Error Rates Compared
160.00%
140.00%
120.00%
percentage error
100.00%
(Freq-Edit)/Edit
(Freq2-Edit)/Edit
80.00%
(MaxFreq-Edit)/Edit
(Wav-Edit)/Edit
(Wav2-Edit)/Edit
60.00%
40.00%
20.00%
0.00%
5
10
15
20
25
k
BMI 731 - Winter'04
31
Sorted Graphs
• To depict why our distance functions
perform so good in k-NN
• Imitate what our k-NN approximation does,
and graph the result
– It sorts the data values in increasing order, and
takes the k-nearest ones
BMI 731 - Winter'04
32
Edit Distances and Matching FD1 Distances sorted by FD1
140
120
ED
80
FD1
60
40
20
397
386
375
364
353
342
331
320
309
298
287
276
265
254
243
232
221
210
199
188
177
166
155
144
133
122
111
100
89
78
67
56
45
34
23
12
0
1
Distance Value
100
First 400 strings when sorted by FD1
50 nearest
20 nearest
BMI 731 - Winter'04
33
Edit Distances and Matching WD2 sorted by WD2
140
100
ED
WD2
80
60
40
20
50 nearest
20 nearest
391
378
365
352
339
326
313
300
287
274
261
248
235
222
209
196
183
170
157
144
131
118
105
92
79
66
53
40
27
14
0
1
Distance Value
120
First 400 strings when sorted by WD2
BMI 731 - Winter'04
34
Nature of the distance functions
• WD2 has very good performance in k-NN
even though not so well pruning
– Its variance of its ratio to edit distance is much
lower than others as you would like for a
distance function
BMI 731 - Winter'04
35
BMI 731 - Winter'04
666
647
628
609
590
571
552
533
514
495
476
457
438
419
400
381
362
343
324
305
286
267
248
229
210
191
172
153
134
115
96
77
58
39
20
1
wav2
140
120
100
80
EditDist
WaveletDist2
60
40
20
0
36
BMI 731 - Winter'04
666
647
628
609
590
571
552
533
514
495
476
457
438
419
400
381
362
343
324
305
286
267
248
229
210
191
172
153
134
115
96
77
58
39
20
1
distance (edit and freq)
Freq
140
120
100
80
EditDist
60
FreqDist
40
20
0
string sorted by edit dist to query
37
Results
• Tested the parameters obtained by this
random experiments, on real data.
• Then also did the parameter extraction
using real data too.
BMI 731 - Winter'04
38
Comparison of index structures
BMI 731 - Winter'04
39
Future Work
• Check applicability of those methods to
other kinds of sequence data.
– Text
– Image search
• Implement index structure in the standalone
program, and make performance evaluation
BMI 731 - Winter'04
40