Download Indexing and Filtering for Similarity Search

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003. Overview • • • • • • Applications of queries Background on queries Current problem Solutions and our solution Comparison experiments and results Future work BMI 731 - Winter'04 2 Queries in general • We need a metric distance function – To measure the (dis)similarity btw objects • Dynamic programming Algorithm – O( |string1| * |string2| ) time and space • i.e. O(n2) where n is length of the strings – Especially bad for genetic sequence queries where you have long sequences BMI 731 - Winter'04 3 2 kinds of queries • -range queries – Retrieve all objects similar to query more than a certain degree   BMI 731 - Winter'04 4 2 kinds of queries k-nearest neighbor (k-NN) queries – Retrieve k most similar objects • No domain knowledge necessary Ex: 4 NN  BMI 731 - Winter'04 5 2 kinds of queries • -range queries • Requires domain knowledge – Data distribution & Distance definition   too small None returned BMI 731 - Winter'04 6 2 kinds of queries • -range queries   too large All returned BMI 731 - Winter'04 7 Measuring similarity • We need a metric distance function – To measure the (dis)similarity btw objects • Edit Distance (ED) – Three kinds of operations • Insert, delete, replace – ACTTAGC to AATGATAG – ACT - - TAGC R I I D  ED = 4 AATGATAG– Dynamic programming Algorithm – O(mn) time and space BMI 731 - Winter'04 8 DPA BMI 731 - Winter'04 9 String/Genome Data • Asks the most similar substrings in the database to the given string. • BLAST has -range queries – Naïve search (linear scan) – scalability problems • How to Handle Size – Partial information rather than whole database • Approximate the string data (compress)  may fit in memory  may be used for indexing, clustering BMI 731 - Winter'04 11 How to Handle Size • 3 approaches to make use of compressed data 1. Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially -range queries) 2. Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries) 3. Approximate pruning for -range queries BMI 731 - Winter'04 12 Overview • • • • • Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work BMI 731 - Winter'04 13 Big Picture General Approach step by step • Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors • Develop a distance function df in vector spaces to approximate the string similarity • Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing• Implement one of the three approaches mentioned -QueryBMI 731 - Winter'04 14 Preprocessing 1 Windowing Overlapping Windows String Database 2 Transformation Into vector Space 3 Indexing Multidimentional Vectors Indexed with BMI 731 - Winter'04 respect to some distance function 15 Using the index Done 2a Approximate Query (k-NN or -range) 1 Index of vectors Transformation 2b Query sequence Exact Query (k-NN or -range) The vectors returned represent most of k-NN (or vectors in range ) + some false positives Candidate set Index of vectors BMI 731 - Winter'04 Continued 16 Using the index 3 Candidate set Refine I/O for strings represented by those vectors. BMI 731 - Winter'04 Calculate ED for each of them. (Remove false positives.) 17 1ST Step: Partitioning into overlapping Windows • AACCGGTTACGTACGT… e.g W=6 • AACCGGTTACGTACGT… e.g =2 • AACCGGTTACGTACGT… BMI 731 - Winter'04 18 2ND Step: Mapping Windows into Vector Space • • • • Choose a tuple size k Associate an int to each 4k k-tuples Frequencies of those k-tuples, is the vector If k=2  4k=16 k-tuples • • • • AA, AC, AG, AT, CA, CC, CG, CT TA, TC, TG, TT GA, GC, GG, GT BMI 731 - Winter'04 19 Example Mapping • The integers assigned • • • • AA=0, AC=1, AG=2, AT=3, CA=4, CC=5, CG=6, CT=7 TA=8, TC=9, TG=10, TT=11 GA=12, GC=13, GG=14, GT=15 • Assume window AACCGG • AA, AC, CC, CG, GG all occur once • 1100011000100000 is the matching vector. BMI 731 - Winter'04 20 Different transformations & Distance Functions • Tuple size  transformation size – 1  4 (frequencies of A, C, G, T) – 2  16 (frequencies of 2-tuples) BMI 731 - Winter'04 FV1 FV2 21 Different transformations & Distance Functions 2 • WVn transformation – String into halves x,y – FVns for x,yFVx,FVy – Concatenate addition and subtraction of them [ FVx + FVy, FVx-FVy] • Wavelet 1 on example – TCACTTAG – 1st: divide into halves & find FV1 transformation • x:TCAC  1 2 0 1 • y:TTAG  1 0 1 2 – 2nd: add and subtract • 2 2 1 3 0 2 –1 –1 WV1 • Same operations on 2tuples WV2 BMI 731 - Winter'04 22 Distance Functions on the Vector Spaces • All of them are proved to be lower-bounds to edit-distance • FD1  distance on FV1 • FD2  distance on FV2 • WD1  distance on WV1 • WD2  distance on WV2 BMI 731 - Winter'04 23 Frequency Distance FDn Algorithm Example (n=1) • u:ACTTAGC2,2,1,2 FDn (n-gram frequencies u,v) • posDist:=negDist:=0 • for all dimensions ui,vi – If ui>vi then posDist:=ui-vi – else negDist:=ui-vi • Return max(posDist, negDist)/n v:AATGATAG4,0,2,2 • – 2-4<0 negDist+=|2-4| – 2-0>0 posDist+=|2-0| – 1-2<0 negDist+=|1-2| – 2-2=0 • posDist:2 negDist:3 • FD1 is 3 BMI 731 - Winter'04 24 FDn Why • On example lower bound? – need to incresase A by 2 G by 1 3 – need to decrease c by 2 • We may “increase+decrease” if we can replace (back to slide #8) • So in best case edit dist is only FD1 • But it may not be the case, you may need more operations, because of mismatch of locations… • Divide by n is because a change in one character, updates frequency of n n-grams. BMI 731 - Winter'04 25 Wavelet Distance WDn Algorithm WDn (n-gram frequency wavelets u,v) • Find posDist and negDist on u,v • m:=min(posDist, negDist) • d:= (posDist-negDist)/2 • if m < d – Return d / n • else – Return (d + (m-d )/2 )/n Example (n=1) • u:ACTC TAGC 1201 1111 2 3 1 2 0 1 –1 0 • v:AATG ATAG 2011 2011 4 0 2 2 0 0 0 0 • • • • • posDist: 3 + 1 = 4 negDist: 2 + 1 + 1 = 4 m:4 d:0 (0 + 4/2)/1 Return 2 BMI 731 - Winter'04 26 WDn Why lower bound? • Assume a string transformed into wavelet [a1,…a, b1,…b] • Largest change posDist+=3 negDist-=1 or vice versa – So use this change whenever posDist<>negDist BMI 731 - Winter'04 27 Overview • • • • • Background on queries Current problem Transformation and Indexing Comparison experiments and results Future work BMI 731 - Winter'04 28 Experiment Design • Implemented transformations & distance functions • Evaluated their pruning efficiency on -range queries and approximation efficiency on k-NN queries experimentally on real genetic data • Ran queries with different parameters – – – – Varying string size W, shift amount  Some containing exact match, some not For -range queries different  values For k-NN queries different k values BMI 731 - Winter'04 29 K-nearest efficiency 90 Average of edit-distances of k-nearest 80 70 60 EditDist Freq 50 Freq2 MaxFreq 40 Wav Wav2 30 20 10 0 5 10 15 20 25 k (for k-nearest neighbor query ) BMI 731 - Winter'04 30 Error Rates Compared 160.00% 140.00% 120.00% percentage error 100.00% (Freq-Edit)/Edit (Freq2-Edit)/Edit 80.00% (MaxFreq-Edit)/Edit (Wav-Edit)/Edit (Wav2-Edit)/Edit 60.00% 40.00% 20.00% 0.00% 5 10 15 20 25 k BMI 731 - Winter'04 31 Sorted Graphs • To depict why our distance functions perform so good in k-NN • Imitate what our k-NN approximation does, and graph the result – It sorts the data values in increasing order, and takes the k-nearest ones BMI 731 - Winter'04 32 Edit Distances and Matching FD1 Distances sorted by FD1 140 120 ED 80 FD1 60 40 20 397 386 375 364 353 342 331 320 309 298 287 276 265 254 243 232 221 210 199 188 177 166 155 144 133 122 111 100 89 78 67 56 45 34 23 12 0 1 Distance Value 100 First 400 strings when sorted by FD1 50 nearest 20 nearest BMI 731 - Winter'04 33 Edit Distances and Matching WD2 sorted by WD2 140 100 ED WD2 80 60 40 20 50 nearest 20 nearest 391 378 365 352 339 326 313 300 287 274 261 248 235 222 209 196 183 170 157 144 131 118 105 92 79 66 53 40 27 14 0 1 Distance Value 120 First 400 strings when sorted by WD2 BMI 731 - Winter'04 34 Nature of the distance functions • WD2 has very good performance in k-NN even though not so well pruning – Its variance of its ratio to edit distance is much lower than others as you would like for a distance function BMI 731 - Winter'04 35 BMI 731 - Winter'04 666 647 628 609 590 571 552 533 514 495 476 457 438 419 400 381 362 343 324 305 286 267 248 229 210 191 172 153 134 115 96 77 58 39 20 1 wav2 140 120 100 80 EditDist WaveletDist2 60 40 20 0 36 BMI 731 - Winter'04 666 647 628 609 590 571 552 533 514 495 476 457 438 419 400 381 362 343 324 305 286 267 248 229 210 191 172 153 134 115 96 77 58 39 20 1 distance (edit and freq) Freq 140 120 100 80 EditDist 60 FreqDist 40 20 0 string sorted by edit dist to query 37 Results • Tested the parameters obtained by this random experiments, on real data. • Then also did the parameter extraction using real data too. BMI 731 - Winter'04 38 Comparison of index structures BMI 731 - Winter'04 39 Future Work • Check applicability of those methods to other kinds of sequence data. – Text – Image search • Implement index structure in the standalone program, and make performance evaluation BMI 731 - Winter'04 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Indexing and Filtering for Similarity Search