Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, Saarbrücken E. Rivals P. Ferragina M. Vingron Deutsches Krebsforschungszentrum, Heidelberg Outline • Existing Work • Motivation • Problem • Algorithm • Results • Open Problems • Examples : • BLAST • FASTA • Linear Scan (No Index) • Good Sensitivity BLAST: Scan database for exact matching substrings of the query Extend hits iteratively BLAST: D Small substring length allows high sensitivity Scan is expensive, every character in D is accessed • Today: New Applications • Examples: • EST-Clustering • Large Scale Shotgun Assembly • Low Sensitivity • Multiple Searches ⇒ Specialized Algorithms Needed Pattern P TCGATTACAGTGAAT w=8 Database D GCATTCGATGGACTGGACTAGTGAATCAGT • Local Alignment, minimum Length w • Low Error Rate (<10% Edit Distance) •Filter Step: •Identify Hotspots •Scan Step: •Scan Hotspots with BLAST • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting TCGATTACAGTGAAT q=3 TCG # of q-grams : w=8 CGA |P| - q + 1 GAT ATT TTA TAC GCATTCGATGGACTGGACTAGTGAATCAGT Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting TCGATTAC • Divide D into Blocks • Count matching q-grams per Block • Scan Blocks with counter ≥ t How to find the matching q-grams? GCATTCGATGGACTGGACTAGTGAATCAGT 4321 0 • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting TCGATTAC AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 3 23 16 11 GCATTCGATGGACTGGACTAGTGAATCAGT 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 • Sorted List of Pointers to Suffixes, O(log |D|) Access Time • Precompute Searches for q-grams, O(1) Time Access • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting q=3 w=8 e=1 t=3 TCGATTACAGTGAAT • Move Window over Query • Mark full Blocks for each Window • Scan Marked Blocks GCATTCGATGGACTGGACTAGTGAATCAGT 2304 107 • Influence of the Block Size • Sensitivity • Running Times • Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM Solaris 2.5.1 Influence of Block Size 0,8 Time in Seconds 0,7 0,6 0,5 Scan Time 0,4 Total Time Filter Time 0,3 0,2 0,1 0 512 1024 2048 Block Size 4096 8192 Sensitivity • 1000 Queries • 6 % Error • BLAST Cutoff E = 0.00001 • Number of identical hitlists • Mouse EST DB: 91.4 % • Human EST DB: 97.1 % ⇒ QUASAR finds many Hits beyond selected Error Level Running Times 13.275 14 6% Error l w = 50 l q = 11 l block size 2048 l scan with BLAST l time averaged for 1000 queries l • ~30 times faster than BLAST Running times in seconds • Test Parameters: 12 10 8 6 3.371 4 2 0.123 0.380 Mouse EST Human EST 0 QUASAR BLAST Overhead for Loading the Index • 1000 queries • Human EST DB, 280 Mbps • BLAST Test Run: • 5 seconds Load Time • 13 270 seconds Search Time • QUASAR Test Run: • 90 seconds Load Time • 380 seconds Search Time