Download q-gram Based Database Searching Using A Suffix Array (QUASAR)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
q-gram Based Database
Searching Using A Suffix
Array (QUASAR)
S. Burkhardt
A. Crauser
H-P. Lenhof
Max-Planck Institut f. Informatik, Saarbrücken
E. Rivals
P. Ferragina
M. Vingron
Deutsches Krebsforschungszentrum, Heidelberg
Outline
• Existing Work
• Motivation
• Problem
• Algorithm
• Results
• Open Problems
• Examples :
• BLAST
• FASTA
• Linear Scan (No Index)
• Good Sensitivity
BLAST:
Scan database for exact matching
substrings of the query
Extend hits iteratively
BLAST:
D
Small substring length allows high sensitivity
Scan is expensive, every character in D is accessed
• Today: New Applications
• Examples:
• EST-Clustering
• Large Scale Shotgun Assembly
• Low Sensitivity
• Multiple Searches
⇒ Specialized Algorithms Needed
Pattern P
TCGATTACAGTGAAT
w=8
Database D
GCATTCGATGGACTGGACTAGTGAATCAGT
• Local Alignment, minimum Length w
• Low Error Rate (<10% Edit Distance)
•Filter Step:
•Identify Hotspots
•Scan Step:
•Scan Hotspots with BLAST
• q-gram Filtration
• Block Addressing
• Suffix Array
• Window Shifting
TCGATTACAGTGAAT
q=3
TCG
# of q-grams :
w=8
CGA
|P| - q + 1
GAT
ATT
TTA
TAC
GCATTCGATGGACTGGACTAGTGAATCAGT
Edit Distance e :
at least
t = |P| - q + 1 - (qe)
common q-grams
• q-gram Filtration
• Block Addressing
• Suffix Array
• Window Shifting
TCGATTAC
• Divide D into Blocks
• Count matching q-grams per Block
• Scan Blocks with counter ≥ t
How to find the matching q-grams?
GCATTCGATGGACTGGACTAGTGAATCAGT
4321
0
• q-gram Filtration
• Block Addressing
• Suffix Array
• Window Shifting
TCGATTAC
AAA : 0
AAC : 0
AAG : 0
AAT : 0
ACA : 1
ACC : 1
ACG : 1
ACT : 1
AGA : 3
AGC : 3
AGG : 3
AGT : 3
ATA : 4
ATC : 4
ATG : 4
ATT : 5
TGA : 26
TGC : 27
TGG : 27
TGT : 29
TTA : 29
TTC : 29
TTG : 30
TTT : 30
3
23 16 11
GCATTCGATGGACTGGACTAGTGAATCAGT
0 1 2
3 4 5
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
• Sorted List of Pointers to Suffixes, O(log |D|) Access Time
• Precompute Searches for q-grams, O(1) Time Access
• q-gram Filtration
• Block Addressing
• Suffix Array
• Window Shifting
q=3
w=8
e=1
t=3
TCGATTACAGTGAAT
• Move Window over Query
• Mark full Blocks for each Window
• Scan Marked Blocks
GCATTCGATGGACTGGACTAGTGAATCAGT
2304
107
• Influence of the Block Size
• Sensitivity
• Running Times
• Overhead for loading the Index
Benchmark System:
Ultra Sparc Processor, 333Mhz, 4GB RAM
Solaris 2.5.1
Influence of Block Size
0,8
Time in Seconds
0,7
0,6
0,5
Scan Time
0,4
Total Time
Filter Time
0,3
0,2
0,1
0
512
1024
2048
Block Size
4096
8192
Sensitivity
• 1000 Queries
• 6 % Error
• BLAST Cutoff E = 0.00001
• Number of identical hitlists
• Mouse EST DB: 91.4 %
• Human EST DB: 97.1 %
⇒ QUASAR finds many Hits
beyond selected Error Level
Running Times
13.275
14
6% Error
l w = 50
l q = 11
l block size 2048
l scan with BLAST
l time averaged for
1000 queries
l
• ~30 times faster
than BLAST
Running times in seconds
• Test Parameters:
12
10
8
6
3.371
4
2
0.123
0.380
Mouse EST
Human EST
0
QUASAR
BLAST
Overhead for Loading the Index
• 1000 queries
• Human EST DB, 280 Mbps
• BLAST Test Run:
• 5 seconds Load Time
• 13 270 seconds Search Time
• QUASAR Test Run:
• 90 seconds Load Time
• 380 seconds Search Time
Related documents