* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Database Searching
Survey
Document related concepts
Transcript
Database Searches BLAST BLAST • Basic Local Alignment Search Tool – Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) – Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman, Nucleic Acids Res. 25 (1997) • Main ideas: – Increase search speed by finding fewer, but better, hot spots during initial screening phase – Uses longer word sizes – Integrate scoring matrix into first phase • Compare with FASTA, which requires exact matches BLAST Terminology • Segment pair: equal-length substrings of sequences S1 and S2 • Locally maximal segment pair: segment pair whose alignment score cannot be improved by extending or shortening it • Maximum segment pair (MSP) = segment pair with maximum score over all segment pairs in the sequences S1 and S2 • High-scoring segment pair (HSP): A segment pair with score higher than some cutoff score, s. • w is the length parameter; t is the threshold parameter BLAST: Hits • A hit is a w-length word in the database that aligns with a word from the query sequence with score > t • BLAST looks for hits instead of exact matches – Allows word size to be kept high for speed, without sacrificing sensitivity • Typically, w = 3-5 for amino acids, ~11-12 for DNA • t is the most critical parameter: – ↑t ↓ “background” hits (faster) – ↓t ↑ ability to detect more distant relationships (at cost of increased noise Hits • For each word, evaluate score of match (exact or not) according to BLOSUM62 – E.g., for PQG, score is 7+5+6 = 18 • There are 20w possible w-length words, but considering only those with score > t, greatly reduces number of matches – E.g., there are 203 = 8000 possible matches to PQG, but only 50 achieve score > t = 13 BLAST Extending a hit • After locating a hit, BLAST attempts to extend hit in both directions, until score has drops more than X below the maximum score yet attained. • Extension step typically accounts for > 90% of execution time. Extending a hit Improvement: 2-hit method • Do extensions only when there are two hits on the same diagonal within some distance A of each other (e.g., A =40) • Reduces sensitivity (ability to detect distantly related sequences) – To compensate, use lower t value (e.g., 11 rather than 13) • Since we only extend when there are two nearby hits, many fewer regions are extended Gapped BLAST • Allows local alignments with indels (similar to FASTA) • Local alignments from different diagonal are merged into a different local alignment followed by some indels followed by a second local alignment, etc. – equivalent to a path through the dynamic programming matrix composed of alternating diagonal sections and paths connecting them Gapped BLAST • Original BLAST implicitly handled gaps by finding several distinct HSPs and calculating a statistical assessment of the combined result – Two or more HSPs each below the cutoff value might in combination rise to statistical significance • Gapped BLAST, extend hits by allowing gaps when hits are promising (exceed sg): – Advantage: We can afford to miss some HSPs as long as at least one is found • Use dynamic programming, starting from center of each high-scoring region if s > sg – sg is chosen such that gapped alignment is triggered in about 1/50 of the sequences compared PSI-BLAST • Position-Specific Iterated BLAST • Generates a multiple alignment from statistically significant alignments produced by BLAST • Produces a position-specific score matrix (PSSM) – – – – – Can search the database using the PSSM Match sequences to profile Generate new profiles Repeat (iteration) Search gradually extends to increasingly divergent sequences Flavors of BLAST • BLASTP - protein query against protein DB • BLASTN - DNA/RNA query against GenBank (DNA) • BLASTX - 6 frame trans. DNA query against proteinDB • TBLASTN - protein query against 6 frame GB transl. • TBLASTX - 6 frame DNA query to 6 frame GB transl. • PSI-BLAST - protein ‘profile’ query against protein DB • PHI-BLAST - protein pattern against protein DB