Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to BLAST Minkoo Seo DKE Lab., Yonsei Univ. 8 October 2004 The Five BLAST Programs Program Database Query BLASTN Nucleotide Nucleotide BLASTP Protein Protein BLASTX Protein Nucleotide translated into protein TBLASTN Nucleotide translated into protein Protein TBLASTX Nucleotide translated into protein Nucleotide translated into protein Traditional BLAST programs Alignment Search space and alignment The Smith-Waterman algorithm will find the maximum scoring alignment between two sequences. Unlike Smith-Waterman, BLAST doesn’t explore the entire search space. BLAST’s minimizing the search space is the key to its speed but at the cost of a loss in sensitivity. FASTA Known as much as BLAST. FASTA perform alignment by finding k consecutive exact matches, locating 10 bestmatches and joining them. The BLAST Algorithm STEP 1 The sequence is optionally filtered to remove low-complexity regions. This process is called soft masking. The low-complexity sequence occurs much more frequently than expected by chance in both proteins and nucleic acids. The low complexity region is replaced with Xs (or Ns for Nucleotide sequences.) Note that filtering is only applied to the query sequence and not to the database sequence. The BLAST Algorithm (cont) STEP 2 A list of words of length 3 in the query protein sequence is made starting with positions 1,2, and 3; then 2,3, and 4; etc. For example, the sequence MGQLV has words MGQ, GQL, and QLV. Word length 11 for DNA sequences, 3 for programs that translate DNA sequences. The BLAST Algorithm (cont) STEP 3 Using a scoring matrix (e.g., BLOSUM 62), the query sequence are evaluated with any other combination of three amino acids. There are a total 20 x 20 x 20 = 8,000 possible match scores for a word. For example, suppose that three-letter word PQG occurs in the query sequence. The likelihood of a match to itself is found in the BLOSUM 62 matrix as: (P-P match) + (Q-Q match) + (G-G match) = 7 + 5 + 6 = 18 The BLAST Algorithm (cont) STEP 4 A cutoff score called neighborhood word score threshold (T) is selected to reduce the number of possible matches to PQG to the most significant ones. Note that FASTA considers only exact matches. For example, if T is 13, only the words that score above 13 are kept. T limits sensitivity. The BLAST Algorithm (cont) STEP 5 The previous procedure is repeated for each three-letter word in the query sequence. The BLAST Algorithm (cont) STEP 6 The remaining high-scoring words are organized into an efficient search tree for comparing them rapidly to the database sequences. Approach 1. Build a DFA that recoginzes all the high-scoring words. 2. Run DB sequences through DFA. 3. Remember hits. Builds an index on the fly The BLAST Algorithm (cont) STEP 7 If a match is found, this match is used to seed a possible ungapped alignment between the query and database sequences. The BLAST Algorithm (cont) STEP 8(a) In the original BLAST, matching words are extended. At this point, a larger stretch of sequence (HSP or high-scoring segment pair) may have been found. For example, if MATCH=1, MISMATCH=-1, and X=5 max MSPs are extended, then trimmed to max. The BLAST Algorithm (cont) STEP 8(b) In the gapped BLAST, called BLAST2, T is lowered in step 4. Then, find short matched region lying on the same diagonal within distance A of each other to build longer and join them. Once found, these joined regions are extended using the method of Step 8(a). The BLAST Algorithm (cont) STEP 9 Determine whether each HSP score found is greater in value than a cutoff score S. S is determined empirically by examining the range of scores found by comparing random sequences, and by choosing a value that is significantly greater. The BLAST Algorithm (cont) STEP 10 Determines the statistical significance of each HSP score. Sometimes, two or more HSP regions that can be made into a longer alignment will be found. For example, HSP score #1: 65 and 40 HSP score #2: 52 and 45 Poisson method: probability of multiple score is higher when the lower score of each set is higher. (45 is higher than 40) Sum-of-scores method: 65+40=105 is higher than 52+45=97. The BLAST Algorithm (cont) STEP 11 Smith-Waterman local alignments are shown for the query sequence with each of the matched sequences in the database. The Gumbel Extreme Value Dist. Extreme Value Distribution When two sequences have been aligned optimally, the significance of a local alignment score can be tested on the basis of two random sequence score of the same length and same composition. These random alignment scores follow extreme value distribution. Goal Evaluate the probability that score between random or unrelated sequences will reach the score found between two real sequences of interest. The Gumbel Extreme Value Dist. (cont) The Gumbel Extreme Value Dist. (cont) Extreme Value Distribution Probability Distribution(Eq.17): Yev exp[ x e x ] Mean: Euler-Mascheroni constant, 0.57722… Variance: 2 2 / 6 1.6449 Probability that score S will be less than value x (Eq. 19) : P(S x) exp[ e x ] Probability of S is greater than or equal to value x (Eq. 20) : P(S x) 1 exp[ e x ] Eq. 17 and Eq.20 can be modified to accommodate extreme values (Eq. 22): P(S x) 1 exp[ e ( xu ) ] where u is mode, highest point, or characteristic of the dist. and is the decay or scale parameter Karlin-Altschul Statistics Karlin-Altschul statistics (Samuel Karlin and Stephen Altschul 1990) make five central assumptions: A positive score must be possible. The expected score must be negative. The letters of the sequences are independent and identically distributed (IID). The sequences are infinitely long. Alignment don’t contain gap. The first two assumptions are true for any scoring matrix estimated from real data. The last three assumptions a re problematic because biological sequences have context dependencies, aren’t infinitely long, and are frequently aligned with gaps. Karlin-Altschul Statistics (cont) For now, though, let’s turn to the Karlin-Altschul equation. E kmne S This equation states that the number of alignments expected by chance (E) during a sequence database searching is a function of the size of the search space (m*n), the normalized score (λS), and a minor constant K. Hence, the relationship between the expected number of alignments and the search space is linear. The relationship between the expected number of alignments and score is exponential. This means that small changes in score can lead to large differences in E. Sample BLAST Output # of matches w/ positive score Scoring matrix independent score Scoring matrix dependent score E kmne S # of exactly matching characters PSI-BLAST Position Specific Iterated BLAST Scoring Matrix Searching 1. Identify additional related sequences that might otherwise be missed 2. Difficulty with such an expended search is that alignment of related sequences must already be available Method of PSI-BLAST 1. 2. 3. 4. 5. 6. DB Search with given query using BLAST Set of related sequences are found Perform msa on result set Make matrix using msa result DB Search using matrix Go to step 1 to find sequences similar to result of step 5 PSI-BLAST (cont) Innate limitation in the Profile Searching Approach Query Family There is no guarantee that the alignments finally discovered represent the same set of related sequences. PSI-BLAST (cont) Problems in the PSI-BLAST matrix The matrix covers the entire length of the aligned sequences where other matrices cover only a short stretch of the alignment. The same gap penalties are used throughout the procedure and there is no position-specific penalty as other programs. Each subsequence alignment is based on using the query sequence as a master template for producing a multiple sequence alignment of the same length as the query sequence. Thus, the msa is a compilation of the pairwise alignment rather than a true msa. PSI-BLAST (cont) PHI-BLAST (Pattern Hit Initiated BLAST) Much like PSI-BLAST except that the query sequence is first searched for a complex pattern provided by the investigator. Then, the sequence database searching is focused on regions containing the pattern. PROBE Similar to PSI-BLAST But performs a more complex and rigorous type of data analysis; bayesian statistical approach. MAXHOM Matching sequences found in a database search are aligned by dynamic programming with a query sequence, and a profile is made from the alignment. A new round of sequences that match the updated profile are then picked from the Swiss-Prot. References Advanced Medical Informatics Seminar by Vanathi Gopalakrishnan, Ph.D. http://omega.cbmi.upmc.edu/~vanathi/ Computational Molecular Biology Course By Doug Brutlag and Lee Kozar http://cmgm.stanford.edu/biochem218/ David W. Mount, BIOINFORMATICS, COLD SPRING HARBOR LABORATORY PRESS, 2001. Stephen F. Altschul et al., “Gapped BLAST and PSI-BLAT: a new generation of protein database search programs,” Nucleic Acids Research, 25(17):3389-3402, 1997. Stephen F. Altschul et al., “Basic Local Alignment Search Tool,” J. Mol. Biol., 215, 403-410, 1990. Ian Korf, Mark Yandell and Joseph Bedell, BLAST, O’REILLY, 2003.