Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence Similarity Searching: Understanding and Using Web Based BLAST Dr. Joanne Fox [email protected] Lecture/Lab 3.1 BLAST 1 Concepts of Sequence Similarity Searching • The premise: – One sequence by itself is not informative; it must be analyzed by comparative methods against existing sequence databases to develop hypothesis concerning relatives and function. Lecture/Lab 3.1 BLAST 2 Sequence Similarity Comparisons • Alignments can be global or local (this is algorithm specific) – A global alignment is an optimal alignment that includes all characters from each sequence (Clustal generates global alignments) – A local alignment is an optimal alignment that includes only the most similar local region or regions (BLAST generates local alignments). Lecture/Lab 3.1 BLAST 3 Sequence Similarity Searches • Goals – Identify all homologs (true positives) • infer function, transfer annotations, structure/domain information – Limit the misidentification of non-homologs (false positives) – Search large sets of sequences efficiently Lecture/Lab 3.1 BLAST 4 QUERY sequence(s) BLAST results BLAST program BLAST database Lecture/Lab 3.1 BLAST 5 Sequence Similarity Searching – The statistics are important • Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance. • We’ll talk more about the meaning of the scores (S) and e-values (E) that are associated with BLAST hits Lecture/Lab 3.1 BLAST 6 The BLAST algorithm • The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. – Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. – Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402. Lecture/Lab 3.1 BLAST 7 Several different BLAST programs: Program Description blastp Compares an amino acid query sequence against a protein sequence database. blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr tblastx database on the BLAST Web page because it is too computationally Lecture/Lab 3.1 BLAST 8 intensive. http://www.ncbi.nlm.nih.gov/BLAST/ blastp blastn blastx tblastn tblastx Lecture/Lab 3.1 BLAST 9 Other BLAST programs • BLAST 2 Sequences (bl2seq) – Aligns two sequences of your choice – Can do different types of comparison ex. Blastx – Gives dot-plot like output • VecScreen – Compares query with sequences of known cloning vectors • Both very handy for sequencing! Lecture/Lab 3.1 BLAST 10 More BLAST programs • BLAST against genomes – Many available – BLAST parameters pre-optimized – Handy for mapping query to genome • Search for short exact matches – BLAST parameters pre-optimized – Great for checking probes and primers Lecture/Lab 3.1 BLAST 11 MegaBLAST • megaBLAST – For aligning sequences which differ slightly due to sequencing errors etc. – Very efficient for long query sequences – Uses big word (k-tuple) sizes to start search • Very fast – Accepts batch submissions of ESTs – Can upload files of sequences as queries • More detailed info: see megaBLAST pages Lecture/Lab 3.1 BLAST 12 QUERY sequence(s) BLAST results BLAST program BLAST database Lecture/Lab 3.1 BLAST 13 Considerations for choosing a BLAST database • First consider your research question: – Are you looking for an particular gene in a particular species? • BLAST against the genome of that species. – Are you looking for additional members of a protein family across all species? • BLAST against the non-redudant database (nr), if you can’t find hits check wgs, htgs, and the trace archives. – Are you looking to annotate genes in your species of interest? • BLAST against known genes (RefSeq) and/or ESTs from a closely related species. Lecture/Lab 3.1 BLAST 14 When choosing a database for BLAST… • It is important to know your reagents. – Changing your choice of database is changing your search space – Database size affects the BLAST statistics • record BLAST parameters, database choice, database size in your bioinformatics lab book, just as you would for your wetbench experiments. – Databases change rapidly and are updated frequently • It may be necessary to repeat your analyses Lecture/Lab 3.1 BLAST 15 BLAST protein databases available at through blastp web interface @ NCBI blastp db Lecture/Lab 3.1 BLAST 16 BLAST nucleotide databases available at through blastn web interface @ NCBI blastn db Lecture/Lab 3.1 BLAST 17 Creating Custom Databases for BLAST UBiC FAQ Lecture/Lab 3.1 BLAST 18 Important Terms for Sequence Similarity Searching with very different meanings • Similarity – The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. • Identity – The extent to which two (nucleotide or amino acid) sequences are invariant. • Homology – Similarity attributed to descent from a common ancestor. • It is your responsibility as an informed bioinformatician to use these terms correctly: A sequence is either homologous or not. Don’t use % with this term! Lecture/Lab 3.1 BLAST 19 How Does BLAST Really Work? • The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. • Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". Lecture/Lab 3.1 BLAST 20 BLAST Algorithm Lecture/Lab 3.1 BLAST 21 How Does BLAST Really Work? • The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. • Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". Lecture/Lab 3.1 BLAST 22 BLAST Algorithm Lecture/Lab 3.1 BLAST 23 Extending the High Scoring Segment Pair (HSP) Significance Decay Minimum Score Neighborhood Score Threshold Lecture/Lab 3.1 BLAST 24 Lecture/Lab 3.1 BLAST 25 BLAST Algorithm • Sequences are split into words (default n=3) – Speed, computational efficiency • Scoring of matches done using scoring matrices • HSP = high scoring segment pair – BLAST algorithm extends the initial “seed” hit into an HSP • Local optimal alignment • More than one HSP can be found Lecture/Lab 3.1 BLAST 26 Where does the score (S) come from? • The quality of each pair-wise alignment is represented as a score and the scores are ranked. • Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). • The alignment score will be the sum of the scores for each position. Lecture/Lab 3.1 BLAST 27 What’s a scoring matrix? • Substitution matrices are used for amino acid alignments. – each possible residue substitution is given a score • A simpler unitary matrix is used for DNA pairs – each position can be given a score of +1 if it matches and a score of -2 if it does not. Lecture/Lab 3.1 BLAST 28 BLOSUM vs. PAM BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100 More Divergent Less Divergent • BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Lecture/Lab 3.1 BLAST 29 Sequence Similarity Searching – The statistics are important • Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance. • We’ll talk more about the meaning of the scores (S) and e-values (E) that are associated with BLAST hits Lecture/Lab 3.1 BLAST 30 What do the Score and the e-value really mean? • The quality of the alignment is represented by the Score. – Score (S) • The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . • The significance of each alignment is computed as an E value. – E value (E) • Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. Lecture/Lab 3.1 BLAST 31 Is the E-value the same as a P-value? • The E-value is not a probability; it’s an expect value – The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. – However, when E < 0.01, P-values and E-value are nearly identical. Lecture/Lab 3.1 BLAST 32 I’m confused! What does the E-value mean again? • E value (E) – Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. • When E < 0.01, P-values and E-value are nearly identical. – So, the E-value is the number of times you expect to see your hit occur in the database (with as good as or better score) due to random chance alone. Lecture/Lab 3.1 BLAST 33 Notes on E-values • Low E-values suggest that sequences are homologous – Can’t show non-homology • Statistical significance depends on both the size of the alignments and the size of the sequence database – Important consideration for comparing results across different searches – E-value increases as database gets bigger – E-value decreases as alignments get longer Lecture/Lab 3.1 BLAST 34 Homology: Some Rules to Consider • Similarity can be indicative of homology • Generally, if two sequences are significantly similar over entire length they are likely homologous • 50% similarity over a short sequence often occurs by chance • Low complexity regions can be highly similar without being homologous • Homologous sequences are not always highly similar Lecture/Lab 3.1 BLAST 35 http://www.ncbi.nlm.nih.gov/BLAST/ BLAST FAQ Lecture/Lab 3.1 BLAST Program selection guide 36 “What BLAST program should I use?” – check the NCBI’s BLAST Program selection guide Lecture/Lab 3.1 BLAST 37 http://www.ncbi.nlm.nih.gov/BLAST/ blastp Lecture/Lab 3.1 BLAST 38 Input your query (gi|231571) as FASTA, raw sequence, or Accession/ID and choose your database query database Lecture/Lab 3.1 BLAST 39 Links to more information can be found on the BLAST page links links links links Lecture/Lab 3.1 BLAST 40 BLAST parameters and options to consider: conserved domains Entrez query E-value cutoff Word size Lecture/Lab 3.1 BLAST 41 More BLAST parameters and options to consider: filtering matrix Lecture/Lab 3.1 BLAST gap penalities 42 Run your BLAST search: BLAST Lecture/Lab 3.1 BLAST 43 The BLAST Queue: click for more info Note your RID Lecture/Lab 3.1 BLAST 44 Formatting and Retrieving your BLAST results: Results options Lecture/Lab 3.1 BLAST 45 A graphical view of your BLAST results: Lecture/Lab 3.1 BLAST 46 The BLAST “hit” list: Score E-Value GenBank alignment EntrezGene Lecture/Lab 3.1 BLAST 47 The BLAST pairwise alignments Identity Lecture/Lab 3.1 BLAST Similarity 48 Sorting BLAST results by Taxonomy Taxonomy Report Lecture/Lab 3.1 BLAST 49 Tax BLAST Report Summary hits by lineage BLAST hits by organism Lecture/Lab 3.1 BLAST 50 BLAST statistics to record in your bioinformatics labbook Record the statistics that are found at bottom of your BLAST results page Lecture/Lab 3.1 BLAST 51 Homology: Some Guidelines • Similarity can be indicative of homology • Generally, if two sequences are significantly similar over entire length they are likely homologous • Low complexity regions can be highly similar without being homologous • Homologous sequences not always highly similar • Suggested BLAST Cutoffs – (source: Chapter 11 – Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins) – For nucleotide based searches, one should look for hits with E-values of 106 or less and sequence identity of 70% or more – For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more Lecture/Lab 3.1 BLAST 52 Advanced BLAST programs • The NCBI BLAST pages have several advanced BLAST methods available – PSI-BLAST – PHI-BLAST – RPS-BLAST • All are powerful methods based on protein similarities Lecture/Lab 3.1 BLAST 53 PSI-BLAST • Position Specific Iterated – BLAST • A cycling/iterative method – Gives increased sensitivity for detecting distantly related proteins – Can give insight into functional relationships – Very refined statistical methods • Fast – still based on BLAST methods • Simple to use Lecture/Lab 3.1 BLAST 54 How does PSI-BLAST work? 1. First, a standard blastp is performed 2. The highest scoring hits are used to generate a multiple alignment 3. A Position Specific Scoring Matrix (PSSM) is generated from the multiple alignment. – – – Highly conserved residues get high scores Less conserved residues get lower scores The PSSM describes the sequence similarity between your query and all significant blastp hits 4. Another similarity search is performed, this time using the new PSSM instead of the standard BLOSUM or PAM matrices - This PSSM (scoring matrix) is now customized to find sequences that are related to your original query 5. Steps 2-4 can be repeated until convergence – Convergence occurs when no new sequences appear after iteration Lecture/Lab 3.1 BLAST 55 http://www.ncbi.nlm.nih.gov/BLAST/ PSI-BLAST Lecture/Lab 3.1 BLAST 56 Format results for PSI-BLAST with inclusion E-value set at 0.005 PSI-BLAST BLAST Lecture/Lab 3.1 BLAST 57 Contributors • Special thanks to David Wishart, Andy Baxevanis, Stephanie Minnema, Sohrab Shah, and Francis Ouellette for contributions to these materials • You are now ready to complete the BLAST assignment, which is due Friday February 16th, 2006 at 9AM. Lecture/Lab 3.1 BLAST 58