Download EMBL-EBI Powerpoint Presentation

Sequence Searching Strategies A guide to efficient database searching Jennifer McDowall EMBL-EBI EBI is an Outstation of the European Molecular Biology Laboratory. Overview • Know the data • The Toolbox • Search Guidelines Sequence Searching Tools Know the data Sequence Searching Tools Know the Data… • Many databases, each getting bigger • Efficient searching requires knowledge of what data is stored in a database  Don’t assume annotation can be transferred because of a good match • Databases can contain errors • Data can change  Deletions, sequence modifications  Daily updates, identifier changes… Sequence Searching Tools Know the Data…Nucleotides EMBL-Bank • Divided into classes and divisions... • Release and updates • Supplementary sets: EMBL-CDS, EMBL-MGA Specialist databases • • • • Immunoglobulins: IMGT/HLA, IMGT/LIGM… Alternative splicing: ASTD… Completed genomes: Ensembl, Integr8… Variation: HGVBase, dbSNP… Sequence Searching Tools Know the Data…Proteins UniProt • Divided into 3 sections • Release and updates Specialist databases • • • • • • Sequence from structure: PDB, SGT… Immunoglobulins: IMGT/HLA… Alternative splicing: ASTD… Completed proteomes: Ensembl, Integr8… Protein interactions: IntAct Patent proteins: EPO, USTPO, JPO, KIPO Sequence Searching Tools Homology • Homologous sequences share a common origin vs. Similarity • Similarity is a measure of the “likeness” of 2 sequences • Presence of similar features • Uses statistics to determine because of common decent ‘significance’ of similarity • Statistically significant similar sequences are considered ‘homologous’  If significant, considered to be homologous  If not significant  uncertain • Similarity does not • Homology is like pregnancy: necessarily reflect homology either one is or one isn’t! (Gribskov – 1999) Sequence Searching Tools The Toolbox Sequence Searching Tools Sequence Similarity Search Tools Sequence Searching Tools Sequence Similarity Search Tools BLAST FASTA Iterative searches Sequence Searching Tools Sequence Similarity Search Tools BLAST  NCBI-BLAST  Wu-BLAST FASTA  FASTA  SSEARCH  GGSEARCH  GLSEARCH Iterative search Sequence Searching Tools  PSI-BLAST  PSI-SEARCH Tools: NCBI BLAST • BLASTP: protein • BLASTX: DNA • BLASTN: DNA Sequence Searching Tools Protein DB translate Protein DB DNA DB Tools: NCBI BLAST Nucleotide search Sequence Searching Tools Protein search Tools: Wu-BLAST • BLASTP: protein • BLASTN: DNA • BLASTX: DNA translate • TBLASTN: protein • TBLASTX: DNA translate Sequence Searching Tools Protein DB DNA DB Protein DB Translated DNA DB Translated DNA DB Tools: Wu-BLAST Nucleotide search Sequence Searching Tools Protein search Tools: FASTA • FASTA: • FASTX/Y: protein Protein DB or DNA DNA translate • SSEARCH: protein Protein DB Protein DB or DNA • GLSEARCH: protein Protein DB • GGSEARCH: protein Protein DB Sequence Searching Tools DNA DB DNA DB Tools: FASTA Nucleotide search Sequence Searching Tools Protein search Query length When to use which search? NCBI BLAST WU-BLAST FASTA PSI-SEARCH Database size Sequence Searching Tools Speed of search When to use which search? NCBI BLAST WU-BLAST FASTA PSI-SEARCH PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc Sequence Searching Tools BLAST v FASTA • Fast • Slower • Excels with proteins • Excels with proteins and DNA (better than BLASTN for DNA) • Good local alignments + short global alignments • Produces S-W alignments • Proteins: BLOSUM62(-11/-1) alignments good at >85% homology • Proteins: BLOSUM50(-10/-2) longer alignments good at >70% homology • Good at finding siblings • Good at finding cousins Sequence Searching Tools GLSEARCH and GGSEARCH GLSEARCH  Global (query) - Local (target DB) alignment  For global query alignments to domains/patterns in target proteins GGSEARCH  Global (query) – Global (target DB) alignment  Specific for searching short sequences against short targets or for gene-to-gene comparisons Sequence Searching Tools What are global and local alignments? Query BLAST, FASTA Local - Local |||||||| |||||||||||||| Subject Query GLSEARCH Global - Local ||||||||| ||||||||||||| Subject Query GGSEARCH Global - Global ||||||||| ||||||||||||| Subject Sequence Searching Tools Tools: PSI (Position Specific Iterated) Search Single Protein Sequence Search Database Estimate significance iterate Construct profile Sequence Searching Tools Generate Alignment Tools: PSI Search • PSI-BLAST • Part of NCBI-BLAST package • Automatic iteration service • (PSSM = position specific scoring) • Manually guided service • PSI-SEARCH • Combines: SSEARCH (S&W algorithm) • Manually guided service Sequence Searching Tools + PSI-BLAST (iterative strategy) Let’s look at a FASTA search Sequence Searching Tools FASTA search Step 1: Select a database Sequence Searching Tools Which database to choose? Database size is important • ENA-Annotation >124 million • UniParc (non-redundant) >24 million • Databases grow every day Sequence Searching Tools How database size affects results BLAST >122M sequence: gatctccatggg >15M >1.5M 0 hits 60 hits 489 hits (>1000) 789.0 621.0 e-values of 100% matches Sequence Searching Tools >700,000 3 hits 0.96 How database size affects results • Search smallest database likely to contain your sequence • Run multiple small searches (can run all ENA/UniParc as well) Sequence Searching Tools Protein or nucleotide database search? Two issues are worth considering… Sequence Searching Tools Protein or nucleotide database search? Codon degeneracy Amino acids Ser Ser Nucleotides UCU AGC Sequence Searching Tools match mismatch Protein or nucleotide database search? Over-simple match/mismatch scoring Amino acids Nucleotides Sequence Searching Tools highly conserved weakly conserved not conserved Ser Ser Ser Ser identical Asn similar Leu mismatch UCU UCU UCU AGC mismatch AAC mismatch CUC mismatch no distinction Protein or nucleotide database search? Human CKS1B kinase Protein Nucleotide Sequence Searching Tools v Zebra finch CDC28 kinase 1B today extinction of dinosaurs Billions of years ago Cambrian explosion 1 multicellular life 2 complex cells 3 photosynthesis self-replicating cells Protein comparisons chemical evolution 4 identify homologues formation of Earth 5-10x further back Sequencein Searching Tools evolution prokaryotes archaea cyanobacteria eukarytoes plants insects land plants fish arthropods reptiles amphibians flowers birds mammals Protein or nucleotide search? genus Homo Identify homologs searching: Protein or nucleotide database search? …therefore, searching a protein database could pull out many more homologues than searching a nucleotide database …if you start with a nucleotide sequence, try BLASTX or FASTX to translate your query sequence and search a protein database Sequence Searching Tools FASTA search Step 1: Select a database Step 2: Paste sequence Sequence Searching Tools FASTA search Step 1: Select a database Step 2: Paste sequence Step 3: Choose parameters Sequence Searching Tools Choosing parameters Sequence Searching Tools Choosing parameters User manual provides help Sequence Searching Tools Which parameters to choose? Matrix Nucleotide search ‘simpler’ - only match/mismatch Protein search uses substitution matrix tables (based on amino acid similarities and rate of change) Sequence Searching Tools Which parameters to choose? Choice of matrix depends on: 1. strictness of search 2. length of query sequence Sequence Searching Tools QUERY LENGTH >300 85-300 50-85 >300 85-300 35-85 <=35 <=10 MATRIX BLOSUM50 BLOSUM62 BLOSUM80 PAM250 PAM120 MDM40 MDM20 MDM10 open -10 -7 -16 -10 -16 -12 -22 -23 ext -2 -1 -4 -2 -4 -2 -4 -4 Matrices - controlling search sensitivity PAM (point accepted mutation) • Based on global alignments of related proteins • 1 substitution in 100 residues = PAM 1 • Other matrices extrapolated from PAM 1 • Model of evolutionary divergence • Bias against rare substitutions (e.g. Cys → Tyr) due to seed proteins Sequence Searching Tools Matrices - controlling search sensitivity BLOSUM (BLOCKS amino-acid substitution) • Based on protein domain alignments from the BLOCKS database • Observed substitutions in conserved domains • Based on percentage identity, so BLOSUM50 is deeper than BLOSUM80 Sequence Searching Tools Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence Sequence Searching Tools 10 100 200 300 400 500 Which parameters to choose? Matrix - protein Match/mismatch - nucleotide FASTA ...instead have... Sequence Searching Tools BLAST Match/mismatch scores • “Reward” for match, “penalty” for mismatch • Reward/penalty ratio:  Increase ratio to find more divergent sequences:  Ratio of 0.33 (1/-3) for 99% conserved  Ratio of 0.5 (1/-2) for 95% conserved  Ratio of 1 (1/-1) for 75% conserved Sequence Searching Tools Which parameters to choose? gap penalties Nucleotide search gap open = -2 to -16 Gap extension = 0 to -4 Protein search gap open = 0 to -23 Sequence Searching Tools Gap extension = 0 to -8 Which parameters to choose? Choice of gap penalties depends on: 1. strictness of search • larger penalty  fewer gaps 2. to match scoring matrix Sequence Searching Tools QUERY LENGTH >300 85-300 50-85 >300 85-300 35-85 <=35 <=10 MATRIX BLOSUM50 BLOSUM62 BLOSUM80 PAM250 PAM120 MDM40 MDM20 MDM10 open -10 -7 -16 -10 -16 -12 -22 -23 ext -2 -1 -4 -2 -4 -2 -4 -4 Which parameters to choose? KTUP (word length)  KTUP = ‘word-length’ of search  Large word-length  less sensitive  faster Nucleotide search - fewer bases than amino acids  higher KTUP Sequence Searching Tools Which parameters to choose? Do I mask my sequence? Low complexity regions should be masked to avoid spurious results • CA repeats • poly-A tails • proline-rich regions **Be careful you don’t mask what you are looking for Sequence Searching Tools Which parameters to choose? What do I use for short sequences?  use strict matrices  use high gap penalties  avoid masking  allow high e-values Sequence Searching Tools Fasta results Matches section Sequence Searching Tools Fasta results Do you use e-values or % identity? Sequence Searching Tools E-values or % identity? E-value Estimates statistical significance of matches  Default = 10  expect 10 matches found by chance  E() = <0.01  usually homologous  E() = 1-10  frequently related % identity % of positions identical between query and match sequence Sequence Searching Tools E-values or % identity? Sequence Searching Tools Similar Different % identity scores e-values E-values or % identity? Pattern of conservation indicates homology No evidence of homology Sequence Searching Tools E-values or % identity? Use e-values to estimate likelihood two sequences are homologous Sequence Searching Tools Fasta results Check length and alignments in relation to % identity Sequence Searching Tools Length of match 100% identity, but only over 124 / 663 (20%) of sequence Sequence Searching Tools Fasta results Protein and nucleotide search results have additional annotation Sequence Searching Tools Fasta results Related EMBL nucleotide entries Sequence Searching Tools Fasta results Related genomic information Sequence Searching Tools Fasta results Gene ontology (GO) mapping for protein Sequence Searching Tools Fasta results InterPro family/domain classification Sequence Searching Tools Fasta results Literature Sequence Searching Tools Fasta results Functional prediction on ALL proteins Sequence Searching Tools Function Predictions using InterPro Extract information Functional predictions: InterPro family/domain classifications Visual comparison  find mis- or partial matches Prioritize results Sequence Searching Tools Function Predictions using InterPro 100% ID • Matches: • family signature • 4 domain signatures 34% ID • Matches: • family signature • 3 domain signatures 28% ID • Matches: • 1 domain signature 24% ID Sequence Searching Tools • Matches: • No signatures Sequence Search Summary Navigate to search tools Select search tool Functional predictions Result summary + annotation Sequence Searching Tools (1) Select database (2) Copy/paste sequence (4) Submit (3) Set parameters Search guidelines Sequence Searching Tools Search Guidelines: #1 • BEST: • 2nd • 3rd AVTEGPIPEV LFNYDAQYT FGHKNSDKSS BEST: BEST: • WORST: Sequence Searching Tools protein ATGGCTAGC TTCGACTAG GCGATGCGA ATGGCTAGC TTCGACTAG GCGATGCGA AVTEGPIPEV LFNYDAQYT FGHKNSDKSS Protein DB DNA translate DNA protein Protein DB DNA DB Translated DNA DB  FASTA  BLASTP  FASTX  BLASTX  FASTA  BLASTN  TFASTX  TBLASTN Search Guidelines: #2 • Search smallest database likely to contain your sequence • Use sequence statistics (E-values) rather than % identity or % similarity, as your primary criterion for sequence homology Sequence Searching Tools Search Guidelines: #3 • Check statistics are likely to be accurate by looking for highest scoring unrelated sequence Examine the histograms Use programs such as prss3 to confirm the E-values Searching with shuffled sequences (use MLE/Shuffle in FASTA) which should have an E-value ~1.0 Sequence Searching Tools Search Guidelines: #4 • Consider searches with different gap penalties and other scoring matrices Use shallower matrices and/or more stringent gaps to uncover or force out relationships in partial sequences Adjust scoring matrix to suit length of query sequence Adjust gap penalties to match scoring matrix Sequence Searching Tools QUERY LENGTH >300 85-300 50-85 >300 85-300 35-85 <=35 <=10 MATRIX BLOSUM50 BLOSUM62 BLOSUM80 PAM250 PAM120 MDM40 MDM20 MDM10 open -10 -7 -16 -10 -16 -12 -22 -23 ext -2 -1 -4 -2 -4 -2 -4 -4 Search Guidelines: #5 • Homology can be reliably inferred from statistically significant similarity Homology = common 3D structure Homology - NOT common function • Orthologous sequences have similar functions • Paralogous sequences acquire very different functional roles Sequence Searching Tools Search Guidelines: #6 • Consult motif or fingerprint databases to find evidence for conservation critical for functional residues Motif identity in the absence of overall sequence similarity is not a reliable indicator of homology! • Try to produce multiple sequence alignments in order to validate the relatedness of your sequence data ClustalW, MUSCLE, T-Coffee, Kalign, MAFFT Mview, DBClustal (available form EBI FASTA & BLAST services) Sequence Searching Tools Search Guidelines: #7 • Low complexity regions (e.g. CA repeats, poly-A tails and Proline-rich regions) give spuriously high scores that reflect compositional bias rather than significant position-by-position alignment Use seg, xnu, dust, CENSOR, etc. BUT be careful about what you filter!!! Sequence Searching Tools Search Guidelines: #8 • What about short sequences? • Depends on their nature: • Protein:  Reduce word length and/or increase e-value  Use shallow matrices • DNA:  Reduce word length (but NOT to 1!)  Set Threshold for band optimisation (FASTA) to 0  Ignore gap penalties (force local alignments only) Sequence Searching Tools Accessing Tools at EBI Access Sequence Similarity Search services over various interfaces: 1) Using your browser 2) Over email 3) Using Web Services (SOAP/REST)  Perl, Python, C and Java clients available  Taverna & Triana workflows are fully supported (See: http://www.ebi.ac.uk/Tools/webservices/) (See: http://www.myexperiment.org/) Sequence Searching Tools Typical workflow search function review Check stats evolution compare Sequence Searching Tools Final remarks • Don’t assume a single tool will cater for all your search needs • DO change the parameters of the tools • Remember where the tool excels and what its limitations are • A tool intended for specific task A can also be used for task B (and may be better than the tool intended for task B specifically!) • Crazy input will always give crazy results! Sequence Searching Tools Contacts: http://www.ebi.ac.uk/support/ EBI is an Outstation of the European Molecular Biology Laboratory.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EMBL-EBI Powerpoint Presentation