Download Sequence Similarity Searching: Understanding and Using Web

Document related concepts

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Structural alignment wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
Sequence Similarity Searching:
Understanding and Using
Web Based BLAST
Dr. Joanne Fox
[email protected]
Lecture/Lab 3.1 BLAST
1
Concepts of Sequence Similarity
Searching
• The premise:
– One sequence by itself is not informative; it must
be analyzed by comparative methods against
existing sequence databases to develop
hypothesis concerning relatives and function.
Lecture/Lab 3.1 BLAST
2
Sequence Similarity Comparisons
• Alignments can be global or local (this is algorithm
specific)
– A global alignment is an optimal alignment that includes all
characters from each sequence (Clustal generates global
alignments)
– A local alignment is an optimal alignment that includes only
the most similar local region or regions (BLAST generates
local alignments).
Lecture/Lab 3.1 BLAST
3
Sequence Similarity Searches
• Goals
– Identify all homologs (true positives)
• infer function, transfer annotations, structure/domain
information
– Limit the misidentification of non-homologs (false
positives)
– Search large sets of sequences efficiently
Lecture/Lab 3.1 BLAST
4
QUERY sequence(s)
BLAST results
BLAST program
BLAST
database
Lecture/Lab 3.1 BLAST
5
Sequence Similarity Searching – The
statistics are important
• Discriminating between real and artifactual
matches is done using an estimate of
probability that the match might occur by
chance.
• We’ll talk more about the meaning of the
scores (S) and e-values (E) that are
associated with BLAST hits
Lecture/Lab 3.1 BLAST
6
The BLAST algorithm
• The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison algorithms
introduced in 1990 that are used to search sequence
databases for optimal local alignments to a query.
– Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local
alignment search tool.” J. Mol. Biol. 215:403-410.
– Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman
DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs.” NAR 25:3389-3402.
Lecture/Lab 3.1 BLAST
7
Several different BLAST programs:
Program
Description
blastp
Compares an amino acid query sequence against a protein sequence
database.
blastn
Compares a nucleotide query sequence against a nucleotide sequence
database.
blastx
Compares a nucleotide query sequence translated in all reading frames
against a protein sequence database. You could use this option to find
potential translation products of an unknown nucleotide sequence.
tblastn
Compares a protein query sequence against a nucleotide sequence
database dynamically translated in all reading frames.
Compares the six-frame translations of a nucleotide query sequence
against the six-frame translations of a nucleotide sequence database.
Please note that the tblastx program cannot be used with the nr
tblastx
database on the BLAST Web page because it is too computationally
Lecture/Lab 3.1 BLAST
8
intensive.
http://www.ncbi.nlm.nih.gov/BLAST/
blastp
blastn
blastx
tblastn
tblastx
Lecture/Lab 3.1 BLAST
9
Other BLAST programs
• BLAST 2 Sequences (bl2seq)
– Aligns two sequences of your choice
– Can do different types of comparison ex. Blastx
– Gives dot-plot like output
• VecScreen
– Compares query with sequences of known cloning
vectors
• Both very handy for sequencing!
Lecture/Lab 3.1 BLAST
10
More BLAST programs
• BLAST against genomes
– Many available
– BLAST parameters pre-optimized
– Handy for mapping query to genome
• Search for short exact matches
– BLAST parameters pre-optimized
– Great for checking probes and primers
Lecture/Lab 3.1 BLAST
11
MegaBLAST
• megaBLAST
– For aligning sequences which differ slightly due to
sequencing errors etc.
– Very efficient for long query sequences
– Uses big word (k-tuple) sizes to start search
• Very fast
– Accepts batch submissions of ESTs
– Can upload files of sequences as queries
• More detailed info: see megaBLAST pages
Lecture/Lab 3.1 BLAST
12
QUERY sequence(s)
BLAST results
BLAST program
BLAST
database
Lecture/Lab 3.1 BLAST
13
Considerations for choosing a
BLAST database
• First consider your research question:
– Are you looking for an particular gene in a
particular species?
• BLAST against the genome of that species.
– Are you looking for additional members of a
protein family across all species?
• BLAST against the non-redudant database (nr), if you can’t find
hits check wgs, htgs, and the trace archives.
– Are you looking to annotate genes in your species
of interest?
• BLAST against known genes (RefSeq) and/or ESTs from a
closely related species.
Lecture/Lab 3.1 BLAST
14
When choosing a database for
BLAST…
• It is important to know your reagents.
– Changing your choice of database is changing
your search space
– Database size affects the BLAST statistics
• record BLAST parameters, database choice, database size in
your bioinformatics lab book, just as you would for your wetbench experiments.
– Databases change rapidly and are updated
frequently
• It may be necessary to repeat your analyses
Lecture/Lab 3.1 BLAST
15
BLAST protein databases available at
through blastp web interface @ NCBI
blastp db
Lecture/Lab 3.1 BLAST
16
BLAST nucleotide databases available at
through blastn web interface @ NCBI
blastn db
Lecture/Lab 3.1 BLAST
17
Creating Custom Databases for BLAST
UBiC FAQ
Lecture/Lab 3.1 BLAST
18
Important Terms for Sequence Similarity
Searching with very different meanings
• Similarity
– The extent to which nucleotide or protein sequences are
related. The extent of similarity between two sequences can
be based on percent sequence identity and/or conservation.
In BLAST similarity refers to a positive matrix score.
• Identity
– The extent to which two (nucleotide or amino acid)
sequences are invariant.
• Homology
– Similarity attributed to descent from a common ancestor.
• It is your responsibility as an informed bioinformatician to use
these terms correctly: A sequence is either homologous or not.
Don’t use % with this term!
Lecture/Lab 3.1 BLAST
19
How Does BLAST Really
Work?
• The BLAST programs improved the overall speed of
searches while retaining good sensitivity (important
as databases continue to grow) by breaking the
query and database sequences into fragments
("words"), and initially seeking matches between
fragments.
• Word hits are then extended in either direction in an
attempt to generate an alignment with a score
exceeding the threshold of "S".
Lecture/Lab 3.1 BLAST
20
BLAST Algorithm
Lecture/Lab 3.1 BLAST
21
How Does BLAST Really
Work?
• The BLAST programs improved the overall speed of
searches while retaining good sensitivity (important
as databases continue to grow) by breaking the
query and database sequences into fragments
("words"), and initially seeking matches between
fragments.
• Word hits are then extended in either direction in an
attempt to generate an alignment with a score
exceeding the threshold of "S".
Lecture/Lab 3.1 BLAST
22
BLAST Algorithm
Lecture/Lab 3.1 BLAST
23
Extending the High Scoring Segment
Pair (HSP)
Significance
Decay
Minimum
Score
Neighborhood
Score Threshold
Lecture/Lab 3.1 BLAST
24
Lecture/Lab 3.1 BLAST
25
BLAST Algorithm
• Sequences are split into words (default n=3)
– Speed, computational efficiency
• Scoring of matches done using scoring matrices
• HSP = high scoring segment pair
– BLAST algorithm extends the initial “seed” hit into an HSP
• Local optimal alignment
• More than one HSP can be found
Lecture/Lab 3.1 BLAST
26
Where does the score (S) come from?
• The quality of each pair-wise alignment is
represented as a score and the scores are
ranked.
• Scoring matrices are used to calculate the
score of the alignment base by base (DNA) or
amino acid by amino acid (protein).
• The alignment score will be the sum of the
scores for each position.
Lecture/Lab 3.1 BLAST
27
What’s a scoring matrix?
• Substitution matrices are
used for amino acid
alignments.
– each possible residue
substitution is given a score
• A simpler unitary matrix is
used for DNA pairs
– each position can be given a
score of +1 if it matches and a
score of -2 if it does not.
Lecture/Lab 3.1 BLAST
28
BLOSUM vs. PAM
BLOSUM 45 BLOSUM 62 BLOSUM 90
PAM 250
PAM 160
PAM 100
More Divergent
Less Divergent
• BLOSUM 62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of moderately
distant proteins, it performs well in detecting closer
relationships. A search for distant relatives may be
more sensitive with a different matrix.
Lecture/Lab 3.1 BLAST
29
Sequence Similarity Searching – The
statistics are important
• Discriminating between real and artifactual
matches is done using an estimate of
probability that the match might occur by
chance.
• We’ll talk more about the meaning of the
scores (S) and e-values (E) that are
associated with BLAST hits
Lecture/Lab 3.1 BLAST
30
What do the Score and the e-value
really mean?
• The quality of the alignment is represented by the
Score.
– Score (S)
• The score of an alignment is calculated as the sum of substitution and
gap scores. Substitution scores are given by a look-up table (PAM,
BLOSUM) whereas gap scores are assigned empirically .
• The significance of each alignment is computed as
an E value.
– E value (E)
• Expectation value. The number of different alignments with scores
equivalent to or better than S that are expected to occur in a database
search by chance. The lower the E value, the more significant the
score.
Lecture/Lab 3.1 BLAST
31
Is the E-value the same as a P-value?
• The E-value is not a probability; it’s an expect
value
– The BLAST programs report E-value rather than
P-values because it is easier to understand the
difference between, for example, E-value of 5 and
10 than P-values of 0.993 and 0.99995.
– However, when E < 0.01, P-values and E-value
are nearly identical.
Lecture/Lab 3.1 BLAST
32
I’m confused! What does the E-value
mean again?
• E value (E)
– Expectation value. The number of different alignments with
scores equivalent to or better than S that are expected to
occur in a database search by chance. The lower the E
value, the more significant the score.
• When E < 0.01, P-values and E-value are nearly
identical.
– So, the E-value is the number of times you expect to see
your hit occur in the database (with as good as or better
score) due to random chance alone.
Lecture/Lab 3.1 BLAST
33
Notes on E-values
• Low E-values suggest that sequences are
homologous
– Can’t show non-homology
• Statistical significance depends on both the size of
the alignments and the size of the sequence
database
– Important consideration for comparing results across
different searches
– E-value increases as database gets bigger
– E-value decreases as alignments get longer
Lecture/Lab 3.1 BLAST
34
Homology: Some Rules to Consider
• Similarity can be indicative of homology
• Generally, if two sequences are significantly similar
over entire length they are likely homologous
• 50% similarity over a short sequence often occurs by
chance
• Low complexity regions can be highly similar without
being homologous
• Homologous sequences are not always highly similar
Lecture/Lab 3.1 BLAST
35
http://www.ncbi.nlm.nih.gov/BLAST/
BLAST FAQ
Lecture/Lab 3.1 BLAST
Program selection
guide
36
“What BLAST program should I use?” –
check the NCBI’s BLAST Program
selection guide
Lecture/Lab 3.1 BLAST
37
http://www.ncbi.nlm.nih.gov/BLAST/
blastp
Lecture/Lab 3.1 BLAST
38
Input your query (gi|231571) as
FASTA, raw sequence, or Accession/ID
and choose your database
query
database
Lecture/Lab 3.1 BLAST
39
Links to more information can be
found on the BLAST page
links
links
links
links
Lecture/Lab 3.1 BLAST
40
BLAST parameters and options to
consider:
conserved domains
Entrez query
E-value cutoff
Word size
Lecture/Lab 3.1 BLAST
41
More BLAST parameters and
options to consider:
filtering
matrix
Lecture/Lab 3.1 BLAST
gap penalities
42
Run your BLAST search:
BLAST
Lecture/Lab 3.1 BLAST
43
The BLAST Queue:
click for more info
Note your RID
Lecture/Lab 3.1 BLAST
44
Formatting and Retrieving your
BLAST results:
Results
options
Lecture/Lab 3.1 BLAST
45
A graphical view of your BLAST results:
Lecture/Lab 3.1 BLAST
46
The BLAST “hit” list:
Score
E-Value
GenBank
alignment
EntrezGene
Lecture/Lab 3.1 BLAST
47
The BLAST pairwise alignments
Identity
Lecture/Lab 3.1 BLAST
Similarity
48
Sorting BLAST results by Taxonomy
Taxonomy
Report
Lecture/Lab 3.1 BLAST
49
Tax BLAST Report
Summary hits
by lineage
BLAST hits
by organism
Lecture/Lab 3.1 BLAST
50
BLAST statistics to record in your
bioinformatics labbook
Record the statistics
that are found at
bottom of your
BLAST results page
Lecture/Lab 3.1 BLAST
51
Homology: Some Guidelines
• Similarity can be indicative of homology
• Generally, if two sequences are significantly similar
over entire length they are likely homologous
• Low complexity regions can be highly similar without
being homologous
• Homologous sequences not always highly similar
• Suggested BLAST Cutoffs
–
(source: Chapter 11 – Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins)
– For nucleotide based searches, one should look for hits with E-values of 106 or less and sequence identity of 70% or more
– For protein based searches, one should look for hits with E-values of 10-3
or less and sequence identity of 25% or more
Lecture/Lab 3.1 BLAST
52
Advanced BLAST programs
• The NCBI BLAST pages have several
advanced BLAST methods available
– PSI-BLAST
– PHI-BLAST
– RPS-BLAST
• All are powerful methods based on protein
similarities
Lecture/Lab 3.1 BLAST
53
PSI-BLAST
• Position Specific Iterated – BLAST
• A cycling/iterative method
– Gives increased sensitivity for detecting distantly
related proteins
– Can give insight into functional relationships
– Very refined statistical methods
• Fast – still based on BLAST methods
• Simple to use
Lecture/Lab 3.1 BLAST
54
How does PSI-BLAST work?
1. First, a standard blastp is performed
2. The highest scoring hits are used to generate a multiple
alignment
3. A Position Specific Scoring Matrix (PSSM) is generated from
the multiple alignment.
–
–
–
Highly conserved residues get high scores
Less conserved residues get lower scores
The PSSM describes the sequence similarity between your query
and all significant blastp hits
4. Another similarity search is performed, this time using the new
PSSM instead of the standard BLOSUM or PAM matrices
-
This PSSM (scoring matrix) is now customized to find sequences
that are related to your original query
5. Steps 2-4 can be repeated until convergence
–
Convergence occurs when no new sequences appear after
iteration
Lecture/Lab 3.1 BLAST
55
http://www.ncbi.nlm.nih.gov/BLAST/
PSI-BLAST
Lecture/Lab 3.1 BLAST
56
Format results for PSI-BLAST with
inclusion E-value set at 0.005
PSI-BLAST
BLAST
Lecture/Lab 3.1 BLAST
57
Contributors
• Special thanks to David Wishart, Andy
Baxevanis, Stephanie Minnema, Sohrab
Shah, and Francis Ouellette for contributions
to these materials
• You are now ready to complete the BLAST
assignment, which is due Friday February
16th, 2006 at 9AM.
Lecture/Lab 3.1 BLAST
58