Download BLAST Basic Local Alignment Search Tool

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
BLAST
Basic Local Alignment Search Tool
BLAST ‫החכה‬
BLAST (Basic Local Alignment Search Tool)
allows rapid sequence comparison of a query sequence
[‫([רצף שאילתא‬nucleotides or amino acids) ‫הפיתיון בחכה‬
against a database ‫הים הגדול‬.
‫לצורך דיג מוצלח‬
‫ פיתיון ומקווה מים בהתאם‬,‫יש לבחור חכה‬
.‫לשאלה הביולוגית‬
Comparing the query sequence to known sequences in
databases is fundamental to understanding the
relatedness of any query sequence to other known
proteins or DNA sequences.
Applications include:
• Identifying shared similarities with sequences already
deposited in the databanks (orthologs and paralogs?)
• Discovering new genes or proteins (ascertaining
existence of a putative ORF)
• Discovering variants of genes or proteins
•Identifying functional motifs shared with other proteins.
• Investigating expressed sequence tags (ESTs)
• Exploring protein structure and function
Why use local alignment for
database searches?
Local alignment is a useful approach to
DB searching because many query
sequences have domains, active sites or
other motifs that have local but not
global regions of similarity to other
sequences.
BLAST
(1) for the query, find the list of high scoring words of length w
Query Sequence of length L
For each word from the query sequence
find the list of words that will score at
least T when scored using a pair-score
matrix
(e.g. PAM 250, BLOSUM)
BLAST (cont.)
(2) Compare the word list to the database and identify exact matches
database
sequence
Word
List
Exact matches of words from word lists
(3) For each word match, extend the alignment in both directions
to find alignments that score greater than a threshold of value S
maximal segment pairs (MSPs)
‫‪Blast is a heuristic algorythm‬‬
‫לא משווים את מלוא רצף השאילתא‬
‫למלוא האורך של כ"א מן הרצפים במאגר‬
‫(מרחב החיפוש(‪ ,‬אלא מבצעים חיפוש‬
‫חלקי ע"ס קירוב‪.‬‬
‫‪Speed vs. sensitivity‬‬
‫!!! ‪Does not find ALL best matches‬‬
‫‪False negatives.‬‬
‫כיצד נעריך את הממצאים המתקבלים?‬
Raw score "S" of the alignment is usually
calculated by summing the scores for
matches, mismatches and gaps in the
alignment .
Normalized score (bits) - bit scores from
different alignments, even those employing
different scoring matrices can be compared.
The higher the score the better the alignment,
but the significance of an alignment can not
be deduced from the score alone.
E-value (Expectation value)
•
Expect value of 10 for a match means, in a
database of current size, one might expect to see
10 matches with a similar or better score, simply
by chance alone
•
E-value is the most commonly used threshold in
database searches. Only those hits with E-values
smaller than the set threshold will be reported in
the output
•
Increasing the E-value enables you to see
biologically related sequences but statistically
insignificant
To evaluate the alignment
• Examine statistical parameters:
􀂃Normalized score
􀂃E value
􀂃% identity
􀂃 % similarity
􀂃 % gaps
• Examine the alignment itself.
• Use biological common sense.
Don’t rely only on statistical significance!!!
‫?‪What can we do if there are too many matches‬‬
‫מרוב עצים לא רואים את היער‬
‫יותר מידי חזרות על אותם רצפים בעלי מובהקות‬
‫גבוהה‪ .‬לא רואים רצפים בעלי דמיון נמוך יותר‬
‫שעשויים אף הם להיות מעניינים‪.‬‬
•Limit DB
•Limit organism
‫ספירת האפשרויות‬
‫השונות‬
•Filter reported entries by keyword
•(Limit to a specific domain)
•Change matrix and/or gap penalties
•Change E-value
•Add filter for low complexity
What can we do if
there are hardly
any matches?
•Check choice of DB
•Check choice of organism
•Remove filter for low complexity
•Change matrix or gap penalties
•Increase E-value
DNA vs. Protein searches
If we have a nucleotide sequence, should we search the
DNA databases only? Or should we translate it to protein
and search protein databases?
Query:
DNA
Protein
Database:
DNA
Protein
Translating causes loss of information but protein sequence
is more conserved than DNA sequence
It is therefore advisable to translate a nucleotide sequence
to protein and search protein databases for homology
Why use a nucleotide sequence
after all?

No ORF found.
No similar protein sequences were found
Specific DNA databases are available (EST)

To find duplicated genes in a genome

To find pseudogenes

To find the location of non-protein coding genes


in the genome (siRNA etc.)
Blast flavors
Query:
DB:





DNA
DNA
Protein
Protein
BlastN - nt versus nt database
BlastP - protein versus protein database
BlastX - translated nt (6 frames) versus protein database
tBlastN - protein versus translated nt database (6 frames)
tBlastX - translated nt versus translated nt database (both 6 frames)
Uses of BLAST programs
BLASTx – compares a nucleotide query seq
translated in all reading frames against a
prot seq db.
DNA
protein
If you have a DNA seq and you want to now
what protein (if any) it encodes, you can
perform BLASTx search.
tBLASTn
tBLASTn – compares a protein query seq against
a nucleotide seq db which is translated in all
reading frames.
Protein
DNA
You can use this program to ask whether a DNA or
ESTs db contains a nuc seq encoding a protein
that matches your protein of interest.
tBLASTx
tBLASTx – translates DNA from query and
compares it to db of DNA seqs all translated to
all reading frames
DNA
DNA
(nr db cannot be used, because it’s too large)
Used to determine whether an entire DNA db
contains genes that encodes proteins similar to
your query. (If blastx or tblastn fail)
E-value