Download • 100 times faster than dynamic programming. • Good for database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular cloning wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Molecular ecology wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Genomic library wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Community fingerprinting wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Non-coding DNA wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
BLAST
• 100 times faster than dynamic
programming.
• Good for database searches.
• Derive a list of words of
l
length
h w from
f
query (e.g.,
(
3
for protein, 11 for DNA)
• High-scoring words are
compared with database
sequences
• Sequences with many matches
to high- scoring words are
g
used for final alignments
Protein based searches are always more
powerful than nucleotide-base of coding
DNA in determiningg similarityy and
inferring homology
BLAST
(Basic Local Alignment Search Tool)
P=7+ Q=5 + G=6
•
In addition to the exact word,
BLAST considers related
words based on BLOSUM62:
the neighborhood.
•
Once a word is aligned
aligned,
gapped and un-gapped
extensions are initiated,
tallying the cumulative score
•
When the score drops more
than X, the extension is
terminated
•
The extension is trimmed
back to the maximum HSP=
High scoring segment pair
•
Produces local alignments
X= significance decay
S= min. score to return a BLAST hit
T= neighborhood score threshold
BLAST home page
http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLASTP
BLAST databases
Peptide Sequence Databases
• nr: non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF
• RefSeq_protein: reference proteins
•
Swissprot: SWISS-PROT protein sequence database
•
pdb
db: Sequences derived from the 3-dimensional structure from
Nucleotide Sequence Databases
•
nr: GenBank+EMBL+DDBJ+PDB (no EST
EST, STS,
STS GSS,
GSS or WGS
WGS, or PAT).
PAT)
•
est: Expressed Seq. tags. 34 billion seq.!
•
htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2
•
gss: Genome Survey Sequence
Sequence,.
•
wgs: Whole Genome Shotgun Sequences. 148 billion sequences
BLAST Advanced options
•
-G Cost to open a gap [Integer]; default = 11 (10 10 8
9)
•
-E Cost to extend a gap [Integer]; default = 1 ( 1
2)
•
-e Expectation value (E) [Real]; default = 10
10.0
0
•
-W Word size; default is 11 for blastn, 3 for other programs.
•
-b Number of alignments to show (B) [Integer]; default = 100
2 2
Special Cases
Default
Short Query
Large Sequence
Family
Ungapped
BLAST
Filter
on
off
on
on
Scoring
Matrix
BLOSUM62
≤ PAM30-35
BLOSUM62
BLOSUM62
Word Size
3
3-2, 7 for DNA
3, 11 for DNA
3, 11 for DNA
E value
10
1000 or more
10
10
Gap costs
11, 1
9, 1
11, 1
4
Alignments
50
50
2000
50
Report by species
Database: All nr GenBank CDS translations+PDB+SwissProt+PIR+PRF
2,794,673 sequences; 957,836,323 total letters
Taxonomy reports
Query= Apetala1 P35631 (255 letters
“+” indicates conservative amino acid substitution
“–” indicates gap/insertion
XXXX… shows areas of low complexity
CONSIDER TAXONOMIC RELATIONSHIP WHEN INTERPRETING SIMILARITY VALUES!
Format BLAST output
All sequences above the
E value threshold are
aligned beneath the
query. In "with identity“
identical residues are
shown
h
as dots.
d t
Fl t Query-Anchored
Flat
Q
A h
d
Query-Anchored with identities
Statistical significance
• Chance alignments have no
biological
g
significance
g
• Statistical significance implies low probability of
generating a chance alignment
• Probability of long alignments increases with
longer sequences
• The extreme-value distribution
– Used to calculate the probability of chance alignment
– Generated by calculating
the scores resulting from
repeatedly scrambling one
of the sequences being compared
BLAST statistics
S’ (Bit score): calculated from raw score “S”
S
S (sum of BLOSUM62 scores) by
normalizing with statistical variables that define a scoring system (K and λ).
Bit scores from different alignments, even employing different scoring
matrices can be compared.
S’ =(λS-lnK)/ln2
k= minor constant
λ= constant to adjust for scoring matrix
S= score of High-scoring segment pair (HSP)
E (expect) value: number of chance alignments with scores equivalent to or
better than S’ that are expected to occur in a database search by chance.
E = mN2-s’
m= query size
S’= bit score
N= database size
m*N= search space
– The E-value decreases exponentially as the Score (S) that is assigned to a
match between two sequences increases.
– The E-value depends on the size of database and the scoring system in use
use.
– When the E-value threshold is increased from the default value of 10, more hits
can be reported. When reduced, more significant hits are reported.
– The lower the E
E-value
value (or higher the bit score), the more significant the hit
– The product mN defines the search space. the same HSP may come out
statistically significant in a small database and not significant in a large database
P values
P: Probability
P
P b bilit off fifinding
di att least
l
t one HSP
with bit score S’ or higher by chance.
Since it can be
Si
b shown
h
that
th t th
the number
b off random
d
HSP
HSPs with
ith score S' is
i
described by Poisson distribution, the probability of finding at least
one HSP with bit score S' is
P = 1- e-E
E= expect value
E= 10 -> P =0.99995
E= 1 -> P =0.63
E= 0.1 -> P =0.095
E= 0.01 -> P =0.01
E= 0.001 -> P =0.001
E= 0.0001 -> P =0.0001
P-values vary from 0 to 1, whereas E-values can be much greater than
1. The BLAST programs report E-values, rather than P-values,
b
because
E
E-values
l
of,
f for
f example,
l 5 and
d 10 are much
h easier
i tto
comprehend than P-values of 0.993 and 0.99995. However, for E <
0.01, P-value and E-value are nearly identical.
BLAST Tips
• Suggested
S
t d BLAST cutoffs:
t ff
– DNA: book suggests E values < E-6 (I use E<e-10)
– Protein: book suggests E values < E-3
•
Consider evolutionary divergence in your results!: DNA mutation
rate without selection =5.5 10-9 per site per year. So in 10 million years (107) of
divergences=
g
5.5 10-2=0.05 ~ 95% identityy
•
BLAST search artifacts: Repeated amino acid stretches (e.g. poly
glutamine) or nucleotide repeats (e.g. ATATATATATATAT) result in
meaningless positives with significant E values
values.
•
Use BLAST filters to mask low complexity regions: programs
SEG for proteins and DUST for DNA
•
Or customize masking using lower case letter option
•
RepeatMasker can be used to mask
•
repeats in lower case letters
http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker
MEGABLAST
•
Variation of BLASTN, 10 times faster
•
Optimized for long or highly similar (>95%) sequences
•
Ideal to find whether a large sequence is part of a large contig or chromosome,
find sequencing errors and comparing large similar sequences
•
Uses longer default word length (word length= 28 instead of 11)
•
Faster non-affine gap penalty: gap opening penalty=0,
gap extension penalty E=
E r/2 - q (r= match reward
q= mismatch penalty)
•
Non-affine gapping tends to yield more gaps of shorter length.
•
Accepts
p multiple
p consecutive FASTA files as input
p
Discontinuous MEGABLAST
•
Ideal to compare divergent sequences from different organisms (<80% =)
•
Uses a discontiguous word approach, different from other BLAST programs
•
Nonconsecutive positions are examined over longer segments
PSI-BLAST (Position Specific Iterative BLAST)
•Designed
D i
d tto d
detect
t t weak
k relationships
l ti
hi
PSI-BLAST
PSI
BLAST steps
•
BLASTP
•
Multiple Alignment
•
Construct PSSM
•
Use PSSM to search
• The added sensitivity comes from the
use of a profile that is constructed
((automatically)
y) from a multiple
p alignment.
g
•The profile is generated by calculating a
Position-Specific Scoring Matrix (PSSM)
for every position in the alignment. Also
called
ll d profiles
fil off Hidd
Hidden M
Markov
k M
Models
d l
• PSSM are numerical representations of
a multiple alignment
Construction of
a PSSM
• A highly
highl conserved
conser ed position recei
receives
es a
high score.
•The profile is used to perform additional
searches ( iteration)) and the results of
each iteration used to refine the profile.
•Each iteration uses a PSSM built from
the previous iteration.
• Continue search iteratively until no new
matches are identified: "convergence".
Each columns in the alignment is a row in the PSSM
Frequency of occurrence of a residue at each position
Calculate Pb of each aa at each position
T at position 8 conserved= highest score 150
P at position 9 less conserve= score 89
Note low scores of aromatic FYW relative to A at P row
PHI-BLAST (Pattern Hit Initiated BLAST)
•
PHI-BLAST
PHI
BLAST searches for particular patterns in
protein queries. Combines matching of regular
expressions with local alignments surrounding the
match.
•
PHI-BLAST is preferable to just searching for
pattern occurrences because it filters out cases
where the pattern occurrence is pb. random and
not indicative of homology.
•
PHI-BLAST expects as input a protein query
sequence and a pattern contained in that
sequence.
•
PHI-BLAST
PHI
BLAST lilimits
it alignments
li
t tto th
those th
thatt match
t h
the provided pattern.
•
Statistical significance is reported using E-values
as for other forms of BLAST,, but the statistical
method for computing the E-values is different.
•
PHI-BLAST is integrated with Position-Specific
Iterated BLAST (PSI-BLAST), so that the results of
a PHI
PHI-BLAST
BLAST query can be used for PSI
PSI-BLAST.
BLAST
Pattern:
[C]-x(2)-[C]-x(10,16)-[H]-x(2,3)-[H]
Syntax for pattern at
http://www.ncbi.nlm.nih.gov/blast/html/PHIsyntax.html
Specialized
BLAST
Great tool!
Multiple Sequence
Alignment COBALT
http://www.ncbi.nlm.nih.gov/BLAST/