Download Blast and Psi-Blast alignment heuristics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Blast heuristics
Morten Nielsen
Department of Systems Biology,
DTU
Outline
• Basic Local Alignment Search Tool
– What are the Blast heuristics?
– How does Blast calculate E-values?
– What are the limits of Blast?
• Psi-Blast
– Why does it work so much better
– This we saw last week ... It uses sequence
profiles
Why alignment is slow
• 99% of the cpu time is spend aligning nosimilar sequences
• The execution time for gapped alignments
is 500 times that for un-gapped
Blast heuristics
• Hits (High scoring segment pairs, HSP)
– Triplets of amino acids that scores at least T
Hit extension
What are P and E values?
• E-value
– Number of expected hits in
database with score higher
than match
– Depends on database size
• P-value
– Probability that a random hit
will have score higher than
match
– Database size independent
Score 150
10 hits with higher
score (E=10)
10000 hits in
database =>
P=10/10000 = 0.001
Score
Blast E-values
• Normalized score S’ (bit score) is
S'
  S  ln( K)
ln 2
•  and K are calculable given sij and pi
p p
i
j
exp( sij )  0
i, j
• The number E of sequences with a score
of at least S’ is
N
E  S'
2
Altschul et al., 1997. Nucleic Acids Research
Blast output
Edge Effects
• a high-scoring alignment must have some
length, and therefore can not begin near
to the end of either of two sequences
being compared. This "edge effect" may
be corrected for by calculating an
"effective length" for sequences; the
BLAST programs implement such a
correction.
Blast output
Blast E-values. Example
Score = 506 bits (1302), Expect = 7e-142,
Identities = 245/245 (100%), Positives = 245/245 (100%)
..
Lambda
K
0.267
0.0410
Matrix: BLOSUM62
S'
  S  ln( K) 1302  0.267  ln( 0.041)
ln 2

ln 2
 506
• The number E of sequences with a score of at
N
least S’ is
E  S'
2
log( E)  log( N)  S'log( 2)  log(1468197536114) - 506 log(2)
E  7 10142

Blast
Only align subset of sequences
Only do gap extension at few seed sites
Only extend gaps close to diagonal
Approximate (and conservative) E-value
estimates
• Details on the Blast algorithm
•
•
•
•
– www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html