Download String-matching algorithms in BLAST and FASTA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mathematics and computation behind
BLAST and FASTA
Xuhua Xia
[email protected]
http://dambe.bio.uottawa.ca
Why string matching?
• Early applications: Sequence similarity between an oncogene (genes in
viruses that cause a cancer-like transformation of the infected cells), v-sis,
and the platelet-derived growth factor (PDGF)
– M. D. Waterfield et al. 1983. Nature 304:35-39
– R. F. Doolittle et al. 1983. Science 221:275-277
– Implications:
• Cancer can be caused by a constitutively expressed growth factor
• Alteration of gene expression can contribute to cancer
• Growth factors and the like can be drug targets against cancer
• Fast computational methods in string matching
– FASTA
– BLAST
– Local pair-wise alignment by dynamic programming
Slide 2
FASTA
• A commonly used family of alignment and search
tools
• Generally considered to be more sensitive than
BLAST.
• Illustration with two fictitious sequences used in the
Contig Assembly lecture:
Seq1: ACCGCGATGACGAATA
Seq2: GAATACGACTGACGATGGA
Seq1: ACCGCGATGACGAATA
Seq2:
GAATACGACTGACGATGGA
Slide 3
String Match in FASTA
(a)
1
A
G
2
C
A
3
C
A
4
G
T
(b)
A
1
7
10
13
14
16
C
2
3
5
11
G
4
6
9
12
T
8
15
(c)
1
2
3
4
G A A T
-3
1
2 -4
-5 -5 -4 -11
-8 -8 -7
-11 -11 -10
-12 -11
-14 -13
Seq1
Seq2
(e)
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Left
Right
C G A T G A C G A A T A
Move N Move
A C G A C T G A C G A T G G A
-1 3
1
-2 5
2
-3 1
3
-4 3
4
-5 7
5
-6 1
6
-7 1
7
-8 4
8
-9 1
9
-10 1
10
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-11 5
11
A C G A C T G A C G A T G G A
-12 1
12
4 4 3 7 7 2 7 11 11 10 14 8 13 14 18
-13 1
13
-2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12
-14 1
14
-5 1 -2 -2 4
2 2 8 5 5
8 9 9
-15 0
15
-8 -5 -5 -5 -2
-1 -1 2 2 2
5 6 6
16
-9
-6
-2
1
5
17
-11
-8
-4
-1
3
18
(d)
Seq1:
ACCGCGATGACGAATA
Seq2: GAATACGACTGACGATGGA
Seq1:
Seq2:
ACCGCGATGACGAATA
GAATACGACTGACGATGGA
N
6
7
3
3
6
3
3
5
2
2
3
2
1
2
0
0
0
1
Word length of 2
(a)
Seq1
Seq2
1
A
G
2
C
A
3
C
A
4
G
T
5
C
A
6
G
C
7
A
G
8
T
A
(b) AA AC AG AT CA CC CG CT
13 1
7
2 3
10
14
5
11
(c)
(e)
1 2 3 4 5 6 7 8
GA AA AT TA AC CG GA AC
-5 -11 -4 -11 4 3 1 7
-8
-11
-5 1 -2 -2
-11
-5 -5
Best
Seq1:
ACCGCGATGACGAATA
Seq2: GAATACGACTGACGATGGA
9 10 11 12 13 14 15 16 17 18 19 Left
Right
G A C G A A T A
Move N Move N
C T G A C G A T G G A
-1 1
1 3
-2 2
2 5
GA GC GG GT TA TC TG TT
-3 0
3 1
6 4
15
8
-4 1
4 1
9
-5 4
5 2
12
-6 0
6 1
-7 0
7 1
-8 1
8 4
-9 0
9 1
-10 0
10 1
9 10 11 12 13 14 15 16 17 18
-11 4
11 1
CT TG GA AC CG GA AT TG GG GA
-12 0
12 1
2 5 11 10 8 8 8
12
-13 0
13 0
2 2 8 5 1
9
-14 0
14 0
-1
2 2
6
15 0
16 0
17 0
(d)
One of the three 2nd best
Seq1: ACCGCGATGACGAATA
Seq2:
GAATACGACTGACGATGGA
Human COX1
RWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMI
FFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTV
YPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLI
TAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFG
MISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAI
PTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAH
FHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIGVNLTFFPQHFLGLSGMPR
RYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSMNLEWLYGCPP
PYHTFEEP
Exact match:
All sequences in the database are pre-indexed.
Cys is the rarest in this protein in the database.
If a query sequence contain a C, then go directly to C at site
494 to check; if the query has no C, then report 'No match'
BLAST
• Adapted from Crane & Raymer 2003
• Motivation: matching short sequences are faster than matching
longer ones
• Input sequence: AILVPTVIGCTVPT
• Algorithm:
– Break the query sequence into words:
AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT,
GCTV, CTVP, TVPT
– Discard common words (i.e., words made entirely of common amino
acids)
– Search for matches against database sequences, assess significance and
decide whether to discard to continue with extension using dynamic
programming:
AILVPTVIGCTVPT
MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC
• Critical decision: Discard or continue?
• The E-value as an answer.
Slide 7
Basic stats in string matching
• Given PA, PC, PG, PT in a target (database) sequence, the
probability of a query sequence, say, ATTGCC, having a
perfect match of the target sequence is:
prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2
• Let M be the target sequence length and N be the query
sequence length, the “matching operation” can be performed
(M – N +1) times, e.g.,
Query: ATG
Target CGATTGCCCG
• The probability distribution of the number of matches follows
(approximately) a binomial distribution with p = prob and n =
(M – N +1)
Slide 8
Basic stats in string matching
• Probability of having a sequence match: p
• Probability of having no match: q = 1-p
• Binomial distribution:
( p  q) n  p n 
n!
n!
p n 1q  ... 
p n  x q x  ...  q n
(n  1)!1!
(n  x)! x !
• When np > 50, the binomial distribution can be approximated
by the normal distribution with the mean = np and variance =
npq
( x )

2
P( x) 
1
e
2
2 2
• When np < 1 and n is very large, binomial distribution can be
approximated by the Poisson distribution with mean and
variance equal to np (i.e.,  = 2 = np).
e   x
P( x) 
x!
Slide 9
From Binomial to Poisson
( p  q) n  p n 
n!
n!
n!
p n 1q  ... 
p n  x q x ... 
p x q n  x  ...  q n
(n  1)!1!
(n  x)! x !
(n  x)! x !
P ( n)  p n
P(n  x) 
n!
p x q n x
(n  x)! x !
n!

p xq xqn
(n  x)! x !
P(0)  q n
n(n  1)(n  2)...(n  x  1)  p 
n

(1

p
)
 
x!
q
x
n x x  np n x p x  np (np ) x  np
 

pe 
e 
e e
x!
x!
x!
x!
P(n  1)  np n 1q
n!
p n x q x
(n  x)! x !
n!
P( x) 
p x q n x
(n  x)! x !
P( x) 
x
Slide 10
Matching two sequences without gap
• Assuming equal nucleotide frequencies, the probability of a
nucleotide site in the query sequence matching a site in the
target sequence is p = 0.25.
• The probability of finding an exact match of L letters is a = pL
= 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST.
• M: query length; N: target length, e.g., M = 8, N = 5, L = 3
AACGGTTC
CGGTT
• A sequence of length L can move at (M – L +1) distinct sites
along the query and (N – L +1) distinct sites along the target.
• m = (M-L+1) and n = (N-L+1) are called effective lengths of
the two sequences.
• The expected number of matches with length L is mn2-S,
which is called E-value in ungapped BLAST.
• S is calculated differently in the gapped BLAST
Slide 11
Blast Output (Nuc. Seq.)
BLASTN 2.2.4 [Aug-26-2002]
...
Query= Seq1 38
Database: MgCDS
480 sequences; 526,317 total letters
Sequences producing significant alignments:
MG001 1095 bases
Score = 34.2 bits (17), Expect = 7e-004
Identities = 35/40 (87%), Gaps = 2/40 (5%)
Query: 1
Sbjct: 1
Constant gap penalty vs
affine function penalty
Score
E
(bits) Value
34
7e-004
atgaataacg--attatttccaacgacaaaacaaaaccac 38
|||||||||| ||||||||||| |||||| ||||||||
atgaataacgttattatttccaataacaaaataaaaccac 40
Lambda
K
H
1.37
0.711
1.31
Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
…
effective length of query: 26
effective length of database: 520,557
e E ( E ) x
p ( x) 
x!
Typically one would
count only 1 GE here.
Matches: 35*1 = 35
Mismatches: 3*(-3) = -9
Gap Open: 1*5 = 5
Gap extension: 2*2 =4
R = 35 - 9 - 5 - 4 = 17
S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34
E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04
x
p(x)
0
0.999265217
1
0.000734513
…
Alternatively, E = KmnExp(-lambda*R)
E-Value in BLAST
• The e-value is the expected number of random
matches that is equally good or better than the
reported match. It can be a number near zero or much
larger than 1.
• It is NOT the probability of finding the reported
match.
• Only when the e-value is extremely small can it be
interpreted as the probability of finding 1 match that
is as good as the reported one (see next slide).
Slide 13
E-value and P(1)
e E ( E ) x
p ( x) 
x!
p(1)  e E E  E (when E  0)
1
P(1)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.00
0.20
0.40
0.60
0.80
1.00
E-value
Slide 14
BLAST Programs
Program
Database
Query
Typical Uses
BLASTN/ME
GABLAST
Nucleotide
Nucleotide
MEGABLAST has longer word size than
BLASTN
BLASTP
Protein
Protein
Query a protein/peptide against a protein
database.
BLASTX
Protein
Nucleotide
Translate a nuc sequence into a “protein”
in six frames and search against a protein
database
TBLASTN
Nucleotide
Protein
Unannotated nuc sequences (e.g., ESTs)
are translated in six frames against which
the query protein is searched
TBLASTX
Nucleotide
Nucleotide
6-frame translation of both query and
database
PHI-BLAST
Protein
Protein
Pattern-hit iterated BLAST
PSI-BLAST
Protein
Protein
Position-specific iterated BLAST
RPS-BLAST
Protein
Protein
Reverse PSI-BLAST
Slide 15
Comparison: BLAST and FASTA
• BLAST starts with exact string matching, while
FASTA starts with inexact string matching (or exact
string matching with a shorter words). BLAST is
faster than FASTA.
• For the examples given, both BLAST and FASTA
will find the same best match, i.e., shifting the query
sequence by 2 sites to the right.
• Both perform dynamic programming for extending
the match after the initial match.
Slide 16
Optional: BLAST Parameters
• Lambda  and Karlin-Altschul (K) parameters are important
because they directly affect the computation of E value.
• Both  and K depend on
– nucleotide (or aminon acid) frequencies
– match-mismatch matrix
• All BLAST implementations generally assume that nucleotide
(or amino acid) sequences have roughly equal frequencies.
• For nucleotide (or amino acid) sequences with strongly biased
frequencies, BLAST E value obtained with the assumption
can be quite misleading, i.e., one should use appropriate  and
K.
Lambda () and K
BLAST output includes lambda () and K. Mathematically,  is defined as follows:
4 4
 pi p j e
sij 
1
i 1 j 1
where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or
mismatch (when i  j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the
simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to
4
4
 pi p j e
sij 
 4  0.252 e  12  0.252 e3  0.25e  0.75e3  1
i 1 j 1
Now insert different  values to the equation above to find which  balances the equation
(not the trivial solution of  = 0)
20 20
 pi p j e
sij 
1
(for amino acid sequences)
i 1 j 1
See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K.
 implies nucleotide frequencies
(a)
A
G
C
T
A
0.25
0.0625
0.0625
0.0625
0.0625
0.25
0.25
0.25
0.25
Match-mismatch matrix
A
(b)
G
C
T
Lambda
(c)
1
-3
-3
-3
G
0.25
0.0625
0.0625
0.0625
0.0625
-3
1
-3
-3
C
0.25
0.0625
0.0625
0.0625
0.0625
-3
-3
1
-3
T
0.25
0.0625
0.0625
0.0625
0.0625
-3
-3
-3
1
1.374063
(a’)
A
G
C
T
0.49
0.01
0.01
0.49
A
0.49
0.2401
0.0049
0.0049
0.2401
Match-Mismatch matrix
(b’) A
G
C
T
1
-3
-3
-3
G
0.01
0.0049
0.0001
0.0001
0.0049
C
0.01
0.0049
0.0001
0.0001
0.0049
T
0.49
0.2401
0.0049
0.0049
0.2401
-3
1
-3
-3
-3
-3
1
-3
-3
-3
-3
1
Lambda 0.658295
0.246961
0.001013
0.001013
0.001013
0.001013
0.246961
0.001013
0.001013
0.001013
0.001013
0.246961
0.001013
0.001013
0.001013
0.001013
0.246961
1
(c’)
0.463752 0.00068 0.00068 0.03332
0.00068 0.000193 1.39E-05 0.00068
0.00068 1.39E-05 0.000193 0.00068
0.03332 0.00068 0.00068 0.463752
1
BLAST parameters , K and H are computed for each BLAST
database created.
Finding  III: Different , s/v
A
G
C
T
0.1
0.4
0.4
0.1
Match-Mismatch
A
G
C
T
Lambda
A
0.1
0.01
0.04
0.04
0.01
G
0.4
0.04
0.16
0.16
0.04
C
0.4
0.04
0.16
0.16
0.04
T
0.1
0.01
0.04
0.04
0.01
1
-1
-3
-3
-1
1
-3
-3
-3
-3
1
-1
-3
-3
-1
1
0.02691
0.014865
0.002053
0.000513
0.014865
0.430554
0.008211
0.002053
1
0.9899
0.002053 0.000513
0.008211 0.002053
0.430554 0.014865
0.014865 0.02691 1.000046
Finding K: equal , (1, -3)
A
G
C
T
0.25
0.25
0.25
0.25
A
0.25
0.0625
0.0625
0.0625
0.0625
G
0.25
0.0625
0.0625
0.0625
0.0625
C
0.25
0.0625
0.0625
0.0625
0.0625
T
0.25
0.0625
0.0625
0.0625
0.0625
1
-3
-3
-3
G
-3
1
-3
-3
C
-3
-3
1
-3
T
-3
-3
-3
1
Match-Mismatch
A
Lambda
1
0.169893 0.003112 0.003112 0.003112
0.003112 0.169893 0.003112 0.003112
0.003112 0.003112 0.169893 0.003112
0.003112 0.003112 0.003112 0.169893 0.716911
Double-click it, copy to EXCEL and find  by using solver.
Slide 21
Related documents