Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Mathematics and computation behind BLAST and FASTA Xuhua Xia [email protected] http://dambe.bio.uottawa.ca Why string matching? • Early applications: Sequence similarity between an oncogene (genes in viruses that cause a cancer-like transformation of the infected cells), v-sis, and the platelet-derived growth factor (PDGF) – M. D. Waterfield et al. 1983. Nature 304:35-39 – R. F. Doolittle et al. 1983. Science 221:275-277 – Implications: • Cancer can be caused by a constitutively expressed growth factor • Alteration of gene expression can contribute to cancer • Growth factors and the like can be drug targets against cancer • Fast computational methods in string matching – FASTA – BLAST – Local pair-wise alignment by dynamic programming Slide 2 FASTA • A commonly used family of alignment and search tools • Generally considered to be more sensitive than BLAST. • Illustration with two fictitious sequences used in the Contig Assembly lecture: Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Slide 3 String Match in FASTA (a) 1 A G 2 C A 3 C A 4 G T (b) A 1 7 10 13 14 16 C 2 3 5 11 G 4 6 9 12 T 8 15 (c) 1 2 3 4 G A A T -3 1 2 -4 -5 -5 -4 -11 -8 -8 -7 -11 -11 -10 -12 -11 -14 -13 Seq1 Seq2 (e) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Left Right C G A T G A C G A A T A Move N Move A C G A C T G A C G A T G G A -1 3 1 -2 5 2 -3 1 3 -4 3 4 -5 7 5 -6 1 6 -7 1 7 -8 4 8 -9 1 9 -10 1 10 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 -11 5 11 A C G A C T G A C G A T G G A -12 1 12 4 4 3 7 7 2 7 11 11 10 14 8 13 14 18 -13 1 13 -2 3 1 1 6 -5 5 5 10 8 8 1 11 12 12 -14 1 14 -5 1 -2 -2 4 2 2 8 5 5 8 9 9 -15 0 15 -8 -5 -5 -5 -2 -1 -1 2 2 2 5 6 6 16 -9 -6 -2 1 5 17 -11 -8 -4 -1 3 18 (d) Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Seq1: Seq2: ACCGCGATGACGAATA GAATACGACTGACGATGGA N 6 7 3 3 6 3 3 5 2 2 3 2 1 2 0 0 0 1 Word length of 2 (a) Seq1 Seq2 1 A G 2 C A 3 C A 4 G T 5 C A 6 G C 7 A G 8 T A (b) AA AC AG AT CA CC CG CT 13 1 7 2 3 10 14 5 11 (c) (e) 1 2 3 4 5 6 7 8 GA AA AT TA AC CG GA AC -5 -11 -4 -11 4 3 1 7 -8 -11 -5 1 -2 -2 -11 -5 -5 Best Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA 9 10 11 12 13 14 15 16 17 18 19 Left Right G A C G A A T A Move N Move N C T G A C G A T G G A -1 1 1 3 -2 2 2 5 GA GC GG GT TA TC TG TT -3 0 3 1 6 4 15 8 -4 1 4 1 9 -5 4 5 2 12 -6 0 6 1 -7 0 7 1 -8 1 8 4 -9 0 9 1 -10 0 10 1 9 10 11 12 13 14 15 16 17 18 -11 4 11 1 CT TG GA AC CG GA AT TG GG GA -12 0 12 1 2 5 11 10 8 8 8 12 -13 0 13 0 2 2 8 5 1 9 -14 0 14 0 -1 2 2 6 15 0 16 0 17 0 (d) One of the three 2nd best Seq1: ACCGCGATGACGAATA Seq2: GAATACGACTGACGATGGA Human COX1 RWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMI FFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTV YPPLAGNYSHPGASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLI TAVLLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFG MISHIVTYYSGKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAI PTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTVGGLTGIVLANSSLDIVLHDTYYVVAH FHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIGVNLTFFPQHFLGLSGMPR RYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSMNLEWLYGCPP PYHTFEEP Exact match: All sequences in the database are pre-indexed. Cys is the rarest in this protein in the database. If a query sequence contain a C, then go directly to C at site 494 to check; if the query has no C, then report 'No match' BLAST • Adapted from Crane & Raymer 2003 • Motivation: matching short sequences are faster than matching longer ones • Input sequence: AILVPTVIGCTVPT • Algorithm: – Break the query sequence into words: AILV, ILVP, LVPT, VPTV, PTVI, TVIG, VIGC, IGCT, GCTV, CTVP, TVPT – Discard common words (i.e., words made entirely of common amino acids) – Search for matches against database sequences, assess significance and decide whether to discard to continue with extension using dynamic programming: AILVPTVIGCTVPT MVQGWALYDFLKCRAILVPTVIACTCVAMLALYDFLKC • Critical decision: Discard or continue? • The E-value as an answer. Slide 7 Basic stats in string matching • Given PA, PC, PG, PT in a target (database) sequence, the probability of a query sequence, say, ATTGCC, having a perfect match of the target sequence is: prob = PAPTPT PGPCPC = PA (PC)2 PG (PT)2 • Let M be the target sequence length and N be the query sequence length, the “matching operation” can be performed (M – N +1) times, e.g., Query: ATG Target CGATTGCCCG • The probability distribution of the number of matches follows (approximately) a binomial distribution with p = prob and n = (M – N +1) Slide 8 Basic stats in string matching • Probability of having a sequence match: p • Probability of having no match: q = 1-p • Binomial distribution: ( p q) n p n n! n! p n 1q ... p n x q x ... q n (n 1)!1! (n x)! x ! • When np > 50, the binomial distribution can be approximated by the normal distribution with the mean = np and variance = npq ( x ) 2 P( x) 1 e 2 2 2 • When np < 1 and n is very large, binomial distribution can be approximated by the Poisson distribution with mean and variance equal to np (i.e., = 2 = np). e x P( x) x! Slide 9 From Binomial to Poisson ( p q) n p n n! n! n! p n 1q ... p n x q x ... p x q n x ... q n (n 1)!1! (n x)! x ! (n x)! x ! P ( n) p n P(n x) n! p x q n x (n x)! x ! n! p xq xqn (n x)! x ! P(0) q n n(n 1)(n 2)...(n x 1) p n (1 p ) x! q x n x x np n x p x np (np ) x np pe e e e x! x! x! x! P(n 1) np n 1q n! p n x q x (n x)! x ! n! P( x) p x q n x (n x)! x ! P( x) x Slide 10 Matching two sequences without gap • Assuming equal nucleotide frequencies, the probability of a nucleotide site in the query sequence matching a site in the target sequence is p = 0.25. • The probability of finding an exact match of L letters is a = pL = 0.25L = 2-2L = 2-S, where S is called the bit score in BLAST. • M: query length; N: target length, e.g., M = 8, N = 5, L = 3 AACGGTTC CGGTT • A sequence of length L can move at (M – L +1) distinct sites along the query and (N – L +1) distinct sites along the target. • m = (M-L+1) and n = (N-L+1) are called effective lengths of the two sequences. • The expected number of matches with length L is mn2-S, which is called E-value in ungapped BLAST. • S is calculated differently in the gapped BLAST Slide 11 Blast Output (Nuc. Seq.) BLASTN 2.2.4 [Aug-26-2002] ... Query= Seq1 38 Database: MgCDS 480 sequences; 526,317 total letters Sequences producing significant alignments: MG001 1095 bases Score = 34.2 bits (17), Expect = 7e-004 Identities = 35/40 (87%), Gaps = 2/40 (5%) Query: 1 Sbjct: 1 Constant gap penalty vs affine function penalty Score E (bits) Value 34 7e-004 atgaataacg--attatttccaacgacaaaacaaaaccac 38 |||||||||| ||||||||||| |||||| |||||||| atgaataacgttattatttccaataacaaaataaaaccac 40 Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 … effective length of query: 26 effective length of database: 520,557 e E ( E ) x p ( x) x! Typically one would count only 1 GE here. Matches: 35*1 = 35 Mismatches: 3*(-3) = -9 Gap Open: 1*5 = 5 Gap extension: 2*2 =4 R = 35 - 9 - 5 - 4 = 17 S = [λR – ln(K)]/ln(2) =[1.37*17-ln(0.711)]/ln(2) = 34 E = mn2-S = 26 * 520557 * 2-34 = 7.878E-04 x p(x) 0 0.999265217 1 0.000734513 … Alternatively, E = KmnExp(-lambda*R) E-Value in BLAST • The e-value is the expected number of random matches that is equally good or better than the reported match. It can be a number near zero or much larger than 1. • It is NOT the probability of finding the reported match. • Only when the e-value is extremely small can it be interpreted as the probability of finding 1 match that is as good as the reported one (see next slide). Slide 13 E-value and P(1) e E ( E ) x p ( x) x! p(1) e E E E (when E 0) 1 P(1) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.00 0.20 0.40 0.60 0.80 1.00 E-value Slide 14 BLAST Programs Program Database Query Typical Uses BLASTN/ME GABLAST Nucleotide Nucleotide MEGABLAST has longer word size than BLASTN BLASTP Protein Protein Query a protein/peptide against a protein database. BLASTX Protein Nucleotide Translate a nuc sequence into a “protein” in six frames and search against a protein database TBLASTN Nucleotide Protein Unannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched TBLASTX Nucleotide Nucleotide 6-frame translation of both query and database PHI-BLAST Protein Protein Pattern-hit iterated BLAST PSI-BLAST Protein Protein Position-specific iterated BLAST RPS-BLAST Protein Protein Reverse PSI-BLAST Slide 15 Comparison: BLAST and FASTA • BLAST starts with exact string matching, while FASTA starts with inexact string matching (or exact string matching with a shorter words). BLAST is faster than FASTA. • For the examples given, both BLAST and FASTA will find the same best match, i.e., shifting the query sequence by 2 sites to the right. • Both perform dynamic programming for extending the match after the initial match. Slide 16 Optional: BLAST Parameters • Lambda and Karlin-Altschul (K) parameters are important because they directly affect the computation of E value. • Both and K depend on – nucleotide (or aminon acid) frequencies – match-mismatch matrix • All BLAST implementations generally assume that nucleotide (or amino acid) sequences have roughly equal frequencies. • For nucleotide (or amino acid) sequences with strongly biased frequencies, BLAST E value obtained with the assumption can be quite misleading, i.e., one should use appropriate and K. Lambda () and K BLAST output includes lambda () and K. Mathematically, is defined as follows: 4 4 pi p j e sij 1 i 1 j 1 where pi, pj are nucleotide frequencies (i,j = A, C, G, or T), and sij is the match (when i = j) or mismatch (when i j) score. In nucleotide BLAST by default, we have sii = 1 and sij = -3. In the simplest case with equal nucleotide frequencies, i.e., when pi = 0.25, the equation above is reduced to 4 4 pi p j e sij 4 0.252 e 12 0.252 e3 0.25e 0.75e3 1 i 1 j 1 Now insert different values to the equation above to find which balances the equation (not the trivial solution of = 0) 20 20 pi p j e sij 1 (for amino acid sequences) i 1 j 1 See the updated Chapter 1 and BLASTParameter.xlsm on how to compute K. implies nucleotide frequencies (a) A G C T A 0.25 0.0625 0.0625 0.0625 0.0625 0.25 0.25 0.25 0.25 Match-mismatch matrix A (b) G C T Lambda (c) 1 -3 -3 -3 G 0.25 0.0625 0.0625 0.0625 0.0625 -3 1 -3 -3 C 0.25 0.0625 0.0625 0.0625 0.0625 -3 -3 1 -3 T 0.25 0.0625 0.0625 0.0625 0.0625 -3 -3 -3 1 1.374063 (a’) A G C T 0.49 0.01 0.01 0.49 A 0.49 0.2401 0.0049 0.0049 0.2401 Match-Mismatch matrix (b’) A G C T 1 -3 -3 -3 G 0.01 0.0049 0.0001 0.0001 0.0049 C 0.01 0.0049 0.0001 0.0001 0.0049 T 0.49 0.2401 0.0049 0.0049 0.2401 -3 1 -3 -3 -3 -3 1 -3 -3 -3 -3 1 Lambda 0.658295 0.246961 0.001013 0.001013 0.001013 0.001013 0.246961 0.001013 0.001013 0.001013 0.001013 0.246961 0.001013 0.001013 0.001013 0.001013 0.246961 1 (c’) 0.463752 0.00068 0.00068 0.03332 0.00068 0.000193 1.39E-05 0.00068 0.00068 1.39E-05 0.000193 0.00068 0.03332 0.00068 0.00068 0.463752 1 BLAST parameters , K and H are computed for each BLAST database created. Finding III: Different , s/v A G C T 0.1 0.4 0.4 0.1 Match-Mismatch A G C T Lambda A 0.1 0.01 0.04 0.04 0.01 G 0.4 0.04 0.16 0.16 0.04 C 0.4 0.04 0.16 0.16 0.04 T 0.1 0.01 0.04 0.04 0.01 1 -1 -3 -3 -1 1 -3 -3 -3 -3 1 -1 -3 -3 -1 1 0.02691 0.014865 0.002053 0.000513 0.014865 0.430554 0.008211 0.002053 1 0.9899 0.002053 0.000513 0.008211 0.002053 0.430554 0.014865 0.014865 0.02691 1.000046 Finding K: equal , (1, -3) A G C T 0.25 0.25 0.25 0.25 A 0.25 0.0625 0.0625 0.0625 0.0625 G 0.25 0.0625 0.0625 0.0625 0.0625 C 0.25 0.0625 0.0625 0.0625 0.0625 T 0.25 0.0625 0.0625 0.0625 0.0625 1 -3 -3 -3 G -3 1 -3 -3 C -3 -3 1 -3 T -3 -3 -3 1 Match-Mismatch A Lambda 1 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.003112 0.003112 0.003112 0.003112 0.169893 0.716911 Double-click it, copy to EXCEL and find by using solver. Slide 21