Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence Alignment II K-tuple methods Statistics of alignments Database searches What is the problem? Large number of sequences to search your query sequence against. Various indexing schemes and heuristics are used, one of which is BLAST. heuristic is a technique to solve a problem that ignores whether the solution can be proven to be correct, but usually produces a good solution, are intended to gain computational performance or conceptual simplicity potentially at the cost of accuracy or precision. http://en.wikipedia.org/wiki/Heuristics#Computer_science K-tuple methods http://creativecommons.org/licenses/by-sa/2.0/ Concepts of Sequence Similarity Searching The premise: The sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function. Important Terms for Sequence Similarity Searching with very different meanings Similarity Identity The extent to which nucleotide or protein sequences are related. In BLAST similarity refers to a positive matrix score. The extent to which two (nucleotide or amino acid) sequences are invariant. Homology Similarity attributed to descent from a common ancestor. Sequence Similarity Searching: The Approach Sequence similarity searching involves the use of a set of algorithms (such as the BLAST programs) to compare a query sequence to all the sequences in a specified database. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared. Blast QUERY sequence(s) BLAST results BLAST program BLAST database Topics: There are different blast programs Understanding the BLAST algorithm BLAST program Word size HSPs (High Scoring Pairs) Understanding BLAST statistics The alignment score (S) Scoring Matrices Dealing with gaps in an alignment The expectation value (E) The BLAST algorithm The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 for optimal local alignments to a query. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410. Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402. http://www.ncbi.nlm.nih.gov/BLAST blastn blastx tblastn tblastx blastp Other BLAST programs BLAST 2 Sequences (bl2seq) Aligns two sequences of your choice Gives dot-plot like output More BLAST programs BLAST against genomes Many available BLAST parameters pre-optimized Handy for mapping query to genome Search for short exact matches BLAST parameters pre-optimized Great for checking probes and primers How Does BLAST Work? The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of “T". Picture used with permission from Chapter 11 of “Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins” Each BLAST “hit” generates an alignment that can contain one or more high scoring pairs (HSPs) Each BLAST “hit” generates an alignment that can contain one or more high scoring pairs (HSPs) Where does the score (S) come from? The quality of each pair-wise alignment is represented as a score and the scores are ranked. Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). The alignment score will be the sum of the scores for each position. What’s a scoring matrix? Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query. PAM vs. BLOSUM scoring matrices BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. PAM vs BLOSUM scoring matrices The PAM Family The BLOSUM family PAM matrices are BLOSUM matrices are based on global based on local alignments. alignments of closely BLOSUM 62 is a matrix related proteins. calculated from The PAM1 is the matrix comparisons of sequences calculated from with no less than 62% comparisons of divergence. sequences with no All BLOSUM matrices are more than 1% based on observed divergence. alignments; they are not Other PAM matrices extrapolated from are extrapolated from comparisons of closely PAM1. related proteins. What happens if you have a gap in the alignment? A gap is a position in the alignment at which a letter is paired with a null Gap scores are negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is frequently ascribed more significance than the length of the gap. Hence the gap is penalized heavily, whereas a lesser penalty is assigned to each subsequent residue in the gap. Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant AC C TG A G – AG AC G TG – G C AG mismatch indel 70% identical BLAST algorithm Keyword search of all words of length w in the query of default length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for proteins Do local alignment extension for each hit of keyword search Extend result until longest match above threshold is achieved and output BLAST algorithm (cont’d) keyword Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 Neighborhood GIK 16 words GGK 14 neighborhood GLK 13 score threshold GNK 12 (T = 13) GRK 11 GEK 11 GDK 11 extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) Local alignment Find the best local alignment between two strings, over the recurrence: si , j 0 s i 1, j (vi ,) max si , j 1 (, w j ) s i 1, j 1 (vi , w j ) Local alignment (cont’d) Input: strings v and w and scoring matrix Output: substrings of v and w whose global alignment as defined by , is maximal among all global alignments of all substrings of v and w Original BLAST Dictionary All words of length w Alignment Ungapped extensions until score falls below statistical threshold T Output All local alignments with score > statistical threshold Original BLAST: Example From lectures by Serafim Batzoglou (Stanford) A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A • w = 4, T = 4 • Exact keyword match of GGTC • Extend diagonals with mismatches until score falls below a threshold • Output result GTAAGGTCC GTTAGGTCC Gapped BLAST: Example Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match Output result GTAAGGTCCAGT GTTAGGTC-AGT From lectures by Serafim Batzoglou (Stanford) C T G A T C C T G G A T T G C G A A C G A A G T A A G G T C C A G T Gapped BLAST : Example (cont’d) Original BLAST exact keyword search, THEN: Extend with gaps around ends of exact match until score <T, then merge nearby alignments Output result GTAAGGTCCAGT GTTAGGTC-AGT From lectures by Serafim Batzoglou (Stanford) C T G A T C C T G G A T T G C G A A C G A A G T A A G G T C C A G T Topics: The different blast databases provided by the NCBI BLAST databases Protein databases Nucleotide databases Genomic databases Considerations for choosing a BLAST database Custom databases for BLAST BLAST protein databases available at through blastp web interface @ NCBI blastp db Considerations for choosing a BLAST database First consider your research question: Are you looking for an ortholog in a particular species? Are you looking for additional members of a protein family across all species? BLAST against the genome of that species. BLAST against nr, if you can’t find hits check wgs, htgs, and the trace archives. Are you looking to annotate genes in your species of interest? BLAST against known genes (RefSeq) and/or ESTs from a closely related species. When choosing a database for BLAST… It is important to know your reagents. Changing your choice of database is changing your search space completely Database size affects the BLAST statistics record BLAST parameters, database choice, database size in your bioinformatics lab book, just as you would for your wet-bench experiments. Databases change rapidly and are updated frequently It may be necessary to repeat your analyses Topics: BLAST results Choosing the right BLAST program Running a blastp search BLAST parameters and options to consider Viewing BLAST results Look at your alignments Using the BLAST taxonomy report BLAST parameters and options to consider: conserved domains Entrez query E-value cutoff Word size More BLAST parameters and options to consider: filtering matrix gap penalities Run your BLAST search: BLAST The BLAST Queue: click for more info Note your RID Formatting and Retrieving your BLAST results: Results options A graphical view of your BLAST results: The BLAST “hit” list: Score E-Value GenBank alignment EntrezGene The BLAST pairwise alignments Identity Similarity Sample BLAST output • Blast of human beta globin protein against zebra fish Sequences producing significant alignments: Score E (bits) Value gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 171 170 170 168 ALIGNMENTS >gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] Length = 148 Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%) Query: 1 Sbjct: 1 Query: 61 Sbjct: 61 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120 Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YH Sbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148 3e-44 7e-44 7e-44 3e-43 Sample BLAST output (cont’d) • Blast of human beta globin DNA against human DNA Sequences producing significant alignments: Score E (bits) Value gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 289 289 280 260 151 149 1e-75 1e-75 1e-72 1e-66 7e-34 3e-33 ALIGNMENTS >gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468 Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| |||||||||||| Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507 What do the Score and the evalue really mean? The quality of the alignment is represented by the Score. Score (S) The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . The significance of each alignment is computed as an E value. E value (E) Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. E value E value (E) Expectation value. The number of different alignments with scores equivalent to or better than S expected to occur in a database search by chance. The lower the E value, the more significant the score. Assessing sequence homology Need to know how strong an alignment can be expected from chance alone “Chance” is the comparison of Real but non-homologous sequences Real sequences that are shuffled to preserve compositional properties Sequences that are generated randomly based upon a DNA or protein sequence model (favored) High Scoring Pairs (HSPs) All segment pairs whose scores can not be improved by extension or trimming Need to model a random sequence to analyze how high the score is in relation to chance Expected number of HSPs Expected number of HSPs with score > S E-value E for the score S: E = Kmne-lS Given: Two sequences, length n and m The statistics of HSP scores are characterized by two parameters K and λ K: scale for the search space size λ: scale for the scoring system BLAST statistics to record in your bioinformatics labbook Record the statistics that are found at bottom of your BLAST results page Scoring matrices Amino acid substitution matrices PAM BLOSUM Bit Scores Normalized score to be able to compare sequences Bit score S’ = lS – ln(K) ln(2) E-value of bit score E = mn2-S’ Assessing the significance of an alignment How to assess the significance of an alignment between the comparison of a protein of length m to a database containing many different proteins, of varying lengths? Calculate a "database search" E-value. Multiply the pairwise-comparison E-value by the number of sequences in the database N divided by the length of the sequence in the database n Homology: Some Guidelines Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous Low complexity regions can be highly similar without being homologous Homologous sequences not always highly similar Homology: Some Guidelines Suggested BLAST Cutoffs (source: Chapter 11 – Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins) For nucleotide based searches, one should look for hits with E-values of 10-6 or less and sequence identity of 70% or more For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more Contributors Special thanks to David Wishart, Andy Baxevanis, Stephanie Minnema, Sohrab Shah, and Francis Ouellette for their contributions to these materials http://creativecommons.org/licenses/by-sa/2.0/ FASTA A FASTA search begins by breaking the search sequence into words. For genomic sequences, a word size of 4 or 6 nucleotides is used; 1 or 2 for polypeptide sequences. FASTA Next a table is constructed for the query sequence (word size is 1): E.g. FAMLGFIKYLPGCM A C D E F G H I 2 K L M N P Q R S T V W Y FASTA Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM A C D E F G H I 2 13 K L M N P Q R S T V W Y FASTA Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM A C D E F G H I 2 13 1 6 K L M N P Q R S T V W Y FASTA Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM A C D E F G H I 2 13 1 5 6 12 K L M N P Q R S T V W Y FASTA Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM A C D E F G H I 2 13 1 5 6 12 7 K L M N P Q R S T V W Y FASTA The table for the query sequence is complete: E.g. FAMLGFIKYLPGCM A C D E F G H I K L M N P Q R S T V W Y 2 8 13 1 5 6 12 7 4 3 10 14 11 9 FASTA Compare the query sequence table with the target sequence: Query: FAMLGFIKYLPGCM Index of Gs are 5 and 12 Target: TGFIKYLPGACT Index of Gs are 2 and 9 Subtract 2 from 5 and 12; producing 3 and 10 Subtract 9 from 5 and 12; producing -4 and 3 A C D E F G H I K L M N P Q R S T V W Y 2 8 13 1 5 7 6 12 4 3 11 9 10 14 1 2 3 4 5 6 7 8 9 10 11 12 T G F I K Y L P G A C T 3 -4 10 3 FASTA Compare the query sequence table with the target sequence: Query: FAMLGFIKYLPGCM Index of Fs are 1 and 6 Target: TGFIKYLPGACT Index of F is 3 Subtract 3 from 1 and 6; producing -2 and 3 A C D E F G H I K L M N P Q R S T V W Y 2 8 13 1 5 6 12 7 4 3 11 9 10 14 1 2 3 4 5 6 7 8 9 10 11 12 T G F I K Y L P G A C T 3 -2 -4 10 3 3 FASTA Compare the query sequence table with the target sequence: Query: FAMLGFIKYLPGCM Index of Fs are 1 and 6 Target: TGFIKYLPGACT Index of F is 3 Subtract 3 from 1 and 6; producing -2 and 3 A C D E F G H I K L M N P Q R S T V W Y 2 8 13 1 5 6 12 7 4 3 11 9 10 14 1 2 3 4 5 6 7 8 9 10 11 12 T G F I K Y L P G A C T 3 -2 3 3 3 -3 3 -4 -8 2 10 3 3 3 FASTA FAMLGFIKYLPGCM |||||||| TGFIKYLPGACT Offset by 3 A C D E F G H I K L M N P Q R S T V W Y 2 8 13 1 5 6 12 7 4 3 11 9 10 14 1 2 3 4 5 6 7 8 9 10 11 12 T G F I K Y L P G A C T 3 -2 3 3 3 -3 3 -4 -8 2 10 3 3 3 Fasta (word size = 2) Database searches Odds score in sequence alignment The chance of an aligned amino acid pair being found in alignments of related sequences compared to the chance of that pair being found in random alignments of unrelated sequences. Statistical significance of an alignment The probability that random or unrelated sequences could be aligned to produce the same score. Smaller the probability is the better. Probability What is the probability that a coin toss will yield a head? What is the probability that the next pair of nucleotides will be a ‘match’ or ‘mismatch’? Bernoulli trials A series of n number of independent trials with the same outcome probabilities and number of choices (e.g., head or tail; or match (m) or mismatch (mi)). P(hhhhh) P(mmmmm) Head or Tail.. Longest run of heads or tails Longest run of heads one would get in a random series of coin tosses? Fair coin, p = 0.5; 1/p = 1/0.5 = 2 Erdös and Rènyi longest run = log1/p(n) If n = 100; longest run 6.65 Alignment analogy You have two sequences a and b of equal length a 1a 2a 3a 4 b 1b 2b 3b 4 if an = bn; then it is head (match) If an does not equal to bn then it is tail (mismatch) Alignment Statistics: For two sequences of length n and m, n times m comparisons are being made; thus the longest length of the predicted match would be log1/p(mn). Alignment Statistics: Expectation value or the mean longest match would be E(M) = log1/p(Kmn), where K is a constant that depends on amino acid or base composition and p is the probability of a match. This is only true for ungapped local alignments. Distribution of alignment scores resembles Gumbel extreme value distribution. Extreme Value Distribution Extreme Value Distribution In this distribution, the probability of a score being higher than x is given by: •m and n are the lengths of the sequences compared •K and l can be calculated from the data in the matrix used and from the relative frequencies of the amino acids (or nucleotides) Alignment Statistics: For two sequences of length n and m, n times m comparisons are being made; thus the longest length of the predicted match would be log1/p(mn). For a pair of random DNA sequences of length 100 and p = 0.25 (equal A,T,C,G), the longest expected run of matches would be: 2 x log1/p (n) = 2 x log4 100 = 6.65 Alignment Statistics E(M)=log1/p(Kmn) means that match length gets bigger as the log of the product of sequence lengths. Amino acid substitution matrices will turn match lengths into alignment scores (S). More commonly l = ln(1/p) is used. Number of longest run HSP will be estimated E = Kmne-lS How good a sequence score is evaluated based on how many HSPs (i.e. E value) one would expect for that score. Alignment Statistics Two ways to get K and l : For 10000 random amino acid sequences with various gap penalties, K and lambda parameters have been tabulated. Calculation of the distribution for two sequences being aligned by keeping one of them fixed and scrambling the other, thus preserving both the sequence length and amino acid composition. Generate random sequences You may use the function randperm >> help randperm RANDPERM(n) is a random permutation of the integers from 1 to n. For example, RANDPERM(6) might be [2 4 5 6 1 3]. Align a sequence with its randomly permuted state >> x = 'atagacagacca' >> l = length (x) l= 12 >> ind = randperm(12) ans = Columns 1 through 9 9 4 5 7 3 11 2 Columns 10 through 12 10 1 12 >> y = x(ind) y= agaaactgccaa >> align1 atagacagacca agaaactgccaa 8 6 Alignment Statistics Alignment Statistics Alignment Statistics Alignment Statistics Probability Distributions: Binomial Distribution The number of an event (x) in n trials is given by binomial distribution: Binomial coefficient probability Probability of event 1 n, p, and q are constant x varies n and x are discrete p+q = 1 Probability of event 1 Binomial Distribution Only two outcomes are possible on each of n trials. The probability of success for each trial is constant (p, and q does not change). All trials are independent of each other. Matlab: binopdf function Y = binopdf(x,n,p) Where x equals the number of successes (outcome), n is the total possible number of trials, P is the probability of one type of outcome. Matlab: binopdf function >> x = 0:10 % from 0, 1,2, ...,10 number of trials >> y = binopdf(x,10,0.5) % calculate pdf >> plot(x,y,'+') %plot n over y using + sign Binomial probability density function Applications Calculate the probability of a couple’s (mother AA and father AB genotype) 2 of 10 children having AB blood type? n = 10 x=2 p = 0.5 q = 0.5 % total number of children % number of children with AB blood % probability of having AB genotype % probability of having AA genotype Matlab >> p = 0.5; >> q = 1-q; >> n = 10; >> x = 2; >> fn = factorial(n); >> fx = factorial(x); >> fnminusx = factorial(n-x); >> binocoef = fn./(fx.*fnminusx) >> Pr = binocoef*p^n*q^(N-n) Use parentheses in order to determine order in calculations >> p = 0.5; >> q = 1-q; >> n = 10; >> x = 2; >> fn = factorial(n); >> fx = factorial(x); >> fnminusx = factorial(n-x); >> binocoef = fn./fx.*fnminusx >> Pr = binocoef*p^n*q^(N-n) Try this! >> n = 1:100; >> y = binopdf(n,100,0.5); >> plot(n,y,'+') Binomial distribution Binomial Cumulative Distribution Function Adds the probability value of the previous case to the next. >> x = 0:10 >> n = 10 >> p = 0.5 >> y = binocdf(x,n,p) >> plot(x,y,'r+') Cumulative distribution Expected value = mean value The mean or expected value of an outcome (e.g., getting an H from a coin toss) for n trials would be E(H) = np p = E(H)/n 2 = np(1-p) Null hypothesis in statistics States equality (or in cases greater than or less than) between observed and an expected value To test a null hypothesis: perform a statistical test calculate a p value reject or do not reject the null hypothesis using a threshold. Example If a baseball team plays 162 games in a season and has a 50-50 chance of winning any game (p = winning = 0.5; q = losing = 0.5), then the probability of that team winning more than 100 games in a season is: >> 1 - binocdf(100,162,0.5) The result is 0.001 (i.e., 1-0.999). If a team wins 100 or more games in a season, this result suggests that it is likely that the team's true probability of winning any game is greater than 0.5. Example In a population of Drosophila, the frequency of AA genotype is p (0.5) and the frequency of AB genotype is q (0.5). If you sample from this population the number of AA or AB individuals in the sampled population will be a function of their relative frequencies and the sample size (n). If n individuals are selected and x number of AB individuals are found, is this number greater or less than what could be obtained by chance alone? >> binopdf(7,10,0.5) ans = 0.1172 >> binopdf(70,100,0.5) ans = 2.3171e-005 Normal Distribution A standard normal distribution will have a mean of 0 and variance of 1. Normal Probability Distribution >> x = -5:0.05:5; >> y = normpdf(x); >>plot(x,y) Plot(x,y) Normal cumulative distribution What is the probability that an observation from a standard normal distribution will fall on the interval [-1 1]? >>p = normcdf([-1 1]); >>p(2) - p(1) ans = 0.6827 PAM-2 PAM-250 PAM-250 PAM-250 PAM-250 PAM-250 PAM-250 Multiple Sequence Alignment Multiple Sequence Alignment MegaBLAST megaBLAST For aligning sequences which differ slightly due to sequencing errors etc. Very efficient for long query sequences Uses big word (k-tuple) sizes to start search Very fast Accepts batch submissions of ESTs Can upload files of sequences as queries More detailed info: see megaBLAST pages P-values The probability of finding b HSPs with a score >=S is given by: For b = 0, that chance is: (e-EEb)/b! e-E Thus the probability of finding at least one such HSP is: P = 1 – e-E Alignment Statistics Alignment Statistics