* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture4 - Department of Computer Science
Survey
Document related concepts
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genome wikipedia , lookup
DNA barcoding wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Metagenomics wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Transcript
BLAST and searching sequence databases Dr Alexei Drummond Department of Computer Science [email protected] BIOSCI 359, Semester 2, 2006 Sequence database searching Sequence Homology • Homologous protein or DNA sequences share common ancestry – A statement of homology is therefore an evolutionary hypothesis • Homology need not imply similar function • Homology is a binary property, a pair of sequences are either homologous or not homologous. x a, b homologous t a b x y a, b not homologous – No such thing as degree of homology • Homology is often inferred by sequence similarity a b 2 Sequence database searching Orthology and paralogy "Where the homology is a result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para=in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact). " Fitch WM. Distinguishing homologous from analogous proteins. Systematic Zoology 1970 Jun;19(2):99-113. 3 Sequence database searching Orthology and paralogy 4 Sequence database searching Orthology, paralogy and multigene families Reproduced from NCBI education website 5 Sequence database searching What are good scores for searching databases? • We want these scores to distinguish the related sequences from the unrelated sequences • So we select alignment parameters for database searching that give us the best distinguishing scores • These may not be the parameters that will give us the most accurate alignment 6 Sequence database searching Database searching is an experiment • “Database searching is the application of knowledge gained from previous experiments to the problem of discovering the biochemistry and physiology of a newly discovered gene or its protein.” • It demands the same careful thought and execution as your bench or laboratory investigations! • Garbage in, garbage out 7 Sequence database searching Sources of previous knowledge • Similarity scores - PAM, Blosum similarity matrices – Rescue us from having to assume that all amino acid changes are equally likely and equally harmful – Different similarity matrices are appropriate for different degrees of evolutionary divergence – More later 8 Sequence database searching Second source of previous knowledge • Computer Algorithm – Dynamic programming is most sensitive and the least selective • Local, global, repeats, overlap – BLAST and FASTA are much faster and more selective which can be an advantage – No program is always best at finding distantly related sequences for all gene or protein families, but dynamic programming is guaranteed to give optimal alignment for given scores 9 Sequence database searching Third source of previous knowledge • Database itself – Large store of previously acquired knowledge – Making the best use of this knowledge can save you many months of expensive laboratory experimentation – The size of this potential gain is the determining factor in deciding how much effort to devote to any particular database search 10 Sequence database searching Database search assumptions • The sequences sought have an evolutionary ancestral sequence in common with the “query sequence” • All substitutions are not equally likely and should be weighted to account for this • Insertions and deletions are less likely than substitutions and should be weighted to account for this. 11 Sequence database FASTA searching FASTA • 2 step algorithm – (1) word search, using a specific word size, finds regions with a high number of identical word matches – (2) Smith-Waterman alignment centered on these regions and bounded by a window size which limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence 12 FASTA Heuristics Sequence database FASTA searching • Heuristic approximation to Smith-Waterman – Runs faster – Loses some sensitivity • Restrictions on the model of sequence evolution – First Heuristic - Word size parameter -usually 2 for proteins and 6 for nucleic acids - FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides 13 Sequence database FASTA searching FASTA Algorithm • divide query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids • each sequence in the database is also broken up in the same way • Two word lists are compared to find all identical words in both sequences • An initial score is computed based on how many identities are based within a small region of the dot plot. 14 Sequence database FASTA searching FASTA Algorithm cont. • If the initial score is high enough, a second score is computed by evaluating which initial identities can be joined into a consistent alignment using only gaps of less then the window size. • If the secondary score is high enough, then a SmithWaterman alignment is performed within the same region of the dot plot defined by the concentrated identities and using the same window-size. • This third score is reported as the optimal score. 15 Sequence database FASTA searching Creating a Word List g g c c c t t t t g g g g g g g g g g g a a a a a a a a g g c a t a a a a a a g g g g g g g g g c c c a a t 16 Sequence database FASTA searching First Pathological Example • Two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches. • With a word size of two, there would be no matches along the main diagonal of the dot plot and the proper alignment would not be found. 17 Sequence database FASTA searching Second Pathological Example • Two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence. • If the window size is 15, then the SmithWaterman alignment phase of FASTA will align the protein to either the sequence prior to or following the insertion, thus missing the fact that the proteins were basically identical (with only one long insertion). 18 Sequence database FASTA searching Second Heuristic • Window size is the second heuristic used by FASTA. • Its effect is more variable then word size. • If the best alignment, as defined by a full SmithWaterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This will lead users to conclude the sequences are not homologous when in fact they are and the homology could have been inferred from a full Smith-Waterman alignment. • In practice these pathological cases are very unlikely. However similar cases do occur and loss of sensitivity caused by the use of these heuristics will be seen. 19 Sequence database BLAST searching BLAST • Approximates a simplification of SmithWaterman known as the maximal segment pairs (MSP) algorithm. • MSP alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the SmithWaterman equation. • MSP alignment’s statistics are well understood and so we can compute a significance probability. 20 Sequence database BLAST searching Significance Probabilities • Thus the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment • Recent advances in MSP statistics allow the use of several independent segment alignments to be used in evaluating significance probability. 21 Sequence database BLAST searching Brief Comparison of BLAST and FASTA • BLAST is less sensitive than Smith-Waterman but therefore more selective. • For proteins BLAST is more sensitive than FASTA even though BLAST uses a word size of 3 for proteins while FASTA uses a word size of 2. • BLAST uses a word size of 11 for nucleic acids. The recent modifications which make it more sensitive for proteins do not seem to work for nucleic acids. • So therefore FASTA should be used instead of BLAST when searching for nucleic acids. 22 Sequence database BLAST searching BLAST Algorithm • It creates a word list (same as FASTA). • Then it expands this list in order to recover sensitivity lost by only using exact matches. • Any word that scores at least a minimum threshold (T) when aligned with any of the initial list of words is added to the list. • BLAST than examines the database for words that exactly match any word in the expanded word list. • Equivalent to looking for gapless alignments of score at least T in the database. 23 Sequence database BLAST searching BLAST example • The example shows an expanded list of 47 words from the original 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table. q q l l l n n n f f f s s s a a a g g g w w Initial Word ql ln nf fs sa ag gw Expanded List ql,qm, hl,zl ln,lb nf,af,ny ,df,qf,ef,gf,hf,kf,sf,tf,bf,zf fs,fa,fn,fd,fg,fp,ft,fb,ys no words score more than 8 including sa ag gw,aw,rw,nw,dw ,qw,ew,hw ,iw,kw,mw, pw,sw,tw,vw ,bw,zw,xw 24 Sequence database BLAST searching BLAST Paradox • Notice that there is no word that scores 8 or more when aligned with the initial word “sa”, even the word “sa” itself. • This situation does occur in actual BLAST searches. • The user has the option to force the initial word into the final list. 25 Sequence database BLAST searching BLAST default • The default is to not include such low scoring words because they contain so little information that they are unlikely to be critical in finding a maximal segment pair alignment. • BLAST has a word length of 3 for protein searches with a threshold score of T=13 using the Blosum62 similarity scoring matrix. 26 Sequence database BLAST searching Blast final step • The occurrence of a word hit is followed by an attempt to find a locally optimal ungapped alignment. • This is accomplished by accumulating the score as the alignment is extended in both directions. • When a run of mostly negative scores is encountered, the cumulative score will drop substantially. When this happens it is unlikely that the score will rebound. • This observation provides the basis for an additional heuristic whereby the extension of a hit is terminated when the reduction in score exceeds a dropoff threshold. • The local alignment with the highest score is returned. 27 Sequence database BLAST searching Improvements to BLAST • The growth of the sequence database has raised the minimum score and hence the length of alignment that must be found by BLAST for a match to be significant. • Speed and sensitivity can be improved by requiring the algorithm find two matches above some (lower) threshold rather that one match. Both matches must be on the same diagonal. 28 Sequence database BLAST searching New BLAST settings • The increase in speed results from fewer sequences which are completely evaluated. • BLAST now looks for 2 words of length 3, that each score at least 11 using Blosum62. The matches must be within 40 amino acids on the same diagonal. • As the database grows new techniques will need to be constantly devised. 29 Sequence database BLAST searching Gapped BLAST • Builds the alignment out from a central high scoring pair of aligned amino acids analogous to the way BLAST extends the initial maximal segment pair alignment. • The initial pair of amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids. • Smith-Waterman alignment is extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the SmithWaterman phase. 30 Sequence database BLAST searching Guarantees on Gapped Blast • Will find the best scoring Smith-Waterman alignment if: – The calculation is extended until a score of 0 is reached. Stopping earlier accepts a small risk of not finding the complete alignment in return for a very large savings in computer resources. – The initial pair of amino acids selected as the midpoint must actually be part of the alignment the would be reported from a full Smith-Waterman alignment. 31 Sequence database BLAST searching BLAST Warning • Before publishing an alignment: prudent to do a complete Smith-Waterman analysis. • Further it is suggested to make use of the Waterman-Eggert extensions to SmithWaterman (MaxSegs algorithm) in order to look at the best several independent local alignments and to examine each sequence for repeated motifs. 32 Sequence database BLAST searching Finding Distant Homologies • Many functionally and evolutionary important protein similarities are recognizable only through comparison of three-dimensional structures. • When not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities • These patterns are called motifs, profiles, positionspecific score matrices, and Hidden Markov Models. 33 Sequence database BLAST searching PSI-BLAST • Position-Specific Iterative BLAST • Designed to detect weak relationships by using a profile that is constructed automatically from the multiple alignment of the highest scoring hits in the initial BLAST search. • The profile is created by calculating position-specific scores for ever amino acid at every position in the alignment. 34 Sequence database BLAST searching How it works • If a residue is highly conserved at a particular position, it will receive a high score, and others will be assigned high negative scores. • At weakly conserved positions all residues receive scores near 0. • Position specific scores can also be assigned to potential insertions and deletions 35 Sequence database BLAST searching Iteration • The power of profile methods can be further enhanced through iteration of the search procedure. • After a profile is run against a database, new similar sequences can be detected. In each iteration: – A new multiple alignment, which includes these new sequences can be constructed. – A new profile abstracted. – A new database search performed. • The procedure can be iterated as often as desired or until convergence (when no new statistically significant sequences are detected). 36 Sequence database BLAST searching Design Goals of PSI-BLAST • Speed, simplicity, automatic operation • Unlike most profile-based search methods, PSIBLAST runs one program starting with a single protein sequence as input, and the intermediate steps of multiple alignment and profile construction are invisible to the user. 37 Sequence database BLAST searching PSI-BLAST Details • It uses the gapped BLAST program for the database searches. A PSI-BLAST query is identical to a Gapped BLAST query with the addition of an expectation value cut-off for inclusion of a match in an iteration. • The E-value cut-off can be over-ridden by the user on a case-by-case basis if a sequence hit of interest is worse then the threshold. (default is 0.001) • The multiple alignment and profile will have lengths identical to that of the query 38 Sequence database BLAST searching Notes on using PSI-BLAST • The WWW version requires the user to decide after each iteration whether to continue. It has the advantage that the user can handpick sequences used for each profile construction, regardless of Evalue, by checking boxes next to the sequences descriptions. • A stand-alone version of PSI-BLAST, obtainable from NCBI, allows the user to run the program for a chosen number of iterations or until convergence. • This version also allows the user to save the profile produced and use it subsequently to search another database. 39 Sequence database BLAST searching Warnings on using PSI-BLAST • PSI-BLAST is a powerful tool and it requires caution. • The sources of error are the same as for standard BLAST, but are easily amplified by iteration! 40 Sequence database BLAST searching Sources of Errors • The major source of deceptive alignments is the presence within proteins of regions with highly biased amino acid composition - low complexity. • If such a region is included during production of a profile, otherwise unrelated sequences containing similarly biased regions will creep into subsequent iterations, rendering the search nearly worthless. 41 Sequence database BLAST searching How to stop bias • PSI-BLAST filters out biased regions of query sequences by default, using the SEG program. • SEG parameters are set to avoid masking potentially important regions, so some bias may still persist. So PSI-BLAST can still generate compositionally rooted artifacts. • These cases can usually be identified by inspection - especially when sequences that have a known bias, such as myosins or collagens, are retrieved. • SEG can also be set to eliminate nearly all biased regions, or filtering procedures, such as COILS, can be used before submitting the appropriately masked sequence to PSI-BLAST. 42 Sequence database BLAST searching PHI-BLAST • Pattern Hit Initiated BLAST searches for particular patterns in protein queries • It takes a protein query and a pattern contained in that sequence as input. • It searches the database for protein sequences that – contain the input pattern and also – have significant similarity to the query sequence in the region of the pattern occurrences. 43 Sequence database BLAST searching PHI-BLAST and PSI-BLAST • The statistical significance of PHI-BLAST is reported using E-values like other forms of BLAST, but the method for computing the E-values is different. • PHI-BLAST is integrated with PSI-BLAST so the results of a PHI-BLAST can be used to initiate one or more iterations of PSI-BLAST searching. • PHI-BLAST is under development and may change substantially over time. 44 searching database Sequence Matrices Similarity Which Similarity Matrix to Use? • Database searches or sequence alignments perform much better if the similarity matrix is based on replacement patterns that correspond to the degree of divergence of the sequences being aligned or discovered. • In database searching, a PAM or Blosum matrix corresponding to an inappropriate degree of divergence can cause you to fail to discover homologous sequences that are present in the database. • Therefore a thorough database search will involve using at least 2 and most likely 3 different matrices. • Using different matrices usually has a higher payoff than using different programs and search algorithms. 45 searching database Sequence Matrices Similarity Comparable Blosum and PAM Matrices Blosum Matrix Blosum 90 Blosum 80 Blosum 60 Blosum 52 Blosum 45 Entropy PAM Matrix 1.18 PAM 100 0.99 PAM 120 0.66 PAM 160 0.52 PAM 200 0.38 PAM 250 Entropy Sequenc e Identit y PAM Matrix 1.18 43 0.98 38 0.70 30 0.51 25 0.36 20 46 searching database Sequence Matrices Similarity What the Comparability Table Means • The comparability is based on matrix entropy. Entropy is defined by information theory as the average amount of information per position in a sequence alignment that is available to determine whether or not a sequence is homologous. • This amount of information is available only if the matrix used in the database search is matched for the appropriate degree of sequence divergence. • As will be shown later this can be used to get a rough indication of whether or not a specific database search result is significant. 47 searching database Sequence Matrices Similarity Scores for nucleic acids • If possible use an amino acid sequence for a database search because: – There is redundancy in the genetic code with up to 6 codons translated as the same amino acid so there is more information in amino acid sequences once the sequences have diverged beyond about 50 PAMs (~60% identical) – Compositional bias found in many organisms and organelles – Some nucleic acid sequences are derived from messengers while others are genomic DNA with exons and the introns may be too short to give a significant alignment with a messenger derived sequence. 48 searching database Sequence Matrices Similarity Search with Nucleic Acids • There are circumstances when there is no choice but to search with nucleic acid sequences. (for 98.9% of the human genome!) • BLAST uses a very long word size, 11, for nucleic acids and the modifications to the heuristic to improve sensitivity for protein sequences do not work as well for nucleic acids because they have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. • Thus FASTA is more sensitive than BLAST for nucleic acid sequences and should be used instead of BLAST if you want to use one of the faster searching programs. 49 searching database Sequence Matrices Similarity Nucleic Acid Matrices • There are matrix of replacements for nucleic acids just as has been recommended for proteins. • Commonly used are PAM 47 scores assuming equal rates of transitions and transversions. This assumption leaves us with only two scores, 5 for identities or matches and -4 for nonidentities or mismatches. • It is possible to create nucleic acid scores that do not assume equal rates of transitions and transversions. – For example assumes a three to one transition to transversion ratio might be more appropriate than the defaults 50 searching database Sequence and deletions Insertions Scoring Insertions and Deletions • In most alignment and search programs, the gap penalty consists of two terms, the cost to open the gap and the cost to extend the gap. • This form of gap penalty has been shown to give better results in searching and alignments then a form of penalty that includes only one of the terms. • The selection of appropriate scores for insertions and deletions, the gap penalties, is as important as selecting the similarity scores for the success of a database search. 51 searching database Sequence and deletions Insertions Purpose of Gap Penalties • The intuitive choice of gap penalties is to use one designed to create the most accurate possible alignment between your query sequence and any homologues in the database. • So we may need to accommodate several small gaps in highly variable regions of the sequence which implies small penalties for opening a gap. • But more knowledge, experience, and thought about the experiment being performed in a database search shows that the most accurate alignment is not the correct goal. • A little thought shows that the appropriate goal is to distinguish any homologous sequences from the vast majority of nonhomolgous sequences in the database. 52 searching database Sequence and deletions Insertions Expected Maximum Scores • The score for homologous sequences will be dominated by relatively long sections of the alignment uninterrupted by gaps. Thus changing the gap penalty will only have a minor effect on the score for an alignment. • Gap penalties, however, are critical in determining the expected maximum score for random sequences being compared with your query sequence. 53 searching database Sequence and deletions Insertions High Non-homologous Scores • The number of nonhomologous sequences that achieve a particular score decreases exponentially as the score increases as long as the gap penalties are sufficiently large relative to the similarity scores used to compare the sequences. • If the gap penalties are too small relative to the similarity scores the number of sequences achieving a particular score decrease as the square of the number of sequences. • This leads to an increase in the number of relatively high scoring non-homologous sequences. 54 searching database Sequence and deletions Insertions Decay: Exponential versus Square 700 600 500 400 Exponential Square 300 200 100 0 10 30 50 70 90 55 searching database Sequence and deletions Insertions Obscure Homologous Sequences • Unfortunately there is no analogous effect on the scores of the homologous sequences. • Thus choosing gap penalties that are too small leads to matches with homologous sequences being obscured by spurious matches with nonhomologous sequences. 56 searching database Sequence and deletions Insertions Length of Gaps • While we want to force alignments to have relatively few gaps we need to allow them to be fairly long. • This allows a few gaps to accommodate a larger number of insertion and deletion events in the evolutionary history of the sequences. • So we should match a large penalty for opening a new gap with a small penalty for lengthening or extending a gap. • In this context “large” will typically be two to three times the largest negative value in the similarity matrix and at least as large as the largest positive value. • For an integer similarity matrix the gap extension penalty will typically be 1-4 and should not be 0. 57 searching database Sequence and deletions Insertions Statistical Significance • A Significance probability is an answer to the question: – How often will an event at least as extreme as this one just observed happen if these events are the result of a well defined, specific, random process? • For database searches the random process is evolution from independent, unrelated ancestral sequences rather than a common ancestral sequence. • With respect to any newly determined query sequence most of the sequences in the database are such randomly evolved sequences. 58 searching database Sequence and deletions Insertions Random Sequences • We expect the scores for random sequences aligned with our query sequence to decay exponentially. • For a group of sequences that achieve at least some minimum score we expect only some constant fraction of them to achieve a given higher score. • If the gap penalty is high enough this exponential decay is observed in practice. 59 searching database Sequence and deletions Insertions Predicting Good Scores • This exponential decay process gives a straight line when we plot the log of the number of sequences that achieve a particular score against the score. Extrapolating this line to its intercept with the score axis gives us a score higher than we might reasonably expect to see from a group of random sequences the size of the database. • Thus any appreciably higher score is a statistically rare event deserving careful evaluation. 60 searching database Sequence and deletions Insertions Problems if Gap to low • Empirically this depends on the gap penalty being high enough. • If the gap penalty is too low, the scores will decay as the square of the number of sequences achieving at least a given score. • This slower decay obscures most homologous matches (as discussed before) and also invalidates the fitting of the scores to a logarithmic function and yields a much too high extrapolated score. 61 searching database Sequence and deletions Insertions Difference between Logs 10 9 8 7 6 Log(Exponential) 5 Log(Square) 4 3 2 1 0 10 20 30 40 50 60 70 80 90 100 62 searching database Sequence testing Statistical Information Theory • Every alignment has a certain amount of information per position about whether or not the aligned sequences are homologous. This is called entropy. • Every similarity matrix has an associated amount of entropy. • This amount of entropy is a maximum and is achieved only in alignments between sequences that have diverged to the degree appropriate to the matrix. 63 searching database Sequence testing Statistical Entropy • It is currently impossible to assign an entropy value to alignment gaps corresponding to insertions or deletions. • Thus entropy can be computed only for regions of the alignment uninterrupted by gaps. • The amount of information (or entropy) associated with an uninterrupted region of the alignment is given by the length of the alignment in sequence residues multiplied by the entropy value for the similarity matrix. • It is measured in units called bits, which is the amount of information need to answer a yes no question. 64 searching database Sequence testing Statistical Entropy of Random Sequences • In order to make use of the information of an alignment we need to know how much information we expect to find in alignments between random sequences. • The answer obviously depends on: – Length of the query sequence – Size of the database • Without going into the details of the derivation, this is the amount of information required to repeatedly divide the database into the remaining half containing the best scoring segment. 65 searching database Sequence testing Statistical Test for Statistical Significance • S ≈ log2(mn) – The minimum amount of information (in bits) for a significant match is given by the logarithm (base 2) of the product of the length of the query sequence and the number of residues in the database. • If there is more than this amount of information in the alignment it is statistically significant and should be investigated further. • This is the basis of the statistical significance tests in BLAST - now used by the other algorithms as well. • E = mn 2-S – is the E-value formula where m is the length of the alignment, n is the size of the data base and S is the bit score of entropy. 66 searching database Sequence testing Statistical E Value • The E-value is the chance number of alignments with a certain score, s. – E=Kmn e-s – where s is the similarity score, m and n are as above, and K and can be thought of as natural scales for the search space size and scoring system respectively. • This makes intuitive sense since doubling either the size of the database or the size of the alignment should double the number of random strings attaining that score. Also for us to attain a score twice as good, 2x, it must attain the score x twice in a row, so one expects E to decrease exponentially with score. 67 searching database Sequence testing Statistical Converting raw scores to bit scores S s ln K ln 2 • S is bit score, s is alignment score and translation parameters) • , K (scale – Can be estimated empirically for a given database and score matrix (even including gaps) or calculated exactly assuming a random ungapped model and background amino acid frequencies 68 searching database Sequence testing Statistical Usefulness of Statistical Measures • The usefulness of statistical measures is dependent on: – Whether the assumptions of the underlying statistical model are correct – The kinds of errors that one is willing to accept when using the measure to draw a conclusion • The threshold for statistical significance will vary depending on which we are more concerned with: – Labeling a sequence as related when it is not - misidentifying a non-homologue (False Positive or Type I error) – Labeling a sequence as non-homologous when a high scoring homologue has been found - missing a likely homologue (False Negative or Type II error) 69 searching database Sequence testing Statistical Problems with Statistical Models • However, with biological sequences (as opposed to “fair” coins), the assumptions underlying the statistical model may not be met. • In general inaccurate statistical estimates are cause either by : – Incorrect gap penalties – Low complexity regions - runs of simple amino acid composition 70 searching database Sequence testing Statistical Additional Statistical Checks • Prss3 in the FASTA collection evaluates the significance of a sequence similarity score by comparing two sequences and calculating optimal scores and then repeatedly shuffling the second sequence and getting optimal similarity scores using the SmithWaterman algorithm. You must use 500-2000 shuffles and remember to normalize the statistical significance to the size of the database originally searched. - permutation test, randomization testing • Randseq also in FASTA produces a random sequence with the same length and amino acid composition as a query sequence. In general in a database search, the highest scoring match to a random query sequence should have an E value of 1. 71 Sequence database searching References • http://www.psc.edu/biomed/TUTORIALS/SEQUENCE/D BSEARCH/tutorial.html • http://www.ncbi.nlm.nih.gov/BLAST/ • Flexible Sequence Similarity Searching with the FASTA3 program package, William Pearson • Bioinformatics A Practical Guide the Analysis of Genes and Proteins Baxevanis & Ouellette 72