* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download • 100 times faster than dynamic programming. • Good for database
Survey
Document related concepts
Molecular cloning wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Molecular ecology wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Genomic library wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Community fingerprinting wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Non-coding DNA wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Transcript
BLAST • 100 times faster than dynamic programming. • Good for database searches. • Derive a list of words of l length h w from f query (e.g., ( 3 for protein, 11 for DNA) • High-scoring words are compared with database sequences • Sequences with many matches to high- scoring words are g used for final alignments Protein based searches are always more powerful than nucleotide-base of coding DNA in determiningg similarityy and inferring homology BLAST (Basic Local Alignment Search Tool) P=7+ Q=5 + G=6 • In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood. • Once a word is aligned aligned, gapped and un-gapped extensions are initiated, tallying the cumulative score • When the score drops more than X, the extension is terminated • The extension is trimmed back to the maximum HSP= High scoring segment pair • Produces local alignments X= significance decay S= min. score to return a BLAST hit T= neighborhood score threshold BLAST home page http://blast.ncbi.nlm.nih.gov/Blast.cgi BLASTP BLAST databases Peptide Sequence Databases • nr: non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF • RefSeq_protein: reference proteins • Swissprot: SWISS-PROT protein sequence database • pdb db: Sequences derived from the 3-dimensional structure from Nucleotide Sequence Databases • nr: GenBank+EMBL+DDBJ+PDB (no EST EST, STS, STS GSS, GSS or WGS WGS, or PAT). PAT) • est: Expressed Seq. tags. 34 billion seq.! • htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 • gss: Genome Survey Sequence Sequence,. • wgs: Whole Genome Shotgun Sequences. 148 billion sequences BLAST Advanced options • -G Cost to open a gap [Integer]; default = 11 (10 10 8 9) • -E Cost to extend a gap [Integer]; default = 1 ( 1 2) • -e Expectation value (E) [Real]; default = 10 10.0 0 • -W Word size; default is 11 for blastn, 3 for other programs. • -b Number of alignments to show (B) [Integer]; default = 100 2 2 Special Cases Default Short Query Large Sequence Family Ungapped BLAST Filter on off on on Scoring Matrix BLOSUM62 ≤ PAM30-35 BLOSUM62 BLOSUM62 Word Size 3 3-2, 7 for DNA 3, 11 for DNA 3, 11 for DNA E value 10 1000 or more 10 10 Gap costs 11, 1 9, 1 11, 1 4 Alignments 50 50 2000 50 Report by species Database: All nr GenBank CDS translations+PDB+SwissProt+PIR+PRF 2,794,673 sequences; 957,836,323 total letters Taxonomy reports Query= Apetala1 P35631 (255 letters “+” indicates conservative amino acid substitution “–” indicates gap/insertion XXXX… shows areas of low complexity CONSIDER TAXONOMIC RELATIONSHIP WHEN INTERPRETING SIMILARITY VALUES! Format BLAST output All sequences above the E value threshold are aligned beneath the query. In "with identity“ identical residues are shown h as dots. d t Fl t Query-Anchored Flat Q A h d Query-Anchored with identities Statistical significance • Chance alignments have no biological g significance g • Statistical significance implies low probability of generating a chance alignment • Probability of long alignments increases with longer sequences • The extreme-value distribution – Used to calculate the probability of chance alignment – Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared BLAST statistics S’ (Bit score): calculated from raw score “S” S S (sum of BLOSUM62 scores) by normalizing with statistical variables that define a scoring system (K and λ). Bit scores from different alignments, even employing different scoring matrices can be compared. S’ =(λS-lnK)/ln2 k= minor constant λ= constant to adjust for scoring matrix S= score of High-scoring segment pair (HSP) E (expect) value: number of chance alignments with scores equivalent to or better than S’ that are expected to occur in a database search by chance. E = mN2-s’ m= query size S’= bit score N= database size m*N= search space – The E-value decreases exponentially as the Score (S) that is assigned to a match between two sequences increases. – The E-value depends on the size of database and the scoring system in use use. – When the E-value threshold is increased from the default value of 10, more hits can be reported. When reduced, more significant hits are reported. – The lower the E E-value value (or higher the bit score), the more significant the hit – The product mN defines the search space. the same HSP may come out statistically significant in a small database and not significant in a large database P values P: Probability P P b bilit off fifinding di att least l t one HSP with bit score S’ or higher by chance. Since it can be Si b shown h that th t th the number b off random d HSP HSPs with ith score S' is i described by Poisson distribution, the probability of finding at least one HSP with bit score S' is P = 1- e-E E= expect value E= 10 -> P =0.99995 E= 1 -> P =0.63 E= 0.1 -> P =0.095 E= 0.01 -> P =0.01 E= 0.001 -> P =0.001 E= 0.0001 -> P =0.0001 P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs report E-values, rather than P-values, b because E E-values l of, f for f example, l 5 and d 10 are much h easier i tto comprehend than P-values of 0.993 and 0.99995. However, for E < 0.01, P-value and E-value are nearly identical. BLAST Tips • Suggested S t d BLAST cutoffs: t ff – DNA: book suggests E values < E-6 (I use E<e-10) – Protein: book suggests E values < E-3 • Consider evolutionary divergence in your results!: DNA mutation rate without selection =5.5 10-9 per site per year. So in 10 million years (107) of divergences= g 5.5 10-2=0.05 ~ 95% identityy • BLAST search artifacts: Repeated amino acid stretches (e.g. poly glutamine) or nucleotide repeats (e.g. ATATATATATATAT) result in meaningless positives with significant E values values. • Use BLAST filters to mask low complexity regions: programs SEG for proteins and DUST for DNA • Or customize masking using lower case letter option • RepeatMasker can be used to mask • repeats in lower case letters http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker MEGABLAST • Variation of BLASTN, 10 times faster • Optimized for long or highly similar (>95%) sequences • Ideal to find whether a large sequence is part of a large contig or chromosome, find sequencing errors and comparing large similar sequences • Uses longer default word length (word length= 28 instead of 11) • Faster non-affine gap penalty: gap opening penalty=0, gap extension penalty E= E r/2 - q (r= match reward q= mismatch penalty) • Non-affine gapping tends to yield more gaps of shorter length. • Accepts p multiple p consecutive FASTA files as input p Discontinuous MEGABLAST • Ideal to compare divergent sequences from different organisms (<80% =) • Uses a discontiguous word approach, different from other BLAST programs • Nonconsecutive positions are examined over longer segments PSI-BLAST (Position Specific Iterative BLAST) •Designed D i d tto d detect t t weak k relationships l ti hi PSI-BLAST PSI BLAST steps • BLASTP • Multiple Alignment • Construct PSSM • Use PSSM to search • The added sensitivity comes from the use of a profile that is constructed ((automatically) y) from a multiple p alignment. g •The profile is generated by calculating a Position-Specific Scoring Matrix (PSSM) for every position in the alignment. Also called ll d profiles fil off Hidd Hidden M Markov k M Models d l • PSSM are numerical representations of a multiple alignment Construction of a PSSM • A highly highl conserved conser ed position recei receives es a high score. •The profile is used to perform additional searches ( iteration)) and the results of each iteration used to refine the profile. •Each iteration uses a PSSM built from the previous iteration. • Continue search iteratively until no new matches are identified: "convergence". Each columns in the alignment is a row in the PSSM Frequency of occurrence of a residue at each position Calculate Pb of each aa at each position T at position 8 conserved= highest score 150 P at position 9 less conserve= score 89 Note low scores of aromatic FYW relative to A at P row PHI-BLAST (Pattern Hit Initiated BLAST) • PHI-BLAST PHI BLAST searches for particular patterns in protein queries. Combines matching of regular expressions with local alignments surrounding the match. • PHI-BLAST is preferable to just searching for pattern occurrences because it filters out cases where the pattern occurrence is pb. random and not indicative of homology. • PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. • PHI-BLAST PHI BLAST lilimits it alignments li t tto th those th thatt match t h the provided pattern. • Statistical significance is reported using E-values as for other forms of BLAST,, but the statistical method for computing the E-values is different. • PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI PHI-BLAST BLAST query can be used for PSI PSI-BLAST. BLAST Pattern: [C]-x(2)-[C]-x(10,16)-[H]-x(2,3)-[H] Syntax for pattern at http://www.ncbi.nlm.nih.gov/blast/html/PHIsyntax.html Specialized BLAST Great tool! Multiple Sequence Alignment COBALT http://www.ncbi.nlm.nih.gov/BLAST/