Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Point mutation wikipedia , lookup
Public health genomics wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Genome editing wikipedia , lookup
Metagenomics wikipedia , lookup
Microsatellite wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Genomics and Personalized Care in Health Systems Lecture 4. Blast Search Leming Zhou, PhD, DSc School of Health and Rehabilitation Sciences Department of Health Information Management Department of Health Information Management Outline • BLAST algorithm • BLAST search • Other programs Department of Health Information Management Pairwise Local Alignment • Pairwise local sequence alignment: identify similar segments in two sequences • Smith-Waterman algorithm (a dynamic programming algorithm) is guaranteed to find optimal alignments, but it is computationally expensive [O(nm)]. • BLAST and FASTA are heuristic approximations to local alignment and they run much faster than SmithWaterman algorithm but retain sensitivity of the search Department of Health Information Management BLAST • BLAST [Basic Local Alignment Search Tool] is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query • It is the most widely used and referenced computational biology resource • The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T when compared to the query using a substitution matrix • Word hits are then extended in both directions to generate an alignment with score exceeding a given threshold S Department of Health Information Management Applications of BLAST • BLAST searching is fundamental to understanding the relatedness of any query sequence to other known proteins or DNA sequences. • Applications include – Identifying homologs – Discovering new genes – Discovering variants of genes – Exploring protein structure and function – … page 87 BLAST Algorithm Department of Health Information Management BLAST Algorithm • Filter out low complexity regions • Locate words with a fix size in the query sequence • Scan the sequence database for entries that match the words in the query sequence • If there is a hit (i.e. a match between a word in the query and a word in the database entry), extend the hit in both directions. Keep track of the score and stop the extension when the score drops below a threshold Department of Health Information Management Word Size • The initial search is done for a word of length W • Default values: • Protein sequence search: W = 3 • Nucleotide sequence search: W = 11 • Highly similar nucleotide sequence: W=28 • Each word in the query sequence index is compared to the database index and residue pairs are scored Department of Health Information Management Scoring Matrices • Nucleotide: A G C T A +1 –2 –2 –2 G –2 +1 –2 –2 C –2 –2 +1 –2 T -2 -2 -2 +1 • Protein: – PAM matrices – BLOSUM matrices • BLOSUM 62 is default for major database searches Department of Health Information Management Initial Search • The initial search is done for a word of length W that scores at least T when compared to the query using a scoring matrix Department of Health Information Management Hit Extension • Word hits are then extended in both directions in an attempt to generate an alignment with a score exceeding a given threshold to derive the High-scoring Segment Pairs • This procedure stops when the score becomes lower than the threshold …..SLAALLNKCKTPQGQRLVNQWIKQPLMDKNR IEERLNLVEA… +LA++L+ TP G R++ +W+ P+ D + ER +A …..TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA…. Department of Health Information Management Gap Penalty • Gap penalties can be calculated linearly (constant penalty for each gap) or using affine gap penalty: – G = a + bx – a: gap open penalty – b: gap extension penalty – x: number of gaps • The choice of gap open and gap extension penalty is empirical. Usually we choose a high value for gap open penalty and a low value for gap extension penalty. BLAST Search Department of Health Information Management Four Steps of a BLAST search • Enter query sequence • Select one BLAST program • Choose the database to search • Set optional parameters Department of Health Information Management Enter Query Sequence • A sequence can be pasted into a text field in FASTA format or as accession number • A sequence or a sequence list can also be uploaded as a file • Users may indicate a range of the query sequence instead of using the whole query sequence • You may enter a descriptive title for your BLAST search Department of Health Information Management Align Two or More Sequences • You may provide two or more sequence and perform pairwise BLAST search Department of Health Information Management Select a BLAST Program • BLAST Programs: – BLASTN: DNA query sequence against a DNA database – BLASTP: protein query sequence against a protein database – BLASTX: DNA query sequence, translated into all six reading frames, against a protein database – TBLASTN: protein query sequence against a DNA database, translated into all six reading frames – TBLASTX: DNA query sequence, translated into all six reading frames, against a DNA database, translated into all six reading frames • Choose the right one according to the sequence you have and your purpose of the search Department of Health Information Management DNA vs. Protein Searches • Consider the two sequences: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA • Ungapped DNA alignment: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA ||||| | || || || | || || || | | AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA • 21 identical resides (out of 36) 58% identity • Translate each to protein first: ELVISISALIVE ELVISISALIVE • 100% identical at amino acid level Department of Health Information Management DNA vs. Protein Searches • If nucleotide region contains a gene, it is beneficial to translate the sequence to protein sequence first • Target and query translated into all six reading frames – 3 in forward, 3 in reverse • Number of comparisons needed grows • More sensitive, but slower Department of Health Information Management Choose the Database to Search • BLASTN Nucleotide Databases for BLAST Database nr refseq_mrna refseq_genomic est est_others gss htgs pat pdb alu_repeats dbsts wgs env_nt Content Description All GenBank + EMBL + DDBJ + PDB sequences mRNA sequences from NCBI Reference Sequence Project. Genomic sequences from NCBI Reference Sequence Project. Database of GenBank + EMBL + DDBJ sequences from EST division. Subset of est other than human or mouse. Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. Unfinished High Throughput Genomic Sequences Nucleotides from the Patent division of GenBank. Sequences derived from the 3D structure records from PDB. They are not the coding sequences for the corresponding proteins found in the same PDB record. Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. Database of Sequence Tag Site entries from the STS division Assemblies of Whole Genome Shotgun sequences. Sequences from environmental samples, such as uncultured bacterial samples isolated from soil or marine samples Department of Health Information Management Choose the Database to Search • BLASTP Department of Health Information Management Protein Sequence Databases for BLAST Database nr refseq swissprot pat pdb env_nr Content Description Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr. Protein sequences from RefSeq Last major release of the SWISS-PROT protein sequence database Proteins from the Patent division of GenBank. Sequences derived from the 3-dimensional structure records from the Protein Data Bank. Non-redundant CDS translations from env_nt entries. Department of Health Information Management Optional Parameters • Specify the organism to search or exclude – Common name, taxonomy id, … • Exclude certain sequences – Exclude predicted sequences or sequences from metagenomics • Use Entrez query to select a subset of the blast database page 93 Algorithm Parameters Optional Parameters Department of Health Information Management Algorithm Parameters • Expect value • Word size • Filtering/masking • Substitution matrix Department of Health Information Management BLASTN Algorithm Parameters Department of Health Information Management BLASTP Algorithm Parameters Department of Health Information Management Scoring Parameters Match/Mismatch scores Gap Penalty Expect Value Department of Health Information Management Expect Value Department of Health Information Management Expect Value • It is important to assess the statistical significance of search results. • For local alignments, the scores follow an extreme value distribution • Expected value (E value) is the number of matches expected to occur randomly with a given score • The lower the E value, more significant the match. • E = Kmn e-lS – K: A variable with a value dependent upon the substitution matrix used and adjusted for search base size. – m, n: length of the query and database sequences – λ: A statistical parameter used as a natural scale for the scoring system – S: alignment score Department of Health Information Management More about E Value • The value of E decreases exponentially with increasing alignment score S (higher S values correspond to better alignments). Very high scores correspond to very low E values. • For E=1, one match with a similar score is expected to occur by chance. – For a much larger or smaller database, you would expect E to vary accordingly Department of Health Information Management Why Set Expect Threshold to 1000 • When you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). • A match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be real and of interest to you. • By setting the E value cutoff to 1000 or a bigger value you do not change the way the search was done, but you do change which results are reported to you. – All hits with E value less than 1000 are reported Department of Health Information Management E Values • Orthologs from closely related species will have the highest scores and lowest E values – Often E = 10-30 to 10-100 • Closely related homologs with highly conserved function and structure will have high scores – Often E = 10-15 to 10-50 • Distantly related homologs may be hard to identify – Less than E = 10-4 • These values may be served as general guideline but not a strict range for those situations Department of Health Information Management Set the Expect Threshold • The Expect Threshold can be any positive real number. • The lower the number the more stringent the matches displayed. • The default value of 10 signifies that 10 matches can be expected by chance in a search of the database using a random query with similar length. • No match with an E-value higher than the Expect Threshold selected will be displayed • Increase the Expect Threshold to 1000 or more when searching with a short query Default Special Cases Expect 10 Threshold Short query Large sequence family Ungapped Blast 1000 or more 10 10 Department of Health Information Management Raw Scores and Bit Scores • There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) • Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 • The E value corresponding to a given bit score is: E = mn 2 -S’ • Bit scores allow you to compare results between different database searches, even using different scoring matrices. Repeats Department of Health Information Management Low Complexity Regions • Low complexity regions: amino acid or DNA sequence regions that offer very low information due to their highly biased content – poly-A tails in DNA sequences – runs of purines or pyrimidines – Tandem repeats, such as ACACACACACACAC… – runs of a single amino acid, etc. Department of Health Information Management Short Repeats • DNA or amino acids less than 10 bases that repeat themselves • Short, tandem repeats • Such regions can be the cause of disease, but are common in genomes Department of Health Information Management Interspersed Repeats • Larger repeats are found interspersed throughout genomes • Humans: > 40% interspersed repeats • Plants have large numbers of these as well • Short Interspersed Repeats (SINES, 300 bp) • Long Interspersed Repeats (LINES, 1k bp) Department of Health Information Management RepeatMasker • RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences using Smith-Waterman algorithm called cross-match. • The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). • http://www.repeatmasker.org/ Department of Health Information Management Soft vs. Hard Masking • In BLAST, two options to mask repetitive elements and low complexity regions: – Hard masking: replace regions with X’s or N’s – Soft masking: repetitive regions and low complexity regions are shown in lower case Department of Health Information Management Masking in Lowercase or X/N Filters and Masking Department of Health Information Management Filters and Masking • Filter low-complexity region – This function mask off segments of the query sequence that have low compositional complexity. – Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output. – Filtering is only applied to the query sequence (or its translation products), not to database sequences. • Mask for lookup table only – This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats. – The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence. • Mask lower case letters – With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. – This allows you to customize what is filtered from the sequence. Department of Health Information Management Filters and Masking • If the number of hits returned is small when searching with a short query, it may help to re-search with filtering turned off. Default Filter On Special cases Short query Large sequence family Ungapped Blast off on on BLAST Search Output Department of Health Information Management BLASTN Output (header) Department of Health Information Management BLASTN Output (Graphic Summary) matches to itself probable homologs distantly related homologs distant homolog with shared domain or motif Department of Health Information Management BLASTN Output (Descriptions) Department of Health Information Management BLASTN Output (Sequence Alignments) Formatted BLAST Output Department of Health Information Management Alignment Formatting Department of Health Information Management Formatting Department of Health Information Management Formatting - Pairwise • Pairwise - Default BLAST alignment in pairs of query sequence and database match. BLASTP Department of Health Information Management Algorithm Parameters for BLASTP Department of Health Information Management Scoring Matrix Department of Health Information Management Output (header) - BLASTP Department of Health Information Management Output (Graphic Summary) Department of Health Information Management Graphic Summary Department of Health Information Management Description Other Programs Department of Health Information Management FASTA • First rapid database search utility • 50 times faster than Dynamic Programming • Based on a heuristic – not guaranteed to locate optimal solution Department of Health Information Management FASTA Algorithm • Hashing: – Construct a table showing each word of length k for query and target – Relative positions calculated by subtracting positions • Identify regions with highest density of hits – Trim regions to include only residues contributing to high scores – Associate initial score to each region. Each region is partial alignment without gaps • Join initial regions to form alignments with gaps • Assign score – Sum of initial scores for initial regions – Subtract gap penalty • Construct optimal alignment of the query and database – Consider only a band 32 residues wide – Centered on best initial region Department of Health Information Management FASTA Programs • FASTA: protein to protein or DNA to DNA • Compare a set of ordered peptide fragments against a protein database – FASTF or FASTS • Compares a set of ordered peptide fragments against a DNA database – TFASTF or TFASTS • Compares a query DNA sequence to a protein sequence database – FASTX or FASTY • Compare protein sequence to a DNA sequence or DNA database – TFASTA or TFASTX or TFASTY Department of Health Information Management BLAST-related Tools • The analysis of genomic DNA presents special challenges – There are exons and introns. – There may be sequencing errors or polymorphisms – The comparison may between related species (e.g. human and mouse) • Recently developed tools include – MegaBLAST at NCBI. – BLAT. BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query – SSAHA at Ensembl uses a similar strategy as BLAT. Department of Health Information Management MegaBLAST • For megablast, the word size is 28 and can be adjusted to 64. • Megablast is very fast for finding closely related DNA sequences Department of Health Information Management BLAT • BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 25 bases, and sometimes find them down to 20 bases. • BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. • In practice DNA BLAT works well on primates, and protein BLAT on land vertebrates • BLAT works by keeping an index of the entire genome in memory. The index consists of all non-overlapping 11mers except for those heavily involved in repeats. • http://genome.ucsc.edu Department of Health Information Management Human BLAT Results Department of Health Information Management Sequence Alignment Result Department of Health Information Management Results in Genome Browser Department of Health Information Management Homework 3 • Having the gene sequences you obtained in the homework2, use BLAST to perform pairwise sequence alignment between human and cow BRCA1 genes, summarize the result (level of identity, gaps, strands, coverage, etc.) • For the human gene sequence, perform a BLAST search against the Refseq_RNA database, summarize the result and create a similarity score matrix (including at least five other species, such as Pan troglodytes, Canis lupus familiaris, Bos taurus, Equus caballus, Felis catus, …) based on the bit scores • Retrieve one of the dbSNP records in your homework 1, obtain the sequence record (in fasta format), perform a BLAST search to determine the precise location of the mutation in the human genome