Download Sequence Alignment - Mainlab Bioinformatics

Bioinformatics for Research Module 1 Sequence Alignment September 1, 2015 Mainlab Bioinformatics, Washington State University Learning Outcomes • Understanding of what is sequence alignment and how is it useful • Understanding of the different types of sequence alignment and when you might use them • Understanding of the importance of homology • Understanding scoring matrices and when to use them. • Understanding that there are different alignment algorithms • Basic understanding of BLAST and it’s different flavors What is Sequence Analysis ? • A sequence is ___________________________________ • A biological sequence is __________________________ ______________________________________________ • Sequence analysis in bioinformatics refers to__________ ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________ ______________________________________________ Concept of Sequence Alignment • An alignment is a mutual arrangement of two sequences • Shows where two sequences are similar, and where they differ • An ‘optimal’ alignment – most correspondences and the least differences • Sequences that are similar probably have the same function (descent from a common ancestor) • Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since divergence from a common ancestor. Sequence Alignment • Sequence alignment is the procedure of comparing (at the residue level) two (pairwise alignment) or more sequences (multiple sequence alignment) by searching for a series of common features (characters or patterns) that occur in the same order in the sequences. • Sequence alignment is useful for discovering functional, structural and evolutionary information in biological sequences. Importance of Homology! • Homology strongly suggests that molecules have similar structure and function • Significantly similar molecular sequences are very unlikely to occur by chance. • Significant similarity between sequences infers that the sequences/structures are homologous i.e. at some point in the shared a common ancestor and therefore share structure and function. • Differences between families of species resulted from mutations during the course of evolution. Most of these changes are due to local mutations between nucleotide sequences Orthologs Sharing a common ancestor Orthologs : occur in separate species, common ancestor Ancestor A1 Descendent 1 A2 Descendent 2 A Time Paralogs Sharing a Common Ancestor Paralogs : gene duplication independent of speciation (genome duplication) Ancestor A1 Descendent A A2 Time Homology • Homology designates a qualitative relationship of common descent between entities • Two genes are either homologs or they are not ! It doesn’t make sense to say “two genes are 43% homologous” It doesn’t make sense to say “Jane is 43% diabetic” Sequence Alignment • Changes in sequence may be found in alignments due to divergence from the common ancestor. • These changes are categorized as substitutions, insertions and deletions. • A primary use of sequence alignment is to determine if two sequences are sufficiently similar to declare them homologous and therefore likely to share similar structure and function • What else can we use sequence alignment for? ______________________________________ ______________________________________ ______________________________________ Sequence Alignment • What does sequence identity mean? ___________________ _________________________________________________ • What does sequence similarity mean? __________________ _________________________________________________ • What is sequence homology? _________________________ _________________________________________________ _________________________________________________ _________________________________________________ _________________________________________________ Sequence Alignment Methods Method Use Dot Matrix Plot General exploration of your sequence: - Discovering Repeats - Finding Rearrangements - Predicting regions of self complimentary RNA - Extracting portions of sequence to make a multiple alignment Global Alignments Comparing two sequences over their entire length: - Identifying long insertion/deletions - Checking the quality of your data - Identifying every mutation in your sequence Local Alignments Comparing Sequences with partial homology: - Making high quality alignments - Making residue-per-residue analysis Dot Matrix Plots • Also known as dot plots, they represent the simplest method of evaluating similarity between two sequences • Identifies all possible matches of residues between the two sequences. • One sequence (A) is listed horizontally (top of page) and the other sequence (B) vertically (left side of page). • Starting with the first character in B, the comparison moves across the row and places a dot in the plot space where both of the sequence elements are the same. • Adjacent regions of identity between the two sequences produce diagonal lines of dots in the plot. Dot Matrix Plots • The diagonal line always appears when a sequence is compared to itself. • Can filter out random matches by using by increasing the window size • In a dot matrix, detection of matching regions may be improved by filtering out random matches. • This is done using a sliding window to compare the two sequences • Sliding windows • Window size: Number of characters to compare • Stringency: Number of characters that have to match exactly Dot Matrix Plots • Window size • A larger window size is used for DNA sequences than for proteins because the number of random matches is much larger due to the use of only 4 DNA characters compared to 20 amino acid characters • For DNA sequences comparisons use long windows and high stringencies, e.g. 15 and 10. • For protein comparisons, use short windows and low stringencies except when looking for a short domain in a partially similar sequence. 15 GAACTCATACGAATTCACATTAGAC Dot Matrix Plots • Try this with the example THEFATCAT as sequence 1 and THEFASTCAT as sequence 2. What does the result tell you? T H E F A S T C A T T H E F A T C A T Sequence Analysis Dot Plot Programs • Dottup - Displays a wordmatch dotplot of two sequences • Dotmatcher - Draw a threshold dotplot of two sequences • Dotpath - Draw a non-overlapping wordmatch dotplot of two sequences • Polyplot - Draw dotplots for all-against-all comparison of a sequence set http://emboss.bioinformatics.nl/ Types of Sequence Alignment • Global: Alignments that stretch over the entire sequence length include as many matching residues as possible GGSDNWSA-T IPG Needleman-Wunsch Algorithm GN–RAWA A MNPA • Used to align two closely related sequences over similar length • Useful for checking minor differences between two sequences, analyze polymorphism between closely related species, comparing two sequences that partially overlap Types of Sequence Alignment • Local: Higher priority given to aligning local regions of high similarity rather than extending the alignment to neighboring residues with lower scores. -----DTGA----Smith-Waterman Algorithm -----DTGA----• Dynamic programming - Smith Waterman Algorithm (provides the best possible alignment, but slow!) • Heuristic methods - BLAST and FASTA use fast approximate methods to align two sequences. Types of Sequence Alignment • Local Cont. • Heuristic algorithms are empirical (use rules of thumb to align) • Much faster than dynamic programming algorithms so better suited for database searches • Does not guarantee an optimal alignment like dynamic algorithms Question – You have found a homolog to your unknown gene of interest using BLAST, what might you do to optimize the alignment? Alignment Algorithms • Require a scoring system for evaluating match or mismatch of 2 characters (aa or nt) Substitution Matrices • Likelihood of One Amino Acid Mutated into Another Over Evolutionary Time • Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) • Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) • High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys) Scoring Matrices • There are several of them and the choice can affect the outcome • Values proportional to the probability one aa mutates into another • Can be based on chemical similarity, functional similarity, structure, evolutionary similarity etc • Common matrices for protein comparisons • PAM (Point Accepted Mutation) – based on global alignments of closely related proteins that are at least 85% identical and are based on an implicit model of evolution • PAM250 matrix is widely used • PAM is not necessarily good for identifying relationships in highly divergent species. Does not account for conserved blocks or motifs Scoring Matrices • Blossum Matrices • Look only for differences in conserved, ungapped regions of a protein family • Directly calculated, using no extrapolations • More sensitive to structural or functional substitutions • Generally perform better than PAM matrices for local similarity searches (Henikoff and Henikoff, 1993) • Blossum62 • Every possible identity and substitution is assigned a score based on the observed frequencies of such occurrences in alignments of related proteins • BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Also Blossum 32 or 80 • Default for BLAST Scoring Matrices Alignment Algorithms Require • A penalty function for gaps in sequences • A method for finding an optimal pairing of sequences (may introduce gaps to optimize the score) • A gap is a space introduced into one alignment to compensate for insertions and deletions in the sequences being compared • For each gap introduced there is a penalty and extending the gap further increases the penalty AGGVLIIQVG llllllxxxx AGGVLIQVG- AGGVLIIQVG lllllxllll AGGVL-IQVG AGGVLIIQVG llllllxlll AGGVLI-QVG Score the residue matches and score the residue gaps Heuristic Algorithms • • • • • FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated Search What is a heuristic algorithm? FASTA Algorithm at EBI http://www.ebi.ac.uk/Tools/fasta/index.html BLAST Algorithms BLAST (Basic Local Alignment Search Tool) – To search a sequence against the database – Extremely fast – Robust – Most widely used • It finds very short segment pairs between the query and sequence in the database • These segments are then extended in both directions until the maximum possible score of this particular segment is reached • Available at NCBI, EBI and many other community database sites BLAST • A BLAST search has five components: query, database, program, search purpose/goal and results interpretation o Query: a sequence that you want to find out more information about o Database: need to know what databases are available (NCBI, EBI etc) o Program: what program to select to meet your specific purpose o Interpreting your results, what does it mean? NCBI Protein Databases NCBI Nucleotide Databases NCBI Nucleotide Databases BLAST Program Selection Nucleotide queries Protein queries http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide BLAST Program Selection Specialized queries Protein queries http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide Optional parameters in (blastn) Basic Optional Parameters in BLASTP Advanced Blast Parameters The default parameters are not always the right parameters for your search, depends on your question • G • E • Q • R •E Cost to open gap: default = 5 for nucleotides/11 for proteins Cost to extend gap: default =2 for nucleotides/1 for proteins Penalty for nucleotide mismatch: default = -3 Reward for nucelotide match: default = 1 Expectation value: default = 10 BLAST Choice of Programs • MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence • Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to your nucleotide query. This program uses non-contiguous word within a longer window of template. In coding mode, the third base wobbling is taken into consideration by focusing on finding matches at the first and second codon positions while ignoring the mismatches in the third position. Searching in discontiguous MEGABLAST using the same word size is more sensitive and efficient than standard blastn using the same word size BLAST Choice of Programs • Search for short nearly exact matches" is useful for primer or short nucleotide searches. Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the Expect value parameter is set too stringently and the default word size parameter is set too high. You can adjust both the word size and the expect value on the standard BLAST pages to work with short sequences A common use of this page is to check the specificity of PCR or hybridization primers. A useful way to check a pair of PCR primers is to first concatenate them by inserting string of 20 or more N's in between the two primers, and then search the concatenated pair as one sequence. Since BLAST looks for local alignments and automatically searches both strands, there is no need to reverse complement the reverse primer before doing the concatenation or the search. BLAST Choice of Programs • Use the Trace Archive BLAST page to search raw primary sequence trace files. The sequence data come from a variety of projects and sequencing strategies, including Whole Genome Shotgun (WGS), BAC end sequencing, and EST sequencing. The trace data are single pass sequencing reads not trimmed for quality or vector contamination. Their average lengths are between 500 to 700 bp. • Standard protein BLAST is designed for protein searches. Standard protein-protein BLAST (blastp) is used for both identifying a query amino acid sequence and for finding similar sequences in protein databases. Like other BLAST programs, blastp is designed to find local regions of similarity. When sequence similarity spans the whole sequence, blastp will also report a global alignment, which is the preferred result for protein identification purposes. • PSI-BLAST is designed for more sensitive protein-protein similarity searches. Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...". Database Search Questions • What database should I search? • What kind of sequences should I search with? • What E-value is significant? • What can I reliably infer about the function of my sequence based on homology? Databases • Bigger databases have more sequences. • Bigger databases are also more redundant, which can skew the statistics. • Bigger databases are also poorly annotated (homology with an "unidentified sequence" doesn't really tell you much) • Bigger databases take lots of time to search. • Smaller databases (like Swiss-Prot) are often better curated and annotated. Databases • Smaller databases are much less redundant. • Smaller databases can contain phylogenetically relevant sequences (all plant) • Smaller databases are much faster to search. What is a significant E-value? • For a single search, an E-value of 10-3 is significant, though typically quite distant. • For multiple searches, the E-value cutoff varies according to the number of searches. Multiple Sequence Searches e.g. 15,000 EST query sequences • A 10-3 E-value cutoff means that you should expect one false positive in 1000 searches. • Thus with 15,000 searches, we should expect 15 false positives with a cutoff of 10-3. • To reduce the chances of identifying a false positive, set the E-value cutoff lower. • For 15,000 searches, an E-value cutoff of 10-5 will mean that you should expect 0.15 false positives. Multiple Sequence Searches In general: • DNA to DNA alignment For nucleotide sequences at least 100 bp long, if 70% of your nucleotides are identical with your match sequence then they can be considered to be homologous • AA to AA alignment For amino acid sequences at least 100 aa long, if 25% of your aa are identical with your match sequence then they can be considered to be homologous. • Below these values, the alignments are considered to be in the twilight zone!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sequence Alignment - Mainlab Bioinformatics