Download sequence alignments

Introduction to sequence alignment WEEK 2 Mike Hallett (David Walsh) BIOL510: Bioinformatics Outline: pairwise alignment • The importance of pairwise alignment • The important steps in comparing two sequences (Sections 4.1-4.5) • Performing pairwise alignments using BLAST (Section 4.6-4.7): WILL COVER IN LAB CLASS ON TUESDAY 4.1 Principles of sequence alignment Sequences (DNA and protein) vary as a result of evolutionary processes acting at the molecular level: 1. Point mutations: nucleotides or amino acids 1. Insertions and deletions (length variation) 1. Fusion of two genes into a single gene Evolution in gene sequences can effectively mask any underlying sequence similarity P. 73 Similarity and Homology • Similarity is quantitative measure of how related two sequences are: – Usually based on pairwise alignment of two sequences – By aligning sequences we can count the number of residues that line up and be expressed in terms of percent identity – High degrees of sequence similarity may imply a common evolutionary history or a possible commonality in biological function • Homology refers specifically to similarity in sequence or structure due to decent from a common ancestor – The concept of homology implies an evolutionary relationship Definition: homology Homology Similarity attributed to descent from a common ancestor. Morphological homology Molecular homology fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD.APM..F SAD.APM..F SAD.APM..F SKDNTPM..F SS.TAPM..F PKGDEPVKQL Definitions: identity, similarity, conservation Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Pairwise sequence alignment is the most fundamental operation of bioinformatics • It is fundamental to characterizing genome sequences • To identify genes within a genome • To identify related proteins, predict protein structure and function • To construct phylogenetic trees and compare evolutionary relationships P. 72 Definition: pairwise alignment Pairwise alignment: The process of lining up two sequences to achieve maximal levels of similarity Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247 4.5 Types of alignment Global alignments: an alignment that covers the full length of a gene or protein sequence  for aligning closely related sequences that are similar over their whole length 4.5 Types of alignment Global alignments: an alignment that covers the full length of a gene or protein sequence  for aligning closely related sequences that are similar over their whole length Local alignments: an alignment that only covers a certain region (e.g. domain) of a gene or protein sequences  for aligning proteins that are only partly related (e.g multidomain proteins)  for identifying conserved regions in very divergent sequences 4.5 Types of alignment 4.5 Types of alignment General approach to pairwise alignment • Choose two sequences • Select an alignment algorithm that generates a score • Score reflects degree of similarity • Allow gaps (insertions, deletions) • Estimate probability that the alignment occurred by chance 4.1 Principles of sequence alignment When sequences are derived from a common ancestor, we want to align bases/amino acids derived from the same ancestral position b-corticotropin (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu (Peptide Hormones) Oxytocin Vasopressin (Nueromodulators) CYIQNCPLG CYFQNCPRG 4.1 Principles of sequence alignment When sequences are derived from a common ancestor, we want to align bases/amino acids derived from the same ancestral position T H I S S E QE N C E T H A T S E QE N C E Identical matches Two amino acid point mutations Mismatches P. 73 4.1 Principles of sequence alignment Often sequences we wish to align will differ in length, obscuring the similarity that exists: T H I S I S A S E Q E N C E T HA T S E Q E N C E Identical matches Mismatches How many amino acid point mutations?  8 point mutations? P. 73 4.1 Principles of sequence alignment Insertion/deletion mutations result in gaps in an alignment T H I S I S A - S E Q E N C E T H - - - - A T S E Q E N C E Identical matches Mismatches How many amino acid point mutations?  0 point mutations?  but two indel mutations! The best pairwise alignment is not obvious, hence we have algorithms for testing different alignments quantitatively P. 74 Matches do not have to be identical Certain amino acids resemble each other in their physical and chemical characteristics, and can substitute functionally for each other T H I S I S A S E Q E N C E T H A T- - - S E Q E N C E serine - threonine isoleucine - alanine Charged amino acids Polar uncharged amino acids Hydrophobic amino acids Pairwise alignment: protein versus DNA sequences • Synonymous mutations alter DNA but not amino acid sequences • Nonsynonymous mutations alter amino acid sequence • Protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments Codons are degenerate: changes in the third position often do not alter the amino acid that is specified DNA alignments • Many times, DNA alignments are appropriate (or necessary): -- to identify promoters and regulatory elements -- to identify gene sequences -- to study noncoding regions of DNA -- to study DNA polymorphisms (SNPs) Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247 4.2 Scoring alignments How do we objectively determine which is the best possible alignment for a pair of sequences? 4.2 Scoring alignments How do we objectively determine which is the best possible alignment for a pair of sequences?  Generate all possible alignments (not possible: 1075 possibilities for an alignment of 100 positions!!!)  Calculate a score for each alignment • Optimal alignment: the alignment with the best score • Suboptimal alignments: alignments with slightly poorer scores 4.2 Scoring alignments: Percent identity The simplest way to quantify similarity is to sum the number of bases/amino acid matches and divide by length of the alignment T H I S I S A - S E Q E N C E T H - - - - A T S E Q E N C E (10 matches/15 positions)*100 = 66% identity 4.2 Scoring alignments: dot plots Dot-plots are a simple way to visualize pairwise sequence similarity Fig 4.1 Matches do not have to be identical Do all amino acid substitutions occur with the same probability? Matches do not have to be identical Do all amino acid substitutions occur with the same probability? NO!!!! T H I S I S A S E Q E N C E T H A T- - - S E Q E N C E serine – threonine : highly conservative isoleucine – alanine : poorly conservative Substitution Matrix A substitution matrix contains the likelihood that a particular pair of amino acids will occupy the same position due to decent from a common ancestor (i.e. homology)  20 x 20 substitution matrix The BLOSUM62 substitution matrix +5 for Arg to Arg -2 for Arg to Asp Fig 4.4 The BLOSUM62 substitution matrix + 1 for Ser to Thr +5 for Arg to Arg -2 for Arg to Asp Fig 4.4 Scoring a pairwise alignment using the BLOSUM62 matrix T H I S S E Q E N C E T H A T S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 The overall alignment score (S) = 52 Generation of substitution scoring matrices • Based on the observed amino acid substitution frequencies in alignments of homologous protein sequences • Use real data to model the evolutionary processes PAM substitution matrices are calculated from global protein alignments BLOSUM substitution matrices are calculated from local protein alignments PAM matrices: Point-accepted mutations Dayoff (1960’s) calculated substitution probabilities from alignments of highly similar protein families All the PAM data come from closely related proteins (>85% amino acid identity). PAM matrices: Point-accepted mutations Dayoff (1960’s) calculated substitution probabilities from alignments of highly similar protein families All the PAM data come from closely related proteins (>85% amino acid identity). The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. A R N D C Q E G H I L K M F P S T W Y V A PAM250 scoring matrix that assigns scores and is forgiving of mismatches… (such as +17 for W to W or -5 for W to T) 2 -2 6 0 0 2 0 -1 2 4 -2 -4 -4 -5 12 0 1 1 2 -5 4 0 -1 1 3 -5 2 4 1 -3 0 1 -3 -1 0 5 -1 2 2 1 -3 3 1 -2 6 -1 -2 -2 -2 -2 -2 -2 -3 -2 5 -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6 -1 3 1 0 -5 1 0 -2 0 -2 -3 5 -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 7 …compared to a scoring matrices such as PAM10 that are strict and do not tolerate mismatches (such as +13 for W to W or -19 for W to T) -10 9 -7 -9 9 -6 -17 -1 8 -10 -11 -17 -21 10 -7 -4 -7 -6 -20 9 -5 -15 -5 0 -20 -1 8 -4 -13 -6 -6 -13 -10 -7 7 -11 -4 -2 -7 -10 -2 -9 -13 10 -8 -8 -8 -11 -9 -11 -8 -17 -13 9 -9 -12 -10 -19 -21 -8 -13 -14 -9 -4 7 -10 -2 -4 -8 -20 -6 -7 -10 -10 -9 -11 7 -8 -7 -15 -17 -20 -7 -10 -12 -17 -3 -2 -4 12 -12 -12 -12 -21 -19 -19 -20 -12 -9 -5 -5 -20 -7 9 -4 -7 -9 -12 -11 -6 -9 -10 -7 -12 -10 -10 -11 -13 8 -3 -6 -2 -7 -6 -8 -7 -4 -9 -10 -12 -7 -8 -9 -4 7 -3 -10 -5 -8 -11 -9 -9 -10 -11 -5 -10 -6 -7 -12 -7 -2 8 -20 -5 -11 -21 -22 -19 -23 -21 -10 -20 -9 -18 -19 -7 -20 -8 -19 13 -11 -14 -7 -17 -7 -18 -11 -20 -6 -9 -10 -12 -17 -1 -20 -10 -9 -8 10 -5 -11 -12 -11 -9 -10 -10 -9 -9 -1 -5 -13 -4 -12 -9 -10 -6 -22 -10 R N D Q E A C G H I L K M F P S T W Y 8 V BLOSUM Matrices BLOSUM matrices are based on local alignments. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM stands for blocks substitution matrix. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. BLOSUM Matrices BLOSUM62 is a matrix calculated from comparisons of sequences with no more than 62% similarity. BLOSUM62 is the default matrix in BLAST2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Selecting an appropriate scoring matrix More conserved Rat versus mouse globin Less conserved Rat versus bacterial globin 4.4 Inserting Gaps Homologous sequences are often different in length as a result of insertions and deletions (indels) The alignment of indels involves inserting gaps into the alignment Gap penalty: each time a gap is introduced, a gap penalty is subtracted from the score A gap opening penalty is usually high A gap extension penalty is usually low Scoring a pairwise alignment using the BLOSUM62 matrix and gap penalty T H I S I S A S E Q E N C E T H A T- - - S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Scoring a pairwise alignment using the BLOSUM62 matrix and gap penalty T H I S I S A S E Q E N C E T H A T- - - S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Gap opening penalty = -11 Gap extension penalty = -1 The overall score (S) = 52 + (-11 - 2) = 40 4.4 Inserting Gaps Alignment with a high gap penalty Alignment with a low gap penalty Page 86 Next in the course... Ch 5. We have learned how to score an alignment, but how do you generate the alignment in the first place? Here are two approaches:  Dynamic Programming Algorithms  Heuristic Search Algorithms Sequence alignments continued Rasko et al. Nucleic Acids Res. 2004; 32(3): 977–988 David Walsh [email protected] BIOL510: Bioinformatics Outline: sequence alignments (Ch 5) •Dynamic Programming Algorithms (Ch 5.2) Global alignment: Needleman-Wunsch Local alignments: Smith-Waterman •Heuristic Search Algorithms (Ch 5.3)  BLAST •Alignment Score Significance (Ch 5.4) WE WILL COVER THIS ON TUESDAY DURING THE LAB Scoring an alignment using the BLOSUM62 substitution matrix and gap penalty T H I S I S A S E Q E N C E T H A T- - - S E Q E N C E 5 8 -1 1 4 5 5 5 6 9 5 Gap opening penalty = -11 Gap extension penalty = -1 The overall score (S) = 52 + (-11 - 2) = 40 Dynamic Programming Algorithms For any given pair of sequences, if gaps are allowed there is a large number of possible alignments. Dynamic Programming Algorithms For any given pair of sequences, if gaps are allowed there is a large number of possible alignments. Dynamic programming algorithms: can explore the full range of alignments using a variety of different constraints, by dividing the problem of alignment into many smaller parts Needleman and Wunsch published the original program in the 1970’s and there have been many modifications and improvements since. Global alignment versus local alignment Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. Needleman-Wunsch: dynamic programming N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. 4.2 Scoring alignments: dot plots Dot-plots are a simple way to visualize pairwise sequence similarity But, they are the beginning of generating optimal alignments as well. Fig 4.1 Three steps to global alignment with the Needleman-Wunsch algorithm [1] set up a matrix of two sequences [2] score the matrix [3] identify the optimal alignment(s) Global alignment with the algorithm of Needleman and Wunsch (1970) • Two sequences can be compared in a matrix along x- and y-axes. • If they are identical, a path along a diagonal can be drawn • Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) • N-W is guaranteed to find optimal alignment(s) Four possible outcomes in aligning two sequences 1 2 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) The initial stage of dynamic programming Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8 The initial stage of dynamic programming Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8 The initial stage of dynamic programming -16 Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8 The initial stage of dynamic programming -16 Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Figure 5.8 The initial stage of dynamic programming: filling in the matrix -1 Figure 5.10 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix Thr  Ile Score= -1 Figure 5.8 The initial stage of dynamic programming: filling in the matrix Score = -4 Gap extension penalty (E) = -8 BLOSUM62 substitution matrix The final stage of dynamic programming: traceback Figure 5.9 The initial stage of dynamic programming: filling in the matrix Score = 7 Gap extension penalty (E) = -4 BLOSUM62 substitution matrix The final stage of dynamic programming: traceback Figure 5.11 Local alignment: the Smith-Waterman (SW) algorithm Remember: two protein sequences may not exhibit homology along their full length Page 88, Page 136 Local alignment: the Smith-Waterman (SW) algorithm Remember: two protein sequences may not exhibit homology along their full length SW is a modification of the Needleman-Wunsch algorithm Instead of looking at each sequence in its entirety, the method compares segments of all possible lengths and chooses the segments that optimize the similarity measure Page 88, Page 136 Local alignment algorithm: optimal subsequence alignments less than zero (<0) are rejected Score = 12 Gap extension penalty (E) = -8 (!!!!!) BLOSUM62 substitution matrix Figure 5.15 Outline: sequence alignments (Ch 5) •Dynamic Programming Algorithms (Ch 5.2) Global alignment: Needleman-Wunsch Local alignments: Smith-Waterman •Heuristic Search Algorithms (Ch 5.3)  BLAST •Alignment Score Significance (Ch 5.4) Will cover this topic during the lab

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download sequence alignments