Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence Alignment CSCE 769 Guest Lecture November 1, 2012 Stephanie Irausquin, PhD Sequence Alignment: Definition and Importance ● ● Sequence alignment is a process in which at least two homologous sequences are compared and involves the identification of insertions or deletions that might have occurred in either lineage since their divergence from a common ancestor A powerful tool for discovering biological function and establishing evolutionary relationships Sequence Alignment ● ● The same principles for sequence alignment can be used to align both nucleotide and amino acid sequences More reliable alignments are usually obtained by using amino acid sequences 1.Amino acids change less frequently during evolution than nucleotides 2.There are 20 amino acids and only 4 nucleotides, so the probability for 2 sites to be identical by chance is lower at the amino acid level than at the nucleotide level Sequence Alignment ● ● A DNA sequence alignment consists of a series of paired bases (one base from each sequence) There are 3 types of aligned pairs 1.Match – it is assumed that the nucleotide at this site has not changed since the divergence between the two sequences 2.Mismatch – at least one substitution has occurred in one of the sequences since their divergence from each other 3.Gaps - a deletion has occurred in one sequence, or an insertion has occurred in the other (the Types of Alignment ● Manual ● Dot matrix ● Distance and similarity methods ● Alignment algorithms Manual Alignment ● ● ● A reasonable alignment by visual inspection can be obtained using either specialized alignment editors or plain text editors, when there are few gaps and the two sequences are not too different from each other Advantages: uses the brain and allows direct integration of additional data (i.e. domain structure) Disadvantages: is subjective and results cannot be compared to those derived using other methods Dot Matrix S S E A dot is put in the dot matrix plot at a position where the nucleotides in the two sequences are N C E N A L Y S • S P R M E R • • • • • • • C • E • • • • A • N • • • A • • L • Y • • • • I • • • • • P • R • I • • • M E I • • N S I • • E S A • U R ● E • • Q The two sequences to be aligned are written out as column and row headings of a two-dimensional matrix U • E ● Q • • • • • • Dot Matrix ● ● Advantages: – a simple method – is useful in unraveling important evolution of sequences Disadvantages: – may become very cluttered – may require human intervention to recognize patterns – may not be reliable – limited to two sequences Dot Matrix Examples a.) b.) Distance and similarity methods ● ● The best possible alignment between two sequences is the one which minimizes the numbers of mismatches and gaps However, reducing the number of mismatches usually results in an increase in the number of gaps (and vice versa) Distance and similarity methods Considering the following example: Seq1: TCAGACGATTG LengthSeq1=11 ● Seq2: TCGGAGCTG LengthSeq2=9 We can reduce the number of mismatches to 0, but the number of gaps in this case is 6: Seq1: TCAG-ACG-ATTG ● Seq2: TC-GGA-GC-T-G Distance and similarity methods Our example, yet again: Seq1: TCAGACGATTG ● LengthSeq1=11 Seq2: TCGGAGCTG LengthSeq2=9 Conversely, we can reduce the number of gaps to a single gap having the minimum possible size |LengthSeq1 – LengthSeq2| = 2 nucleotides, which increases the number of mismatches to 5: Seq1: TCAGACGATTG ● Distance and similarity methods Our example, yet again: Seq1: TCAGACGATTG ● LengthSeq1=11 Seq2: TCGGAGCTG LengthSeq2=9 We can also choose an alignment that minimizes neither the number of gaps nor the number of mismatches. In the case below, the number of gaps is 4 and the number of mismatches is 2: Seq1: TCAG-ACGATTG ● Distance and similarity methods ● ● ● Which of the three alignments is preferable? In order to determine that, we need to find a common denominator (the gap penalty) that allows us to compare gaps and mismatches Gap penalty – a factor (or set of factors) by which gap values (the numbers and lengths of gaps) are multiplied in order to make the gaps equivalent in value to mismatches – Based on how frequent different types of insertions and deletions occur in evolution in comparison with the frequency of Distance and similarity methods ● For any given alignment, we can calculate a distance or dissimilarity index (D) as: D = ∑miyi + ∑wkzk where yi is the number of mismatches of type i, mi is the mismatch penalty for an i-type of mismatch, zk is the number of gaps of length k, and wk is a positive number representing the penalty for gaps of length k Distance and similarity methods ● In the most frequently used gap penalty systems, it is assumed that the gap penalty includes two components: 1.Gap-opening penalty 2.Gap-extension penalty ● Further complications in the gap penalty system may be introduced by distinguishing among different mismatches (i.e. amino acids) – Leu and Ile vs Arg and Glu BLOSUM ● ● ● BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins First introduced in a paper by Henikoff and Henikoff – scanned very conserved regions of protein families and counted the relative frequencies of amino acids and their substitution probabilities – Calculated a log-odds score for each of the possible substitutions of the 20 standard amino acids Several sets of matrices exist BLOSUM50 Substitution Matrix A C D E F G H I K L M N P Q R S T V W Y Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr A A R N D C Q E G H I L K M F P S T W Y V R 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 N D C Q E G H I -2 -1 -2 -1 -1 -1 0 -2 7 -1 -2 -4 1 0 -3 0 -1 7 2 -2 0 0 0 1 -2 2 8 -4 0 2 -1 -1 -4 -2 -4 13 -3 -3 -3 -3 1 0 0 -3 7 2 -2 1 0 0 2 -3 2 6 -3 0 -3 0 -1 -3 -2 -3 8 -2 0 1 -1 -3 1 0 -2 10 -4 -3 -4 -2 -3 -4 -4 -4 -3 -4 -4 -2 -2 -3 -4 -3 3 0 -1 -3 2 1 -2 0 -2 -2 -4 -2 0 -2 -3 -1 -3 -4 -5 -2 -4 -3 -4 -1 -3 -2 -1 -4 -1 -1 -2 -2 -1 1 0 -1 0 -1 0 -1 -1 0 -1 -1 -1 -1 -2 -2 -3 -4 -5 -5 -1 -3 -3 -3 -1 -2 -3 -3 -1 -2 -3 2 -3 -3 -4 -1 -3 -3 -4 -4 L -1 -4 -3 -4 -2 -3 -4 -4 -4 5 2 -3 2 0 -3 -3 -1 -3 -1 4 K -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2 -1 1 M F P S T W Y V -1 -1 -3 -1 1 0 -3 -2 0 3 -2 -3 -3 -1 -1 -3 -1 -3 0 -2 -4 -2 1 0 -4 -2 -3 -1 -4 -5 -1 0 -1 -5 -3 -4 -3 -2 -2 -4 -1 -1 -5 -3 -1 2 0 -4 -1 0 -1 -1 -1 -3 1 -2 -3 -1 -1 -1 -3 -2 -3 -2 -3 -4 -2 0 -2 -3 -3 -4 0 -1 -1 -2 -1 -2 -3 2 -4 -3 2 0 -3 -3 -1 -3 -1 4 -3 3 1 -4 -3 -1 -2 -1 1 6 -2 -4 -1 0 -1 -3 -2 -3 -2 7 0 -3 -2 -1 -1 0 1 -4 0 8 -4 -3 -2 1 4 -1 -1 -3 -4 10 -1 -1 -4 -3 -3 0 -2 -3 -1 5 2 -4 -2 -2 -1 -1 -2 -1 2 5 -3 -2 0 -3 -1 1 -4 -4 -3 15 2 -3 -2 0 4 -3 -2 -2 2 8 -1 -3 1 -1 -3 -2 0 -3 -1 5 s x,y = log p xy Px P y Pxy is the probability that x and y are evolutionarily related. Px is the probability of occurrence of x. Py is the probability of occurrence of y. Sequence Alignment Algorithms ● ● ● ● The purpose of any alignment algorithm is to choose the alignment associated with the smallest D from all possible alignments The number of possible alignments can be very large Fortunately, there are computer alignment algorithms for searching the optimal alignment between two sequences Fundamentally, there are two different types of alignment algorithms: 1.Global (Needleman-Wunsch) Global Alignment: NeedlemanWunsch ● ● ● ● ● Every letter of each sequence is aligned to a letter or gap Alignment takes place in a 2D matrix Each cell corresponds to a pairing of one letter from each sequence and contains a score derived from a scoring scheme along with a corresponding pointer The algorithm contains three major phases (initialization, fill, and trace-back) In order to examine each phase, lets align the words HEAGAWGHE and PAWHEAE using Global Alignment: NeedlemanWunsch ● Initialization – Values for the first row and column are assigned – The score of each cell is set to the gap penalty (-8) multiplied by the distance from the origin P A W H E A E 0 -8 -16 -24 -32 -40 -48 -56 H -8 E -16 A -24 G -32 A -40 W -48 G -56 H -64 E -72 Global Alignment: NeedlemanWunsch ● Fill – Three scores are computed for each cell Diagonal Score – sum of the diagonal cell score and the score for a match/mismatch (BLOSUM50 matrix) ● Horizontal Score – sum of the cell to the left and the H E A G A W G H E gap penalty 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 ● -8 Vertical -2 Score – sum of the above cell and the gap (P->H) Diagonal Score penalty -16 {0 + (-2) = -2 } ● P A W H– E A E -24 The is then filled by assigning for (P->H) Horizontal -32 entire matrix (P->H) Max Score = -2 -40 cell theScore each max score (obtained from the 3 {-8 + (-8) = -16} -48 (P->H) Vertical computed scores) andScore corresponding pointer -56 {-8 + (-8) = -16} Global Alignment: NeedlemanWunsch ● Fill – Three scores are computed for each cell Diagonal Score – sum of the diagonal cell score and the score for a match/mismatch (BLOSUM50 matrix) ● Horizontal Score – sum of the cell to the left and the H E A G A W G H E gap penalty 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 ● -8 Vertical -2 Score – sum of the above cell and the gap (P->E) Diagonal Score penalty -16 {-8 + (-1) = -9 } ● P A W H– E A E -24 The is then filledScore by assigning for Horizontal -32 entire matrix (P->E) (P->E) Max Score = -9 + (-8) = (obtained -10} -40 cell the max{-2 each score from the 3 -48 Score computed scores)(P->E) andVertical corresponding pointer -56 {-16 + (-8) = -24} Global Alignment: NeedlemanWunsch ● Fill – P A W H E A E Continue calculating max score for all cells along with corresponding pointer 0 -8 -16 -24 -32 -40 -48 -56 H -8 -2 -10 -18 -14 -22 -30 -38 E -16 -9 -3 -11 -18 -8 -16 -24 A -24 -17 -4 -6 -13 -16 -3 -11 G -32 -25 -12 -7 -8 -16 -11 -6 A -40 -33 -20 -15 -9 -9 -11 -12 W -48 -41 -28 -5 -13 -12 -12 -14 G -56 -49 -36 -13 -7 -15 -12 -15 H -64 -57 -44 -21 -3 -7 -15 -12 E -72 -65 -52 -29 -11 3 -5 -9 Global Alignment: NeedlemanWunsch ● Trace-back – Allows one to recover the alignment from the matrix – Trace back your transition from the bottom right corner toH the Etop left toE A corner G Aby referring W G back H the 0completed -8 -16 matrix -24 -32 -40 -48 -56 -64 -72 P A W H E A E -8 -16 -24 -32 -40 -48 -56 -2 -10 -18 -14 -22 -30 -38 -9 -3 -11 -18 -8 -16 -24 -17 -4 -6 -13 -16 -3 -11 -25 -12 -7 -8 -16 -11 -6 -33 -20 -15 -9 -9 -11 -12 -41 -28 -5 -13 -12 -12 -14 -49 -36 -13 -7 -15 -12 -15 -57 -44 -21 -3 -7 -15 -12 -65 -52 -29 -11 3 -5 -9 Global Alignment: NeedlemanWunsch ● Trace-back – – – P A W H– E A E Horizontal transition represents a gap in the vertical sequence Vertical transition represents a gap in the horizontal sequence H E A G A W G H E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 Diagonal transition represents a match in the -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 corresponding characters -16 -10 -3 -4 -12 -20of the -28 two -36 sequences -44 -52 -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 Final Alignment: -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 3 H-40 E -22 A G -8A W-16G H-16 - -9E -12 -15 -7 -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 --56 - -38 P - -24A W-11H E-6 A -12 E -14 -15 -12 -9 Local Alignment: Smith-Waterman ● A slight modification of the NeedlemanWunsch algorithm: – Edges of the matrix are initialized to zero – Max score is never less than zero, no pointer is recorded unless the score is greater than zero – Trace-back starts from the highest score in the matrix and ends at a score of zero Local Alignment: Smith-Waterman ● Again, lets align the words HEAGAWGHE and PAWHEAE using the same scoring scheme: – gap penalty of -8 – match score and mismatch penalty to be H E using A the G BLOSUM50 A W Gmatrix H determined P– A W H E– A E – – 0 0 0 0 0 0 0 0 0 0 0 0 10 2 0 0 0 0 0 0 2 16 8 6 0 0 5 0 0 8 21 13 0 0 0 2 0 0 13 18 0 0 5 0 0 0 5 12 0 0 0 20 12 4 0 4 0 0 0 12 18 10 4 0 0 0 0 0 22 18 10 4 E 0 0 0 0 14 28 20 16 Start from the largest score and trace back to determine the best local alignment Horizontal transition represents a gap in the vertical sequence Vertical transition represents a gap in the horizontal sequence Diagonal transition represents a match in the Local Alignment: Smith-Waterman ● ● Does it matter what “word”/sequence is horizontal/vertical? To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE (vertical) using the same scoring scheme asA before: P A W H E E H– E A– G A W G– H E – 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 2 0 20 12 4 0 gap penalty of -8 0 10 2 0 0 0 12 18 22 14 0 2 16 8 0 0 4 10 18 28 0 0 8 21 13 5 0 4 10 20 0 0 6 13 18 12 4 0 4 16 match score and mismatch penalty to be determined using the BLOSUM50 matrix Start from the largest score and trace back to determine the best local alignment Horizontal transition represents a gap in the vertical sequence Local Alignment: Smith-Waterman ● ● Does it matter what “word”/sequence is horizontal/vertical? To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE P using A WtheH same E A E (vertical) scoring scheme 0 0 0 0 0 0 0 0 H 0 0 0 0 10 2 0 0 as before: E A G– A W– G H E – 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 2 0 20 12 4 0 gap penalty of -8 2 0 0 0 12 18 22 14 16 8 0 0 4 10 18 28 8 21 13 5 0 4 10 20 6 13 18 12 4 0 4 16 match score and mismatch penalty to be determined using the BLOSUM50 matrix Start from the largest score and trace back to determine the best local alignment Local Alignment: Smith-Waterman ● ● Does it matter what “word”/sequence is horizontal/vertical? To answer this question lets align PAWHEAE (horizontal) to HEAGAWGHE P using A WtheH same E A E (vertical) scoring scheme 0 0 0 0 0 0 0 0 H 0 0 0 0 10 2 0 0 as before: E A G– A W– G H E – 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 2 0 20 12 4 0 gap penalty of -8 2 0 0 0 12 18 22 14 16 8 0 0 4 10 18 28 8 21 13 5 0 4 10 20 6 13 18 12 4 0 4 16 match score and mismatch penalty to be determined using the BLOSUM50 matrix Start from the largest score and trace back to determine the best local alignment So does it matter what “word”/sequence is horizontal/vertical? No, it does not. Either way, the final alignment is the same and is considered to be the “optimal” alignment P A W H E A E H E A G A W G H E 0 0 0 0 0 0 0 0 H 0 0 0 0 10 2 0 0 E 0 0 0 0 2 16 8 6 A 0 0 5 0 0 8 21 13 G 0 0 0 2 0 0 13 18 A 0 0 5 0 0 0 5 12 W 0 0 0 20 12 4 0 4 G 0 0 0 12 18 10 4 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 H 0 10 2 0 0 0 12 18 22 14 E 0 2 16 8 0 0 4 10 18 28 A 0 0 8 21 13 5 0 4 10 20 E 0 0 6 13 18 12 4 0 4 16 H 0 0 0 0 22 18 10 4 E 0 0 0 0 14 28 20 16 Final Alignment: H E A G A W G H E - - - - P A W - H E A E Global or Local? ● When is a global alignment more useful? – ● When sequences in a query set are similar and close in size When is a local alignment more useful? – When sequences in a query set are dissimilar but suspected to contain regions of similarity When sequences (amino acid or nucleotide) are sufficiently similar, there is no difference between local and global alignments Helpful Charts AA chart: http://sofbiology.blogspot.com/2010/12/proteinsynthesis-amino-acid-table.html IUPAC chart: http://www.bioinformatics.org/sms/iupa c.html Except where otherwise noted (i.e. items on the slide labeled “Helpful Charts”), most information contained in this presentation was obtained from: Graur, Dan and Wen-Hsiung Li. Fundamentals of Molecular Evolution. Second Edition. Sunderland, Massachusetts: Sinauer Associates, Inc., Publishers, 2000. Some of the information related to global & local alignment algorithms was obtained from and can be accessed at: http://etutorials.org/Misc/blast/Part+II+Theory/Chapter+3.+Sequence+Alignment/