* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Biochemical switches in the cell cycle wikipedia , lookup
Endomembrane system wikipedia , lookup
Cellular differentiation wikipedia , lookup
Homology modeling wikipedia , lookup
Cell culture wikipedia , lookup
Organ-on-a-chip wikipedia , lookup
Cell growth wikipedia , lookup
Cytokinesis wikipedia , lookup
Gap junction wikipedia , lookup
Extracellular matrix wikipedia , lookup
C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 7 Pair-wise Sequence Alignment (III) What can sequence alignment tell us about structure HSSP Sander & Schneider, 1991 ≥30% sequence identity Global dynamic programming PAM250, Gap =6 (linear) S P E A R E 0 -6 -12 -18 -24 -30 -36 S -6 2 -4 -10 -16 -22 -28 0 H -12 -4 2 -3 -9 -14 -20 3 0 A -18 -10 -3 2 -1 -7 -13 -1 4 K -24 -16 -9 -3 1 2 -4 E -30 -22 -15 -5 -3 -2 6 S P E A R E S 2 1 0 1 0 0 H -1 0 1 -1 2 1 A 1 1 0 2 -2 K 0 -1 0 -1 E 0 -1 4 0 These values are copied from the PAM250 matrix (see earlier slide) The easy algorithm is only for linear gap penalties Start in left upper cell before either sequence (circled in red). Path will end in lower right cell (circled in blue) SPEARE S-HAKE Higgs & Attwood, p. 124 – Note: There are errors in the matrices!! DP is a two-step process • Forward step: calculate scores • Trace back: start at highest score and reconstruct the path leading to the highest score – These two steps lead to the highest scoring alignment (the optimal alignment) – This is guaranteed when you use DP! Time and memory complexity of DP • The time complexity is O(n2): if you would align two sequences of n residues, you would need to perform n2 algorithmic steps (square search matrix has n2 cells that need to be filled) • The memory (space) complexity is also O(n2): if you would align two sequences of n residues, you would need a square search matrix of n by n containing n2 cells Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Problem with simple DP approach: •Can only do linear gap penalties •Not suitable for affine and concave penalties Global dynamic programming using affine penalties j-2 j-1 j i-2 i-1 i Looking back from cell (i, j) we can adapt the algorithm for affine gap penalties by looking back to four more cells (magenta) If you came from here, gap was already open, so apply gap-extension penalty If you came from here, gap was opened so apply gap-open penalty Global dynamic programming all three types of gap penalties j-1 i-1 Si,j = si,j + Max Gap opening penalty Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} Gap extension penalty Global dynamic programming Gapo=10, Gape=2 D W V T A L K 0 -12 -14 -16 -18 -20 -22 -24 T -12 8 -9 -6 -5 -9 -11 -14 5 D -14 0 9 2 2 3 -5 -3 -34 10 6 W -16 -13 25 11 5 4 9 0 -21 6 14 5 V -18 -10 -4 37 21 19 19 15 -16 7 5 13 L -20 -14 -2 23 46 31 37 26 1 K -22 -12 -9 17 33 53 39 50 14 -34 -29 -1 17 39 27 50 D W V T A L K T 8 3 8 11 9 9 8 D 12 1 6 8 8 4 8 W 1 25 2 3 2 6 V 6 2 12 8 8 L 4 6 10 9 K 8 5 6 8 These values are copied from the PAM250 matrix (see earlier slide), after being made nonnegative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores Easy DP recipe for using affine gap penalties j-1 i-1 • M[i,j] is optimal alignment (highest scoring alignment until [i,j]) • Check – preceding row until j-2: apply appropriate gap penalties – preceding row until i-2: apply appropriate gap penalties – and cell[i-1, j-1]: apply score for cell[i-1, j-1] DP is a two-step process • Forward step: calculate scores • Trace back: start at highest score and reconstruct the path leading to the highest score – These two steps lead to the highest scoring alignment (the optimal alignment) – This is guaranteed when you use DP! Global dynamic programming There are three kinds of alignments • Global alignment • Semi-global alignment • Local alignment Semi-global pairwise alignment • Global alignment: all gaps are penalised • Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised End-gaps MSTGAVLIY--TS-------GGILLFHRTSGTSNS End-gaps Semi-global dynamic programming PAM250, Gap =6 (linear) S P E A R E 0 0 0 0 0 0 0 S 0 2 1 0 1 0 0 0 H 0 -1 2 2 -1 3 1 3 0 A 0 1 0 2 4 -2 3 -1 4 K 0 0 0 0 1 7 1 E 0 0 -1 4 0 1 11 S P E A R E S 2 1 0 1 0 0 H -1 0 1 -1 2 1 A 1 1 0 2 -2 K 0 -1 0 -1 E 0 -1 4 0 These values are copied from the PAM250 matrix (see earlier slide) The easy algorithm is only for linear gap penalties Start in left upper cell before either sequence (circled in red). Path will end in cell anywhere in the bottom row or rightmost columns with the highest score SPEARE -SHAKE Higgs & Attwood, p. 124 – Note: There are errors in the matrices!! Semi-global dynamic programming - two examples with different gap penalties These values are copied from the PAM250 matrix (see earlier slide), after being made nonnegative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) Global score would have been 65 –10 – 1*2 –10 – 2*2 (because of the two end gaps) Semi-global pairwise alignment Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – Generally: if one sequence is much longer than the other Danger: if gap penalties high -- really bad alignments for divergent sequences Local dynamic programming (Smith & Waterman, 1981) LCFVMLAGSTVIVGTR E D A S T I L C G S Negative numbers Amino Acid Exchange Matrix Search matrix AGSTVIVG A-STILCG Gap penalties (open, extension) Local dynamic programming (Smith and Waterman, 1981) basic algorithm i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g 0 diagonal vertical horizontal Local dynamic programming Match/mismatch = 1/-1, Gap =2 A T G A C G T 0 0 0 0 0 0 0 0 T 0 0 1 0 0 0 0 1 A 0 1 0 0 1 0 0 0 G 0 0 0 1 0 0 1 0 A 0 1 0 0 2 0 0 0 C 0 0 0 0 0 3 1 0 T 0 0 1 0 0 0 2 2 Fill the matrix (forward pass), then do trace back from highest cell anywhere in the matrix till you reach 0 or the beginning of a sequence Local dynamic programming Match/mismatch = 1/-1, Gap =2 A T G A C G T 0 0 0 0 0 0 0 0 T 0 0 1 0 0 0 0 1 A 0 1 0 0 1 0 0 0 G 0 0 0 1 0 0 1 0 A 0 1 0 0 2 0 0 0 C 0 0 0 0 0 3 1 0 T 0 0 1 0 0 0 2 2 Fill the matrix (forward pass), then do trace back from highest cell anywhere in the matrix till you reach 0 or the beginning of a sequence GAC GAC Local dynamic programming (Smith & Waterman, 1981) j-1 This is the general DP algorithm, which is suitable for linear, affine and concave penalties, although for the examples here affine penalties are used i-1 Si,j = Max Gap opening penalty Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px} Si,j + Si-1,j-1 Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px} 0 Gap extension penalty Local dynamic programming Global and local alignment A B B A A A A A B A Local B B A Global A B C B C C B C B A A C C Local Global Global or Local Pairwise alignment A B B A A A A A B A Local B B A Global A B C B C C B C B A A C C Local Global Globin fold protein myoglobin PDB: 1MBN Alpha-helices are labelled ‘A’ (blue) to ‘H’ (red). The D helix can be missing in some globins: What happens with the alignment if Dhelix containing globin sequences are aligned with ‘D-less’ ones? sandwich protein immunoglobulin PDB: 7FAB Immunoglobulin structures have variable regions where numbers of amino acids can vary substantially TIM barrel / protein Triose phosphate IsoMerase PDB: 1TIM The evolutionary history of this protein family has been the subject of rigorous debate. Arguments have been made in favor of both convergent and divergent evolution. Because of the general lack of sequence homology, the ancestry of this molecule is still a mystery. Pyruvate kinase Phosphotransferase barrel regulatory domain / barrel catalytic substrate binding domain / nucleotide binding domain What does all this mean for alignments? • Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). • Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment. • Alignments are only meaningful for homologous sequences (with a common ancestor) There are three kinds of pairwise alignments Global alignment – align all residues in both sequences; all gaps are penalised Semi-global alignment – align all residues in both sequences; end gaps are not penalised (zero end gap penalties) Local alignment – align one part of each sequence; end gaps are not applicable Easy global DP recipe for using affine gap penalties (after Gotoh) j-1 Penalty = Pi + gap_length*Pe i-1 M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: Si,j = si,j + Max Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell in bottom row and rightmost column and do trace-back Let’s do an example: global alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) D W V T A L K 0 -12 -14 -16 -18 -20 -22 -24 T -12 0 -17 -14 -13 -17 -19 -22 -3 D -14 -8 -7 -14 -14 -13 2 -2 W -16 -21 9 -13 -19 -18 -2 6 -3 V -18 -18 -20 13 -3 -16 -1 -3 5 L -20 -22 -18 -1 14 K -22 -20 -21 -24 -42 D W V T A L K T 0 -5 0 3 1 1 0 D 4 -7 -2 0 0 -4 0 W -7 17 -6 -5 -6 -2 V -2 -6 4 0 0 L -4 -2 2 1 K 0 -3 -2 0 PAM250 -22 -42 -1 -14 -12 -41 -18 -16 -14 -12 0 Cell (D2, T4) can alternatively come from two cells (same score): ‘high-road’ or ‘low-road’ Row and column ‘0’ are filled with 0, -12, -14, -16, … if global alignment is used (for N-terminal endgaps); also extra row and column at the end to calculate the score including C-terminal end-gap penalties. Note that only ‘non-diagonal’ arrows are indicated for clarity (no arrow means that you go back to earlier diagonal cell). Let’s do another example: semi-global alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) D W V T A L K T 0 -5 0 3 1 1 0 D 4 -7 -2 0 0 -4 W -7 17 -6 -5 -6 V -2 -6 4 0 L -4 -2 2 K 0 -3 -2 D W V T T 0 -5 0 3 0 D 4 -7 -7 -2 -3 W -7 21 -13 0 2 -2 V -2 -13 25 1 -2 6 -3 L 0 -1 -3 5 K A L K 9 PAM250 Starting row and column ‘0’, and extra column at right or extra row at bottom is not necessary when using semi global alignment (zero endgaps). Rest works as under global alignment. Easy local DP recipe for using affine gap penalties (after Gotoh) j-1 Penalty = Pi + gap_length*Pe Si,j = Max i-1 Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px} Si,j + Si-1,j-1 Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px} 0 M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until zero-valued cell or start of sequence(s) Let’s do yet another example: local alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) D W V T A L K T 0 -5 0 3 1 1 0 D 4 -7 -2 0 0 -4 W -7 17 -6 -5 -6 V -2 -6 4 0 L -4 -2 2 K 0 -3 -2 D W V T T 0 0 0 3 0 D 4 0 0 0 -2 -3 W 0 21 0 0 0 2 -2 V 0 0 25 9 1 -2 6 -3 L 0 0 11 0 -1 -3 5 K 0 0 A L K PAM250 Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences) Dot plots • Way of representing (visualising) sequence similarity without doing dynamic programming (DP) • Make same matrix, but locally represent sequence similarity by averaging using a window Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis. Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions For your first exam D1: Make sure you understand and can carry out 1. the ‘simple’ DP algorithm for global, semiglobal and local alignment (using linear gap penalties but make sure you know the extension of the basic algorithm for affine gap penalties) and 2. The general DP algorithm for global, semiglobal and local alignment (using linear, affine and concave gap penalties)!