Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Comparison of Biological Sequences The Biocomputing Service Group Types of Sequence Comparison • Pairwise Alignments • Multiple Alignments • Database Searches Pairwise Sequence Alignment • Principles of pairwise sequence comparison • global / local alignments • scoring systems • gap penalties • Methods of pairwise sequence alignment • windows-based methods • dynamic programming approaches • Needleman and Wunsch • Smith and Waterman • Pairwise alignment programs in HUSAR Why Sequence Comparison? The biological basis: • Many genes and proteins are members of families which have a similar biochemical function or share a common evolutionary origin. Sequence comparison is used: • to define evolutionary relationships. • to identify conserved patterns. • when dealing with a sequence of unknown function: to find similar domains which could imply similar function. A comparison can be the starting point for further experimental investigations. Aligning Sequences…. actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 1 Sequence 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact taccattaccgtgttaactgaaaggacttaaagact actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Aligning Sequences…. • There are lots of possible alignments. • Two sequences can always be aligned. • Sequence alignments have to be scored. • Often there is more than one solution with the same score. Pairwise Sequence Comparison • Global Alignments • Local Alignments Global Alignment Two closely related sequences: GAP (Needleman & Wunsch) creates an end-to-end alignment. Global Alignment Two sequences sharing several regions of local similarity: 1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67 |||||||||||||| | | | |||| || | | | || 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70 Local Alignment 14 42 TCAGAAGCAGCTAAAGCGT ||||||||| ||||||||| TCAGAAGCA.CTAAAGCGT Bestfit (Smith-Waterman) 32 59 finds the region of best local similarity. Local Alignment 14 42 1 1 39 1 62 66 TCAGAAGCAGCTAAAGCGT ||||||||| ||||||||| TCAGAAGCA.CTAAAGCGT AGGATTGGAATGCT |||||||||||||| AGGATTGGAATGCT AGGATTGGAAT ||||||||||| AGGATTGGAAT AGACCG |||||| AGACCG Similarity (X. Huang) 32 59 14 14 49 11 67 71 displays all regions of similarity. Human Hemoglobin α- and γ-Chains • Symbol Comparison Table: PAM250 • Gap opening penalty: 3 • Gap extension penalty: 0.1 • Score: 116 Parameters of Sequence Alignment Scoring Systems: • Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: • Opening: • Extension: The cost to introduce a gap The cost to elongate a gap DNA Scoring Systems Sequence 1 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 2 A G C T A 1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1 Match: 1 Mismatch: 0 Score = 5 DNA Scoring Systems Sequence 1 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Sequence 2 Negative scoring values to penalize mismatches: A T C G A 5 -4 -4 -4 T -4 5 -4 -4 C -4 -4 G -4 -4 -4 5 -4 5 Matches: 5 Mismatches: 19 Score: 5 x 5 + 19 x (-4) = - 51 Protein Scoring Systems Sequence 1 PTHPLASKTQILPEDLASEDLTI Sequence 2 PTHPLAGERAIGLARLAEEDFGM Scoring matrix C C S T P A G N 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 . . D 6 . . T:G = -2 T:T = 5 Score = 48 Protein Scoring Systems • Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. tiny aliphatic P C S+S I V A L hydrophobic M Y F small G G CSH T S D K W H N E R Q aromatic positive polar charged Protein Scoring Systems • Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. • Scoring matrices reflect • probabilities of mutual substitutions • the probability of occurrence of each amino acid. • Widely used scoring matrices: • PAM • BLOSUM PAM (Percent Accepted Mutations) matrices • Derived from global alignments of protein families . Family members share at least 85% identity (Dayhoff et al., 1978). • Construction of phylogenetic tree and ancestral sequences of each protein family • Computation of number of replacements for each pair of amino acids PAM (Percent Accepted Mutations) matrices • The numbers of replacements were used to compute a so-called PAM-1 matrix. • The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM matrices for larger evolutionary distances can be extrapolated from the PAM-1 matrix. • PAM250 = 250 mutations per 100 residues. • Greater numbers mean bigger evolutionary distance . PAM 250 A R N D C Q E G H I L K M F P S T W Y V B Z W A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 -8 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 WW -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 17 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 BLOSUM (Blocks Substitution Matrix) • Derived from alignments of domains of distantly related proteins (Henikoff & Henikoff,1992). A A C E C • Occurrences of each amino acid pair in each column of each block alignment is counted. • The numbers derived from all blocks were used to compute the BLOSUM matrices. A A C E C A-C A-E C-E A-A C-C =4 =2 =2 =1 =1 BLOSUM (Blocks Substitution Matrix) • Sequences within blocks are clustered according to their level of identity. • Clusters are counted as a single sequence. • Different BLOSUM matrices differ in the percentage of sequence identity used in clustering. • The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. • Greater numbers mean smaller evolutionary distance. TIPS on choosing a scoring matrix • Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches (Henikoff & Henikoff, 1993). • When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. • For database searching the commonly used matrix is BLOSUM62. Scoring Insertions and Deletions A T G T A A T G C A T A T G T G G A A T G A A T G T - - A A T G C A T A T G T G G A A T G A insertion / deletion The creation of a gap is penalized with a negative score value. Why Gap Penalties? Gaps not permitted Score: 10 1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Gaps allowed but not penalized Match = 5 Mismatch = -4 Score: 88 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Why Gap Penalties? • The optimal alignment of two similar sequences is usually that which • maximizes the number of matches and • minimizes the number of gaps. • Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. • Penalizing gaps forces alignments to have relatively few gaps. Gap Penalties Linear gap penalty score: γ(g) = - gd Affine gap penalty score: γ(g) = -d - (g -1)e γ(g) = gap penalty score of a gap of lenght g d = gap opening penalty e = gap extension penalty g = gap lenght Scoring Insertions and Deletions match = 1 mismatch = 0 Total Score: 4 A T G T T A T A C T A T G T G C G T A T A Total Score: 8 - 3.2 = 4.8 Gap parameters: d = 3 (gap opening) e = 0.1 (gap extension) g = 3 (gap lenght) γ(g) = -3 - (3 -1) 0.1 = -3.2 A T G T - - - T A T A C T A T G T G C G T A T A insertion / deletion Modification of Gap Penalties Score Matrix: BLOSUM62 gap opening penalty gap extension penalty score = 3 = 0.1 = 6.3 1 ...VLSPADKFLTNV 12 |||| 1 VFTELSPAKTV.... 11 gap opening penalty gap extension penalty score = 0 = 0.1 = 11.3 1 V...LSPADKFLTNV 12 | |||| | | | 1 VFTELSPA.K..T.V 11 Pairwise Sequence Alignment • Principles of pairwise sequence comparison • global / local alignments • scoring systems • gap penalties • Methods of pairwise sequence alignment • window-based methods • dynamic programming approaches •Pairwise alignment programs in HUSAR Dotplot: A dotplot gives an overview of all possible alignments Sequence 2 A T T C A C A T A l l l l l l l l l l l l l l l T l l A l l l C l l l l A l l l l l T l l T l l A Sequence 1 C G T l A C Dotplot: In a dotplot each diagonal corresponds to a possible (ungapped) alignment Sequence 2 A T T C A C A T A l l l l l l l l l l l l l l l T l l A l l l C l l l l A l l l l l T l l l l A T C G T l A C Sequence 1 One possible alignment: T A C A T T A C G T A C A T A C A C T T A Pairwise Sequence Alignment • Principles of pairwise sequence comparison • global / local alignments • scoring systems • gap penalties • Methods of pairwise sequence alignment • window-based methods • dynamic programming approaches • Pairwise alignment programs in HUSAR Window-based Approaches • Word Size • Window / Stringency Word Size Algorithm T A C G G T A T G Word Size = 3 A C A G T A T C C T A T G A C A T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G T A C G G T A T G › A C A G T A T C › Window / Stringency T A C G G T A T G Window = 5 / Stringency = 4 T C A G T A T C T A C G G T A T G T C A G T A T C › T A C G G T A T G T C A G T A T C › › › › T A C G G T A T G T A C G G T A T G T C A G T A T C C T A T G A C A › Window / Stringency Score = 11 PTHPLASKTQILPEDLASEDLTI › PTHPLAGERAIGLARLAEEDFGM Scoring Matrix Filtering Score = 11 Matrix: PAM250 PTHPLASKTQILPEDLASEDLTI › PTHPLAGERAIGLARLAEEDFGM Score = 7 PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Window = 12 Stringency = 9 Considerations • The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). • The smaller the window, the larger the weight of statistical (unspecific) matches. • With large windows the sensitivity for short sequences is reduced. • Insertions/deletions are not treated explicitly. Insertions / Deletions in a Dotplot Sequence 2 T A C T G T C A T T A C T G T T C A T Sequence 1 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T Dotplot (Window = 30 / Stringency = 9) Hemoglobin β-chain Output of the programs Compare and DotPlot Hemoglobin α-chain Dotplot (Window = 18 / Stringency = 10) Hemoglobin β-chain Output of the programs Compare and DotPlot Hemoglobin α-chain Pairwise Sequence Alignment • Principles of pairwise sequence comparison • global / local alignments • scoring systems • gap penalties • Methods of pairwise sequence alignment • windows-based approaches • dynamic programming approaches • Needleman and Wunsch • Smith and Waterman • Pairwise alignment programs in HUSAR Dynamic Programming Automatic procedure that finds the best alignment with an optimal score depending on the chosen parameters. • Needleman and Wunsch Algorithm - Global Alignment • Smith and Waterman Algorithm - Local Alignment - Needleman and Wunsch (global alignment) Sequence 1: Sequence 2: HEAGAWGHEE PAWHEAE Scoring parameters: Gap penalty: BLOSUM50 matrix Linear gap penalty of 8 Basic principles of dynamic programming - Initialisation of alignment matrix - Stepwise calculation of score values (creation of an alignment path matrix) - Backtracking (evaluation of the optimal path) Initialisation of Matrix (BLOSUM 50) H E A G A W G H E E P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1 A -2 -1 5 0 5 -3 0 -2 -1 -1 W -3 -3 -3 -3 -3 15 -3 -3 -3 -3 H 10 0 -2 -2 -2 -3 -2 10 0 0 E 0 6 -1 -3 -1 -3 -3 0 6 6 A -2 -1 5 0 5 -3 0 -2 -1 -1 E 0 6 -1 -3 -1 -3 -3 0 6 6 Creation of an alignment path matrix Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences • Construct matrix F indexed by i and j (one index for each sequence) • F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj • Build F(i,j) recursively beginning with F(0,0) = 0 Creation of an alignment path matrix F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i-1, j-1) F(i, j-1) s(xi ,yj) F(i-1,j) -d -d F(i, j) Creation of an alignment path matrix • If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) • Three possibilities: • xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) • xi is aligned to a gap, F(i,j) = F(i-1,j) - d • yj is aligned to a gap, F(i,j) = F(i,j-1) - d • The best score up to (i,j) will be the largest of the three options Creation of an alignment path matrix 0 P -8 A -16 W -24 H -32 H -8 E -16 A -24 G -32 A -40 W -48 G -56 H -64 Boundary conditions F(i, 0) = -i d F(j, 0) = -j d E -40 A -48 E -56 E -72 E -80 Creation of an alignment path matrix P 0 H -8 E -16 -8 -2 -9 A -24 G -32 F(i, j) = max A -16 -10 -3 A -40 W -48 G -56 H -64 E -72 E -80 F(i, j) = F(i-1, j-1) + s(xi ,yj) P-H=-2 F(i, j) = F(i-1, j) - d E-P=-1 F(i, j) = F(i, j-1) - d H-A=-2 W F(0,0) + s(xi ,yj) = 0 -2 = -2 -24 F(1,1) = max F(0,1) - d = -8 -8= -16 = -8 -8= -16 H -32 F(1,0) - d E -40 F(1,0) + s(xi ,yj) = -8 -1 = -9 A -48 E -56 F(2,1) = max F(1,1) - d = -2 -8 = -10 F(2,0) - d = -16 -8= -24 -2 -8 = -10 = -9 -2 -1 = -3 -8 -2 = -10 F(1,2) = max -16 -8 = -24 = -10 E-A=-1 = -2 F(2,2) = max -10 -8 = -18 -9 -8 = -17 = -3 Backtracking 0 H -8 E -16 A -24 G -32 A -40 W -48 G -56 H -64 E -72 E -80 -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19 E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1 P Optimal global alignment: HEAG AWGHE- E --P- AW-HEA E Smith and Waterman (local alignment) Two differences: 0 1. F(i, j) = max F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d 2. An alignment can now end anywhere in the matrix Example: Sequence 1 Sequence 2 HEAGAWGHEE PAWHEAE Scoring parameters: Gap penalty: BLOSUM50 matrix Linear gap penalty of 8 Smith Waterman alignment 0 H 0 E 0 A 0 G 0 A 0 W 0 G 0 H 0 E 0 E 0 P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 6 13 18 12 4 0 4 16 26 Optimal local alignment: A WGH E A W-H E Extended Smith & Waterman To get multiple local alignments: • delete regions around best path • repeat backtracking Extended Smith & Waterman 0 H 0 E 0 A 0 G 0 A 0 W 0 G 0 H 0 E 0 E 0 P 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 6 13 18 12 4 0 4 16 26 Extended Smith & Waterman 0 H 0 E 0 A 0 G 0 A 0 W 0 G 0 H 0 E 0 E 0 P 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 0 0 0 0 0 W 0 0 0 0 2 0 0 0 H 0 10 2 0 0 0 E 0 2 16 8 0 0 A 0 0 8 21 13 5 0 E 0 0 6 13 18 12 4 Second best local alignment: 0 HEA HEA Further Extensions of Dynamic Programming • Overlap matches • Alignment with affine gap scores Algorithmic Complexity How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem? Needleman & Wunsch • Storing (n+1)x(m+1) numbers • Each number costs a constant number of calculations to compute (three sums and a max) • Algorithm takes O(nm) memory and O(nm) time • Since n and m are usually comparable: O(n2) Multiple Alignments The Biocomputing Service Group Multiple Alignments Process of aligning 3 or more sequences. The basis for: • The study of protein families or evolutionary relationships. • Finding conserved consensus patterns or domains. Multiple Alignments Approaches: • Multidimensional dynamic programming • Progressive alignments • And others Multidimensional Multidimensional Dynamic Dynamic Programming Programming Multiple Alignment Three-dimensional Alignment Path Matrix Alignment of 3 sequences: Computing time! Sequence 2 Sequence 1 Sequence 3 Memory! Multiple Alignments Approaches: • Multidimensional dynamic programming MSA (Lipman, Altschul and Kececioglu, 1989) DCA (Jens Stoye) Multidimensional Multidimensional Dynamic Dynamic Programming Programming Multiple Alignment Divide-and-Conquer Alignment (DCA) • Simultaneous alignment of multiple sequences using Needleman and Wunsch algorithm • Reduction of search space (reduces computing time) • Sequences are cut at suitable positions near their midpoints to obtain two new families of shorter sequences. • The cutting procedure is repeated until the new families of sequences can be aligned optimally. • Then the resulting alignments are concatenated. • Crucial Point: finding suitable cut positions Multidimensional Multidimensional Dynamic Dynamic Programming Programming Multiple Alignment Divide-and-Conquer Alignment divide divide divide align optimally concatenate Multidimensional Multidimensional Dynamic Dynamic Programming Programming Multiple Alignment Sequence 1 Reduction of Search Space e qu e S e nc 2 Sequence 3 Multidimensional Multidimensional Dynamic Dynamic Programming Programming Multiple Alignment Reduction of Search Space Multiple Alignments Approaches: • Multidimensional dynamic programming • Progressive alignments • And others Multiple Alignment Progressive Alignment Principle: Pairwise Alignment Guide Tree 1+2 1 1+3 2 1+4 Iterative Multiple Alignment 2 3 4 2+3 3 2+4 1 1 4 3+4 2 3 Progressive Progressive Alignment: Alignment: Multiple Alignment 1. step step Pairwise Comparison of all sequences 1:2 1:3 1:4 1:5 2:3 2:4 2:5 3:4 3:5 4:5 Similarity score of every comparison Progressive Progressive Alignment: Alignment: Multiple Alignment 1. step step Methods of Pairwise Comparison Programs perform global alignments: • Needleman & Wunsch: (Pileup, Tree, Clustal) • Word Size Method: (Clustal) • X. Huang (MAlign) (modified N-W) Progressive Progressive Alignment: Alignment: 2. step step Multiple Alignment Construction of a Guide Tree Sequence 1 2 3 4 5 1 2 3 Similarity Matrix: displays scores of all sequence pairs. 4 5 The similarity matrix is transformed into a distance matrix . . . . . Progressive Progressive Alignment: Alignment: Multiple Alignment 2. step step Construction of a Guide Tree No phylogenetic tree !! Guide Tree 1 5 Distance Matrix 2 3 4 Neighbour-Joining Method or UPGMA (unweighted pair group method of arithmetic averages) Progressive Progressive Alignment: Alignment: Multiple Alignment 3. step step Multiple Alignment Guide Tree 1 5 2 3 2 4 1 Progressive Progressive Alignment: Alignment: Multiple Alignment 3. step step Columns - once aligned - are never changed G T C C G T T - C G C C A G G C - G G T T A C T T C C A G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G Progressive Progressive Alignment: Alignment: Multiple Alignment 3. step step Columns - once aligned - are never changed G T C C G T T - C G C C A G G C - G G T T A C T T C C A G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G . . . . and new gaps are inserted. Progressive Progressive Alignment: Alignment: Multiple Alignment 3. step step Columns - once aligned - are never changed G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G A T C T - - C A A T C T G T C C C T A G A T C - T - - C A A T C T G - T C C C T A G Multiple Alignments Approaches: • Multidimensional dynamic programming • Progressive alignments • And others Multiple Alignment Other OtherApproaches Approaches Other methods DiAlign (Morgenstern et al. (1996) Proc. Natl. Acad. Sci., 93, 12098-12103) PRRP (Gotoh O. (1996) J. Mol. Biol., 264, 823-838) T-Coffee (Notredame et al. (2000) J. Mol. Biol., 302, 205-217) Multiple Alignment Other OtherApproaches Approaches DiAlign • Local alignment • Gap-free segment-to-segment comparison • Instead of residue comparison • Gaps are not treated explicitly (no gap penalties) • Gaps represent those parts of the sequences that are not aligned Other OtherApproaches: Approaches: DiAlign DiAlign Multiple Alignment Example: Alignment of two sequences Sequence 2 S3 S2 S1 Sequence 1 Other OtherApproaches: Approaches: DiAlign DiAlign Multiple Alignment Consistent versus Non-consistent diagonals (local alignments) S1 S2 S3 Sequence 1 Sequence 2 S2 Consistent: Non-consistent: S1 S1 + S3, S2 + S3 S1 + S2 S3 Other OtherApproaches: Approaches: DiAlign DiAlign Multiple Alignment Maximum Alignment Sequence 2 S3 S2 S1 Sequence 1 Marked diagonals (similar segments) are taken for the maximum alignment. Other OtherApproaches: Approaches: DiAlign DiAlign Multiple Alignment 1. Pairwise Comparison of each sequence pair Sequence Y Extension to Multiple Alignment 2. Maximum Alignment (diagonals) for each sequence pair Sequence Y Sequence X Sequence X Other OtherApproaches: Approaches: DiAlign DiAlign Multiple Alignment Extension to Multiple Alignment 3. Diagonals of all pairwise maximum alignments are ranked according to their score and incorporated one by one as long as they are consistent in the growing multiple alignment. Other OtherApproaches: Approaches: DiAlign DiAlign Multiple Alignment Extension to Multiple Alignment SA SB SC SD SA SB SA SC SA SC SA SB SC SD Multiple Alignment Other OtherApproaches Approaches Other methods DiAlign (Morgenstern et al. (1996) Proc. Natl. Acad. Sci., 93, 12098-12103) PRRP (Gotoh O. (1996) J. Mol. Biol., 264, 823-838) T-Coffee (Notredame et al. (2000) J. Mol. Biol., 302, 205-217) Other Approaches Multiple Alignment PRRP Iterative refinement Multiple alignment Two sequence groups (Random) New multiple alignment Iteration if SP score of the new multiple alignment is higher than the previous SP score. Multiple Alignment Other OtherApproaches Approaches Other methods DiAlign (Morgenstern et al. (1996) Proc. Natl. Acad. Sci., 93, 12098-12103) PRRP (Gotoh O. (1996) J. Mol. Biol., 264, 823-838) T-Coffee (Notredame et al. (2000) J. Mol. Biol., 302, 205-217) Other Approaches Multiple Alignment T-Coffee Global Pairwise Alignment Local Pairwise Alignment Primary Library Weighting Extended Library Extension Progressive Alignment Final Multiple Alignment Last Lastbut butnot notleast, least,aa HINT: HINT: Multiple Alignment