Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Methods for alignment of sequences Lars Liljas Department of Cell and Molecular Biology University of Uppsala Uppsala, Sweden V 1.8.1 060111 2 3 SOME BASICS Accurate alignments of sequences are needed for many types of analyses. Aligned sequences might be the basis of phylogenetic analysis of evolutionary relationships. They are also important for identification of protein functions and for modeling of protein conformation. Alignment methods are also used to search for similarities between new sequences and sequences in databases. Depending on the purposes, different properties of the alignment algorithm are important: searches in extensive databases require speed, while algorithms for alignments of homologous sequences can be optimized to use all available information to produce the most reliable alignment. Alignment algorithms are therefore either global, trying to optimally align entire sequences, or local, finding short segments of high similarity. Scoring matrices All alignment methods need some sort of scoring for matches and mismatches. For a certain alignment, a number is assigned to each position in the sequence depending on the match at that position. The scores for all position in the alignment are then added to calculate a total score, which is used to select the optimal alignment among alternative alignments. The choice of scoring matrix might be important for the result of the alignment, and it is therefore of interest to understand the basis of them and to select the one that best fits the needs of the actual alignment. The simplest way of scoring is to assign one number for a match and another number for a mismatch. Such a matrix is often referred to as a unitary matrix. For nucleic acid alignments, such simple scoring matrices might be adequate. Changes in amino acid sequences are generally more informative than changes in the base sequence, since protein function and thereby the possibilities of survival for the organism is directly related to the nature of the residue. A change from a valine to an isoleucine (for example) is more likely to be found than a valine to an aspartate change. For amino acid sequences, the alignment programs therefore utilize scoring matrices, which in a direct or indirect way contain information about the likelihood of a certain change. For closely related proteins, however, a matrix scoring only for identities will mostly give the same alignment as a more sophisticated scoring matrix. Some scoring matrices are based on the genetic code: the 4 minimum number of base changes necessary to change from one amino acid to another (minimum mutation distance matrix). Other matrices use classifications of amino acids according to their physical properties. A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Figure 1. Example of an alignment scoring matrix (The BLOSUM 62 matrix). The most commonly used scoring matrices are based on statistical analysis of amino acid changes in homologous proteins. The PAM series are based on estimated mutation rates (Percent Accepted Mutations) from closely related proteins and will therefore be dominated by amino acid mutations caused by single base changes. It is also called a log-odds matrix, since the numbers are proportional to the logarithm of the “odds” for the replacement not being a random change. The odds are the ratio of the times residue A is observed to be replaced by B, divided by the number of times A would be replaced by B if the replacements occurred at random. Positive scores in the matrix thus represent a pair of residues that replace each other more often than is expected by chance. PAM 1 stands for 1 % accepted mutations, meaning that on average only one amino acid of 100 is changed. PAM matrices for less similar sequences are obtained by extrapolations. The PAM100 matrix corresponds to 100 accepted mutations per 100 residues, but since the same residue might change more than once, two sequences with this level of mutations will still have about 50 % identities. In the same way, the PAM250 matrix corresponds to a level of about 20 % identical residues. The PAM matrices are based on a model of the events in the evolution. In contrast, the BLOSUM series 1 are calculated from blocks of aligned sequences from homologous proteins 5 with a certain level of identity. In this way, the pattern of changes observed at a certain level of identity is used, instead of extrapolations from the evolutionary events observed for closely related proteins. Figure 1 shows the BLOSUM 62 matrix. The score for a certain pair of aligned residues is found at the intersection of the row and column corresponding to the two amino acids. The numbers on the diagonal show the score for residues that have not changed, and one can note that there is a higher score for retaining a tryptophan (W) than a valine (V). Most changes correspond to negative numbers, but some conservative changes also have positive scores, for example lysine (K) to arginine (R) or valine (V) to isoleucine (I). The BLOSUM matrices seem to perform slightly better than the PAM matrices, and BLOSUM62 is often the default matrix used in searches and alignments. Gaps The sequences to be aligned are not necessarily of the same length, and this is the main problem with alignments. In the evolution from a (possible) common ancestor, residues are inserted and deleted. To allow for that, the score from all pairs of aligned residues is combined with suitable penalties for introducing gaps (gap penalties). The total score is used to select the optimal alignment. In practice, there is no useful statistical treatment of gaps in proteins that has been used to determine gap penalties. The gap penalties are normally defined by two parameters, one for opening of a gap and one that gives a penalty proportional to the length of the gap. Most programs allow the user to choose these parameters, which might have different optima for different systems. They also depend on the scoring matrix used. Alignment programs normally have useful default values connected to the substitution matrix that is selected. 6 Figure 2. The distribution of scores obtained from alignment of unrelated sequences. The solid curve shows a Gaussian curve (normal distribution). The observed values follow an extreme-value distribution, where the probability for a score larger than x is 1-exp(-Kexp(-λx)). If the values of K and λ can be estimated, the probability of a score being non-random can be calculated. Estimating the significance of an alignment It is not trivial to find a good estimate of the significance of an alignment, the main reason being that the amino acid composition and the distribution of the residues are not random. To compensate for the non-random composition, the significance of an alignment score can be estimated by repeating the alignment with randomized sequences of the same composition. A number of such alignments will give a mean score and an estimate of the expected variation of the score for random sequences similar to the aligned sequences. From this, the obtained score for the “real” alignment can be expressed as number of standard deviations above the mean obtained from the random sequences. The obtained distribution of scores is, at least for local alignments (see below) a skewed distribution (Figure 2), different from the familiar normal distribution. A special problem in alignment is the presence in some proteins of regions with a very different amino acid composition, for example an exceptionally high content of arginines or glutamines. Such “low complexity” regions might give high scores also in cases where there is no homology. 7 Dotplot analysis A direct way of comparing sequences is the dotplot analysis. The two sequences to be compared form the rows and columns of a matrix, and in the simplest case a dot is plotted in a graphical representation of the matrix wherever the sequences are identical. In practice, the comparison is most often done in a window of specified length, and a dot is put in the graph whenever the score within this window is above a certain specified threshold. The score might be calculated as the number of identities or as a similarity based on a scoring matrix. Sequence similarities will be obvious as diagonals in the plot. The method does not align sequences but it is a quick way of spotting similarities. It has the additional advantage over other alignment methods that it can detect repetitions (parallel diagonals) and permutations in the sequences. Figure 3. A dotplot analysis of the sequences of human hemoglobin a chain and erythrocruorin from Chironomus. 8 GLOBAL ALIGNMENTS Needleman-Wunsch-Sellers method When two sequences are compared, we can make a global alignment or a local alignment. Global alignments aim at optimally aligning all of two or more sequences, while local alignments just try to find similar parts. The most used methods for global alignments are based on algorithms originally developed by Needleman and Wunsch 2 2 and modified by Sellers 3. This procedure (the NWS method) is using a dynamic programming algorithm that simplifies the enormous task of calculating a score for all possible alignments of two sequences with gaps of any lengths. The sequences to be aligned are arranged as rows and columns of a rectangular matrix. A score is calculated for each position of the matrix according to three possible events: replacement (or conservation) of a residue, insertion in sequence A or insertion in sequence B. The matrix elements Di,j are filled with numbers according the rule Di −1, j −1 + β(Ai ,B j ) = max + w(k);k = 1,..., j − 1 Di, j Di, j − k Di −k , j + w(k);k = 1,...,i − 1 The three alternatives are shown graphically in Figure 4. j-y Di,j-y - w(y) Di-1,j-1 + β(Aj, Bj) j-1 j Di-x,j - w(x) i-x Di,j i-1 i Figure 4. The three alternative ways of finding the score in the matrix elements. 9 Here the first alternative corresponds to the case of no insertions or deletions: to the previous diagonal element Di-1,j-1 is added the score β for the current pair of residues (Ai, Bj). The other two alternatives correspond to insertions in A or B: to the element above or to the left is added the score for a deletion of length k. The scoring begins at the first element of the matrix, corresponding to one end of the sequences. An example using a short nucleotide sequence is shown in Figure 5 and Figure 6. The total score is found as the last element in the matrix. The actual sequence alignment is obtained from the path through which this number is obtained. The optimal alignment is thus obtained by tracing back the calculations from the element in the lower right corner. Δ G C C A A G T A G G Δ 0 -4 -5 -6 -7 -8 -9 -10 –11 –12 -13 A -4 -1 -5 -6 -3 -4 -8 -9 -7 -11 -12 C -5 -5 2 -2 -3 -4 -5 -6 -7 -8 -9 G -6 -2 -2 1 -3 -4 -1 -5 -6 -4 -5 A -7 -6 -3 -3 4 0 -1 -2 -2 -4 -5 G -8 -4 -4 -4 0 3 3 -1 -2 1 -1 C -9 -8 -1 -1 -1 -1 2 2 -2 -3 0 G -10 -6 -5 -2 -2 -2 2 1 1 1 0 T -11 -10 -6 -6 -3 -3 -2 5 1 0 0 A -12 -11 -7 -7 -3 0 -3 1 8 4 3 T -13 -12 -8 -8 -5 -4 -1 0 4 7 3 G -14 -10 -9 -9 -6 -5 -1 -1 3 7 10 A -15 -14 -10 -10 -6 -3 -5 -2 2 3 6 Alignment GCCAA-GTAGGACGAGCGTATGA Figure 5. Alignment of two nucleotide sequences. Here a match has a score of +3, a mismatch a score of –1. Gaps are scored by –3 for insertion of a gap and –1 for each inserted residue (w=-3-k). The scores are calculated from the 5’ end. The arrows show the path to the number in the lower right corner. In this global alignment with end gap penalty, there will be two places with alternative alignments of the same score. They correspond to shifting the position of the two gaps in the top sequence. 10 Δ G C C A A G Δ 0 -4 -5 -6 -7 -8 -9 A -4 A -1 -5 -9 -5 -5 -6 -10 -6 -6 -3 -11 -7 -3 -4 -12 -7 -4 -9 -13 -8 -8 C -5 -5A -5 -9 -5 2 -9 -9 2 -2 -10 -2 -2 -7 -7 -3 -3 -4 -8 -4 -4 -5 -12 -5 -5 G -6 -2 -6 -10 -2 -6 -2 -6 -2 1 -6 -6 1 -3 -7 -3 -3 -4 -8 -4 -4 -1 -9 -5 -1 A -7 -7 -6 -11 -6 -3 -3 -10 -3 -3 -3 -7 -3 4 -7 -7 4 0 -8 0 0 -5 -5 -1 -1 G -8 -4 -7 -12 -4 -7 -4 -8 -4 -4 -4 -8 -4 -4 0 -8 0 3 -4 -4 3 3 -5 -1 3 C -9 -9 -8 -13 -8 -1 -5 -12 -1 -1 -5 -5 -1 -5 -1 -5 -1 -1 -1 -5 -1 2 -1 -5 2 Figure 6. The upper corner of the matrix above showing the numbers obtained by substitutions (top) or insertions/deletions (middle and bottom). Note that the lower two numbers results either from the introduction of a new gap (-4) or extension of a gap (-1). An example of a gap extension is encircled At two positions of this example, the alignment is ambiguous. In this particular case, one is the difference of whether to put a single gap after nucleotide four or five in the top sequence. The second corresponds to putting a gap at the end of the top sequence or before the last two nucleotides. It is important to note that any sequences can be aligned using these procedures, irrespective of their possible homology. Programs using this type of algorithms normally present one solution and do not show alternative alignments with the same score. An example of the alignment of protein sequences is shown in Figure 7. 11 Δ A N D W P I A G S H Δ 0 -16 -20 -24 -28 -32 -36 -40 -44 -48 -52 S -16 1 -15 -19 -23 -27 -31 -35 -39 -40 -47 T -20 -15 1 -15 -19 -23 -27 -31 -35 -38 -42 S -24 -19 -14 1 -15 -19 -23 -26 -31 -31 -39 T -28 -23 -19 -15 -1 -16 -20 -23 -28 -30 -33 D -32 -27 -22 -13 -17 -2 -18 -22 -24 -28 -31 W -36 -31 -27 -23 -2 -18 -5 -21 -24 -27 -30 P -40 -35 -31 -27 -18 5 -11 -6 -19 -23 -27 L -44 -39 -35 -31 -22 -11 7 -9 -10 -17 -21 V -48 -43 -39 -35 -26 -15 -8 7 -9 -12 -17 G -52 -47 -43 -39 -30 -19 -13 -8 13 -3 -7 A -56 -48 -47 -43 -34 -23 -17 -9 -3 14 -2 T -60 -55 -48 -47 -38 -27 -21 -17 -7 -2 12 Alignment --ANDWPIAGSH STSTDWPLVGAT or AN--DWPIAGSH STSTDWPLVGAT Figure 7. Alignment of two short protein segments. BLOSUM62 is used, the gap opening weight is –12 and the extension weight is –4. Bold numbers show the path of the alignment. In these alignments, gaps at the beginning and end of the sequences have the same gap penalties as gaps in other parts of the sequences. This has the effect that the algorithm will try to align all of the two sequences. If the sequences are of different lengths or if only parts of the sequences are similar this might have negative effects. One way to handle such cases is not to penalize gaps at the ends of the sequences. Figure 8 shows the same sequences aligned under this condition. The optimal alignment is chosen as the one giving the highest number in the last row or column. In this case, the alignment of the same nucleotide sequence is different, but the same number of identities has been obtained. 12 Δ G C C A A G T A G G Δ 0 0 0 0 0 0 0 0 0 0 0 A 0 -1 -1 -1 3 3 -1 -1 3 -1 -1 C 0 -1 2 2 -1 2 2 -2 -1 2 -2 G 0 3 -1 1 1 -2 5 1 0 2 5 A 0 -1 2 -2 4 4 1 4 4 0 1 G 0 3 -1 1 0 3 7 3 3 7 3 C 0 -1 6 2 1 0 3 6 2 3 6 G 0 3 2 5 1 0 3 2 5 5 6 T 0 -1 2 1 4 0 1 6 2 4 4 A 0 -1 0 1 4 7 3 2 9 5 4 T 0 -1 -1 -1 0 3 6 6 5 8 4 G 0 3 -1 -2 -1 2 6 5 5 8 11 A 0 -1 2 -2 0 2 2 5 8 4 7 Alignment ----GCCAAGTAGGACGAGC---GTATGA Figure 8. Alignment with no gap penalty at ends (first row and column initiated with 0), using the same scoring parameters. The alignment is different from that of Figure 5. 13 LOCAL ALIGNMENTS The Smith-Waterman algorithm The dynamic programming algorithm can also be used for finding local sequence similarities. The Smith and Waterman algorithm 4 is very similar to the NWS method except that a calculated negative number for a matrix position is replaced by zero, indicating that no sequence similarity has been detected up to that point. When all the matrix elements have been calculated, the maximum number in the matrix is located, and the alignment is traced back from this point until the first positive number. The same sequences as above aligned with these assumptions are shown in Figure 9. Once more, a different alignment is obtained with the simple test sequence. Only the aligned parts of the sequences are shown as output of this procedure. Δ G C C A A G T A G G Δ 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 0 3 3 0 0 3 0 0 C 0 0 3 3 0 2 2 0 0 2 0 G 0 3 0 2 2 0 5 1 0 3 5 A 0 0 2 0 5 5 1 4 4 0 2 G 0 3 0 1 1 4 8 4 3 7 3 C 0 0 6 3 1 0 4 7 3 3 6 G 0 3 2 5 2 0 3 3 6 6 6 T 0 0 2 1 4 1 2 6 2 5 5 A 0 0 0 1 4 7 3 2 9 5 4 T 0 0 0 0 0 3 6 6 5 8 4 G 0 3 0 0 0 2 6 5 5 8 11 A 0 0 2 0 3 3 2 5 8 4 7 Alignment -CAAG--TAGG-CGAGCGTATG- Figure 9. Local alignment with no negatives. The alignment is different from the above examples and corresponds only to the positive numbers in the matrix ending up with the maximum number. 14 Figure 10 shows the alignment of two short segments of sequence, in this case of a protein. A segment of six amino acids give a total maximum score of 15 using the BLOSUM62 matrix and is presented as the output of the local alignment. Δ C A V D N T A H L M Q Δ 0 0 0 0 0 0 0 0 0 0 0 0 A 0 0 4 0 0 0 0 4 0 0 0 0 R 0 0 0 1 0 0 0 0 4 0 0 0 G 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 3 0 0 0 0 0 2 1 0 D 0 0 0 0 9 1 0 0 0 0 0 3 Q 0 0 0 0 2 9 0 0 0 0 0 5 T 0 0 0 0 0 2 14 0 0 0 0 0 V 0 0 0 4 0 0 2 14 0 1 1 0 N 0 0 0 0 5 6 0 0 15 0 0 1 T 0 0 0 0 0 5 11 0 0 14 0 0 G 0 0 0 0 0 0 3 11 0 0 11 0 T 0 0 0 0 0 0 5 3 9 0 0 10 I 0 0 0 3 0 0 0 4 0 11 1 0 C 0 9 0 0 0 0 0 0 1 0 10 0 K 0 0 8 0 0 0 0 0 0 0 0 11 Alignment VDNTAH IDQTVN Figure 10. Local alignment of two amino acid sequences using the BLOSUM62 scoring matrix. With a gap penalty of 12 and a gap length penalty of 4 (normal for this scoring matrix) there are no gaps found. The total score is 15 for the alignment of these 6 residues. 15 PROCEDURES FOR DATABASE SEARCHES FASTA For efficient database searches, the above methods are not sufficiently fast. Another type of alignment methods initially uses searches of local similarities, followed by extensions. One such algorithm used in the programs FASTP and FASTA was developed by Pearson and Lipman 5; 6. The basis of this method is to store sequence information from the database in lookup tables and compare the search sequence with these tables. The lookup tables and the sequence are divided into words of a specified length. For protein sequences, the position of amino acid pairs (word length=2) or single residues (word length=1) in all the sequences in the database is tabulated. The search sequence is compared to these tables, identities are located and an offset is calculated between the positions in the database sequence and the search sequence. For example, the sequence LAQSHATKHKIP would give a table showing alanines at positions 2 and 6 and threonine at position 7. Using the search sequence VEKFAVNHITRKIS, one finds that the alanine at position 5 gives offsets of –3 and 1 (hits at position 2 and 6), and the threonine at position 10 gives an offset of –3 (hit at position 7). Going through the search sequence, all position offsets are counted. In this example, a region AVNHIT has three identities with a common offset of –3. This first part of the procedure is similar to locating diagonals by the Dotplot procedure. The best local matches are used as starting points for the further steps in the procedure. The regions containing these similarities are extended and scores are calculated using a suitable scoring matrix. Finally, the regions with the highest scores are used for alignment and scoring using a dynamic programming algorithm to allow for gaps. The score is given as number of standard deviations from the average score for random sequences of the same length and composition. Z = (similarity score - mean of random scores) / (standard deviation of random scores) The purpose of the score is to select the best alignment and to indicate whether it represents a non-random similarity. The distribution of scores does not follow what the statisticians call a normal distribution, but rather a more skewed distribution (cf Figure 2). The score is therefore also translated into an expectation value, E, which gives the probability that the obtained score is due to a random similarity. 16 The problem with this method is the sensitivity. If amino acid pairs are used for the initial search, related sequences lacking two consecutive identities in a region of homology will be missed. Using single residues will increase the sensitivity but increase the computing time. BLAST The most used algorithm for database searches is BLAST, developed by Altschul et al. 7. This program uses a similar approach as the FASTA algorithm, but the initial search is not using identities but is trying to identify short segments with a specified length (words) that have similarities with scores above a suitable threshold. For a protein, a typical length of these words might be three amino acids, and the score is calculated using a substitution matrix from the PAM or BLOSUM series. Since some amino acids mutate more readily than others, there are some words that will give comparatively low scores also for a complete match, for example combinations of L, A, V, I and S using the BLOSUM62 matrix. To save time, only words in the database sequences that are scoring above a certain threshold with a word in the search sequence will be used for further analysis. This threshold is chosen in a way that will pick up as many as possible of sequences that are significantly similar without increasing the computing time too much. The hits will be extended to locate segments of similarities without gaps, and segments with scores above a certain threshold (high scoring pairs) are reported. To reduce the time of computing for the rapidly increasing databases, recent modifications of BLAST procedure require combination of two hits, within a window of suitable length but without intervening gaps, before extending a segment 8. Since two hits are required, the selectivity is increased, but the threshold for hits can be lowered to increase the sensitivity without increasing the computing time. In all these searches, the aim is to pick up all related sequences without finding unrelated sequences. In this respect, the BLAST procedure seems to be similar in performance to the more general dynamic programming algorithm. The output of the program gives a list of hits sorted according to the calculated score. It also gives the probability for the alignment being a random hit. The probability of a score being non-random will depend on the composition of the sequence and the scoring matrix and gap penalty, which can be used to calculate the parameters described in the legend to Figure 2. The value used in BLAST searches is the E value. This number gives the number of random similarities with this score that is expected in the search. Compared to the probability P described above it is thus multiplied with the size of the database. Values above 0.02 is by 17 default regarded as representing random similarities, but might of course still represent true homologues with very low sequence similarity. PSIBLAST The most sensitive methods for finding similarities between sequences use not only a single sequence but also tries to use the information from several similar sequences. This is also used in PSI-BLAST (Position-Specific Iterated BLAST). The first step of this procedure is a normal BLAST search using the search sequence. If this search gives some significant hits, these sequences are aligned to the search sequences and used to construct a sequence profile. This profile contains the sequence information from several related proteins and is used for another BLAST search of the database. This search might pick up further sequences. New hits are again aligned to construct an improved profile and the procedure can be continued until no more hits are found. The procedure can be completely automatic, or it can allow manual intervention to change the threshold or modify the list of sequences that are used to construct the profile. PSI-BLAST is often able to pick up relationships that are not detected with the other programs. In contrast to other sensitive procedures like the HMM methods described below, PSI-BLAST finds homologous sequences and aligns them automatically to increase the sensitivity of the search. The procedure can therefore be used without any prior knowledge about possible homologues in the database. 18 MULTIPLE ALIGNMENTS The alignment of many homologous sequences is in many cases valuable. Phylogenetic analysis of sequence data depends on multiple alignments and new sequences can be aligned more accurately when the alignment is based on the pattern of conservation from already aligned sequences. The pattern of conserved residues is also important for the functional characterization of a protein. The dynamic programming algorithm used for pairwise alignments in principle can be extended to align multiple sequences. The computing time needed for this approach increases enormously with the number of sequences, and this procedure can therefore only be used for a very limited number of sequences. Hierarchical methods: Clustal and T-Coffee The most common procedure for multiple sequence alignment uses hierarchical methods. This is the procedure used in the traditionally used program, CLUSTALW. In these methods, alignments of all pairs of sequences are made first using the dynamic programming algorithm and a similarity score is calculated (Figure 11). 1 1 1 1 1 1 1 2 2 2 x x x x x x x x x x 2 3 4 5 6 7 8 3 4 5 4.05 0.68 0.85 0.86 0.70 0.55 0.30 1.05 1.03 1.10 2 2 2 3 3 3 3 3 4 4 x x x x x x x x x x 6 7 8 4 5 6 7 8 5 6 0.76 0.62 0.35 3.99 3.76 1.82 1.52 0.25 3.88 1.74 4 4 5 5 5 6 6 7 x x x x x x x x 7 8 6 7 8 7 8 8 1.78 0.40 1.56 1.48 0.34 3.67 0.15 0.07 Figure 11. The list of sequence similarities used to decide on the order of the stepwise alignment. The sequences are then grouped according to their similarities into a tree (hierarchical cluster analysis), see Figure 12. Finally, starting with the most similar pairs, in this case myg_human and myg_chick (1 and 2), the sequences are aligned using dynamic programming. Other similar pairs are also aligned, and these groups of aligned sequences are aligned to the others until all sequences have been added. The aligned sequences are output as well as the cluster analysis, but these procedures normally do not include any statistical analysis of the significance of the alignment. 19 Figure 12. Dendrogram showing the similarity between eight globin sequences (Hemoglobin chains α, β and γ from chicken and man, myoglobin from man and chicken, and the Scapharca inequivalvis hemoglobin). Scoring matrix Blosum62, using a gap penalty of 12 and a gap length penalty of 4. myg_human myg_chick hbb_human hbg_human hbb_chick hba_human hba_chick glb1_scain ~~~~~~~~GL ~~~~~~~~GL ~~~~~~~VHL ~~~~~~~GHF ~~~~~~~VHW ~~~~~~~~VL ~~~~~~~~VL PSVYDAAAQL SDGEWQLVLN SDQEWQQVLT TPEEKSAVTA TEEDKATITS TAEEKQLITG SPADKTNVKA SAADKNNVKG TADVKKDLRD VWGKVEADIP IWGKVEADIA LWGKV..NVD LWGKV..NVE LWGKV..NVA AWGKVGAHAG IFTKIAGHAE SWKVIGSDKK GHGQEVLIRL GHGHEVLMRL EVGGEALGRL DAGGETLGRL ECGAEALARL EYGAEALERM EYGAETLERM GNGVALMTTL FKGHPETLEK FHDHPETLDR LVVYPWTQRF LVVYPWTQRF LIVYPWTQRF FLSFPTTKTY FTTYPPTKTY FADNQETIGY FDKFKHLKSE FDKFKGLKTP FESFGDLSTP FDSFGNLSSA FASFGNLSSP FPHF.DLSH. FPHF.DLSH. FKRLGDVSQG myg_human myg_chick hbb_human hbg_human hbb_chick hba_human hba_chick glb1_scain DEMKASEDLK DQMKGSEDLK DAVMGNPKVK SAIMGNPKVK TAILGNPMVR ....GSAQVK ....GSAQIK ...MANDKLR KHGATVLTAL KHGATVLTQL AHGKKVLGAF AHGKKVLTSL AHGKKVLTSF GHGKKVADAL GHGKKVVAAL GHSITLMYAL GGILKKKGHH GKILKQKGNH SDGLAHLDNL GDAIKHLDDL GDAVKNLDNI TNAVAHVDDM IEAANHIDDI QNFIDQLDN. EAEIKPLAQS ESELKPLAQT KGTFATLSEL KGTFAQLSEL KNTFSQLSEL PNALSALSDL AGTLSKLSDL PDDLVCVVEK HATKHKIPVK HATKHKIPVK HCDKLHVDPE HCDKLHVDPE HCDKLHVDPE HAHKLRVDPV HAHKLRVDPV FAVNHITRKI YLEFISECII YLEFISEVII NFRLLGNVLV NFKLLGNVLV NFRLLGDILI NFKLLSHCLL NFKLLGQCFL SAAEFGKING myg_human myg_chick hbb_human hbg_human hbb_chick hba_human hba_chick glb1_scain QVLQSKHPGD KVIAEKHAAD CVLAHHFGKE TVLAIHFGKE IVLAAHFSKD VTLAAHLPAE VVVAIHHPAA PIKKVLASKN FGADAQGAMN FGADSQAAMK FTPPVQAAYQ FTPEVQASWQ FTPECQAAWQ FTPAVHASLD LTPEVHASLD FGDKYANAWA KALELFRKDM KALELFRNDM KVVAGVANAL KMVTAVASAL KLVRVVAHAL KFLASVSTVL KFLCAVGTVL KLVAVVQAAL ASNYKELGFQ ASKYKEFGFQ AHKYH~~~~~ SSRYH~~~~~ ARKYH~~~~~ TSKYR~~~~~ TAKYR~~~~~ ~~~~~~~~~~ G G ~ ~ ~ ~ ~ ~ Figure 13. The aligned sequences of the eight globin chains. ClustalW might perform poorly with distantly related proteins. The main reason is that it is unable to modify the pairwise alignments based on the information from the rest of the sequences. A more recent method, T-Coffee 9, is basically a hierarchical method, but by 20 comparing primary pairwise alignments it can generate information about all alignments that is then used for the final hierarchical alignment. There are servers that allow users to align their sequences with T-Coffee, for example at http://services.bioasp.nl/blast/cgi-bin/tcoffee.cgi. Profile methods and HMM The hierarchical method used in Clustal uses the normal scoring matrices described for the pairwise alignments. When several sequences are aligned, the pattern of conservation and variation in the sequences are not at all used to improve the alignment. For example, three residues (Phe, His, Lys shown in bold letters) are conserved in all the globins in the alignment above. This suggests that in this group of proteins these residues are conserved for structural or functional reasons. When aligning a distantly related globin, it seems reasonable to find an alignment that would put these residues at the corresponding positions. When a family of proteins is aligned, the probability for the occurrence of a certain residue (or a gap or an insertion) at each position in the sequence can be estimated. This can be used for the alignment of additional sequences. Alignment by hand might be based on implicit use of this information. There are profile methods, where the scoring in some way is based on the probability of finding a certain residue at that position 10. In the simplest approach, the score for placing a residue at a position is based on the pairwise score between this and all the residues observed at this position. This position-specific score takes into account the distribution of amino residues found at this position. If a residue is highly conserved, however, a mismatch at such a position does not necessarily give a lower score than at a variable position. One method that is explicitly using the information about the substitution pattern in a specific protein is based on so-called hidden Markov models (HMM) for assessing the significance of the alignment11. This method uses a statistical treatment of sequence information from a family of proteins. A hidden Markov model is a probabilistic model that has been used much in speech recognition. In the case of sequences, the HMM is ideally able to generate new probable sequences based on observed sequences. This is normally not very interesting – but it can also estimate the probability that a search sequence belongs to the protein family. In the Hidden Markov Model, there are three states (Figure 14). In one state, the match state, a probability for finding a certain residue at that position is stored. This probability is 21 called the emission probability. For each position, a probability for going into two other states is stored. These states are the insertion and deletion states. In a first step, the topology (the number of match, insert and delete states) and the parameters of the HMM have to be defined. This can be done in two ways. One is to use a number of sequences aligned by other means to define the probabilities. The other method is to use the sequences of many homologous proteins to “train” the HMM. The “training” means that parameters in the HMM are automatically defined in a way that the calculated probabilities for the set of training sequences, including gaps and insertions, is maximized. The HMM can then be used to align new sequences and give statistical measures of how well they fit to the model (the probability that the HMM will “emit” that sequence). The most important feature of this procedure is thus that it not only suggests an alignment, but also the probability that the aligned sequence is related to the group of sequences. In this way, one can find homology between sequences where the similarities are smaller than the level that normally is required to establish an evolutionary relation. Figure 14. A simple illustration of a Hidden Markov Model. There is a Begin and an End state. The M stands for match state, I for insert state and D for delete state. A probability is associated to each arrow based on the training sequences. The sum of the probabilities going from every node is 1 (see cover picture).The I and M states are emitting residues, I with a basic probability for any residue, and M with the position-specific probabilities defined from the training sequences. Only two nodes are shown. Each node corresponds to one residue in the sequence of the HMM. In a more normal case, the number of match states would be the same as the length of the proteins in the family. Any sequence can be generated in many ways using the HMM, but one path will give the highest probability. 22 Structural alignment and reliability of alignments An alignment between two sequences is correct when the residues that are derived from the same specific residue in the common ancestor are aligned. It is impossible to reconstruct the exact evolutionary history of a sequence, and we therefore must realize that our alignments might be at least partially incorrect. In the simple cases when the similarities are high, we have good reasons to assume that they are correct, but for weaker similarities this is not possible. The structure of a protein is more conserved than the sequence, and the most reliable alignment can be obtained when the conformations of the proteins are known. In this case, superposition of the structural models of the proteins will give a relatively reliable alignment in all regions where the protein conformation is similar. These regions are normally the core and the central parts of the secondary structure elements of the proteins. In surface loops, insertions and gaps often occur and the conformation is different. In such regions, it is often impossible to decide where the gap/insertion has occurred and thus which alignment is correct. The reliability of any alignment will therefore vary along the sequence. In the most conserved regions, the alignment is reliable, but in less conserved regions where gaps and insertions are found, one has to realize that the alignment is more or less meaningless. In fact, even a local sequence similarity in a region of different conformation might be the result of random mutations rather than conservation of residues. Comparisons of related proteins have shown that gaps and insertions rarely occur in central parts of secondary structure elements. When aligning a sequence to the sequence of a protein with known structure, one can assume that gaps or insertions in the middle of secondary structure elements are unlikely. 23 A little Swedish vocabulary Alignment Inpassning Dotplot graph Punktdiagram? Gap Lucka Gap weight or penalty Luckkostnad Gap opening penalty Lucköppningskostnad Similarity score Likhetspoäng? Substitution matrix Utbytesmatris 24 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89, 10915-9. Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53. Sellers, P. H. (1974). SIAM J. Appl. Math. 26, 787-. Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197. Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444-8. Lipman, D. J. & Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227, 1435-41. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-10. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-402. Notredame, C., Higgins, D. G. & Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205-17. Gribskov, M., McLachlan, A. D. & Eisenberg, D. (1987). Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 84, 4355-8. Eddy, S. R. (1996). Hidden Markov models. Curr Opin Struct Biol 6, 361-5.