Download Methods for alignment of sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Hyperreal number wikipedia , lookup

Sequence wikipedia , lookup

Transcript
1
Methods for alignment of sequences
Lars Liljas
Department of Cell and Molecular Biology
University of Uppsala
Uppsala, Sweden
V 1.8.1 060111
2
3
SOME BASICS
Accurate alignments of sequences are needed for many types of analyses. Aligned
sequences might be the basis of phylogenetic analysis of evolutionary relationships. They are
also important for identification of protein functions and for modeling of protein
conformation. Alignment methods are also used to search for similarities between new
sequences and sequences in databases. Depending on the purposes, different properties of the
alignment algorithm are important: searches in extensive databases require speed, while
algorithms for alignments of homologous sequences can be optimized to use all available
information to produce the most reliable alignment. Alignment algorithms are therefore either
global, trying to optimally align entire sequences, or local, finding short segments of high
similarity.
Scoring matrices
All alignment methods need some sort of scoring for matches and mismatches. For a
certain alignment, a number is assigned to each position in the sequence depending on the
match at that position. The scores for all position in the alignment are then added to calculate
a total score, which is used to select the optimal alignment among alternative alignments. The
choice of scoring matrix might be important for the result of the alignment, and it is therefore
of interest to understand the basis of them and to select the one that best fits the needs of the
actual alignment.
The simplest way of scoring is to assign one number for a match and another number
for a mismatch. Such a matrix is often referred to as a unitary matrix. For nucleic acid
alignments, such simple scoring matrices might be adequate.
Changes in amino acid sequences are generally more informative than changes in the
base sequence, since protein function and thereby the possibilities of survival for the organism
is directly related to the nature of the residue. A change from a valine to an isoleucine (for
example) is more likely to be found than a valine to an aspartate change. For amino acid
sequences, the alignment programs therefore utilize scoring matrices, which in a direct or
indirect way contain information about the likelihood of a certain change. For closely related
proteins, however, a matrix scoring only for identities will mostly give the same alignment as
a more sophisticated scoring matrix. Some scoring matrices are based on the genetic code: the
4
minimum number of base changes necessary to change from one amino acid to another
(minimum mutation distance matrix). Other matrices use classifications of amino acids
according to their physical properties.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
R
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
N
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
D
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
C
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
Q
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
E
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
G
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
H
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
I
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
L
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
K
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
M
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
F
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
P
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
S
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
T
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
W
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
Y
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
V
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
Figure 1. Example of an alignment scoring matrix (The BLOSUM 62 matrix).
The most commonly used scoring matrices are based on statistical analysis of amino
acid changes in homologous proteins. The PAM series are based on estimated mutation rates
(Percent Accepted Mutations) from closely related proteins and will therefore be dominated
by amino acid mutations caused by single base changes. It is also called a log-odds matrix,
since the numbers are proportional to the logarithm of the “odds” for the replacement not
being a random change. The odds are the ratio of the times residue A is observed to be
replaced by B, divided by the number of times A would be replaced by B if the replacements
occurred at random. Positive scores in the matrix thus represent a pair of residues that replace
each other more often than is expected by chance. PAM 1 stands for 1 % accepted mutations,
meaning that on average only one amino acid of 100 is changed. PAM matrices for less
similar sequences are obtained by extrapolations. The PAM100 matrix corresponds to 100
accepted mutations per 100 residues, but since the same residue might change more than
once, two sequences with this level of mutations will still have about 50 % identities. In the
same way, the PAM250 matrix corresponds to a level of about 20 % identical residues.
The PAM matrices are based on a model of the events in the evolution. In contrast, the
BLOSUM series 1 are calculated from blocks of aligned sequences from homologous proteins
5
with a certain level of identity. In this way, the pattern of changes observed at a certain level
of identity is used, instead of extrapolations from the evolutionary events observed for closely
related proteins. Figure 1 shows the BLOSUM 62 matrix. The score for a certain pair of
aligned residues is found at the intersection of the row and column corresponding to the two
amino acids. The numbers on the diagonal show the score for residues that have not changed,
and one can note that there is a higher score for retaining a tryptophan (W) than a valine (V).
Most changes correspond to negative numbers, but some conservative changes also have
positive scores, for example lysine (K) to arginine (R) or valine (V) to isoleucine (I). The
BLOSUM matrices seem to perform slightly better than the PAM matrices, and BLOSUM62
is often the default matrix used in searches and alignments.
Gaps
The sequences to be aligned are not necessarily of the same length, and this is the main
problem with alignments. In the evolution from a (possible) common ancestor, residues are
inserted and deleted. To allow for that, the score from all pairs of aligned residues is
combined with suitable penalties for introducing gaps (gap penalties). The total score is used
to select the optimal alignment. In practice, there is no useful statistical treatment of gaps in
proteins that has been used to determine gap penalties. The gap penalties are normally defined
by two parameters, one for opening of a gap and one that gives a penalty proportional to the
length of the gap. Most programs allow the user to choose these parameters, which might
have different optima for different systems. They also depend on the scoring matrix used.
Alignment programs normally have useful default values connected to the substitution matrix
that is selected.
6
Figure 2. The distribution of scores obtained from alignment of unrelated sequences. The solid curve shows a
Gaussian curve (normal distribution). The observed values follow an extreme-value distribution, where the
probability for a score larger than x is 1-exp(-Kexp(-λx)). If the values of K and λ can be estimated, the
probability of a score being non-random can be calculated.
Estimating the significance of an alignment
It is not trivial to find a good estimate of the significance of an alignment, the main
reason being that the amino acid composition and the distribution of the residues are not
random. To compensate for the non-random composition, the significance of an alignment
score can be estimated by repeating the alignment with randomized sequences of the same
composition. A number of such alignments will give a mean score and an estimate of the
expected variation of the score for random sequences similar to the aligned sequences. From
this, the obtained score for the “real” alignment can be expressed as number of standard
deviations above the mean obtained from the random sequences. The obtained distribution of
scores is, at least for local alignments (see below) a skewed distribution (Figure 2), different
from the familiar normal distribution.
A special problem in alignment is the presence in some proteins of regions with a very
different amino acid composition, for example an exceptionally high content of arginines or
glutamines. Such “low complexity” regions might give high scores also in cases where there
is no homology.
7
Dotplot analysis
A direct way of comparing sequences is the dotplot analysis. The two sequences to be
compared form the rows and columns of a matrix, and in the simplest case a dot is plotted in a
graphical representation of the matrix wherever the sequences are identical. In practice, the
comparison is most often done in a window of specified length, and a dot is put in the graph
whenever the score within this window is above a certain specified threshold. The score might
be calculated as the number of identities or as a similarity based on a scoring matrix.
Sequence similarities will be obvious as diagonals in the plot. The method does not align
sequences but it is a quick way of spotting similarities. It has the additional advantage over
other alignment methods that it can detect repetitions (parallel diagonals) and permutations in
the sequences.
Figure 3. A dotplot analysis of the sequences of human hemoglobin a chain and erythrocruorin from
Chironomus.
8
GLOBAL ALIGNMENTS
Needleman-Wunsch-Sellers method
When two sequences are compared, we can make a global alignment or a local
alignment. Global alignments aim at optimally aligning all of two or more sequences, while
local alignments just try to find similar parts. The most used methods for global alignments
are based on algorithms originally developed by Needleman and Wunsch 2 2 and modified by
Sellers 3. This procedure (the NWS method) is using a dynamic programming algorithm that
simplifies the enormous task of calculating a score for all possible alignments of two
sequences with gaps of any lengths. The sequences to be aligned are arranged as rows and
columns of a rectangular matrix. A score is calculated for each position of the matrix
according to three possible events: replacement (or conservation) of a residue, insertion in
sequence A or insertion in sequence B. The matrix elements Di,j are filled with numbers
according the rule
 Di −1, j −1 + β(Ai ,B j )



=
max
+
w(k);k
=
1,...,
j
−
1


Di, j
Di, j − k
 Di −k , j + w(k);k = 1,...,i − 1 


The three alternatives are shown graphically in Figure 4.
j-y
Di,j-y - w(y)
Di-1,j-1 + β(Aj, Bj)
j-1
j
Di-x,j - w(x)
i-x
Di,j
i-1
i
Figure 4. The three alternative ways of finding the score in the matrix elements.
9
Here the first alternative corresponds to the case of no insertions or deletions: to the
previous diagonal element Di-1,j-1 is added the score β for the current pair of residues (Ai, Bj).
The other two alternatives correspond to insertions in A or B: to the element above or to the
left is added the score for a deletion of length k. The scoring begins at the first element of the
matrix, corresponding to one end of the sequences. An example using a short nucleotide
sequence is shown in Figure 5 and Figure 6. The total score is found as the last element in the
matrix. The actual sequence alignment is obtained from the path through which this number is
obtained. The optimal alignment is thus obtained by tracing back the calculations from the
element in the lower right corner.
Δ
G
C
C
A
A
G
T
A
G
G
Δ
0
-4
-5
-6
-7
-8
-9
-10
–11
–12
-13
A
-4
-1
-5
-6
-3
-4
-8
-9
-7
-11
-12
C
-5
-5
2
-2
-3
-4
-5
-6
-7
-8
-9
G
-6
-2
-2
1
-3
-4
-1
-5
-6
-4
-5
A
-7
-6
-3
-3
4
0
-1
-2
-2
-4
-5
G
-8
-4
-4
-4
0
3
3
-1
-2
1
-1
C
-9
-8
-1
-1
-1
-1
2
2
-2
-3
0
G
-10
-6
-5
-2
-2
-2
2
1
1
1
0
T
-11
-10
-6
-6
-3
-3
-2
5
1
0
0
A
-12
-11
-7
-7
-3
0
-3
1
8
4
3
T
-13
-12
-8
-8
-5
-4
-1
0
4
7
3
G
-14
-10
-9
-9
-6
-5
-1
-1
3
7
10
A
-15
-14
-10
-10
-6
-3
-5
-2
2
3
6
Alignment
GCCAA-GTAGGACGAGCGTATGA
Figure 5. Alignment of two nucleotide sequences. Here a match has a score of +3, a mismatch a score of –1.
Gaps are scored by –3 for insertion of a gap and –1 for each inserted residue (w=-3-k). The scores are calculated
from the 5’ end. The arrows show the path to the number in the lower right corner. In this global alignment with
end gap penalty, there will be two places with alternative alignments of the same score. They correspond to
shifting the position of the two gaps in the top sequence.
10
Δ
G
C
C
A
A
G
Δ
0
-4
-5
-6
-7
-8
-9
A
-4
A
-1
-5
-9
-5
-5
-6
-10
-6
-6
-3
-11
-7
-3
-4
-12
-7
-4
-9
-13
-8
-8
C
-5
-5A
-5
-9
-5
2
-9
-9
2
-2
-10
-2
-2
-7
-7
-3
-3
-4
-8
-4
-4
-5
-12
-5
-5
G
-6
-2
-6
-10
-2
-6
-2
-6
-2
1
-6
-6
1
-3
-7
-3
-3
-4
-8
-4
-4
-1
-9
-5
-1
A
-7
-7
-6
-11
-6
-3
-3
-10
-3
-3
-3
-7
-3
4
-7
-7
4
0
-8
0
0
-5
-5
-1
-1
G
-8
-4
-7
-12
-4
-7
-4
-8
-4
-4
-4
-8
-4
-4
0
-8
0
3
-4
-4
3
3
-5
-1
3
C
-9
-9
-8
-13
-8
-1
-5
-12
-1
-1
-5
-5
-1
-5
-1
-5
-1
-1
-1
-5
-1
2
-1
-5
2
Figure 6. The upper corner of the matrix above showing the numbers obtained by substitutions (top) or
insertions/deletions (middle and bottom). Note that the lower two numbers results either from the introduction of
a new gap (-4) or extension of a gap (-1). An example of a gap extension is encircled
At two positions of this example, the alignment is ambiguous. In this particular case,
one is the difference of whether to put a single gap after nucleotide four or five in the top
sequence. The second corresponds to putting a gap at the end of the top sequence or before
the last two nucleotides.
It is important to note that any sequences can be aligned using these procedures,
irrespective of their possible homology. Programs using this type of algorithms normally
present one solution and do not show alternative alignments with the same score.
An example of the alignment of protein sequences is shown in Figure 7.
11
Δ
A
N
D
W
P
I
A
G
S
H
Δ
0
-16
-20
-24
-28
-32
-36
-40
-44
-48
-52
S
-16
1
-15
-19
-23
-27
-31
-35
-39
-40
-47
T
-20
-15
1
-15
-19
-23
-27
-31
-35
-38
-42
S
-24
-19
-14
1
-15
-19
-23
-26
-31
-31
-39
T
-28
-23
-19
-15
-1
-16
-20
-23
-28
-30
-33
D
-32
-27
-22
-13
-17
-2
-18
-22
-24
-28
-31
W
-36
-31
-27
-23
-2
-18
-5
-21
-24
-27
-30
P
-40
-35
-31
-27
-18
5
-11
-6
-19
-23
-27
L
-44
-39
-35
-31
-22
-11
7
-9
-10
-17
-21
V
-48
-43
-39
-35
-26
-15
-8
7
-9
-12
-17
G
-52
-47
-43
-39
-30
-19
-13
-8
13
-3
-7
A
-56
-48
-47
-43
-34
-23
-17
-9
-3
14
-2
T
-60
-55
-48
-47
-38
-27
-21
-17
-7
-2
12
Alignment --ANDWPIAGSH
STSTDWPLVGAT
or
AN--DWPIAGSH
STSTDWPLVGAT
Figure 7. Alignment of two short protein segments. BLOSUM62 is used, the gap opening weight is –12 and the
extension weight is –4. Bold numbers show the path of the alignment.
In these alignments, gaps at the beginning and end of the sequences have the same gap
penalties as gaps in other parts of the sequences. This has the effect that the algorithm will try
to align all of the two sequences. If the sequences are of different lengths or if only parts of
the sequences are similar this might have negative effects. One way to handle such cases is
not to penalize gaps at the ends of the sequences. Figure 8 shows the same sequences aligned
under this condition. The optimal alignment is chosen as the one giving the highest number in
the last row or column. In this case, the alignment of the same nucleotide sequence is
different, but the same number of identities has been obtained.
12
Δ
G
C
C
A
A
G
T
A
G
G
Δ
0
0
0
0
0
0
0
0
0
0
0
A
0
-1
-1
-1
3
3
-1
-1
3
-1
-1
C
0
-1
2
2
-1
2
2
-2
-1
2
-2
G
0
3
-1
1
1
-2
5
1
0
2
5
A
0
-1
2
-2
4
4
1
4
4
0
1
G
0
3
-1
1
0
3
7
3
3
7
3
C
0
-1
6
2
1
0
3
6
2
3
6
G
0
3
2
5
1
0
3
2
5
5
6
T
0
-1
2
1
4
0
1
6
2
4
4
A
0
-1
0
1
4
7
3
2
9
5
4
T
0
-1
-1
-1
0
3
6
6
5
8
4
G
0
3
-1
-2
-1
2
6
5
5
8
11
A
0
-1
2
-2
0
2
2
5
8
4
7
Alignment
----GCCAAGTAGGACGAGC---GTATGA
Figure 8. Alignment with no gap penalty at ends (first row and column initiated with 0), using the same scoring
parameters. The alignment is different from that of Figure 5.
13
LOCAL ALIGNMENTS
The Smith-Waterman algorithm
The dynamic programming algorithm can also be used for finding local sequence
similarities. The Smith and Waterman algorithm 4 is very similar to the NWS method except
that a calculated negative number for a matrix position is replaced by zero, indicating that no
sequence similarity has been detected up to that point. When all the matrix elements have
been calculated, the maximum number in the matrix is located, and the alignment is traced
back from this point until the first positive number. The same sequences as above aligned
with these assumptions are shown in Figure 9. Once more, a different alignment is obtained
with the simple test sequence. Only the aligned parts of the sequences are shown as output of
this procedure.
Δ
G
C
C
A
A
G
T
A
G
G
Δ
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
0
3
3
0
0
3
0
0
C
0
0
3
3
0
2
2
0
0
2
0
G
0
3
0
2
2
0
5
1
0
3
5
A
0
0
2
0
5
5
1
4
4
0
2
G
0
3
0
1
1
4
8
4
3
7
3
C
0
0
6
3
1
0
4
7
3
3
6
G
0
3
2
5
2
0
3
3
6
6
6
T
0
0
2
1
4
1
2
6
2
5
5
A
0
0
0
1
4
7
3
2
9
5
4
T
0
0
0
0
0
3
6
6
5
8
4
G
0
3
0
0
0
2
6
5
5
8
11
A
0
0
2
0
3
3
2
5
8
4
7
Alignment
-CAAG--TAGG-CGAGCGTATG-
Figure 9. Local alignment with no negatives. The alignment is different from the above examples and
corresponds only to the positive numbers in the matrix ending up with the maximum number.
14
Figure 10 shows the alignment of two short segments of sequence, in this case of a
protein. A segment of six amino acids give a total maximum score of 15 using the
BLOSUM62 matrix and is presented as the output of the local alignment.
Δ
C
A
V
D
N
T
A
H
L
M
Q
Δ
0
0
0
0
0
0
0
0
0
0
0
0
A
0
0
4
0
0
0
0
4
0
0
0
0
R
0
0
0
1
0
0
0
0
4
0
0
0
G
0
0
0
0
0
0
0
0
0
0
0
0
I
0
0
0
3
0
0
0
0
0
2
1
0
D
0
0
0
0
9
1
0
0
0
0
0
3
Q
0
0
0
0
2
9
0
0
0
0
0
5
T
0
0
0
0
0
2
14
0
0
0
0
0
V
0
0
0
4
0
0
2
14
0
1
1
0
N
0
0
0
0
5
6
0
0
15
0
0
1
T
0
0
0
0
0
5
11
0
0
14
0
0
G
0
0
0
0
0
0
3
11
0
0
11
0
T
0
0
0
0
0
0
5
3
9
0
0
10
I
0
0
0
3
0
0
0
4
0
11
1
0
C
0
9
0
0
0
0
0
0
1
0
10
0
K
0
0
8
0
0
0
0
0
0
0
0
11
Alignment
VDNTAH
IDQTVN
Figure 10. Local alignment of two amino acid sequences using the BLOSUM62 scoring matrix. With a gap
penalty of 12 and a gap length penalty of 4 (normal for this scoring matrix) there are no gaps found. The total
score is 15 for the alignment of these 6 residues.
15
PROCEDURES FOR DATABASE SEARCHES
FASTA
For efficient database searches, the above methods are not sufficiently fast. Another
type of alignment methods initially uses searches of local similarities, followed by extensions.
One such algorithm used in the programs FASTP and FASTA was developed by Pearson and
Lipman 5; 6. The basis of this method is to store sequence information from the database in
lookup tables and compare the search sequence with these tables. The lookup tables and the
sequence are divided into words of a specified length. For protein sequences, the position of
amino acid pairs (word length=2) or single residues (word length=1) in all the sequences in
the database is tabulated. The search sequence is compared to these tables, identities are
located and an offset is calculated between the positions in the database sequence and the
search sequence. For example, the sequence LAQSHATKHKIP would give a table showing
alanines at positions 2 and 6 and threonine at position 7. Using the search sequence
VEKFAVNHITRKIS, one finds that the alanine at position 5 gives offsets of –3 and 1 (hits at
position 2 and 6), and the threonine at position 10 gives an offset of –3 (hit at position 7).
Going through the search sequence, all position offsets are counted. In this example, a region
AVNHIT has three identities with a common offset of –3. This first part of the procedure is
similar to locating diagonals by the Dotplot procedure. The best local matches are used as
starting points for the further steps in the procedure. The regions containing these similarities
are extended and scores are calculated using a suitable scoring matrix. Finally, the regions
with the highest scores are used for alignment and scoring using a dynamic programming
algorithm to allow for gaps. The score is given as number of standard deviations from the
average score for random sequences of the same length and composition.
Z = (similarity score - mean of random scores) / (standard deviation of random scores)
The purpose of the score is to select the best alignment and to indicate whether it
represents a non-random similarity. The distribution of scores does not follow what the
statisticians call a normal distribution, but rather a more skewed distribution (cf Figure 2).
The score is therefore also translated into an expectation value, E, which gives the probability
that the obtained score is due to a random similarity.
16
The problem with this method is the sensitivity. If amino acid pairs are used for the
initial search, related sequences lacking two consecutive identities in a region of homology
will be missed. Using single residues will increase the sensitivity but increase the computing
time.
BLAST
The most used algorithm for database searches is BLAST, developed by Altschul et
al. 7. This program uses a similar approach as the FASTA algorithm, but the initial search is
not using identities but is trying to identify short segments with a specified length (words) that
have similarities with scores above a suitable threshold. For a protein, a typical length of these
words might be three amino acids, and the score is calculated using a substitution matrix from
the PAM or BLOSUM series. Since some amino acids mutate more readily than others, there
are some words that will give comparatively low scores also for a complete match, for
example combinations of L, A, V, I and S using the BLOSUM62 matrix. To save time, only
words in the database sequences that are scoring above a certain threshold with a word in the
search sequence will be used for further analysis. This threshold is chosen in a way that will
pick up as many as possible of sequences that are significantly similar without increasing the
computing time too much. The hits will be extended to locate segments of similarities without
gaps, and segments with scores above a certain threshold (high scoring pairs) are reported.
To reduce the time of computing for the rapidly increasing databases, recent
modifications of BLAST procedure require combination of two hits, within a window of
suitable length but without intervening gaps, before extending a segment 8. Since two hits are
required, the selectivity is increased, but the threshold for hits can be lowered to increase the
sensitivity without increasing the computing time.
In all these searches, the aim is to pick up all related sequences without finding
unrelated sequences. In this respect, the BLAST procedure seems to be similar in performance
to the more general dynamic programming algorithm.
The output of the program gives a list of hits sorted according to the calculated score.
It also gives the probability for the alignment being a random hit. The probability of a score
being non-random will depend on the composition of the sequence and the scoring matrix and
gap penalty, which can be used to calculate the parameters described in the legend to Figure
2. The value used in BLAST searches is the E value. This number gives the number of
random similarities with this score that is expected in the search. Compared to the probability
P described above it is thus multiplied with the size of the database. Values above 0.02 is by
17
default regarded as representing random similarities, but might of course still represent true
homologues with very low sequence similarity.
PSIBLAST
The most sensitive methods for finding similarities between sequences use not only a
single sequence but also tries to use the information from several similar sequences. This is
also used in PSI-BLAST (Position-Specific Iterated BLAST). The first step of this procedure
is a normal BLAST search using the search sequence. If this search gives some significant
hits, these sequences are aligned to the search sequences and used to construct a sequence
profile. This profile contains the sequence information from several related proteins and is
used for another BLAST search of the database. This search might pick up further sequences.
New hits are again aligned to construct an improved profile and the procedure can be
continued until no more hits are found. The procedure can be completely automatic, or it can
allow manual intervention to change the threshold or modify the list of sequences that are
used to construct the profile.
PSI-BLAST is often able to pick up relationships that are not detected with the other
programs. In contrast to other sensitive procedures like the HMM methods described below,
PSI-BLAST finds homologous sequences and aligns them automatically to increase the
sensitivity of the search. The procedure can therefore be used without any prior knowledge
about possible homologues in the database.
18
MULTIPLE ALIGNMENTS
The alignment of many homologous sequences is in many cases valuable.
Phylogenetic analysis of sequence data depends on multiple alignments and new sequences
can be aligned more accurately when the alignment is based on the pattern of conservation
from already aligned sequences. The pattern of conserved residues is also important for the
functional characterization of a protein.
The dynamic programming algorithm used for pairwise alignments in principle can be
extended to align multiple sequences. The computing time needed for this approach increases
enormously with the number of sequences, and this procedure can therefore only be used for a
very limited number of sequences.
Hierarchical methods: Clustal and T-Coffee
The most common procedure for multiple sequence alignment uses hierarchical
methods. This is the procedure used in the traditionally used program, CLUSTALW. In these
methods, alignments of all pairs of sequences are made first using the dynamic programming
algorithm and a similarity score is calculated (Figure 11).
1
1
1
1
1
1
1
2
2
2
x
x
x
x
x
x
x
x
x
x
2
3
4
5
6
7
8
3
4
5
4.05
0.68
0.85
0.86
0.70
0.55
0.30
1.05
1.03
1.10
2
2
2
3
3
3
3
3
4
4
x
x
x
x
x
x
x
x
x
x
6
7
8
4
5
6
7
8
5
6
0.76
0.62
0.35
3.99
3.76
1.82
1.52
0.25
3.88
1.74
4
4
5
5
5
6
6
7
x
x
x
x
x
x
x
x
7
8
6
7
8
7
8
8
1.78
0.40
1.56
1.48
0.34
3.67
0.15
0.07
Figure 11. The list of sequence similarities used to decide on the order of the stepwise alignment.
The sequences are then grouped according to their similarities into a tree (hierarchical
cluster analysis), see Figure 12. Finally, starting with the most similar pairs, in this case
myg_human and myg_chick (1 and 2), the sequences are aligned using dynamic
programming. Other similar pairs are also aligned, and these groups of aligned sequences are
aligned to the others until all sequences have been added. The aligned sequences are output as
well as the cluster analysis, but these procedures normally do not include any statistical
analysis of the significance of the alignment.
19
Figure 12. Dendrogram showing the similarity between eight globin sequences (Hemoglobin chains α, β and γ
from chicken and man, myoglobin from man and chicken, and the Scapharca inequivalvis hemoglobin). Scoring
matrix Blosum62, using a gap penalty of 12 and a gap length penalty of 4.
myg_human
myg_chick
hbb_human
hbg_human
hbb_chick
hba_human
hba_chick
glb1_scain
~~~~~~~~GL
~~~~~~~~GL
~~~~~~~VHL
~~~~~~~GHF
~~~~~~~VHW
~~~~~~~~VL
~~~~~~~~VL
PSVYDAAAQL
SDGEWQLVLN
SDQEWQQVLT
TPEEKSAVTA
TEEDKATITS
TAEEKQLITG
SPADKTNVKA
SAADKNNVKG
TADVKKDLRD
VWGKVEADIP
IWGKVEADIA
LWGKV..NVD
LWGKV..NVE
LWGKV..NVA
AWGKVGAHAG
IFTKIAGHAE
SWKVIGSDKK
GHGQEVLIRL
GHGHEVLMRL
EVGGEALGRL
DAGGETLGRL
ECGAEALARL
EYGAEALERM
EYGAETLERM
GNGVALMTTL
FKGHPETLEK
FHDHPETLDR
LVVYPWTQRF
LVVYPWTQRF
LIVYPWTQRF
FLSFPTTKTY
FTTYPPTKTY
FADNQETIGY
FDKFKHLKSE
FDKFKGLKTP
FESFGDLSTP
FDSFGNLSSA
FASFGNLSSP
FPHF.DLSH.
FPHF.DLSH.
FKRLGDVSQG
myg_human
myg_chick
hbb_human
hbg_human
hbb_chick
hba_human
hba_chick
glb1_scain
DEMKASEDLK
DQMKGSEDLK
DAVMGNPKVK
SAIMGNPKVK
TAILGNPMVR
....GSAQVK
....GSAQIK
...MANDKLR
KHGATVLTAL
KHGATVLTQL
AHGKKVLGAF
AHGKKVLTSL
AHGKKVLTSF
GHGKKVADAL
GHGKKVVAAL
GHSITLMYAL
GGILKKKGHH
GKILKQKGNH
SDGLAHLDNL
GDAIKHLDDL
GDAVKNLDNI
TNAVAHVDDM
IEAANHIDDI
QNFIDQLDN.
EAEIKPLAQS
ESELKPLAQT
KGTFATLSEL
KGTFAQLSEL
KNTFSQLSEL
PNALSALSDL
AGTLSKLSDL
PDDLVCVVEK
HATKHKIPVK
HATKHKIPVK
HCDKLHVDPE
HCDKLHVDPE
HCDKLHVDPE
HAHKLRVDPV
HAHKLRVDPV
FAVNHITRKI
YLEFISECII
YLEFISEVII
NFRLLGNVLV
NFKLLGNVLV
NFRLLGDILI
NFKLLSHCLL
NFKLLGQCFL
SAAEFGKING
myg_human
myg_chick
hbb_human
hbg_human
hbb_chick
hba_human
hba_chick
glb1_scain
QVLQSKHPGD
KVIAEKHAAD
CVLAHHFGKE
TVLAIHFGKE
IVLAAHFSKD
VTLAAHLPAE
VVVAIHHPAA
PIKKVLASKN
FGADAQGAMN
FGADSQAAMK
FTPPVQAAYQ
FTPEVQASWQ
FTPECQAAWQ
FTPAVHASLD
LTPEVHASLD
FGDKYANAWA
KALELFRKDM
KALELFRNDM
KVVAGVANAL
KMVTAVASAL
KLVRVVAHAL
KFLASVSTVL
KFLCAVGTVL
KLVAVVQAAL
ASNYKELGFQ
ASKYKEFGFQ
AHKYH~~~~~
SSRYH~~~~~
ARKYH~~~~~
TSKYR~~~~~
TAKYR~~~~~
~~~~~~~~~~
G
G
~
~
~
~
~
~
Figure 13. The aligned sequences of the eight globin chains.
ClustalW might perform poorly with distantly related proteins. The main reason is that
it is unable to modify the pairwise alignments based on the information from the rest of the
sequences. A more recent method, T-Coffee 9, is basically a hierarchical method, but by
20
comparing primary pairwise alignments it can generate information about all alignments that
is then used for the final hierarchical alignment. There are servers that allow users to align
their sequences with T-Coffee, for example at http://services.bioasp.nl/blast/cgi-bin/tcoffee.cgi.
Profile methods and HMM
The hierarchical method used in Clustal uses the normal scoring matrices described for
the pairwise alignments. When several sequences are aligned, the pattern of conservation and
variation in the sequences are not at all used to improve the alignment. For example, three
residues (Phe, His, Lys shown in bold letters) are conserved in all the globins in the alignment
above. This suggests that in this group of proteins these residues are conserved for structural
or functional reasons. When aligning a distantly related globin, it seems reasonable to find an
alignment that would put these residues at the corresponding positions. When a family of
proteins is aligned, the probability for the occurrence of a certain residue (or a gap or an
insertion) at each position in the sequence can be estimated. This can be used for the
alignment of additional sequences. Alignment by hand might be based on implicit use of this
information. There are profile methods, where the scoring in some way is based on the
probability of finding a certain residue at that position 10. In the simplest approach, the score
for placing a residue at a position is based on the pairwise score between this and all the
residues observed at this position. This position-specific score takes into account the
distribution of amino residues found at this position. If a residue is highly conserved,
however, a mismatch at such a position does not necessarily give a lower score than at a
variable position.
One method that is explicitly using the information about the substitution pattern in a
specific protein is based on so-called hidden Markov models (HMM) for assessing the
significance of the alignment11. This method uses a statistical treatment of sequence
information from a family of proteins. A hidden Markov model is a probabilistic model that
has been used much in speech recognition. In the case of sequences, the HMM is ideally able
to generate new probable sequences based on observed sequences. This is normally not very
interesting – but it can also estimate the probability that a search sequence belongs to the
protein family.
In the Hidden Markov Model, there are three states (Figure 14). In one state, the match
state, a probability for finding a certain residue at that position is stored. This probability is
21
called the emission probability. For each position, a probability for going into two other states
is stored. These states are the insertion and deletion states.
In a first step, the topology (the number of match, insert and delete states) and the
parameters of the HMM have to be defined. This can be done in two ways. One is to use a
number of sequences aligned by other means to define the probabilities. The other method is
to use the sequences of many homologous proteins to “train” the HMM. The “training” means
that parameters in the HMM are automatically defined in a way that the calculated
probabilities for the set of training sequences, including gaps and insertions, is maximized.
The HMM can then be used to align new sequences and give statistical measures of how well
they fit to the model (the probability that the HMM will “emit” that sequence). The most
important feature of this procedure is thus that it not only suggests an alignment, but also the
probability that the aligned sequence is related to the group of sequences. In this way, one can
find homology between sequences where the similarities are smaller than the level that
normally is required to establish an evolutionary relation.
Figure 14. A simple illustration of a Hidden Markov Model. There is a Begin and an End state. The M stands for
match state, I for insert state and D for delete state. A probability is associated to each arrow based on the
training sequences. The sum of the probabilities going from every node is 1 (see cover picture).The I and M
states are emitting residues, I with a basic probability for any residue, and M with the position-specific
probabilities defined from the training sequences. Only two nodes are shown. Each node corresponds to one
residue in the sequence of the HMM. In a more normal case, the number of match states would be the same as
the length of the proteins in the family. Any sequence can be generated in many ways using the HMM, but one
path will give the highest probability.
22
Structural alignment and reliability of alignments
An alignment between two sequences is correct when the residues that are derived
from the same specific residue in the common ancestor are aligned. It is impossible to
reconstruct the exact evolutionary history of a sequence, and we therefore must realize that
our alignments might be at least partially incorrect. In the simple cases when the similarities
are high, we have good reasons to assume that they are correct, but for weaker similarities this
is not possible.
The structure of a protein is more conserved than the sequence, and the most reliable
alignment can be obtained when the conformations of the proteins are known. In this case,
superposition of the structural models of the proteins will give a relatively reliable alignment
in all regions where the protein conformation is similar. These regions are normally the core
and the central parts of the secondary structure elements of the proteins. In surface loops,
insertions and gaps often occur and the conformation is different. In such regions, it is often
impossible to decide where the gap/insertion has occurred and thus which alignment is
correct. The reliability of any alignment will therefore vary along the sequence. In the most
conserved regions, the alignment is reliable, but in less conserved regions where gaps and
insertions are found, one has to realize that the alignment is more or less meaningless. In fact,
even a local sequence similarity in a region of different conformation might be the result of
random mutations rather than conservation of residues.
Comparisons of related proteins have shown that gaps and insertions rarely occur in
central parts of secondary structure elements. When aligning a sequence to the sequence of a
protein with known structure, one can assume that gaps or insertions in the middle of
secondary structure elements are unlikely.
23
A little Swedish vocabulary
Alignment
Inpassning
Dotplot graph
Punktdiagram?
Gap
Lucka
Gap weight or penalty
Luckkostnad
Gap opening penalty
Lucköppningskostnad
Similarity score
Likhetspoäng?
Substitution matrix
Utbytesmatris
24
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein
blocks. Proc Natl Acad Sci U S A 89, 10915-9.
Needleman, S. B. & Wunsch, C. D. (1970). A general method applicable to the search
for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53.
Sellers, P. H. (1974). SIAM J. Appl. Math. 26, 787-.
Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular
subsequences. J. Mol. Biol. 147, 195-197.
Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence
comparison. Proc Natl Acad Sci U S A 85, 2444-8.
Lipman, D. J. & Pearson, W. R. (1985). Rapid and sensitive protein similarity
searches. Science 227, 1435-41.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic
local alignment search tool. J Mol Biol 215, 403-10.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. &
Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res 25, 3389-402.
Notredame, C., Higgins, D. G. & Heringa, J. (2000). T-Coffee: A novel method for
fast and accurate multiple sequence alignment. J Mol Biol 302, 205-17.
Gribskov, M., McLachlan, A. D. & Eisenberg, D. (1987). Profile analysis: detection of
distantly related proteins. Proc Natl Acad Sci U S A 84, 4355-8.
Eddy, S. R. (1996). Hidden Markov models. Curr Opin Struct Biol 6, 361-5.