Download Sequence comparison in Molecular biology

Document related concepts
no text concepts found
Transcript
Comparison of Biological Sequences
The Biocomputing Service Group
Types of Sequence Comparison
• Pairwise Alignments
• Multiple Alignments
• Database Searches
Pairwise Sequence Alignment
• Principles of pairwise sequence comparison
• global / local alignments
• scoring systems
• gap penalties
• Methods of pairwise sequence alignment
• windows-based methods
• dynamic programming approaches
• Needleman and Wunsch
• Smith and Waterman
• Pairwise alignment programs in HUSAR
Why Sequence Comparison?
The biological basis:
• Many genes and proteins are members of families which have a
similar biochemical function or share a common evolutionary origin.
Sequence comparison is used:
• to define evolutionary relationships.
• to identify conserved patterns.
• when dealing with a sequence of unknown function: to find similar
domains which could imply similar function.
A comparison can be the starting point for
further experimental investigations.
Aligning Sequences….
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Sequence 1
Sequence 2
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
taccattaccgtgttaactgaaaggacttaaagact
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Aligning Sequences….
• There are lots of possible alignments.
• Two sequences can always be aligned.
• Sequence alignments have to be scored.
• Often there is more than one solution with the same score.
Pairwise Sequence Comparison
• Global Alignments
• Local Alignments
Global Alignment
Two closely related sequences:
GAP (Needleman & Wunsch) creates an end-to-end alignment.
Global Alignment
Two sequences sharing several regions of local similarity:
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67
|||||||||||||| |
|
| |||| ||
| |
| ||
1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70
Local Alignment
14
42
TCAGAAGCAGCTAAAGCGT
||||||||| |||||||||
TCAGAAGCA.CTAAAGCGT
Bestfit (Smith-Waterman)
32
59
finds the region of best local similarity.
Local Alignment
14
42
1
1
39
1
62
66
TCAGAAGCAGCTAAAGCGT
||||||||| |||||||||
TCAGAAGCA.CTAAAGCGT
AGGATTGGAATGCT
||||||||||||||
AGGATTGGAATGCT
AGGATTGGAAT
|||||||||||
AGGATTGGAAT
AGACCG
||||||
AGACCG
Similarity (X. Huang)
32
59
14
14
49
11
67
71
displays all regions of similarity.
Human Hemoglobin α- and γ-Chains
• Symbol Comparison Table: PAM250
• Gap opening penalty: 3
• Gap extension penalty: 0.1
• Score: 116
Parameters of Sequence Alignment
Scoring Systems:
• Each symbol pairing is assigned a numerical value,
based on a symbol comparison table.
Gap Penalties:
• Opening:
• Extension:
The cost to introduce a gap
The cost to elongate a gap
DNA Scoring Systems
Sequence 1
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Sequence 2
A
G
C
T
A
1
0
0
0
G
0
1
0
0
C
0
0
1
0
T
0
0
0
1
Match: 1
Mismatch: 0
Score = 5
DNA Scoring Systems
Sequence 1
actaccagttcatttgatacttctcaaa
taccattaccgtgttaactgaaaggacttaaagact
Sequence 2
Negative scoring values to penalize mismatches:
A
T
C
G
A
5
-4 -4 -4
T
-4
5 -4 -4
C
-4
-4
G
-4
-4 -4
5 -4
5
Matches: 5
Mismatches: 19
Score: 5 x 5 + 19 x (-4) = - 51
Protein Scoring Systems
Sequence 1
PTHPLASKTQILPEDLASEDLTI
Sequence 2
PTHPLAGERAIGLARLAEEDFGM
Scoring
matrix
C
C
S
T
P
A
G
N
9
S -1
4
T
-1
1
5
P -3
-1
-1
7
A
0
1
0
-1
4
G -3
0
-2
-2
0
6
N -3
1
0
-2
-2
0
5
D -3
0
-1
-1
-2
-1
1
.
.
D
6
.
.
T:G
= -2
T:T
= 5
Score = 48
Protein Scoring Systems
• Amino acids have different biochemical and physical properties
that influence their relative replaceability in evolution.
tiny
aliphatic
P
C S+S
I
V
A
L
hydrophobic
M
Y
F
small
G
G
CSH
T
S
D
K
W
H
N
E
R
Q
aromatic
positive
polar
charged
Protein Scoring Systems
• Amino acids have different biochemical and physical properties
that influence their relative replaceability in evolution.
• Scoring matrices reflect
• probabilities of mutual substitutions
• the probability of occurrence of each amino acid.
• Widely used scoring matrices:
• PAM
• BLOSUM
PAM (Percent Accepted Mutations) matrices
• Derived from global alignments of protein families . Family members
share at least 85% identity (Dayhoff et al., 1978).
• Construction of phylogenetic tree and ancestral sequences of
each protein family
• Computation of number of replacements for each pair of amino acids
PAM (Percent Accepted Mutations) matrices
• The numbers of replacements were used to compute a so-called
PAM-1 matrix.
• The PAM-1 matrix reflects an average change of 1% of all amino
acid positions. PAM matrices for larger evolutionary distances can
be extrapolated from the PAM-1 matrix.
• PAM250 = 250 mutations per 100 residues.
• Greater numbers mean bigger evolutionary distance
.
PAM 250
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
B
Z
W
A
2
-2
0
0
-2
0
0
1
-1
-1
-2
-1
-1
-3
1
1
1
-6
-3
0
2
1
R
-2
6
0
-1
-4
1
-1
-3
2
-2
-3
3
0
-4
0
0
-1
2
-4
-2
1
2
N
0
0
2
2
-4
1
1
0
2
-2
-3
1
-2
-3
0
1
0
-4
-2
-2
4
3
D
0
-1
2
4
-5
2
3
1
1
-2
-4
0
-3
-6
-1
0
0
-7
-4
-2
5
4
C
C
-2
-4
-4
-5
12
-5
-5
-3
-3
-2
-6
-5
-5
-4
-3
0
-2
-8
0
-2
-3
-4
Q
0
1
1
2
-5
4
2
-1
3
-2
-2
1
-1
-5
0
-1
-1
-5
-4
-2
3
5
-8
E
0
-1
1
3
-5
2
4
0
1
-2
-3
0
-2
-5
-1
0
0
-7
-4
-2
4
5
G
1
-3
0
1
-3
-1
0
5
-2
-3
-4
-2
-3
-5
0
1
0
-7
-5
-1
2
1
H
-1
2
2
1
-3
3
1
-2
6
-2
-2
0
-2
-2
0
-1
-1
-3
0
-2
3
3
I
-1
-2
-2
-2
-2
-2
-2
-3
-2
5
2
-2
2
1
-2
-1
0
-5
-1
4
-1
-1
L
-2
-3
-3
-4
-6
-2
-3
-4
-2
2
6
-3
4
2
-3
-3
-2
-2
-1
2
-2
-1
K
-1
3
1
0
-5
1
0
-2
0
-2
-3
5
0
-5
-1
0
0
-3
-4
-2
2
2
M
-1
0
-2
-3
-5
-1
-2
-3
-2
2
4
0
6
0
-2
-2
-1
-4
-2
2
-1
0
F
-3
-4
-3
-6
-4
-5
-5
-5
-2
1
2
-5
0
9
-5
-3
-3
0
7
-1
-3
-4
P
1
0
0
-1
-3
0
-1
0
0
-2
-3
-1
-2
-5
6
1
0
-6
-5
-1
1
1
S
1
0
1
0
0
-1
0
1
-1
-1
-3
0
-2
-3
1
2
1
-2
-3
-1
2
1
T
1
-1
0
0
-2
-1
0
0
-1
0
-2
0
-1
-3
0
1
3
-5
-3
0
2
1
WW
-6
2
-4
-7
-8
-5
-7
-7
-3
-5
-2
-3
-4
0
-6
-2
-5
17
0
-6
-4
-4
Y
-3
-4
-2
-4
0
-4
-4
-5
0
-1
-1
-4
-2
7
-5
-3
-3
0
10
-2
-2
-3
17
V
0
-2
-2
-2
-2
-2
-2
-1
-2
4
2
-2
2
-1
-1
-1
0
-6
-2
4
0
0
B
2
1
4
5
-3
3
4
2
3
-1
-2
2
-1
-3
1
2
2
-4
-2
0
6
5
Z
1
2
3
4
-4
5
5
1
3
-1
-1
2
0
-4
1
1
1
-4
-3
0
5
6
BLOSUM (Blocks Substitution Matrix)
• Derived from alignments of domains of distantly related
proteins (Henikoff & Henikoff,1992).
A
A
C
E
C
• Occurrences of each amino acid pair
in each column of each block alignment
is counted.
• The numbers derived from all blocks were
used to compute the BLOSUM matrices.
A
A
C
E
C
A-C
A-E
C-E
A-A
C-C
=4
=2
=2
=1
=1
BLOSUM (Blocks Substitution Matrix)
• Sequences within blocks are clustered according to their level of identity.
• Clusters are counted as a single sequence.
• Different BLOSUM matrices differ in the percentage of sequence identity
used in clustering.
• The number in the matrix name (e.g. 62 in BLOSUM62) refers to the
percentage of sequence identity used to build the matrix.
• Greater numbers mean smaller evolutionary distance.
TIPS on choosing a scoring matrix
• Generally, BLOSUM matrices perform better than PAM matrices
for local similarity searches (Henikoff & Henikoff, 1993).
• When comparing closely related proteins one should use lower
PAM or higher BLOSUM matrices, for distantly related proteins
higher PAM or lower BLOSUM matrices.
• For database searching the commonly used matrix is BLOSUM62.
Scoring Insertions and Deletions
A T G T A A T G C A
T A T G T G G A A T G A
A T G T - - A A T G C A
T A T G T G G A A T G A
insertion / deletion
The creation of a gap is penalized with a negative score value.
Why Gap Penalties?
Gaps not permitted
Score: 10
1 GTGATAGACACAGACCGGTGGCATTGTGG 29
|||
| | |||
|
|| || |
1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29
Gaps allowed but not penalized
Match = 5
Mismatch = -4
Score: 88
1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29
||| || | | | ||| || | | || || |
1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29
Why Gap Penalties?
• The optimal alignment of two similar sequences is usually that which
• maximizes the number of matches and
• minimizes the number of gaps.
• Permitting the insertion of arbitrarily many gaps can lead to high
scoring alignments of non-homologous sequences.
• Penalizing gaps forces alignments to have relatively few gaps.
Gap Penalties
Linear gap penalty score:
γ(g) = - gd
Affine gap penalty score:
γ(g) = -d - (g -1)e
γ(g) = gap penalty score of a gap of lenght g
d = gap opening penalty
e = gap extension penalty
g = gap lenght
Scoring Insertions and Deletions
match = 1
mismatch = 0
Total Score:
4
A T G T T A T A C
T A T G T G C G T A T A
Total Score:
8 - 3.2 = 4.8
Gap parameters:
d = 3 (gap opening)
e = 0.1 (gap extension)
g = 3 (gap lenght)
γ(g) = -3 - (3 -1) 0.1 = -3.2
A T G T - - - T A T A C
T A T G T G C G T A T A
insertion / deletion
Modification of Gap Penalties
Score Matrix: BLOSUM62
gap opening penalty
gap extension penalty
score
= 3
= 0.1
= 6.3
1 ...VLSPADKFLTNV 12
||||
1 VFTELSPAKTV.... 11
gap opening penalty
gap extension penalty
score
= 0
= 0.1
= 11.3
1 V...LSPADKFLTNV 12
|
|||| | | |
1 VFTELSPA.K..T.V 11
Pairwise Sequence Alignment
• Principles of pairwise sequence comparison
• global / local alignments
• scoring systems
• gap penalties
• Methods of pairwise sequence alignment
• window-based methods
• dynamic programming approaches
•Pairwise alignment programs in HUSAR
Dotplot:
A dotplot gives an overview of all possible alignments
Sequence 2
A
T
T
C
A
C
A
T
A
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
T
l
l
A
l
l
l
C
l
l
l
l
A
l
l
l
l
l
T
l
l
T
l
l
A
Sequence 1
C
G
T
l
A
C
Dotplot:
In a dotplot each diagonal corresponds to a possible (ungapped) alignment
Sequence 2
A
T
T
C
A
C
A
T
A
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
T
l
l
A
l
l
l
C
l
l
l
l
A
l
l
l
l
l
T
l
l
l
l
A
T
C
G
T
l
A
C
Sequence 1
One possible alignment:
T A C A T T A C G T A C
A T A C A C T T A
Pairwise Sequence Alignment
• Principles of pairwise sequence comparison
• global / local alignments
• scoring systems
• gap penalties
• Methods of pairwise sequence alignment
• window-based methods
• dynamic programming approaches
• Pairwise alignment programs in HUSAR
Window-based Approaches
• Word Size
• Window / Stringency
Word Size Algorithm
T A C G G T A T G
Word Size = 3
A C A G T A T C
C
T
A
T
G
A
C
A
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
T A C G G T A T G
›
A C A G T A T C
›
Window / Stringency
T A C G G T A T G
Window = 5 / Stringency = 4
T C A G T A T C
T A C G G T A T G
T C A G T A T C
›
T A C G G T A T G
T C A G T A T C
›
›
›
›
T A C G G T A T G
T A C G G T A T G
T C A G T A T C
C
T
A
T
G
A
C
A
›
Window / Stringency
Score = 11
PTHPLASKTQILPEDLASEDLTI
›
PTHPLAGERAIGLARLAEEDFGM
Scoring Matrix Filtering
Score = 11
Matrix: PAM250
PTHPLASKTQILPEDLASEDLTI
›
PTHPLAGERAIGLARLAEEDFGM
Score = 7
PTHPLASKTQILPEDLASEDLTI
PTHPLAGERAIGLARLAEEDFGM
Window = 12
Stringency = 9
Considerations
• The window/stringency method is more sensitive than the wordsize
method (ambiguities are permitted).
• The smaller the window, the larger the weight of statistical
(unspecific) matches.
• With large windows the sensitivity for short sequences is reduced.
• Insertions/deletions are not treated explicitly.
Insertions / Deletions in a Dotplot
Sequence 2 T
A
C
T
G
T
C
A
T
T
A
C
T
G
T
T
C
A
T
Sequence 1
T A C T G - T C A T
| | | | |
| | | |
T A C T G T T C A T
Dotplot
(Window = 30 / Stringency = 9)
Hemoglobin
β-chain
Output of the
programs
Compare and DotPlot
Hemoglobin α-chain
Dotplot
(Window = 18 / Stringency = 10)
Hemoglobin
β-chain
Output of the
programs
Compare and DotPlot
Hemoglobin α-chain
Pairwise Sequence Alignment
• Principles of pairwise sequence comparison
• global / local alignments
• scoring systems
• gap penalties
• Methods of pairwise sequence alignment
• windows-based approaches
• dynamic programming approaches
• Needleman and Wunsch
• Smith and Waterman
• Pairwise alignment programs in HUSAR
Dynamic Programming
Automatic procedure that finds the best alignment
with an optimal score depending on the chosen parameters.
• Needleman and Wunsch Algorithm
- Global Alignment • Smith and Waterman Algorithm
- Local Alignment -
Needleman and Wunsch
(global alignment)
Sequence 1:
Sequence 2:
HEAGAWGHEE
PAWHEAE
Scoring parameters:
Gap penalty:
BLOSUM50 matrix
Linear gap penalty of 8
Basic principles of dynamic programming
- Initialisation of alignment matrix
- Stepwise calculation of score values
(creation of an alignment path matrix)
- Backtracking (evaluation of the optimal path)
Initialisation of Matrix
(BLOSUM 50)
H
E
A
G
A
W
G
H
E
E
P
-2
-1
-1
-2
-1
-4
-2
-2
-1
-1
A
-2
-1
5
0
5
-3
0
-2
-1
-1
W
-3
-3
-3
-3
-3
15
-3
-3
-3
-3
H
10
0
-2
-2
-2
-3
-2
10
0
0
E
0
6
-1
-3
-1
-3
-3
0
6
6
A
-2
-1
5
0
5
-3
0
-2
-1
-1
E
0
6
-1
-3
-1
-3
-3
0
6
6
Creation of an alignment path matrix
Idea:
Build up an optimal alignment using previous solutions for
optimal alignments of smaller subsequences
• Construct matrix F indexed by i and j (one index for each sequence)
• F(i,j) is the score of the best alignment between the initial
segment x1...i of x up to xi and the initial segment y1...j of y up to yj
• Build F(i,j) recursively beginning with F(0,0) = 0
Creation of an alignment path matrix
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = max
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i-1, j-1)
F(i, j-1)
s(xi ,yj)
F(i-1,j)
-d
-d
F(i, j)
Creation of an alignment path matrix
• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)
• Three possibilities:
• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
• xi is aligned to a gap, F(i,j) = F(i-1,j) - d
• yj is aligned to a gap, F(i,j) = F(i,j-1) - d
• The best score up to (i,j) will be the largest of the three options
Creation of an alignment path matrix
0
P
-8
A
-16
W
-24
H
-32
H
-8
E
-16
A
-24
G
-32
A
-40
W
-48
G
-56
H
-64
Boundary conditions
F(i, 0) = -i d
F(j, 0) = -j d
E
-40
A
-48
E
-56
E
-72
E
-80
Creation of an alignment path matrix
P
0
H
-8
E
-16
-8
-2
-9
A
-24
G
-32
F(i, j) = max
A
-16
-10
-3
A
-40
W
-48
G
-56
H
-64
E
-72
E
-80
F(i, j) = F(i-1, j-1) + s(xi ,yj)
P-H=-2
F(i, j) = F(i-1, j) - d
E-P=-1
F(i, j) = F(i, j-1) - d
H-A=-2
W
F(0,0) + s(xi ,yj) = 0 -2 = -2
-24
F(1,1) = max F(0,1) - d
= -8 -8= -16
= -8 -8= -16
H
-32
F(1,0) - d
E
-40
F(1,0) + s(xi ,yj) = -8 -1 = -9
A
-48
E
-56
F(2,1) = max F(1,1) - d
= -2 -8 = -10
F(2,0) - d
= -16 -8= -24
-2 -8 = -10
= -9
-2 -1 = -3
-8 -2 = -10
F(1,2) = max -16 -8 = -24 = -10
E-A=-1
= -2
F(2,2) = max
-10 -8 = -18
-9 -8 = -17
= -3
Backtracking
0
H
-8
E
-16
A
-24
G
-32
A
-40
W
-48
G
-56
H
-64
E
-72
E
-80
-8
-2
-9
-17
-25
-33
-42
-49
-57
-65
-73
A -16
-10
-3
-4
-12
-20
-28
-36
-44
-52
-60
W -24
-18
-11
-6
-7
-15
-5
-13
-21
-29
-37
H -32
-14
-18
-13
-8
-9
-13
-7
-3
-11
-19
E -40
-22
-8
-16
-16
-9
-12
-15
-7
3
-5
A -48
-30
-16
-3
-11
-11
-12
-12
-15
-5
2
E -56
-38
-24
-11
-6
-12
-14
-15
-12
-9
1
P
Optimal global alignment:
HEAG AWGHE- E
--P- AW-HEA E
Smith and Waterman
(local alignment)
Two differences:
0
1. F(i, j) = max
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
2. An alignment can now end anywhere in the matrix
Example:
Sequence 1
Sequence 2
HEAGAWGHEE
PAWHEAE
Scoring parameters:
Gap penalty:
BLOSUM50 matrix
Linear gap penalty of 8
Smith Waterman alignment
0
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
P
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
5
0
0
0
0
0
W
0
0
0
0
2
0
20
12
4
0
0
H
0
10
2
0
0
0
12
18
22
14
6
E
0
2
16
8
0
0
4
10
18
28
20
A
0
0
8
21
13
5
0
4
10
20
27
E
0
0
6
13
18
12
4
0
4
16
26
Optimal local alignment:
A WGH E
A W-H E
Extended Smith & Waterman
To get multiple local alignments:
• delete regions around best path
• repeat backtracking
Extended Smith & Waterman
0
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
P
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
5
0
0
0
0
0
W
0
0
0
0
2
0
20
12
4
0
0
H
0
10
2
0
0
0
12
18
22
14
6
E
0
2
16
8
0
0
4
10
18
28
20
A
0
0
8
21
13
5
0
4
10
20
27
E
0
0
6
13
18
12
4
0
4
16
26
Extended Smith & Waterman
0
H
0
E
0
A
0
G
0
A
0
W
0
G
0
H
0
E
0
E
0
P
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
0
0
0
0
0
W
0
0
0
0
2
0
0
0
H
0
10
2
0
0
0
E
0
2
16
8
0
0
A
0
0
8
21
13
5
0
E
0
0
6
13
18
12
4
Second best local alignment:
0
HEA
HEA
Further Extensions of Dynamic
Programming
• Overlap matches
• Alignment with affine gap scores
Algorithmic Complexity
How does an algorithm‘s performance in CPU time and
required memory storage scale with the size of the problem?
Needleman & Wunsch
• Storing (n+1)x(m+1) numbers
• Each number costs a constant number of
calculations to compute (three sums and a max)
• Algorithm takes O(nm) memory and O(nm) time
• Since n and m are usually comparable: O(n2)
Multiple Alignments
The Biocomputing Service Group
Multiple Alignments
Process of aligning 3 or more sequences.
The basis for:
• The study of protein families or evolutionary
relationships.
• Finding conserved consensus patterns or domains.
Multiple Alignments
Approaches:
• Multidimensional dynamic programming
• Progressive alignments
• And others
Multidimensional
Multidimensional
Dynamic
Dynamic
Programming
Programming
Multiple Alignment
Three-dimensional Alignment Path Matrix
Alignment of 3 sequences:
Computing time!
Sequence 2
Sequence 1
Sequence 3
Memory!
Multiple Alignments
Approaches:
• Multidimensional dynamic programming
MSA
(Lipman, Altschul and Kececioglu, 1989)
DCA
(Jens Stoye)
Multidimensional
Multidimensional
Dynamic
Dynamic
Programming
Programming
Multiple Alignment
Divide-and-Conquer Alignment (DCA)
• Simultaneous alignment of multiple sequences using
Needleman and Wunsch algorithm
• Reduction of search space (reduces computing time)
• Sequences are cut at suitable positions near their midpoints
to obtain two new families of shorter sequences.
• The cutting procedure is repeated until the new families of
sequences can be aligned optimally.
• Then the resulting alignments are concatenated.
• Crucial Point: finding suitable cut positions
Multidimensional
Multidimensional
Dynamic
Dynamic
Programming
Programming
Multiple Alignment
Divide-and-Conquer Alignment
divide
divide
divide
align optimally
concatenate
Multidimensional
Multidimensional
Dynamic
Dynamic
Programming
Programming
Multiple Alignment
Sequence 1
Reduction of Search Space
e
qu
e
S
e
nc
2
Sequence 3
Multidimensional
Multidimensional
Dynamic
Dynamic
Programming
Programming
Multiple Alignment
Reduction of Search Space
Multiple Alignments
Approaches:
• Multidimensional dynamic programming
• Progressive alignments
• And others
Multiple Alignment
Progressive Alignment
Principle:
Pairwise Alignment
Guide Tree
1+2
1
1+3
2
1+4
Iterative Multiple Alignment
2
3
4
2+3
3
2+4
1
1
4
3+4
2
3
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
1. step
step
Pairwise Comparison of all sequences
1:2
1:3
1:4
1:5
2:3
2:4
2:5
3:4
3:5
4:5
Similarity score
of every comparison
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
1. step
step
Methods of Pairwise Comparison
Programs perform global alignments:
• Needleman & Wunsch:
(Pileup, Tree, Clustal)
• Word Size Method:
(Clustal)
• X. Huang
(MAlign)
(modified N-W)
Progressive
Progressive
Alignment:
Alignment:
2. step
step
Multiple Alignment
Construction of a Guide Tree
Sequence
1
2
3
4
5
1
2
3
Similarity Matrix:
displays scores of
all sequence pairs.
4
5
The similarity matrix
is transformed into a distance matrix . . . . .
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
2. step
step
Construction of a Guide Tree
No phylogenetic tree !!
Guide Tree
1
5
Distance
Matrix
2
3
4
Neighbour-Joining Method or
UPGMA (unweighted pair group method of arithmetic averages)
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
3. step
step
Multiple Alignment
Guide Tree
1
5
2
3
2
4
1
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
3. step
step
Columns - once aligned - are never changed
G T C C G T T - C G C
C A G G
C - G G
T T A C T T C C A G G
G T C C G - - C A G G
T T - C G C - C - G G
T T A C T T C C A G G
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
3. step
step
Columns - once aligned - are never changed
G T C C G T T - C G C
C A G G
C - G G
T T A C T T C C A G G
G T C C G - - C A G G
T T - C G C - C - G G
T T A C T T C C A G G
. . . . and new gaps are inserted.
Progressive
Progressive
Alignment:
Alignment:
Multiple Alignment
3. step
step
Columns - once aligned - are never changed
G T C C G - - C A G G
T T - C G C - C - G G
T T A C T T C C A G G
G T C C G - - C A G G
T T - C G C - C - G G
T T A C T T C C A G G
A T C T - - C A A T
C T G T C C C T A G
A T C - T - - C A A T
C T G - T C C C T A G
Multiple Alignments
Approaches:
• Multidimensional dynamic programming
• Progressive alignments
• And others
Multiple Alignment
Other
OtherApproaches
Approaches
Other methods
DiAlign
(Morgenstern et al. (1996) Proc. Natl. Acad. Sci., 93, 12098-12103)
PRRP
(Gotoh O. (1996) J. Mol. Biol., 264, 823-838)
T-Coffee
(Notredame et al. (2000) J. Mol. Biol., 302, 205-217)
Multiple Alignment
Other
OtherApproaches
Approaches
DiAlign
• Local alignment
• Gap-free segment-to-segment comparison
• Instead of residue comparison
• Gaps are not treated explicitly (no gap penalties)
• Gaps represent those parts of the sequences that are not aligned
Other
OtherApproaches:
Approaches:
DiAlign
DiAlign
Multiple Alignment
Example: Alignment of two sequences
Sequence 2
S3
S2
S1
Sequence 1
Other
OtherApproaches:
Approaches:
DiAlign
DiAlign
Multiple Alignment
Consistent versus Non-consistent
diagonals (local alignments)
S1
S2
S3
Sequence 1
Sequence 2
S2
Consistent:
Non-consistent:
S1
S1 + S3, S2 + S3
S1 + S2
S3
Other
OtherApproaches:
Approaches:
DiAlign
DiAlign
Multiple Alignment
Maximum Alignment
Sequence 2
S3
S2
S1
Sequence 1
Marked diagonals (similar segments)
are taken for the maximum alignment.
Other
OtherApproaches:
Approaches:
DiAlign
DiAlign
Multiple Alignment
1. Pairwise Comparison
of each sequence pair
Sequence Y
Extension to Multiple Alignment
2. Maximum Alignment
(diagonals) for each
sequence pair
Sequence Y
Sequence X
Sequence X
Other
OtherApproaches:
Approaches:
DiAlign
DiAlign
Multiple Alignment
Extension to Multiple Alignment
3. Diagonals of all pairwise maximum alignments are ranked
according to their score and incorporated one by one
as long as they are consistent in the growing multiple alignment.
Other
OtherApproaches:
Approaches:
DiAlign
DiAlign
Multiple Alignment
Extension to Multiple Alignment
SA
SB
SC
SD
SA
SB
SA
SC
SA
SC
SA
SB
SC
SD
Multiple Alignment
Other
OtherApproaches
Approaches
Other methods
DiAlign
(Morgenstern et al. (1996) Proc. Natl. Acad. Sci., 93, 12098-12103)
PRRP
(Gotoh O. (1996) J. Mol. Biol., 264, 823-838)
T-Coffee
(Notredame et al. (2000) J. Mol. Biol., 302, 205-217)
Other Approaches
Multiple Alignment
PRRP
Iterative refinement
Multiple alignment
Two sequence groups
(Random)
New multiple alignment
Iteration if SP score of the new
multiple alignment is higher than
the previous SP score.
Multiple Alignment
Other
OtherApproaches
Approaches
Other methods
DiAlign
(Morgenstern et al. (1996) Proc. Natl. Acad. Sci., 93, 12098-12103)
PRRP
(Gotoh O. (1996) J. Mol. Biol., 264, 823-838)
T-Coffee
(Notredame et al. (2000) J. Mol. Biol., 302, 205-217)
Other Approaches
Multiple Alignment
T-Coffee
Global Pairwise
Alignment
Local Pairwise
Alignment
Primary
Library
Weighting
Extended
Library
Extension
Progressive Alignment
Final Multiple
Alignment
Last
Lastbut
butnot
notleast,
least,aa
HINT:
HINT:
Multiple Alignment
Related documents