Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemical switches in the cell cycle wikipedia , lookup

Cytosol wikipedia , lookup

Endomembrane system wikipedia , lookup

Amitosis wikipedia , lookup

Cellular differentiation wikipedia , lookup

Cell cycle wikipedia , lookup

Homology modeling wikipedia , lookup

Cell culture wikipedia , lookup

Organ-on-a-chip wikipedia , lookup

Cell wall wikipedia , lookup

JADE1 wikipedia , lookup

Cell growth wikipedia , lookup

Mitosis wikipedia , lookup

Cytokinesis wikipedia , lookup

Gap junction wikipedia , lookup

Extracellular matrix wikipedia , lookup

List of types of proteins wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to bioinformatics
2007
Lecture 7
Pair-wise Sequence Alignment (III)
What can sequence alignment tell us
about structure
HSSP
Sander & Schneider, 1991
≥30%
sequence
identity
Global dynamic programming
PAM250, Gap =6 (linear)
S
P
E
A
R
E
0
-6
-12
-18
-24
-30
-36
S
-6
2
-4
-10
-16
-22
-28
0
H
-12
-4
2
-3
-9
-14
-20
3
0
A
-18
-10
-3
2
-1
-7
-13
-1
4
K
-24
-16
-9
-3
1
2
-4
E
-30
-22
-15
-5
-3
-2
6
S
P
E
A
R
E
S
2
1
0
1
0
0
H
-1
0
1
-1
2
1
A
1
1
0
2
-2
K
0
-1
0
-1
E
0
-1
4
0
These values are copied from
the PAM250 matrix (see earlier
slide)
The easy algorithm is only for
linear gap penalties
Start in left upper cell before either sequence
(circled in red). Path will end in lower right
cell (circled in blue)
SPEARE
S-HAKE
Higgs & Attwood, p. 124 – Note: There are errors in the matrices!!
DP is a two-step process
• Forward step: calculate scores
• Trace back: start at highest score and
reconstruct the path leading to the highest
score
– These two steps lead to the highest
scoring alignment (the optimal
alignment)
– This is guaranteed when you use DP!
Time and memory complexity of DP
• The time complexity is O(n2): if you would
align two sequences of n residues, you
would need to perform n2 algorithmic steps
(square search matrix has n2 cells that need
to be filled)
• The memory (space) complexity is also
O(n2): if you would align two sequences of
n residues, you would need a square search
matrix of n by n containing n2 cells
Global dynamic programming
i-1
i
j-1 j
H(i,j) = Max
H(i-1,j-1) + S(i,j)
H(i-1,j) - g
H(i,j-1) - g
diagonal
vertical
horizontal
Problem with simple DP approach:
•Can only do linear gap penalties
•Not suitable for affine and concave penalties
Global dynamic programming
using affine penalties
j-2 j-1 j
i-2
i-1
i
Looking back from cell
(i, j) we can adapt the
algorithm for affine gap
penalties by looking back
to four more cells
(magenta)
If you came from here, gap
was already open, so apply
gap-extension penalty
If you came from here, gap was
opened so apply gap-open
penalty
Global dynamic programming
all three types of gap penalties
j-1
i-1
Si,j = si,j + Max
Gap
opening
penalty
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
Gap extension penalty
Global dynamic programming
Gapo=10, Gape=2
D
W
V
T
A
L
K
0
-12
-14
-16
-18
-20
-22
-24
T
-12
8
-9
-6
-5
-9
-11
-14
5
D
-14
0
9
2
2
3
-5
-3
-34
10
6
W
-16
-13
25
11
5
4
9
0
-21
6
14
5
V
-18
-10
-4
37
21
19
19
15
-16
7
5
13
L
-20
-14
-2
23
46
31
37
26
1
K
-22
-12
-9
17
33
53
39
50
14
-34
-29
-1
17
39
27
50
D
W
V
T
A
L
K
T
8
3
8
11
9
9
8
D
12
1
6
8
8
4
8
W
1
25
2
3
2
6
V
6
2
12
8
8
L
4
6
10
9
K
8
5
6
8
These values are copied from
the PAM250 matrix (see earlier
slide), after being made nonnegative by adding 8 to each
PAM250 matrix cell (-8 is the
lowest number in the PAM250
matrix)
The extra bottom row and rightmost column give the
final global alignment scores
Easy DP recipe for using affine
gap penalties
j-1
i-1
• M[i,j] is optimal alignment (highest scoring
alignment until [i,j])
• Check
– preceding row until j-2: apply appropriate gap penalties
– preceding row until i-2: apply appropriate gap penalties
– and cell[i-1, j-1]: apply score for cell[i-1, j-1]
DP is a two-step process
• Forward step: calculate scores
• Trace back: start at highest score and
reconstruct the path leading to the highest
score
– These two steps lead to the highest scoring
alignment (the optimal alignment)
– This is guaranteed when you use DP!
Global dynamic programming
There are three kinds of
alignments
• Global alignment
• Semi-global alignment
• Local alignment
Semi-global pairwise alignment
• Global alignment: all gaps are penalised
• Semi-global alignment: N- and C-terminal gaps
(end-gaps) are not penalised
End-gaps
MSTGAVLIY--TS-------GGILLFHRTSGTSNS
End-gaps
Semi-global dynamic
programming
PAM250, Gap =6 (linear)
S
P
E
A
R
E
0
0
0
0
0
0
0
S
0
2
1
0
1
0
0
0
H
0
-1
2
2
-1
3
1
3
0
A
0
1
0
2
4
-2
3
-1
4
K
0
0
0
0
1
7
1
E
0
0
-1
4
0
1
11
S
P
E
A
R
E
S
2
1
0
1
0
0
H
-1
0
1
-1
2
1
A
1
1
0
2
-2
K
0
-1
0
-1
E
0
-1
4
0
These values are copied from
the PAM250 matrix (see earlier
slide)
The easy algorithm is only for
linear gap penalties
Start in left upper cell before either sequence
(circled in red). Path will end in cell anywhere in
the bottom row or rightmost columns with the
highest score
SPEARE
-SHAKE
Higgs & Attwood, p. 124 – Note: There are errors in the matrices!!
Semi-global dynamic programming
- two examples with different gap penalties These values are copied from
the PAM250 matrix (see earlier
slide), after being made nonnegative by adding 8 to each
PAM250 matrix cell (-8 is the
lowest number in the PAM250
matrix)
Global score would
have been 65 –10 –
1*2 –10 – 2*2
(because of the two
end gaps)
Semi-global pairwise alignment
Applications of semi-global:
– Finding a gene in genome
– Placing marker onto a chromosome
– Generally: if one sequence is much longer
than the other
Danger: if gap penalties high -- really bad
alignments for divergent sequences
Local dynamic programming
(Smith & Waterman, 1981)
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
AGSTVIVG
A-STILCG
Gap penalties
(open, extension)
Local dynamic programming (Smith
and Waterman, 1981)
basic algorithm
i-1
i
j-1 j
H(i,j) = Max
H(i-1,j-1) + S(i,j)
H(i-1,j) - g
H(i,j-1) - g
0
diagonal
vertical
horizontal
Local dynamic programming
Match/mismatch = 1/-1, Gap =2
A
T
G
A
C
G
T
0
0
0
0
0
0
0
0
T
0
0
1
0
0
0
0
1
A
0
1
0
0
1
0
0
0
G
0
0
0
1
0
0
1
0
A
0
1
0
0
2
0
0
0
C
0
0
0
0
0
3
1
0
T
0
0
1
0
0
0
2
2
Fill the matrix (forward pass), then do
trace back from highest cell anywhere
in the matrix till you reach 0 or the
beginning of a sequence
Local dynamic programming
Match/mismatch = 1/-1, Gap =2
A
T
G
A
C
G
T
0
0
0
0
0
0
0
0
T
0
0
1
0
0
0
0
1
A
0
1
0
0
1
0
0
0
G
0
0
0
1
0
0
1
0
A
0
1
0
0
2
0
0
0
C
0
0
0
0
0
3
1
0
T
0
0
1
0
0
0
2
2
Fill the matrix (forward pass), then do
trace back from highest cell anywhere
in the matrix till you reach 0 or the
beginning of a sequence
GAC
GAC
Local dynamic programming
(Smith & Waterman, 1981)
j-1
This is the general
DP algorithm,
which is suitable
for linear, affine
and concave
penalties,
although for the
examples here
affine penalties
are used
i-1
Si,j = Max
Gap
opening
penalty
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
Gap extension penalty
Local dynamic programming
Global and local alignment
A
B
B
A
A
A
A
A
B
A
Local
B
B
A
Global
A
B
C
B
C
C
B
C
B
A
A
C
C
Local
Global
Global or Local Pairwise
alignment
A
B
B
A
A
A
A
A
B
A
Local
B
B
A
Global
A
B
C
B
C
C
B
C
B
A
A
C
C
Local
Global
Globin fold
 protein
myoglobin
PDB: 1MBN
Alpha-helices
are labelled ‘A’
(blue) to ‘H’
(red). The D
helix can be
missing in some
globins:
What happens
with the
alignment if Dhelix containing
globin sequences
are aligned with
‘D-less’ ones?
 sandwich
 protein
immunoglobulin
PDB: 7FAB
Immunoglobulin
structures have
variable regions
where numbers
of amino acids
can vary
substantially
TIM barrel
 /  protein
Triose
phosphate
IsoMerase
PDB: 1TIM
The evolutionary
history of this protein
family has been the
subject of rigorous
debate. Arguments
have been made in
favor of both
convergent and
divergent evolution.
Because of the
general lack of
sequence homology,
the ancestry of this
molecule is still a
mystery.
Pyruvate kinase
Phosphotransferase
 barrel regulatory domain
/ barrel catalytic substrate binding
domain
/ nucleotide binding domain
What does all this mean for
alignments?
• Alignments need to be able to skip secondary
structural elements to complete domains (i.e. putting
gaps opposite these motifs in the shorter sequence).
• Depending on gap penalties chosen, the algorithm
might have difficulty with making such long gaps
(for example when using high affine gap penalties),
resulting in incorrect alignment.
• Alignments are only meaningful for homologous
sequences (with a common ancestor)
There are three kinds of
pairwise alignments



Global alignment – align all residues in both
sequences; all gaps are penalised
Semi-global alignment – align all residues in
both sequences; end gaps are not penalised
(zero end gap penalties)
Local alignment – align one part of each
sequence; end gaps are not applicable
Easy global DP recipe for using
affine gap penalties (after
Gotoh)
j-1
Penalty = Pi + gap_length*Pe
i-1


M[i,j] is optimal alignment (highest scoring alignment until [i, j])
At each cell [i, j] in search matrix, check Max coming from:




Si,j = si,j + Max
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
any cell in preceding row until j-2: add score for cell[i, j] minus appropriate
gap penalties;
any cell in preceding column until i-2: add score for cell[i, j] minus
appropriate gap penalties;
or cell[i-1, j-1]: add score for cell[i, j]
Select highest scoring cell in bottom row and rightmost column and do
trace-back
Let’s do an example: global alignment
Gotoh’s DP algorithm with affine gap penalties
(PAM250, Pi=10, Pe=2)
D
W
V
T
A
L
K
0
-12
-14
-16
-18
-20
-22
-24
T
-12
0
-17
-14
-13
-17
-19
-22
-3
D
-14
-8
-7
-14
-14
-13
2
-2
W
-16
-21
9
-13
-19
-18
-2
6
-3
V
-18
-18
-20
13
-3
-16
-1
-3
5
L
-20
-22
-18
-1
14
K
-22
-20
-21
-24
-42
D
W
V
T
A
L
K
T
0
-5
0
3
1
1
0
D
4
-7
-2
0
0
-4
0
W
-7
17
-6
-5
-6
-2
V
-2
-6
4
0
0
L
-4
-2
2
1
K
0
-3
-2
0
PAM250
-22
-42
-1
-14
-12
-41
-18
-16
-14
-12
0
Cell (D2, T4) can alternatively come from two cells (same score):
‘high-road’ or ‘low-road’
Row and column ‘0’ are filled with 0, -12, -14, -16, … if global alignment is used (for N-terminal endgaps); also extra row and column at the end to calculate the score including C-terminal end-gap
penalties. Note that only ‘non-diagonal’ arrows are indicated for clarity (no arrow means that you go
back to earlier diagonal cell).
Let’s do another example: semi-global alignment
Gotoh’s DP algorithm with affine gap penalties
(PAM250, Pi=10, Pe=2)
D
W
V
T
A
L
K
T
0
-5
0
3
1
1
0
D
4
-7
-2
0
0
-4
W
-7
17
-6
-5
-6
V
-2
-6
4
0
L
-4
-2
2
K
0
-3
-2
D
W
V
T
T
0
-5
0
3
0
D
4
-7
-7
-2
-3
W
-7
21
-13
0
2
-2
V
-2
-13
25
1
-2
6
-3
L
0
-1
-3
5
K
A
L
K
9
PAM250
Starting row and column ‘0’, and extra column at right or extra row at
bottom is not necessary when using semi global alignment (zero endgaps). Rest works as under global alignment.
Easy local DP recipe for using
affine gap penalties (after
Gotoh)
j-1
Penalty = Pi + gap_length*Pe
Si,j = Max
i-1


Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
M[i,j] is optimal alignment (highest scoring alignment until [i, j])
At each cell [i, j] in search matrix, check Max coming from:
any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap
penalties;
any cell in preceding column until i-2: add score for cell[i, j] minus appropriate
gap penalties;
or cell[i-1, j-1]: add score for cell[i, j]

Select highest scoring cell anywhere in matrix and do trace-back until
zero-valued cell or start of sequence(s)
Let’s do yet another example: local alignment
Gotoh’s DP algorithm with affine gap penalties
(PAM250, Pi=10, Pe=2)
D
W
V
T
A
L
K
T
0
-5
0
3
1
1
0
D
4
-7
-2
0
0
-4
W
-7
17
-6
-5
-6
V
-2
-6
4
0
L
-4
-2
2
K
0
-3
-2
D
W
V
T
T
0
0
0
3
0
D
4
0
0
0
-2
-3
W
0
21
0
0
0
2
-2
V
0
0
25
9
1
-2
6
-3
L
0
0
11
0
-1
-3
5
K
0
0
A
L
K
PAM250
Extra start/end columns/rows not necessary (no end-gaps). Each negative
scoring cell is set to zero. Highest scoring cell may be found anywhere in
search matrix after calculating it. Trace highest scoring cell back to first
cell with zero value (or the beginning of one or both sequences)
Dot plots
• Way of representing (visualising) sequence
similarity without doing dynamic
programming (DP)
• Make same matrix, but locally represent
sequence similarity by averaging using a
window
Comparing two sequences
We want to be able to choose the best alignment between two
sequences.
A simple method of visualising similarities between two sequences
is to use dot plots. The first sequence to be compared is assigned to
the horizontal axis and the second is assigned to the vertical axis.
Dot plots can be filtered by
window approaches (to
calculate running averages)
and applying a threshold
They can identify
insertions, deletions,
inversions
For your first exam D1:
Make sure you understand and can carry out
1. the ‘simple’ DP algorithm for global, semiglobal and local alignment (using linear gap
penalties but make sure you know the extension of
the basic algorithm for affine gap penalties) and
2. The general DP algorithm for global, semiglobal and local alignment (using linear, affine and
concave gap penalties)!