Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense
except in the light of Biology”
Evolution
Three requirements:
• Template structure providing stability
(DNA)
• Copying mechanism (meiosis)
• Mechanism providing variation (mutations;
insertions and deletions; crossing-over; etc.)
Evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
Evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
true alignment
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
Example: Pairwise sequence
alignment needs sense of evolution
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Amino Acid Exchange
Matrix
Gap penalties
(open,extension)
Sequence alignment
History
1970 Needleman-Wunsch global pair-wise
alignment
1981 Smith-Waterman local pair- wise alignment
1984 Hogeweg-Hesper progressive multiple
alignment
1989 Lipman-Altschul-Kececioglu simultaneous
multiple alignment
1994 Hidden Markov Models (HMM) for
multiple alignment
1996 Iterative strategies for progressive multiple
alignment revived
1997 PSI-Blast (PSSM)
Pair-wise alignment
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
~
=
n
22n
(2n)!
(n!)2
n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
attt---ggcggatcg-cctctacgggcc----
Dynamic programming
Scoring alignments
Sa,b = l s(ai, b )+
j
gp(k) = pi + kpe

k
Nk  gp(k )
affine gap penalties
pi and pe are the penalties for gap initialisation
and extension, respectively
Dynamic programming
Scoring alignments
T D W V T A L K
T D W L - - I K
2020
10
Amino Acid Exchange Matrix
1
Affine gap penalties (open,
extension)
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)+Po+2Px +
+s(L,I)+s(K,K)
Amino acid exchange matrices
2020
How do we get one?
And how do we get associated gap penalties?
First systematic method to derive a.a.
exchange matrices by Margaret Dayhoff et
al. (1978) – Atlas of Protein Structure
A
2
R -2
6
N
0
0
2
D
0 -1
2
PAM250 matrix
4
C -2 -4 -4 -5 12
Q
0
1
1
2 -5
4
E
0 -1
1
3 -5
2
4
G
1 -3
0
1 -3 -1
0
2
1 -3
1 -2
H -1
2
3
5
6
I -1 -2 -2 -2 -2 -2 -2 -3 -2
5
L -2 -3 -3 -4 -6 -2 -3 -4 -2
2
1
0 -5
1
amino acid
exchange matrix
(log odds)
0 -2
6
K -1
3
M -1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F -4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
9
1
0 -1 -1 -3
0 -2 -3 -1 -2 -5
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
2
T
1 -1
0
0 -2 -1
0
0 -1
0 -1 -3
0
1
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
Y -3 -4 -2 -4
0 -4 -4 -5
0 -1 -1 -4 -2
3
0 -6 -2 -5 17
7 -5 -3 -3
B
0 -1
2
3 -4
1
2
0
1 -2 -3
1 -2 -5 -1
0
0 -5 -3 -2
2
Z
0
0
1
3 -5
3
3 -1
2 -2 -3
0 -2 -5
0
0 -1 -6 -4 -2
2
3
A
R
N
D
Q
E
H
K
P
S
B
Z
I
L
2 -1 -1 -1
0 10
0 -2 -2 -2 -2 -2 -2 -1 -2
G
2 -2
6
V
C
4
Positive exchange values
denote mutations that are
more likely than randomly
expected, while negative
numbers correspond to
avoided mutations compared
to the randomly expected
situation
5
P
W -6
0 -1 -1
0 -2 -3
M
F
0 -6 -2
T
W
Y
4
V
Pairwise sequence alignment
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Amino Acid Exchange
Matrix
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Gap penalties
(open,extension)
Global dynamic programming
j-1
i-1
Si,j = si,j + Max
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
Global dynamic programming
Global dynamic programming
Pairwise alignment
• Global alignment: all gaps are penalised
• Semi-global alignment: N- and C-terminal gaps
(end-gaps) are not penalised
End-gaps
MSTGAVLIY--TS-------GGILLFHRTSGTSNS
End-gaps
Local dynamic programming
(Smith & Waterman, 1981)
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
AGSTVIVG
A-STILCG
Gap penalties
(open, extension)
Local dynamic programming
(Smith & Waterman, 1981)
j-1
i-1
Si,j = Max
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
Local dynamic programming
Dot plots
• Way of representing (visualising) sequence
similarity without doing dynamic
programming (DP)
• Make same matrix, but locally represent
sequence similarity by averaging using a
window
• See Lesk’s book pp. 167-171
Comparing two sequences
We want to be able to choose the best alignment between two
sequences.
A simple method of finding similarities between two sequences is to
use dot plots. The first sequence to be compared is assigned to the
horizontal axis and the second is assigned to the vertical axis.
Dot plots can be filtered by
window approaches (to
calculate running averages)
and applying a threshold
They can identify
insertions, deletions,
inversions
Related documents