Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Protein Architecture: Four Levels
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
t = A N A N A S
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
t = A N A N A S
More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations
p1
s
p 1’
x
x’
p2
p 2’
y
p3
y’
p 3’
t
p = p1 p2 p3 + p01 p02 p03
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
t = A N A N A S
More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations
p1
s
p 1’
x
x’
p2
p 2’
y
p3
y’
p 3’
t
p = p1 p2 p3 + p01 p02 p03
#2) Align amino acid sequence such that the weight of the mutations is maximised, e.g.:
B A N A N A
A N A N A S
Requirement: “Scoring Matrix” for
• mutations
• insertion / deletion
s i ! ti :
pG
Common scoring matries: BLOSUM, PAM matrices
gap
psi ;ti
BLOSUM80 scoring matrix
(A) Sequence alignment
Example: s = R I - L V S D K V I
t = R I S L V - - K A I
p = 1 · 1 · pG · 1 · 1 · p2G · 1 · pVA · 1
si
ti
R
R
I
S
pG
K
A
I
D K
wi-1,j-1
V I
wi,j-1
p(si,tj)
I
V
V S
1
1
L
L
pG
pG
pSV
wi-1,j
pLS
1
1
pG
wi,j
pVD
pG pG
1
Task:
pVA
1
•
Find the shortest (weighted) path (#2)
•
Sum over all paths (#1)
(A) Sequence alignment
Task:
•
Find the shortest (weighted) path (#2)
•
Sum over all paths (#1)
Number of possible paths/alignments:
☞ n = 100: 1059
☞ n = 1000: 10600
✓
2n
n
◆
22n
⇡p
⇡n
→ NP-problem?
No! Needleman / Wunsch (1970)
Smith / Waterman (1976)
Idea (analogous to path integral for Schrödiner eq.:
Complete sum wij over all paths to (i.j) recursively:
wij = wi
1,j 1 psi ,tj
wij = Max{wi
+ wi
1,j 1 psi ,tj
1,j pG
+ wi
+ wi,j
1,j pG
Computational cost: O(n2) (like a route planner)
1 pG
+ wi,j
(solves #1)
1 pG }
(solves #2)
“Dynamic
programming”
Close rela5on to Smoluchowski/Feynman path integrals
action
(x1 , t1 )
(x0 , t0 )
(x1 , t1 ) =
Z
dx0 (x0 , t0 )
x0
x1
eiS/~ = exp
Z
(x0 , t0 )
B
dt L(x, ẋ, t)
A
i
~
Dx(t) exp
all paths
Z
!
B
dt L(x, ẋ, t)
A
xn
x2
…
Discretisation:
i
~
Z
(x1 , t1 )
!
Close rela5on to Smoluchowski/Feynman path integrals
x0
x2
x1
xn
…
(x1 , t1 )
…
Discretisation:
(x0 , t0 )
(xn , tn ) =
(xi+1 , ti+1 ) =
=
=
Z
dx0 (x0 , t0 )
Z
Z
Z
Z
dxi (xi , ti ) e
i
~
dx1 e
R ti+1
ti
dxi
develop ψ(x,t) in powers of Δx …
@
i
=
@t
~
R t1
t0
dt L
···
Z
dxn
1e
i
~
R tn
tn
1
2
V (x)
◆
1
dt L
dt L
✓ Z ti+1
✓ 2
◆◆
i
ẋ
(xi , ti ) exp
dt
V (x)
~ ti
2
✓
◆2
Z
i ti+1
1 xi+1 xi
(xi , ti ) exp
dt
~ ti
2
t
dxi
✓
i
~
V (x)
!!
Sequence alignment: hemoglobin of mammals
Sequence comparison: hemoglobin alpha chain vs beta chain
window size: 31
match: +5
dismatch: -4
(B) Phylogene5c trees
Given: N sequences s(1), … s(N)
Task: Find most probable evolutionary tree:
• Example
s(1) = B A N A N A
s(1) = A N A N A S
s(1) = H O T D O G
distance
• Cost: NP-complete
☞ Trees for different proteins are (usually) similar
☞ Reconstruction of evolution
Problem: horizontal gene transfer
Phylogene5c trees
Phylogenetic
tree of dogs
Nature 438, 803-819
Phylogene5c trees
Phylogenetic
tree of
vertebrates
Nature 496,
311-316
Phylogene5c trees
Phylogenetic tree of ribosomal RNA
Wikimedia
Phylogene5c tree of indo-‐
european languages
Science 337, 957-960 (2012)
(C) Structure predic5on:
from sequence to structure
•
•
“Folding problem”
Ab initio → only possible for smallest proteins (since recently)
(a) Secondary structure prediction
Chou-Fasman method (empirical)
• Calculate properties from known structures
P (S|A) =
P (A|S)
nA,S /nS
=
P (A)
nA /n
amino acid
second. structure
• Search for regions with high (average) propensities for certain secondary structures
• Search secondary structure boundaries (e.g., “helix breakers” such as proline)
☞ 75% prediction rate (compare to random guess: 33%)
log frequencies of amino acids in secondary structure elements
A.A.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
P<a>
1.42
0.98
0.67
1.01
0.70
1.11
1.51
0.57
1.00
1.08
1.21
1.16
1.45
1.13
0.57
0.77
0.83
1.08
0.69
1.06
P<b>
0.83
0.93
0.89
0.54
1.19
1.10
0.37
0.75
0.87
1.60
1.30
0.74
1.05
1.38
0.55
0.75
1.19
1.37
1.47
1.70
P<t>
0.66
0.95
1.56
1.46
1.19
0.98
0.74
1.56
0.95
0.47
0.59
1.01
0.60
0.60
1.52
1.43
0.96
0.96
1.14
0.50
Hp 1.80
-4.50
-3.50
-3.50
2.50
-3.50
-3.50
-0.40
-3.20
4.50
3.80
-3.90
1.90
2.80
-1.60
-0.80
-0.70
-0.90
-1.30
4.20
(C) Structure predic5on
(a) Homology modelling
Observation in PDB: Similar sequence (30% identity) → similar structure
☞ Strategy:
Aquaporin-1
GlpF
•
•
•
•
•
•
Search homologous sequence with known structure
align sequences
change differing amino acids
meet sterical criteria (avoid atomic overlaps), and other criteria
optimize rotamers
Critical: correct alignment
GlpF
GlpF model based on Aqp1
(C) Structure predic5on:
from sequence to structure
(c) Protein threading
No homologous structure available?
☞ Into which known fold fits the sequence best?
aa
S
A R N D
☞ Find the known fold with the maximal
…
α-helix
N
X
ln p(ai , sj )
i=1
p(ai , sj )
β-sheet
Improvements:
turn
Sequence / structure statistics
better statistics, e.g. consider triplets, spacial neighbours,
cys-cys bonds, …
non-polar
surface area
[A^2]
Trp
Leu
Ile
Phe
Met
Val
Pro
Lys
Tyr
His
Thr
Arg
Ala
Glu
Gln
Ser
Cys
Gly
Asp
Asn
236
164
155
194
137
135
124
122
154
129
90
89
86
69
66
56
48
47
45
42
estimated hydrophobic effect
[kcal/mol]
4.11
4.10
3.88
3.46
3.43
3.38
3.10
3.05
2.81
2.45
2.25
2.23
2.15
1.73
1.65
1.40
1.20
1.18
1.13
1.05
(C) Structure predic5on
(d) Empirical potentials
E.g., ψ-angles between amino acids, e.g., Ala-Asn:
h( )
V =
2
3
1
kB T ln h( )
1
3
2
☞ 20x20 pair interactions Vij
☞ minimize
N
X
i=1
VSi ,Si+1
Bottom line: structure prediction is still not very accurate
and reliable !
Ramachandran plots