Download A^2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Whole genome sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Metabolism wikipedia , lookup

Metalloprotein wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Protein wikipedia , lookup

Proteolysis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Protein Architecture: Four Levels
Cost per genome
☞ Cost of first genome (Human Genome Project, started 1990): 3 Billion US$
Exponential decay
of computing cost
soon below $1,000 ?
Hayden, Nature 2014
Cost per genome
☞ Cost of first genome (Human Genome Project, started 1990): 3 Billion US$
Exponential decay
of computing cost
soon below $1,000 ?
Cost per genome
☞ Cost of first genome (Human Genome Project, started 1990): 3 Billion US$
http://sulab.org/
Structures in the Protein Data Bank
80.000
total
per year
X-ray
crystallography
10.000
NMR
spectroscopy
Structures in the Protein Data Bank
80.000
total
per year
X-ray
crystallography
Membrane
proteins of known
structure
Stephen While lab, UC Irvine
Today’s lecture
Much more sequence informa>on available, compared to structural informa>on!
(A)Sequence alignments
How similar are two (amino acid) sequences?
(B)Phylogenetic trees
Find evolutionary tree from set of sequences
(C) Structure prediction
Predict protein structure from amino acid sequence
(A) Sequence alignment
Why sequence alignments?
• Mul>ple sequence alignments
→ Iden>fy amino acids that were conserved during evolu>on
→ Relevant for protein func>on or protein stability • Quan>fy, how similar two (or more) sequences are
→ Quan>fy distance in evolu>on, build evolu>onary (phylogene>c) trees • Find the best (or most likely) alignment of two sequences
→ homology modelling
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
t = A N A N A S
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
s = B A N A N A
t =
A N A N A S
t = A N A N A S
More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations
p1
s
p 1’
x
x’
p2
p 2’
y
p3
y’
p 3’
t
p = p1 p2 p3 + p01 p02 p03
Possible transformations: • mutation
• insertion / deletion
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
s = B A N A N A
t =
A N A N A S
t = A N A N A S
More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations
p1
s
p 1’
x
x’
p2
p 2’
y
p3
y’
p 3’
t
p = p1 p2 p3 + p01 p02 p03
Possible transformations: • mutation
• insertion / deletion
#2) Align amino acid sequence such that the weight of the transformation is maximised, e.g.: B A N A N A
A N A N A S
(A) Sequence alignment
Problem: How similar are two sequences?
s = B A N A N A
s = B A N A N A
t =
A N A N A S
t = A N A N A S
More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations
p1
s
p 1’
x
x’
p2
p 2’
y
p3
y’
p 3’
t
p = p1 p2 p3 + p01 p02 p03
#2) Align amino acid sequence such that the weight of the transformation is maximised, e.g.:
B A N A N A
A N A N A S
Requirement: “Scoring Matrix” for
transformations:
• mutations
• insertion / deletion
Common scoring matrices: BLOSUM, PAM matrices
s i ! ti :
pG
gap
psi ;ti
Scoring matrix
Requirement: “Scoring Matrix” for
transformations:
• mutations
• insertion / deletion
s i ! ti :
pG
Common scoring matrices: BLOSUM, PAM matrices
gap
BLOSUM62 scoring matrix
(1992)
(BLOck SUbstitution Matrix)
psi ;ti
Scoring matrix
Diagonal: weight for keeping the amino acid
Cystein: large weight since important for 3D structure (disulfide bonds)
Tryptophane: largest
amino acid, mutation
unlikely
Scoring matrix
Off-diagonal elements: weight for mutations
Leucine to aspartate: hydrophobic to anionic, mutation unlikely
Leucine to isoleucine: similar amino acids,
mutation likely
Scoring matrix
Requirement: “Scoring Matrix” for
transformations:
s i ! ti :
pG
• mutations
• insertion / deletion
Common scoring matrices: BLOSUM, PAM matrices
psi ;ti
gap
BLOSUM62 scoring matrix
(1992)
E K N G F P A
|
|
|
E M Q G R W A
BLOSUM62 score = 7
One to three-
letter code:
E = Glu
K = Lys
M = Met
N = Asn
Q = Gln
G = Gly
F = Phe
R = Arg
P = Pro
W = Trp
A = Ala
(A) Sequence alignment
Example: s = R I - L V S D K V I
t = R I S L V - - K A I
p = 1 · 1 · pG · 1 · 1 ·
p2G
· 1 · pVA · 1
Here simplified: p=1 for keeping a mutation
si
ti
R
R
I
S
pG
K
A
I
D K
wi-1,j-1
V I
wi,j-1
p(si,tj)
I
V
V S
1
1
L
L
pG
pG
pSV
wi-1,j
pLS
1
1
pG
wi,j
pVD
pG pG
1
Task:
pVA
1
Every possible alignment corresponds to one specific path!
•
Find the shortest (weighted) path (#2)
•
Sum over all paths (#1)
(A) Sequence alignment
Task:
•
Find the shortest (weighted) path (#2)
•
Sum over all paths (#1)
Number of possible paths/alignments:
☞ n = 100: 1059
☞ n = 1000: 10600
✓
2n
n
◆
22n
⇡p
⇡n
→ NP-problem?
No! Needleman / Wunsch (1970)
Smith / Waterman (1976)
Idea (analogous to path integral for Schrödiner eq.):
Complete sum wij over all paths to (i,j) recursively:
wij = wi
1,j 1 psi ,tj
wij = Max{wi
+ wi
1,j 1 psi ,tj
1,j pG
+ wi
+ wi,j
1,j pG
Computational cost: O(n2) (like a route planner)
1 pG
+ wi,j
(solves #1)
1 pG }
(solves #2)
“Dynamic
programming”
Close rela>on to Smoluchowski/Feynman path integrals
action
(x1 , t1 )
(x0 , t0 )
(x1 , t1 ) =
Z
dx0 (x0 , t0 )
x0
x1
eiS/~ = exp
Z
(x0 , t0 )
B
dt L(x, ẋ, t)
A
i
~
Dx(t) exp
all paths
Z
!
B
dt L(x, ẋ, t)
A
xn
x2
…
Discretisation:
i
~
Z
(x1 , t1 )
!
Close rela>on to Smoluchowski/Feynman path integrals
x0
x2
x1
xn
…
(x1 , t1 )
…
Discretisation:
(x0 , t0 )
(xn , tn ) =
(xi+1 , ti+1 ) =
=
=
Z
dx0 (x0 , t0 )
Z
Z
Z
Z
dxi (xi , ti ) e
i
~
dx1 e
R ti+1
ti
i
~
R t1
t0
dt L
···
Z
dxn
1e
i
~
R tn
tn
dxi
develop ψ(x,t) in powers of Δx ,…, → Schrödinger equation
@
i
=
@t
~
✓
1
2
V (x)
◆
dt L
dt L
✓ Z ti+1
✓ 2
◆◆
i
ẋ
(xi , ti ) exp
dt
V (x)
~ ti
2
✓
◆2
Z
i ti+1
1 xi+1 xi
(xi , ti ) exp
dt
~ ti
2
t
dxi
1
V (x)
!!
Sequence alignment of a ribosomal protein P0
Source: Wikipedia
Sequence comparison: hemoglobin alpha chain vs beta chain
residue number of alpha chain
Dot plots:
residue number of beta chain
Highlight similar
regions in two
sequences
Details on filtering: window size: 31 match: +5 dismatch: -­‐4
(B) Phylogene>c trees
Given: N sequences s(1), … s(N)
Task: Find most probable evolutionary tree:
French:
German:
Italian:
Spanish:
English:
un
eins
uno
un
one
deux
zwei
due
dos
two
trois
drei
tre
tres
three
quatre
vier
quattro
cuatro
four
cinq
fünf
cinque
cinco
five
(B) Phylogene>c trees
Given: N sequences s(1), … s(N)
Task: Find most probable evolutionary tree:
French:
German:
Italian:
Spanish:
English:
un
eins
uno
un
one
German
English
French
Spanish
Italian
deux
zwei
due
dos
two
trois
drei
tre
tres
three
quatre
vier
quattro
cuatro
four
cinq
fünf
cinque
cinco
five
(B) Phylogene>c trees
Given: N sequences s(1), … s(N)
Task: Find most probable evolutionary tree:
• Example
s(1) = B A N A N A
s(2) = A N A N A S
s(3) = H O T D O G
distance
• Cost: NP-complete
☞ Trees for different proteins are (usually) similar
☞ Reconstruction of evolution
Problem: horizontal gene transfer
Phylogene>c trees
Phylogenetic
tree of dogs
Nature 438, 803-819
Phylogene>c trees
Phylogenetic
tree of
vertebrates
Nature 496,
311-316
Phylogene>c trees
Phylogenetic tree of ribosomal RNA
Wikimedia
Phylogene>c tree of indo-­‐
european languages
Science 337, 957-960 (2012)
(C) Structure predic>on:
from sequence to structure
•
•
•
Recall: many more sequences than structures available
“Folding problem”
Ab initio → only possible for smallest proteins (since recently)
(a) Secondary structure prediction
Chou-Fasman method (empirical)
• Calculate probabilities from known structures
P (S|A) =
P (A|S)
nA,S /nS
=
P (A)
nA /n
amino acid
second. structure
• Search for regions with high (average) propensities for certain secondary structures
• Search secondary structure boundaries (e.g., “helix breakers” such as proline)
☞ 75% prediction rate (compare to random guess: 33%)
log frequencies of amino acids in secondary structure elements
amino acid
alpha helix
beta sheet
turn
A.A.
P<a>
P<b>
P<t>
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
1.42
0.98
0.67
1.01
0.70
1.11
1.51
0.57
1.00
1.08
1.21
1.16
1.45
1.13
0.57
0.77
0.83
1.08
0.69
1.06
0.83
0.93
0.89
0.54
1.19
1.10
0.37
0.75
0.87
1.60
1.30
0.74
1.05
1.38
0.55
0.75
1.19
1.37
1.47
1.70
0.66
0.95
1.56
1.46
1.19
0.98
0.74
1.56
0.95
0.47
0.59
1.01
0.60
0.60
1.52
1.43
0.96
0.96
1.14
0.50
Hp 1.80
-4.50
-3.50
-3.50
2.50
-3.50
-3.50
-0.40
-3.20
4.50
3.80
-3.90
1.90
2.80
-1.60
-0.80
-0.70
-0.90
-1.30
4.20
(C) Structure predic>on
(a) Homology modelling
Observation in PDB: Similar sequence (30% identity) → similar structure
☞ Strategy:
Crystal structures:
Aquaporin-1
GlpF
•
•
•
•
•
•
Search homologous sequence with known structure
align sequences
change differing amino acids in template structure
meet steric criteria (avoid atomic overlaps), and other criteria
optimize rotamers
Critical: correct alignment
GlpF crystal structure
GlpF model based on Aqp1 (bad due to wrong alignment)
(C) Structure predic>on:
from sequence to structure
(c) Protein threading
No homologous structure available?
☞ Into which known fold fits the sequence best?
aa
S
A R N D
☞ Find the known fold with the maximal
…
α-helix
N
X
ln p(ai , sj )
i=1
p(ai , sj )
β-sheet
Improvements:
turn
Sequence / structure statistics
better statistics, e.g. consider triplets, spacial neighbours,
cys-cys bonds, …
non-polar
surface area
[A^2]
E.g., make use of hydrophobicity of amino acids: Hydrophobic residues mainly buried inside.
Trp
Leu
Ile
Phe
Met
Val
Pro
Lys
Tyr
His
Thr
Arg
Ala
Glu
Gln
Ser
Cys
Gly
Asp
Asn
236
164
155
194
137
135
124
122
154
129
90
89
86
69
66
56
48
47
45
42
estimated hydrophobic effect
[kcal/mol]
4.11
4.10
3.88
3.46
3.43
3.38
3.10
3.05
2.81
2.45
2.25
2.23
2.15
1.73
1.65
1.40
1.20
1.18
1.13
1.05
E.g., make use of hydrophobicity of amino acids: Hydrophobic residues mainly buried inside.
(C) Structure predic>on
(d) Empirical potentials
E.g., ψ-angles between amino acids, e.g., Ala-Asn:
h( )
V =
2
1
kB T ln h( )
3
1
3
2
☞ 20x20 pair interactions Vij
☞ minimize
N
X
i=1
VSi ,Si+1
Ramachandran plots
• Another source of empirical potentials: Ramachandran plots
• Distribution of φ / ψ backbone angles
Ramachandran plot for glycine
Ramachandran plot for proline
Bottom line: structure prediction is still not very accurate
and reliable !
Today’s summary
Learning from sequence alignments
• Mul>ple sequence alignments
→ Iden>fy amino acids that were conserved during evolu>on
→ Relevant for protein func>on or protein stability • Quan>fy, how similar two (or more) sequences are
→ Quan>fy distance in evolu>on, build evolu>onary (phylogene>c) trees • Find the best (or most likely) alignment of two sequences
→ homology modelling
Structure predic>on, from sequence to structure
• homology modelling • Threading • Ab in>o modelling, empirical poten>als
Master / Bachelor projects
Interpretation of X-ray scattering experiments
Membrane biophysics
Contact: Jochen Hub, [email protected], phone 39-14189