Download Predicting protein 3D structure from evolutionary sequence variation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Paracrine signalling wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

Magnesium transporter wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Point mutation wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein wikipedia , lookup

Protein purification wikipedia , lookup

Structural alignment wikipedia , lookup

Western blot wikipedia , lookup

Interactome wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Proteolysis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Predicting protein 3D structure
from evolutionary sequence
variation
Thomas A. Hopf
Protein Prediction I, 06/18/2015
Outline
Prologue: Correlated mutations
Local vs. global models for 2D prediction
Application to 3D structure prediction
2 The structure prediction problem
genotype
phenotype
ACTGTGCACG
TAATGGCATC
3 Structure from sequence alone?
Christian Anfinsen, Nobel Prize for Chemistry 1972
4 Sequence-structure gap is not closing!
Marks, Hopf & Sander, Nature Biotechnology (2012)
5 A protein
6 Evolutionary selection leaves residue
covariation signature
easy
inverse problem
7 Folding proteins from evolutionary couplings
??? 8 Outline
Prologue: Correlated mutations
Local vs. global models for 2D prediction
Application to 3D structure prediction
9 Try something simple: correlation
between two columns
N
C
A
A
A
A
A
A
A
A
D
D
D
D
E
E
E
E
R
R
R
R
R
R
R
R
i
L
A
L
A
L
A
L
A
T
T
T
T
T
T
T
T
L
L
L
L
L
L
L
L
T
T
T
T
T
T
T
T
A
A
A
A
A
A
A
A
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
K
D
I
D
I
D
I
D
I
G
G
Y
Y
G
G
Y
Y
P
P
P
P
P
P
P
P
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
E
E
E
E
E
E
E
E
A
A
A
A
A
A
A
A
Y
Y
G
G
Y
Y
G
G
G
G
G
G
G
G
G
G
N
N
N
N
N
N
N
N
A
A
A
A
A
V
A
A
R
R
R
R
K
K
K
K
C
C
C
C
C
C
C
C
single column f (A ) fj (Aj )
i
i
frequencies:
column pair frequencies:
fij (Ai , Aj )
j
fij (Ai , Aj )
fi (Ai )fj (Aj )
To what extent do we see a pair of amino acids more/less often
than expected by chance?
10 on of direct information
Mutual information measures
measure
for calculating
thetwo
correlation
between
correlation
between
columns
t is mutual information (MI), which is defined as
M Iij =
q
⌃
Ai ,Aj =1
fij (Ai , Aj ) ln
fij (Ai , Aj )
fi (Ai )fj (Aj )
⇥
mns i and j. M Iij is the difference entropy betwee
weight
Aj ) and the expected
frequency
fi (Afrom
i )fj (Aj ) if
deviation
sum all possible
endent.amino
MI acid
is an inherently localstatistical
measure which
combinations
independence
11 Bad news: doesn‘t work.
TRY2_RAT,MIs and Residues less than 5.00 Angstroms apart
local model
main problem: transitivity
50
100
150
200
50
100
150
200
nz = 300
PDB structure residue contacts
residue pairs with 100 highest MI values
12 Solution: use a global model!
global probability model
explains observed correlations by
causative pair interactions
Inverse Potts inference
(undirected graphical model)
observed correlations
causal, direct coevolution
13 and, in keeping with the maximum-entropy principle,
the desired most-gene
"
model is had by maximizing the entropy S = − σ P (σ)lnP (σ) while adher
to these. Maximization of S under (3) can be carried out through the int
duction of Lagrange multipliers, giving, after some straightforward calculatio
the Potts distribution
⎛
⎞
N
N
−1 !
N
!
!
1
P (σ) = exp ⎝
hi (σi ) +
Jij (σi , σj )⎠ ,
Z
i=1
i=1 j=i+1
Probability model connects correlations
to direct interactions
in which multipliers remain as parameters to be fitted to data. This P (σ), in
information-theory sense, makesapproximate
minimal assumption about the world while s
capable of staying true to our observed averages. Z is a normalizing const
maximum
likelihood
making sure the total probability
is one
by summing over all possible states
inference
⎛
⎞
N
N
−1 !
N
!
!
observed !
data
direct pair ⎝
⎠.
Z
=
exp
h
(σ
)
+
J
(σ
,
σ
)
i
i
ij
i
j
(sequences)
interactions
σ
4 Unless
i=1
i=1 j=i+1
stated otherwise, node indexes are assumed to take on values 1 ≤ i ≤ N ,
14 indexes run through 1 ≤ i < j ≤ N , and states take on 1 ≤ k, l ≤ q.
From sequences to pair scores
Infer parameters
Y
P (data|parameters) =
2alignment
P ( | h, J)
approximate!
Calculate evolutionary couplings
v
u 21 21
uX X
F Nij = ||J ij ||2 = t
J ij (k, l)2
k=1 k=1
+ some other technical details
References: Marks et al. (2011), Ekeberg et al. (2013)
15 Most global model pairs are close in 3D
local model (mutual information)
global model (MaxEnt, Marks et al., 2011)
TRY2_RAT, Residues less than 5.00 Angstroms apart
TRY2_RAT,MIs and Residues less than 5.00 Angstroms apart
50
50
100
100
150
150
200
200
50
100
150
nz = 300
200
50
100
150
number of constraints = 150
200
16 He solved it before everyone else did...
Alan Lapedes
17 Outline
Prologue: Correlated mutations
Local vs. global models for 2D prediction
Application to 3D structure prediction
18 What could we predict using
evolutionary couplings?
3D structure
function
protein complexes
conformation changes
19 What could we predict?
3D structure
protein complexes
20 A brief reminder
21 The breakthrough: 15 proteins folded
from sequences alone
3D Structure
Marks et al., PLoS ONE (2011)
Figure 2. Predicted 3D structures for three representative proteins. Visual comparison of22 3 of the 15 te
reveals the remarkable agreement of the predicted top ranked 3D structure (left) and the experimentally observed
error and, in parentheses, number of residues used for C -RMSD error calculation, e.g., 2.9 Å C -RMSD (67). The ribb
Also works for membrane proteins!
predicted
experimental
Hopf et al., Cell (2012)
23 Medically important membrane proteins
predicted from sequences
predicted
(Hopf et al., Cell, 2012) experimental
(Baradaran et al., Nature, 2013)
respiratory complex I
24 Medically important membrane proteins
predicted from sequences II
human adiponectin receptor 1 (3.7Å, 186 residues)
25 What could we predict?
3D structure
protein complexes
26 Interacting proteins co-evolve to
maintain interaction
27 Complex interactions from the evolutionary sequence record
28 Accurate prediction of the ABC transporter MetNI
29 Benchmark set gallery
30 De novo prediction of unsolved complex interactions
31 Can we predict which proteins interact?
E. coli ATP synthase 32 ATP synthase interaction matrix
33 Try folding yourself!
www.evfold.org
34 References for further reading
Protein 3D Structure Computed from Evolutionary
Sequence Variation
Debora S. Marks1*., Lucy J. Colwell2., Robert Sheridan3, Thomas A. Hopf1, Andrea Pagnani4, Riccardo
Zecchina4,5, Chris Sander3
1 Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America, 2 MRC Laboratory of Molecular Biology, Hills Road,
Cambridge, United Kingdom, 3 Computational Biology Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America, 4 Human Genetics
Foundation, Torino, Italy, 5 Politecnico di Torino, Torino, Italy
Abstract
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence
homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these
constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering
purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of
inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from
a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of
observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by
the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of
these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring
residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We
quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different
fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de
novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution
signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Ca-RMSD error relative to the
observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery
provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the
universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic
variants in normal and disease genomes.
Citation: Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, et al. (2011) Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS
ONE 6(12): e28766. doi:10.1371/journal.pone.0028766
Editor: Andrej Sali, University of California San Francisco, United States of America
Received November 10, 2011; Accepted November 14, 2011; Published December 7, 2011
Copyright: ! 2011 Marks et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: CS and RS have support from the Dana Farber Cancer Institute-Memorial Sloan-Kettering Cancer Center Physical Sciences Oncology Center (NIH U54CA143798). LC is supported by an Engineering and Physical Sciences Research Council fellowship (EP/H028064/1). TH has support from the German National
Academic Foundation. RZ has support from European Community grant 267915. No other financial support was received for the research. The funders had no role
in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
Introduction
Exploiting the evolutionary record in protein families
The evolutionary process constantly samples the space of
possible sequences and, by implication, structures consistent with a
functional protein in the context of a replicating organism.
Homologous proteins from diverse organisms can be recognized
by sequence comparison because strong selective constraints
prevent amino acid substitutions in particular positions from
being accepted. The beauty of this evolutionary record, reported
in protein family databases such as PFAM [1], is the balance
[2,3]. This suggests that residue correlations could provide
information about amino acid residues that are close in structure
[4,5,6,7,8,9,10,11]. However, correlated residue pairs within a
protein are not necessarily close in 3D space. Confounding residue
correlations may reflect constraints that are not due to residue
proximity but are nevertheless true biological evolutionary
constraints or, they could simply reflect correlations arising from
the limitations of our insight and technical noise. Evolutionary
constraints on residues involved in oligomerization, proteinprotein, or protein-substrate interactions or other spatially indirect
or spatially distributed interactions can result in co-variation
35 Acknowledgments
Debora Marks
Chris Sander Burkhard Rost
(Harvard Medical School)
(MSKCC)
(TU München)
Charlotta Schärfe
(HMS & Uni Tübingen)
36