Download Scoring Docked Protein Complexes with Hydrogen Bonds

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Implicit solvation wikipedia , lookup

Rosetta@home wikipedia , lookup

Protein design wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Bimolecular fluorescence complementation wikipedia , lookup

List of types of proteins wikipedia , lookup

Circular dichroism wikipedia , lookup

Structural alignment wikipedia , lookup

Protein wikipedia , lookup

Protein domain wikipedia , lookup

Protein moonlighting wikipedia , lookup

Protein purification wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Proteomics wikipedia , lookup

Homology modeling wikipedia , lookup

Western blot wikipedia , lookup

Protein folding wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Alpha helix wikipedia , lookup

Cyclol wikipedia , lookup

Protein structure prediction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Scoring Docked Protein Complexes with
Hydrogen Bonds
Patrick Day
April 17, 2013
1
1.1
Introduction
Area
Finding the structure of proteins and protein-protein complexes is an important problem in the field of computational biology. Determining protein structures can be done successfully with various experimental methods
like x-ray crystallography, but determining protein-protein structures experimentally is more difficult. Instead, these structures are frequently predicted
computationally. One approach to predicting these complexes is to create
a function to score complexes on their viability, ideally giving the correct
structure the best score.
1.2
Importance
Since the function of a protein (or protein-protein complex) is determined by
its structure, understanding protein structure is tantamount to understanding protein function. Discovering the structure of a protein can provide a
wealth of information about the function of that protein, but many proteins
must bind another protein or small molecule to perform their function. So
in order to have a complete understanding of these proteins, the structures
of these bound complexes must be determined. Beyond providing the knowledge of a protein’s function within biological pathways, understanding the
structures of protein-protein complexes also assists with drug design. Learning how proteins complex allows drug designers to better shape their drugs
to bind to desired receptors and not bind to undesired receptors.
1
1.3
Types of Protein Structure
Protein structure is defined in four distinct levels from primary to quaternary 3 . Primary structure is defined as the sequence of amino acids in the
polypeptide chain. There are 20 different amino acids and these amino acids
are chained together by peptide bonds between the carboxyl and amino termini. Secondary structure is largely defined by hydrogen bonding interactions between the amino acid residues. These interactions create localized
structures such as alpha helices and beta sheets. Tertiary structure is the
3D representation of protein as determined by the folding of the localized
structures of secondary protein structure. These structures fold in a way
that hides the hydrophobic regions on the inside of the protein, but exposes
the hydrophilic regions to the solvent. Quaternary is the 3D structure of a
complex made of multiple subunits (discrete polypeptide chains). This quaternary structure is the result of proteins need to bind other proteins in order
to function.
Figure 1: Levels of Protein Structure 2
1.4
Ease of Determination
The structure of unbound proteins is relatively easy to determine. The primary structure can be determined experimentally in a number of ways. This
primary structure is then used to computationally compute the secondary,
and tertiary structure of a protein. When a protein binds (or docks with)
another protein, this new protein-protein complex takes on a new quaternary
structure that is not as easily determined. These complexes are so big and
2
the interaction between the member proteins so weak that the new structure cannot be determined experimentally, even if the structures of both
proteins are already known. In addition to this, these complexes are also
often times transient and difficult to isolate. Fortunately, computational determination of protein complexes is promising since the search space is more
limited compared to protein folding. In order to find the most likely docking conformation of a complex, a large number of possible structures must
first be determined. These determinations are made under the assumption of
rigid-body docking, which means that the protein will have little to no conformational change upon docking. With this assumption, possible dockings are
generated by fixing the position of one of the proteins while giving the other
protein six degrees of freedom: three for rotation and three for translation. If
a protein complex consists of proteins Xa and Xb , we seek a transformation
T (Xb ) such that the energy (or potential) of the complex E(Xa , T (Xb )) is
at a minimum. This energy function can take into account many different
factors including electrostatic forces, hydrophobic/hydrophilic interactions,
and specific amino acid interactions.
1.5
Hydrogen Bonding
My research focuses on the role of hydrogen bonds in complex formation.
When hydrogen atoms are attached to highly electronegative atoms like nitrogen, oxygen, or sulfur, they become polarized with a slight positive charge.
These positively charged hydrogens are attracted to and bind with electronegative atoms. It is important to note that they do not bind covalently, but
are attracted by a strong dipole-dipole force. When these bonds are between
atoms on different proteins, the protein-protein complex is stabilized. Given
the role of hydrogen bonds in stabilizing these complexes, I used a statistical learning approach to find a scoring function based on the quantity of
hydrogen bonds found in the structures of known complexes provided by the
Protein Data Bank (www.rcsb.org).
3
2
2.1
Background
Complex Structure
These protein complexes always consist of two proteins, the longer one of
which is usually called the receptor while the other is called the ligand. Between these two proteins exists an interface, or region of contact between the
proteins. The atoms in this region are close enough to interact and thus,
contribute to the binding of the two proteins. Also, this region is protected
from the solvent up complex formation which is important as hydrogen bonds
are less likely to be formed between proteins when solvent is present.
Figure 2: Protein-protein interface 7
2.2
Types of Docking
At this point, it is important to delineate between the two theoretical forms of
docking. The first theory, induced fit, posits that the two proteins undergo
a significant conformational change in the process of binding. This would
mean that the bound versions of proteins would bear little resemblance to
their unbound counterparts. As our transformations only allow the six degrees of freedom found in rotation and translation, this theory is less useful
to our research. The second theory is known as the lock-and-key model and
posits that there is little conformational change in the process of binding. 11
This would mean that the correct binding of two proteins could be found
4
from the six degrees of freedom in our transformations. This theory is especially useful in the case of bound docking in which the ligand and receptor
proteins are actually pulled apart from a complex of which the structure is
already known. This allows us to know that the proteins are already in their
correct conformation to be bound. Bound docking is used to benchmark the
performance of docking algorithms since it allows us to compare the algorithms docking compared to the actual one. A more scientifically useful case
is unbound docking. This is the case that not only is the structure of the
bound complex not previously known, but the structures of the receptor and
the ligand may not be known. When this is the case, the structures of those
proteins must be first approximated through homology modeling, which is
the determination of structure through the structure of a homologous protein. Once the structures of the receptor and ligand are known, the process
of transforming the proteins occurs in the same way as in the bound case.
Figure 3: Bound and unbound docking 4
2.3
Scoring Functions
Once the transformations are complete, the complexes need to be scored
and ranked so that the most feasible conformation can be found. Scoring
functions (or potentials) can either be physics based or knowledge based.
Physics based potentials are rooted in the ideal interactions between atoms.
5
Knowledge based potentials are rooted in statistical analysis of already known
structures compared to unknown structures.
A physics based potential (or molecular mechanics potential) are generally
based on knowledge of bonding and electrostatics. They first find energies
for the structure in terms of bonded terms and non-bonded terms, and then
sum these energies. Bonded terms typically take into account deviations
from ideal physics laws like bond length and angle and then square these deviations. Non-bonded terms take into account knowledge from electrostatics
and van der Waals interactions. These potentials are used to minimize the
given structures to create a more feasible docked complex. 5
Two common knowledge based potentials are statistical potentials and
mathematical-programming based potentials. Instead of analyzing protein
structures based on the physics of protein interactions, these potentials are
wholly based on statistical analysis of predicted structures.
Statistical potentials are based on equation 1 which is known as the inverse Boltzmann distribution. This equation states that the difference in
energy between the current structure (or state) and some reference state is a
function of the log odds ratio of the probabilities of the current state over the
reference state. The probability of the current structure, P (r), is determined
by the presence of certain interactions within the structure.The reference
state, Pref (r), is created to represent a system in which all of the measured
interactions are absent. 6
E = −kB T ∗ ln
P (r)
Pref (r)
(1)
Potentials can also be based on mathematical programming. These potentials
are based on the idea that correctly folded protein structures will have a low
energy, and that the energy of a structure will get higher the further that
structure is from being correctly folded. This logic is exemplified in the
funnel-shaped graph of the protein folding landscape below.
This gives rise to equation 2, which describes the relationship between
correct and misfolded structures. A large number of these equations gives a
linear program to be solved by an objective function. This objective function
is minimized over the given constraints to produce the parameters of the
potential. 9
E(Xmisf olded ) − E(Xcorrect ) > 0 ∀Xmisf olded
(2)
6
Figure 4: Protein Landscape 8
3
3.1
Method
Data Set
My work involves the scoring of these protein-protein complexes based on
the number of hydrogen bonds present in the interface of the complex. This
work was done on the basis that hydrogen bonds provide at least specificity
to protein-protein complexes, if not stability.The code I wrote analyzes protein structures for the proximity of polar hydrogens to electronegative atoms
given variable constraints for bond length and angle. This code was run on
a data set of 640 complexes stored in pdb (protein data bank) files. 462 of
these complexes were bound and 178 of these were unbound. These pdb files
unfortunately did not explicitly include hydrogens. This is because the experimental technique of X-ray crystallography cannot see the small hydrogen
atoms. Fortunately, the positions of hydrogen atoms can be modeled accurately from chemical geometry. To add these hydrogen atom coordinates, my
entire data set was run through a program called Reduce that was developed
at Duke to insert the hydrogens into the files. Reduce adds hydrogens to pdb
files so that the hydrogen atoms are staggered about the atom to which they
are bound, rotationally optimized, and given appropriate bond lengths. 10
7
3.2
Interface
With the hydrogens in place, I then needed to find the subset of atoms in the
proteins found at the interface of the two proteins. I defined this interface
as the subset of atoms in each protein that are within ten angstroms of an
atom in the opposite protein.
3.3
Hydrogen Bond Geometry
Next was determining the appropriate geometries for a hydrogen bond. Research suggests that the typical distance (d) between a polarized hydrogen and the electronegative atom it is hydrogen bonded with is around 1.9
angstroms. The angle (θ) between the covalent bond and the hydrogen bond
(hydrogen being the origin) is less well defined, but obviously trends toward
180◦ .
Figure 5: Geometry of a hydrogen bond
3.4
Code
Given these constraints, I was then able to run my code on these interfaces
with various values for d and θ, in order to find the most appropriate values
for d and θ. My score function, W, is defined by a log odd ratio, W =
obs
−kB T ∗ ln PPref
, where Pobs is the observed frequency of hydrogen bond length
(or angle) in the dataset and Pref is the reference (or null) distribution.
This reference distribution represents the expected distribution if there is no
hydrogen bonding for this particular complex. These formulas are defined as
Pobs = p(P, H, d, θ)
Pref = p(H) ∗ p(P ) ∗ p(R, d, θ|H)
8
where H represents polar hydrogens, P represents any polarizing atom, and
R represents any atom. After determining the score ratio, we test it again
against an independent data set (one consisting of purposefully mis-docked
complexes) to assess the recognition capacity.
4
Results
Existing research defines hydrogen bonds to have a length (d) around 2
angstroms and an angle (θ) tending toward 180◦ 1 . Figure 6 shows the breakdown of scores across all of the proteins in my data set for several different
bond lengths. A few different θ values were used as well, but this information was combined for each bond length. The trough at 2.0 angstroms gives
creedence to the validity of my scoring function as it shows a preference for
the correct bond length.
Figure 6: Breakdown of protein scores by bond length
In addition to this, Figure 7 shows scoring to improve as d approaches 2.0
and θ approaches 180◦ . This lines up with Coulomb’s law which states that
9
the electrostatic force between charges decreases as the distance between the
charges decreases. The closer θ is to 180◦ , the larger the distance between
the two negatively charged on either side of the hydrogen is. This increased
distance results in a more stable bond.
Figure 7: Contour plot of protein scores
After parameterizing the energy function with a d value of 2.0 and a
preference for higher values of θ, correct structures were scored against ’decoy’
structures. These decoy structures are purposefully misdocked versions of the
same receptor and ligand to provide a comparison against correct structures.
Figure 8 shows the scorings for six different complexes against several decoy
complexes. Many of the decoy structures had no possible hydrogen bonds
and so returned scores of zero. Although these are just the results from six
complexes, the general trend across all complexes was in favor of the correct
complex.
10
Figure 8: Correct vs Decoy Complexes
5
Summary
Using hydrogen bonding to score dockings of protein-protein complexes is
definitely feasible. My energy function is relatively simple, yet can still tell
11
a good deal about the viability of a protein-protein complex. In cases where
the correct structure was not chosen, the number of hydrogen bonds present
was generally higher in the correct structure, but was overpowered by the
reference structure. This suggests that perhaps I should weight hydrogen
bonds more or change the calculation of reference structure.
References
[1] Kortemme, Tanja, Alexandre V. Morozov, and David Baker. ”An
Orientation-dependent Hydrogen Bonding Potential Improves Prediction of Specificity and Structure for Proteins and ProteinProtein Complexes.” Journal of Molecular Biology. 326.4 (2003): 1239-1259. Web. 16
Apr. 2013. ¡http://dx.doi.org/10.1016/S0022-2836(03)00021-4¿.
[2] Fischer T. protein structure levels, http://faculty.irsc.edu/FACULTY/T
Fischer/bio%201%20files/bio%201%20resources.htm. 2013
[3] Janin J, Bahadur RP, Chakrabarti P. Protein-protein interaction and
quaternary structure. Q Rev Biophys 2008;41(2):133-180.
[4] Funkhouser T. COS 597A Lectures Notes in Structural Bioinformatics.
In: Archives PUC, editor; 2005.
[5] Ravikant D. Learning to Dock Proteins. Ithaca NY: Cornell University;
2011.
[6] Schwede T, Peitsch MC. Computational structural biology : methods
and applications. N.J.: World Scientific; 2008. x, 779 p. p.
[7] Keskin O. Protein-Protein Interface, http://home.ku.edu.tr/ okeskin/interface. 2013
[8] Chaplin
M.
Protein
Folding
and
http://www.lsbu.ac.uk/water/protein2.html. 2013
Denaturation,
[9] Maiorov VN, Crippen GM. Contact Potential That Recognizes the Correct Folding of Globular-Proteins. J Mol Biol 1992;227(3):876-888.
12
[10] Word, J. Michael, Simon C. Lovell, Jane S. Richardson, and David C.
Richardson. ”Asparagine and Glutamine: Using Hydrogen Atom Contacts in the Choice of Side-chain Amide Orientation.” Journal of Molecular Biology. 285. (1999): 1735-1747. Print.
[11] Fischer E. Einfluss der Configuration auf die Wirkung der Enzyme. Ber
Dt Chem Ges 1894;27(3):9.
13