Download Physical Models for Protein Folding and Drug Design

Proc. Idea-Finding Symposium Frankfurt Institute for Advanced Studies (2003) 23–33 Idea-Finding Symposium Frankfurt Institute for Advanced Studies Frankfurt, Germany April 15–17, 2003 Physical Models for Protein Folding and Drug Design R.A. Broglia1,2 and G. Tiana1 1 2 Department of Physics, University of Milano and INFN, Sez. di Milano Via Celoria 16, 20133 Milano, Italy The Niels Bohr Institute, Bledgamsvej 16, 2100 Copenhagen, Denmark Abstract. The problem of protein folding consists in understanding how the aminoacid sequence of a protein (primary structure) determines its unique, biological active equilibrium conformation (tertiary structure). By mean of simplified models, we explore the dynamical processes which are at the basis of the folding of model proteins and find a simple hierarchical mechanism which governs the folding phenomenon. Exploiting this result, it is possible not only to develop an algorithm to determine the equilibrium conformation of a protein from its sequence, that is to solve the protein folding problem provided one knows the interaction among the amino acids, but also to design a novel class of drugs which interfere with the folding mechanism and whose inhibitor effect cannot be neutralized through mutations, as it is the case with standard drugs acting, as a rule, on the active site of enzymes. 1. Introduction The problem of protein folding is to understand how a protein molecule of specified amino acid sequence ends up in a unique configuration which, among other things, determines its biological function [1]. In physical terms, the problem is how the one-dimensional information provided by the sequence of twenty types of amino acids encodes for a unique and stable three-dimensional equilibrium conformation. This problem has a self-evident biological and medical importance. The sequencing of the human genome [2, 3], that is the identification of the way the thousands of millions of basis follow each other in the human DNA, provides information on the sequence of amino acids forming each of the tens of thousands of proteins which build our cells and catalyze the chemical reactions which make them function. The acquisition of sequence data by DNA sequencing is relatively quick, and vast quantities of data have become available through international efforts. But the knowledge of the sequence alone is of little help in understanding the function of the corresponding protein, in manipulating its function and in designing drugs to act on it. For that, one needs the three dimensional equilibrium conformation. On the other hand, the acquisition of three-dimensional data is still slow and is limited to proteins that either crystallize in a suitable form or are sufficiently small and soluble to be solved by NMR in solution [4]. In fact, while at present data banks contain information concerning the linear sequence of about 105 proteins, atomic coordinates of ISBN 963 000 000 0 c 2003 EP Systema, Debrecen 24 R.A. Broglia and G. Tiana only 104 native structures are available [5]. Algorithms are thus required to translate the linear information into spatial information. Once the conformation of a protein is known, one can attempt at designing drugs to interact with the protein. Most of the targets of pharmaceutical drugs are enzymes, that is proteins whose task is to catalyze some reaction in the human body. Such drugs usually inhibit the associated enzyme by capping its active site thus preventing the enzyme to bind its substrate. For example, matrilysin is an enzyme involved in the degradation of tissues which takes place as a consequence of arthritis. Some drugs against arthritis are designed to inhibit matrilysin activity by binding to its enzymatic site [6]. But the protein folding problem is extremely intriguing also from the physical point of view. A protein is a system which is in a nearly-zero-entropy equilibrium state (usually referred to as ‘native’ state) for a wide interval of temperatures (ranging from ∼0 to ∼60 Celsius). Such equilibrium state has essentially no symmetries. The interactions within the protein are noticeably complicated and heterogeneous. Nonetheless, the protein displays neither slow dynamics, nor the large number of competing low-energy states and kinetic traps associated with metastable states, typical of ‘frustrated’ systems [7]. The only feature of frustrated systems which survives in the case of proteins is the difficulty of predicting the ground state conformation of the system. This prediction is the essence of the protein folding problem. The understanding of the process which is behind the folding of proteins is both interesting as a physical problem per se as well as being functional to the prediction of the native state. It is important to emphasize that the main goal of the physical approach to the protein folding problem is not to analyze the behavior of a specific protein, but to understand the general principles of the folding mechanism of any protein. The first and basic assumption needed to proceed further is that such a general paradigm does exist. There are indeed some evidences which support this view. Although proteins are complicated systems and each of them can be different from the others for its size, shape, and function, all of them display a number of common features. For example, secondary motives known as β sheets and α helices, hydrophobic cores, etc. The starting point of the physical approach is, consequently, the search through the (vast) experimental literature concerning the folding of proteins, for these common features. It is furthermore sensible to assume that the tens of thousands of known proteins have evolved from few common ancestors. Hints of this evolution can be found in the conservation patterns of protein sequences displaying similar native structure. These conservation patterns can be helpful in understanding the folding of related proteins and further testify to the fact that it is reasonable to assume that there is a single general mechanism controlling folding. If one subscribes to the idea of a single folding pattern for all proteins, or at least for small monoglobular proteins, then the use of simplified models to describe this mechanism is not only allowed, but also useful. During the last twenty years a remarkable development of protein models has taken place, ranging from simple two-state models of the kind used by chemists to describe chemical reactions, to all-atom models which take advantage of the power achieved by modern computers which allow to carry out simulations of the folding of proteins over periods of time which, in spite of being a small fraction of the full folding time, are still not negligible, at least for the case of small proteins. Physical Models for Protein Folding and Drug Design 25 A particularly interesting model describing the protein as a chain of beads on a cubic lattice, seems to represent an appropriate balance between solvability and realism (cf. e.g. Ref. [8] and references therein). Studying in detail this model, one can find some remarkable simplicities in the folding of protein-like chains. The folding process is controlled, within the framework of this model, by a small subset of the amino acids of the protein. As we shall see in more detail in the next section, these ‘hot’ amino acids [8] build very early in the folding process few local elementary structures (LES), which diffuse as essentially rigid entities. When the local elementary structures, which display a high affinity for each other, find their correct partners, they build the folding nucleus (FN), the minimum set of native contacts needed to overcome the main barrier of the free energy associated with the entire folding process [9, 10]. The point of view of folding in terms of assembly of local elementary structures into the folding nucleus not only accounts for known experimental facts, but also opens the way to predictions and manipulations. In fact, while the direct prediction of the native conformation of a protein from the amino acid sequence is difficult. On the other hand, the localization of the local elementary structures is much easier, elementary structures are known, it is not impossible to determine the folding nucleus, and from it the native conformation. Furthermore, the knowledge of local elementary structures can be used to design drugs able to inhibit the folding, and consequently the biological activity, of selected proteins [11]. 2. The Model An important ingredient which is at the basis of the folding of proteins is the heterogeneity of the interaction arising from the presence of twenty kinds of different amino acids. It is known that physical systems displaying such an heterogeneity display, as a rule, a rough energy landscape with many competing low-energy states [7]. This is a picture incompatible with that of proteins, which must display a unique ground state, well separated from the others, and as few metastable states as possible. Consequently, the purpose of these models is to understand what makes a protein, characterized by a well defined amino acid sequence, different from a generic heterogeneous system, whose paradigm is found in a random sequence of amino acids. The simplest choice for a heterogeneous potential is that of a contact potential, in the form X Bσ (i) σ ( j ) 1(ri − r j ) , (1) U ({ri }, {σ (i )}) = ij where ri and σ (i ) are the position and kind of the i th amino acid, 1(r i − r j ) is a contact function which assumes the value 1 if |r i −r j | ≤ 1 and zero otherwise, while Bσ τ is the element of the 20 × 20 interaction matrix which defines the interaction energy between amino acids of kind σ and τ . A widely used interaction matrix has been calculated by Miyazawa and Jernigan (MJ) [12] from the statistical analysis of the contacts of a large database of known proteins, assuming that the more often a given contact appears in the database, the more attractive it is. This is done by calculating the probability pσ τ of appearance of the contact between the amino acids of kind σ and τ , and assuming a Boltzmann-like relationship of the kind Bσ τ ∼ − log pσ τ . 26 R.A. Broglia and G. Tiana The second approximation used, consisting in locating the beads representing the amino acids on the vertices of a cubic lattice of unitary side length, implies that the conformational degrees of freedom are discrete. This is very convenient from a computational point of view and makes conformational entropy easy to handle. Making use of this approximation, the small scale motion of the protein (i.e. the peptide bond vibrations) is neglected and the chain is constrained to have unrealistic angles between monomers (π/2, π and 3π/2). A more realistic choice could have been to use a fcc lattice (the average mean square of the difference between real proteins and their projection onto a fcc lattice is ∼1 Å [13]), although calculations are slightly more complicated. Since the choice of the lattice does not change the underlying physics, in the following we will restrict to the use of a cubic lattice. Our starting point is the inverse folding approach, which turns the folding problem upside down, asking which are the sequences folding to a given native conformation. The answer to this problem is well known, at least within the framework of simple (lattice) protein models. Good folders are obtained by minimizing the energy of the chain in the native conformation with respect to amino acid sequence for fixed composition. Starting from a random sequence, composition conserving mutations are introduced (swapping of amino acids). Within the framework of a Monte Carlo treatment, a sequence with sufficiently low energy is searched. Fig. 1. The model description of the native structure of a protein. In dark grey and light grey are displayed the ‘hot’ and ‘warm’ sites, respectively. The dashed lines indicate the contact building LES. Good-folder sequences are characterized by a large gap δ = E c − E n (compared to the standard deviation σ of the contact energies) between the energy of the sequence in the native conformation E n , and the lowest energy (threshold energy) of the conformations structurally dissimilar to the native conformation [14,15]. The quantity E c is the lowest energy a random sequence can achieve in the process of compacting, and is a quantity which is solely determined by the composition of the protein. In other words, good folders are Physical Models for Protein Folding and Drug Design 27 associated with an normalized gap ξ = δ/σ 1, quantity closely related to the z-score [16]. Furthermore, starting from a designed sequence which displays a large gap, all mutated sequences which preserve (to some extent) the gap fold into the native conformation [17]. For the sake of definiteness, we will consider in the following a particular sequence made out of 36 amino acids called S36 and folding to the native structure shown in Fig. 1, which can be seen as prototype of folding model sequence [8]. 3. Folding of Small Proteins A striking result which emerges from studying the inverse folding approach is that the stabilization energy of a protein is note distributed evenly across its amino acid, but is concentrated in few ‘hot’ residues [8]. Locating ‘hot’ amino acids is quite simple. In fact, for this purpose one introduces point mutations in each site of the native structure, that is, one replaces each of the amino acids of the designed (low energy) sequence by all of the possible 19 amino acids and study whether the resulting sequence still folds or not. It is found that mutations in only few sites denaturate (i.e. impedes its folding) as well as destabilizes (strongly reduces the native state occupation probability) the protein. To be quantitative, we find that only 8% ± 2% (Fig. 1) of the amino acids of a designed sequence are highly conserved, strongly interacting and occupy a hot site in the native conformation, in general well protected inside the protein, as it will suit an hydrophobic residue. Mutations of the amino acids occupying the hot sites denaturate the protein, that is block the unfolded (denaturated) → native (D → N) phase transition. Mutations of amino acids occupying the other sites have little effect on the ability the resulting sequence has to fold onto the native conformation, but lead to sequences which, in the native conformation, still display an energy lower than E c , thus qualifying as good folders. The resulting families of (homologous) proteins (folding to the same native structures) display in common essentially only the few amino acids which occupy the hot sites. The hot amino acids not only determine the stability of the protein but also the hierarchy of native contacts formation through which the protein, starting from an elongated phase reaches the native conformation (cf. Fig. 2): a) formation, almost instantaneously of few local elementary structures (LES, i.e. hidden intermediates corresponding to incipient α helices and β sheets, the secondary structures of proteins) stabilized by the interaction between the hot amino acids, b) formation of the minimum set of native contacts which brings the system over the major free energy barrier of the whole folding process resulting from the docking of the LES (i.e. formation of the post-critical folding nucleus (FN)), c) relaxation of the remaining amino acids onto the native structure shortly after the formation of the FN giving rise to a unique system with an energy below E c [9, 10]. Summing up, the folding of proteins is controlled by the corresponding hot amino acids through the LES, ultimate building blocks of this molecular LEGO [18]. In other words, the simple, most important feature common to all designed sequences folding to the same native structure is the presence of few, highly conserved, strongly interacting, hot, amino acids which stabilize the LES and which are buried inside the folding nucleus of the protein in its native conformation. 28 R.A. Broglia and G. Tiana Fig. 2. Dynamics of contact formation for a MC simulation of the folding of the model sequence S36 . With a dashed line we label the contacts 3–6, 27–30 and 11–14 stabilizing the LES S41 , S42 and S43 (cf. Fig. 1). With solid dot lines along the vertical axis we label (from top to bottom) the contacts: 5–28, 3–30, 14–27, 6–11, 13–28, 6–27, 12–5, 4–29 forming the folding nucleus. 4. Predicting the Native State of a Model Protein With the help of the results discussed above, we have developed a strategy which allows to predict the three-dimensional native conformation of a model protein from its amino acid sequence (three step strategy (3SS) [19]), that is to solve the folding problem provided the contact energies acting among the amino acids are known. The algorithm consists of three steps, namely 1) Finding good candidates for the role of local elementary structures, 2) finding the folding nucleus, and 3) finding the native conformation relaxing the residues not participating in the folding nucleus. This algorithm is based on the hierarchical sequence of events that allows the chain to fold fast and works because at each step only a limited portion of the configuration space of proteins has to be searched through. In what follows we discuss in detail the 3SS algorithm and apply it to a representatives example of notional proteins. Step 1: Find the local elementary structures (LES) which lead the process of protein folding. Elementary structures can be closed or open, depending whether they contain interactions within themselves (outside for the peptidic bond), or not. Examples of closed elementary structures are provided by S41 , S42 and S43 (cf. Fig. 2). In keeping with this classification of LES, the present step is composed of two substeps. Physical Models for Protein Folding and Drug Design 29 Substep 1a: Find the open elementary structures. For each substring of the sequence, starting at monomer i and ending at monomer j (0 < i < j < N), we define the density of energy 1 X s = min Um(l) m(k) , (2) k∈| (i, j ) j −i i≤l ≤ j where U is the matrix of contact energies used to design the notional protein, e.g. the MJ matrix B (cf. Eq. (1)). In other words, s is the average energy with which each element of the substring (i , j ) interact with the rest of the chain. The substrings which are good candidates to be open elementary structures in the folding process have low values of s . Among such substrings we select those with values of s lower than a threshold s∗ . Substep 1b: Find the closed elementary substructures. For this purpose we evaluate, for each pair of monomers i and j , the function p(i , j ) = exp(−Um(i) m( j ) /Teff ) , ( j − i )ρ (3) where Teff is an effective temperature which we set equal to the standard deviation of the interaction matrix U (e.g. σ = 0.3 for the case of the contact matrix of Ref. [12]). The exponential factor ρ = 1.7 reflects the ratio between the number of conformations associated with the formation of a contact and the total number of conformations. If a substructure contains more than one interaction, the values of p associated with the different interactions are to be multiplied together. As possible (closed) local elementary structures, we select those composed of mononomers i ,i + 1, . . ., j − 1, j and with p(i , j ) > p ∗ , where p∗ is a threshold value (see below). Step 2: Find the folding nucleus. All the elementary structures (let S be the total number of such structures) found in steps 1a and 1b are moved in space and the conformational spectrum is found. This is done selecting all possible choices of 1, 2, . . . , S local elementary structures, giving them all possible relative conformations and making a complete enumeration of their reciprocal positions in space. The conformations with lowest energy are selected as possible candidates for the (post-critical) folding nucleus of the protein. Step 3: Relax the remaining monomers around the folding core. This can be done through a complete enumeration of all the conformations displaying a given nucleus, they are rather few (∼104 for a 36mer). Another way, which we found computationally attractive is to use a low-temperature Monte Carlo relaxation simulations, keeping fixed the monomers belonging to the folding core.a Below we discuss the results of the 3SS strategy when applied to the designed sequence S36 . In Fig. 3a we display the corresponding distribution of values of p(i , j ) for this sequence. Three bonds have a p value which is remarkably larger than that associated with the rest of the possible bonds of the protein, and consequently are good candidates for stabilizing closed local elementary structures. The distribution of values of s , displayed 30 R.A. Broglia and G. Tiana in Fig. 3b, shows a single peak, whose lowest points are associated with the same sites already involved in the closed elementary structures. It is thus likely that open elementary structures do not play any noticeable role in the folding process of S 36 . We thus search for a folding nucleus composed of monomers S41 ≡ (3, 4, 5, 6), S42 ≡ (27, 28, 29, 30) and S43 ≡ (11, 12, 13, 14), and stabilized by the contacts 3–6, 11–14 and 27–30. A complete enumeration of all the conformations built out of these three elementary substructures gives the energy distribution displayed in Fig. 3c. The most stable of these conformation has energy −7.81 and is, in fact, the actual folding core. The relaxation of the other amino acids around it gives the right native conformation, with energy E n = −16.50. The next lowenergy conformations built out of the three elementary substructures have energy −7.75, −7.68 and −7.68. The relaxation of the other residues around these tentative folding nuclei lead to ‘native’ energies −12.40, −12.58 and −14.05, respectively. The first two of them are larger than E c = −14.0, so they correspond to states which belong to the set of structurally dissimilar conformations to the native conformation we are searching. The last of them has an energy just below E c . Although it can hardly be confused with the native conformation, it corresponds to a metastable state which can slow down the folding process. 50 (b) 30 40 n(εs) n(p(i,j )) 60 (a) 40 20 (3,4) (3,6) (6,7) (27,30) (29,30) 20 10 36 2730 1114 0 0 2 4 p(i,j ) 6 0 2 1.5 1 0.5 εs 0 0.5 1 50 n(E ) (c) 25 0 10 8 6 E 4 2 0 Fig. 3. (a) The distribution of the parameter p(i , j ), whose maximization allows to find the closed elementary structures. (b) the distribution of the energy density s , employed to find open elementary structures. (c) The distribution of the energies associated with the possible folding nuclei of sequence S36 , build of the elementary structures 3–4–5–6, 11– 12–13–14 and 27–28–29–30. Physical Models for Protein Folding and Drug Design 31 5. Drug Design LES elementary structures are also at the basis of a protocol for non-conventional drug design recently proposed by us [11]. Conventional drugs perform their activity either by activating or by inhibiting some target component of the cell. In particular, many inhibitory drugs bind to an enzyme and deplete its function by preventing the binding of the substrate. This is done by either capping the active site of the enzyme (competitive inhibition) or, binding to some other part of the enzyme, by provoking structural changes which make the enzyme unfit to bind the substrate (allosteric inhibition). The two main features that inhibitory drugs must display are efficiency and specificity. In fact, it is not sufficient that the drug binds to the substrate and reduces efficiently its activity. It is also important that it does not interfere with other cellular processes, binding only to the protein it was designed for. These features are usually accomplished designing drugs which mimic the molecular properties of the natural substrate. In fact, the pair enzyme/substrate have undergone millions of years of evolution in order to display the required features. Consequently, the more similar the drug is to the substrate, the lower is the probability that it interferes with other cellular processes. Something that this kind of inhibitory drugs are not able to do is to avoid the development of resistance, a phenomenon which is typically related to viral protein targets. Under the selective pressure of the drug, the target is often able to either mutate the amino acids at the active site or at sites controlling its conformation in such a way that the activity of the enzyme is essentially retained, while the drug is no longer able to bind to it. An important example of drug resistance is connected with AIDS. In this case, one of the main target proteins, HIV-protease, a dimer formed out of two identical chains each containing 99 residues and folding according to the LES paradigm discussed above (cf. e.g. [20]), is able to mutate its active site so as to avoid the effects of drug action within a period of time of 6–8 months. In keeping with this result and with the central role played by LES in the folding process of proteins, we suggest the use of short peptides with the same sequence as LES (p-LES) as non-conventional drugs which interfere with the folding mechanism of the target protein, destabilizing it and making it prone to proteolysis. These drugs are efficient, specific and do not suffer from the upraise of resistance. In fact, the very reason why LES make single domain proteins fold fast confers p-LES the required features to act as effective drugs, that is, efficiency and specificity. They are efficient because they bind as strongly as LES do. Since LES are responsible for the stability of the protein, their stabilization energy must be of the order of several times kT . These peptides are also as specific as LES are. In fact LES have evolved over millions of years so as to prevent the upraise of metastable states and to avoid aggregation, aside from securing that the protein to fold fast. The possibility of developing non-conventional drugs for actual situations is tantamount to being able to determine the LES for a given protein. This can be done either experimentally (e.g. through molecular engineering [21]), or extending the algorithm discussed in Ref. [19] making use of a realistic force field. The resulting peptides can be used either directly as drugs, or as templates to build mimetic molecules, which eventually do not display side effects connected with digestion or allergies. A feature which makes, in principle, these drugs quite promising as compared to conventional ones is to be found in the fact that the target protein cannot evolve through mutations to escape the drug, as happens in particular in the case of viral proteins in response to conven- 32 R.A. Broglia and G. Tiana tional drugs, because the mutation of residues in the LES would, anyway, lead to protein denaturation. Note a. In some cases the system is non ergodic, in the sense that from a given starting configuration it is not possible to reach all other configurations (with the folding core formed and fixed). In such cases several relaxation simulations are performed starting from different conformations (with the folding core formed and fixed). In keeping with this fact, the folding nucleus of a notional protein could be required not to be exceedingly stable, so as to avoid long-lived metastable states en route to folding. The (single) totally relaxed conformation with energy lower than E c is the native conformation of the protein. References 1. J. Maddox, Does folding determine protein configuration? Nature 370 (1994) 13. 2. D.D. Shoemaker et al., Experimental annotation of the human genome using microarray technology, Nature 409 (2001) 922. 3. J.C. Venter et al., The sequence of the human genome, Science 291 (2001) 1304. 4. R.F. Service, Tapping DNA structures produces a trickle, New Focus, Science 298 (2002) 948. 5. Protein Data Bank, http://www.rcsb.org . 6. M.F. Browner, W.W. Smith and A.L. Castelhano, Matrilysin-inhibitor complexes: common themes among metalloproteases, Biochemistry 23(1995) 6602. 7. M. Mezard, G. Parisi and M. Virasoro, Spin Glasses and Beyond, World Scientific, New York, 1988. 8. G. Tiana, R.A. Broglia, H.E. Roman, E. Vigezzi and E.I. Shakhnovich, Folding and misfolding of designed protein-like chains with mutations, J. Chem. Phys. 108 (1998) 757. 9. R.A. Broglia and G. Tiana, Hierarchy of Events in the folding of model proteins, J. Chem. Phys. 114 (2001) 7267. 10. G. Tiana and R.A. Broglia, Statistical Analysis of Native Contact Formation in the Folding of Designed Model Proteins, J. Chem. Phys. 114 (2001) 2503. 11. R.A. Broglia, G. Tiana and R. Berera, Resistance proof, folding-inhibitor drugs, J. Chem. Phys. 118 (2003) 4754. 12. S. Miyazawa and R. Jernigan, Estimation of effective interresidue contact energies from protein crystal structures, Macromolecules 18 (1985) 534. 13. R.H. Park and M. Levitt, The complexity and accuracy of discrete state models of protein structure, J. Mol. Biol. 249 (1995) 493. 14. E.I. Shakhnovich, Proteins with selected sequences fold into unique native conformation, Phys. Rev. Lett. 72 (1994) 3907. 15. E.I. Shakhnovich and A. Gutin, Enumeration of all compact conformations of copolymers with random sequence of links, J. Chem. Phys. 93 (1989) 5967. Physical Models for Protein Folding and Drug Design 33 16. V.I. Abkkevich, A.M. Gutin and E.I. Shakhnovich, Specific nucleus as the transition state for protein folding, Biochemistry 33 (1994) 10026. 17. R.A. Broglia, G. Tiana, H.E. Roman, E. Vigezzi and E. Shakhnovich, Stability of Designed Proteins against Mutations, Phys. Rev. Lett. 82 (1999) 4727. 18. R.A. Broglia, G. Tiana, S. Pasquali, H.E. Roman, E. Vigezzi, Folding and Aggregation of Designed Protein Chains, Proc. Natl. Acad. Sci. USA 95 (1998) 12930. 19. R.A. Broglia and G. Tiana, Reading the three-dimensional structure of a protein from its amino acid sequence, Proteins 45 (2001) 421. 20. G. Tiana and R.A. Broglia, Folding and design of dimeric proteins, Proteins 49 (2002) 82. 21. A. Fersht, Structure and Mechanism in Protein Science, W.H. Freeman and Co., New York, 1999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Physical Models for Protein Folding and Drug Design