Download Protein production: feeding the crystallographers and NMR

© 2000 Nature America Inc. • http://structbio.nature.com progress Protein production: feeding the crystallographers and NMR spectroscopists Aled M. Edwards1– 3, Cheryl H. Arrowsmith1,3, Dinesh Christendat1, Akil Dharamsi3, James D. Friesen2,3, Jack F. Greenblatt2,3 and Masoud Vedadi3 © 2000 Nature America Inc. • http://structbio.nature.com Protein purification efforts for structural genomics will focus on automation for the readily-expressed proteins, and process development for the more difficult ones, such as membrane proteins. Thousands of proteins are expected to be produced in the next few years. The purified proteins will be valuable reagents for the entire research community. Structural genomics1 or structural proteomics2,3 can be defined as the quest to obtain the three-dimensional structures of all proteins. However, converting sequence information to biological reagents — choosing proper expression constructs, which includes purifying proteins rapidly and obtaining excellent structural samples — remains a significant problem for this new field. The development of better and faster methods to clone, express and purify proteins is expected to generate new methods and reagents (clones, proteins, and purification procedures) that will benefit the general biological community as well as structural genomics researchers. This review will focus on the technical hurdles to be encountered and provides an overview of the strategies that are currently being used or developed to solve the problems that lie ahead. Cloning: Expression constructs Prokaryotes and eukaryotes: two issues. Producing an expression construct for genes without introns is a matter of conventional recombinant DNA technology. Most of these manipulations involve liquid handling and are attractive targets for automation. Many of the structural genomics efforts are directed to this end and it is expected that within a few years, a small group of people should be able to create tens of thousands of expression clones per year. The most common expression strategy, and the one likely to dominate in the area of bacterial expression, will be to drive recombinant protein expression using an inducible T7 RNA polymerase promoter. Creating expression constructs for intron-containing genes will be limited by the availability of full-length cDNA clones. Libraries of full-length cDNA clones have been under construction for some time for purposes peripheral to structural genomics. Therefore it is important for structural genomics researchers to establish alliances with these other projects. Affinity tags: which ones? Automation, an essential facet of structural genomics, will almost certainly require that the recombinant proteins are affinity-tagged. In order to establish best practices, it will be important to conduct a well-controlled study in which a large set of proteins is tagged with a variety of affinity tags, such as polyhistidine and glutathione S-transferase (GST) tags. It will also be important to assess the merits of appending the tags to the Nor the C-termini of the recombinant proteins. The different tagged proteins will then be tested for soluble expression and suitability for structure determination. An interim strategy might be to select one type of affinity tag and identify the well-behaved proteins under those conditions. The remaining proteins might then be constructed with a different affinity tag(s). To date, it appears that the polyhistidine tag has been the most popular and effective. A similar large-scale, well-controlled study will be required to determine if it is advisable to leave or remove the tag before crystallization or NMR studies. At the present state of knowledge, it is probably best to try both forms of the protein. Protein expression: The ‘low-hanging fruit’ Our analysis of over one thousand proteins, derived from all three biological Kingdoms, has revealed that ~15–20% of small (<50 kDa) non-membrane proteins will be suitable immediately for structural biology2,3. The analysis of these proteins will require no conceptual or technical advances. By definition, they express well in soluble form in Escherichia coli, can be purified easily, and will crystallize readily or give good NMR spectra. These proteins will provide ample subjects for scientists who seek to improve downstream technology, such as data collection, data processing, and structure determination and analysis. The ‘low-hanging fruit’ phase of the structural genomics projects will almost exclusively exploit E. coli as an expression system because of its ease of use, low cost, ready availability of suitable expression vectors, and ease with which proteins can be labeled metabolically with seleno-methionine for crystallography, or with 15N and 13C for NMR studies. In addition, for proteins whose codon composition varies from that used in E. coli, the cells can be supplemented with tRNA genes encoding the low abundance codons, such as with the commercially available Codon Plus system from Stratagene. Working with these relatively tractable proteins will provide an excellent opportunity to develop automated processes. Automation strategies for many of the steps in existing procedures (such as cloning and cell growth) are relatively straightforward, mostly a matter of liquid handling. 1Ontario Cancer Institute, 610 University Avenue, Toronto, Ontario, Canada. 2Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada M5W 1L6. 3Integrative Proteomics, 161 Bay St., Toronto, Ontario, Canada. Correspondence should be addressed to A.M.E. email: [email protected] 970 nature structural biology • structural genomics supplement • november 2000 © 2000 Nature America Inc. • http://structbio.nature.com © 2000 Nature America Inc. • http://structbio.nature.com progress Protein expression: Climbing the tree Many individual proteins cannot be expressed in soluble form in bacteria. These include one-third to one half of prokaryote proteins (unpublished data). This proportion is likely to be higher for eukaryotic proteins, particularly those that comprise multiple domains, those that require cofactors or protein partners for proper folding, or those that require extensive post-translational modification. Unfortunately, many of the advances that have prepared the way for proteome-wide three-dimensional structure determination, namely seleno-methionine incorportation for MAD phasing or 13C, 15N-metabolic labeling, were developed with E. coli as an expression system4,5. At present, developing new systems and strategies for the expression of soluble, labeled proteins or protein fragments is probably the greatest impediment for structural genomics researchers. Insolubility arises either from an intrinsic property of a protein (for example, aggregation due to a very hydrophobic patch on the surface) or because the protein is not susceptible to the folding mechanisms in the expression host; in which case there is an aggregation of folding intermediates. Below we discuss approaches for the generation of soluble versions of these more difficult proteins. domain protein cannot. Over the past decade, this approach has yielded structural information for dozens of eukaryotic proteins, and we expect that a similar strategy on a proteome-wide scale will be necessary in order to complete the project. In our experience, experimental results (limited proteolysis coupled with mass spectrometry)9–11 have been better indicators of domain boundaries than sequence comparisons. It may be that sequence-based approaches are of limited use in defining structural domains because sequence conservation can be found in regions of proteins that do not adopt a stable tertiary structure (such as a protein interaction motif that folds only when bound to its partner). In some cases, the expression of proteins requires the co-expression of other proteins for proper folding and to achieve adequate solubility12,13. Data mining. The rules governing protein expression and solubility and even protein crystallization are unknown. By assembling a database of the successes and failures of the large-scale expression and purification trials, researchers will be able to deduce correlations between protein sequence and behavior. Even with a limited database, our group has found links between protein sequence and solubility and the propensity to crystallize3. Ultimately, this knowledge Screening for the most soluble ortholog. It is currently impossible will guide the mutagenesis of proteins to generate samples more to predict the degree of solubility of the encoded protein from its amenable to expression, purification and structure determination gene sequence. It is known, however, that even subtle changes in amino acid sequence can dramatically affect protein solubility6. Chemical proteomics. Proteins can also be expressed poorly or be Thus, for proteins that have many orthologs, a common strategy insoluble because they lack an obligate cofactor. For most such prois to clone and express many and select the ortholog with the best teins, however, it is impossible to predict the nature of the co-factor. solubility properties. In isolated instances, this approach has In order to identify possible co-factors, we believe that protein proven successful. We do not yet know how effective it will be on expression and purification methods must include screens for small a proteome-wide scale. molecules that interact with newly synthesized proteins. These screens should incorporate known bioactive small molecules (such Alternative expression systems. The use of different expression as ATP) as well as a library of new chemical entities. Indeed, systemsystems often allows the soluble expression of proteins that are atic identification of interacting compounds may provide a means insoluble when expressed in E. coli. At present, the other choices to purify more proteins, as well as a method to ascribe function to for expressing these proteins, such as yeast, insect or human cells, new proteins. This approach, which can be defined as ‘chemical have disadvantages. Insect and human cell culture, in its current proteomics’, helps determine protein function by matching each form, is expensive and time-consuming, and current yeast sys- member of a library of proteins with a small molecule chemical, tems lack the capacity to add many post-translational modifica- perhaps a component of a combinatorial chemistry library. tions that are important for proteins from higher organisms. In addition, the development of metabolic-labeling in these systems Membrane proteins. Membrane proteins are especially challengis in its infancy7. ing subjects for structural genomics. Crystallization of integral The use of cell-free expression systems in bacterial extracts, membrane proteins has been difficult, and the evolution to a developed more than a decade ago but never able to achieve the high-throughput approach will require considerable process requisite efficiency, is now making a resurgence in protein development. The development of these processes will depend expression efforts in Japan8. The problems that have plagued the on a plentiful supply of properly folded membrane proteins. The use of the methods (mostly low protein yields) have evidently current expression systems are inadequate in this regard, and it is been overcome by incorporating continuous flow methods, and clear that inexpensive alternate systems, which are capable of cell-free metabolic labeling has been reported (see the article by generating milligram quantities of membrane proteins, need to Yokoyama). Researchers are awaiting with great interest further be devised. One strategy might be to exploit biological systems results from the Japanese structural genomics projects. that are primed to synthesize vast quantities of membrane proteins, including mammalian cells, such as oligodendrocytes or Protein domains. Proteins with multiple domains are difficult retinal cells, or plant cells, with their thylakoid membranes. targets for structural genomics efforts for three reasons. First, The expression and purification of intra-cellular or extra-celmulti-domain proteins are difficult to express in E. coli. Second, lular domains of membrane proteins is somewhat easier. While these proteins exhibit conformational heterogeneity, which many of these domains can be expressed and purified in the decreases the probability of crystallization. Third, multi- absence of a detergent or membrane-like environment, their domain proteins are relatively large, which increases the diffi- proper folding and purification will depend on factors such as culty of using NMR to determine their structure. the oxidative environment (for extra-cellular domains) or a A general and powerful strategy that simplifies the analysis of highly ionic environment (for membrane-associated proteins or complex proteins is to produce and study them as individual domains). Thus, specific (and separate) procedures will likely domains. Experience has shown that a single domain of a protein have to be developed for different classes of membrane-anchored can often be expressed in bacteria, whereas the intact, multi- or membrane-associated proteins and domains. nature structural biology • structural genomics supplement • november 2000 971 © 2000 Nature America Inc. • http://structbio.nature.com progress © 2000 Nature America Inc. • http://structbio.nature.com Genetic selection. The use of genetics to select for proteins that express well in E. coli is potentially an important complement to biochemical approaches. In one example of this approach, Waldo and colleagues14 fused a variety of coding sequences N-terminal to the coding region of the green fluorescent protein (GFP), which is known to fold poorly in E. coli. Insoluble proteins were found to inhibit the folding of GFP; proteins that were soluble did not affect the folding of GFP. The solubility of the fusion protein could therefore be monitored by measuring the fluorescence of the bacterial colonies after transformation. A similar strategy uses chloramphenicol acetyl transferase as the fusion protein and antibiotic resistance as the read-out15. Such assays should allow scientists to screen for more soluble mutants of specific proteins. the probability of expressing and crystallizing single domain proteins is significantly higher than that for multi-domain proteins. Quality control of purified proteins The aim of structural genomics is to determine protein structures in the most efficient manner. Along the path to structure determination, careful attention to quality control can maximize efficiency. Perhaps the most cost-effective way to make the pipeline more efficient is to use a suite of biophysical techniques, such as dynamic light scattering, mass spectrometry, onedimensional NMR or native gel electrophoresis, to assess the suitability of the protein for structure determination. In this way, proteins that are unstable, that show a propensity to aggregate or that have been proteolyzed can be eliminated from the pipeline Protein purification, generation of structural samples at a relatively early stage. These techniques can also be used to Structural genomics projects require the purification of hundreds guide the process of engineering derivatives of each protein that of thousands of proteins and/or protein fragments. This goal can- are more suitable for structure determination. not be met with current technology because protein purification demands considerable user-intervention and expert decision- Crystallography or NMR? making abilities. Thus, the success of structural genomics Once each protein has passed the quality control step, one is left depends on the development of automated or semi-automated, with a choice — NMR or X-ray crystallography? At present, NMR remains plagued by heavy instrumentation requirerobust and inexpensive methods for protein purification. ments, the need to interpret the data manually, the size limitaDifferent requirements for NMR and crystallization. Purification tion of ∼150 residues and the difficulty of assessing the strategies are routine for proteins that are highly expressed for statistical correctness of the structure. However, NMR is on the use in NMR or X-ray crystallography. An affinity-chromatogra- verge of a dramatic increase in efficiency, with improvements phy step using a removable tag, such as hexahistadine, usually in instrumentation and in automation of data analysis. Protein provides a major step of the purification. For NMR studies, one crystallography, which does not have size limitations, is also or two simple chromatographic steps yield protein samples that advancing and may soon be an almost fully automated process, are sufficiently pure for acquisition of an initial HSQC NMR from crystal to structure. Therefore, crystallography may domspectrum. The small hexahistidine tag is advantageous for NMR inate the first stages of the structural genomics efforts. samples because its presence does not usually interfere with the However, the inability to predict and control protein crystalspectrum. For crystallization, removal of the tag and an addi- lization may turn out to be a serious bottleneck. In this regard, tional purification step may be required. Greater purity increases the advantages of NMR are that it does not require excessive the probability that crystals will grow and enhances the repro- protein purity and it avoids the requirement to grow crystals. ducibility among crystallization trials. The main concern for Ultimately, we predict that the structural genomics projects both crystallization and NMR is to eliminate contaminating pro- could evolve to the point where many small proteins will be tackled by NMR and crystallography will be the method of teases from protein samples. choice for larger proteins and protein complexes. Robotics. Complete automation of protein purification from bacterial extracts will be difficult, particularly the steps of harvesting Acknowledgments cells, lysing cells and preparing clarified extracts. The centrifuga- The authors would like to acknowledge the research support of the Ontario tion and/or filtration processes are most challenging because of government, the University of Toronto and the Ontario Cancer Institute. A.M.E. the large volumes required, the viscosity of the cell paste and the and C.H.A. are Scientists of the Canadian Institutes of Health Research (CIHR). propensity of filters to clog. These problems will only be ampli- J.F.G. is a Distinguished Scientist of the CIHR and a Foreign Investigator of the Howard Hughes Medical Institute. D.C. was supported by a Best Fellowship. fied if throughputs of hundreds of proteins per day are required. This opens up opportunities for novel automation technology. Associations with structural genomics Until such developments arrive, researchers will adopt semi- C.H.A. and A.M.E. direct the Ontario Structural Proteomics Initiative and are automated processes in which the particularly unfriendly steps affiliated with the Midwest Center for Structural Genomics and the Northeast Structural Genomics Consortium. C.H.A., A.M.E., J.F.G., J.D.F., A.D. and M.V. are of cell harvesting and lysis are performed manually. affiliated with Integrative Proteomics, a company that discovers protein structure Poorly-expressed proteins. High-throughput processes for generat- and function on a genome-wide scale. ing structural samples of large and complex proteins are not yet 1. Kim, S.H. Nature Struct. Biol. 5, 643–645 (1998). Christendat, D. et al. Nature Struct. Biol. 7, 903–909 (2000). established. One approach might be a scaled-up version of com- 2. 3. Christendat, D. et al. Progr. Biophys. and Mol. Biol. in the press (2000). mon structural biology methods. Proteins will be produced in a 4. Hendrickson, W.A., Horton, J.R. & LeMaster, D.M. EMBO J. 9, 1665–1672 (1990). Bax, A., Ikura, M., Kay, L.E., Barbato, G. & Spera, S. Ciba Found. Symp. 161, 108–119 (1991). suitable expression host and, probably at considerable cost of 5.6. Eberstadt, M. et al. Nature 392, 941–945 (1998). time and effort, purified and processed for structure determina- 7. Hansen, A.P. et al. Biochemistry 31, 12713–12718 (1992). 8. Kigawa, T. et al. FEBS Lett. 442, 15–19 (1999). tion. Historically, these methods have low success rates and it is 9. Cohen, S.L., Ferre-D’Amare, A.R., Burley, S.K. & Chait, B.T. Protein Sci. 4, 1088–1099 (1995). unlikely, in our opinion, that implementation of these classical 10. Pfuetzner, R.A., Bochkarev, A., Frappier, L. & Edwards, A.M. J. Biol. Chem. 272, 430–434 (1997). Barwell, J.A. et al. J. Biol. Chem. 270, 20556–20559 (1995). approaches on a proteome-wide scale will be sufficiently fast or 11. 12. Bochkareva, E., Frappier, L., Edwards, A.M. & Bochkarev, A. J. Biol. Chem.273, 3932–3936 (1998). cost-effective. As stated above, some strategies for the purification 13. Koth, C.M. et al. J. Biol. Chem. 275, 11174–11180 (2000). Waldo, G.S., Standish, B.M., Berendzen, J. & Terwilliger, T.C. Nature Biotech. 17, 691–695 (1999). and structural genomics of ‘intractable’ proteins are based on the 14. 15. Maxwell, K.L., Mittermaier, A.K., Forman-Kay, J.D. & Davidson, A.R. Protein Sci. 8, fact that complex proteins contain many simple domains and that 1908–1911 (1999). 972 nature structural biology • structural genomics supplement • november 2000

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Protein production: feeding the crystallographers and NMR