Download Protein production: feeding the crystallographers and NMR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

SNARE (protein) wikipedia , lookup

Ubiquitin wikipedia , lookup

Phosphorylation wikipedia , lookup

Thylakoid wikipedia , lookup

LSm wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Proteasome wikipedia , lookup

Endomembrane system wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Protein phosphorylation wikipedia , lookup

Magnesium transporter wikipedia , lookup

Signal transduction wikipedia , lookup

Circular dichroism wikipedia , lookup

Bacterial microcompartment wikipedia , lookup

Protein folding wikipedia , lookup

Homology modeling wikipedia , lookup

SR protein wikipedia , lookup

Protein structure prediction wikipedia , lookup

Protein wikipedia , lookup

Protein domain wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Cyclol wikipedia , lookup

Protein purification wikipedia , lookup

Proteomics wikipedia , lookup

Western blot wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
© 2000 Nature America Inc. • http://structbio.nature.com
progress
Protein production: feeding the
crystallographers and NMR spectroscopists
Aled M. Edwards1– 3, Cheryl H. Arrowsmith1,3, Dinesh Christendat1, Akil Dharamsi3, James D. Friesen2,3,
Jack F. Greenblatt2,3 and Masoud Vedadi3
© 2000 Nature America Inc. • http://structbio.nature.com
Protein purification efforts for structural genomics will focus on automation for the readily-expressed proteins,
and process development for the more difficult ones, such as membrane proteins. Thousands of proteins are
expected to be produced in the next few years. The purified proteins will be valuable reagents for the entire
research community.
Structural genomics1 or structural proteomics2,3 can be defined as the quest to
obtain the three-dimensional structures of
all proteins. However, converting sequence
information to biological reagents — choosing proper expression constructs, which
includes purifying proteins rapidly and
obtaining excellent structural samples —
remains a significant problem for this new field. The development
of better and faster methods to clone, express and purify proteins
is expected to generate new methods and reagents (clones, proteins, and purification procedures) that will benefit the general
biological community as well as structural genomics researchers.
This review will focus on the technical hurdles to be encountered
and provides an overview of the strategies that are currently being
used or developed to solve the problems that lie ahead.
Cloning: Expression constructs
Prokaryotes and eukaryotes: two issues. Producing an expression construct for genes without introns is a matter of conventional recombinant DNA technology. Most of these manipulations involve liquid
handling and are attractive targets for automation. Many of the
structural genomics efforts are directed to this end and it is expected
that within a few years, a small group of people should be able to
create tens of thousands of expression clones per year. The most
common expression strategy, and the one likely to dominate in the
area of bacterial expression, will be to drive recombinant protein
expression using an inducible T7 RNA polymerase promoter.
Creating expression constructs for intron-containing genes
will be limited by the availability of full-length cDNA clones.
Libraries of full-length cDNA clones have been under construction for some time for purposes peripheral to structural
genomics. Therefore it is important for structural genomics
researchers to establish alliances with these other projects.
Affinity tags: which ones? Automation, an essential facet of structural genomics, will almost certainly require that the recombinant
proteins are affinity-tagged. In order to establish best practices, it
will be important to conduct a well-controlled study in which a
large set of proteins is tagged with a variety of affinity tags, such as
polyhistidine and glutathione S-transferase (GST) tags. It will also
be important to assess the merits of appending the tags to the Nor the C-termini of the recombinant proteins. The different tagged
proteins will then be tested for soluble expression and suitability
for structure determination. An interim strategy might be to select
one type of affinity tag and identify the well-behaved proteins
under those conditions. The remaining proteins might then be
constructed with a different affinity tag(s). To date, it appears that
the polyhistidine tag has been the most popular and effective.
A similar large-scale, well-controlled study will be required to
determine if it is advisable to leave or remove the tag before crystallization or NMR studies. At the present state of knowledge, it
is probably best to try both forms of the protein.
Protein expression: The ‘low-hanging fruit’
Our analysis of over one thousand proteins, derived from all
three biological Kingdoms, has revealed that ~15–20% of small
(<50 kDa) non-membrane proteins will be suitable immediately
for structural biology2,3. The analysis of these proteins will
require no conceptual or technical advances. By definition, they
express well in soluble form in Escherichia coli, can be purified
easily, and will crystallize readily or give good NMR spectra.
These proteins will provide ample subjects for scientists who
seek to improve downstream technology, such as data collection,
data processing, and structure determination and analysis.
The ‘low-hanging fruit’ phase of the structural genomics projects will almost exclusively exploit E. coli as an expression system
because of its ease of use, low cost, ready availability of suitable
expression vectors, and ease with which proteins can be labeled
metabolically with seleno-methionine for crystallography, or
with 15N and 13C for NMR studies. In addition, for proteins
whose codon composition varies from that used in E. coli, the
cells can be supplemented with tRNA genes encoding the low
abundance codons, such as with the commercially available
Codon Plus system from Stratagene.
Working with these relatively tractable proteins will provide
an excellent opportunity to develop automated processes.
Automation strategies for many of the steps in existing procedures (such as cloning and cell growth) are relatively straightforward, mostly a matter of liquid handling.
1Ontario Cancer Institute, 610 University Avenue, Toronto, Ontario, Canada. 2Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario,
Canada M5W 1L6. 3Integrative Proteomics, 161 Bay St., Toronto, Ontario, Canada. Correspondence should be addressed to A.M.E. email: [email protected]
970
nature structural biology • structural genomics supplement • november 2000
© 2000 Nature America Inc. • http://structbio.nature.com
© 2000 Nature America Inc. • http://structbio.nature.com
progress
Protein expression: Climbing the tree
Many individual proteins cannot be expressed in soluble form in
bacteria. These include one-third to one half of prokaryote proteins (unpublished data). This proportion is likely to be higher
for eukaryotic proteins, particularly those that comprise multiple domains, those that require cofactors or protein partners for
proper folding, or those that require extensive post-translational
modification. Unfortunately, many of the advances that have
prepared the way for proteome-wide three-dimensional structure determination, namely seleno-methionine incorportation
for MAD phasing or 13C, 15N-metabolic labeling, were developed
with E. coli as an expression system4,5. At present, developing new
systems and strategies for the expression of soluble, labeled proteins or protein fragments is probably the greatest impediment
for structural genomics researchers.
Insolubility arises either from an intrinsic property of a protein
(for example, aggregation due to a very hydrophobic patch on the
surface) or because the protein is not susceptible to the folding
mechanisms in the expression host; in which case there is an aggregation of folding intermediates. Below we discuss approaches for
the generation of soluble versions of these more difficult proteins.
domain protein cannot. Over the past decade, this approach has
yielded structural information for dozens of eukaryotic proteins,
and we expect that a similar strategy on a proteome-wide scale
will be necessary in order to complete the project.
In our experience, experimental results (limited proteolysis
coupled with mass spectrometry)9–11 have been better indicators
of domain boundaries than sequence comparisons. It may be
that sequence-based approaches are of limited use in defining
structural domains because sequence conservation can be found
in regions of proteins that do not adopt a stable tertiary structure
(such as a protein interaction motif that folds only when bound
to its partner). In some cases, the expression of proteins requires
the co-expression of other proteins for proper folding and to
achieve adequate solubility12,13.
Data mining. The rules governing protein expression and solubility
and even protein crystallization are unknown. By assembling a database of the successes and failures of the large-scale expression and
purification trials, researchers will be able to deduce correlations
between protein sequence and behavior. Even with a limited database, our group has found links between protein sequence and solubility and the propensity to crystallize3. Ultimately, this knowledge
Screening for the most soluble ortholog. It is currently impossible will guide the mutagenesis of proteins to generate samples more
to predict the degree of solubility of the encoded protein from its amenable to expression, purification and structure determination
gene sequence. It is known, however, that even subtle changes in
amino acid sequence can dramatically affect protein solubility6. Chemical proteomics. Proteins can also be expressed poorly or be
Thus, for proteins that have many orthologs, a common strategy insoluble because they lack an obligate cofactor. For most such prois to clone and express many and select the ortholog with the best teins, however, it is impossible to predict the nature of the co-factor.
solubility properties. In isolated instances, this approach has In order to identify possible co-factors, we believe that protein
proven successful. We do not yet know how effective it will be on expression and purification methods must include screens for small
a proteome-wide scale.
molecules that interact with newly synthesized proteins. These
screens should incorporate known bioactive small molecules (such
Alternative expression systems. The use of different expression as ATP) as well as a library of new chemical entities. Indeed, systemsystems often allows the soluble expression of proteins that are atic identification of interacting compounds may provide a means
insoluble when expressed in E. coli. At present, the other choices to purify more proteins, as well as a method to ascribe function to
for expressing these proteins, such as yeast, insect or human cells, new proteins. This approach, which can be defined as ‘chemical
have disadvantages. Insect and human cell culture, in its current proteomics’, helps determine protein function by matching each
form, is expensive and time-consuming, and current yeast sys- member of a library of proteins with a small molecule chemical,
tems lack the capacity to add many post-translational modifica- perhaps a component of a combinatorial chemistry library.
tions that are important for proteins from higher organisms. In
addition, the development of metabolic-labeling in these systems Membrane proteins. Membrane proteins are especially challengis in its infancy7.
ing subjects for structural genomics. Crystallization of integral
The use of cell-free expression systems in bacterial extracts, membrane proteins has been difficult, and the evolution to a
developed more than a decade ago but never able to achieve the high-throughput approach will require considerable process
requisite efficiency, is now making a resurgence in protein development. The development of these processes will depend
expression efforts in Japan8. The problems that have plagued the on a plentiful supply of properly folded membrane proteins. The
use of the methods (mostly low protein yields) have evidently current expression systems are inadequate in this regard, and it is
been overcome by incorporating continuous flow methods, and clear that inexpensive alternate systems, which are capable of
cell-free metabolic labeling has been reported (see the article by generating milligram quantities of membrane proteins, need to
Yokoyama). Researchers are awaiting with great interest further be devised. One strategy might be to exploit biological systems
results from the Japanese structural genomics projects.
that are primed to synthesize vast quantities of membrane proteins, including mammalian cells, such as oligodendrocytes or
Protein domains. Proteins with multiple domains are difficult retinal cells, or plant cells, with their thylakoid membranes.
targets for structural genomics efforts for three reasons. First,
The expression and purification of intra-cellular or extra-celmulti-domain proteins are difficult to express in E. coli. Second, lular domains of membrane proteins is somewhat easier. While
these proteins exhibit conformational heterogeneity, which many of these domains can be expressed and purified in the
decreases the probability of crystallization. Third, multi- absence of a detergent or membrane-like environment, their
domain proteins are relatively large, which increases the diffi- proper folding and purification will depend on factors such as
culty of using NMR to determine their structure.
the oxidative environment (for extra-cellular domains) or a
A general and powerful strategy that simplifies the analysis of highly ionic environment (for membrane-associated proteins or
complex proteins is to produce and study them as individual domains). Thus, specific (and separate) procedures will likely
domains. Experience has shown that a single domain of a protein have to be developed for different classes of membrane-anchored
can often be expressed in bacteria, whereas the intact, multi- or membrane-associated proteins and domains.
nature structural biology • structural genomics supplement • november 2000
971
© 2000 Nature America Inc. • http://structbio.nature.com
progress
© 2000 Nature America Inc. • http://structbio.nature.com
Genetic selection. The use of genetics to select for proteins that
express well in E. coli is potentially an important complement to
biochemical approaches. In one example of this approach, Waldo
and colleagues14 fused a variety of coding sequences N-terminal
to the coding region of the green fluorescent protein (GFP),
which is known to fold poorly in E. coli. Insoluble proteins were
found to inhibit the folding of GFP; proteins that were soluble did
not affect the folding of GFP. The solubility of the fusion protein
could therefore be monitored by measuring the fluorescence of
the bacterial colonies after transformation. A similar strategy uses
chloramphenicol acetyl transferase as the fusion protein and
antibiotic resistance as the read-out15. Such assays should allow
scientists to screen for more soluble mutants of specific proteins.
the probability of expressing and crystallizing single domain proteins is significantly higher than that for multi-domain proteins.
Quality control of purified proteins
The aim of structural genomics is to determine protein structures in the most efficient manner. Along the path to structure
determination, careful attention to quality control can maximize
efficiency. Perhaps the most cost-effective way to make the
pipeline more efficient is to use a suite of biophysical techniques,
such as dynamic light scattering, mass spectrometry, onedimensional NMR or native gel electrophoresis, to assess the
suitability of the protein for structure determination. In this way,
proteins that are unstable, that show a propensity to aggregate or
that have been proteolyzed can be eliminated from the pipeline
Protein purification, generation of structural samples
at a relatively early stage. These techniques can also be used to
Structural genomics projects require the purification of hundreds guide the process of engineering derivatives of each protein that
of thousands of proteins and/or protein fragments. This goal can- are more suitable for structure determination.
not be met with current technology because protein purification
demands considerable user-intervention and expert decision- Crystallography or NMR?
making abilities. Thus, the success of structural genomics Once each protein has passed the quality control step, one is left
depends on the development of automated or semi-automated, with a choice — NMR or X-ray crystallography? At present,
NMR remains plagued by heavy instrumentation requirerobust and inexpensive methods for protein purification.
ments, the need to interpret the data manually, the size limitaDifferent requirements for NMR and crystallization. Purification tion of ∼150 residues and the difficulty of assessing the
strategies are routine for proteins that are highly expressed for statistical correctness of the structure. However, NMR is on the
use in NMR or X-ray crystallography. An affinity-chromatogra- verge of a dramatic increase in efficiency, with improvements
phy step using a removable tag, such as hexahistadine, usually in instrumentation and in automation of data analysis. Protein
provides a major step of the purification. For NMR studies, one crystallography, which does not have size limitations, is also
or two simple chromatographic steps yield protein samples that advancing and may soon be an almost fully automated process,
are sufficiently pure for acquisition of an initial HSQC NMR from crystal to structure. Therefore, crystallography may domspectrum. The small hexahistidine tag is advantageous for NMR inate the first stages of the structural genomics efforts.
samples because its presence does not usually interfere with the However, the inability to predict and control protein crystalspectrum. For crystallization, removal of the tag and an addi- lization may turn out to be a serious bottleneck. In this regard,
tional purification step may be required. Greater purity increases the advantages of NMR are that it does not require excessive
the probability that crystals will grow and enhances the repro- protein purity and it avoids the requirement to grow crystals.
ducibility among crystallization trials. The main concern for Ultimately, we predict that the structural genomics projects
both crystallization and NMR is to eliminate contaminating pro- could evolve to the point where many small proteins will be
tackled by NMR and crystallography will be the method of
teases from protein samples.
choice for larger proteins and protein complexes.
Robotics. Complete automation of protein purification from bacterial extracts will be difficult, particularly the steps of harvesting Acknowledgments
cells, lysing cells and preparing clarified extracts. The centrifuga- The authors would like to acknowledge the research support of the Ontario
tion and/or filtration processes are most challenging because of government, the University of Toronto and the Ontario Cancer Institute. A.M.E.
the large volumes required, the viscosity of the cell paste and the and C.H.A. are Scientists of the Canadian Institutes of Health Research (CIHR).
propensity of filters to clog. These problems will only be ampli- J.F.G. is a Distinguished Scientist of the CIHR and a Foreign Investigator of the
Howard Hughes Medical Institute. D.C. was supported by a Best Fellowship.
fied if throughputs of hundreds of proteins per day are required.
This opens up opportunities for novel automation technology. Associations with structural genomics
Until such developments arrive, researchers will adopt semi- C.H.A. and A.M.E. direct the Ontario Structural Proteomics Initiative and are
automated processes in which the particularly unfriendly steps affiliated with the Midwest Center for Structural Genomics and the Northeast
Structural Genomics Consortium. C.H.A., A.M.E., J.F.G., J.D.F., A.D. and M.V. are
of cell harvesting and lysis are performed manually.
affiliated with Integrative Proteomics, a company that discovers protein structure
Poorly-expressed proteins. High-throughput processes for generat- and function on a genome-wide scale.
ing structural samples of large and complex proteins are not yet 1. Kim, S.H. Nature Struct. Biol. 5, 643–645 (1998).
Christendat, D. et al. Nature Struct. Biol. 7, 903–909 (2000).
established. One approach might be a scaled-up version of com- 2.
3. Christendat, D. et al. Progr. Biophys. and Mol. Biol. in the press (2000).
mon structural biology methods. Proteins will be produced in a 4. Hendrickson, W.A., Horton, J.R. & LeMaster, D.M. EMBO J. 9, 1665–1672 (1990).
Bax, A., Ikura, M., Kay, L.E., Barbato, G. & Spera, S. Ciba Found. Symp. 161, 108–119 (1991).
suitable expression host and, probably at considerable cost of 5.6. Eberstadt,
M. et al. Nature 392, 941–945 (1998).
time and effort, purified and processed for structure determina- 7. Hansen, A.P. et al. Biochemistry 31, 12713–12718 (1992).
8. Kigawa, T. et al. FEBS Lett. 442, 15–19 (1999).
tion. Historically, these methods have low success rates and it is 9. Cohen, S.L., Ferre-D’Amare, A.R., Burley, S.K. & Chait, B.T. Protein Sci. 4, 1088–1099 (1995).
unlikely, in our opinion, that implementation of these classical 10. Pfuetzner, R.A., Bochkarev, A., Frappier, L. & Edwards, A.M. J. Biol. Chem. 272, 430–434 (1997).
Barwell, J.A. et al. J. Biol. Chem. 270, 20556–20559 (1995).
approaches on a proteome-wide scale will be sufficiently fast or 11.
12. Bochkareva, E., Frappier, L., Edwards, A.M. & Bochkarev, A. J. Biol. Chem.273, 3932–3936 (1998).
cost-effective. As stated above, some strategies for the purification 13. Koth, C.M. et al. J. Biol. Chem. 275, 11174–11180 (2000).
Waldo, G.S., Standish, B.M., Berendzen, J. & Terwilliger, T.C. Nature Biotech. 17, 691–695 (1999).
and structural genomics of ‘intractable’ proteins are based on the 14.
15. Maxwell, K.L., Mittermaier, A.K., Forman-Kay, J.D. & Davidson, A.R. Protein Sci. 8,
fact that complex proteins contain many simple domains and that
1908–1911 (1999).
972
nature structural biology • structural genomics supplement • november 2000