Download Topology Prediction of Membrane Proteins

Document related concepts

Lipid signaling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Paracrine signalling wikipedia , lookup

Biochemistry wikipedia , lookup

Gene expression wikipedia , lookup

Oxidative phosphorylation wikipedia , lookup

Metalloprotein wikipedia , lookup

Expression vector wikipedia , lookup

SR protein wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Signal transduction wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Homology modeling wikipedia , lookup

Interactome wikipedia , lookup

Magnesium transporter wikipedia , lookup

SNARE (protein) wikipedia , lookup

Protein wikipedia , lookup

Protein purification wikipedia , lookup

Thylakoid wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Anthrax toxin wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Western blot wikipedia , lookup

Transcript
Topology Prediction of Membrane Proteins:
Why, How and When?
Karin Melén
Stockholm University
© Karin Melén, Stockholm 2007
ISBN 91-7155-397-5
Printed in Sweden by Universitetsservice AB, Stockholm 2007
Distributor: Stockholm University Library
To Ture, Otto and Henrik
List of publications
Publications included in this thesis
Paper I
Melén K, Krogh A, von Heijne G. (2003)
Reliability measures for membrane protein topology
prediction algorithms. J Mol Biol. 327(3):735-44.
Paper II
Kim H, Melén K, von Heijne G. (2003)
Topology models for 37 Saccharomyces cerevisiae membrane
proteins based on C-terminal reporter fusions and predictions.
J Biol Chem. 278(12):10208-13.
Paper III
Rapp M, Drew D, Daley DO, Nilsson J, Carvalho T,
Melén K, de Gier JW, von Heijne G (2004)
Experimentally based topology models for E. coli inner
membrane proteins. Protein Sci. 13(4):937-45.
Paper IV
Daley DO, Rapp M, Granseth E, Melén K, Drew D,
von Heijne G. (2005)
Global topology analysis of the Escherichia coli inner
membrane proteome. Science. 308(5726):1321-3.
Paper V
Kim H*, Melén K*, Österberg M*, von Heijne G. (2006)
A global topology map of the Saccharomyces cerevisiae
membrane proteome. Proc Natl Acad Sci. 103(30):11142-7.
* These authors contributed equally
Reprints were made with permission from the publishers
Other publications
Österberg M, Kim H, Warringer J, Melén K, Blomberg A, von Heijne G.
(2006) Phenotypic effects of membrane protein overexpression in
Saccharomyces cerevisiae. Proc Natl Acad Sci. 103(30):11148-53.
Granseth E, Daley DO, Rapp M, Melén K, von Heijne G. (2005)
Experimentally constrained topology models for 51,208 bacterial inner
membrane proteins. J Mol Biol. 352(3):489-94.
Laudon H, Hansson EM, Melén K, Bergman A, Farmery MR, Winblad B,
Lendahl U, von Heijne G, Näslund J. (2005) A nine-transmembrane domain
topology for presenilin 1. J Biol Chem. 280(42):35352-60.
Contents
1 Introduction ................................................................................................11
1.1 Biological membranes ............................................................................................. 11
1.2 Membrane proteins ................................................................................................. 13
1.2.1 Peripheral membrane proteins ....................................................................... 13
1.2.2 Integral membrane proteins............................................................................ 13
2 Membrane protein structure.......................................................................17
2.1 Overexpression ....................................................................................................... 17
2.2 Techniques for structure determination................................................................... 18
2.2.1 Electron crystallography.................................................................................. 19
2.2.2 X-ray crystallography ...................................................................................... 19
2.2.3 NMR spectroscopy.......................................................................................... 19
2.3 Structural models by homology............................................................................... 20
2.4 Structure prediction ................................................................................................. 21
3 Membrane protein topology .......................................................................23
3.1 Experimental determination of topology.................................................................. 24
3.1.1 Reporter genes ............................................................................................... 24
3.2 Prediction of topology.............................................................................................. 26
3.2.1 Topological determinants................................................................................ 27
3.2.2 Topology prediction algorithms....................................................................... 31
4 Summary of papers....................................................................................39
4.1 Reliability measures for topology predictions and the use of experimental
knowledge (Paper I) ...................................................................................................... 39
4.2 Topology models for a small number of S. cerevisiae membrane proteins based on
C-terminal reporter fusions and predictions (Paper II) .................................................. 41
4.3 Topology models for a small number of E. coli membrane proteins and optimization
analysis of fusion points (Paper III) ............................................................................... 43
4.4 Large-scale topology analysis of the E. coli and S. cerevisiae membrane
proteomes (Papers IV and V)........................................................................................ 45
5 Discussion and future perspectives ...........................................................50
Acknowledgements .......................................................................................52
References....................................................................................................54
Abbreviations
2D
3D
Endo H
ER
GFP
GO
GPCR
HA
His4
HMM
MP
NMR
NN
ORF
PDB
PhoA
SG
SP
SRP
SUC2
TM
Two-dimensional
Three-dimensional
Endoglycosidase H
Endoplasmic reticulum
Green fluorescent protein
Gene ontology
G protein-coupled receptor
Hemagglutinin
Histidinol dehydrogenase
Hidden Markov model
Membrane protein
Nuclear magnetic resonance
Neural network
Open reading frame
Protein Data Bank
Alkaline phosphatase
Structural genomics
Signal peptide
Signal recognition particle
Invertase (carrying N-glycosylation sites)
Transmembrane
Amino acids
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamine
Glutamic acid
Glycine
Histidine
Isoleucine
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
1 Introduction
The cell is the essential unit of life and is the fundamental building block in
all organisms. Cells are surrounded by membranes, usually a double layer of
lipids, which separates them from the outside world. The membrane is a
physical barrier that protects the cell from foreign molecules at the same
time as it prevents leakage of internal components and substances. However,
a cell must be able to communicate with its surroundings, exchange molecules and adapt to sudden environmental changes. Membrane proteins (MPs)
are the key players in these communication processes and are responsible for
regulating the permeability of the membranes. They are involved in a wide
range of functions, such as transport of ions and water, receptors for hormones or other signaling molecules, recognition of “self” versus “non-self”,
transducing energy and cell-cell interactions, just to mention some.
The diversity of functions is also mirrored in the great variability in the
three-dimensional (3D) structure of membrane proteins. Some are integral
with certain parts embedded in the membrane whereas others are peripheral
and bound only to the membrane surface. Determination of the structures
would facilitate the assignment of the functions, but unfortunately it is very
difficult to solve the three-dimensional structure of membrane proteins experimentally. Therefore, alternative approaches to obtain structural information must be taken in parallel with traditional structure determination efforts.
One way is to use bioinformatics methods to predict which parts of the protein interact with the lipid bilayer and which parts are residing outside the
membrane using the amino acid sequence only. Other means can be to experimentally locate selected parts of the proteins to the cell interior, cell exterior or in the membrane. In the work presented in this thesis, a strategy of
combining theoretical predictions and experimental analyses has been carried out in order to increase our knowledge of the membrane protein universe.
1.1 Biological membranes
The membranes surrounding the cells in all three domains of life (bacteria,
archea and eukaryotes) are called plasma membranes. In eukaryotes there are
11
additional internal membranes that define specific compartments, so-called
organelles. Some examples are the endoplasmic reticulum (ER), Golgi network, mitochondria, chloroplasts, nucleus, lysosomes and peroxisomes.
Common to all kinds of membrane is that they consist of a mixture of lipids
that are assembled into a bilayer structure. There are different classes of
lipids that can be neutral, zwitterionic or negatively charged, where glycoand phospholipids are the most common ones. The main characteristics for
all lipids are similar in that they are amphipathic molecules with hydrophilic
head groups and hydrophobic hydrocarbon tails. In an aqueous milieu they
spontaneously arrange themselves into a double layer with the polar head
groups facing the aqueous environment and the hydrophobic tails in each
layer pointing inward, thereby avoiding contact with water. The driving
force for this formation is the inability of the fatty acid hydrocarbon chains
to hydrogen bond to water. The shape of the tails allows van der Waals interactions between neighboring tails. The lengths of the fatty acid chains
determine the thickness of the hydrophobic core that typically is about 30 Å.
The size of the polar head groups on each side of the core is around 15 Å
which in total gives a bilayer thickness of roughly 60 Å (Fig. 1). The result is
a hydrophobic barrier that is impermeable to most molecules (except small
uncharged solutes). Therefore the membrane proteins associated with the
lipid bilayer are necessary for the transmission of matter or information
across the various membranes. The proportion of protein depends on the
membrane type but in general makes up about 50% of the mass. The membrane is a dynamic structure with a fairly equal mixture of lipids and proteins
which all can move laterally as depicted by the fluid mosaic model (Singer
and Nicolson, 1972).
Polar head groups
~15 Å Interface region
Hydrophobic tails
~30 Å Membrane core
~15 Å Interface region
Figure 1. Schematic picture of a lipid bilayer with different kinds of lipids and associated membrane proteins; grey spheres represent the lipid head groups, black sticks
represent lipid tails, the light grey cylinder symbolizes an integral membrane protein
and the dark grey cylinder symbolizes a peripheral membrane protein.
12
1.2 Membrane proteins
The biological significance of membrane proteins is reflected in their abundance in a cell. It is estimated that 20-30% of all genes in most organisms
code for membrane proteins (Krogh et al., 2001; Wallin and von Heijne,
1998) based on prediction of the main category of membrane proteins (the
α-helical class, see below). Taking all different classes into account, the
figure is even higher. From a pharmaceutical point of view, the membrane
proteins are also of great importance since about half of all drug targets are
membrane proteins (Hopkins and Groom, 2002; Russell and Eggleston,
2000). The classification of membrane proteins depends on the relationship
to the membrane. They can be divided into two broad categories, the peripheral and the integral membrane proteins.
1.2.1 Peripheral membrane proteins
Peripheral membrane proteins are only loosely associated to one side of the
membrane. They either interact directly with the polar head groups of the
lipids or attach to integral membrane proteins but they rarely penetrate
deeply into the hydrophobic core of the membrane. The peripheral membrane proteins dissociate from the membrane upon treatment with solutions
of high ionic strength or elevated pH.
1.2.2 Integral membrane proteins
Integral membrane proteins are more tightly attached to the membranes and
require a detergent or an organic solvent to be solubilized. They have one or
more polypeptide segments buried in the lipid bilayer. The segments traverse
the entire membrane and the protein must thus both have regions that can
exist in a lipid environment and regions that are happy in a polar milieu. The
solution has been to have a combination of transmembrane (TM) segments
containing residues with hydrophobic side chains that can interact with the
hydrophobic membrane core and loops with a more hydrophilic character
that are in contact with the polar lipid head groups and the surrounding
aqueous medium.
Integral membrane proteins can further be separated into two distinct
classes based on how the transmembrane segments are folded, namely the αhelical bundle class and the β-barrel class. The two architectures have in
different ways solved the problem of having energetically unfavorable amide
and carbonyl groups of the peptide bonds in a hydrophobic environment.
1.2.2.1 The α-helical bundle class
The α-helical membrane proteins are the most frequent type of membrane
proteins and are found in nearly all cellular membranes. The α-helices trav13
erse the membrane and are tightly packed into bundles (Fig. 2a). They are
composed of mainly hydrophobic residues where the side chains can form
van der Waals interactions with the fatty acid chains in the membrane core.
All polar amide and carbonyl groups in the backbone are hydrogen bonded
internally within the helix. This lowers the cost of transferring polar entities
into the hydrocarbon interior and makes the conformation energetically stable. Another stabilizing factor is the enrichment of aromatic residues (Tyr
and Trp) near the ends of the transmembrane segments (Killian and von Heijne, 2000), something also recognized in the β-barrel membrane proteins
(see below). It is thought that their interaction with the membrane-water
interface regions helps the positioning of the helices relative to the membrane (de Planque et al., 1999; Yau et al., 1998).
The helices are oriented more or less perpendicularly to the membrane
plane but are usually slightly tilted. In general, the lengths of the helices will
approximately match the thickness of the membrane. Among the known
three-dimensional structures the average number of residues in an α-helix is
commonly estimated to be between 23 and 26 (Bowie, 1997; Cuthbertson et
al., 2005; Eyre et al., 2004; Ulmschneider et al., 2005). The deviation in the
estimated numbers reflects both the difficulty of precisely identifying the
membrane core and also the differences in defining where the TM helices
actual start and end. There is however a large variation of the helix lengths
and observations from 15 up to 43 residues have been made (Granseth et al.,
2005a). Factors that affect the lengths are the tilting angle of a helix and the
distortion of the lipid bilayer (Engelman et al., 1986). The exact positioning
of the helices and the direction of the side chains are also influenced by the
snorkeling effect (hydrophobic side chains tend to point towards the hydrophobic membrane center while polar and charged side chains are oriented
towards the polar head groups and the membrane-water interface region)
(Granseth et al., 2005a; Monné et al., 1998). There are both single-spanning
proteins where the membrane domain acts as an anchor for a water-soluble
domain, and multi-spanning (polytopic) proteins where two or more αhelices usually are more directly involved with the function of the protein.
14
(a)
(b)
Figure 2. Ribbon diagrams showing two examples of integral membrane proteins.
The red lines denote the approximate membrane borders. (a) Bacteriorhodopsin
(PDB code 2BRD) with 7 transmembrane α-helices, (b) A porin (PDB code 1PRN)
with 12 transmembrane β-strands.
1.2.2.2 The β-barrel class
The β-barrel proteins are present in the outer membrane of gram negative
bacteria, mitochondria and chloroplasts. The membrane-spanning parts are
composed of an even number of antiparallel β-strands (between 8 and 22
strands in proteins of known three-dimensional structures) that are tilted
~45° relative to the membrane normal (Lomize et al., 2006; Schulz, 2000).
The strands form a cylindrical pore where the first and last strands meet to
close the barrel (Fig. 2b). This arrangement enables the backbone amide and
carbonyl groups to hydrogen bond laterally with the neighboring strands.
The residues in the β-strands alternately face the lipid bilayer and the inside
of the barrel. Side chains oriented towards the fatty acid chains are typically
hydrophobic whereas side chains oriented towards the barrel interior are
more polar on average. The outcome is a polar channel through which watersoluble molecules can cross. The residues flanking the β-strands are often
aromatic (Tyr and Trp) and are in contact with the lipid head groups which is
believed to stabilize the structure (Yau et al., 1998). The loops joining the
strands are predominantly composed of polar residues and usually form short
turns on the periplasmic side and longer loops at the extracellular side
(Schulz, 2000).
It is more difficult to predict the fraction of proteins in a genome belonging to the β-barrel class than to the α-helical bundle class, due to less hydrophobic motifs and larger length variations in the former group. A few attempts have been made however and the proportion of β-barrel encoding
genes in the gram negative bacterium Escherichia coli genome is estimated
to be 2-3% (Garrow et al., 2005; Wimley, 2003) .
15
In this thesis I have only focused on the α-helical class of integral membrane
proteins and will from now on refer to them as simply “membrane proteins”.
16
2 Membrane protein structure
The 3D structure of a protein is implicitly related to the function of the protein (Laskowski et al., 2003), but it is not always straight forward to infer
function from structure. There are cases where proteins with similar structures have different functions (Bartlett et al., 2003) and if a protein represents a new fold (i.e. resembles no previously solved structure) it might be
hard to assign the function (Skolnick et al., 2000). Nevertheless, a good way
to start studying the function for a protein is to determine its 3D structure.
This is a demanding procedure for any protein but has turned out
to be considerably more difficult for membrane proteins than for
globular proteins. The issue is illustrated by the fact that only just
above 100 structures of membrane proteins have been solved
(http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html), which is in
contrast to the number of solved structures for globular proteins deposited in
the Protein Data Bank (PDB) (Berman et al., 2000) which is nearly 40 000
(http://www.rcsb.org/pdb). The fraction is hence less than 1% which should
be compared to the observation that between 20 and 30% of all proteins are
membrane proteins (Krogh et al., 2001; Wallin and von Heijne, 1998). The
huge gap is not expected to be filled in the foreseeable future even though
there has been an exponential growth in the number of membrane protein
structures solved during the last years (White, 2004).
Why are membrane proteins so challenging? There are several explanations,
but the major reason is the interaction with the membrane lipids that is necessary for correct folding. Without the amphipathic lipid molecules a membrane protein does not fold into its native structure. This has many implications, some of which I will go through briefly in the following sections.
2.1 Overexpression
A prerequisite for structural determination is a reasonable amount of purified
protein, typically several milligrams. Most membrane proteins do not reach
this level. Either the proteins are degraded by proteases (Wagner et al., 2006)
or the overexpression can lead to aggregation of proteins and formation of
inclusion bodies from which it is difficult to isolate the protein (Rogl et al.,
1998). It seems as if the membrane assembly machinery (including the SRP,
17
translocon and chaperones) does not manage to handle too large quantities
and therefore the proteins are not inserted correctly or get misfolded (Drew
et al., 2003). Moreover, the balance between lipid and protein composition in
the membrane gets disturbed which also might be a limiting factor for the
protein insertion capacity. As an illustration, it has been shown in E. coli that
proliferation of extra intracellular membranes increased the yield of overexpressed membrane proteins (Arechaga et al., 2000). Finally, overexpression
would ideally be performed in endogenous hosts, but this is often not possible, especially not for mammalian proteins. A common host for both prokaryotic and eukaryotic membrane proteins is E. coli but since the components in the assembly machinery vary between the species it is likely that the
pathways are far from optimal. Another disadvantage of using bacterial expression systems for eukaryotic membrane proteins is the inability of bacteria to perform post-translational modifications that may be crucial for the
function. The further the evolutionary distance between host and protein
origin is, the higher the risk of expression problems. Most mammalian membrane proteins require a eukaryotic host cell for overexpressing (Tate, 2001)
and in many cases even insect or mammalian host cells are needed
(Massotte, 2003).
2.2 Techniques for structure determination
Even if the overexpression of membrane proteins is successful, solubilization and purification of the proteins have to be carried out before further
analyses. Since membrane proteins contain both hydrophilic and hydrophobic parts they are not soluble in water and can therefore not easily be released from the lipids in the membrane. Detergents, i.e. amphipathic compounds possessing both polar and nonpolar regions, are necessary for the
solubilization. It is however not trivial to assess the amount of detergent
needed or which detergent to use or if a combination of different detergents
and lipids is required. The conditions are individual for each membrane protein and have to be optimized by trial and error.
A general strategy for purifying overexpressed proteins is to utilize constructs where the target genes are fused to an affinity tag, either at the aminoor the carboxy-terminus of the protein. The tag enables the proteins to be
recovered efficiently (Walian et al., 2004).
There are a number of experimental techniques for three-dimensional
structure determination. The classical methods commonly used for globular
proteins are X-ray crystallography and nuclear magnetic resonance (NMR)
spectroscopy. The difficulties in working with membrane proteins have lead
to a development of alternative methods for obtaining structural data, such as
electron crystallography, single-particle cryo-electron microscopy (cryoEM) and atomic force microscopy (AMF). In some cases the resulting struc18
tures are of lower resolution than achieved by the standard techniques but
they still yield valuable and complementary information (Torres et al.,
2003). Below follows a description of the three main techniques used for
structure determination of membrane proteins.
2.2.1 Electron crystallography
The pioneering work of solving the very first membrane protein structure
was done using electron crystallography. It was performed in 1975 by Henderson and Unwin. They studied bacteriorhodopsin from Halobacterium
halobium and managed to get a low-resolution structure (Henderson and
Unwin, 1975), from which it was possible to produce a model consisting of 7
transmembrane segments arranged almost perpendicularly to the membrane.
The methodology has been continually improved, and there are now examples of structures solved at near-atomic resolution (Henderson et al., 1990;
Torres et al., 2003). The main benefits of the method are that membrane
proteins can be reconstituted in lipid bilayers resembling their natural environment and that the proteins often organize themselves into well-ordered
two-dimensional (2D) crystals.
2.2.2 X-ray crystallography
The rate-limiting obstacle in X-ray crystallography is to obtain diffracting
3D crystals of high quality. Well-ordered 3D crystals are more difficult to
grow than 2D crystals, and again, it is the amphipathic nature of the protein
that causes the trouble. The protein can get stuck in aggregates on the way
from soluble protein-detergent complex to crystal (Caffrey, 2003). But once
a high quality crystal is achieved, standard X-ray diffraction analyses can be
applied. To date, the majority of the known 3D structures for both globular
and membrane proteins have been solved by X-ray crystallography (> 80%)
(http://www.rcsb.org/pdb). During recent years, new methods have been
developed for improving 3D crystal production and it seems as if the X-ray
technique will continue to be the most used method for structural determination, at least in the near future (Caffrey, 2003) .
2.2.3 NMR spectroscopy
The main advantage of NMR spectroscopy is that there is no need for crystals. On the other hand, the technique is limited to analyses of proteins of
small sizes, regularly < 35 kDa, although advances for pushing this size limit
have been reported (Kainosho et al., 2006; Yu, 1999). There are different
types of NMR where solution NMR has been widely used for globular proteins. The technique requires that the molecule studied must tumble quickly
on the NMR time scale in order to produce sharp resonance bands. The lar19
ger the molecule, the more slowly it will tumble and the harder it will be to
obtain sharp NMR resonances (Torres et al., 2003). Membrane proteins that
need to be surrounded by an environment that mimics the membrane, such as
a detergent micelle, often become too large and tumble too slowly to be
studied successfully and thus are even more size-restricted than globular
proteins. The problem of slow tumbling can however be bypassed by solidstate NMR where larger molecules can be analyzed. The membrane protein
is reconstituted in detergent bilayers instead of a micelle and the protein is
immobilized by the environment (Opella et al., 2002). Although the NMR
technique contributes fewer structures than X-ray crystallography it has been
shown that for small proteins the methods are complementary with low redundancy (Yee et al., 2005). Therefore, it is suggested that the most effective
way to obtain new structures is to continue to use both methods in parallel.
2.3 Structural models by homology
Although technical progress is made continuously, it is not feasible to experimentally determine structures for every known protein. It would take too
much effort both in terms of cost and labor. Fortunately one can take advantage of the evolutionary relationship between sequence and structure. It has
been concluded by several studies that protein sequences sharing at least
30% sequence identity are likely to have similar structures (Bradley et al.,
2005; Rost and Sander, 1993; Vitkup et al., 2001). Therefore, a way to obtain a structural model for a protein without solving the structure explicitly is
to find a homolog for which the structure is known. The homolog can be
used as a template in homology modeling methods and accordingly one can
attain a reasonably correct structure for the target protein. The evolutionary
relationship between the modeled protein and the template influences the
accuracy. The higher the sequence identity, the better the alignment will be
and hence the likelihood of a correct structural model increases.
Facing the reality and the aforementioned facts, a number of so-called
structural genomics (SG) initiatives were started in 2000 (Thornton, 2001).
The goal of SG is to provide a structural representative for each protein family and to use computational homology modeling and fold recognition to
obtain structural models for all related sequences in the protein families
(Brenner, 2001; Todd et al., 2005). In order to cover the whole protein family space, the target selection for experimental structure determination has to
be clever. It should in general focus on targets from protein families without
members of known structure and targets from large protein families where it
might not be enough with one structure to capture the diversity of functions
(Marsden et al., 2006). Following this strategy it has been predicted that
25,000 new structures are needed to achieve 80% structural coverage of all
20
proteins in the first 1,000 sequenced genomes (Yan and Moult, 2005), something that is expected to be achievable within the next decade.
When it comes to membrane proteins the figures have to be recalculated
due to a number of reasons; i) the fraction of solved MP structures is much
lower than the fraction of MPs present in the proteomes (see chapter 2), ii)
the structure determination process is slower for membrane proteins than for
globular proteins (Torres et al., 2003) and iii) the lipid bilayer imposes
physical and chemical constraints that restricts the structural diversity of the
transmembrane domains in the membrane proteins (Bowie, 1997; Ubarretxena-Belandia and Engelman, 2001). Yet it seems as if modeled structures of
membrane proteins can reach the same level of accuracy as globular proteins
if the sequence identity of the target protein and the template is 30% or
higher (Forrest et al., 2006). The difference in success rate might instead be
due to the scarcity of membrane protein structures which reduces the probability of finding a homolog of known structure and thereby obtaining a correct alignment between the target and template. According to a recent study
(Oberai et al., 2006) the most dominant membrane protein families are already present in sequence databases today, something which seems not to be
the case for globular proteins. This concludes that the universe of membrane
protein families appears to be much more structurally limited than found for
globular protein families. Oberai et al. estimated that if 80% of the MP sequence space should be covered by structural representatives, about 700
families are required, which relates to about 300 folds since different protein
families can have the same fold (Grant et al., 2004). They further estimated
that this could be achieved by year 2020 if the same pace of structure determination is continued as modeled in (White, 2004).
2.4 Structure prediction
Considering the moderate speed of structural determination of proteins it is
tempting to try to predict the 3D structure ab initio. Anfinsen stated in 1973
that it is the chemical properties of the amino acid sequence that determines
the tertiary structure of a protein (Anfinsen, 1973). This idea is the central
postulate on which protein structure prediction is based. Another fundamental principle is that the native state of a protein is assumed to be at the minimum of the global free energy. Ab initio structure prediction has turned out
that be a very complex task mainly due to the search for low-energy states in
the conformational space. There is however progress in the field and successful structure prediction for small globular proteins has been achieved
(Bradley et al., 2005).
Membrane protein structure prediction could benefit from the structural
constraints imposed by the lipid bilayer. The membrane environment reduces the degrees of freedom for an embedded protein and hence its struc21
ture should be easier to predict than the structure for a globular protein
(Simon et al., 2001). In a first approximation, the prediction boils down to
the identification of the transmembrane helices and the optimization of their
packing. And indeed, there are examples where structures have been successfully predicted for parts of membrane proteins (Yarov-Yarovoy et al.,
2006). On the other hand, membrane proteins are often larger than the small
globular proteins for which ab initio prediction has done well, which make
the computational calculations for membrane proteins more complicated “in
the end” (Fleishman and Ben-Tal, 2006). Alternative routes for gaining
structural insights have to be taken until 3D prediction is more satisfying.
One way is to reduce the dimensionality and focus on the topology as described in next chapter.
22
3 Membrane protein topology
If the three-dimensional structure of a membrane protein is not obtainable,
neither by experimental techniques nor by theoretical predictions, what
would be the second-best approach to gain structural insight and functional
information? There are many answers to this question but one commonly
agreed on is to find out the topology for the membrane protein (Jones, 2007;
von Heijne, 2006). The topology can be seen as a 2D representation of the
protein but it has also been referred to as “low-resolution structure”
(Kernytsky and Rost, 2003). A general definition of the topology is the
specification of the number of transmembrane helices and the orientation of
the protein relative to the lipid bilayer (Fig. 3). Although the spatial arrangement and interactions of the helices are not are taken into account, it
provides information on which side of the membrane the protein starts and
ends as wells as where the loops and helices are located, all of which can be
used for both functional and structural classification of the protein (von Heijne, 2006).
N
non-cytoplasm
membrane
cytoplasm
C
Figure 3. A topology map of an integral membrane protein with 7 transmembrane
helices. The N-terminus is located in the extracytoplasmic space (commonly referred
to as ‘outside’), the C-terminus is located in the cytoplasm (commonly referred to as
‘inside’) and there are short loops connecting two succeeding helices on the in- and
outside respectively.
23
3.1 Experimental determination of topology
There are a number of different techniques available for topology determination (van Geest and Lolkema, 2000). Since the membrane-spanning parts can
be identified with relative ease due to their hydrophobic character, the topology determining techniques have in general focused on correctly localizing
the loops to the cytoplasmic or the extracytoplasmic side of the membrane
(Traxler et al., 1993). Some methods use biochemical agents or antibodies
that have access only to one side of the membrane e.g. in cysteine labeling
(Kimura et al., 1997), glycosylation mapping (Chang et al., 1994), epitope
mapping or antibody binding (Canfield and Levenson, 1993). Other methods
use a reporter gene fused to a hydrophilic domain of the membrane protein
of interest and whose product or enzymatic activity can disclose on which
side of the membrane it resides (Manoil, 1991).
In this thesis, different gene fusion techniques have been employed for
topology mapping. They will be described more thoroughly below.
3.1.1 Reporter genes
Reporter genes usually code for proteins with enzymatic activity that becomes manifest only on one but not the other side of a membrane. Activity
assays on reporter gene fusions can therefore be used to determine the inside/outside location of different parts of a membrane protein. A reporter
gene can be fused to various sites and in order to do a full topology mapping,
fusion constructs for each extramembraneous region, including the N and C
termini, should be made. The constructs can either be designed to contain
truncated versions of the membrane protein fused to the reporter gene in the
C-terminal end or designed in a “sandwich” manner where the reporter gene
is inserted into a loop region and accordingly the membrane protein is always full-length (Ehrmann et al., 1990; van Geest and Lolkema, 2000). It
has been discussed to what extent a large reporter moiety affects the folding
and insertion into the membrane, in particular if it is supposed to be translocated to the extracytoplasmic side. It could also be suspected to affect the
long-range interactions of different parts of the proteins. But so far, several
studies have shown that the influence of the reporter gene is marginal and
that membrane proteins mostly retain their native topology (Kim et al., 2005;
Traxler et al., 1993; van Geest and Lolkema, 2000).
The choice of reporter gene depends on which organism the proteins are
to be expressed in. In E. coli (and bacteria in general) the membrane proteins
are inserted into the plasma membrane and in Saccharomyces cerevisiae
(and eukaryotes in general) most membrane proteins are inserted into the ER
membrane (at least initially). Protein parts facing the “inside” are in both
systems located in the cytoplasm while parts facing the “outside” are located
in the periplasm and ER lumen, respectively. Thus the reporter proteins are
24
placed in different environments. For a in-depth review of different reporter
genes, see (van Geest and Lolkema, 2000).
Observing enzyme activity is usually correctly interpreted as having the
reporter protein at the specific side of the membrane where the reaction can
take place. But if there is no measurable activity, is that an indication of opposite extramembraneous location or is it due to misfolding of the protein or
failure of correct insertion into the membrane? To be able to unambiguously
assign a part of the protein to be located in the cytoplasm or non-cytoplasm,
the experimental analysis benefits from including two complementary reporter proteins showing activities at opposite sides of the membrane. Below
follows a description of reporters used in this thesis, two expressed in E.
coli, alkaline phosphatase (PhoA) and green fluorescent protein (GFP), and
two expressed in S. cerevisiae, histidinol dehydrogenase (His4) and an invertase carrying a number of glycosylation sites (SUC2).
3.1.1.1 PhoA
One of the first reporter genes used for topology mapping was alkaline phosphatase (PhoA) (Manoil and Beckwith, 1986). It is normally expressed in
bacteria and transported to the periplasmic space where it can catalyse the
hydrolysis of phosphate groups from different molecules. The enzymatic
activity depends on the formation of an essential cysteine disulfide bridge
within the protein, which only can take place in the periplasm. If PhoA stays
in the cytoplasm it can not fold properly and is therefore not active. To detect the location of PhoA, a substrate that changes color upon hydrolysis is
added to the bacterial culture medium and if a color change is observed
PhoA must be located in the periplasm.
3.1.1.2 GFP
Green fluorescent protein (GFP) from the jellyfish Aequrea victoria can be
expressed in bacteria and used as a topology reporter as it is only active in
the cytoplasm but incorrectly folded and thus inactive in the periplasm
(Feilmeier et al., 2000). The active form is fluorescent after UV illumination,
so detection of fluorescence indicates a cytoplasmic location. GFP is a suitable complement to PhoA and the combination of the two reporters has been
used in several studies (Drew et al., 2002, paper III and IV).
3.1.1.3 His4C
His4C is a truncated version of the yeast enzyme histidinol dehydrogenase
(His4). His4C retains the enzymatic activity of the full-length protein and
can convert histidinol to the essential amino acid histidine (Deak and Wolf,
2001). Cells containing a mutated nonfunctional his4 gene cannot grow on
media lacking histidine but containing histidinol. However, if the reporter
His4C is fused to a membrane protein and stays in the cytoplasm, it metabo-
25
lizes histidinol to histidine and the his4 cells survive. Growth on the selection media is therefore an indicator of cytoplasmic localization. When the
His4C part is translocated to the ER lumen, it cannot act on its substrate and
the his4 cells cannot grow.
3.1.1.4 SUC2
SUC2 codes for invertase, an enzyme that catalyses the hydrolysis of different sugar molecules. It is however not its enzymatic capacity that carries the
reporter potential. Instead it is the presence of several N-linked glycosylation
acceptor sites (Taussig and Carlson, 1983) that is utilized (Deak and Wolf,
2001). N-linked glycosylation of proteins in eukaryotes takes place in the ER
lumen. Glycan molecules are attached to the asparagine (N) residue in the
tripeptide Asn-X-Ser/Thr, where X can be any amino acid except proline. A
fusion construct with the Suc2 part facing the ER lumen will be heavily glycosylated. This causes an increase in molecular weight (2-3 kDa/glycan),
which can be detected by a shift in size of the protein upon endoglycosidase
H (Endo H) treatment. In cases where there is no change in molecular weight
of the fusion protein with and without Endo H treatment, the Suc2 part has
remained unglycosylated and likely resides in the cytoplasm. Since His4C
and Suc2 have opposite reporter profiles they constitute a suitable reporter
pair for topology mapping (Deak and Wolf, 2001, paper II and V).
3.2 Prediction of topology
Despite the wide variety and high confidence of experimental techniques for
topology determination, it is practically demanding to accomplish full topology mapping of membrane proteins. Most proteins are polytopic and contain
several transmembrane helices, meaning that a large number of constructs
are required to map a detailed topology. It is therefore not practical to perform if proteome-wide analyses in high-throughput mode are intended. Consequently the contributions to topology information from theoretical and
computational calculations are of great importance.
Membrane protein folding is thought to follow a two-stage process (Popot
and Engelman, 1990). This model proposes that the individual helices initially are formed and inserted independently as stable domains (the first
stage), followed by the interaction and assembly of the helices to produce the
final structure (the second stage). The model of folding has later been expanded to a four-stage model (White and Wimley, 1999) to also account for
helix partitioning and folding in the water phase and the membrane interface
region, but the main stages (insertion and association) are the same as in the
two-stage model. Although it is an obvious simplification of the folding
process, it is a widely accepted model and seems to hold for most cases
(Chamberlain et al., 2003). Since the helix formation and insertion process is
26
assumed to be separate from the helix packing step, topology prediction is an
important first step in any full structure prediction method.
Membrane proteins possess certain attributes that make them additionally
well-suited for sequenced-based topology prediction. As mentioned in chapter 2.4, the lipid bilayer imposes constraints on the protein architecture. The
protein zigzags back and forth across the membrane and the amino acid distribution in different regions reflects the very heterogeneous environments
that the membrane and extramembraneous compartments present. Topology
prediction algorithms try to capture those differences. There are two predominating properties that are fundamental and on which the prediction algorithms rely: the long stretches of hydrophobic amino acids forming the αhelices and the uneven distribution of positively charged amino acids in
loops flanking the α-helices. The following sections will describe the major
factors contributing to the topology in more detail.
3.2.1 Topological determinants
3.2.1.1 Hydrophobicity
The presence of mainly hydrophobic amino acids in the α-helices is the
most characteristic feature of the transmembrane domains. The dominating
residues are the hydrophobic amino acids Ala, Ile, Leu and Val that together
account for 45% of the TM residues (Ulmschneider et al., 2005). The hydrophobic effect and intrahelical hydrogen bonding make the α-helical conformation stable and compensate for the cost of transferring polar peptide bonds
into the membrane. Intuitively, polar and charged amino acids would be
expected to be absent from the hydrocarbon core due to energetic concerns.
Nevertheless, those residues are found occasionally and it has been concluded that they are required for the structure or function (UbarretxenaBelandia and Engelman, 2001), for example by binding of prostethic groups
or to mediate proton or electron transport. It has furthermore been shown
that polar residues are less frequently mutated in TM domains as compared
to in extramembraneous regions or globular proteins (Jones et al., 1994a;
Ubarretxena-Belandia and Engelman, 2001) which also indicates structural
and functional importance. In multi-spanning MPs not all residues face the
lipids. Helix-helix association creates environments where the side chains
are buried within the protein interior and can interact without lipid contact.
Buried residues are moreover found to be more conserved than residues facing the lipids (Eyre et al., 2004), which again suggests that they are important for the preservation of structure and function. It is even believed that the
interhelical interactions between polar amino acids are one of the driving
forces for helix packing and protein folding (Popot and Engelman, 2000).
Moreover, amino acids with small side chains like Ala and Gly have a preference for helix-helix interfaces since they allow the helices to pack tightly.
27
Glycine is also the core of the GxxxG motif known to stabilize helix-helix
interaction and oligomerization (Senes et al., 2000). In contrast, hydrophobic
amino acids with large bulky side chains like Phe, Leu, Ile and Val are more
exposed on the helix surfaces (Eilers et al., 2000; Ulmschneider et al., 2005).
As a consequence of the relaxed hydrophobicity requirement for residues
involved in helix packing, helices in multi-spanning MPs are on average less
hydrophobic than helices in single-spanning MPs (Jones et al., 1994a).
Moreover, charged residues in transmembrane segments do not necessarily
need to be completely buried; they can still be neutralized by forming intrahelical hydrogen bonds, i.e. interacting with residues on the same helix, and
thus be exposed to lipid tails without being energetically unfavorable (Eyre
et al., 2004), or the charged side chains can snorkel out to the interface region and thereby escape the lipid environment, see below.
3.2.1.2 Helix length
A transmembrane helix has to be sufficiently long to traverse the lipid bilayer. It should at least span through the 30 Å thick hydrophobic core, something that is attained if it is ~20 residues long since each residue adds ~1.5 Å
to the length of the helix. The average helix length is estimated to be around
26 residues (Bowie, 1997; Granseth et al., 2005a; Ulmschneider et al., 2005),
meaning that the helices often protrude into the membrane interface region.
The helices are usually tilted around 21-24 degrees relative to the membrane
normal (Bowie, 1997; Ulmschneider et al., 2005) so that longer helices
nicely can fit into the membrane core. There is also flexibility among the
lipids which allows a certain degree of bilayer distortion (Engelman et al.,
1986) or accumulation of lipids with suitable lengths of the hydrocarbon tails
around the TM helices to avoid “hydrophobic mismatch” (Chamberlain et
al., 2003). An additional factor that does not affect the helix length, but
rather has influence on the precise helix positioning and directions of the
side chains is the snorkeling effect. Residues flanking the helices tend to
point toward the membrane core if they are hydrophobic whereas polar and
charged residues instead are oriented towards the interface region (Granseth
et al., 2005a; Monné et al., 1998).
3.2.1.3 Membrane-water interface region
There is no exact definition of the membrane-water interface region but in
general it is located ±15-25 Å away from the membrane center (see Fig. 1).
It is here that most TM helices end and the polypeptide chains continue as
loops of either irregular secondary structure (70% of the residues) or interfacial α-helices parallel to the membrane (30% of the residues) (Granseth et
al., 2005a). The chemically complex interface environment influences the
amino acid distribution which differs significantly from both the membrane
core and the surrounding aqueous phase. A striking observation is that polar
aromatic residues (Tyr and Trp) are enriched in the membrane interface on
28
both sides (Granseth et al., 2005a; Killian and von Heijne, 2000). It is believed that aromatic residues, with their flat rigid ring structures, are excluded from the membrane core (except Phe) because they would perturb the
ordering of the hydrocarbon tails too much. At the same time they are well
suited to the interface region, probably due to electrostatic interactions and
thereby help to anchor the transmembrane segments in the membrane
(Killian and von Heijne, 2000; Yau et al., 1998).
Glycine is the most frequent residue in this region and this is probably
due to its small size that enables the formation of short loops of the polypeptide facilitating close packing of the TM helices (Ulmschneider et al., 2005).
Proline is also preferred in the interface regions for the same reason, as it
allows the backbone to take specific conformations which can be advantageous as the loops connecting two TM helices on average are short, only 9
residues (Granseth et al., 2005a).
As mentioned earlier, charged residues are not common in the hydrocarbon core, but are more frequent in the interface region, that is discussed in
the next section.
3.2.1.4 Positive-inside rule
The exact positioning of the transmembrane segments is determined by the
long stretches of hydrophobic residues together with the anchoring residues
in the interface region as discussed above. But what guarantees that the helices are inserted in the correct orientation? The helices themselves contain no
directional information. Statistical studies of bacterial inner membrane proteins have shown that positively charged residues (Arg and Lys) are more
prevalent in cytoplasmic loops than in non-cytoplasmic loops (von Heijne,
1986). The biased distribution, the so-called positive-inside rule, has later
been confirmed to be universal and holds for virtually all organisms, although it is somewhat less pronounced in eukaryotes (Nilsson et al., 2005).
No comparable enrichment of negatively charged residues (Asp or Glu) is
detected on any side of the membrane (Granseth et al., 2005a; Nilsson et al.,
2005). It is the increased presence of Arg and Lys in the vicinity of the helix
ends in the cytoplasmic loops that directs the helix orientation.
How this guidance occurs in detail is not fully understood, but there are
presumably a number of factors that contribute to establishing the final topology. Interaction with negatively charged phospholipid head groups can
prevent translocation of a polypeptide chain with positive charge (van Klompenburg et al., 1997), which thus becomes retained in the cytoplasm.
Another aspect that affects which parts that are translocated is the direct
contact with the Sec-translocon. Charged residues in the translocon complex
are believed to attract or repel the positive residues in the MP that is to be
inserted. It has been verified that mutations of residues to opposite charges in
the yeast Sec61p (the largest subunit in the Sec-translocation complex) affect
the MP orientation (Goder et al., 2004). The influence of the positive-inside
29
rule was reduced for a set of test membrane proteins as compared to when
wildtype Sec61p was present, with inverted orientations as a result.
Finally, the electrochemical potential across the bacterial inner membrane is anticipated to play a role for the orientation. The basic cytoplasm
and the acidic periplasm seem to prevent translocation of positively charged
polypeptide segments containing Arg and Lys residues and facilitate translocation of less positively charged segments (Andersson and von Heijne,
1994). This phenomenon contributing to the positive-inside rule can not be
applied to eukaryotic MPs because there is no general potential across the
ER membrane. This might be one reason why the distribution of Arg and
Lys in cytoplasmic loops compared to non-cytoplasmic loops is less biased
in eukaryotes than in prokaryotes.
3.2.1.5 Signal peptides
In order to be inserted into the membrane, TM proteins have to be targeted to
the translocon present in the ER in eukaryotes (Sec61) and in the plasma
membrane in prokaryotes (SecY), reviewed in (Luirink et al., 2005). Targeting is mainly mediated by the signal recognition particle (SRP) that binds to
the first hydrophobic segment as the nascent polypeptide chain is translated
on the ribosome. The SRP directs the ribosome to the translocon and the Nterminal part of the peptide enters the protein-conducting channel. The hydrophobic segment acts as a signal anchor sequence that most often is transferred latterly into the membrane and stays there as the first TM helix. In
some cases it can be cleaved off by the enzyme signal peptidase yielding a
TM protein with non-cytoplasmic N-terminal topology. The presence of a
signal peptide (SP) in membrane proteins is thus an important topological
determinant.
Secretory proteins follow the same pathway through the translocon (although not always co-translationally), but since they are not intended to be
kept in the membrane they all have a cleavable signal peptide. The SP is
hydrophobic in the middle of the sequence and forms an α-helix, but this is
in general shorter than a TM helix and somewhat less hydrophobic. Still, an
SP is similar enough to a TM helix to cause confusion among prediction
programs trying to discriminate between secreted and membrane proteins
(Käll et al., 2004).
3.2.1.6 Re-entrant regions
During the last years it has been become clear that it is not only transmembrane helices that enter the membrane in helical MPs. So-called re-entrant
regions also penetrate into the membrane, but instead of traversing the membrane, the peptide chain enters and leaves at the same side of the membrane.
Usually re-entrant regions are short, on average ~13 residues. Some contain
an α-helix that digs into the membrane, makes a turn and goes back, either
as a second α-helix or as coil, while others contain only coil structures
30
(Viklund et al., 2006). Aquaporin-1 is a nice example where two re-entrant
regions, one from each side meet in the center of the membrane and thereby
can form a stable structure (Murata et al., 2000) (Fig. 4). Viklund et al. estimated that at least 10% of all TM proteins have re-entrant regions and that it
is most common in channels and transporters. There is an abundance of residues with small side chains (Ala and Gly) and the overall amino acid distribution is of intermediate hydrophobicity.
non-cytoplasm
2
5
E
4
3
6
1
B
cytoplasm
Figure 4. Ribbon diagram of aquaporin-1 (PDB code 2D57) showing six transmembrane helices (number 1-6) and two re-entrant regions (B and E) each folded as a
half-helix connected to a coil structure. The N-terminal ends of the half-helices face
the hydrophilic water-conducting pore and contain the conserved motif Asn-Pro-Ala
which holds together the half-helices by hydrogen bonding and van der Waals interactions.
3.2.2 Topology prediction algorithms
Taking all the abovementioned topological elements into account, one would
imagine that we have all the information needed for correct topology prediction. But still it has turned out to be challenging to incorporate all the various
TM properties into one general model. Three basic criteria have to be fulfilled for a topology prediction algorithm, namely ability to i) discriminate
between globular and membrane proteins, ii) determine the number and positions of the transmembrane segments and iii) determine the inside/outside
orientation of the protein. Additional criteria for more sophisticated algorithms are ability to iv) discriminate between a signal peptide and a transmembrane helix and v) predict re-entrant regions and interfacial helices.
It is always a bit tricky to compare the performance of different topology
prediction methods since many are machine learning algorithms developed
using different training sets of proteins with known topology. The lack of
structural data restricts the size of training and test sets for membrane pro-
31
teins. Ideally the sets should be completely disparate but this has not always
been the case. Some algorithms might have been trained on sequences present in the test set which bias the result. Overtraining is an obvious problem
as test set proteins have turned out to be easier to predict than whole TM
proteomes (Käll and Sonnhammer, 2002, paper I). Several benchmarking
studies have been performed (Chen et al., 2002; Ikeda et al., 2002; Möller et
al., 2001) and the conclusion is that no method is always performing the best
and that the evaluated performance accuracy usually lies in the range 5070%. Incorporation of evolutionary information (Jones, 2007; Viklund and
Elofsson, 2004), experimental knowledge (paper I) or using consensus predictions (Arai et al., 2004; Nilsson et al., 2000) increase the prediction accuracy.
Topology prediction algorithms can be divided into two broad categories,
methods based on hydrophobicity scales and methods based on machine
learning approaches. Below I will explain the differences between the categories, describe the particular methods used in my studies, and discuss some
of the most recent methods that have had impact on the prediction performance.
3.2.2.1 Hydrophobicity scales
The earliest methods for predicting transmembrane segment locations were
based on hydrophobicity indices reflecting each amino acid’s propensity for
being embedded in the membrane. The indices were derived from experimental or theoretical studies estimating the free energy of transferring an
amino acid from aqueous solution to nonpolar environment resembling the
membrane interior and summarized in different hydrophobicity scales
(Engelman et al., 1986; Kyte and Doolittle, 1982; White and Wimley, 1999;
Wimley and White, 1996). In a prediction, it is only information from amino
acids anticipated to be embedded in the membrane that are considered. A
sliding window approach is applied where a window of fixed length is
scanned along the sequence and the hydrophobicity indices for each residue
within the window are summed. The average hydrophobicity for the center
position in the window is plotted to generate a hydrophobicity profile of the
whole protein. Segments sufficiently long and above a heuristically determined cut-off are predicted as transmembrane.
One drawback with the early hydrophobicity scales is that they are poor
in discriminating between globular and membrane proteins and usually overpredict the number of TM helices (Chen et al., 2002; Möller et al., 2001).
Another disadvantage is that they do not predict the sidedness of the protein
meaning that the inside/outside orientation remains unknown.
3.2.2.1.1 TopPred
The prediction progress was taken one step further by the development of
TopPred (von Heijne, 1992). The method integrated the GES hydrophobicity
32
scale (Engelman et al., 1986) with the positive-inside rule and was actually
the first method to predict the complete topology and not only the number of
TM segments and their positions. The sliding-window analysis was also
somewhat refined by using a trapezoid window to reflect the environmental
transitions between membrane interface (triangular shape) and membrane
core (rectangular shape). A hydrophobicity plot is constructed and every
peak above an upper cut-off is considered as a certain TM helix whereas a
peak between a lower and the upper cut-off is considered as putative. All
possible topologies are created, always including the certain helices but alternatively including or excluding the putative ones. By applying the positive-inside rule (maximizing the difference in the number of Arg and Lys in
potentially inside and outside loops) the most probable topology can be predicted.
3.2.2.2 Machine learning approaches
More advanced prediction methods are usually based on some machine
learning technique such as neural networks (NNs) or hidden Markov models
(HMMs). These methods use collections of known MP structures or experimentally confirmed topologies as training data to statistically estimate the
amino acid distributions in topologically distinct regions of a model membrane protein (e.g. TM helices, interfaces, inside and outside loops). A prediction is the outcome of optimizing the matching of the residue distribution
in an examined protein with the pre-calculated distributions in the different
regions of the model. The whole protein is analyzed at a time where all topological signals are taken into account concurrently. Thus, the amino acid
compositions in other regions than the membrane-spanning parts have impact on the predicted topology, which not is the case in the more simple hydrophobicity scale-based methods. This has turned out to be a successful
strategy and advanced methods are almost always performing better than
more simple methods in benchmarking studies (Chen et al., 2002; Möller et
al., 2001).
3.2.2.2.1 MEMSAT
One of the first machine learning methods to take advantage of a global view
of all topological signals was MEMSAT (Jones et al., 1994b). It defines five
structural states (inside loop, inside helix end, helix middle, outside helix
end, outside loop) and each state is associated with a statistical table of the
frequency of the 20 amino acids, represented as log likelihood ratios. The
tables were compiled from a set of proteins of known topologies where single-spanning and multi-spanning proteins are treated separately since the TM
helices tend to have somewhat different properties (Jones et al., 1994a). All
possible topologies are explored, starting from one helix (in both orientations) and successively increasing the number of helices by one up to an
upper limit depending on the protein length. Each topology is then scored
33
according to a statistical method (expectation maximization). For a given
number of TM helices a dynamic programming algorithm is applied to
search for their most optimal locations and lengths. A list of the optimized
topologies together with their scores is produced. The topology with the
highest score is the final prediction.
The approach used in MEMSAT can be seen as a forerunner of prediction
methods based on hidden Markov models (HMMs), see TMHMM and
HMMTOP below.
3.2.2.2.2 PHDhtm
PHDhtm (Rost et al., 1996) is a program belonging to a general tool, PHD
(Rost, 1996), for predicting secondary structures of proteins. It is designed to
predict the topology of membrane proteins by using evolutionary information in a stepwise procedure. The first step is a BLAST search (Altschul et
al., 1990) of the query sequence against the SWISSPROT database
(Boeckmann et al., 2003) to identify relevant homologs. The hits are aligned
in a multiple sequence alignment which is fed into a neural network. The
network estimates the preference for each residue to be or not to be in a
transmembrane helix state. In the second step, the region with highest transmembrane preference is compared to a threshold value to decide whether the
protein is a membrane or globular protein. If it is classified as MP, the network preferences are used again in the third step as input to a dynamic programming algorithm that finds the optimal number and locations of the TM
helices (the highest-scoring model). Lastly, the positive-inside rule is applied
to determine the orientation and to generate the final topology of the protein.
Together with the predicted topology, PHDhtm provides two indices,
ranging from 0 to 9, for estimating the prediction reliability. One reliability
index is defined for the model (i.e. the number and locations of the TM helices), based on the difference in score for the best and second best model.
The other reliability index is related to the predicted orientation and is proportional to the positive charge difference between the cytoplasmic and noncytoplasmic parts of the protein.
3.2.2.2.3 TMHMM
TMHMM (Krogh et al., 2001; Sonnhammer et al., 1998) is based on a hidden Markov model with seven distinct states (helix core, helix caps on either
side of the membrane, short loop on cytoplasmic side, short and long loops
on non-cytoplasmic side and globular domains in the middle of each loop)
corresponding to the well-defined regions in membrane proteins. Each type
of state has a probability distribution of the 20 amino acids (emission probabilities) that is estimated from a set of proteins with experimentally known
topologies. Between the states there are transition probabilities that reflect
the likelihood for either staying in the same state or move to the next state,
also estimated from training data. The architecture of TMHMM is cyclic and
34
biologically relevant as the transitions between the states force the succession of predicted protein regions to be in the correct order, i.e. an inside
loops is always followed by a helix, followed by an outside loop and so on
(Fig. 5a).
Given the defined model, the algorithm finds the most probable path
through the states for a query protein, i.e. it maximizes the correlation between the sequence of state emission probabilities and the observed amino
acids. The predicted topology is represented as a labeled sequence of the
three classes i (inside or cytoplasmic), h (helix) and o (outside or extracytoplasmic). This is calculated by the N-best algorithm (Krogh, 1997)
which maximizes the likelihood that the query sequence is generated by a set
of state paths (that all generate the same topology), here denoted as p(best
topology), where p stands for probability. There are many possible paths that
a protein sequence can take through the model and their summed probabilities can be calculated with a procedure called the forward algorithm, here
denoted as p(all possible topologies). Each residue is labeled i, h or o as
mentioned above, but a posterior probability is given for all three classes,
p(i), p(h) and p(o) to each residue, which is interpreted as the probability to
be in each class given the residue (Fig 5b). Note that there is not necessarily
a one-to-one correlation between the most probable labeling according to the
posterior probability and the final predicted topology. The reason for this is
that posterior probabilities are “local” and are not limited to obey the restrictions of allowed state paths.
The strength of HMMs is that the optimal path, i.e. the prediction, is
found in one step and thus all protein parts are modeled simultaneously. The
influence of all topological signals is therefore also dependent on their relative clearness rather than only the actual matching to each state. This is
beneficial when signals are weaker than usual, as for example in multispanning helices where some helices are less hydrophobic than the average
helix.
35
(a)
(b)
Sequence: M S W
p(i):
0.76 0.71 0.68
p(h):
0.00 0.05 0.08
p(o):
0.24 0.24 0.24
Label:
i
i
i
N
0.68
0.11
0.21
i
L
L F V
0.14 0.14 0.00 0.00
0.83 0.85 0.99 0.99
0.03 0.01 0.01 0.01
h
h
h h
Figure 5. (a) The architecture of TMHMM showing the different states (as boxes)
and the permitted transitions connecting the states (as arrows). States with the same
names have the same amino acid distributions. Figure from (Krogh et al., 2001),
reprinted with permission from Elsevier. (b) An example of a TMHMM output for
the first eight residues in the membrane protein SWF1 from S. cerevisiae. The posterior probabilities for the three classes i, h and o are listed for each residue and the
underlined numbers correspond to the labeling, i.e. the prediction, which yields the
highest probability for the whole protein, p(best topology).
3.2.2.2.4 HMMTOP
HMMTOP (Tusnady and Simon, 1998; Tusnady and Simon, 2001) is another HMM-based method with an architecture similar to that of TMHMM.
Here however, there are only five structural states (inside loop, inside helix
tail, helix, outside helix tail and outside loop) where the differences to
TMHMM are that there is no globular state and that the modeling of short
loops is done by omitting the loop state and instead connect the helix tail
state to another helix tail state on the same side. The helix tail states thus
model segments located in the loops (but close to the membrane) and are
therefore not equivalent to the helix cap states of TMHMM which model
segments that are parts of the TM helices.
The major difference between the two methods lies in the description of
the driving forces for membrane protein folding. Whereas TMHMM is based
on the assumption that the different structural parts are composed of more or
less predetermined amino acid distributions that should hold for all membrane proteins, the hypothesis of HMMTOP is that the topology is determined by the difference in amino acid distributions between the various
structural parts and thus not solely on the absolute amino acid compositions
in the separate parts. Therefore, the method first uses the query protein sequence to optimize its state parameters and then searches for the combination of states that gives the maximum divergence in the amino acid distribu36
tions among the predicted segments. The idea is that the large differences in
physicochemical properties that different parts of the protein encounter
should be reflected in large changes in the amino acid distributions.
There is furthermore an option to include homologous sequences to improve the prediction performance. Those sequences are then used one by one
in the state parameter optimization process.
3.2.2.3 Recently developed methods
During the last years new methods have been developed that are even more
sophisticated than the previously described ones. Some contain additional
structural elements, some also benefit from being trained on a larger data set
than earlier methods and others apply new approaches.
It has been known for a long time is that the use of evolutionary information, i.e. homologous sequences, increases the prediction accuracy for globular proteins (Rost and Sander, 1993) as well as for membrane proteins
(Persson and Argos, 1994; Rost et al., 1996; Tusnady and Simon, 1998). As
mentioned above, a multiple sequence alignment is used in PHDhtm whereas
in HMMTOP the homologs are used as single sequences to estimate new
model parameters. In the recent methods prodiv-TMHMM (Viklund and
Elofsson, 2004) and MEMSAT3 (Jones, 2007) information from multiple
sequences are used in sequence profiles which significantly improves the
prediction performance.
Phobius (Käll et al., 2004) and TOP-MOD (Viklund et al., 2006) are two
other novel methods, both of which are based on HMMs, that incorporate
prediction of additional substructures beside the ordinary transmembrane
helices and loops. Phobius is able to model signal peptides and TM helices
simultaneously and thus reduces the risk of mixing-up the first TM helix and
a SP. TOP-MOD identifies re-entrant regions and also attempts to predict
interfacial helices but that has turned out to be more challenging, possibly
due to weaker sequence characteristics.
A new approach has been taken in (Hessa et al., 2005) where the contribution from individual residues to the membrane insertion efficiency of a
TM helix has been analyzed by designing polypeptide segments and quantifying the degree of insertion. A ‘biological’ hydrophobicity scale has been
developed as well as a position-dependent free energy matrix. A novelty in
this approach compared to the derivation of traditional hydrophobicity scales
is the experimental design where the measurements are made on ER membranes in vitro (dog pancreas microsomes) and not on residue or peptide
partitioning into aqueous and non-polar solvent respectively. Moreover, the
positional dependence of the residues is accounted for and, in agreement
with statistical studies, it was shown that charged and polar residues are unfavorable in the middle of the helices, polar aromatic residues Trp and Tyr
are preferred towards the helix ends, and the contribution from hydrophobic
residues do not vary much with the position within the helices.
37
A way to bridge the topology knowledge between experimental studies
and bioinformatics is to use experimental information as constraints in theoretical predictions. If one or more residues in a membrane protein are constrained to lie on one or the other side of the membrane, the number of possible topologies is reduced and the likelihood of predicting the correct topology increases, as described more thoroughly in chapter 4.1 (paper I). Instead
of experimental information, domain assignments can also be used as a priori topological data fed into prediction algorithms (Bernsel and von Heijne,
2005). Extramembraneous soluble domains that are compartment-specific,
i.e. always localized in the cytoplasm or the extracytoplasmic space but
never found on both sides of the membrane, are estimated to be present in at
least 11% of eukaryotic membrane proteomes. These findings were shown to
increase the prediction accuracy in general when used as constraints, particularly for single-spanning MPs.
Finally, reliability scores can be a guide to dismiss or approve predictions
and a help to identify the most dubious topologies that are worth confirming
experimentally, see chapter 4.1 (paper I).
38
4 Summary of papers
4.1 Reliability measures for topology predictions and
the use of experimental knowledge (Paper I)
The objective behind this study was to find different strategies to overcome
the relatively moderate performance for the most widely used topology prediction methods by that time. We managed to do this in some aspects by
deriving reliability scores that make it possible to estimate the trustworthiness of a prediction, and by showing that limited experimental information
given a priori to a prediction algorithm considerably increases the accuracy.
We examined the five topology prediction methods TopPred, PHDhtm,
MEMSAT, TMHMM and HMMTOP and defined for each a reliability score
based on their respective raw output. A test set of 92 prokaryotic proteins
with experimentally determined topologies was used to assess prediction
accuracy and its correlation to the constructed reliability scores. As seen in
Figure 6, the best correlations were obtained for TMHMM and MEMSAT,
whereas the scores did not seem very useful for the other methods. For both
TMHMM and MEMSAT ~50% of the predictions have reliability scores
corresponding to a prediction accuracy of ~90%, and ~70% of the proteins
have scores corresponding to a prediction accuracy of ~80%. This should be
compared to accuracies of 66 and 70% respectively if the whole test set is
benchmarked. Thus, by considering the score it is possible to estimate the
likelihood that a given prediction is correct.
The TMHMM reliability score was defined as p(best topology)/p(all possible topologies) and takes values between 0 and 1. It gets close to 1 if the
suggested topology has high probability at the same time as there are few
other topologies that can compete with it. The opposite applies to scores
close to 0, i.e. then several other topologies might be as likely as the suggested one and such predictions should therefore be considered with caution.
Furthermore we used the TMHMM score to assess the degree of bias in
the test set compared to the predicted membrane proteomes of E. coli, S.
cerevisiae and Caenorhabditis elegans. We found a much larger fraction of
high-scoring proteins in the test set compared to the whole proteomes and
consequently estimated the prediction accuracy to be far lower for the full
proteomes. This bias is a result of limited experimental data and overtraining
which also was confirmed in another study (Käll and Sonnhammer, 2002).
39
Figure 6. A plot showing the relation between the reliability scores and test set cumulative coverage. The overall accuracies measured on the entire test set (found at
100% coverage) lie between 50 and 70%. See paper I for details.
The other way to address the low expected prediction accuracies was to investigate the effect of including experimental information in the predictions.
The TMHMM algorithm allows the class assignment for a residue (or region) to be set a priori. In other words, the probability for a certain residue
to be located in, for example, an inside loop can be set to 1.0 (p(i)=1.0 and
p(o)=p(h)=0.0) which means that the prediction is fixed in that location for
that residue and only topologies that are compatible with this constraint will
be considered valid. When the number of possible solutions decreases (all
topologies opposing the fixation are ignored), the likelihood of predicting the
correct topology is expected to increase.
We assigned the C-terminal residue for all test set proteins to its experimentally known class and registered the prediction performance. There was
an increase in accuracy from 66% (unfixed predictions) to 77% (fixed predictions). Thus, with very limited topological pre-knowledge it was possible
to get much improved topologies. We further estimated that for the three
membrane proteomes studied, the prediction accuracy will increase by at
least the same amount, given that the C-terminal location is known.
The advances of TMHMM, i.e. the derivation of a reliability score and the
enabling of experimental information usage were implemented in a refined
version of TMHMM2.0, namely TMHMMfix that is publicly available,
(http://www.sbc.su.se/~melen/TMHMMfix/).
40
4.2 Topology models for a small number of S.
cerevisiae membrane proteins based on C-terminal
reporter fusions and predictions (Paper II)
This is a pilot study where we for the first time combined experimental results with TMHMMfix. The motives for working with S. cerevisiae were
that it is a common model organism for other eukaryotes, at the same time as
prediction methods in general perform less well on yeast membrane proteins
as compared to both mammalian and prokaryotic membrane proteins
(Nilsson et al., 2002, paper I).
Encouraged by the performance improvement estimated in paper I, the idea
was to determine the C-terminal location for a set of yeast membrane proteins and to use that information as constraints to predict reliable topology
models,
Target proteins were selected by scanning the yeast genome and choosing
the predicted open reading frames (ORFs) for which the five prediction
methods (TopPred, MEMSAT, PHDhtm, TMHMM and HMMTOP) all produced the same topology. It had been shown earlier (Nilsson et al., 2000)
that consensus predictions have high proportions of correctly predicted topologies and thus gave us a good basis for verifying the orientation of the
proteins experimentally and also correcting possibly incorrectly predicted
topologies.
Only proteins predicted to have at least two transmembrane segments
were included in order to avoid confusion between cleavable signal peptides
of secretory proteins and single-spanning membrane proteins. Furthermore,
genes containing introns or genes with dubious ORFs were removed as well
as proteins not expected to be targeted to the secretory pathway, as the experimental design requires insertion into the ER membrane.
For each protein a construct was made in which the full-length gene (except the stop codon) was fused to the dual reporter Suc2/His4C (Fig. 7a).
The constructs were expressed from plasmids transformed into a yeast strain
with a mutated nonfunctional his4 gene. Successful fusions that expressed
well were finally made for 39 proteins.
The experimental determination of the C terminal location was carried out in
two parallel ways. Cells were streaked on plates depleted in histidine but
supplemented with histidinol and checked for growth ability. Cells were also
subjected to lysis from where the membrane-protein-fusions were isolated
and treated with Endo H to assess the glycosylation status. If the C terminus
is located in the cytoplasm the His4C can convert histidinol to histidine and
the cells grow (Fig. 7b). At the same time there will be no change in molecular weight between Endo H-treated and untreated proteins since glycosyla41
tion of the SUC2 moiety only can take place in the ER. The contrary applies
to cases where the C terminus resides in the ER lumen. The cells cannot
grow due to lack of histidine but there will be a molecular weight difference
as a consequence of glycosylation (Fig. 7c). Since the reporter genes obviously are complementary to each other, C-terminal locations can be determined unequivocally for most proteins.
(a)
MP
HA
His4C
SUC2
C
His
Histidinol
(b)
(c)
C
ER
Growth on -His/+Histidinol
No glycosylation
ER
No growth on -His/+Histidinol
Glycosylation
Figure 7. C-terminal topology mapping in S. cerevisiae. (a) Schematic picture of a
construct where the membrane protein (MP) is fused to the reporters SUC2 (containing glycosylation acceptor sites, indicated by ‘V’) and His4C (with enzymatic capacity). The hemagglutinin (HA) tag allows the fused protein to be identified by
Western blotting. (b) The C terminus is located in the cytoplasm which is detected
by growth and absence of glycosylation. (c) The C terminus is located in the ER
lumen which is detected by glycosylation and no growth.
For 37 out of the 39 proteins we could assign the C termini to either the inside (cytoplasm) or the outside (ER lumen). Two proteins were neither glycosylated nor did the cells grow on histidinol and could thus not be assigned.
We speculated that they are inserted into the mitochondrial inner membrane
with their C termini located in the matrix, and for one of the proteins this
was later confirmed.
The inside/outside assignments for the C-terminal ends of the proteins were
further used as constraints in TMHMMfix to produce reliable topology models. For 31 of the 37 proteins the fixed predicted topology was the same as
the initial consensus prediction. The other six shifted orientation or changed
the number of helices by 1.
The reliability scores were relatively high compared to the score distribution for yeast calculated in paper I. Notably, proteins with a large number of
TM helices had in general lower scores than those with few TM helices,
likely due to the various possible topologies that TMHMM can produce.
Our conclusion was that our strategy of combining experimental methods
and bioinformatics predictions worked out well. With a relatively limited
42
amount of experimental effort (compared to a full topology mapping) we
could obtain reliable topology models for 37 membrane proteins in an efficient way. Accordingly, this work shows potential to be expanded to a proteome-wide scale in yeast.
4.3 Topology models for a small number of E. coli
membrane proteins and optimization analysis of fusion
points (Paper III)
Here we applied the same strategy as in paper II but for E. coli, i.e. we determined the C-terminal location for a selection of integral membrane proteins and used the information as constraints in TMHMMfix predictions to
generate reliable topology models. Additionally, we investigated which part
of a protein is the optimal part to fix, and raised the question where a reporter protein should be fused in order to capture the most informative topological data.
Initially we examined how different placements of a topology reporter would
influence the topology prediction. Is it correct to assume that fixation of the
C terminus increases the TMHMM accuracy the most? Or are there other
regions in a protein that provide better topological information, for example
the N terminus or a loop region with low posterior probability?
We used a test set comprising of 233 membrane proteins with experimentally determined topologies. First we ran unconstrained predictions for all
proteins and noted that 69% were correctly predicted. Then each protein was
scanned along the sequence and one residue at a time was fixed according to
its annotated location prior to prediction. Residues within transmembrane
segments were excluded since common topology reporters can only be used
for detection of extramembraneous locations. We focused on the initially
incorrectly predicted topologies and analyzed which fixed residues in those
proteins could convert the topologies to the correct ones. We concluded that
the C terminus is an optimal placement for the reporter protein if only one
region is to be experimentally mapped. 81% of all proteins were correctly
predicted when their respective C termini were fixed. The corresponding
number for always fixing the N terminus was 79%. If a combination of the
two was used, only a slight increase in prediction accuracy was obtained
(82%), but to a very high experimental cost.
It did not turn out to be a good idea to fix residues in loops of low posterior probabilities. In fact, many such regions were transmembrane segments,
missed by TMHMM. This situation reflects the uncertainty of a prediction.
By fixing a residue there to inside or outside would inevitably produce a
wrong topology.
43
Based on the positive results for fixing the C terminus, and the fact the Cterminal fusions do not affect the insertion into the membrane significantly
and are minimally disruptive of the native topologies of membrane proteins
(no truncation is needed nor any fusions into internal loops), we decided to
use the same experimental approach as in a pioneering analysis performed in
our lab (Drew et al., 2002). They successfully used a setup with the two topology reporters GFP and PhoA for localization studies in E. coli.
34 new target proteins were selected by applying the five prediction methods
(TopPred, MEMSAT, PHDhtm, TMHMM and HMMTOP) to all ORFs in
the E.coli genome and choosing the ones that had a consensus predicted Nterminal location, and that were predicted to have the same topology by the
five methods. Deviation of one predicted TM segment was however allowed.
These selection criteria provided a data set where the C-terminal determination will be conclusive for producing reliable topologies.
For each protein, two constructs on separate plasmids were made, one
where the corresponding gene was fused to GFP and another where it was
fused to PhoA. The vectors were transformed into E. coli and the expressed
fusions were analyzed.
The GFP assay was carried out by illuminating the cells with UV light
and analyzing the GFP fluorescence emission. GFP folds properly and fluoresces only in the cytoplasm, so detection of fluorescence cells suggests a
cytoplasmic location of the protein’s C terminus. In contrast, PhoA requires
a periplasmic location in order to be enzymatically active, which can be detected by adding an appropriate substrate.
For 31 of the 34 proteins, the GFP/PhoA results were completely consistent
and for those we could assign the C termini to the inside or outside. This
information was used as constraints in TMHMMfix, and experimentally
based topology models were produced. The reliability scores supported increased prediction quality as compared to unconstrained predictions. 3D
structures for two of the examined proteins were known and in one case the
predicted topology agreed on the structural data but for the other protein the
prediction missed two helices, yet the orientation was correct. However, the
reliability score for the failed one was very low.
In conclusion, we found that the strategy of expressing full-length proteins
fused to the C-terminal reporters GFP and PhoA had a high success rate and
combining the results with constrained predictions yielded trustworthy topologies. The approach thus has potential to be used on a proteome-wide
scale in E. coli.
44
4.4 Large-scale topology analysis of the E. coli and S.
cerevisiae membrane proteomes (Papers IV and V)
Having validated the experimental and bioinformatics approaches in E. coli
(paper III) and in S. cerevisiae (paper II), the next step was to extend the
analyses to global topology mappings of all multi-spanning membrane proteins in the two organisms.
We started with all predicted ORFs in E. coli and S. cerevisiae respectively
and defined the whole membrane proteomes by applying TMHMM. Only
proteins with 2 or more predicted transmembrane segments were selected
(for the same reason as in paper II and III). The data sets were reduced by
eliminating ORFs that were too short, previously analyzed, had dubious gene
sequences or were not targeted to the secretory pathway (in yeast). This resulted in 714 putative membrane proteins in E. coli and 629 in S. cerevisiae.
Out of those, 665 and 617 were cloned and expressed successfully. The Cterminal assignments could finally be made to 502 and 468 proteins respectively.
Bioinformatics was heavily used in the selection process, the design of optimal restriction enzyme combinations for the E. coli genes and the control of
the expressed gene sequences (comparison of expected nucleotide sequence
with output from sequencing analyses). The last of these steps was necessary
to discover mispredicted ORFs or cloning failures in order to not draw incorrect conclusions from the C-terminal assignments. The experimental setup
was somewhat simpler in the yeast study in that cloning was accomplished
by homologous recombination and inserted genes were verified by PCR
analysis instead of by sequencing. I will mainly describe the results in paper
V in the following sections, but comparisons to the results obtained in paper
IV will be made.
To be confident that our C-terminal assignment procedure also holds on a
large scale and to possibly rule out that the C-terminal reporter proteins affect the membrane insertion (or at least conclude that this is not likely the
case here), we did an internal validation for all the yeast proteins with assigned C-terminal locations. We performed an all-against-all BLAST search
(Altschul et al., 1990) among the proteins and retained all pairwise hits with
an E-value < 10-5 and for which the BLAST alignment should reach within
15 residues of the C termini (illustrated in Fig. 8). These restrictions prevent
the appearance of an additional transmembrane segment between the end of
the alignment and the C terminus of either sequence, and homologs found in
this way can be assumed to have the same C-terminal orientation.
45
9 unaligned residues
...iiihhhhhhhhhhhh---hhhhhhhhhooooooooooooooooooo
...FVLYAGFALVIGCFW---YFSPISFGMEGPSSNFRYLNWFSTWDIA
... V Y ++L GC +
F+PI GM G + + L W STWDIA
...MVKYPIYSLFGGCIYIYNLFAPICQGMHGDKAEYLPLQWLSTWDIA
...iiiiihhhhhhhhhhhhhhhhhhhhhhooooooooooooooooooo
ooooooooo
......... query
BLAST alignment
....
hit
oooo
4 unaligned residues
Figure 8. An example of the “BLAST approach”. The grey area marks the aligned
region between the query and hit protein sequences. The upper topology prediction
belongs to the query protein, the lower to the hit protein and both are constrained by
their corresponding experimentally determined C-terminal location (i; inside, h;
helix, o; outside). Here the unaligned residues in the C termini are 9 and 4 respectively.
All proteins in our data set that fulfilled our search criteria matched homologs with the same C-terminal assignment except two, Ygl263wp and
Ynr002cp. Ygl263wp is a member of the large COS (conserved sequence)
family and was assigned a lumenal C terminus (Cout) while the other eight
COS family members in our data set were assigned to have a cytoplasmic C
terminus (Cin). The Cout orientation of Ygl263wp was further confirmed by
three internal N-linked glycosylation sites in loops that all face the ER lumen
(and are thus glycosylated) when the C terminus is also located there. The
opposite orientation was also supported by the positive-inside rule, as accumulation of positively charged residues was found in different loops for
Ygl263wp compared to the other family members. The second protein of
contradicting orientation, Ynr002cp, belongs to a family of ATO (ammonia/ammonium transport outward) proteins. It also has a Cout assignment
whereas two homologous proteins in our data set have Cin assignments. This
family was however not studied further. The presence of families with opposite orientations opens up interesting interpretation possibilities of the evolution of membrane proteins and their topologies. Gene duplication followed
by divergent topology evolution or adoption of a single protein to dual topologies can explain the experimental results. These phenomena have earlier
been observed in E. coli (Sääf et al., 1999), a few were also identified in
paper IV and further analyzed in (Rapp et al., 2006).
Strengthened by the validation tests (only two proteins showed deviating
orientations between homologs whereas the rest was completely consistent)
we extended our initial data set by repeating the “BLAST-approach”, but
this time applied to the unassigned yeast proteins, and inferred the query
assignment to the hit if the search criteria were matched. We could thereby
assign a C-terminal location to another 41 proteins, increasing the total number of assigned proteins to 546 (also including the 37 from paper II).
By applying TMHMM and using the C-terminal locations as constraints we
produced topology models for the 546 proteins (Fig. 9). Perhaps the most
46
striking observation is that proteins with a Cin orientation are far more frequent than those with a Cout orientation (82% vs. 18% and similar proportions for the E. coli membrane proteome). An even number of transmembrane regions dominate and thus topologies with both N and C termini located in the cytoplasm are the most common. This might suggest that socalled “helical hairpins” (two closely spaced TM helices) are a basic block in
co-translational insertion of MPs into the membrane. The functional categories seen in Figure 9 are taken from the Gene Ontology (GO) terms
(Ashburner et al., 2000) and correlate well with the number of transmembrane helices. For example proteins of 10 or more TM helices are mainly
involved in solute transport and proteins with few predicted TM helices are
usually of unknown function. One category that deviates from the general
Cin trend is the protein modification class (yellow in figure) that has about
50% as Cout, which makes sense since protein modification largely takes
place in the ER lumen.
100
Vesiclem ediated
transport
4%
80
Unknow n
36%
Transport
32%
Number of proteins
60
Lipid
m etabolism
5%
Other function
Organelle
13%
organisation
3%
Protein
m odification
7%
40
20
Cin
0
Cout
20
20
40
40
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Number of transmembrane helices
Figure 9. A histogram over the topology distribution for the 546 membrane proteins
with an assigned C terminus. Bars upward correspond to Cin topologies and bars
downward correspond to Cout topologies. The pie chart shows the GO annotations
for the yeast membrane proteins. The white bars inside the colored bars represent the
corresponding distribution for the E. coli proteome.
Comparison to the topology distribution in E. coli shows many similarities
but also a few obvious differences, for instance the higher fraction of transport proteins in E. coli with Nin-6TM-Cin topology and the higher fraction of
possible G protein-coupled receptors (GPCRs) with a Nout-7TM-Cin topology
in S. cerevisiae.
47
Since it seems rare that homologous proteins have opposite orientations we
believe that homology based C-terminal mapping is reliable with a low error
rate and therefore we wanted to expand our analysis to other eukaryotic
membrane proteomes. First we used the C-terminally assigned yeast proteins
(excluding the COS and ATO families) as queries in BLAST searches
against a database of predicted membrane proteins from 38 different fully
sequenced eukaryotic genomes. We applied the same criteria as in the
“BLAST approach” described above, except that we used a stricter E-value
cutoff of 10-6 here. All homologs for which we could infer a C-terminal assignment were used in a second BLAST run against the same database and
all together 13,281 eukaryotic proteins homologous to S. cerevisiae were
generated to which a C-terminal orientation could be assigned (Fig.10).
Subsequently we used 612 E. coli proteins from paper IV that also were
C-terminal assigned in an additional two-step BLAST search against the
eukaryotic database. This generated 4,051 further homologs for which the Cterminal locations could be assigned (Fig. 10). Out of these, 2,522 overlapped with the S. cerevisiae homologs and in all cases the C-terminal assignments agreed, supporting our assumption that homology-based Cterminal mapping is appropriate. Interestingly, eukaryotic membrane proteins only homologous to E. coli proteins often turned out to be located in
the mitochondria or the chloroplasts, which naturally can be related to the
prokaryotic origin of these organelles.
Combining the results for S. cerevisiae and E. coli in total 14,810 eukaryotic membrane proteins were C-terminally assigned. For these proteins
we also run constrained TMHMM predictions to produce topology models.
A similar study for bacterial membrane proteins has also recently been
performed (Granseth et al., 2005b).
48
900
S. cerevisiae homologs
overlapping homologs
800
E. coli homologs
unique S. cerevisiae homologs
Number of homologs
700
600
500
400
300
200
100
sc
y8
y6
y1
kl
kw
gl
al
yl
dh
fg
gr
go
an
po
ns
tb
um
eu
pl
py
at
oj
os
dt
ce
ci
cw
xe
ag
to
dm
rn
gg
da
hs
xp
m
m
0
Organism
animals
plants
parasites
fungi
Figure 10. Homologs in 38 eukaryotic genomes with assigned C termini orientations
assigned from either the 534 S. cerevisiae proteins or the 612 previously analyzed E.
coli proteins. Dark red bars represent homologs assigned only by S. cerevisiae proteins in each organism; yellow bars represent homologs assigned only by E. coli
proteins; orange bars represent homologs assigned by both S. cerevisiae and E. coli
proteins. Dark blue bars show the number of unique S. cerevisiae proteins that have
at least one homolog in the other genome. Organism abbreviations as in paper V.
In conclusion, we have been able to generate global topology maps for the
two important model organisms E. coli and S. cerevisiae. By homology we
have further transferred the C-terminal assignments to a large number of
eukaryotic membrane proteins and from our observations so far it appears to
be reliable. The extensive amount of C-terminal location data can be used for
benchmarking topology prediction methods which hopefully will prove valuable in the light of the limited structural and other experimental data of
membrane proteins.
49
5 Discussion and future perspectives
What have we learnt from our studies and what future directions should be
taken to advance the exploration of the membrane proteome?
First, the results summarized in paper I-V show that topology prediction
indeed benefits from including experimental information and that determination of the location of the C termini of the majority of polytopic membrane
proteins in both E. coli and S. cerevisiae could be achieved in an efficient
way. By looking at the distribution of membrane protein topologies in the
two organisms we can get some clues about the function of yet unannotated
proteins.
We have also shown that it is possible to transfer C-terminal assignments
to homologs of a large number of eukaryotic membrane proteins, for which
it would be difficult to perform the same sort of experiments.
How can the results be used in the future? One obvious way is to apply
more recently developed prediction methods with higher accuracy than
TMHMM, for example prodiv-TMHMM (Viklund and Elofsson, 2004) and
constrain the predictions with our experimental data to produce even better
topologies.
The ability to constrain predictions is not only limited to the C-terminal
region of a membrane protein. Any part for which the location is known
(inside, outside or membrane-spanning) can be constrained. The
TMHMMfix website (http://www.sbc.su.se/~melen/TMHMMfix/) allows
the user to include all known topological information for a membrane protein and thereby increase the chance of retrieving the correct prediction.
The reliability score is useful for estimating the likelihood that a given
prediction is correct. High scores correlate to more accurate predictions
whereas low scores identify proteins where the predicted topologies are
more uncertain and for which a detailed experimental topology mapping
therefore probably is worth the effort.
Topology prediction can furthermore be valuable for target selection in
structural genomics as the goal is to determine a structural representative for
all protein families, making it important to choose proteins of potentially
new folds. The selection can further take advantage of information of which
proteins that are well-expressing (since large amounts of proteins is a prerequisite for structural determination), which is estimated for E. coli in paper
IV and S. cerevisiae in an accompanying study to paper V (Österberg et al.,
2006).
50
Another way to improve topology prediction is to make the methods specific for different organisms, or at least tune them for prokaryotic and eukaryotic proteins separately. There are obvious differences between various
organisms, for example the positive-inside rule is more pronounced in prokaryotes than in eukaryotes, the lipid composition varies among species and
extramembraneous environmental differences might affect the amino acid
distributions. However, topology prediction methods will most likely have
an upper accuracy limit considerably less than 100%. Polar and charged
residues within transmembrane helices will continue to confuse prediction
algorithms unless helix packing interactions are explicitly taken into account. The next generation of prediction algorithms will therefore need to
model the contact between the helices. New propensity scales for predicting
buried and exposed residues in transmembrane helices have already been
used to predict the spatial relationships between the helices (Adamian and
Liang, 2006; Park and Helms, 2007). Although full three-dimensional prediction trials also have been made (Yarov-Yarovoy et al., 2006) (notably
using topology prediction methods initially to define the positions of the
ends of each helix and then successively adding helices one by one and
model the interactions and orientations), there is still a long way to go before
reliable 3D prediction of membrane protein structure is achieved. Moreover,
a complete 3D structure does not only contain the membrane-spanning helices, but also includes the structures of the extramembraneous loop regions
with substructures like β-beta strands, interface helices, and re-entrant helices, as well as larger globular protein domains.
Finally, until it becomes possible to either quickly determine the 3D
structure of a membrane protein experimentally or to predict it accurately
there is motivation for doing topology predictions. Why? Because the structural data is limited and a predicted topology can be used for classification,
suggest possible functions and be a guide for designing experiments. How?
By applying one or several prediction methods (preferably using the constraining ability if any topological information is available and preferably
also including methods that can discover signal peptides and/or other substructures) and to carefully analyze and compare the outputs. When? When
the aim is to achieve quick and cheap information about membrane proteins,
in particular in the context of large-scale studies.
51
Acknowledgements
I would like to express my sincere gratitude to all people who have supported and encouraged me and made this work possible. I am especially
grateful to the following persons:
Gunnar von Heijne, my supervisor, for sharing your immense scientific
knowledge and stimulating research. Thanks for all great assistance over the
years and for always having time when it is needed. It has been a pleasure to
be a student in your lab.
Hugh Salter, my co-supervisor at AstraZeneca, for giving me the opportunity to be part of your group. Thanks for both scientific inspiration and all
social events and your hospitality. Thanks also to Henrik, Ingela and
Kerstin for always making me feel so welcome at AstraZeneca.
Bengt Persson, Lena Lewin and Per-Erik Jansson at The PhD Programme
in Medical Bioinformatics at the Karolinska Institute for educational guidance and financial support through the Swedish Knowledge Foundation and
AstraZeneca.
Stefan Nordlund, for being a cool and relaxed senior researcher at DBB
with a contagious interest for science. Thanks also to all past and present
people at the secretariat for always being so kind and helpful.
Anders Krogh for our fruitful collaboration and for kindly letting me visit
your lab in Denmark.
Arne and Erik L (“Piff & Puff”) for creating an inspiring research environment and open atmosphere at CBR.
Joy, Marie Ö, Mikaela, and Dan, my co-authors and gurus at the lab, for
good collaboration and for teaching me all about your favorite organisms.
Thanks also for sustaining my recurrent questions on experimental details.
Erik G, Håkan, Andreas, Johan N and Lukas for interesting and valuable
discussions on topology prediction of membrane proteins and for being enthusiastic colleagues.
Sara L, my former roommate, for discussing all that is relevant in life and
for being such a good listener. Thanks for all help with linguistic issues. I do
miss you a lot!
52
Anna J, Per L and Diana, my present roommates, for lightening up our
office and for taking care of my flowers. Thanks also to Åsa B and Johannes for being nice friends with super positive attitudes, and to all other people at CBR for being such a pleasant company.
Erik Sj, my computer-hero, for your patience and willing to rescue me when
I am distressed.
Olivia, Olof, Bob and all other previous colleagues at SBC for making
every-day life at work so fun and for our delicious cake club. I have always
enjoyed going to work (especially on Fridays…).
Marika, my favorite lunch mate, for all your warm support and your kindhearted personality. I am also happy that we did a good job on timing our
pregnancies perfectly!
Marie, my best friend, for your endless encouragement and inspiration and
for always being so caring, and also for being so incredible fun! Your
charming humor is catching!
Helena, Eva, Sara N and Åsa G for the unforgettable years in Uppsala.
Thanks for all great fun while studying and living together, and for our annual Christmas baking day. I already long for December!
Anna N, Tove, Anja and Christin in “bokcirkeln” for the best evening per
month. Every session is an energy booster and a laugh releaser which I can
live on for a long time!
Linda, my twin spirit, for always being close in mind despite the distance to
Paris. Without you as a dear friend I would never have been the one I am
today and I can’t in words tell how much I appreciate you.
Birgitta and Ingemar, my mother and father, for your never ending love
and believe in me, your stability and unconditional support. You’re my best
role models in life and a perfect mix of vitality and contemplation.
Jonas and Erik, my brothers with families, for your generosity and for making life rich in many ways. I love you!
Ture and Otto, my sons and sweethearts, for making the future so exciting
and unpredictable. Thanks for bringing so much joy into my life and for
giving every day meaning.
Finally I would like to thank the most wonderful person I know, my husband
Henrik, for always standing by my side and for always making me happy.
You are my everything. I love you with all my heart.
53
References
Adamian, L. and Liang, J. (2006) Prediction of transmembrane helix orientation in polytopic membrane proteins. BMC Struct Biol, 6, 13.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990)
Basic local alignment search tool. J Mol Biol, 215, 403-410.
Andersson, H. and von Heijne, G. (1994) Membrane protein topology: effects of delta mu H+ on the translocation of charged residues explain
the 'positive inside' rule. EMBO J, 13, 2267-2272.
Anfinsen, C.B. (1973) Principles that govern the folding of protein chains.
Science, 181, 223-230.
Arai, M., Mitsuke, H., Ikeda, M., Xia, J.X., Kikuchi, T., Satake, M. and
Shimizu, T. (2004) ConPred II: a consensus prediction method for
obtaining transmembrane topology models with high reliability. Nucleic Acids Res, 32, W390-393.
Arechaga, I., Miroux, B., Karrasch, S., Huijbregts, R., de Kruijff, B., Runswick, M.J. and Walker, J.E. (2000) Characterisation of new intracellular membranes in Escherichia coli accompanying large scale overproduction of the b subunit of F(1)F(o) ATP synthase. FEBS Lett,
482, 215-219.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry,
J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris,
M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese,
J.C., Richardson, J.E., Ringwald, M., Rubin, G.M. and Sherlock, G.
(2000) Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet, 25, 25-29.
Bartlett, G.J., Todd, A.E. and Thornton, J.M. (2003) Inferring protein function from structure. Methods Biochem Anal, 44, 387-407.
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig,
H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data
Bank. Nucleic Acids Res, 28, 235-242.
Bernsel, A. and von Heijne, G. (2005) Improved membrane protein topology
prediction by domain assignments. Protein Sci, 14, 1723-1728.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A.,
Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I.,
Pilbout, S. and Schneider, M. (2003) The SWISS-PROT protein
knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids
Res, 31, 365-370.
Bowie, J.U. (1997) Helix packing in membrane proteins. J Mol Biol, 272,
780-789.
54
Bradley, P., Misura, K.M. and Baker, D. (2005) Toward high-resolution de
novo structure prediction for small proteins. Science, 309, 18681871.
Brenner, S.E. (2001) A tour of structural genomics. Nat Rev Genet, 2, 801809.
Caffrey, M. (2003) Membrane protein crystallization. J Struct Biol, 142,
108-132.
Canfield, V.A. and Levenson, R. (1993) Transmembrane organization of the
Na,K-ATPase determined by epitope addition. Biochemistry, 32,
13782-13786.
Chamberlain, A.K., Faham, S., Yohannan, S. and Bowie, J.U. (2003) Construction of helix-bundle membrane proteins. Adv Protein Chem, 63,
19-46.
Chang, X.B., Hou, Y.X., Jensen, T.J. and Riordan, J.R. (1994) Mapping of
cystic fibrosis transmembrane conductance regulator membrane topology by glycosylation site insertion. J Biol Chem, 269, 1857218575.
Chen, C.P., Kernytsky, A. and Rost, B. (2002) Transmembrane helix predictions revisited. Protein Sci, 11, 2774-2791.
Cuthbertson, J.M., Doyle, D.A. and Sansom, M.S. (2005) Transmembrane
helix prediction: a comparative evaluation and analysis. Protein Eng
Des Sel, 18, 295-308.
de Planque, M.R., Kruijtzer, J.A., Liskamp, R.M., Marsh, D., Greathouse,
D.V., Koeppe, R.E., 2nd, de Kruijff, B. and Killian, J.A. (1999) Different membrane anchoring positions of tryptophan and lysine in
synthetic transmembrane alpha-helical peptides. J Biol Chem, 274,
20839-20846.
Deak, P.M. and Wolf, D.H. (2001) Membrane topology and function of
Der3/Hrd1p as a ubiquitin-protein ligase (E3) involved in endoplasmic reticulum degradation. J Biol Chem, 276, 10663-10669.
Drew, D., Froderberg, L., Baars, L. and de Gier, J.W. (2003) Assembly and
overexpression of membrane proteins in Escherichia coli. Biochim
Biophys Acta, 1610, 3-10.
Drew, D., Sjostrand, D., Nilsson, J., Urbig, T., Chin, C.N., de Gier, J.W. and
von Heijne, G. (2002) Rapid topology mapping of Escherichia coli
inner-membrane proteins by prediction and PhoA/GFP fusion analysis. Proc Natl Acad Sci U S A, 99, 2690-2695.
Ehrmann, M., Boyd, D. and Beckwith, J. (1990) Genetic analysis of membrane protein topology by a sandwich gene fusion approach. Proc
Natl Acad Sci U S A, 87, 7574-7578.
Eilers, M., Shekar, S.C., Shieh, T., Smith, S.O. and Fleming, P.J. (2000)
Internal packing of helical membrane proteins. Proc Natl Acad Sci U
S A, 97, 5796-5801.
Engelman, D.M., Steitz, T.A. and Goldman, A. (1986) Identifying nonpolar
transbilayer helices in amino acid sequences of membrane proteins.
Annu Rev Biophys Biophys Chem, 15, 321-353.
55
Eyre, T.A., Partridge, L. and Thornton, J.M. (2004) Computational analysis
of alpha-helical membrane protein structure: implications for the
prediction of 3D structural models. Protein Eng Des Sel, 17, 613624.
Feilmeier, B.J., Iseminger, G., Schroeder, D., Webber, H. and Phillips, G.J.
(2000) Green fluorescent protein functions as a reporter for protein
localization in Escherichia coli. J Bacteriol, 182, 4068-4076.
Fleishman, S.J. and Ben-Tal, N. (2006) Progress in structure prediction of
alpha-helical membrane proteins. Curr Opin Struct Biol, 16, 496504.
Forrest, L.R., Tang, C.L. and Honig, B. (2006) On the accuracy of homology
modeling and sequence alignment methods applied to membrane
proteins. Biophys J, 91, 508-517.
Garrow, A.G., Agnew, A. and Westhead, D.R. (2005) TMB-Hunt: an amino
acid composition based method to screen proteomes for beta-barrel
transmembrane proteins. BMC Bioinformatics, 6, 56.
Goder, V., Junne, T. and Spiess, M. (2004) Sec61p contributes to signal
sequence orientation according to the positive-inside rule. Mol Biol
Cell, 15, 1470-1478.
Granseth, E., Daley, D.O., Rapp, M., Melén, K. and von Heijne, G. (2005b)
Experimentally constrained topology models for 51,208 bacterial inner membrane proteins. J Mol Biol, 352, 489-494.
Granseth, E., von Heijne, G. and Elofsson, A. (2005a) A study of the membrane-water interface region of membrane proteins. J Mol Biol, 346,
377-385.
Grant, A., Lee, D. and Orengo, C. (2004) Progress towards mapping the
universe of protein folds. Genome Biol, 5, 107.
Henderson, R., Baldwin, J.M., Ceska, T.A., Zemlin, F., Beckmann, E. and
Downing, K.H. (1990) Model for the structure of bacteriorhodopsin
based on high-resolution electron cryo-microscopy. J Mol Biol, 213,
899-929.
Henderson, R. and Unwin, P.N. (1975) Three-dimensional model of purple
membrane obtained by electron microscopy. Nature, 257, 28-32.
Hessa, T., Kim, H., Bihlmaier, K., Lundin, C., Boekel, J., Andersson, H.,
Nilsson, I., White, S.H. and von Heijne, G. (2005) Recognition of
transmembrane helices by the endoplasmic reticulum translocon.
Nature, 433, 377-381.
Hopkins, A.L. and Groom, C.R. (2002) The druggable genome. Nat Rev
Drug Discov, 1, 727-730.
Ikeda, M., Arai, M., Lao, D.M. and Shimizu, T. (2002) Transmembrane topology prediction methods: a re-assessment and improvement by a
consensus method using a dataset of experimentally-characterized
transmembrane topologies. In Silico Biol, 2, 19-33.
Jones, D.T. (2007) Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics, 23,
538-544.
56
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994a) A mutation data matrix for transmembrane proteins. FEBS Lett, 339, 269-275.
Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994b) A model recognition
approach to the prediction of all-helical membrane protein structure
and topology. Biochemistry, 33, 3038-3049.
Kainosho, M., Torizawa, T., Iwashita, Y., Terauchi, T., Mei Ono, A. and
Guntert, P. (2006) Optimal isotope labelling for NMR protein structure determinations. Nature, 440, 52-57.
Kernytsky, A. and Rost, B. (2003) Static benchmarking of membrane helix
predictions. Nucleic Acids Res, 31, 3642-3644.
Killian, J.A. and von Heijne, G. (2000) How proteins adapt to a membranewater interface. Trends Biochem Sci, 25, 429-434.
Kim, H., von Heijne, G. and Nilsson, I. (2005) Membrane topology of the
STT3 subunit of the oligosaccharyl transferase complex. J Biol
Chem, 280, 20261-20267.
Kimura, T., Ohnuma, M., Sawai, T. and Yamaguchi, A. (1997) Membrane
topology of the transposon 10-encoded metal-tetracycline/H+ antiporter as studied by site-directed chemical labeling. J Biol Chem,
272, 580-585.
Krogh, A. (1997) Two methods for improving performance of an HMM and
their application for gene finding. Proc Int Conf Intell Syst Mol Biol,
5, 179-186.
Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. (2001) Predicting transmembrane protein topology with a hidden Markov
model: application to complete genomes. J Mol Biol, 305, 567-580.
Kyte, J. and Doolittle, R.F. (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol, 157, 105-132.
Käll, L., Krogh, A. and Sonnhammer, E.L. (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol,
338, 1027-1036.
Käll, L. and Sonnhammer, E.L. (2002) Reliability of transmembrane predictions in whole-genome data. FEBS Lett, 532, 415-418.
Laskowski, R.A., Watson, J.D. and Thornton, J.M. (2003) From protein
structure to biochemical function? J Struct Funct Genomics, 4, 167177.
Lomize, M.A., Lomize, A.L., Pogozheva, I.D. and Mosberg, H.I. (2006)
OPM: orientations of proteins in membranes database. Bioinformatics, 22, 623-625.
Luirink, J., von Heijne, G., Houben, E. and de Gier, J.W. (2005) Biogenesis
of inner membrane proteins in Escherichia coli. Annu Rev Microbiol, 59, 329-355.
Manoil, C. (1991) Analysis of membrane protein topology using alkaline
phosphatase and beta-galactosidase gene fusions. Methods Cell Biol,
34, 61-75.
Manoil, C. and Beckwith, J. (1986) A genetic approach to analyzing membrane protein topology. Science, 233, 1403-1408.
57
Marsden, R.L., Lee, D., Maibaum, M., Yeats, C. and Orengo, C.A. (2006)
Comprehensive genome analysis of 203 genomes provides structural
genomics with new insights into protein family space. Nucleic Acids
Res, 34, 1066-1080.
Massotte, D. (2003) G protein-coupled receptor overexpression with the
baculovirus-insect cell system: a tool for structural and functional
studies. Biochim Biophys Acta, 1610, 77-89.
Monné, M., Nilsson, I., Johansson, M., Elmhed, N. and von Heijne, G.
(1998) Positively and negatively charged residues have different effects on the position in the membrane of a model transmembrane helix. J Mol Biol, 284, 1177-1183.
Murata, K., Mitsuoka, K., Hirai, T., Walz, T., Agre, P., Heymann, J.B.,
Engel, A. and Fujiyoshi, Y. (2000) Structural determinants of water
permeation through aquaporin-1. Nature, 407, 599-605.
Möller, S., Croning, M.D. and Apweiler, R. (2001) Evaluation of methods
for the prediction of membrane spanning regions. Bioinformatics,
17, 646-653.
Nilsson, J., Persson, B. and von Heijne, G. (2000) Consensus predictions of
membrane protein topology. FEBS Lett, 486, 267-269.
Nilsson, J., Persson, B. and Von Heijne, G. (2002) Prediction of partial
membrane protein topologies using a consensus approach. Protein
Sci, 11, 2974-2980.
Nilsson, J., Persson, B. and von Heijne, G. (2005) Comparative analysis of
amino acid distributions in integral membrane proteins from 107 genomes. Proteins, 60, 606-616.
Oberai, A., Ihm, Y., Kim, S. and Bowie, J.U. (2006) A limited universe of
membrane protein families and folds. Protein Sci, 15, 1723-1734.
Opella, S.J., Nevzorov, A., Mesleb, M.F. and Marassi, F.M. (2002) Structure
determination of membrane proteins by NMR spectroscopy. Biochem Cell Biol, 80, 597-604.
Park, Y. and Helms, V. (2007) On the derivation of propensity scales for
predicting exposed transmembrane residues of helical membrane
proteins. Bioinformatics, 23, 701-708.
Persson, B. and Argos, P. (1994) Prediction of transmembrane segments in
proteins utilising multiple sequence alignments. J Mol Biol, 237,
182-192.
Popot, J.L. and Engelman, D.M. (1990) Membrane protein folding and oligomerization: the two-stage model. Biochemistry, 29, 4031-4037.
Popot, J.L. and Engelman, D.M. (2000) Helical membrane protein folding,
stability, and evolution. Annu Rev Biochem, 69, 881-922.
Rapp, M., Granseth, E., Seppala, S. and von Heijne, G. (2006) Identification
and evolution of dual-topology membrane proteins. Nat Struct Mol
Biol, 13, 112-116.
Rogl, H., Kosemund, K., Kuhlbrandt, W. and Collinson, I. (1998) Refolding
of Escherichia coli produced membrane protein inclusion bodies
immobilised by nickel chelating chromatography. FEBS Lett, 432,
21-26.
58
Rost, B. (1996) PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol, 266, 525-539.
Rost, B., Fariselli, P. and Casadio, R. (1996) Topology prediction for helical
transmembrane proteins at 86% accuracy. Protein Sci, 5, 1704-1718.
Rost, B. and Sander, C. (1993) Improved prediction of protein secondary
structure by use of sequence profiles and neural networks. Proc Natl
Acad Sci U S A, 90, 7558-7562.
Russell, R.B. and Eggleston, D.S. (2000) New roles for structure in biology
and drug discovery. Nat Struct Biol, 7 Suppl, 928-930.
Schulz, G.E. (2000) beta-Barrel membrane proteins. Curr Opin Struct Biol,
10, 443-447.
Senes, A., Gerstein, M. and Engelman, D.M. (2000) Statistical analysis of
amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with beta-branched residues at
neighboring positions. J Mol Biol, 296, 921-936.
Simon, I., Fiser, A. and Tusnady, G.E. (2001) Predicting protein conformation by statistical methods. Biochim Biophys Acta, 1549, 123-136.
Singer, S.J. and Nicolson, G.L. (1972) The fluid mosaic model of the structure of cell membranes. Science, 175, 720-731.
Skolnick, J., Fetrow, J.S. and Kolinski, A. (2000) Structural genomics and its
importance for gene function analysis. Nat Biotechnol, 18, 283-287.
Sonnhammer, E.L., von Heijne, G. and Krogh, A. (1998) A hidden Markov
model for predicting transmembrane helices in protein sequences.
Proc Int Conf Intell Syst Mol Biol, 6, 175-182.
Sääf, A., Johansson, M., Wallin, E. and von Heijne, G. (1999) Divergent
evolution of membrane protein topology: the Escherichia coli RnfA
and RnfE homologues. Proc Natl Acad Sci U S A, 96, 8540-8544.
Tate, C.G. (2001) Overexpression of mammalian integral membrane proteins
for structural studies. FEBS Lett, 504, 94-98.
Taussig, R. and Carlson, M. (1983) Nucleotide sequence of the yeast SUC2
gene for invertase. Nucleic Acids Res, 11, 1943-1954.
Thornton, J. (2001) Structural genomics takes off. Trends Biochem Sci, 26,
88-89.
Todd, A.E., Marsden, R.L., Thornton, J.M. and Orengo, C.A. (2005) Progress of structural genomics initiatives: an analysis of solved target
structures. J Mol Biol, 348, 1235-1260.
Torres, J., Stevens, T.J. and Samso, M. (2003) Membrane proteins: the 'Wild
West' of structural biology. Trends Biochem Sci, 28, 137-144.
Traxler, B., Boyd, D. and Beckwith, J. (1993) The topological analysis of
integral cytoplasmic membrane proteins. J Membr Biol, 132, 1-11.
Tusnady, G.E. and Simon, I. (1998) Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol, 283, 489-506.
Tusnady, G.E. and Simon, I. (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics, 17, 849-850.
59
Ubarretxena-Belandia, I. and Engelman, D.M. (2001) Helical membrane
proteins: diversity of functions in the context of simple architecture.
Curr Opin Struct Biol, 11, 370-376.
Ulmschneider, M.B., Sansom, M.S. and Di Nola, A. (2005) Properties of
integral membrane protein structures: derivation of an implicit
membrane potential. Proteins, 59, 252-265.
Wagner, S., Bader, M.L., Drew, D. and de Gier, J.W. (2006) Rationalizing
membrane protein overexpression. Trends Biotechnol, 24, 364-371.
Walian, P., Cross, T.A. and Jap, B.K. (2004) Structural genomics of membrane proteins. Genome Biol, 5, 215.
Wallin, E. and von Heijne, G. (1998) Genome-wide analysis of integral
membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci, 7, 1029-1038.
van Geest, M. and Lolkema, J.S. (2000) Membrane topology and insertion of
membrane proteins: search for topogenic signals. Microbiol Mol
Biol Rev, 64, 13-33.
van Klompenburg, W., Nilsson, I., von Heijne, G. and de Kruijff, B. (1997)
Anionic phospholipids are determinants of membrane protein topology. EMBO J, 16, 4261-4266.
White, S.H. (2004) The progress of membrane protein structure determination. Protein Sci, 13, 1948-1949.
White, S.H. and Wimley, W.C. (1999) Membrane protein folding and stability: physical principles. Annu Rev Biophys Biomol Struct, 28, 319365.
Viklund, H. and Elofsson, A. (2004) Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models
and evolutionary information. Protein Sci, 13, 1908-1917.
Viklund, H., Granseth, E. and Elofsson, A. (2006) Structural classification
and prediction of reentrant regions in alpha-helical transmembrane
proteins: application to complete genomes. J Mol Biol, 361, 591603.
Wimley, W.C. (2003) The versatile beta-barrel membrane protein. Curr
Opin Struct Biol, 13, 404-411.
Wimley, W.C. and White, S.H. (1996) Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nat Struct Biol,
3, 842-848.
Vitkup, D., Melamud, E., Moult, J. and Sander, C. (2001) Completeness in
structural genomics. Nat Struct Biol, 8, 559-566.
von Heijne, G. (1986) The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans-membrane
topology. EMBO J, 5, 3021-3027.
von Heijne, G. (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol, 225, 487-494.
von Heijne, G. (2006) Membrane-protein topology. Nat Rev Mol Cell Biol,
7, 909-918.
Yan, Y. and Moult, J. (2005) Protein family clustering for structural genomics. J Mol Biol, 353, 744-759.
60
Yarov-Yarovoy, V., Schonbrun, J. and Baker, D. (2006) Multipass membrane protein structure prediction using Rosetta. Proteins, 62, 10101025.
Yau, W.M., Wimley, W.C., Gawrisch, K. and White, S.H. (1998) The preference of tryptophan for membrane interfaces. Biochemistry, 37,
14713-14718.
Yee, A.A., Savchenko, A., Ignachenko, A., Lukin, J., Xu, X., Skarina, T.,
Evdokimova, E., Liu, C.S., Semesi, A., Guido, V., Edwards, A.M.
and Arrowsmith, C.H. (2005) NMR and X-ray crystallography,
complementary tools in structural proteomics of small proteins. J
Am Chem Soc, 127, 16512-16517.
Yu, H. (1999) Extending the size limit of protein nuclear magnetic resonance. Proc Natl Acad Sci U S A, 96, 332-334.
Österberg, M., Kim, H., Warringer, J., Melén, K., Blomberg, A. and von
Heijne, G. (2006) Phenotypic effects of membrane protein overexpression in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A,
103, 11148-11153.
61