Download Structure, prediction, evolution and genome wide studies of membrane proteins

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Metalloprotein wikipedia , lookup

Paracrine signalling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Gene expression wikipedia , lookup

Metabolism wikipedia , lookup

Expression vector wikipedia , lookup

SR protein wikipedia , lookup

Oxidative phosphorylation wikipedia , lookup

Biochemistry wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Interactome wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Signal transduction wikipedia , lookup

Homology modeling wikipedia , lookup

Magnesium transporter wikipedia , lookup

SNARE (protein) wikipedia , lookup

Protein wikipedia , lookup

Protein purification wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Thylakoid wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Western blot wikipedia , lookup

Transcript
Structure, prediction, evolution and
genome wide studies of membrane
proteins
Erik Granseth
Stockholm University
©Erik Granseth, Stockholm 2007
ISBN 978-91-7155-489-5
Printed in Sweden by US-AB, Stockholm 2007
Distributor: Stockholm University Library
Hello World
List of publications
Publications included in this thesis
I
Daniel O. Daley*, Mikaela Rapp*, Erik Granseth, Karin Melén,
David Drew and Gunnar von Heijne
Global topology analysis of the Escherichia coli inner membrane
proteome
Science (2005) 308, 1321-1323
II
Erik Granseth, Daniel O. Daley, Mikaela Rapp, Karin Melén and
Gunnar von Heijne
Experimentally constrained topology models for 51,208 bacterial
inner membrane proteins
Journal of Molecular Biology (2005) 352,489-494
III
Erik Granseth, Gunnar von Heijne and Arne Elofsson
A study of the membrane-water interface region of membrane
proteins
Journal of Molecular Biology (2005) 346, 377-385
IV
Håkan Viklund, Erik Granseth and Arne Elofsson
Structural classification and prediction of reentrant regions in αhelical transmembrane proteins: Application to complete genomes
Journal of Molecular Biology (2006) 361, 591-603
V
Erik Granseth, Håkan Viklund and Arne Elofsson
ZPRED: Predicting the distance to the membrane center for residues in α-helical membrane proteins
Bioinformatics (2006) 22, e191-e196
VI
Mikaela Rapp*, Erik Granseth*, Susanna Seppälä and Gunnar
von Heijne
Identification and evolution of dual-topology membrane proteins
Nature Structural & Molecular Biology (2006) 13, 112-116
Reprints were made with permission from the publishers
Publications not included in this thesis:
• Mikaela Rapp*, Susanna Seppälä*, Erik Granseth and Gunnar von Heijne
Emulating membrane protein evolution by rational design
Science (2007) 315, 1282-1284
• Erik Granseth, Susanna Seppälä, Mikaela Rapp, Daniel O. Daley and
Gunnar von Heijne
Membrane protein structural biology — How far can the bugs take us?
(review)
Molecular Membrane Biology (2007) 24, 1-5
• Costas Papaloukas, Erik Granseth, Håkan Viklund and Arne Elofsson
Estimating the length of transmembrane helices using Z-coordinate predictions
Submitted
*These authors contributed equally
Abstract
α-helical membrane proteins constitute 20-30% of all proteins in a cell and
are involved in many essential cellular functions. The structure is only
known for a few hundred of them, which makes structural models important.
The most common structural model of a membrane protein is the topology
which is a two-dimensional representation of the structure.
This thesis is focused on three different aspects of membrane protein
structure: improving structural predictions of membrane proteins, improving
the level of detail of structural models and the concept of dual topology.
It is possible to improve topology models of membrane proteins by including experimental information in computer predictions. This was first
performed in Escherichia coli and, by using homology, it was possible to
extend the results to 225 prokaryotic organisms. The improved models covered ~80% of the membrane proteins in E. coli and ~30% of other prokaryotic organisms.
However, the traditional topology concept is sometimes too simple for
complex membrane protein structures, which create a need for more detailed
structural models. We created two new machine learning methods, one that
predicts more structural features of membrane proteins and one that predicts
the distance to the membrane centre for the amino acids. These methods
improve the level of detail of the structural models.
The final topic of this thesis is dual topology and membrane protein evolution. We have studied a class of membrane proteins that are suggested to
insert either way into the membrane, i.e. have a dual topology. These protein
families might explain the frequent occurrence of internal symmetry in
membrane protein structures.
Contents
Introduction ...................................................................................................10
Background ...................................................................................................12
From sequence to structure to function......................................................................... 12
Computational biology and bioinformatics .................................................................... 13
Membranes....................................................................................................................13
Membrane proteins ....................................................................................................... 14
α-helical membrane proteins ................................................................................... 15
β-barrel membrane proteins .................................................................................... 15
Monotopic and peripheral membrane proteins ........................................................ 16
Improving the accuracy of topology prediction of membrane proteins .........17
Topology – a structural model of a membrane protein ................................................. 17
Topology mapping ................................................................................................... 19
Combining experimental data with topology prediction ........................................... 20
Increasing coverage with homology.............................................................................. 22
Improving the level of detail of the structural models of membrane proteins23
Structural features of membrane proteins..................................................................... 23
The transmembrane helices .................................................................................... 24
The positive inside rule ............................................................................................ 24
Aromatic residues .................................................................................................... 25
Snorkelling ............................................................................................................... 25
Loop lengths............................................................................................................. 25
Interface helices ....................................................................................................... 25
Re-entrant regions ................................................................................................... 26
Problems with topology and structural features....................................................... 27
The Z-coordinate ........................................................................................................... 27
Dual topology ................................................................................................29
Evolution of membrane proteins.................................................................................... 29
Families and folds.......................................................................................................... 31
Summary of papers.......................................................................................33
Improving the accuracy of topology models.................................................................. 33
Improving the detail of structural predictions................................................................. 35
Dual topology................................................................................................................. 39
Final thoughts................................................................................................41
Acknowledgements .......................................................................................42
References....................................................................................................44
Abbreviations
2D
3D
ATP
BLAST
DNA
GFP
HMM
PhoA
PSIBLAST
PSSM
SMR
TM
TMH
TMHMM
Å
Two-dimensional
Three-dimensional
Adenosine triphosphate
Basic local alignment search tool
Deoxyribonucleic acid
Green fluorescent protein
Hidden Markov model
Alkaline phosphatase
Position-specific iterative BLAST
Position specific scoring matrix
Small multidrug resistance
Transmembrane
Transmembrane helix
Transmembrane hidden Markov model
Ångström
Introduction
4 billion years ago, the first living cell appeared. Mother earth had been
waiting patiently for at least 10 billion years for this to happen and ever
since that day, nothing has been quite the same. The cells replicated and
replicated and finally a group of cells wrote this thesis.
In order to have a functional, living cell some of the basic things that are
needed are:
• Some way of converting and utilizing energy, for instance chemical energy or the rays from the sun.
• Self-replication and self-assembly. The first cells might have used RNA
as the information carrier and as molecular machines. Nowadays, modern
cells have DNA as the information carrier and RNA as an intermediate
step towards proteins but sometimes it also functions as molecular machines.
• One less obvious thing is that molecules need to be located at the right
place at the right time. One thing that aids this is compartmentalisation,
when molecules are kept separated by different kinds of barriers. For instance, the DNA molecule of a cell needs to be located inside the cell, if it
is outside, it will rapidly get degraded.
My research has been focused on the proteins that are located in the membrane that surrounds cells and compartments. These proteins are responsible
for transport through the membrane, they are involved in cell-to-cell signalling, they are crucial for generating and converting chemical and sun rays
into useful energy.
In order to be able to understand these processes in detail, a 3D structure
of the protein is needed. Unfortunately, structure determination of membrane
proteins is very time-consuming and difficult so there are considerably fewer
structures available than for water-soluble proteins. There are millions of
proteins with known amino acid sequence and tens of thousands of proteins
with known structure but only a few hundred of these structures are membrane proteins. Therefore, theoretical approaches that give structural insight
from the amino acid sequence hold great potential since computing time is
very cheap. However, since computational predictions are just predictions, it
is also important to validate the theoretical results with practical experiments.
10
This thesis consists of three parts. The first part describes how it is possible
to create better structural models of membrane proteins by including experimental information to computer predictions. In the second part, a foundation is laid by examining the currently available membrane protein structures. Based on the results from this study it was possible to create new machine learning methods that predicted novel structural features of membrane
proteins. The third part consists of studies of a particular class of membrane
proteins that behaved oddly, but might hold important clues about the evolution of membrane proteins.
11
Background
From sequence to structure to function
The revolutionary discovery of the molecular structure of DNA by Watson
and Crick 1953 made a tremendous impact on science (Watson and Crick,
1953). Never before had it been possible to study and understand the molecular details of how genetic information was stored. It could be seen that
the DNA double helix was stabilized by hydrogen bonds between Adenine
(A) – Thymine (T) and Guanine (G) – Cytosine (C). The problem was that
no one knew how the DNA sequences were translated into proteins.
The genetic code was finally broken by the Nirenberg laboratory in 196166 (Nirenberg, 2004). It was discovered that each DNA nucleotide triplet
stands for one of the 20 different amino acids that are the building blocks of
peptides and proteins. These events lead to the birth of molecular biology.
There are two main events involved in going from DNA to protein. The
first step is the transcription, copying, of DNA into complementary RNA by
the enzyme RNA polymerase. This enzyme is essential for life and present in
all organisms and also in some viruses. The second step is the translation of
the mRNA into an amino acid chain that forms the protein. This step is performed by the ribosome, which translates the genetic code to an amino acid
chain.
One of the remaining unsolved problems of molecular biology is how the
linear amino acid chain folds into a three-dimensional protein. All information about how a protein structure will look like is present in its amino acid
sequence (Anfinsen, 1973).
A water-soluble protein folds within fractions of a second and that makes
it impossible to explore all different conformations that exist for the amino
acids. The Levinthal paradox states that due to the large amount of degrees
of freedom in the protein chain the molecule has an astronomical number of
possible conformations (an estimate is 10143) (Levinthal, 1968). If the protein
would have to explore all conformations it would take longer time than the
universe has existed. Thus, there exist folding pathways, yet the details of
them remain unclear.
12
The key to a detailed understanding of protein function lies in its structure. A
high resolution structure is needed to be able to study the protein function at
a molecular level. The most common method to determine the structure is to
use X-ray crystallography.
Computational biology and bioinformatics
The amount of available genetic information continues to increase at a
steady pace. Ever since GenBank, a sequence database, started 1982 it has
doubled its size every 18 months. The latest release (v160) contains
77,248,690,945 bases. To put this enormous number into perspective, the
number of bases that was sequenced from April to June 2007 was 42,880
each second. The DNA sequencing technology also gets faster and faster,
e.g. a 454 gene sequencer can sequence 25 million bases with 99% accuracy
in 4 hours, which makes it possible to sequence a bacterial genome during
the course of a day (Margulies et al., 2005).
There is clearly a need for methods that can make sense out of all this vast
amount of genetic information and this is where bioinformatics and computational biology steps in. The difference in definition between the two terms
is that computational biology refers to hypothesis-driven investigations of
biological problems using computers while bioinformatics is more technique-driven and concerns development of algorithms and computational
and statistical techniques. Many times the two terms are used interchangeably. Bioinformatical methods that are used for the research in this thesis are
described in gray.
The unifying topic for all the papers included in this thesis is that they all are
about membrane proteins. I will continue with a brief overview about the
environment that membrane proteins reside in, the membrane.
Membranes
One of the things that is needed for life to exist is the membrane. Membranes
are fluid, semipermeable barriers separating the interior of the cell from the
exterior. The basic building block of a typical biological membrane is a
phospholipid molecule (Figure 1A). It consists of a hydrophilic headgroup
which is often slightly negatively charged, and several long, carbon rich fatty
acid chains which are nonpolar. This dual nature of lipids, amphilicity,
makes them aggregate and spontaneously form a membrane or a micelle
when placed in water. The membranes are self-sealing if they get broken.
13
Figure 1. A. An example of a single lipid molecule. The two long, fatty acid hydrocarbon chains are green lines and the atoms belonging to the polar headgroup are
depicted as balls. Carbon is colored green, oxygen red, phosphorus orange and oxygen blue. B. A snapshot of a computer simulation of a part of a fluid membrane. The
length in Ångström is shown to the right. The colors are the same as in A.
The membrane is a bilayer, where the two outer surfaces consist of the polar
headgroups and the interior of nonpolar, hydrophobic fatty acids (Figure
1B). This is the simplest solution to avoid having water molecules close to
the fatty acids since they are unable to hydrogen bond to water. The length
of the fatty acids determines the width of the membrane, most are around 5080 Å thick. In biological membranes there are many different kinds of lipids
and the lipid composition gives different membranes different characteristics
such as curvature, rigidity, fluidity etc.
The membrane is capable of upholding an ion gradient and this is very
important, since the ion gradient is a form of potential energy that can be
used by the cell. For instance, in mitochondria and chloroplasts, a proton
gradient is responsible for the chemiosmotic potential, the proton motive
force that is used for the synthesis of ATP.
Membrane proteins
Inside the membrane there are membrane proteins with many diverse functions. They are responsible for selecting which substances allowed to pass
through the membrane, they are important for transmitting signals, cell-tocell communication, and numerous other functions. More than half of all
14
drugs are targeting membrane proteins (Russell and Eggleston, 2000; Hopkins and Groom, 2002).
An example of a membrane protein structure, archaerhodopsin from a
halobacterium, is shown in Figure 2A. Halobacteria can be found in extreme
salinity environments such as the Dead Sea or the Great Salt Lake. The protein gives the archaeabacteria a reddish colour and is responsible for creating
chemical energy from light.
The amount of membrane proteins inside the lipid bilayer vary from organism to organism, from cell type to cell type. For instance, membrane
proteins make up 75% of the mass of the membrane in E. coli while only
30% for human myelin sheath cells.
There are three main classes of membrane proteins:
α-helical membrane proteins
α-helical membrane proteins have their amino acids arranged in tightly
packed helices in the transmembrane regions (Figure 2A). The reason behind
the α-helical arrangement is to have as large hydrophobic outer surface area
as possible that can interact with the hydrophobic fatty acids of the lipids. At
the same time, it also satisfies all its internal backbone hydrogen bonds.
α-helical membrane proteins typically make up 20-30% of the encoded
proteins of an organism (Krogh et al., 2001). Most often they are located in
the inner membrane but recently one example, a polysaccharide transporter,
has been found that resides in the outer membrane of E. coli (Dong et al.,
2006).
β-barrel membrane proteins
The other possibility of spanning the membrane while satisfying internal
backbone hydrogen bonds and maintaining a large hydrophobic surface area
is the β-barrel fold (Figure 2B). Since a β-strand always hydrogen bonds to
another β-strand, these proteins consist of β-hairpins through the membrane.
Thus, this class always has an even number of β-strands. Every second residue in the transmembrane region is facing the inside of the barrel whereas
the other is facing the outside, towards the lipids. This alternating pattern is
reflected in the amino acid sequence where one amino acid is non-polar and
hydrophobic (and directed towards the lipids) and the next one more polar
(and directed towards the interior of the barrel) etc. Since a β-strand is extended compared to the coiled α-helix, the transmembrane regions are
shorter and typically consist of 8 to 15 amino acids (Wimley, 2002). The βbarrel membrane proteins can be found in the outer membrane of bacteria,
mitochondria and chloroplasts.
Around 2-3% of a genome encodes β-barrel membrane proteins (Garrow
et al., 2005) but that is an uncertain number given the difficulties of identify15
ing them. The main reason for this is that β-strands are more difficult to predict than α-helices since they consist of fewer residues and contain more
long range interactions. There are also fewer structures of β-barrel membrane proteins than α-helical.
Figure 2. Example of membrane protein structures with the membrane depicted as a
pink rectangle. A. An α-helical membrane protein with 7 transmembrane helices. B.
A β-barrel membrane protein with 8 transmembrane β-strands. C. A monotopic
membrane protein.
Monotopic and peripheral membrane proteins
Monotopic membrane proteins are only associated to one of the two leaflets
of the membrane. Prostaglandin H2 synthase-1 is one example and is shown
in Figure 2C (Picot et al., 1994). The helices that are associated to the membrane are amphiphilic (Gupta et al., 2004) and the protein is only removable
from the membrane by detergents, organic solvents or denaturants that interfere with the hydrophobic interactions.
A peripheral membrane protein is more loosely attached to the membrane,
mainly through electrostatic interactions and hydrogen bonding with the
polar headgroup of lipids or with hydrophilic domains of other integral
membrane proteins. Peripheral membrane proteins can be dissociated from
the membrane by treatment with solutions of high ionic strength or elevated
pH.
Hereafter, the term “membrane protein” will refer to α-helical membrane
proteins since they have been the focus of this thesis.
16
Improving the accuracy of topology
prediction of membrane proteins
This first part of the thesis will describe the background behind paper I and
II. Structural models of membrane proteins are usually produced by either
sequence-based prediction or time-consuming experimental approaches. The
main idea presented here is that by combining large-scale, limited biochemical experimental data with bioinformatical methods we gain better structural
predictions of membrane proteins. A correct structural model is crucial for
functional analysis of the protein.
Topology – a structural model of a membrane protein
It is extremely difficult to obtain a three dimensional structure of a membrane protein. As of 1 January 2007, membrane proteins were 0.5 % of the
Protein Data Bank (Berman et al., 2002), even though they constitute 2030% of a typical proteome. The reasons for the low number include various
experimental difficulties such as: inclusion bodies (cytoplasmic aggregates
of membrane proteins that are not inserted into the membrane), toxicity, lack
of raw material for crystallisation trials, problems with solubilisation etc
(Granseth et al., 2007).
Given the small amount of available membrane proteins, what can one do
when there is no available structure of a membrane protein? The next best
alternative is having the topology of it. The topology is like a crude twodimensional projection of a membrane protein structure. It depicts where the
transmembrane helices are situated and on which side of the membrane the
connecting loops and the N- and C-termini are located (Figure 3). This information is useful both for structural and functional classification of the
protein (von Heijne, 2006). A topology model also provides information in
order to plan site-directed mutagenesis experiments and is invaluable as a
first step towards full 3D structure prediction (Yarov-Yarovoy et al., 2006).
17
Figure 3. The structure and topology of bacteriorhodopsin with 7 transmembrane
helices (TM). The C-terminal is located in the cytoplasm and the N terminal is located in the periplasm. The protein thus has a Nout-7TM-Cin topology.
The topology of a membrane protein is determined by the amino acid sequence; all information about how many transmembrane helices and orientation is present there from the beginning. What is known is that the transmembrane helices are consecutive regions of around 25 hydrophobic amino
acids, but the lengths can vary between ~15 and ~40 residues (Granseth et
al., 2005). The loops are hydrophilic and often consist of glycines and prolines, residues that induce turns (Monne et al., 1999).
The sequence factors that determines if a helix is inserted or not into the
membrane are still not completely understood but much work has been focused on cracking the insertion code (Hessa et al., 2005). By using an in
vitro system, Hessa et al. created a biological hydrophobicity scale by measuring the insertion efficiency for each of the 20 amino acids. Isoleucine,
leucine, phenylalanine and valine promoted membrane insertion, cysteine,
methionine and alanine were intermediate and the remaining polar and
charged amino acids opposed membrane insertion. These results are in good
agreement with statistical analyses of known membrane protein structures
(Ulmschneider and Sansom, 2001; Beuming and Weinstein, 2004).
Topology prediction
The first topology prediction methods were based on the observation that
membrane protein sequences contain regions of 18-25 amino acids that are
hydrophobic. Predictions were improved in 1992 when the "positive inside"
rule was used to decide which loops that were located on the inside or the
outside of the membrane (von Heijne, 1992). In the late 1990s more advanced machine learning methods came into use, examples are TMHMM
(Sonnhammer et al., 1998), HMMTOP (Tusnady and Simon, 2001) and
PhDHTM (Rost et al., 1996). These dramatically improved the accuracy of
the predictions and lead to possibilities of rapidly screening the complete
18
genomes that were released. Typically, the earliest methods predicted the
correct topology, i.e. all transmembrane helices and the correct position of
the loops, for ~30-40% of the membrane proteins while the more advanced
methods were correct ~50-60% (Moller et al., 2001; Viklund and Elofsson,
2004). More recent topology predictors make use of evolutionary information in their predictions which increases the performance to ~70-75%
(Viklund and Elofsson, 2004; Jones, 2007).
Topology mapping
In order to have a correct topology model of a membrane protein it has to be
verified by experiments. It is relatively easy to determine the regions that are
located in the membrane by computational methods since they are more
hydrophobic than other regions. However, many times the predicted topology is wrong, for instance by missing a helix that is not very hydrophobic.
The solution to this is to experimentally determine the location of specific
amino acids, whether they are located in the membrane, the cytoplasm or the
extra-cytoplasmic space. Sometimes antibodies, biochemical agents are used
to do this (Kimura et al., 1997). However, the most common method is fusing a protein domain to a loop, a domain that only is active when it is located on one side of the membrane but not on the opposing side (Manoil,
1991; Moller et al., 2000; Ikeda et al., 2003). In paper I and VI the fusion
proteins Green Flourescent Protein (GFP) and alkaline phosphatase (PhoA)
were attached to the C terminus of membrane proteins.
PhoA
PhoA was one of the first topology mapping reporters (Manoil and
Beckwith, 1985). The protein normally catalyses hydrolysis of phosphate
groups located in the periplasmic space of bacteria. It contains an essential
cysteine disulfide bridge that is required for proper folding of the protein.
These cysteine disulfide bridges can only be formed in the periplasm, if the
protein is located in the cytoplasm the bridge will not get formed and the
protein remains inactive. In order to use PhoA as a topology reporter, it is
fused to a membrane protein and a substrate that changes colour upon hydrolysis is added. If there is a change of colour, the region in question must
be located in the periplasm, if there is no colour change, the region can be
located either in the cytoplasm or the membrane.
GFP
Green Flourescent Protein (GFP) is a protein that originally comes from the
jellyfish Aequorea victoria where it fluoresces green when exposed to blue
light. It was first cloned 1994 and immediately became very popular in molecular biology, especially for making animals green (Chalfie et al., 1994).
19
GFP is only fluorescent if it is properly folded and it can only do that in the
cytoplasm, it behaves opposite of PhoA. One advantage with GFP is that it is
possible to measure protein overexpression since only properly folded membrane proteins can fluoresce (Drew et al., 2001).
Using GFP/PhoA to aid topology determination
Ideally, for a perfect topology model, many positions in the amino acid sequence should be mapped so that all the locations of the loops are experimentally determined. This is however unfeasible since it is too timeconsuming. To speed things up, Rapp et al. attached GFP and PhoA to the C
termini of different membrane proteins. The rationale behind this was that a
combination between limited experimental information together with topology predictions leads to more accurate topology models (Rapp et al., 2004).
Since the fusion proteins are rather large, the risk is less that they interfere
with the insertion mechanism if they are placed at the C terminus compared
to other sites.
Combining experimental data with topology prediction
The topology prediction method that was used in paper I, II and VI was
TMHMM. The reason for this was that it was easily available, fast and has
good overall performance.
TMHMM
TMHMM is one of the most popular topology prediction methods. Its accuracy is reported to be 77.5% correct when predicting the full topology
(Krogh et al., 2001). However, independent benchmarks report the accuracy
to be 49-63% correct (Viklund and Elofsson, 2004; Jones, 2007). This discrepancy shows that it is important to perform independent tests, since the
performance reported in the original articles most often is too high.
TMHMM contains 7 compartments that are designed to recognize different
regions of a transmembrane protein (Figure 4). For instance, the helix core
compartment is trained to recognize the hydrophobic regions, while the inside loop compartment recognizes cytoplasmic loops. The HMM has been
trained on the currently available 3D structures and experimentally verified
topology models. Each compartment consists of states with probability distributions (emission probabilities) of the 20 different amino acids. Between
the states there are transition probabilities that control how likely it is to go
from one state to another or stay in the same state.
When TMHMM is presented with an unknown amino acid sequence it
finds the optimal way through the model for this sequence. The path through
the different states in the compartments is the predicted topology of the sequence.
20
Cytoplasmic
side
globular
Membrane
Non-cytoplasmic
side
cap
cyt.
helix core
cap
non-cyt.
short loop
non-cyt.
globular
cap
cyt.
helix core
cap
non-cyt.
long loop
non-cyt.
globular
loop
cyt.
Figure 4. The architecture of TMHMM. The boxes shows the compartments and
inside the compartments there can be one or more states. Parts of the model with the
same text are tied, i.e. their parameters are the same. The permitted transitions between the compartments are shown as arrows.
Increasing accuracy of topology models: constraining predictions
To improve the prediction accuracy of TMHMM, it is possible to include
experimental information about where certain parts of the protein are located. For instance, by attaching a flourescent probe to the C terminus of the
protein, it is possible to see if it is located in the cytoplasm. This information
can be included in the predictions by constraining which paths the HMM
chooses. TMHMM allows for choosing the probability of an amino acid to
belong to a certain class a priori. If the membrane protein has its C terminus
in the cytoplasm, TMHMM is constrained so that the prediction must end in
the cytoplasmic compartment (Figure 4). This both raises the accuracy of the
predictions and forces the predictions to fit the experimental evidence.
In paper I, the C-terminal location was determined by GFP/PhoA experiments for the majority of the membrane proteins in Escherichia coli. This
information was then used in constrained TMHMM predictions. The expected increase in accuracy of the topology models when using experimental
information about the C-terminal location was expected to increase from
~70% to ~80% (Rapp et al., 2004).
In order to have as many improved topology models as possible, we
searched for membrane proteins in 225 organisms that were similar to those
we studied in E. coli (paper II). This is called a homology search and is explained below.
21
Increasing coverage with homology
All life on earth is related to each other. Genetic information is inherited and
passed on from generation to generation but is under constant evolutionary
pressure, either by changes in the environment or by random mutations.
When comparing genetic sequences from a generation to another generation,
the changes are subtle, but given enough time, we get as diverse organisms
as we can observe today. Since it is only possible to sequence DNA from
living organisms, we can only get a snapshot of how life and organisms look
like today and we have to make educated guesses about the past.
The term homology is used to describe similarity caused by shared ancestry. For a protein sequence, this means that if you have a sequence with unknown function you can search for homologous sequences in a database such
as GenBank. If you find a sequence with high sequence similarity to your
unknown sequence, they probably have similar function. However, a search
in a sequence database is not a trivial problem due to the enormous amounts
of data stored in them. Therefore many different search methods have been
developed, either slow and accurate, or fast and less accurate.
BLAST and PSIBLAST
BLAST (Basic Local Alignment Tool) is a rapid method to find local similarities between an unknown query sequence and a database of sequences
(Altschul et al., 1990). It is possible to use both DNA and protein sequences.
It calculates the statistical significance for each of the matches in the database.
PSIBLAST (Position-Specific Iterated BLAST) uses BLAST to create a
Position Specific Scoring Matrix (PSSM) (Altschul et al., 1997). All statistically significant matches from a BLAST run are aligned in a multiple sequence alignment. For each position in the alignment, the amino acid frequencies are calculated, this is the PSSM. What PSIBLAST then does is to
use this PSSM and search with it again in the sequence database. For each
iteration, a new PSSM is calculated and used to search the database.
Since the PSSM contains more evolutionary information than the initial
query sequence does, it is possible to find more distantly related sequences
than a traditional BLAST run.
To make the improved topology models for the membrane proteins from the
225 prokaryotic organisms, we first made a sequence database with all the
proteins present in the 225 organisms. Then we searched the database with
the proteins that had experimentally verified C-terminal location (from paper
I and other sources). Proteins that were homologous at their C-terminal end
were assigned to have the same C-terminal location as the experimentally
verified proteins. This made it possible to run constrained TMHMM predictions for ~30% of the membrane proteins from the 225 organisms.
22
Improving the level of detail of the structural
models of membrane proteins
As more and more membrane protein structures become known, it is clear
that traditional topology models only provides a simplified picture of a
membrane protein. It is clearly sufficient for proteins like bacteriorhodopsin
(Figure 3) but for proteins like the glutamate transporter (Figure 5) a traditional topology model would miss much of the structural detail present in the
structure. In the glutamate transporter, some transmembrane helices are
tilted, other contains kinks, and there are also regions that go inside the
membrane and then back again, from the same side as it came from. It is
clear that new prediction methods are needed that can incorporate these
kinds of structural features.
Figure 5. The glutamate transporter with its complex topology. The blue mesh depicts a 30 Å wide membrane.
Structural features of membrane proteins
The lipid environment that membrane proteins reside in put many constraints
on their structure. As James U. Bowie so nicely cited the book “The bound
man” by Ilse Aichinger in a review (Bowie, 2005):
“’A man awakes from a coma to find himself bound by ropes. He learns to
move gracefully within these constraints and eventually becomes a circus
performer.’ Similarly, membrane proteins must perform complex signaling
and transport functions within the strict confines of a lipid bilayer. This requires elegant structural adaptations to their environment.”
23
There are many different structural features that characterize membrane proteins. Here I will first describe some general structural characteristics of
membrane proteins and then continue with the regions that challenge the
traditional view of membrane protein topology.
The transmembrane helices
The initial view was that membrane proteins simply consisted of membrane
penetrating ideal helices. But as more and more structures are solved this
view gets more and more revised. Almost 60% of the helices contain significant bends or are distorted (Yohannan et al., 2004). One particular case is
proline-induced kinks. Proline is a special amino acid since its side-chain
loops back to the amide nitrogen of the backbone. This makes it rigid and, if
placed in an α-helix, results in a loss of a backbone hydrogen bond in the
helix (Reiersen and Rees, 2001). If placed in a transmembrane helix, the
helix gets kinked. Other elements that disturb the simple picture of ideal
helices include 310- and π-helices. In an α-helix, the hydrogen bonds are 4
amino acids apart, in a 310-helix the hydrogen bond is between each 3rd residue and in a π-helix each 5th. This also causes distortions.
The positive inside rule
Apart from the hydrophobic transmembrane helices, the best known sequence characteristic of a membrane protein is the ‘positive inside’ rule. It
states that the majority of the arginines (Arg) and lysines (Lys) in the loops
are located on the cytoplasmic side of the membrane (Figure 6).
Figure 6. The side chain atoms of arginine and lysine are depicted as lines. As can
be seen from the figure, there are more arginine and lysine located in the cytoplasmic side of the membrane.
When studying known membrane protein structures almost twice as high
fraction of Arg and Lys are on the cytoplasmic side compared to the extra24
cytoplasmic can be found (Granseth et al., 2005). Genome-wide studies confirms that the bias is present in almost all organisms (Wallin and von Heijne,
1998; Nilsson et al., 2005).
Aromatic residues
Tryptophan (Trp) and tyrosine (Tyr) are most often found in the membranewater interface region located around ±15 Å from the centre of the membrane. This is often referred to as the “aromatic belt” (Wallin et al., 1997).
Experiments with artificial transmembrane helices show that Trp flanking
the transmembrane helix positions it so that the Trp are at their preferred
locations near the lipid-carbonyl region (Braun and von Heijne, 1999; de
Planque et al., 1999). However, this behaviour can not be seen for the aromatic phenylalanine.
Snorkelling
Lys and Arg have long aliphatic side chains with a positively charged amine
or guanidium group at the end. The aliphatic hydrocarbon part of the side
chains prefer to be located in hydrophobic region of the bilayer. The positively charged amine or guanidium groups, on the other hand, prefer the
more polar interface region. This behaviour is described as snorkelling
(Chamberlain et al., 2004). The same kind of behaviour can sometimes be
observed in Trp and Tyr (de Planque et al., 1999). Trp is capable of forming
hydrogen bonds with its NH group, but it also has the largest non-polar surface of all amino acid residues.
Loop lengths
The loops that connect the transmembrane helices can vary a lot in length,
but are on average 9 amino acids long. If the loops are longer than this, they
often change from having irregular structure to contain short helices that are
approximately parallel to the membrane surface. These are called interface
helices. If an interface helix is present between two transmembrane helices,
the average number of amino acids between them rises to 27 (Granseth et al.,
2005).
Interface helices
The first example of a structural region that is not modelled in traditional
topology models is the interface helix. An interface helix is a helix running
parallel with the membrane surface, almost perpendicularly to the transmembrane helices (Figure 7). The reasons for these helices are not well understood, but they might be needed for structural reasons such as positioning
25
the transmembrane helices. In photosystem I, they are thought to shield cofactors from the aqueous phase (Jordan et al., 2001) and in the MscS mechanosensitive channel they might be involved in channel gating by transferring mechanical force (Bass et al., 2003).
Interface helices are on average 9 amino acids long, similar to the average
length of helices in globular proteins. They contain a higher amount of Trp
and Tyr than other parts of the protein (Granseth et al., 2005).
Figure 7. Example of a membrane protein structure that contains both reentrant
regions (dark blue) and interface helices (green).
Re-entrant regions
The second example of a structural element that disturbs the traditional topology model is the re-entrant region. A re-entrant region is a membrane
penetrating part of the protein that goes into the membrane and then out
again, from the same side as it came from. Two typical examples from aquaporin-0 is shown in Figure 7, where the two "half-TM" helices meet each
other in the middle of the membrane (Harries et al., 2004). These are responsible for the channel gating, function as selectivity filters for water molecules and prevent protons from passing through the channel. Other examples
include the protein conducting channel SecY (Van den Berg et al., 2004), the
chloride ion channel (Dutzler et al., 2003) and the glycerol conducting channel (Fu et al., 2000).
Many times the re-entrant regions contain amino acids that are responsible for the function of the protein (Lasso et al., 2006) which has lead to increasing efforts of predicting these regions, both by us in paper V and by
Lasso et al.
26
Problems with topology and structural features
Paper V addresses the need for more structural complexity in the topology
prediction models. However, we also made a novel prediction method which
looks at the problem from a different angle. We created a method that predicts the distance to the membrane centre for all amino acids in the protein
sequence. This is sort of a 2.5 dimensional structural prediction since we get
a structurally meaningful measure. It might be useful as an intermediate step
towards full 3D structure prediction. In order to understand how this kind of
prediction works, the Z-coordinate need to be explained.
The Z-coordinate
The 3D structures of membrane proteins do not usually contain any information of how the proteins are situated in the membrane. The reason for this is
that for structure determination, the protein is taken out from the membrane
and crystallized by adding amphiphilic detergent molecules. These are most
of the time disordered and impossible to assign coordinates for. However,
sometimes a few of the lipids and/or detergent molecules can be seen in the
X-ray picture that can facilitate the determination of the positioning in the
membrane.
Figure 8. Before (left image) and after (right image) the membrane protein has been
subject to rotation and translation to a common coordinate system. The blue mesh
represents the end of the hydrophobic region at Z=±15 Å.
In order to be able to compare different membrane protein structures to each
other they need to have a common coordinate system. This leads to possibilities of statistical studies and comparisons between different structures. To be
able to do this, the atom coordinates from the original structure need to be
changed so that the structure is positioned in a theoretical membrane (Figure
8).
One simple solution is to calculate a theoretical transmembrane helix that
is the average of all transmembrane helices (Wallin et al., 1997). The struc27
ture is rotated to this theoretical vector. Then the average hydrophobicity of
the amino acids in 1 Å wide slabs perpendicular to the vector is calculated.
The coordinates for the amino acids is then moved so that the hydrophobic
maximum is in the middle of the membrane (Z=0 Å) and a positive value of
the Z-coordinate is directed to the non-cytoplasm and a negative to the cytoplasm. The Z-coordinate is thus the distance of an amino acid relative to the
membrane. More advanced methods have recently been constructed, for
instance TMDET (Tusnady et al., 2004) and OPM (Lomize et al., 2006).
TMDET is an automated algorithm that first calculates the membrane exposed surface area and then the most probable positioning of the membrane
planes is calculated by an objective function. The objective function considers hydrophobicity in 1 Å slabs together with structural factors.
The other method, Orientations of Proteins in Membranes (OPM), uses a
similar methodology (Lomize et al., 2006).
28
Dual topology
In the global topology analysis (paper I), it could be seen that most membrane proteins either had high GFP fluorescence or high PhoA activity.
However, since some proteins had both GFP and PhoA levels above background, we suggested that these aberrant proteins might have dual topology,
i.e. they can insert either way into the membrane (Figure 9). The suggested
dual topology proteins have a weak “positive-inside” bias; there is approximately the same number of Arg and Lys on the cytoplasmic side as the periplasmic.
Figure 9. A dual-topology protein inserts into the membrane in two opposite orientations. As nearly all helix-bundle membrane proteins have a higher number of lysine and arginine residues in cytoplasmic (in) than in periplasmic (out) loops (the
‘positive-inside’ rule), dual-topology proteins are expected to have very small
(Lys+Arg) biases.
Dual topology proteins are interesting because they are attractive model systems for studying membrane protein evolution (Rapp et al., 2007).
Evolution of membrane proteins
Given the vast difference in the number of transmembrane helices between
membrane proteins, how can they increase in size and complexity?
One clue to this question lies in the available membrane protein structures.
Many of the proteins have internal symmetry; if the protein is divided in two
halves then the two halves can be superimposed on each other with an almost perfect fit. Examples include Lactose permease (Abramson et al.,
2003), EmrD (Yin et al., 2006), BtuCD (Bass et al., 2003) and the chloride
ion channel (Dutzler et al., 2002).
29
Figure 10. The structure of lactose permease. If the protein is divided into two
halves (orange and green) and these are superimposed on each other, the internal
structural symmetry can be seen (RMSD 3 Å).
Figure 10 shows lactose permease before and after superimposing the two
halves. If we look at the sequence level and align the sequences of the two
halves, the sequence identity between them is around 20%. It seems that
even though the two halves share the same structure they do not have very
similar sequence. The hypothesis behind internal symmetry is that an internal gene duplication event creates two copies of the gene (Figure 11). These
two genes are then finally fused and encode a twice as large protein.
30
1
2
3
4
5
6
N
C
Gene duplication
1
2
3
4
5
+
6
N
1’
2’
3’
4’
5’
6’
N
C
C
Gene fusion
1
2
3
4
N
5
6
1’
(N’)
2’
3’
4’
5’
6’
C
Figure 11. Suggested evolutionary scenario of internal structural symmetry for a 12
TMH protein. A gene duplication event of a gene encoding a 6 TMH protein creates
two adjacent genes encoding the same protein. The two genes diverge from each
other and finally fuse into a single gene encoding a 12 TMH protein.
The evolution of transmembrane topology has also been studied by detecting
partial sequence homology in membrane proteins (Shimizu et al., 2004).
Even though the study only used predicted topologies, it could be seen that
~1% of the 38,124 membrane proteins contained internal homology. It seems
plausible that at least some membrane proteins have evolved by either partial
or full internal gene duplications.
In paper VI, we studied the identification and evolution of dual topology
membrane protein families.
Families and folds
When comparing two structures of proteins, they belong to the same fold if
they are similarly ordered in three dimensions, i.e. their secondary structures
(α-helices and β-strands) match and they have a similar topological arrangement (Murzin et al., 1995). Even if two proteins share the same fold, that
31
does not imply that they are evolutionary related. A deeper level of structural
similarity is achieved on the family level: two proteins belong to the same
family if they are evolutionary related, in which case they typically have a
pair-wise sequence identity of ≥ 30% (Murzin et al., 1995). In paper VI,
Pfam was used to scan 225 genomes for dual topology family members.
PFAM
Pfam is one of the most useful tools that exist in bioinformatics. It uses multiple sequence alignments for proteins belonging to the same family to train
HMM profiles. The HMM profiles can later be used to identify new members to the family. Given a sequence for a protein with unknown function, it
is thus possible to use Pfam to see if the sequence belongs to any of the
known families. Pfam contains 7973 families (v. 18.0) and these cover 6084% of the proteins in a proteome (Finn et al., 2006).
There are two different versions of Pfam, Pfam-A and Pfam-B. Pfam-A is
manually curated and contains well-characterised protein domain families
with high quality alignments. Pfam-B is automatically generated by clustering and aligning all sequences that do not belong to any Pfam-A family
(Sonnhammer et al., 1997). New protein families are not allowed to overlap
with existing Pfam entries. Thus, one specific region of a sequence may only
exist in one Pfam family.
32
Summary of papers
The papers in this thesis have been divided into two major topics and one
minor. The first two papers create more accurate traditional topology models
by incorporating information about where the C termini are located, first in
E. coli then in 225 completely sequenced prokaryotic organisms.
The next major topic consists of redefining the traditional view of topology. The idea came after a statistical study of known membrane protein
structures (paper III), where we could see that there was much structural
complexity besides the transmembrane helices. This was not modelled in the
traditional topology prediction methods that existed, so it was incorporated
into the topology predictor described in paper IV. In the fifth paper, we
wanted to get away from topology and create novel ways of predicting structural features. We therefore created a method that predicts the distance to the
membrane centre for the amino acids. A successful prediction of this distance is a good starting point for more detailed 3D structure predictions.
Finally, the last topic is about dual topology membrane proteins. In paper
I we could see that the GFP/PhoA activities were behaving strange for a few
of the proteins. These were further studied in paper VI, both by biochemical
experiments and by genome-wide predictions.
Improving the accuracy of topology models
Global topology analysis of the Escherichia coli inner membrane
proteome (Paper I)
This paper describes the results from the first proteome-wide membrane
protein topology mapping of an organism. Earlier results from a pilot study
(Rapp et al., 2004) revealed that the C-terminal location is the most beneficial position for making constrained TMHMM predictions. Together with
earlier results that described a methodology to use GFP/PhoA as topology
reporter (Drew et al., 2005), the first large-scale topology analysis was performed for all multispanning membrane proteins in an organism.
Out of Escherichia coli’s 4288 open reading frames, 737 were predicted
by TMHMM to have more than 2 transmembrane helices and be longer than
100 amino acids. It was possible to assign a C-terminal location for 502 of
the proteins and by using homology we could increase the coverage to 601
33
proteins. The criteria for assigning the homologous proteins was that the
BLAST alignment should be present close to the C termini of both sequences and the e-value should be lower than 10-4.
We used a recently developed TMHMM version to force the topology
prediction to be consistent with the experimental results. By doing this, 80%
of the predicted topology models were expected be correct instead of 70%
without the experimental information (Rapp et al., 2004).
The majority (~80%) of the membrane proteins had their C-terminal located in the cytoplasm and ~60% had both the N- and C-terminal inside the
cytoplasm.
The results also included a large-scale data set of overexpression potential
for membrane proteins that had their C-terminal located in the cytoplasm.
Unfortunately, no sequence characteristics that correlated with overexpression potential could be found. The reasons for the difficulties of overexpressing membrane proteins are probably caused by many interacting factors.
Experimentally constrained topology models for 51,208 bacterial inner
membrane proteins (Paper II)
We wanted to create better topology models for as many as possible of the
completely sequenced prokaryotic organisms. This was done by searching
for homologs to the 502 E. coli membrane proteins that we had assigned the
C-terminal location for. We also included 106 prokaryotic membrane proteins with known topology from other sources. The 608 proteins were used
as queries in BLAST searches against 225 bacterial genomes.
It was possible to create 51,208 much improved topology models, and
these covered ~30% of all the predicted membrane proteins in the organisms.
The topology models were assigned in a 2 step model, the first round found
30,744 homologs to the 608 proteins. The second round used these 30,744
proteins as queries and then an additional 17,111 proteins were found.
Lastly, 3353 homologous proteins were found when we relaxed the criteria
that the alignment should be all the way to the C-terminal end for both query
and subject.
Intuitively, the gammaproteobacterias, which E. coli belong to, had a larger fraction of assigned membrane proteins while archaeabacteria had the
lowest fraction.
The initial, unconstrained TMHMM topologies had the C-terminal location correctly predicted 69% of the times and it was mainly small (1-2 TMH
proteins) that were mispredicted.
34
Improving the detail of structural predictions
A study of the membrane-water interface region of membrane proteins
(Paper III)
In order to create new prediction methods, a statistical study is first often
performed to see whether there are any specific things to consider. There
have been many statistical studies of membrane proteins, but almost all of
them have focused on the transmembrane α-helices inside the hydrophobic
membrane region. This paper instead focused on the membrane-water interface region located just outside the hydrocarbon core. This region has often
been neglected, but it could immediately be seen that it contained more
structural complexity than previously believed.
The region is enriched in irregular structure (~70%) and interface helices
(~30%) running roughly parallel with the membrane surface, while β-strands
are extremely rare.
The average amino acid composition is different between the interfacial
helices, the end-parts of the transmembrane helices located in the interface
region, and the irregular loop structures.
In the membrane-water interface region, hydrophobic and aromatic residues tend to point toward the membrane and charged/polar residues tend to
point away from the membrane. The same general features could be seen in
monotopic membrane proteins suggesting that common evolutionary forces
shape all different kinds of protein surfaces exposed to the membrane-water
interface.
The interface region thus imposes different constraints on protein structure than do the central hydrocarbon core of the membrane and the surrounding aqueous phase.
Structural classification and prediction of re-entrant regions in αhelical transmembrane proteins: Application to complete genomes
(Paper IV)
The motivation behind this paper was that in paper III we could see quite a
lot of structural elements in the membrane-water interface region that current
topology prediction methods missed. These elements were located between
two transmembrane helices and were interface helices if running parallel
with the membrane or were re-entrant regions if dipping inside the membrane.
The method that was created was called TOPMOD and it uses a hidden
Markov model for its predictions. Its underlying architecture resembles the
standard TMHMM architecture but with new, additional compartments for
interface helices and re-entrant regions (Figure 12). The new compartments
had as few free parameters as possible because of the low amount of training
data. This avoids the problem of overfitting.
35
It was very difficult for TOPMOD to get any reasonable performance of
identifying interface helices, it often confused them with re-entrant regions.
It seems that the sequence signals characteristic for re-entrant regions are
stronger and more consistent.
Under the theoretical situation that the transmembrane helices are known,
TOPMOD had a sensitivity of 69% and a specificity of 72%. This situation
is however quite artificial since it is only with an completely experimentally
verified topology map that this would be possible. A more realistic situation
is when the transmembrane helices are predicted, under these circumstances
the sensitivity is 47% and the specificity 72%.
Figure 12. The architecture of the TOPMOD hidden Markov model. The arrow in
the bottom of the figure shows the Z-coordinate axis in Ångström.
The final part of the project consisted of applying TOPMOD on 3 completely sequenced genomes, E. coli, S. cerevisiae and H. sapiens. 10 % of the
membrane proteins of S. cerevisiae were predicted to have reentrant regions
whereas E. coli 16.7% and H. sapiens 15%.
36
ZPRED: Predicting the distance to the membrane center for residues in
α-helical membrane proteins (Paper V)
As more and more membrane protein structures become known, it is clear
that traditional topology information/prediction only provides a simplified
picture of a membrane protein. The structural complexity in a membrane
protein can not be modelled by conventional topology prediction algorithms.
For instance, there are instances of re-entrant regions where the amino acid
sequence dips into the membrane and out again on the same side as it originated from (red and blue in Figure 13). The transmembrane helices can also
be tilted and/or contain structurally significant irregularities (green and cyan
helices in Figure 13). If a re-entrant region is mistaken as a transmembrane
helix by a topology prediction algorithm, the topology model of the protein
will be erroneous. In order to overcome these limitations, ZPRED, a predictor of the distance from a residue to the membrane centre was created.
The right part of Figure 13 shows the distance to the membrane centre
(Z=0 Å) for the amino acids of the structure to the left. Initially, the full Zcoordinate was predicted but with little success. Therefore it was decided
that the absolute value of the Z-coordinate should be predicted instead, since
this both simplifies the problem and doubles the amount of training data for
the machine learning methods. The range was also narrowed to 5-25 Å since
this is the region where the environment of the amino acids changes the
most.
We benchmarked artificial neural networks, hidden Markov models and
combinations of both. In the end, we used the output from an HMM topology predictor combined with evolutionary profiles as input to a neural network, and this predictor was named ZPRED. It could predict this distance to
the membrane centre with an average error of 2.55 Å, and was also quite
successful at predicting the protrusion of loop regions (see Figure 14 for an
example of a prediction).
37
Figure 13. Left: The glutamate transporter homolog and the membrane border
(Z=±15 Å) depicted as a blue mesh. The reentrant regions are colored blue and red,
while transmembrane helices with structural irregularities are colored green and
cyan. Right: The distance to the membrane center (which corresponds to the absolute value of the Z-coordinate) for the amino acids of the glutamate transporter. The
coloring is the same as in the structure.
Figure 14. Example of the output of a prediction from ZPRED. Left: All the amino
acids that were predicted to be inside the membrane (Z<15 Å) are colored red, and
those that predicted to be outside the membrane are colored green. The prediction is
in well agreement with the structure except for the transmembrane helix that contains irregularities located to the left. Top right: The true topology with reentrant
regions colored green and the PRODIV-TMHMM prediction below. Top bottom:
The correct Z-coordinate from the structure corresponds to the green line whereas
the prediction is the red.
38
Dual topology
Identification and evolution of dual-topology membrane proteins
(Paper VI)
In paper I we suggested that some membrane proteins might adopt a dual
topology, i.e. they could insert either way into the membrane (Figure 9).
In this paper, we showed that the topology of 5 of the candidates from paper
I was sensitive to changes in the Lys+Arg bias (the difference in positively
charged amino acids between the cytoplasmic and periplasmic loops), that
these families either existed as pairs or singletons in the genome and that the
pairs encode two oppositely oriented proteins whereas singletons encode
dual-topology candidates.
As usual, GFP and PhoA were used to track the topological changes that
occurred when the Lys+Arg bias was altered. SugE, EmrE, CrcB, YnfA and
YdgC were highly sensitive to Lys+Arg alterations with large differences in
C-terminal locations. YdgE and YdgF were used as controls to the experiments since these are proteins with stable topologies. As expected, these
proteins were insensitive to Lys+Arg alterations.
The main reason for choosing YdgE and YdgF as controls is that they are
members of the same family as EmrE and SugE, the Small Multidrug Resistance (SMR) family. They are all ~100 amino acids long and have 4 transmembrane helices. Interestingly, YdgE inserts into the membrane with its C
terminus in the periplasm while in YdgF it is in the cytoplasm. The ydgE and
ydgF genes are also located next to each other in the chromosome. This led
us to investigate the genome arrangement for dual topology candidates in
other organisms than E. coli. The family members were located by Pfam in
174 fully sequenced bacterial genomes. Remarkably, for both the SMR and
CrcB family a consistent pattern emerged where genes either occured as
closely spaced pairs or as singletons. Closely spaced pairs encode homologous proteins with opposite topology (red bars in Figure 15) whereas singletons encode proteins with a Lys+Arg bias close to zero (green bars in Figure
15). The simplest explanation of this pattern is that that the SMR and CrcB
family proteins form antiparallel dimers (or higher oligomers) composed
either of two separately expressed and oppositely oriented homologs or of a
single dual-topology protein. The YnfA and YdgC families were only encoded by singleton genes with a Lys+Arg bias close to zero.
The last thing studied was whether other dual-topology families could be
discovered. One good candidate was found, the DUF606 family. The proteins in this family mostly consisted of 5TM proteins following the same
pattern as we could see for the SMR and CrcB families. However, we found
examples of DUF606 family members which were internally duplicated
(blue bars in Figure 15), where the first and second half of the protein had
39
opposite orientations in the membrane. The sequence identity between the
two halves was as high as 36%.
Taken together, the results might explain the existence of symmetry in the
first and second halves of membrane protein structures and a possible way of
how membrane proteins increase in size: The dual topology encoding gene is
duplicated and eventually, through re-distribution of positively charged
amino acid, the two resulting homologous proteins become fixed in opposite
orientations in the membrane. Finally, the two genes fuse into a one single
gene encoding a protein with internal symmetry.
Figure 15. Distribution of genes coding for SMR, CrcB, YnfA, YdgC and DUF606.
Green, proteins encoded by singleton genes; red, those encoded by closely paired
genes; blue, DUF606-family genes encoding internally duplicated proteins.
40
Final thoughts
Computer prediction methods hold great promise since it is both timeconsuming and expensive to determine the structure of a protein. However,
protein folding is still an unsolved problem and ab initio structure prediction
methods are only successful for small proteins. The usual bioinformatical
excuse is to say that the methods will get better in the future, because then
there will be more data available to make assumptions and build theories
from. But what happens when the new data does not fit into the old model?
This is what has happened during my 5 years as a PhD student. The more
new structures that became available, the more I realised that we did not
know much. First, re-entrant and interface helices occurred, then the dual
topology concept. Suddenly, the old fact that α-helical membrane proteins
are only located in the inner membrane gets crushed by a new protein that is
located in the outer membrane. I guess that we will be surprised over and
over again but that is what makes science so fun.
Membrane protein structure determination is still a difficult problem, but I
suspect that new technological advancements will increase the number of
structures dramatically. During the meantime, topology prediction will suffice. The more detailed prediction attempts presented in this thesis will continue to improve and maybe not replace the traditional topology prediction
methods but work as a complement to them.
41
Acknowledgements
Arne Elofsson, my supervisor, for always having time for discussions and
questions and giving me help. A more relaxed supervisor would be difficult
to find anywhere. It has been great being your student!
Gunnar von Heijne, my co-supervisor for all support and knowledge of
membrane proteins. Thank you for all assistance during the years.
My long-time room-mates Björn (I miss sharing beds at conferences), Åsa
(for our discussions, thank you for listening to my endless blabbering) and
Johannes (for always being positive and supportive) for the wonderful time
we have experienced together. Work - never been so much fun.
Olof, initially my exjobb-supervisor, later colleague and neighbour, now
friend. Thank you for all your help and fun discussions (especially political… I love election time) throughout the years.
My colleagues from work: Uppsalamaffian: kaffe-Andreas, Karin, Pär, Per
(and me, Åsa and Olof) vs Linköpingsmaffian: varpa-Diana, wii-Jenny,
Linnea, Aron, Kristoffer, rugby-Anna (and Johannes) vs the rest: dansHåkan, Olivia, Johan, Sara, Katarina, Charlotta, Anni, Yana and Fang.
It has always been wonderful to get to work and meet you!
Mikaela, Susanna and Dan – for doing such great work and our fruitful
collaboration.
SBC-people: my sparris-dealer Örjan, Ali, Samuel, Isak, Lars, Erik,
Lotta, Jens, Bengt, Bob and Karin.
Erik Lindahl for buying me lunch at Naturhistoriska.
Erik Sjölund for making everything work at work.
Move-on-up-Johan and move-on-up-Per at KTH-biotech for taking care of
me in Brazil.
CBR, DBB and SBC corridors.
42
Paul Horton, for helping me in Japan, Fujita-san for all fun we had together, Larisa for our interesting lunch discussions, Michael Gromiha,
Martin and Tomii-san.
Oskar & Anja, Mattias & Eleonor, Robban & Anna & Hanna, Erik &
Erik & Erik & Kattis, Johan, grann-Tobbe och alla andra vänner. Vad kul
vi har haft tillsammans!
Anders, Aino, Oskar, Barbro, Hugo, Jonas och Kajsa. Det är roligt att ha
fått en familj till!
Min bror och förebild Björn, min syster Elisabeth, energiknippet Markus,
mina föräldrar Asbjörn och Lena. Tack för all hjälp, allt stöd och all
uppmuntran som jag fått av er under 29 år!
Ida, min älskling för att du är du och att vi är tillsammans. Tack för all hjälp
och att du alltid finns vid min sida. Jag älskar dig!
43
References
Abramson, J., Smirnova, I., Kasho, V., Verner, G., Kaback, H.R., and Iwata, S.
2003. Structure and mechanism of the lactose permease of Escherichia coli. Science 301: 610-615.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic
local alignment search tool. J Mol Biol 215: 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and
Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402.
Anfinsen, C.B. 1973. Principles that govern the folding of protein chains. Science
181: 223-230.
Bass, R.B., Locher, K.P., Borths, E., Poon, Y., Strop, P., Lee, A., and Rees, D.C.
2003. The structures of BtuCD and MscS and their implications for transporter
and channel function. FEBS Lett 555: 111-115.
Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K.,
Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. 2002. The Protein Data Bank.
Acta Crystallogr D Biol Crystallogr 58: 899-907.
Beuming, T., and Weinstein, H. 2004. A knowledge-based scale for the analysis and
prediction of buried and exposed faces of transmembrane domain proteins. Bioinformatics 20: 1822-1835.
Bowie, J.U. 2005. Solving the membrane protein folding problem. Nature 438: 581589.
Braun, P., and von Heijne, G. 1999. The aromatic residues Trp and Phe have different effects on the positioning of a transmembrane helix in the microsomal membrane. Biochemistry 38: 9778-9782.
Chalfie, M., Tu, Y., Euskirchen, G., Ward, W.W., and Prasher, D.C. 1994. Green
fluorescent protein as a marker for gene expression. Science 263: 802-805.
Chamberlain, A.K., Lee, Y., Kim, S., and Bowie, J.U. 2004. Snorkeling preferences
foster an amino acid composition bias in transmembrane helices. J Mol Biol
339: 471-479.
de Planque, M.R., Kruijtzer, J.A., Liskamp, R.M., Marsh, D., Greathouse, D.V.,
Koeppe, R.E., 2nd, de Kruijff, B., and Killian, J.A. 1999. Different membrane
anchoring positions of tryptophan and lysine in synthetic transmembrane alphahelical peptides. J Biol Chem 274: 20839-20846.
Dong, C., Beis, K., Nesper, J., Brunkan-Lamontagne, A.L., Clarke, B.R., Whitfield,
C., and Naismith, J.H. 2006. Wza the translocon for E. coli capsular polysaccharides defines a new class of membrane protein. Nature 444: 226-229.
Drew, D., Slotboom, D.J., Friso, G., Reda, T., Genevaux, P., Rapp, M., MeindlBeinker, N.M., Lambert, W., Lerch, M., Daley, D.O., et al. 2005. A scalable,
GFP-based pipeline for membrane protein overexpression screening and purification. Protein Sci 14: 2011-2017.
44
Drew, D.E., von Heijne, G., Nordlund, P., and de Gier, J.W. 2001. Green fluorescent
protein as an indicator to monitor membrane protein overexpression in Escherichia coli. FEBS Lett 507: 220-224.
Dutzler, R., Campbell, E.B., Cadene, M., Chait, B.T., and MacKinnon, R. 2002. Xray structure of a ClC chloride channel at 3.0 A reveals the molecular basis of
anion selectivity. Nature 415: 287-294.
Dutzler, R., Campbell, E.B., and MacKinnon, R. 2003. Gating the selectivity filter in
ClC chloride channels. Science 300: 108-112.
Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., et al. 2006. Pfam:
clans, web tools and services. Nucleic Acids Res 34: D247-251.
Fu, D., Libson, A., Miercke, L.J., Weitzman, C., Nollert, P., Krucinski, J., and
Stroud, R.M. 2000. Structure of a glycerol-conducting channel and the basis for
its selectivity. Science 290: 481-486.
Garrow, A.G., Agnew, A., and Westhead, D.R. 2005. TMB-Hunt: an amino acid
composition based method to screen proteomes for beta-barrel transmembrane
proteins. BMC Bioinformatics 6: 56.
Granseth, E., Seppälä, S., Rapp, M., Daley, D.O., and von Heijne, G. 2007. Membrane protein structural biology — How far can the bugs take us? Mol Membr
Biol 24: 1-5.
Granseth, E., von Heijne, G., and Elofsson, A. 2005. A study of the membrane-water
interface region of membrane proteins. J Mol Biol 346: 377-385.
Gupta, K., Selinsky, B.S., Kaub, C.J., Katz, A.K., and Loll, P.J. 2004. The 2.0 A
resolution crystal structure of prostaglandin H2 synthase-1: structural insights
into an unusual peroxidase. J Mol Biol 335: 503-518.
Harries, W.E., Akhavan, D., Miercke, L.J., Khademi, S., and Stroud, R.M. 2004.
The channel architecture of aquaporin 0 at a 2.2-A resolution. Proc Natl Acad
Sci U S A 101: 14045-14050.
Hessa, T., Kim, H., Bihlmaier, K., Lundin, C., Boekel, J., Andersson, H., Nilsson, I.,
White, S.H., and von Heijne, G. 2005. Recognition of transmembrane helices by
the endoplasmic reticulum translocon. Nature 433: 377-381.
Hopkins, A.L., and Groom, C.R. 2002. The druggable genome. Nat Rev Drug Discov 1: 727-730.
Ikeda, M., Arai, M., Okuno, T., and Shimizu, T. 2003. TMPDB: a database of experimentally-characterized transmembrane topologies. Nucleic Acids Res 31:
406-409.
Jones, D.T. 2007. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 23: 538-544.
Jordan, P., Fromme, P., Witt, H.T., Klukas, O., Saenger, W., and Krauss, N. 2001.
Three-dimensional structure of cyanobacterial photosystem I at 2.5 A resolution.
Nature 411: 909-917.
Kimura, T., Ohnuma, M., Sawai, T., and Yamaguchi, A. 1997. Membrane topology
of the transposon 10-encoded metal-tetracycline/H+ antiporter as studied by
site-directed chemical labeling. J Biol Chem 272: 580-585.
Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting
transmembrane topology with a hidden Markov model: application to complete
genomes. J Mol Biol 305: 567-580.
Lasso, G., Antoniw, J.F., and Mullins, J.G. 2006. A combinatorial pattern discovery
approach for the prediction of membrane dipping (re-entrant) loops. Bioinformatics 22: e290-297.
Levinthal, C. 1968. Are there pathways for protein folding? Journal de Chimie Physique et de Physico-Chimie Biologique 65: 44-45.
45
Lomize, A.L., Pogozheva, I.D., Lomize, M.A., and Mosberg, H.I. 2006. Positioning
of proteins in membranes: a computational approach. Protein Sci 15: 13181333.
Manoil, C. 1991. Analysis of membrane protein topology using alkaline phosphatase
and beta-galactosidase gene fusions. Methods Cell Biol 34: 61-75.
Manoil, C., and Beckwith, J. 1985. TnphoA: a transposon probe for protein export
signals. Proc Natl Acad Sci U S A 82: 8129-8133.
Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A.,
Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376-380.
Moller, S., Croning, M.D., and Apweiler, R. 2001. Evaluation of methods for the
prediction of membrane spanning regions. Bioinformatics 17: 646-653.
Moller, S., Kriventseva, E.V., and Apweiler, R. 2000. A collection of well characterised integral membrane proteins. Bioinformatics 16: 1159-1160.
Monne, M., Nilsson, I., Elofsson, A., and von Heijne, G. 1999. Turns in transmembrane helices: determination of the minimal length of a "helical hairpin" and
derivation of a fine-grained turn propensity scale. J Mol Biol 293: 807-814.
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: a structural
classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536-540.
Nilsson, J., Persson, B., and von Heijne, G. 2005. Comparative analysis of amino
acid distributions in integral membrane proteins from 107 genomes. Proteins
60: 606-616.
Nirenberg, M. 2004. Historical review: Deciphering the genetic code--a personal
account. Trends Biochem Sci 29: 46-54.
Picot, D., Loll, P.J., and Garavito, R.M. 1994. The X-ray crystal structure of the
membrane protein prostaglandin H2 synthase-1. Nature 367: 243-249.
Rapp, M., Drew, D., Daley, D.O., Nilsson, J., Carvalho, T., Melen, K., De Gier,
J.W., and Von Heijne, G. 2004. Experimentally based topology models for E.
coli inner membrane proteins. Protein Sci 13: 937-945.
Rapp, M., Seppala, S., Granseth, E., and von Heijne, G. 2007. Emulating membrane
protein evolution by rational design. Science 315: 1282-1284.
Reiersen, H., and Rees, A.R. 2001. The hunchback and its neighbours: proline as an
environmental modulator. Trends Biochem Sci 26: 679-684.
Rost, B., Fariselli, P., and Casadio, R. 1996. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 5: 1704-1718.
Russell, R.B., and Eggleston, D.S. 2000. New roles for structure in biology and drug
discovery. Nat Struct Biol 7 Suppl: 928-930.
Shimizu, T., Mitsuke, H., Noto, K., and Arai, M. 2004. Internal gene duplication in
the evolution of prokaryotic transmembrane proteins. J Mol Biol 339: 1-15.
Sonnhammer, E.L., Eddy, S.R., and Durbin, R. 1997. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28: 405420.
Sonnhammer, E.L., von Heijne, G., and Krogh, A. 1998. A hidden Markov model
for predicting transmembrane helices in protein sequences. Proc Int Conf Intell
Syst Mol Biol 6: 175-182.
Tusnady, G.E., Dosztanyi, Z., and Simon, I. 2004. Transmembrane proteins in the
Protein Data Bank: identification and classification. Bioinformatics 20: 29642972.
Tusnady, G.E., and Simon, I. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849-850.
46
Ulmschneider, M.B., and Sansom, M.S. 2001. Amino acid distributions in integral
membrane protein structures. Biochim Biophys Acta 1512: 1-14.
Wallin, E., Tsukihara, T., Yoshikawa, S., von Heijne, G., and Elofsson, A. 1997.
Architecture of helix bundle membrane proteins: an analysis of cytochrome c
oxidase from bovine mitochondria. Protein Sci 6: 808-815.
Wallin, E., and von Heijne, G. 1998. Genome-wide analysis of integral membrane
proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7:
1029-1038.
Van den Berg, B., Clemons, W.M., Jr., Collinson, I., Modis, Y., Hartmann, E., Harrison, S.C., and Rapoport, T.A. 2004. X-ray structure of a protein-conducting
channel. Nature 427: 36-44.
Watson, J.D., and Crick, F.H. 1953. Molecular structure of nucleic acids; a structure
for deoxyribose nucleic acid. Nature 171: 737-738.
Viklund, H., and Elofsson, A. 2004. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary
information. Protein Sci 13: 1908-1917.
Wimley, W.C. 2002. Toward genomic identification of beta-barrel membrane proteins: composition and architecture of known structures. Protein Sci 11: 301312.
von Heijne, G. 1992. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 225: 487-494.
von Heijne, G. 2006. Membrane-protein topology. Nat Rev Mol Cell Biol 7: 909918.
Yarov-Yarovoy, V., Schonbrun, J., and Baker, D. 2006. Multipass membrane protein structure prediction using Rosetta. Proteins 62: 1010-1025.
Yin, Y., He, X., Szewczyk, P., Nguyen, T., and Chang, G. 2006. Structure of the
multidrug transporter EmrD from Escherichia coli. Science 312: 741-744.
Yohannan, S., Faham, S., Yang, D., Whitelegge, J.P., and Bowie, J.U. 2004. The
evolution of transmembrane helix kinks and the structural diversity of G protein-coupled receptors. Proc Natl Acad Sci U S A 101: 959-963.
47