* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Structure, prediction, evolution and genome wide studies of membrane proteins
Point mutation wikipedia , lookup
Metalloprotein wikipedia , lookup
Paracrine signalling wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Oxidative phosphorylation wikipedia , lookup
Biochemistry wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Interactome wikipedia , lookup
Signal transduction wikipedia , lookup
Homology modeling wikipedia , lookup
Magnesium transporter wikipedia , lookup
SNARE (protein) wikipedia , lookup
Protein purification wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Structure, prediction, evolution and genome wide studies of membrane proteins Erik Granseth Stockholm University ©Erik Granseth, Stockholm 2007 ISBN 978-91-7155-489-5 Printed in Sweden by US-AB, Stockholm 2007 Distributor: Stockholm University Library Hello World List of publications Publications included in this thesis I Daniel O. Daley*, Mikaela Rapp*, Erik Granseth, Karin Melén, David Drew and Gunnar von Heijne Global topology analysis of the Escherichia coli inner membrane proteome Science (2005) 308, 1321-1323 II Erik Granseth, Daniel O. Daley, Mikaela Rapp, Karin Melén and Gunnar von Heijne Experimentally constrained topology models for 51,208 bacterial inner membrane proteins Journal of Molecular Biology (2005) 352,489-494 III Erik Granseth, Gunnar von Heijne and Arne Elofsson A study of the membrane-water interface region of membrane proteins Journal of Molecular Biology (2005) 346, 377-385 IV Håkan Viklund, Erik Granseth and Arne Elofsson Structural classification and prediction of reentrant regions in αhelical transmembrane proteins: Application to complete genomes Journal of Molecular Biology (2006) 361, 591-603 V Erik Granseth, Håkan Viklund and Arne Elofsson ZPRED: Predicting the distance to the membrane center for residues in α-helical membrane proteins Bioinformatics (2006) 22, e191-e196 VI Mikaela Rapp*, Erik Granseth*, Susanna Seppälä and Gunnar von Heijne Identification and evolution of dual-topology membrane proteins Nature Structural & Molecular Biology (2006) 13, 112-116 Reprints were made with permission from the publishers Publications not included in this thesis: • Mikaela Rapp*, Susanna Seppälä*, Erik Granseth and Gunnar von Heijne Emulating membrane protein evolution by rational design Science (2007) 315, 1282-1284 • Erik Granseth, Susanna Seppälä, Mikaela Rapp, Daniel O. Daley and Gunnar von Heijne Membrane protein structural biology — How far can the bugs take us? (review) Molecular Membrane Biology (2007) 24, 1-5 • Costas Papaloukas, Erik Granseth, Håkan Viklund and Arne Elofsson Estimating the length of transmembrane helices using Z-coordinate predictions Submitted *These authors contributed equally Abstract α-helical membrane proteins constitute 20-30% of all proteins in a cell and are involved in many essential cellular functions. The structure is only known for a few hundred of them, which makes structural models important. The most common structural model of a membrane protein is the topology which is a two-dimensional representation of the structure. This thesis is focused on three different aspects of membrane protein structure: improving structural predictions of membrane proteins, improving the level of detail of structural models and the concept of dual topology. It is possible to improve topology models of membrane proteins by including experimental information in computer predictions. This was first performed in Escherichia coli and, by using homology, it was possible to extend the results to 225 prokaryotic organisms. The improved models covered ~80% of the membrane proteins in E. coli and ~30% of other prokaryotic organisms. However, the traditional topology concept is sometimes too simple for complex membrane protein structures, which create a need for more detailed structural models. We created two new machine learning methods, one that predicts more structural features of membrane proteins and one that predicts the distance to the membrane centre for the amino acids. These methods improve the level of detail of the structural models. The final topic of this thesis is dual topology and membrane protein evolution. We have studied a class of membrane proteins that are suggested to insert either way into the membrane, i.e. have a dual topology. These protein families might explain the frequent occurrence of internal symmetry in membrane protein structures. Contents Introduction ...................................................................................................10 Background ...................................................................................................12 From sequence to structure to function......................................................................... 12 Computational biology and bioinformatics .................................................................... 13 Membranes....................................................................................................................13 Membrane proteins ....................................................................................................... 14 α-helical membrane proteins ................................................................................... 15 β-barrel membrane proteins .................................................................................... 15 Monotopic and peripheral membrane proteins ........................................................ 16 Improving the accuracy of topology prediction of membrane proteins .........17 Topology – a structural model of a membrane protein ................................................. 17 Topology mapping ................................................................................................... 19 Combining experimental data with topology prediction ........................................... 20 Increasing coverage with homology.............................................................................. 22 Improving the level of detail of the structural models of membrane proteins23 Structural features of membrane proteins..................................................................... 23 The transmembrane helices .................................................................................... 24 The positive inside rule ............................................................................................ 24 Aromatic residues .................................................................................................... 25 Snorkelling ............................................................................................................... 25 Loop lengths............................................................................................................. 25 Interface helices ....................................................................................................... 25 Re-entrant regions ................................................................................................... 26 Problems with topology and structural features....................................................... 27 The Z-coordinate ........................................................................................................... 27 Dual topology ................................................................................................29 Evolution of membrane proteins.................................................................................... 29 Families and folds.......................................................................................................... 31 Summary of papers.......................................................................................33 Improving the accuracy of topology models.................................................................. 33 Improving the detail of structural predictions................................................................. 35 Dual topology................................................................................................................. 39 Final thoughts................................................................................................41 Acknowledgements .......................................................................................42 References....................................................................................................44 Abbreviations 2D 3D ATP BLAST DNA GFP HMM PhoA PSIBLAST PSSM SMR TM TMH TMHMM Å Two-dimensional Three-dimensional Adenosine triphosphate Basic local alignment search tool Deoxyribonucleic acid Green fluorescent protein Hidden Markov model Alkaline phosphatase Position-specific iterative BLAST Position specific scoring matrix Small multidrug resistance Transmembrane Transmembrane helix Transmembrane hidden Markov model Ångström Introduction 4 billion years ago, the first living cell appeared. Mother earth had been waiting patiently for at least 10 billion years for this to happen and ever since that day, nothing has been quite the same. The cells replicated and replicated and finally a group of cells wrote this thesis. In order to have a functional, living cell some of the basic things that are needed are: • Some way of converting and utilizing energy, for instance chemical energy or the rays from the sun. • Self-replication and self-assembly. The first cells might have used RNA as the information carrier and as molecular machines. Nowadays, modern cells have DNA as the information carrier and RNA as an intermediate step towards proteins but sometimes it also functions as molecular machines. • One less obvious thing is that molecules need to be located at the right place at the right time. One thing that aids this is compartmentalisation, when molecules are kept separated by different kinds of barriers. For instance, the DNA molecule of a cell needs to be located inside the cell, if it is outside, it will rapidly get degraded. My research has been focused on the proteins that are located in the membrane that surrounds cells and compartments. These proteins are responsible for transport through the membrane, they are involved in cell-to-cell signalling, they are crucial for generating and converting chemical and sun rays into useful energy. In order to be able to understand these processes in detail, a 3D structure of the protein is needed. Unfortunately, structure determination of membrane proteins is very time-consuming and difficult so there are considerably fewer structures available than for water-soluble proteins. There are millions of proteins with known amino acid sequence and tens of thousands of proteins with known structure but only a few hundred of these structures are membrane proteins. Therefore, theoretical approaches that give structural insight from the amino acid sequence hold great potential since computing time is very cheap. However, since computational predictions are just predictions, it is also important to validate the theoretical results with practical experiments. 10 This thesis consists of three parts. The first part describes how it is possible to create better structural models of membrane proteins by including experimental information to computer predictions. In the second part, a foundation is laid by examining the currently available membrane protein structures. Based on the results from this study it was possible to create new machine learning methods that predicted novel structural features of membrane proteins. The third part consists of studies of a particular class of membrane proteins that behaved oddly, but might hold important clues about the evolution of membrane proteins. 11 Background From sequence to structure to function The revolutionary discovery of the molecular structure of DNA by Watson and Crick 1953 made a tremendous impact on science (Watson and Crick, 1953). Never before had it been possible to study and understand the molecular details of how genetic information was stored. It could be seen that the DNA double helix was stabilized by hydrogen bonds between Adenine (A) – Thymine (T) and Guanine (G) – Cytosine (C). The problem was that no one knew how the DNA sequences were translated into proteins. The genetic code was finally broken by the Nirenberg laboratory in 196166 (Nirenberg, 2004). It was discovered that each DNA nucleotide triplet stands for one of the 20 different amino acids that are the building blocks of peptides and proteins. These events lead to the birth of molecular biology. There are two main events involved in going from DNA to protein. The first step is the transcription, copying, of DNA into complementary RNA by the enzyme RNA polymerase. This enzyme is essential for life and present in all organisms and also in some viruses. The second step is the translation of the mRNA into an amino acid chain that forms the protein. This step is performed by the ribosome, which translates the genetic code to an amino acid chain. One of the remaining unsolved problems of molecular biology is how the linear amino acid chain folds into a three-dimensional protein. All information about how a protein structure will look like is present in its amino acid sequence (Anfinsen, 1973). A water-soluble protein folds within fractions of a second and that makes it impossible to explore all different conformations that exist for the amino acids. The Levinthal paradox states that due to the large amount of degrees of freedom in the protein chain the molecule has an astronomical number of possible conformations (an estimate is 10143) (Levinthal, 1968). If the protein would have to explore all conformations it would take longer time than the universe has existed. Thus, there exist folding pathways, yet the details of them remain unclear. 12 The key to a detailed understanding of protein function lies in its structure. A high resolution structure is needed to be able to study the protein function at a molecular level. The most common method to determine the structure is to use X-ray crystallography. Computational biology and bioinformatics The amount of available genetic information continues to increase at a steady pace. Ever since GenBank, a sequence database, started 1982 it has doubled its size every 18 months. The latest release (v160) contains 77,248,690,945 bases. To put this enormous number into perspective, the number of bases that was sequenced from April to June 2007 was 42,880 each second. The DNA sequencing technology also gets faster and faster, e.g. a 454 gene sequencer can sequence 25 million bases with 99% accuracy in 4 hours, which makes it possible to sequence a bacterial genome during the course of a day (Margulies et al., 2005). There is clearly a need for methods that can make sense out of all this vast amount of genetic information and this is where bioinformatics and computational biology steps in. The difference in definition between the two terms is that computational biology refers to hypothesis-driven investigations of biological problems using computers while bioinformatics is more technique-driven and concerns development of algorithms and computational and statistical techniques. Many times the two terms are used interchangeably. Bioinformatical methods that are used for the research in this thesis are described in gray. The unifying topic for all the papers included in this thesis is that they all are about membrane proteins. I will continue with a brief overview about the environment that membrane proteins reside in, the membrane. Membranes One of the things that is needed for life to exist is the membrane. Membranes are fluid, semipermeable barriers separating the interior of the cell from the exterior. The basic building block of a typical biological membrane is a phospholipid molecule (Figure 1A). It consists of a hydrophilic headgroup which is often slightly negatively charged, and several long, carbon rich fatty acid chains which are nonpolar. This dual nature of lipids, amphilicity, makes them aggregate and spontaneously form a membrane or a micelle when placed in water. The membranes are self-sealing if they get broken. 13 Figure 1. A. An example of a single lipid molecule. The two long, fatty acid hydrocarbon chains are green lines and the atoms belonging to the polar headgroup are depicted as balls. Carbon is colored green, oxygen red, phosphorus orange and oxygen blue. B. A snapshot of a computer simulation of a part of a fluid membrane. The length in Ångström is shown to the right. The colors are the same as in A. The membrane is a bilayer, where the two outer surfaces consist of the polar headgroups and the interior of nonpolar, hydrophobic fatty acids (Figure 1B). This is the simplest solution to avoid having water molecules close to the fatty acids since they are unable to hydrogen bond to water. The length of the fatty acids determines the width of the membrane, most are around 5080 Å thick. In biological membranes there are many different kinds of lipids and the lipid composition gives different membranes different characteristics such as curvature, rigidity, fluidity etc. The membrane is capable of upholding an ion gradient and this is very important, since the ion gradient is a form of potential energy that can be used by the cell. For instance, in mitochondria and chloroplasts, a proton gradient is responsible for the chemiosmotic potential, the proton motive force that is used for the synthesis of ATP. Membrane proteins Inside the membrane there are membrane proteins with many diverse functions. They are responsible for selecting which substances allowed to pass through the membrane, they are important for transmitting signals, cell-tocell communication, and numerous other functions. More than half of all 14 drugs are targeting membrane proteins (Russell and Eggleston, 2000; Hopkins and Groom, 2002). An example of a membrane protein structure, archaerhodopsin from a halobacterium, is shown in Figure 2A. Halobacteria can be found in extreme salinity environments such as the Dead Sea or the Great Salt Lake. The protein gives the archaeabacteria a reddish colour and is responsible for creating chemical energy from light. The amount of membrane proteins inside the lipid bilayer vary from organism to organism, from cell type to cell type. For instance, membrane proteins make up 75% of the mass of the membrane in E. coli while only 30% for human myelin sheath cells. There are three main classes of membrane proteins: α-helical membrane proteins α-helical membrane proteins have their amino acids arranged in tightly packed helices in the transmembrane regions (Figure 2A). The reason behind the α-helical arrangement is to have as large hydrophobic outer surface area as possible that can interact with the hydrophobic fatty acids of the lipids. At the same time, it also satisfies all its internal backbone hydrogen bonds. α-helical membrane proteins typically make up 20-30% of the encoded proteins of an organism (Krogh et al., 2001). Most often they are located in the inner membrane but recently one example, a polysaccharide transporter, has been found that resides in the outer membrane of E. coli (Dong et al., 2006). β-barrel membrane proteins The other possibility of spanning the membrane while satisfying internal backbone hydrogen bonds and maintaining a large hydrophobic surface area is the β-barrel fold (Figure 2B). Since a β-strand always hydrogen bonds to another β-strand, these proteins consist of β-hairpins through the membrane. Thus, this class always has an even number of β-strands. Every second residue in the transmembrane region is facing the inside of the barrel whereas the other is facing the outside, towards the lipids. This alternating pattern is reflected in the amino acid sequence where one amino acid is non-polar and hydrophobic (and directed towards the lipids) and the next one more polar (and directed towards the interior of the barrel) etc. Since a β-strand is extended compared to the coiled α-helix, the transmembrane regions are shorter and typically consist of 8 to 15 amino acids (Wimley, 2002). The βbarrel membrane proteins can be found in the outer membrane of bacteria, mitochondria and chloroplasts. Around 2-3% of a genome encodes β-barrel membrane proteins (Garrow et al., 2005) but that is an uncertain number given the difficulties of identify15 ing them. The main reason for this is that β-strands are more difficult to predict than α-helices since they consist of fewer residues and contain more long range interactions. There are also fewer structures of β-barrel membrane proteins than α-helical. Figure 2. Example of membrane protein structures with the membrane depicted as a pink rectangle. A. An α-helical membrane protein with 7 transmembrane helices. B. A β-barrel membrane protein with 8 transmembrane β-strands. C. A monotopic membrane protein. Monotopic and peripheral membrane proteins Monotopic membrane proteins are only associated to one of the two leaflets of the membrane. Prostaglandin H2 synthase-1 is one example and is shown in Figure 2C (Picot et al., 1994). The helices that are associated to the membrane are amphiphilic (Gupta et al., 2004) and the protein is only removable from the membrane by detergents, organic solvents or denaturants that interfere with the hydrophobic interactions. A peripheral membrane protein is more loosely attached to the membrane, mainly through electrostatic interactions and hydrogen bonding with the polar headgroup of lipids or with hydrophilic domains of other integral membrane proteins. Peripheral membrane proteins can be dissociated from the membrane by treatment with solutions of high ionic strength or elevated pH. Hereafter, the term “membrane protein” will refer to α-helical membrane proteins since they have been the focus of this thesis. 16 Improving the accuracy of topology prediction of membrane proteins This first part of the thesis will describe the background behind paper I and II. Structural models of membrane proteins are usually produced by either sequence-based prediction or time-consuming experimental approaches. The main idea presented here is that by combining large-scale, limited biochemical experimental data with bioinformatical methods we gain better structural predictions of membrane proteins. A correct structural model is crucial for functional analysis of the protein. Topology – a structural model of a membrane protein It is extremely difficult to obtain a three dimensional structure of a membrane protein. As of 1 January 2007, membrane proteins were 0.5 % of the Protein Data Bank (Berman et al., 2002), even though they constitute 2030% of a typical proteome. The reasons for the low number include various experimental difficulties such as: inclusion bodies (cytoplasmic aggregates of membrane proteins that are not inserted into the membrane), toxicity, lack of raw material for crystallisation trials, problems with solubilisation etc (Granseth et al., 2007). Given the small amount of available membrane proteins, what can one do when there is no available structure of a membrane protein? The next best alternative is having the topology of it. The topology is like a crude twodimensional projection of a membrane protein structure. It depicts where the transmembrane helices are situated and on which side of the membrane the connecting loops and the N- and C-termini are located (Figure 3). This information is useful both for structural and functional classification of the protein (von Heijne, 2006). A topology model also provides information in order to plan site-directed mutagenesis experiments and is invaluable as a first step towards full 3D structure prediction (Yarov-Yarovoy et al., 2006). 17 Figure 3. The structure and topology of bacteriorhodopsin with 7 transmembrane helices (TM). The C-terminal is located in the cytoplasm and the N terminal is located in the periplasm. The protein thus has a Nout-7TM-Cin topology. The topology of a membrane protein is determined by the amino acid sequence; all information about how many transmembrane helices and orientation is present there from the beginning. What is known is that the transmembrane helices are consecutive regions of around 25 hydrophobic amino acids, but the lengths can vary between ~15 and ~40 residues (Granseth et al., 2005). The loops are hydrophilic and often consist of glycines and prolines, residues that induce turns (Monne et al., 1999). The sequence factors that determines if a helix is inserted or not into the membrane are still not completely understood but much work has been focused on cracking the insertion code (Hessa et al., 2005). By using an in vitro system, Hessa et al. created a biological hydrophobicity scale by measuring the insertion efficiency for each of the 20 amino acids. Isoleucine, leucine, phenylalanine and valine promoted membrane insertion, cysteine, methionine and alanine were intermediate and the remaining polar and charged amino acids opposed membrane insertion. These results are in good agreement with statistical analyses of known membrane protein structures (Ulmschneider and Sansom, 2001; Beuming and Weinstein, 2004). Topology prediction The first topology prediction methods were based on the observation that membrane protein sequences contain regions of 18-25 amino acids that are hydrophobic. Predictions were improved in 1992 when the "positive inside" rule was used to decide which loops that were located on the inside or the outside of the membrane (von Heijne, 1992). In the late 1990s more advanced machine learning methods came into use, examples are TMHMM (Sonnhammer et al., 1998), HMMTOP (Tusnady and Simon, 2001) and PhDHTM (Rost et al., 1996). These dramatically improved the accuracy of the predictions and lead to possibilities of rapidly screening the complete 18 genomes that were released. Typically, the earliest methods predicted the correct topology, i.e. all transmembrane helices and the correct position of the loops, for ~30-40% of the membrane proteins while the more advanced methods were correct ~50-60% (Moller et al., 2001; Viklund and Elofsson, 2004). More recent topology predictors make use of evolutionary information in their predictions which increases the performance to ~70-75% (Viklund and Elofsson, 2004; Jones, 2007). Topology mapping In order to have a correct topology model of a membrane protein it has to be verified by experiments. It is relatively easy to determine the regions that are located in the membrane by computational methods since they are more hydrophobic than other regions. However, many times the predicted topology is wrong, for instance by missing a helix that is not very hydrophobic. The solution to this is to experimentally determine the location of specific amino acids, whether they are located in the membrane, the cytoplasm or the extra-cytoplasmic space. Sometimes antibodies, biochemical agents are used to do this (Kimura et al., 1997). However, the most common method is fusing a protein domain to a loop, a domain that only is active when it is located on one side of the membrane but not on the opposing side (Manoil, 1991; Moller et al., 2000; Ikeda et al., 2003). In paper I and VI the fusion proteins Green Flourescent Protein (GFP) and alkaline phosphatase (PhoA) were attached to the C terminus of membrane proteins. PhoA PhoA was one of the first topology mapping reporters (Manoil and Beckwith, 1985). The protein normally catalyses hydrolysis of phosphate groups located in the periplasmic space of bacteria. It contains an essential cysteine disulfide bridge that is required for proper folding of the protein. These cysteine disulfide bridges can only be formed in the periplasm, if the protein is located in the cytoplasm the bridge will not get formed and the protein remains inactive. In order to use PhoA as a topology reporter, it is fused to a membrane protein and a substrate that changes colour upon hydrolysis is added. If there is a change of colour, the region in question must be located in the periplasm, if there is no colour change, the region can be located either in the cytoplasm or the membrane. GFP Green Flourescent Protein (GFP) is a protein that originally comes from the jellyfish Aequorea victoria where it fluoresces green when exposed to blue light. It was first cloned 1994 and immediately became very popular in molecular biology, especially for making animals green (Chalfie et al., 1994). 19 GFP is only fluorescent if it is properly folded and it can only do that in the cytoplasm, it behaves opposite of PhoA. One advantage with GFP is that it is possible to measure protein overexpression since only properly folded membrane proteins can fluoresce (Drew et al., 2001). Using GFP/PhoA to aid topology determination Ideally, for a perfect topology model, many positions in the amino acid sequence should be mapped so that all the locations of the loops are experimentally determined. This is however unfeasible since it is too timeconsuming. To speed things up, Rapp et al. attached GFP and PhoA to the C termini of different membrane proteins. The rationale behind this was that a combination between limited experimental information together with topology predictions leads to more accurate topology models (Rapp et al., 2004). Since the fusion proteins are rather large, the risk is less that they interfere with the insertion mechanism if they are placed at the C terminus compared to other sites. Combining experimental data with topology prediction The topology prediction method that was used in paper I, II and VI was TMHMM. The reason for this was that it was easily available, fast and has good overall performance. TMHMM TMHMM is one of the most popular topology prediction methods. Its accuracy is reported to be 77.5% correct when predicting the full topology (Krogh et al., 2001). However, independent benchmarks report the accuracy to be 49-63% correct (Viklund and Elofsson, 2004; Jones, 2007). This discrepancy shows that it is important to perform independent tests, since the performance reported in the original articles most often is too high. TMHMM contains 7 compartments that are designed to recognize different regions of a transmembrane protein (Figure 4). For instance, the helix core compartment is trained to recognize the hydrophobic regions, while the inside loop compartment recognizes cytoplasmic loops. The HMM has been trained on the currently available 3D structures and experimentally verified topology models. Each compartment consists of states with probability distributions (emission probabilities) of the 20 different amino acids. Between the states there are transition probabilities that control how likely it is to go from one state to another or stay in the same state. When TMHMM is presented with an unknown amino acid sequence it finds the optimal way through the model for this sequence. The path through the different states in the compartments is the predicted topology of the sequence. 20 Cytoplasmic side globular Membrane Non-cytoplasmic side cap cyt. helix core cap non-cyt. short loop non-cyt. globular cap cyt. helix core cap non-cyt. long loop non-cyt. globular loop cyt. Figure 4. The architecture of TMHMM. The boxes shows the compartments and inside the compartments there can be one or more states. Parts of the model with the same text are tied, i.e. their parameters are the same. The permitted transitions between the compartments are shown as arrows. Increasing accuracy of topology models: constraining predictions To improve the prediction accuracy of TMHMM, it is possible to include experimental information about where certain parts of the protein are located. For instance, by attaching a flourescent probe to the C terminus of the protein, it is possible to see if it is located in the cytoplasm. This information can be included in the predictions by constraining which paths the HMM chooses. TMHMM allows for choosing the probability of an amino acid to belong to a certain class a priori. If the membrane protein has its C terminus in the cytoplasm, TMHMM is constrained so that the prediction must end in the cytoplasmic compartment (Figure 4). This both raises the accuracy of the predictions and forces the predictions to fit the experimental evidence. In paper I, the C-terminal location was determined by GFP/PhoA experiments for the majority of the membrane proteins in Escherichia coli. This information was then used in constrained TMHMM predictions. The expected increase in accuracy of the topology models when using experimental information about the C-terminal location was expected to increase from ~70% to ~80% (Rapp et al., 2004). In order to have as many improved topology models as possible, we searched for membrane proteins in 225 organisms that were similar to those we studied in E. coli (paper II). This is called a homology search and is explained below. 21 Increasing coverage with homology All life on earth is related to each other. Genetic information is inherited and passed on from generation to generation but is under constant evolutionary pressure, either by changes in the environment or by random mutations. When comparing genetic sequences from a generation to another generation, the changes are subtle, but given enough time, we get as diverse organisms as we can observe today. Since it is only possible to sequence DNA from living organisms, we can only get a snapshot of how life and organisms look like today and we have to make educated guesses about the past. The term homology is used to describe similarity caused by shared ancestry. For a protein sequence, this means that if you have a sequence with unknown function you can search for homologous sequences in a database such as GenBank. If you find a sequence with high sequence similarity to your unknown sequence, they probably have similar function. However, a search in a sequence database is not a trivial problem due to the enormous amounts of data stored in them. Therefore many different search methods have been developed, either slow and accurate, or fast and less accurate. BLAST and PSIBLAST BLAST (Basic Local Alignment Tool) is a rapid method to find local similarities between an unknown query sequence and a database of sequences (Altschul et al., 1990). It is possible to use both DNA and protein sequences. It calculates the statistical significance for each of the matches in the database. PSIBLAST (Position-Specific Iterated BLAST) uses BLAST to create a Position Specific Scoring Matrix (PSSM) (Altschul et al., 1997). All statistically significant matches from a BLAST run are aligned in a multiple sequence alignment. For each position in the alignment, the amino acid frequencies are calculated, this is the PSSM. What PSIBLAST then does is to use this PSSM and search with it again in the sequence database. For each iteration, a new PSSM is calculated and used to search the database. Since the PSSM contains more evolutionary information than the initial query sequence does, it is possible to find more distantly related sequences than a traditional BLAST run. To make the improved topology models for the membrane proteins from the 225 prokaryotic organisms, we first made a sequence database with all the proteins present in the 225 organisms. Then we searched the database with the proteins that had experimentally verified C-terminal location (from paper I and other sources). Proteins that were homologous at their C-terminal end were assigned to have the same C-terminal location as the experimentally verified proteins. This made it possible to run constrained TMHMM predictions for ~30% of the membrane proteins from the 225 organisms. 22 Improving the level of detail of the structural models of membrane proteins As more and more membrane protein structures become known, it is clear that traditional topology models only provides a simplified picture of a membrane protein. It is clearly sufficient for proteins like bacteriorhodopsin (Figure 3) but for proteins like the glutamate transporter (Figure 5) a traditional topology model would miss much of the structural detail present in the structure. In the glutamate transporter, some transmembrane helices are tilted, other contains kinks, and there are also regions that go inside the membrane and then back again, from the same side as it came from. It is clear that new prediction methods are needed that can incorporate these kinds of structural features. Figure 5. The glutamate transporter with its complex topology. The blue mesh depicts a 30 Å wide membrane. Structural features of membrane proteins The lipid environment that membrane proteins reside in put many constraints on their structure. As James U. Bowie so nicely cited the book “The bound man” by Ilse Aichinger in a review (Bowie, 2005): “’A man awakes from a coma to find himself bound by ropes. He learns to move gracefully within these constraints and eventually becomes a circus performer.’ Similarly, membrane proteins must perform complex signaling and transport functions within the strict confines of a lipid bilayer. This requires elegant structural adaptations to their environment.” 23 There are many different structural features that characterize membrane proteins. Here I will first describe some general structural characteristics of membrane proteins and then continue with the regions that challenge the traditional view of membrane protein topology. The transmembrane helices The initial view was that membrane proteins simply consisted of membrane penetrating ideal helices. But as more and more structures are solved this view gets more and more revised. Almost 60% of the helices contain significant bends or are distorted (Yohannan et al., 2004). One particular case is proline-induced kinks. Proline is a special amino acid since its side-chain loops back to the amide nitrogen of the backbone. This makes it rigid and, if placed in an α-helix, results in a loss of a backbone hydrogen bond in the helix (Reiersen and Rees, 2001). If placed in a transmembrane helix, the helix gets kinked. Other elements that disturb the simple picture of ideal helices include 310- and π-helices. In an α-helix, the hydrogen bonds are 4 amino acids apart, in a 310-helix the hydrogen bond is between each 3rd residue and in a π-helix each 5th. This also causes distortions. The positive inside rule Apart from the hydrophobic transmembrane helices, the best known sequence characteristic of a membrane protein is the ‘positive inside’ rule. It states that the majority of the arginines (Arg) and lysines (Lys) in the loops are located on the cytoplasmic side of the membrane (Figure 6). Figure 6. The side chain atoms of arginine and lysine are depicted as lines. As can be seen from the figure, there are more arginine and lysine located in the cytoplasmic side of the membrane. When studying known membrane protein structures almost twice as high fraction of Arg and Lys are on the cytoplasmic side compared to the extra24 cytoplasmic can be found (Granseth et al., 2005). Genome-wide studies confirms that the bias is present in almost all organisms (Wallin and von Heijne, 1998; Nilsson et al., 2005). Aromatic residues Tryptophan (Trp) and tyrosine (Tyr) are most often found in the membranewater interface region located around ±15 Å from the centre of the membrane. This is often referred to as the “aromatic belt” (Wallin et al., 1997). Experiments with artificial transmembrane helices show that Trp flanking the transmembrane helix positions it so that the Trp are at their preferred locations near the lipid-carbonyl region (Braun and von Heijne, 1999; de Planque et al., 1999). However, this behaviour can not be seen for the aromatic phenylalanine. Snorkelling Lys and Arg have long aliphatic side chains with a positively charged amine or guanidium group at the end. The aliphatic hydrocarbon part of the side chains prefer to be located in hydrophobic region of the bilayer. The positively charged amine or guanidium groups, on the other hand, prefer the more polar interface region. This behaviour is described as snorkelling (Chamberlain et al., 2004). The same kind of behaviour can sometimes be observed in Trp and Tyr (de Planque et al., 1999). Trp is capable of forming hydrogen bonds with its NH group, but it also has the largest non-polar surface of all amino acid residues. Loop lengths The loops that connect the transmembrane helices can vary a lot in length, but are on average 9 amino acids long. If the loops are longer than this, they often change from having irregular structure to contain short helices that are approximately parallel to the membrane surface. These are called interface helices. If an interface helix is present between two transmembrane helices, the average number of amino acids between them rises to 27 (Granseth et al., 2005). Interface helices The first example of a structural region that is not modelled in traditional topology models is the interface helix. An interface helix is a helix running parallel with the membrane surface, almost perpendicularly to the transmembrane helices (Figure 7). The reasons for these helices are not well understood, but they might be needed for structural reasons such as positioning 25 the transmembrane helices. In photosystem I, they are thought to shield cofactors from the aqueous phase (Jordan et al., 2001) and in the MscS mechanosensitive channel they might be involved in channel gating by transferring mechanical force (Bass et al., 2003). Interface helices are on average 9 amino acids long, similar to the average length of helices in globular proteins. They contain a higher amount of Trp and Tyr than other parts of the protein (Granseth et al., 2005). Figure 7. Example of a membrane protein structure that contains both reentrant regions (dark blue) and interface helices (green). Re-entrant regions The second example of a structural element that disturbs the traditional topology model is the re-entrant region. A re-entrant region is a membrane penetrating part of the protein that goes into the membrane and then out again, from the same side as it came from. Two typical examples from aquaporin-0 is shown in Figure 7, where the two "half-TM" helices meet each other in the middle of the membrane (Harries et al., 2004). These are responsible for the channel gating, function as selectivity filters for water molecules and prevent protons from passing through the channel. Other examples include the protein conducting channel SecY (Van den Berg et al., 2004), the chloride ion channel (Dutzler et al., 2003) and the glycerol conducting channel (Fu et al., 2000). Many times the re-entrant regions contain amino acids that are responsible for the function of the protein (Lasso et al., 2006) which has lead to increasing efforts of predicting these regions, both by us in paper V and by Lasso et al. 26 Problems with topology and structural features Paper V addresses the need for more structural complexity in the topology prediction models. However, we also made a novel prediction method which looks at the problem from a different angle. We created a method that predicts the distance to the membrane centre for all amino acids in the protein sequence. This is sort of a 2.5 dimensional structural prediction since we get a structurally meaningful measure. It might be useful as an intermediate step towards full 3D structure prediction. In order to understand how this kind of prediction works, the Z-coordinate need to be explained. The Z-coordinate The 3D structures of membrane proteins do not usually contain any information of how the proteins are situated in the membrane. The reason for this is that for structure determination, the protein is taken out from the membrane and crystallized by adding amphiphilic detergent molecules. These are most of the time disordered and impossible to assign coordinates for. However, sometimes a few of the lipids and/or detergent molecules can be seen in the X-ray picture that can facilitate the determination of the positioning in the membrane. Figure 8. Before (left image) and after (right image) the membrane protein has been subject to rotation and translation to a common coordinate system. The blue mesh represents the end of the hydrophobic region at Z=±15 Å. In order to be able to compare different membrane protein structures to each other they need to have a common coordinate system. This leads to possibilities of statistical studies and comparisons between different structures. To be able to do this, the atom coordinates from the original structure need to be changed so that the structure is positioned in a theoretical membrane (Figure 8). One simple solution is to calculate a theoretical transmembrane helix that is the average of all transmembrane helices (Wallin et al., 1997). The struc27 ture is rotated to this theoretical vector. Then the average hydrophobicity of the amino acids in 1 Å wide slabs perpendicular to the vector is calculated. The coordinates for the amino acids is then moved so that the hydrophobic maximum is in the middle of the membrane (Z=0 Å) and a positive value of the Z-coordinate is directed to the non-cytoplasm and a negative to the cytoplasm. The Z-coordinate is thus the distance of an amino acid relative to the membrane. More advanced methods have recently been constructed, for instance TMDET (Tusnady et al., 2004) and OPM (Lomize et al., 2006). TMDET is an automated algorithm that first calculates the membrane exposed surface area and then the most probable positioning of the membrane planes is calculated by an objective function. The objective function considers hydrophobicity in 1 Å slabs together with structural factors. The other method, Orientations of Proteins in Membranes (OPM), uses a similar methodology (Lomize et al., 2006). 28 Dual topology In the global topology analysis (paper I), it could be seen that most membrane proteins either had high GFP fluorescence or high PhoA activity. However, since some proteins had both GFP and PhoA levels above background, we suggested that these aberrant proteins might have dual topology, i.e. they can insert either way into the membrane (Figure 9). The suggested dual topology proteins have a weak “positive-inside” bias; there is approximately the same number of Arg and Lys on the cytoplasmic side as the periplasmic. Figure 9. A dual-topology protein inserts into the membrane in two opposite orientations. As nearly all helix-bundle membrane proteins have a higher number of lysine and arginine residues in cytoplasmic (in) than in periplasmic (out) loops (the ‘positive-inside’ rule), dual-topology proteins are expected to have very small (Lys+Arg) biases. Dual topology proteins are interesting because they are attractive model systems for studying membrane protein evolution (Rapp et al., 2007). Evolution of membrane proteins Given the vast difference in the number of transmembrane helices between membrane proteins, how can they increase in size and complexity? One clue to this question lies in the available membrane protein structures. Many of the proteins have internal symmetry; if the protein is divided in two halves then the two halves can be superimposed on each other with an almost perfect fit. Examples include Lactose permease (Abramson et al., 2003), EmrD (Yin et al., 2006), BtuCD (Bass et al., 2003) and the chloride ion channel (Dutzler et al., 2002). 29 Figure 10. The structure of lactose permease. If the protein is divided into two halves (orange and green) and these are superimposed on each other, the internal structural symmetry can be seen (RMSD 3 Å). Figure 10 shows lactose permease before and after superimposing the two halves. If we look at the sequence level and align the sequences of the two halves, the sequence identity between them is around 20%. It seems that even though the two halves share the same structure they do not have very similar sequence. The hypothesis behind internal symmetry is that an internal gene duplication event creates two copies of the gene (Figure 11). These two genes are then finally fused and encode a twice as large protein. 30 1 2 3 4 5 6 N C Gene duplication 1 2 3 4 5 + 6 N 1’ 2’ 3’ 4’ 5’ 6’ N C C Gene fusion 1 2 3 4 N 5 6 1’ (N’) 2’ 3’ 4’ 5’ 6’ C Figure 11. Suggested evolutionary scenario of internal structural symmetry for a 12 TMH protein. A gene duplication event of a gene encoding a 6 TMH protein creates two adjacent genes encoding the same protein. The two genes diverge from each other and finally fuse into a single gene encoding a 12 TMH protein. The evolution of transmembrane topology has also been studied by detecting partial sequence homology in membrane proteins (Shimizu et al., 2004). Even though the study only used predicted topologies, it could be seen that ~1% of the 38,124 membrane proteins contained internal homology. It seems plausible that at least some membrane proteins have evolved by either partial or full internal gene duplications. In paper VI, we studied the identification and evolution of dual topology membrane protein families. Families and folds When comparing two structures of proteins, they belong to the same fold if they are similarly ordered in three dimensions, i.e. their secondary structures (α-helices and β-strands) match and they have a similar topological arrangement (Murzin et al., 1995). Even if two proteins share the same fold, that 31 does not imply that they are evolutionary related. A deeper level of structural similarity is achieved on the family level: two proteins belong to the same family if they are evolutionary related, in which case they typically have a pair-wise sequence identity of ≥ 30% (Murzin et al., 1995). In paper VI, Pfam was used to scan 225 genomes for dual topology family members. PFAM Pfam is one of the most useful tools that exist in bioinformatics. It uses multiple sequence alignments for proteins belonging to the same family to train HMM profiles. The HMM profiles can later be used to identify new members to the family. Given a sequence for a protein with unknown function, it is thus possible to use Pfam to see if the sequence belongs to any of the known families. Pfam contains 7973 families (v. 18.0) and these cover 6084% of the proteins in a proteome (Finn et al., 2006). There are two different versions of Pfam, Pfam-A and Pfam-B. Pfam-A is manually curated and contains well-characterised protein domain families with high quality alignments. Pfam-B is automatically generated by clustering and aligning all sequences that do not belong to any Pfam-A family (Sonnhammer et al., 1997). New protein families are not allowed to overlap with existing Pfam entries. Thus, one specific region of a sequence may only exist in one Pfam family. 32 Summary of papers The papers in this thesis have been divided into two major topics and one minor. The first two papers create more accurate traditional topology models by incorporating information about where the C termini are located, first in E. coli then in 225 completely sequenced prokaryotic organisms. The next major topic consists of redefining the traditional view of topology. The idea came after a statistical study of known membrane protein structures (paper III), where we could see that there was much structural complexity besides the transmembrane helices. This was not modelled in the traditional topology prediction methods that existed, so it was incorporated into the topology predictor described in paper IV. In the fifth paper, we wanted to get away from topology and create novel ways of predicting structural features. We therefore created a method that predicts the distance to the membrane centre for the amino acids. A successful prediction of this distance is a good starting point for more detailed 3D structure predictions. Finally, the last topic is about dual topology membrane proteins. In paper I we could see that the GFP/PhoA activities were behaving strange for a few of the proteins. These were further studied in paper VI, both by biochemical experiments and by genome-wide predictions. Improving the accuracy of topology models Global topology analysis of the Escherichia coli inner membrane proteome (Paper I) This paper describes the results from the first proteome-wide membrane protein topology mapping of an organism. Earlier results from a pilot study (Rapp et al., 2004) revealed that the C-terminal location is the most beneficial position for making constrained TMHMM predictions. Together with earlier results that described a methodology to use GFP/PhoA as topology reporter (Drew et al., 2005), the first large-scale topology analysis was performed for all multispanning membrane proteins in an organism. Out of Escherichia coli’s 4288 open reading frames, 737 were predicted by TMHMM to have more than 2 transmembrane helices and be longer than 100 amino acids. It was possible to assign a C-terminal location for 502 of the proteins and by using homology we could increase the coverage to 601 33 proteins. The criteria for assigning the homologous proteins was that the BLAST alignment should be present close to the C termini of both sequences and the e-value should be lower than 10-4. We used a recently developed TMHMM version to force the topology prediction to be consistent with the experimental results. By doing this, 80% of the predicted topology models were expected be correct instead of 70% without the experimental information (Rapp et al., 2004). The majority (~80%) of the membrane proteins had their C-terminal located in the cytoplasm and ~60% had both the N- and C-terminal inside the cytoplasm. The results also included a large-scale data set of overexpression potential for membrane proteins that had their C-terminal located in the cytoplasm. Unfortunately, no sequence characteristics that correlated with overexpression potential could be found. The reasons for the difficulties of overexpressing membrane proteins are probably caused by many interacting factors. Experimentally constrained topology models for 51,208 bacterial inner membrane proteins (Paper II) We wanted to create better topology models for as many as possible of the completely sequenced prokaryotic organisms. This was done by searching for homologs to the 502 E. coli membrane proteins that we had assigned the C-terminal location for. We also included 106 prokaryotic membrane proteins with known topology from other sources. The 608 proteins were used as queries in BLAST searches against 225 bacterial genomes. It was possible to create 51,208 much improved topology models, and these covered ~30% of all the predicted membrane proteins in the organisms. The topology models were assigned in a 2 step model, the first round found 30,744 homologs to the 608 proteins. The second round used these 30,744 proteins as queries and then an additional 17,111 proteins were found. Lastly, 3353 homologous proteins were found when we relaxed the criteria that the alignment should be all the way to the C-terminal end for both query and subject. Intuitively, the gammaproteobacterias, which E. coli belong to, had a larger fraction of assigned membrane proteins while archaeabacteria had the lowest fraction. The initial, unconstrained TMHMM topologies had the C-terminal location correctly predicted 69% of the times and it was mainly small (1-2 TMH proteins) that were mispredicted. 34 Improving the detail of structural predictions A study of the membrane-water interface region of membrane proteins (Paper III) In order to create new prediction methods, a statistical study is first often performed to see whether there are any specific things to consider. There have been many statistical studies of membrane proteins, but almost all of them have focused on the transmembrane α-helices inside the hydrophobic membrane region. This paper instead focused on the membrane-water interface region located just outside the hydrocarbon core. This region has often been neglected, but it could immediately be seen that it contained more structural complexity than previously believed. The region is enriched in irregular structure (~70%) and interface helices (~30%) running roughly parallel with the membrane surface, while β-strands are extremely rare. The average amino acid composition is different between the interfacial helices, the end-parts of the transmembrane helices located in the interface region, and the irregular loop structures. In the membrane-water interface region, hydrophobic and aromatic residues tend to point toward the membrane and charged/polar residues tend to point away from the membrane. The same general features could be seen in monotopic membrane proteins suggesting that common evolutionary forces shape all different kinds of protein surfaces exposed to the membrane-water interface. The interface region thus imposes different constraints on protein structure than do the central hydrocarbon core of the membrane and the surrounding aqueous phase. Structural classification and prediction of re-entrant regions in αhelical transmembrane proteins: Application to complete genomes (Paper IV) The motivation behind this paper was that in paper III we could see quite a lot of structural elements in the membrane-water interface region that current topology prediction methods missed. These elements were located between two transmembrane helices and were interface helices if running parallel with the membrane or were re-entrant regions if dipping inside the membrane. The method that was created was called TOPMOD and it uses a hidden Markov model for its predictions. Its underlying architecture resembles the standard TMHMM architecture but with new, additional compartments for interface helices and re-entrant regions (Figure 12). The new compartments had as few free parameters as possible because of the low amount of training data. This avoids the problem of overfitting. 35 It was very difficult for TOPMOD to get any reasonable performance of identifying interface helices, it often confused them with re-entrant regions. It seems that the sequence signals characteristic for re-entrant regions are stronger and more consistent. Under the theoretical situation that the transmembrane helices are known, TOPMOD had a sensitivity of 69% and a specificity of 72%. This situation is however quite artificial since it is only with an completely experimentally verified topology map that this would be possible. A more realistic situation is when the transmembrane helices are predicted, under these circumstances the sensitivity is 47% and the specificity 72%. Figure 12. The architecture of the TOPMOD hidden Markov model. The arrow in the bottom of the figure shows the Z-coordinate axis in Ångström. The final part of the project consisted of applying TOPMOD on 3 completely sequenced genomes, E. coli, S. cerevisiae and H. sapiens. 10 % of the membrane proteins of S. cerevisiae were predicted to have reentrant regions whereas E. coli 16.7% and H. sapiens 15%. 36 ZPRED: Predicting the distance to the membrane center for residues in α-helical membrane proteins (Paper V) As more and more membrane protein structures become known, it is clear that traditional topology information/prediction only provides a simplified picture of a membrane protein. The structural complexity in a membrane protein can not be modelled by conventional topology prediction algorithms. For instance, there are instances of re-entrant regions where the amino acid sequence dips into the membrane and out again on the same side as it originated from (red and blue in Figure 13). The transmembrane helices can also be tilted and/or contain structurally significant irregularities (green and cyan helices in Figure 13). If a re-entrant region is mistaken as a transmembrane helix by a topology prediction algorithm, the topology model of the protein will be erroneous. In order to overcome these limitations, ZPRED, a predictor of the distance from a residue to the membrane centre was created. The right part of Figure 13 shows the distance to the membrane centre (Z=0 Å) for the amino acids of the structure to the left. Initially, the full Zcoordinate was predicted but with little success. Therefore it was decided that the absolute value of the Z-coordinate should be predicted instead, since this both simplifies the problem and doubles the amount of training data for the machine learning methods. The range was also narrowed to 5-25 Å since this is the region where the environment of the amino acids changes the most. We benchmarked artificial neural networks, hidden Markov models and combinations of both. In the end, we used the output from an HMM topology predictor combined with evolutionary profiles as input to a neural network, and this predictor was named ZPRED. It could predict this distance to the membrane centre with an average error of 2.55 Å, and was also quite successful at predicting the protrusion of loop regions (see Figure 14 for an example of a prediction). 37 Figure 13. Left: The glutamate transporter homolog and the membrane border (Z=±15 Å) depicted as a blue mesh. The reentrant regions are colored blue and red, while transmembrane helices with structural irregularities are colored green and cyan. Right: The distance to the membrane center (which corresponds to the absolute value of the Z-coordinate) for the amino acids of the glutamate transporter. The coloring is the same as in the structure. Figure 14. Example of the output of a prediction from ZPRED. Left: All the amino acids that were predicted to be inside the membrane (Z<15 Å) are colored red, and those that predicted to be outside the membrane are colored green. The prediction is in well agreement with the structure except for the transmembrane helix that contains irregularities located to the left. Top right: The true topology with reentrant regions colored green and the PRODIV-TMHMM prediction below. Top bottom: The correct Z-coordinate from the structure corresponds to the green line whereas the prediction is the red. 38 Dual topology Identification and evolution of dual-topology membrane proteins (Paper VI) In paper I we suggested that some membrane proteins might adopt a dual topology, i.e. they could insert either way into the membrane (Figure 9). In this paper, we showed that the topology of 5 of the candidates from paper I was sensitive to changes in the Lys+Arg bias (the difference in positively charged amino acids between the cytoplasmic and periplasmic loops), that these families either existed as pairs or singletons in the genome and that the pairs encode two oppositely oriented proteins whereas singletons encode dual-topology candidates. As usual, GFP and PhoA were used to track the topological changes that occurred when the Lys+Arg bias was altered. SugE, EmrE, CrcB, YnfA and YdgC were highly sensitive to Lys+Arg alterations with large differences in C-terminal locations. YdgE and YdgF were used as controls to the experiments since these are proteins with stable topologies. As expected, these proteins were insensitive to Lys+Arg alterations. The main reason for choosing YdgE and YdgF as controls is that they are members of the same family as EmrE and SugE, the Small Multidrug Resistance (SMR) family. They are all ~100 amino acids long and have 4 transmembrane helices. Interestingly, YdgE inserts into the membrane with its C terminus in the periplasm while in YdgF it is in the cytoplasm. The ydgE and ydgF genes are also located next to each other in the chromosome. This led us to investigate the genome arrangement for dual topology candidates in other organisms than E. coli. The family members were located by Pfam in 174 fully sequenced bacterial genomes. Remarkably, for both the SMR and CrcB family a consistent pattern emerged where genes either occured as closely spaced pairs or as singletons. Closely spaced pairs encode homologous proteins with opposite topology (red bars in Figure 15) whereas singletons encode proteins with a Lys+Arg bias close to zero (green bars in Figure 15). The simplest explanation of this pattern is that that the SMR and CrcB family proteins form antiparallel dimers (or higher oligomers) composed either of two separately expressed and oppositely oriented homologs or of a single dual-topology protein. The YnfA and YdgC families were only encoded by singleton genes with a Lys+Arg bias close to zero. The last thing studied was whether other dual-topology families could be discovered. One good candidate was found, the DUF606 family. The proteins in this family mostly consisted of 5TM proteins following the same pattern as we could see for the SMR and CrcB families. However, we found examples of DUF606 family members which were internally duplicated (blue bars in Figure 15), where the first and second half of the protein had 39 opposite orientations in the membrane. The sequence identity between the two halves was as high as 36%. Taken together, the results might explain the existence of symmetry in the first and second halves of membrane protein structures and a possible way of how membrane proteins increase in size: The dual topology encoding gene is duplicated and eventually, through re-distribution of positively charged amino acid, the two resulting homologous proteins become fixed in opposite orientations in the membrane. Finally, the two genes fuse into a one single gene encoding a protein with internal symmetry. Figure 15. Distribution of genes coding for SMR, CrcB, YnfA, YdgC and DUF606. Green, proteins encoded by singleton genes; red, those encoded by closely paired genes; blue, DUF606-family genes encoding internally duplicated proteins. 40 Final thoughts Computer prediction methods hold great promise since it is both timeconsuming and expensive to determine the structure of a protein. However, protein folding is still an unsolved problem and ab initio structure prediction methods are only successful for small proteins. The usual bioinformatical excuse is to say that the methods will get better in the future, because then there will be more data available to make assumptions and build theories from. But what happens when the new data does not fit into the old model? This is what has happened during my 5 years as a PhD student. The more new structures that became available, the more I realised that we did not know much. First, re-entrant and interface helices occurred, then the dual topology concept. Suddenly, the old fact that α-helical membrane proteins are only located in the inner membrane gets crushed by a new protein that is located in the outer membrane. I guess that we will be surprised over and over again but that is what makes science so fun. Membrane protein structure determination is still a difficult problem, but I suspect that new technological advancements will increase the number of structures dramatically. During the meantime, topology prediction will suffice. The more detailed prediction attempts presented in this thesis will continue to improve and maybe not replace the traditional topology prediction methods but work as a complement to them. 41 Acknowledgements Arne Elofsson, my supervisor, for always having time for discussions and questions and giving me help. A more relaxed supervisor would be difficult to find anywhere. It has been great being your student! Gunnar von Heijne, my co-supervisor for all support and knowledge of membrane proteins. Thank you for all assistance during the years. My long-time room-mates Björn (I miss sharing beds at conferences), Åsa (for our discussions, thank you for listening to my endless blabbering) and Johannes (for always being positive and supportive) for the wonderful time we have experienced together. Work - never been so much fun. Olof, initially my exjobb-supervisor, later colleague and neighbour, now friend. Thank you for all your help and fun discussions (especially political… I love election time) throughout the years. My colleagues from work: Uppsalamaffian: kaffe-Andreas, Karin, Pär, Per (and me, Åsa and Olof) vs Linköpingsmaffian: varpa-Diana, wii-Jenny, Linnea, Aron, Kristoffer, rugby-Anna (and Johannes) vs the rest: dansHåkan, Olivia, Johan, Sara, Katarina, Charlotta, Anni, Yana and Fang. It has always been wonderful to get to work and meet you! Mikaela, Susanna and Dan – for doing such great work and our fruitful collaboration. SBC-people: my sparris-dealer Örjan, Ali, Samuel, Isak, Lars, Erik, Lotta, Jens, Bengt, Bob and Karin. Erik Lindahl for buying me lunch at Naturhistoriska. Erik Sjölund for making everything work at work. Move-on-up-Johan and move-on-up-Per at KTH-biotech for taking care of me in Brazil. CBR, DBB and SBC corridors. 42 Paul Horton, for helping me in Japan, Fujita-san for all fun we had together, Larisa for our interesting lunch discussions, Michael Gromiha, Martin and Tomii-san. Oskar & Anja, Mattias & Eleonor, Robban & Anna & Hanna, Erik & Erik & Erik & Kattis, Johan, grann-Tobbe och alla andra vänner. Vad kul vi har haft tillsammans! Anders, Aino, Oskar, Barbro, Hugo, Jonas och Kajsa. Det är roligt att ha fått en familj till! Min bror och förebild Björn, min syster Elisabeth, energiknippet Markus, mina föräldrar Asbjörn och Lena. Tack för all hjälp, allt stöd och all uppmuntran som jag fått av er under 29 år! Ida, min älskling för att du är du och att vi är tillsammans. Tack för all hjälp och att du alltid finns vid min sida. Jag älskar dig! 43 References Abramson, J., Smirnova, I., Kasho, V., Verner, G., Kaback, H.R., and Iwata, S. 2003. Structure and mechanism of the lactose permease of Escherichia coli. Science 301: 610-615. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J Mol Biol 215: 403-410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402. Anfinsen, C.B. 1973. Principles that govern the folding of protein chains. Science 181: 223-230. Bass, R.B., Locher, K.P., Borths, E., Poon, Y., Strop, P., Lee, A., and Rees, D.C. 2003. The structures of BtuCD and MscS and their implications for transporter and channel function. FEBS Lett 555: 111-115. Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., et al. 2002. The Protein Data Bank. Acta Crystallogr D Biol Crystallogr 58: 899-907. Beuming, T., and Weinstein, H. 2004. A knowledge-based scale for the analysis and prediction of buried and exposed faces of transmembrane domain proteins. Bioinformatics 20: 1822-1835. Bowie, J.U. 2005. Solving the membrane protein folding problem. Nature 438: 581589. Braun, P., and von Heijne, G. 1999. The aromatic residues Trp and Phe have different effects on the positioning of a transmembrane helix in the microsomal membrane. Biochemistry 38: 9778-9782. Chalfie, M., Tu, Y., Euskirchen, G., Ward, W.W., and Prasher, D.C. 1994. Green fluorescent protein as a marker for gene expression. Science 263: 802-805. Chamberlain, A.K., Lee, Y., Kim, S., and Bowie, J.U. 2004. Snorkeling preferences foster an amino acid composition bias in transmembrane helices. J Mol Biol 339: 471-479. de Planque, M.R., Kruijtzer, J.A., Liskamp, R.M., Marsh, D., Greathouse, D.V., Koeppe, R.E., 2nd, de Kruijff, B., and Killian, J.A. 1999. Different membrane anchoring positions of tryptophan and lysine in synthetic transmembrane alphahelical peptides. J Biol Chem 274: 20839-20846. Dong, C., Beis, K., Nesper, J., Brunkan-Lamontagne, A.L., Clarke, B.R., Whitfield, C., and Naismith, J.H. 2006. Wza the translocon for E. coli capsular polysaccharides defines a new class of membrane protein. Nature 444: 226-229. Drew, D., Slotboom, D.J., Friso, G., Reda, T., Genevaux, P., Rapp, M., MeindlBeinker, N.M., Lambert, W., Lerch, M., Daley, D.O., et al. 2005. A scalable, GFP-based pipeline for membrane protein overexpression screening and purification. Protein Sci 14: 2011-2017. 44 Drew, D.E., von Heijne, G., Nordlund, P., and de Gier, J.W. 2001. Green fluorescent protein as an indicator to monitor membrane protein overexpression in Escherichia coli. FEBS Lett 507: 220-224. Dutzler, R., Campbell, E.B., Cadene, M., Chait, B.T., and MacKinnon, R. 2002. Xray structure of a ClC chloride channel at 3.0 A reveals the molecular basis of anion selectivity. Nature 415: 287-294. Dutzler, R., Campbell, E.B., and MacKinnon, R. 2003. Gating the selectivity filter in ClC chloride channels. Science 300: 108-112. Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., et al. 2006. Pfam: clans, web tools and services. Nucleic Acids Res 34: D247-251. Fu, D., Libson, A., Miercke, L.J., Weitzman, C., Nollert, P., Krucinski, J., and Stroud, R.M. 2000. Structure of a glycerol-conducting channel and the basis for its selectivity. Science 290: 481-486. Garrow, A.G., Agnew, A., and Westhead, D.R. 2005. TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics 6: 56. Granseth, E., Seppälä, S., Rapp, M., Daley, D.O., and von Heijne, G. 2007. Membrane protein structural biology — How far can the bugs take us? Mol Membr Biol 24: 1-5. Granseth, E., von Heijne, G., and Elofsson, A. 2005. A study of the membrane-water interface region of membrane proteins. J Mol Biol 346: 377-385. Gupta, K., Selinsky, B.S., Kaub, C.J., Katz, A.K., and Loll, P.J. 2004. The 2.0 A resolution crystal structure of prostaglandin H2 synthase-1: structural insights into an unusual peroxidase. J Mol Biol 335: 503-518. Harries, W.E., Akhavan, D., Miercke, L.J., Khademi, S., and Stroud, R.M. 2004. The channel architecture of aquaporin 0 at a 2.2-A resolution. Proc Natl Acad Sci U S A 101: 14045-14050. Hessa, T., Kim, H., Bihlmaier, K., Lundin, C., Boekel, J., Andersson, H., Nilsson, I., White, S.H., and von Heijne, G. 2005. Recognition of transmembrane helices by the endoplasmic reticulum translocon. Nature 433: 377-381. Hopkins, A.L., and Groom, C.R. 2002. The druggable genome. Nat Rev Drug Discov 1: 727-730. Ikeda, M., Arai, M., Okuno, T., and Shimizu, T. 2003. TMPDB: a database of experimentally-characterized transmembrane topologies. Nucleic Acids Res 31: 406-409. Jones, D.T. 2007. Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 23: 538-544. Jordan, P., Fromme, P., Witt, H.T., Klukas, O., Saenger, W., and Krauss, N. 2001. Three-dimensional structure of cyanobacterial photosystem I at 2.5 A resolution. Nature 411: 909-917. Kimura, T., Ohnuma, M., Sawai, T., and Yamaguchi, A. 1997. Membrane topology of the transposon 10-encoded metal-tetracycline/H+ antiporter as studied by site-directed chemical labeling. J Biol Chem 272: 580-585. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L. 2001. Predicting transmembrane topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567-580. Lasso, G., Antoniw, J.F., and Mullins, J.G. 2006. A combinatorial pattern discovery approach for the prediction of membrane dipping (re-entrant) loops. Bioinformatics 22: e290-297. Levinthal, C. 1968. Are there pathways for protein folding? Journal de Chimie Physique et de Physico-Chimie Biologique 65: 44-45. 45 Lomize, A.L., Pogozheva, I.D., Lomize, M.A., and Mosberg, H.I. 2006. Positioning of proteins in membranes: a computational approach. Protein Sci 15: 13181333. Manoil, C. 1991. Analysis of membrane protein topology using alkaline phosphatase and beta-galactosidase gene fusions. Methods Cell Biol 34: 61-75. Manoil, C., and Beckwith, J. 1985. TnphoA: a transposon probe for protein export signals. Proc Natl Acad Sci U S A 82: 8129-8133. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376-380. Moller, S., Croning, M.D., and Apweiler, R. 2001. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17: 646-653. Moller, S., Kriventseva, E.V., and Apweiler, R. 2000. A collection of well characterised integral membrane proteins. Bioinformatics 16: 1159-1160. Monne, M., Nilsson, I., Elofsson, A., and von Heijne, G. 1999. Turns in transmembrane helices: determination of the minimal length of a "helical hairpin" and derivation of a fine-grained turn propensity scale. J Mol Biol 293: 807-814. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536-540. Nilsson, J., Persson, B., and von Heijne, G. 2005. Comparative analysis of amino acid distributions in integral membrane proteins from 107 genomes. Proteins 60: 606-616. Nirenberg, M. 2004. Historical review: Deciphering the genetic code--a personal account. Trends Biochem Sci 29: 46-54. Picot, D., Loll, P.J., and Garavito, R.M. 1994. The X-ray crystal structure of the membrane protein prostaglandin H2 synthase-1. Nature 367: 243-249. Rapp, M., Drew, D., Daley, D.O., Nilsson, J., Carvalho, T., Melen, K., De Gier, J.W., and Von Heijne, G. 2004. Experimentally based topology models for E. coli inner membrane proteins. Protein Sci 13: 937-945. Rapp, M., Seppala, S., Granseth, E., and von Heijne, G. 2007. Emulating membrane protein evolution by rational design. Science 315: 1282-1284. Reiersen, H., and Rees, A.R. 2001. The hunchback and its neighbours: proline as an environmental modulator. Trends Biochem Sci 26: 679-684. Rost, B., Fariselli, P., and Casadio, R. 1996. Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 5: 1704-1718. Russell, R.B., and Eggleston, D.S. 2000. New roles for structure in biology and drug discovery. Nat Struct Biol 7 Suppl: 928-930. Shimizu, T., Mitsuke, H., Noto, K., and Arai, M. 2004. Internal gene duplication in the evolution of prokaryotic transmembrane proteins. J Mol Biol 339: 1-15. Sonnhammer, E.L., Eddy, S.R., and Durbin, R. 1997. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28: 405420. Sonnhammer, E.L., von Heijne, G., and Krogh, A. 1998. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 6: 175-182. Tusnady, G.E., Dosztanyi, Z., and Simon, I. 2004. Transmembrane proteins in the Protein Data Bank: identification and classification. Bioinformatics 20: 29642972. Tusnady, G.E., and Simon, I. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17: 849-850. 46 Ulmschneider, M.B., and Sansom, M.S. 2001. Amino acid distributions in integral membrane protein structures. Biochim Biophys Acta 1512: 1-14. Wallin, E., Tsukihara, T., Yoshikawa, S., von Heijne, G., and Elofsson, A. 1997. Architecture of helix bundle membrane proteins: an analysis of cytochrome c oxidase from bovine mitochondria. Protein Sci 6: 808-815. Wallin, E., and von Heijne, G. 1998. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7: 1029-1038. Van den Berg, B., Clemons, W.M., Jr., Collinson, I., Modis, Y., Hartmann, E., Harrison, S.C., and Rapoport, T.A. 2004. X-ray structure of a protein-conducting channel. Nature 427: 36-44. Watson, J.D., and Crick, F.H. 1953. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171: 737-738. Viklund, H., and Elofsson, A. 2004. Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci 13: 1908-1917. Wimley, W.C. 2002. Toward genomic identification of beta-barrel membrane proteins: composition and architecture of known structures. Protein Sci 11: 301312. von Heijne, G. 1992. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 225: 487-494. von Heijne, G. 2006. Membrane-protein topology. Nat Rev Mol Cell Biol 7: 909918. Yarov-Yarovoy, V., Schonbrun, J., and Baker, D. 2006. Multipass membrane protein structure prediction using Rosetta. Proteins 62: 1010-1025. Yin, Y., He, X., Szewczyk, P., Nguyen, T., and Chang, G. 2006. Structure of the multidrug transporter EmrD from Escherichia coli. Science 312: 741-744. Yohannan, S., Faham, S., Yang, D., Whitelegge, J.P., and Bowie, J.U. 2004. The evolution of transmembrane helix kinks and the structural diversity of G protein-coupled receptors. Proc Natl Acad Sci U S A 101: 959-963. 47