* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence-based predictions of membrane-protein topology, homology and insertion
Paracrine signalling wikipedia , lookup
Genetic code wikipedia , lookup
Lipid signaling wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Oxidative phosphorylation wikipedia , lookup
Metalloprotein wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biochemistry wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Interactome wikipedia , lookup
Signal transduction wikipedia , lookup
Magnesium transporter wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein purification wikipedia , lookup
SNARE (protein) wikipedia , lookup
Homology modeling wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Proteolysis wikipedia , lookup
Sequence-based predictions of membrane-protein topology, homology and insertion Andreas Bernsel Stockholm University © Andreas Bernsel, Stockholm 2008, pages 1-57 ISBN 978-91-628-7565-7 Printed in Sweden by US-AB, Stockholm 2008 Distributor: Stockholm University Library "Trying is the first step towards failure" -- Homer List of publications Publications included in this thesis I Bernsel A, von Heijne G. (2005) Improved membrane protein topology prediction by domain assignments. Protein Sci. 14(7):1723-28. II Bernsel A, Viklund H, Elofsson A. (2008) Remote homology detection of integral membrane proteins using conserved sequence features. Proteins. 71(3):1387-99. III Hessa T*, Meindl-Beinker NM*, Bernsel A*, Kim H, Sato Y, Lerch-Bader M, Nilsson I, White SH, von Heijne G. (2007) Molecular code for transmembrane-helix recognition by the Sec61 translocon. Nature. 450(7172):1026-30. IV Bernsel A*, Viklund H*, Falk J, Lindahl E, von Heijne G, Elofsson A. (2008) Prediction of membrane-protein topology from first principles. Proc Natl Acad Sci U S A. 105(20):7177-81. * These authors contributed equally Reprints were made with permission from the publishers. Additional publications Médigue C, Krin E, Pascal G, Barbe V, Bernsel A, Bertin PN, Cheung F, Cruveiller S, D'Amico S, Duilio A, Fang G, Feller G, Ho C, Mangenot S, Marino G, Nilsson J, Parrilli E, Rocha EP, Rouy Z, Sekowska A, Tutino ML, Vallenet D, von Heijne G, Danchin A. (2005) Coping with cold: the genome of the versatile marine Antarctica bacterium Pseudoalteromonas haloplanktis TAC125. Genome Res. 15(10):1325-35. Prediction servers Two web servers have been developed that implement methods described in the papers: ∆G-pred Prediction of ∆G for TM-helix insertion http://www.cbr.su.se/DGpred/ TOPCONS Consensus prediction of membrane protein topology http://topcons.cbr.su.se/ Abstract Membrane proteins comprise around 20-30% of a typical proteome and play crucial roles in a wide variety of biochemical pathways. Apart from their general biological significance, membrane proteins are of particular interest to the pharmaceutical industry, being targets for more than half of all available drugs. This thesis focuses on prediction methods for membrane proteins that ultimately rely on their amino acid sequence only. By identifying soluble protein domains in membrane protein sequences, we were able to constrain and improve prediction of membrane protein topology, i.e. what parts of the sequence span the membrane and what parts are located on the cytoplasmic and extra-cytoplasmic sides. Using predicted topology as input to a profile-profile based alignment protocol, we managed to increase sensitivity to detect distant membrane protein homologs. Finally, experimental measurements of the level of membrane integration of systematically designed transmembrane helices in vitro were used to derive a scale of position-specific contributions to helix insertion efficiency for all 20 naturally occurring amino acids. Notably, position within the helix was found to be an important factor for the contribution to helix insertion efficiency for polar and charged amino acids, reflecting the highly anisotropic environment of the membrane. Using the scale to predict natural transmembrane helices in protein sequences revealed that, whereas helices in single-spanning proteins are typically hydrophobic enough to insert by themselves, a large part of the helices in multi-spanning proteins seem to require stabilizing helix-helix interactions for proper membrane integration. Implementing the scale to predict full transmembrane topologies yielded results comparable to the best statistics-based topology prediction methods. Contents 1. 2. Introduction ..........................................................................................12 Biological membranes..........................................................................15 2.1 Membrane lipids ...................................................................................................... 15 2.2 Lipid bilayers............................................................................................................ 16 3. Membrane Proteins..............................................................................18 3.1 Types of membrane proteins .................................................................................. 18 3.1.1 α-helical transmembrane proteins .................................................................. 19 3.1.2 β-barrel transmembrane proteins ................................................................... 19 3.2 Membrane protein targeting and translocation in the ER ....................................... 20 3.3 Amino acid statistics of TM-helices ......................................................................... 21 3.3.1 Helix-breaking residues .................................................................................. 21 3.3.2 The aromatic belt ............................................................................................ 22 3.3.3 The positive-inside rule ................................................................................... 23 3.4 Membrane protein topology..................................................................................... 23 3.4.1 Topology mapping techniques........................................................................ 24 3.4.2 Dual topology proteins .................................................................................... 24 3.5 Membrane protein structure .................................................................................... 25 3.5.1 Structure determination techniques ................................................................ 25 3.5.2 Structural features of membrane proteins ...................................................... 26 3.5.3 Helix-helix interactions .................................................................................... 27 4. Computational approaches..................................................................28 4.1 Substitution matrices, profiles and HMMs............................................................... 28 4.1.1 Sequence alignment and substitution matrices .............................................. 28 4.1.2 Sequence profiles ........................................................................................... 29 4.1.3 Hidden Markov Models ................................................................................... 30 4.2 Aritifical neural networks ......................................................................................... 31 4.3 Hydrophobicity scales ............................................................................................. 32 4.4 Transmembrane topology predictors ...................................................................... 33 4.4.1 TopPred .......................................................................................................... 33 4.4.2 TMHMM / PRO-TMHMM / PRODIV-TMHMM ................................................ 33 4.4.3 HMMTOP ........................................................................................................ 34 4.4.4 MEMSAT......................................................................................................... 34 4.4.5 PHDhtm........................................................................................................... 34 4.4.6 Phobius / PolyPhobius .................................................................................... 34 5. Present investigation............................................................................36 5.1 Improving transmembrane topology prediction by domain assignments (Paper I). 36 5.2 Using topology to predict homology (Paper II)........................................................ 38 5.3 Deciphering the molecular code for transmembrane helix recognition (Paper III).. 39 5.4 Prediction of membrane-protein topology from first principles (Paper IV) .............. 44 Acknowledgements .......................................................................................46 References....................................................................................................48 Abbreviations ANN BLAST DOPC ER GFP GPCR GPI HMM MD NMR ORF PDB PhoA PSI-BLAST PSSM SP SRP TM Artificial neural network Basic local alignment search tool Dioleoyl-phosphatidylcholine Endoplasmic reticulum Green fluorescent protein G-protein coupled receptor Glycosyl-phosphatidylinositol Hidden Markov model Molecular dynamics Nuclear magnetic resonance Open reading frame Protein data bank Alkaline phosphatase Position specific iterative BLAST Position specific scoring matrix Signal peptide Signal recognition particle Transmembrane Amino acids Ala Cys Asp Glu Phe Gly His Ile Lys Leu (A) (C) (D) (E) (F) (G) (H) (I) (K) (L) Alanine Cysteine Aspartic acid Glutamic acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Met Asn Pro Gln Arg Ser Thr Val Trp Tyr (M) (N) (P) (Q) (R) (S) (T) (V) (W) (Y) Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine 1. Introduction Life is a complex thing (Keyes 1966). During the short time a biological cell spends on earth before dividing itself into two new cells, or worse, committing suicide (Hengartner 2000), a myriad of biochemical reactions will have taken place, each of which is required for the development, survival and propagation of the cell and the genetic material it carries. Key players in this chemical factory of life are the proteins, catalyzing the specific conversion of substrates to products, regulating each others' concentrations, transporting molecules across barriers, and converting light into sugar, among many other things. The specific function of a protein molecule is determined from its three-dimensional shape, which in turn is a function only of its amino acid sequence (Anfinsen 1973), and ultimately of its corresponding DNA sequence. In a more general sense, not focusing on a specific molecule, however, the function of a protein also depends on its concentration and localization in the cell. In order to separate itself from the rest of the world, all living cells are equipped with a plasma membrane, the advantage being that the cell can modify its own internal conditions independently of, or sometimes in response to, what is going on in the external environment. Within eukaryotic cells, organelles are also delimited by at least one membrane, resulting in different compartments with disparate biochemical conditions. However, the isolation between a cell and its surroundings is not complete. The interior of the cell is only partially shielded from the exterior, and this balance is actively regulated by membrane spanning channels, transporters and receptors. Thus, in response to external changes, channels open and close, energydriven pumps are activated and de-activated, and ligand-binding receptors give rise to signaling cascades that propagate through the cell. Indeed, the concept of regulated partial isolation is general, and applies not only to cells with respect to their environments, but also to organelles, organs, individuals and even species (Weiss 2005). The amount of biological sequence data is currently increasing at an exponential rate (Overbeek 2000). To date, the sequences of more than 700 complete genomes are publicly available (http://www.ebi.ac.uk/GenomeReviews/stats/; August 2008), together comprising the result of one of the largest scientific endeavors in history. This rapid increase in the amount of biological information available has played a 12 major role in the origin of the field of bioinformatics, which deals with annotation, storage and analysis of such biological data. Accordingly, one of the first disciplines within bioinformatics, and still one of the largest, has been that of sequence analysis. Methods for automatic annotation of new DNA and protein sequences are often based on the assumption that only the sequence is known, and try to predict various features such as DNA exons and introns, protein domains or subcellular location using only sequence as input. Thus, there is a certain rationale for developing prediction methods that are fully sequence based, not just because they do not require data from expensive biochemical experiments, but also for the prospect that the wealth of training data can only be expected to increase. Development of such sequence based prediction methods is only becoming increasingly important as the sequence collection continues to grow. In particular, sequence based predictions on membrane spanning proteins are of great importance. Membrane proteins comprise about 25% of all proteins encoded by most genomes (Wallin and von Heijne 1998; Krogh et al. 2001) and are responsible for numerous vital processes in the cell, but apart from their general biological significance, there are also specific reasons for why additional effort is being put into membrane protein predictions. First, the medical importance of membrane bound receptors, channels and pumps as targets for drugs is well established. It has been estimated that more than 50% of all available drugs target membrane proteins (Drews 1996; Hopkins and Groom 2002). Second, structural information is hard to obtain experimentally, both because membrane proteins are particularly hard to overexpress and purify, but also since the large hydrophobic surfaces complicate the growth of crystals. This thesis focuses on prediction methods for membrane proteins that ultimately rely on their amino acid sequence only. As a consequence of the peculiar surroundings in which integral membrane proteins find themselves, spanning the apolar hydrocarbon core and projecting into the polar waterphase on either side of the membrane, they are somewhat constrained in terms of what amino acid sequence is compatible with the different environments. This makes it possible to find sequence patterns and predict the topology of the protein, i.e. what parts of the sequence that actually span the membrane, and what parts are located on the cytoplasmic and extracytoplasmic side of the membrane. In paper I, we attempt to improve sequence-based topology prediction of membrane proteins by introducing additional information about the presence of extra-membranous protein domains in the sequence, which are believed to always reside on just one side of the membrane. In paper II, we instead use the sequence constraints placed by the hydrophobic interior of the membrane to improve detection of distant membrane protein homologs. Paper III and IV relate to the actual insertion of membrane proteins into the membrane, and how transmembrane helices are recognized in the cell. In paper III, a scale describing the energetics of 13 this process is developed based on experimental measurements, and in paper IV, this scale is implemented to predict transmembrane topologies. 14 2. Biological membranes All living cells are surrounded by membranes, separating the interior of the cell from the surrounding environment and protecting it from foreign substances. In the case of eukaryotic cells, membranes also enclose subcellular compartments. The fundamental structural component of biological membranes is lipids, although a large fraction of the membrane mass comes from membrane proteins embedded within the lipids (Guidotti 1972). 2.1 Membrane lipids Membrane lipids are amphiphilic, consisting of one or more hydrophobic hydrocarbon tails, and a hydrophilic sometimes charged headgroup. Depending on the relative sizes of the headgroup and the hydrophobic tails, lipids are more or less prone to form flat bilayer structures rather than mi- Figure 2.1. Different shapes of membrane lipids. Cylindrically shaped lipids (left) stabilize a flat lipid bilayer structure. Conical lipids (middle) have large headgroups or skinny hydrocarbon tails. Inverted conical lipids (right) have small headgroups or bulky unsaturated hydrocarbon tails. celles or other structures when placed in water, Fig. 2.1. When the headgroup is of approximately the same size as the hydrocarbon tails, the lipid becomes roughly cylindrical, which promotes bilayer formation, whereas if the headgroup is either smaller or larger than the hydrocarbon tails, the conical shape induces curvature in the lipid monolayer (Epand 1998). 15 In addition to the variations in lipid shape, there are also differences in charge of the lipid headgroup, being neutral, zwitterionic or negatively charged. The lipids predominantly found in biological membranes are phospholipids and glycolipids, with the overall shape as in Fig. 2.1, and cholesterol, a neutral lipid containing a ring structure that breaks the tight packing of fatty acid chains, making the membrane more fluidic. In general, the headgroup region of a natural mix of lipids in a biological membrane tends to be slightly negative on average. 2.2 Lipid bilayers As a consequence of their amphiphilic nature, lipids aggregate and spontaneously self-organize in water, generally into either micelles or bilayers. The biological membrane is a bilayer structure, where the hydrophobic fatty acid chains avoid contact with water molecules by arranging themselves into two oppositely oriented fluidic leaflets, Fig. 2.2. An early description of the system was the fluid mosaic model (Singer and Nicolson 1972), in which lipids and proteins diffuse laterally in the plane of the bilayer. The width of the membrane is determined by the lengths of the hydrocarbon tails, and is gen- Figure 2.2 Structure of a DOPC bilayer, after a 1.5 ns molecular dynamics simulation (Feller et al. 1997). Due to thermal motions, the interfaces are of roughly the same width as the hydrocarbon core. Colors are as in Fig. 2.3. erally around 60Å, of which about half constitutes the hydrophobic core. However, the actual thickness, as well as fluidity and curvature, depends on 16 the lipid composition which varies between different membranes, and local fluctuations may occur due to the presence of lipid rafts (Simons and Ikonen 1997). The actual distributions of the functional groups in Fig. 2.2 across a DOPC lipid bilayer have been measured experimentally (Wiener and White 1992), and are shown in Fig. 2.3. From these distributions, which are quite Figure 2.3 Structure of a DOPC bilayer, as determined by X-ray and neutron diffraction measurements. The distributions reflect the probability of finding a particular structural group at a specific location in the bilayer. Colors are as in Fig 2.2. Figure adapted from (Wiener and White 1992). consistent with the molecular dynamics simulation in Fig 2.2, the emerging picture is that thermal fluctuations lead to a highly heterogeneous environment, where the interface regions together occupy roughly the same width as the hydrophobic core of the membrane. Thus, the chemical environment across the membrane varies markedly over short distances, which is reflected in the amino acid distributions of TM helices at different depths into the bilayer (see section 3.3). Lipids with conical shapes (Fig. 2.1; middle and right) might induce curvature when present in different concentrations in the two leaflets of a lipid bilayer. A membrane consisting primarily of phosphatidylcholine (cylindrical) and phosphatidylethanolamine (inverted conical) in both leaflets will experience curvature stress, meaning there will be higher lateral pressure in the lipid tail region than in the headgroup region. Such forces could be important for correct folding and functioning of integral membrane proteins (Dan and Safran 1998; Epand 1998). 17 3. Membrane Proteins 3.1 Types of membrane proteins For biological membranes, lipids are only half the story. The other half or so of the membrane mass comes from proteins (Guidotti 1972), interspersed among the lipids and contributing to the overall stability of the membrane, Figure 3.1 Structural classes of transmembrane proteins. (Left) The αhelical transmembrane protein bacteriorhodopsin, found in the plasma membrane of archaea. (Right) The β-barrel transmembrane protein OmpA, found in the outer membrane of gram-negative bacteria. Meshes indicate the approximate borders of the hydrocarbon core of the respective membranes. although the exact protein-lipid ratio varies between different membranes. Broadly, membrane proteins may be classified as peripheral or integral, and the integral membrane proteins can be further subdivided into monotopic, bitopic and polytopic ones (Blobel 1980). Peripheral membrane proteins are only loosely attached to the membrane through electrostatic or hydrophobic interactions with the lipid headgroups or other membrane proteins, or through a covalently attached GPI anchor (Chatterjee and Mayor 2001). Monotopic integral membrane proteins are largely water soluble, but also contain a hydrophobic part that partly penetrates into the membrane from 18 one side only. Bitopic integral membrane proteins span the membrane once and polytopic integral membrane proteins, finally, span the lipid bilayer multiple times. Polytopic proteins come in two basic architectures, either as bundles of tightly packed α-helices, or as closed barrels of amphipathic βstrands, Fig. 3.1. 3.1.1 α-helical transmembrane proteins Because of the low-polarity environment in the hydrophobic core of the membrane, there is a particular need for the protein backbone to form internal hydrogen bonds, and one way of satisfying all backbone hydrogen bonds is to form α-helices in the membrane-spanning parts of the protein. The αhelical bundle type membrane proteins are the most abundant, and are predicted to constitute around 25% of a typical genome (Krogh et al. 2001). Mainly, the TM helices are composed of hydrophobic amino acids that are able to form van der Waals-interactions with fatty acids of the surrounding lipids. However, polar and even charged amino acids are occasionally also found deep into the bilayer, although generally not in direct contact with lipids, but rather buried against a protein surface. The TM-helices are typically roughly perpendicular to the membrane plane, and around 26 residues long (Ulmschneider et al. 2005), although local deformations of the bilayer and tilting of TM-helices, respectively, can accommodate lengths as short as 15 and as long as 43 residues (Granseth et al. 2005). α-helical transmembrane proteins are the most well-studied class of membrane proteins, and the only class investigated in this thesis, and will be referred to simply as 'membrane proteins' throughout the text. 3.1.2 β-barrel transmembrane proteins Another way of solving the problem to satisfy internally all backbone hydrogen bonds is seen in the β-barrel type of transmembrane proteins, in which the membrane spanning parts are composed of an even number of antiparallel β-strands, which are tilted around 45° relative to the membrane plane (Galdiero et al. 2007). Each strand is hydrogen bonding to the neighbouring strands, and the first and last strands hydrogen bond with each other to close the barrel (Fig. 3.1). Residues in the β-strands alternately point outwards, facing the lipids, and inwards, facing the inside of the barrel, resulting in a sequence pattern in which the residues are typically alternately polar and hydrophobic. β-barrel type membrane proteins have so far only been found in the outer membrane of gram-negative bacteria (Fischer et al. 1994), although predictions suggest they are also present in the outer membranes of chloroplasts and mitochondria (Schleiff et al. 2003). Although difficult to predict, they have been estimated to account for less than 3% of the ORFs in the genomes in which they are present (Garrow et al. 2005). 19 3.2 Membrane protein targeting and translocation in the ER Most α-helical transmembrane proteins, and water-soluble proteins destined for the secretory pathway, are targeted by an N-terminal signal sequence to the ER membrane, where protein translation and translocation or membrane insertion occur simultaneously (Simon and Blobel 1991; White and von Heijne 2005; White 2007). Briefly, as the nascent polypeptide chain emerges from the ribosomal exit tunnel, an N-terminal hydrophobic segment of ~20 amino acids is recognized by a ribonucleoprotein complex known as the Signal Recognition Particle (SRP). The binding of SRP to the signal sequence temporarily halts elongation, and the SRP-ribosome-polypeptide complex, with mRNA also bound to it, is targeted to the SRP receptor protein at the ER membrane. SRP is released as the ribosome binds to a heterotrimeric channel complex known as the Sec translocon, whereupon translation is resumed and the nascent polypeptide chain is threaded directly through the channel as it is being synthesized, Fig. 3.2. The Sec translocon is Figure 3.2 Co-translational translocation and membrane insertion in the ER membrane. As the newly synthesized polypeptide chain emerges from the ribosome, the translocon channel must decide whether to insert it into the membrane or translocate it. Figure adapted from (White and von Heijne 2004). responsible both for the membrane integration of membrane proteins and for the export of secreted proteins, first into the ER lumen and then, by means of vesicular transport, through the Golgi and out of the cell. Membrane spanning α-helices, as well as the hydrophobic signal peptide, are recognized by the translocon and transferred laterally into the lipid phase, whereas the hydrophilic lumenal loops are translocated through the channel and the cytoplasmic loops are retained on the cytoplasmic side. For secreted proteins, the 20 signal peptide is cleaved off after translocation, releasing the folded protein on the lumenal side of the membrane. Membrane proteins either contain a cleavable signal peptide or an uncleaved signal sequence which then effectively becomes the first TM-helix of the mature protein. How is the recognition of TM segments accomplished by the translocon channel? It has been suggested that the channel opens and closes rapidly as translocation occurs, thereby allowing the TM segments to partition into the lipid environment (Rapoport et al. 2004). Favourable interactions between lipids and amino acid side chains should then promote membrane integration rather than translocation. With the view of the membrane as a highly heterogeneous environment (Fig. 2.2) where chemical conditions change markedly over short distances, it follows that amino acid side chains will have different preferences for different slabs of the lipid bilayer. On the sequence level, the efficiency with which a TM segment is integrated by the translocon should then be sensitive to the relative positions of amino acids within the segment. This trend is evident when looking at amino acid distributions in known membrane protein structures. 3.3 Amino acid statistics of TM-helices Although the most prominent feature of TM helices is their overall hydrophobicity, there is also a considerable variation in the preferences of amino acids along the membrane normal, as revealed in the known membrane protein structures. A recent statistical analysis of the frequencies of amino acids in TM helices shows a notable positional dependence, where hydrophobic residues (Ala, Ile, Val, Leu) mostly populate the hydrophobic core of the membrane, while polar (Asn, Gln) and polar aromatic (Trp, Tyr, His) residues avoid the core but are more abundant in the interface regions (Ulmschneider et al. 2005). The small polar residues Ser and Thr were found to be equally well represented in core as interface, which could possibly be explained by their ability to sometimes form hydrogen bonds with the backbone (Gray and Matthews 1984). Charged residues (Arg, Asp, Glu, Lys) show a sharp peak indicating that they are almost absent from the very center of the hydrophobic core, but seem to be well tolerated in the interface regions. 3.3.1 Helix-breaking residues Pro plays a special role in membrane helices, in that it breaks the helix and induces a backbone kink that can be important for function or stabilization of the structure (von Heijne 1991; Sansom and Weinstein 2000; Cordes et al. 2002). It even seems that the kink-inducing Pro residue can subsequently be substituted, as long as the kink is stabilized by specific interactions with 21 surrounding residues (Yohannan et al. 2004). Pro is also overrepresented at TM-helix ends in the interface regions (Granseth et al. 2005). In addition to Pro, Gly is also known to induce breaks in helices and promote turns, and also occurs with high frequency near TM-helix ends in the interface regions (Granseth et al. 2005). Despite its helix breaking nature, it is also quite abundant within the membrane. One reason for this is that its small side chain allows for two TM-helices to come close together, and it thus plays an important role in helix-helix packing (see section 3.5.3). 3.3.2 The aromatic belt The polar aromatic residues Tyr and Trp are often found near the ends of TM-helices, sometimes referred to as the "aromatic belt", where they have been suggested to anchor and position the helix in the membrane (de Planque et al. 2003). These residues have a hydrophilic part as well as an apolar Figure 3.3 Three-dimensional structure of the AQPM aquaporin water channel from Methanothermobacter marburgensis. The structure is colored according to hydrophobicity, where red is hydrophobic and white hydrophilic. Tyr and Trp residues are colored green and Lys and Arg residues are colored blue. Cytoplasm is downwards in the picture. The aromatic belts are evident, as is the overrepresentation of Lys and Arg residues on the cytoplasmic side of the membrane. Two re-entrant loops (section 3.5.2) are visible in the middle of the structure. aromatic ring, and being positioned near the borders of the hydrocarbon core region, they can both interact with the lipid headgroups and at the same time bury the aromatic ring among the apolar lipid tails (Yau et al. 1998). 22 3.3.3 The positive-inside rule A strong determinant for the topology and overall orientation of membrane proteins is the asymmetric distribution of positively charged amino acids (Lys and Arg) between loops on opposite sides of the membrane, being overrepresented in short cytoplasmic loops (von Heijne 1992), known as the positive-inside rule. This charge imbalance was first discovered for signal peptides (von Heijne 1984), and was subsequently found to apply also to membrane helices (von Heijne and Gavel 1988). Site-directed mutagenesis studies have shown that the positive charges are sufficient to determine overall topology (von Heijne 1989; Nilsson and von Heijne 1990), and the rule seems to be universal across all types of membranes in a large number of species (Nilsson et al. 2005). 3.4 Membrane protein topology The topology of a membrane protein describes what segments of the amino acid sequence span the membrane, and what segments protrude into the respective compartments on opposite sides of the membrane. Although this simple specification is clearly not as informative and accurate as the full three-dimensional structure of the protein, there are certain reasons why topology information can still be useful. First, even though full structure prediction is currently almost impossible to do with any reasonable reliability, if such methods are to arise in the future they will most probably have to start by making correct topology assumptions (von Heijne 2006). Thus, development of topology prediction algorithms has the prospect of paying off in the future, since the predicted information is so fundamental. Second, certain characteristics of membrane proteins can be deduced from topology information alone. Frequently, membrane protein superfamilies, e.g. the important 7TM receptors, have highly conserved topologies which can provide some information for detection of new members (Wistrand et al. 2006). Moreover, function can to a certain extent be predicted simply from the number of transmembrane helices in the protein (Sugiyama et al. 2003). Transmembrane topology can either be mapped experimentally (section 3.4.1), or predicted from amino acid sequence (section 4.4). Attempts have also been made at combining limited experimental information with prediction techniques to provide higher quality topology models for a relatively large number of proteins (Daley et al. 2005; Kim et al. 2006). According to these studies, in both the Escherichia coli and Saccharomyces cerevisiae entire membrane proteomes, proteins with the C-terminus on the cytoplasmic side of the membrane were roughly four times as frequent as those with the C-terminus on the exoplasmic side. In addition, proteins with an even number of TM-helices were clearly over-represented, and consequently, the 23 N-terminus is also more commonly located on the cytoplasmic side of the membrane. This might indicate that helical hairpins are the basic blocks of membrane insertion by the Sec translocon. This is a debated subject, however, since the X-ray structure of the translocon reveals that the pore seems too small to accommodate more than one helix at a time (Van den Berg et al. 2004). 3.4.1 Topology mapping techniques A number of techniques exist to experimentally map the locations of a few strategically selected parts of the sequence, in order to produce a topology model of the protein. One approach exploits recognition of sequence segments by agents that have access only to one side of the membrane, like glycosylation mapping (Chang et al. 1994), cysteine labeling (Kimura et al. 1997) or epitope mapping (Canfield and Levenson 1993). Another technique involves fusion of a reporter protein that is active only on one side of the membrane (Manoil and Beckwith 1986). By fusing the reporter protein to a number of differently truncated versions of the membrane protein, the in/out location of several points in the sequence can be derived and used to produce a full topology model. To avoid having to interpret the absence of activity of the reporter protein as a positive result, two reporter proteins with activity on opposite sides of the membrane are often used in combination. Reporter proteins frequently employed include PhoA (active in periplasm only) and GFP (active in cytoplasm only) (Drew et al. 2002; Daley et al. 2005). 3.4.2 Dual topology proteins Although the large majority of membrane proteins have a well defined topology, recent studies have demonstrated the existence of a few membrane proteins that seem to insert with some probability in oppositely oriented directions, termed dual-topology membrane proteins (Rapp et al. 2006; Rapp et al. 2007). These proteins have in common a small or non-existent bias of Arg and Lys (KR-bias) to either side of the membrane (section 3.3.3). Interestingly, they sometimes have homologues with large KR-biases, and in those cases, two copies of the gene with opposite KR-biases are always present within the same genome, presumably inserting in opposite directions into the membrane. In one family, the two oppositely oriented copies were sometimes fused, such that both protein length and KR-bias were doubled. If both directional variants are required for function, this points towards a possible route of evolution for membrane proteins in which a dual topology protein is duplicated, whereupon the two copies evolve a KR-bias to stably insert in opposite directions, and may subsequently be fused into a single protein. Indeed, internal symmetry is a common phenomenon in known membrane protein structures (Choi et al. 2008). 24 3.5 Membrane protein structure Membrane proteins are notoriously difficult to crystallize, which has led to the situation that, although about 25% of a typical genome are predicted to encode membrane proteins (Krogh et al. 2001), they only account for less Figure 3.4 The number of known membrane protein structures grows exponentially, although at a lower rate than soluble proteins during the equivalent time period. Figure reproduced from (White 2004) with permission. than 1% of the known protein structures, with about 160 unique structures known (http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html; August 2008). Encouragingly though, the number of known membrane protein structures is growing exponentially, although at a rate that lags behind the corresponding rate for soluble proteins during the equivalent time period, Fig. 3.4. 3.5.1 Structure determination techniques Structure determination starts with protein overexpression, which is often difficult with membrane proteins, both because large amounts of the protein may be toxic to the cell, and because the proteins sometimes aggregate to form inclusion bodies (Drew et al. 2003). After solubilization with a detergent and purification, there are a number of alternative methods to determine the three-dimensional structure. The majority of currently known membrane protein structures have been solved by X-ray crystallography. The main obstacle using this method is that growth of well-ordered diffracting crystals is difficult due to the amphiphilicity of the protein. To overcome this, membrane proteins are sometimes fused to soluble domains, increasing the polar 25 surface area available for crystal lattice contacts (Cherezov et al. 2007; Rosenbaum et al. 2007). Apart from X-ray crystallography, the other main method in use for obtaining structural information of membrane proteins is NMR spectroscopy. Here, no crystals are required, but on the other hand, the method is limited to quite small proteins (Torres et al. 2003). This can be particularly problematic for membrane proteins, since they are typically quite large by themselves (Mitaku and Hirokawa 1999), and need to be solubilized in detergent micelles, adding to the size of the complex (Torres et al. 2003). Although membrane proteins are reluctant to arrange themselves into three-dimensional crystals, they more easily pack into two-dimensional planar structures, which can be analysed by electron microscopy (EM). A threedimensional model of the protein can be produced by combining twodimensional projections at different angles (Henderson et al. 1990). Although the method gives relatively low resolution compared to the 3D methods, higher resolution can be achieved with cryo-EM, where the sample is studied at temperatures approaching -260°C (Torres et al. 2003). Cryo-EM has been used to solve a number of near-atomic membrane protein structures, including aquaporin (Murata et al. 2000) and the acetylcholine receptor (Unwin 2005). 3.5.2 Structural features of membrane proteins Incidentally, the first membrane protein structures to be solved were relatively simple in their structures (Deisenhofer et al. 1985; Henderson et al. 1990), with straight hydrophobic helices penetrating the membrane at angles roughly perpendicular to the membrane plane and interconnected by short loops. With increasing insight into membrane protein structures, the picture is now somewhat more complex, and certain recurring structural features not evident from just a membrane topology representation have emerged. One obvious observation is that not all helices are straight rods, but a few contain Pro-induced kinks (section 3.3.1), which might be important for protein function. Whereas Pro is quite rare in α-helices of soluble proteins, they occur with a relatively high frequency in transmembrane α-helices, where they seem to play an important role in the biological activity of the peptide (Cordes et al. 2002). Another recurring structural element in membrane proteins is the reentrant loop, which penetrates only halfway through the membrane, and enters and exits on the same side (Lasso et al. 2006; Viklund et al. 2006). Commonly, such loops contain residues important for function of the protein (Lasso et al. 2006), and they are found in, e.g., the translocon channel (Van den Berg et al. 2004), the aquaporin water conducting channel (Fig. 3.3 and (Harries et al. 2004)), and the chloride ion channel (Dutzler et al. 2003). 26 In addition to the transmembrane α-helices perpendicular to the membrane plane, some structures also contain shorter interfacial helices, lying roughly parallel to the membrane. The function of these helices is not entirely clear, but they might be involved in positioning of the transmembrane helices (Granseth et al. 2005). They are on average 9 residues long and rich in Trp and Tyr residues (Granseth et al. 2005). Finally, preferences of amino acid side chains can be studied from the structures. Lys and Arg have long hydrophobic side chains with a positive charge at the very end, and when these residues are situated within the hydrocarbon core of the membrane, they often point their side chain away from the membrane core and out towards the hydrophilic interface region, a phenomenon described as 'snorkelling' (Chamberlain et al. 2004). Similarly, Phe residues tend to direct its hydrophobic aromatic ring towards the membrane core, termed 'anti-snorkelling'. 3.5.3 Helix-helix interactions Whereas for globular proteins, the hydrophobic effect is the main driving force of folding (Honig and Yang 1995; Lins and Brasseur 1995), the circumstances are somewhat different for membrane proteins, being embedded in the hydrophobic environment of the membrane. For a long time, hydrogen bonding has been considered the major stabilizing force of membrane protein structure through inter-helical interactions (Engelman 1982; Choma et al. 2000; Zhou et al. 2000), but recently, this view has been questioned (Joh et al. 2008). A dominant force of helix packing in the hydrocarbon core might be van der Waals interactions between helices (White and Wimley 1999). Two small residues (Gly, Ala, Ser) separated by three residues will end up on the same side of the α-helix, and thus allow two helices to come close together and pack tightly. The GX3G-motif has been implicated as the main driving force of homodimerization of the single-spanning Glycophorin A (Lemmon et al. 1992; Treutlein et al. 1992), and both this motif and the GX3GX3G- and GX7G-motifs are overrepresented in TM-helices in general (Liu et al. 2002; Kim et al. 2005). The assembly of membrane protein chains into complexes can be better understood by considering the thermodynamics of the surrounding membrane lipids. Even though the dimerization of two TM chains into a rigid arrangement seems to reduce the entropy of the proteins, this is counterbalanced and outweighed by the entropy increase resulting from the lipids being released back into the lipid pool of the membrane as the total protein-lipid surface area decreases (Helms 2002). 27 4. Computational approaches 4.1 Substitution matrices, profiles and HMMs In principle, all characteristics of a newly synthesized protein, including targeting to different subcellular components, folding of secondary and tertiary structure, complex assembly, function and ultimately degradation, are determined by its amino acid sequence (Anfinsen 1973). One of the most important and challenging areas of bioinformatics is thus the decoding of the biological signals intrinsic in the sequence, and prediction of biochemical properties from this source of information. Among the many different types of signals that can be recognized, evolutionary relationships and classification of proteins into superfamilies and families are among the most important 4.1.1 Sequence alignment and substitution matrices Given two protein sequences, the alignment problem can be described as finding the most probable set of point mutations, insertions and deletions defining the evolutionary divergence of the two proteins. In the simplest case, each amino acid substitution is associated with a certain cost, and a dynamic programming algorithm, such as the Smith-Waterman (Smith and Waterman 1981) or Needleman-Wunsch algorithm (Needleman and Wunsch 1970), is employed to search for the set of substitutions that together sum to the lowest total cost for the alignment. Insertions and deletions can be included in the model by also associating appropriate costs with gap opening and gap insertion respectively. Some different versions of the 20x20 amino acid substitution matrix, describing the costs for aligning all amino acids to each other, are regularly used in bioinformatics applications, of which the most popular are the PAM (Dayhoff 1978) and BLOSUM (Henikoff and Henikoff 1992) set of matrices respectively. Incidentally, the BLOSUM matrices were recently shown to contain errors due to a software source code bug, although the 'erroneous' matrices actually perform better than the 'intended' ones (Styczynski et al. 2008). Dynamic programming algorithms are generally too slow for searching large sequence databases, but heuristic algorithms exist that are significantly 28 faster at the cost of being slightly less sensitive. BLAST (Altschul et al. 1990) first searches sequences for high-scoring 'words', and the hits can subsequently be extended using dynamic programming, and FASTA (Pearson 1990) uses a similar approach. 4.1.2 Sequence profiles By aligning two sequences using a single substitution matrix, it is impossible to model the very common situation where substitution rates are not constant over the whole sequence, but rather vary according to the relative importance between different parts of the protein. Structurally and functionally important regions in a protein structure should be expected to exhibit higher level of conservation than e.g. flexible loops far from the active site. In order to capture this effect, sequence profiles or Position Specific Scoring Matrices (PSSMs) (Altschul et al. 1997) are commonly used to represent common motifs in protein families. In a sequence profile, each position of the sequence motif or sequence family contains a 20x1-vector of probabilities for observing each of the 20 amino acids in that particular position, as given by a multiple sequence alignment (Fig. 4.1). Figure 4.1 (Above) Multiple sequence alignment of a Zinc finger domain. (Below) A sequence profile for the first ten residues of the same alignment. The profile is a compact representation of the multiple sequence alignment. Zeros in the profile are omitted for clarity. A number of conserved residues can be easily detected. 29 When evaluating the alignment of a sequence to a sequence profile, in principle the probability for the aligned amino acid in the single sequence is given by the corresponding frequency vector in the profile, and these probabilities are multiplied to give the total probability of the full alignment. To avoid the very small accumulated probabilities that can result from long protein sequences, in practice these calculations are always performed in log-space. It is well established that including evolutionary information in the form of profiles improves homology detection between protein sequences (Lindahl and Elofsson 2000). By aligning two sequence profiles against each other, evolutionary information is included for both query and database sequences, which has been shown to improve remote homology detection even further (Park et al. 1998; Rychlewski et al. 2000; Mittelman et al. 2003; Ohlson et al. 2004). 4.1.3 Hidden Markov Models Hidden Markov Models (HMMs) are probabilistic models that were early applied to the problem of speech recognition (Jelinek et al. 1975; Rabiner 1989), and have later on been used in a number of biological sequence analysis problems (Churchill 1989; Krogh et al. 1994). The model consists of a set of interconnected states, each of which contains a probability distribution, called emission probabilities, over a set of symbols, referred to as the alphabet of the model. In addition, each state also contains a probability distribution over possible transitions to all states in the model (including itself), known as transition probabilities. According to the parameters of the model, the emission and transition probabilities, the model generates a sequence of symbols from the alphabet by first emitting a symbol according to the emission probability distribution of the current state, then transits to a new state according to the transition probability distribution and repeats this until a certain stop state is reached. By tuning the parameters of the model, it can be designed to generate a set of sequences with high probability. There are basically three different problems associated with HMMs. First, the probability for the model to generate a certain sequence of symbols can be calculated, known as the evaluation problem. The forward and backward algorithms (Rabiner 1989) are used to solve this problem. Second, the most likely state path through the model can be calculated for a certain symbol sequence, known as the decoding problem. This problem is solved by the Viterbi algorithm (Viterbi 1967). Finally, the estimation problem relates to finding the most likely model parameters given a set of symbol sequences, and the Baum-Welch algorithm (Baum 1972) is used for this task. Profile HMMs (Eddy et al. 1995; Eddy 1998) are a certain kind of HMM, that has achieved widespread use in biological sequence analysis applications. These models resemble sequence profiles (section 4.1.2) in that they 30 are used to represent common characteristics of a whole protein family. States in profile HMMs are sequentially arranged and connected so that positions in a multiple sequence alignment are represented by one Match state (M), one Insert state (I) and one Delete state (D) (Fig. 4.2). Each Match and Insert state in the model contains both amino acid emission probabilities, in Figure 4.2 Graphical representation of a profile HMM. Match state (M), Insert state (I), Delete state (D). The model starts in the Begin state (B) and ends in the End state (E). All nonzero state transitions are indicated with arrows. analogy with the probability vector in ordinary sequence profiles, and transition probabilities, whereas the Delete states are 'silent states' representing gaps, and do not emit symbols. A profile HMM can easily be constructed from a multiple sequence alignment by setting emission and transition probabilities according to the observed amino acids and gaps at different positions in the alignment. 4.2 Aritifical neural networks Artificial neural networks (ANNs) (Brian and Hjort 1995) is a standard machine learning technique for pattern recognition and classification, that has been applied to many problems in bioinformatics, including TM topology prediction. The model consists of an interconnected set of nodes (neurons), each taking one or several signals as input. The output from a node is frequently a nonlinear function of a weighted sum of its input values, where hyperbolic tangent functions are particularly common. In the simplest case, the nodes are interconnected in a directed fashion, being arranged into layers, where each layer of nodes is only connected to the neighboring layers and signals through the network are propagated from left to right, called a feed-forward network. The output from the network effectively becomes a 'function of functions' corresponding to a non-linear mapping of a set of input variables to a set of output variables. 31 ANNs have been successfully applied to e.g. prediction of secondary structure (Qian and Sejnowski 1988), prediction of subcellular location (Bendtsen et al. 2004) and protein homology detection (Frishman and Argos 1992). In TM topology prediction, ANNs are sometimes used to predict a residue preference score, that is subsequently used as input to a dynamic programming algorithm (Rost et al. 1996; Viklund and Elofsson 2008) 4.3 Hydrophobicity scales In order to search for hydrophobicity signals in membrane protein sequences, a prerequisite is a scale assigning hydrophobicity values for all amino acids. Numerous such hydrophobicity scales are available, based on e.g. the partitioning of amino acids between two immiscible liquid phases, chromatographic techniques or accessible surface area calculations. A few of the available scales are outlined below, but many more exist. One of the most frequently cited hydrophobicity scales, the KyteDoolittle scale (Kyte and Doolittle 1982), combines accessible surface area measurements in globular proteins with water-vapor partitioning preferences. By implementing this scale in a sliding-window approach, it was possible both to distinguish exterior from interior in globular proteins, and to identify TM regions in membrane proteins. Another hydrophobicity scale specifically tailored to TM helices is the Goldman-Engelman-Steitz (GES) scale (Engelman et al. 1986). Here, a semi theoretical approach is taken, accounting for the attachment of side chains to an α-helical backbone structure. This scale is used to predict TM helices in the TopPred topology prediction algorithm (section 4.4.1). The Wimley-White scale (White and Wimley 1999) takes into account the (unfavorable) contribution of partitioning the backbone peptide bonds into the bilayer. Partitioning of pentapeptides into POPC bilayer interfaces and n-octanol, respectively, were used to construct this scale. Finally, the Zhao-London scale (Zhao and London 2006) is based on propensities of amino acids in known membrane protein structures. The scale was refined to separate well between transmembrane and soluble sequences from protein sequence databases, and should in that sense be quite suited to the problem of predicting TM helices. 32 4.4 Transmembrane topology predictors Several methods exist that, given an amino acid sequence, predict the full transmembrane topology. These methods are either based on physical properties of the amino acids, such as the hydrophobicity scales described above, or rely on statistical over- or underrepresentation of amino acids in transmembrane helices, as observed from known topologies of membrane proteins. TopPred (section 4.4.1) is the only method of the ones outlined below that is based directly on physical principles (i.e. a hydrophobicity scale) for predicting TM-helices. 4.4.1 TopPred One of the first topology predictors, TopPred (von Heijne 1992), takes advantage of the GES hydrophobicity scale (section 4.3), and creates a hydropathy plot of the sequence, using a sliding window approach. First, all hydrophobicity peaks above a certain cutoff value are identified and marked as 'certain' TM segments, whereas all peaks below this cutoff but above a second lower cutoff value are marked 'putative' TM helices. Second, all possible topologies, including all certain TM segments and either including or excluding each of the putative TM segments are generated, and the topology that best complies with the positive-inside rule (section 3.3.3) is chosen as the final prediction. 4.4.2 TMHMM / PRO-TMHMM / PRODIV-TMHMM Among the first methods to implement HMMs to predict TM topology, and one of the most frequently used, is TMHMM (Krogh et al. 2001). Common to all HMM based topology predictors is that all states carry a label, and paths through the model correspond to a labeling defining the topology of the protein. The architecture of the model is cyclic, such that two loops with a membrane region in between will always reside on opposite sides of the membrane. The N-best algorithm (Krogh 1997) is used to find the most probable labeling (the most probable topology) of the input sequence, which may correspond to more than one state path through the model. In PRO-TMHMM (Viklund and Elofsson 2004), a development of TMHMM, sequence profiles rather than single sequences can be scored against the model. This was found to improve prediction performance by a few percentage points. In the same study, another method, PRODIVTMHMM, was found to improve performance even further. This method uses a re-optimization procedure which was first introduced with HMMTOP (see below). 33 4.4.3 HMMTOP Another early method to use HMMs for TM topology prediction is HMMTOP (Tusnady and Simon 2001). The architecture of the model is cyclic, just like TMHMM, but differs in some respects such as the upper and lower boundaries for helix lengths. HMMTOP can take either a single sequence or a sequence profile as input. One notable feature of the method is that the model is re-optimized on every query sequence or sequence profile after the first scoring, and then scored a second time with the new parameters. This has the effect that hydrophobicity relative to the rest of the sequence determines the locations of TM regions, rather than absolute hydrophobicity. 4.4.4 MEMSAT The very first method to combine all sequence signals and calculate a globally best topology according to the model was MEMSAT (Jones et al. 1994). Five different states (inside loop, outside loop, helix inside, helix outside and helix middle) each contain an amino acid distribution, and dynamic programming is used to align the query sequence to the model. One important difference between this approach and the later HMM-based methods is that the model is not cyclic. Instead, all possible topologies are explored, assuming a linear model with an increasing number of TM-helices, starting from just one. Later versions of MEMSAT (Jones 2007) integrates ANNs with HMMs, and can take sequence profiles as input. 4.4.5 PHDhtm Another ANN-based approach for TM topology prediction is the PHDhtm method, which is part of a larger package, PHD (Rost et al. 1996), for predicting secondary structure of proteins. The method takes a sequence profile as input and predicts a preference score for each profile column based on a sliding window of neighboring columns. A topology is then generated using a dynamic programming algorithm, and the overall orientation is determined by the 'positive-inside rule'. 4.4.6 Phobius / PolyPhobius A common problem in TM topology prediction is the confusion of cleaved signal peptides (SPs) and TM regions, since both have an overall hydrophobic character. The first TM topology prediction method to address this problem was the HMM-based method Phobius (Käll et al. 2004), in which TMregions and SPs are modeled separately. The method is particularly efficient in datasets containing a mixture of proteins with and without TM-regions, 34 and with and without SPs. A later version of the method, PolyPhobius (Käll et al. 2005), can take a sequence profile as input. 35 5. Present investigation 5.1 Improving transmembrane topology prediction by domain assignments (Paper I) Topology prediction of membrane proteins is typically based on sequence statistics, reflecting amino acid preferences for different compartments. In particular, three sequence signals are especially important (Fig. 3.3): the hydrophobicity of membrane spanning segments, the disposition of Trp and Tyr residues to the membrane-interface border (section 3.3.2), and the overrepresentation of Arg and Lys residues in short cytoplasmic loops (section 3.3.3). In this paper, we explored the possibility that the presence of globular protein domains in membrane protein sequences can be used to constrain topology prediction methods and thereby improve prediction performance. Protein domains are often compartment specific, and information about domain occurrence has previously been used to predict subcellular location of soluble proteins (Mott et al. 2002). Moreover, although covalent combinations between transmembrane domains (i.e. protein domains with membrane spanning regions) are rare, covalent combinations between one transmembrane domain and one soluble domain are observed frequently (Liu et al. 2004). Taken together, these two observations suggest that soluble domains with compartment specific localization, when found in membrane protein sequences, can be used to constrain that part of the sequence to one side of the membrane, before a sequence-based method is used to predict the topology of the rest of the sequence. This is the basic idea explored in the paper. Related to this idea are earlier studies, where experimentally determined 'anchor points' were used to constrain topology predictions and provide improved topology models (section 3.4) (Daley et al. 2005; Kim et al. 2006). An important difference to these studies is that, here, no experiments are necessary, but rather one type of prediction (topology prediction) is improved by another type of prediction (domain assignments). From the SMART domain database (Letunic et al. 2004), 367 domains carrying an annotation about subcellular location (146 cytoplasmic, 221 extracellular) were extracted. Searching for these domains in two test sets, one small set of membrane proteins with known topologies, and one large set of 36 membrane proteins with predicted topologies, gave the results shown in Fig. 5.1. Briefly, the fraction of sequences containing at least one of the 40% 297 membrane proteins with known topologies 30% ~78.000 membrane proteins with predicted topologies 20% 10% 0% Fraction sequences with domain hits Fraction domain hits in conflict with topology Figure 5.1 Results of domain assignments to the two test sets. The compartment specific domains are found at similar rates in both sets, but agreement with topology differs. compartment-specific domains was around 10% in both sets, but the fraction domain hits in conflict with the topology was much higher in the large set with predicted topologies, than in the small set with known topologies. These conflicts are the cases where our approach can be used to constrain predictions and thereby improve topology models. We found that the domains were overrepresented in single-spanning proteins compared to the whole data set. Single-spanning proteins are often mispredicted by topology predictors, mostly due to an inversion such that the TM-segment is correctly located but the overall orientation is wrong. Large extra-membranous domains carry little or no information in other predictors, and our approach thus solves a major weakness in these methods. After this paper was published, two similar studies have also employed the strategy of using the occurrence of compartment-specific soluble protein domains in membrane protein sequences to improve topology prediction. In LocaloDom (Lee et al. 2006), experimentally verified Swiss-Prot annotations (Boutet et al. 2007) were used to classify Pfam domains (Bateman et al. 2004) as being cytoplasmic or extracellular. In addition, conserved topologies of transmembrane Pfam domains were identified using the Phobius topology prediction algorithm (Käll et al. 2004), and subsequently used to infer TM topologies of sequences in which the Pfam domain was found. TOPDOM (Tusnady et al. 2008) is also based on Swiss-Prot annotations, but also includes predicted information from Swiss-Prot, which makes the domain collection much larger. For the most part, these three studies agree on 37 the subcellular location of domains, although there are also a few exceptions (Tusnady et al. 2008). 5.2 Using topology to predict homology (Paper II) In this study, we introduced a new profile-profile based alignment method for remote homology detection of membrane proteins. Compared with globular proteins, membrane proteins are surrounded by a more intricate environment and, consequently, amino acid substitution rates in the membrane spanning regions differ from those of globular proteins and hydrophilic loops in TM proteins. Since existing algorithms for homology detection are often developed with globular proteins in mind, they may not be optimal to detect remote membrane protein homologs. In this study, we take advantage of the sequence constraints placed by the hydrophobic core of the membrane to detect distant homology relationships between membrane proteins. The approach is to include information about predicted topology in the alignment, and thereby promote alignments where membrane spanning regions are aligned against each other. Although for globular proteins, it is well established that homology detection can be improved by including information from secondary structure predictions (Fischer and Eisenberg 1996; Rice and Eisenberg 1997; Hargbo and Elofsson 1999), prior to this study, only a few attempts had been made to utilize the sequence constraints specific to membrane proteins. In one study (Ng et al. 2000), TM regions are clustered to calculate an amino acid substitution matrix specifically designed to TM proteins. In another study (Hedman et al. 2002), topology predictions were used to promote alignments where TM regions are aligned against each other in a profile-sequence based approach. By including evolutionary information for both query and target sequences, homology detection of globular proteins can be substantially improved (Park et al. 1998; Rychlewski et al. 2000; Mittelman et al. 2003; Ohlson et al. 2004), and this has recently been shown to apply also to membrane proteins (Forrest et al. 2006). In this paper, we presented the (to our knowledge) first method to combine the membrane protein specific sequence constraints with the strength of profile-profile based methods aiming to detect remote members of membrane protein families. Starting from a query sequence, a two-track profile HMM (section 4.1.3) is constructed, combining sequence information with predicted topology. This HMM is then searched against a database of sequence profiles, also containing predicted topologies, and the HMM emission probabilities are tuned such that alignment between TM regions in the two proteins is favored. 38 We first applied the method to remote homology detection within the GPCRDB (Horn et al. 2003) database of G-protein coupled receptors. Here, it was apparent both that the profile-profile based methods performed clearly better than the profile-sequence based ones, and that the additional topology information improved prediction results. The advantage of adding topology information was less clear in two other test sets, the OPM database (Lomize et al. 2006) and the HOMEP database (Forrest et al. 2006), although the profile-profile based methods performed consistently better also here. An example of an improved alignment between two GPCRs is shown in Fig. 5.2. Figure 5.2 Alignment between distant homologs BAR1_SCHCO and Q752Q1_ASHGO (both GPCRs), using HMMs with one (only sequence) and two (sequence and topology) alphabets, respectively. Adding topology information shifts the alignment such that the correct TMregions are aligned against each other. At the same time, the score for the alignment becomes positive and the overlap of TM residues between the two proteins (TM-overlap) is increased. 5.3 Deciphering the molecular code for transmembrane helix recognition (Paper III) Because of the heterogeneous environment in lipid bilayers (section 2.2), amino acid propensities in TM proteins vary with the distance to the center of the hydrocarbon core of the membrane (section 3.3). As secreted or TM 39 proteins are being synthesized by the ribosome, they are directly threaded through the translocon channel (section 3.2), where membrane spanning segments are recognized and inserted into the bilayer, possibly by the rapid opening and closing of a lateral gate (Rapoport et al. 2004), allowing the segment to partition into the lipid environment if this is energetically favorable. In this paper, we have investigated the 'molecular code' for the recognition of TM helices by the Sec translocon, describing the requirements in terms of amino acid composition for a sequence segment to be recognized as a TM helix and inserted into the membrane, as opposed to being translocated across. An earlier study from our lab (Hessa et al. 2005) laid the basis for the experiments (described below), and derived a 'biological hydrophobicity scale' for the contributions to membrane insertion of each amino acid when placed in the middle of a 19 residue segment. The experimental data underlying the derivation of the 'code' comes from insertion efficiency measurements of designed putative TM segments in a model membrane protein, Fig. 5.3. Briefly, systematically designed test Figure 5.3 Insertion efficiency measurements of systematically designed test segments (red), engineered into a model membrane protein with two TM helices (black). The G2 site is only glycosylated in the translocated form (right) and thus the two forms (left/right) can be size separated on a gel. The relative amounts of the two forms quantify the level of insertion. segments were introduced into the model protein, which was expressed and inserted into rough microsomal membranes in an in vitro translation system. The test segment is flanked by two glycosylation sites and, since glycosylation only occurs on the lumenal side, the inserted and translocated forms of the protein can be separated by size on a polyacrylamide gel. From the relative amounts of inserted and translocated forms of the protein, an apparent free energy of membrane insertion of the segment, ∆G app , can be calculated. Almost 500 designed and natural sequences were expressed and tested for aa (i ) membrane integration in this way. To quantify the contributions, ∆Gapp , from individual amino acids to ∆G app , an additive model was assumed in 40 which the position, i, of the amino acid within the segment was taken into account: l pred aa ( i ) ∆Gapp = ∑ ∆Gapp + c 0 µ + c1 + c 2 l + c3 l 2 (1) i =1 where l is the length of the segment, µ is the hydrophobic moment of the helix, c0 is a weight parameter for the hydrophobic moment, and c1, c2 and c3 are parameters describing the contribution from the helix length. The posiaa (i ) tion-specific amino acid contributions, ∆Gapp , were optimized by iterapred values tively minimizing the squared differences between predicted ∆Gapp and measured ∆G app values, using a standard non-linear optimization strategy. To reduce the number of parameters of the model, the contribution from aa (i ) a particular amino acid ∆Gapp as a function of position, i, was first expressed as a simple gaussian function with two parameters, except for Trp and Tyr, which were both modeled with double gaussian functions containing five parameters each. Using this representation, the model contained in ∆Gaa(i) (kcal/mol) app A C D E F 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 −1 −1 −1 −1 −1 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 G H I K −9 −6 −3 0 3 6 9 L 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 −1 −1 −1 −1 −1 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 M N P Q −9 −6 −3 0 3 6 9 R 2 2 2 2 2 1 1 1 1 1 0 0 0 0 0 −1 −1 −1 −1 −1 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 S T V W −9 −6 −3 0 3 6 9 Y 2 2 2 2 2 1 1 1 1 1 0 0 0 0 −1 −1 −9 −6 −3 0 3 6 9 −1 −9 −6 −3 0 3 6 9 0 −1 −9 −6 −3 0 3 6 9 Position −1 −9 −6 −3 0 3 6 9 −9 −6 −3 0 3 6 9 Experimental Statistical Figure 5.4 Position specific amino acid contributions to membrane insertion efficiency of TM helices. Position 0 corresponds to the center of the membrane. (Blue) Curves derived from experimental data as described in Fig. 5.3. (Red) Statistical curves from high-resolution membrane protein structures. 41 total 50 parameters, optimized using more than 400 constructs. The results from the optimization are shown in Fig. 5.4. Overall, the curves correspond well to statistical curves, obtained from the relative over- or underrepresentation of amino acids at different distances to the membrane center in known membrane protein structures. Charged residues (D, E, K, R) give highly position-specific contributions to membrane insertion, and are much better tolerated towards the interface regions than inside the hydrophobic core. Contributions from polar residues (H, N, Q, S, T) have the same general trend as the charged ones, but are less pronounced. Hydrophobic residues (A, F, I, L, M, V) give negative (favorable) contributions and are less position-specific. Finally, the polar aromatic residues (W, Y) are favorable for membrane insertion when located at a certain distance from the membrane center, but unfavorable in the middle. 0.3 Single−span TM Multi−span TM Cytoplasmic Secreted Frequency 0.2 0.1 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 pred ∆Gapp (kcal/mol) Figure 5.5 ∆Gapp distributions in different sets of proteins. A large fraction of the multi-spanning helices seem to be only inefficiently recognized by themselves, and might rely on stabilizing interactions for membrane integration. pred To see how well the model would identify TM helices in natural proteins, we collected four test sets of secreted proteins, cytoplasmic proteins, singlespanning membrane proteins, and TM helices from multi-spanning mempred brane proteins and identified, in each set, the segment with lowest ∆Gapp , Fig. 5.5. The overlap between the secreted and single-spanning distribution is small, and they cross close to the zero-point defined by the experimental setup. A relatively large fraction of helices in multi-spanning proteins have 42 pred >0, indicating that such segments would not be efficiently integrated ∆Gapp by themselves, and might rely on stabilizing interactions from neighboring helices. Molecular dynamics simulations of lipid-embedded TM-helices containing charged residues (Dorairaj and Allen 2007) have resulted in free energies considerably more unfavorable than suggested by the 'biological hydrophobicity scale' first presented in (Hessa et al. 2005) and further developed in this paper. Concerns raised about the experimental setup involve the possibility of interactions between the H-segment and TM1 or TM2 (Fig. 5.3) (Shental-Bechor et al. 2006) and the potential for the H-segment to slide away from its intended position such that the charged residue ends up being closer to the interfacial region (Dorairaj and Allen 2007). However, control experiments have shown that the identity of TM1 and TM2 have little influence on the efficiency with which the H-segment is incorporated (MeindlBeinker et al. 2006). In addition, series of constructs where two polar or charged residues are scanned symmetrically from the middle towards the ends of the segment agree well with the additive model, arguing against sliding of the H-segment as being a crucial factor (Hessa et al. 2007). When interpreting the experimental results, a few things should be kept in mind. First, the position of an amino acid within the H-segment (Fig. 5.4) refers to the Cα-carbon, meaning that polar and charged residues, especially Arg and Lys, might be expected to point their side chains away from the immediate center of the hydrocarbon core. Second, a discrepancy to the MDsimulations is that biological membranes consist to a large part of proteins (Guidotti 1972), possibly influencing the polarity of the membrane environment compared to pure lipid bilayers. Third, charged residues in the membrane may induce local deformations of the bilayer and pull water into the membrane (Johansson and Lindahl 2006). Potentially, a combination of several explanations will eventually make it possible to reconcile the MD data with the experimental results. A recent study (Park and Helms 2008) derived a statistical positionspecific free-energy scale for the contributions of all amino acids, analogous to the statistical curves in Fig. 5.4. Using this scale to predict TM segments yielded results comparable to the best available TM topology prediction methods. However, scanning globular proteins, this scale did not show any difference between cytoplasmic and secreted proteins, as the one suggested in Fig. 5.5. 43 5.4 Prediction of membrane-protein topology from first principles (Paper IV) In an attempt to predict full transmembrane topologies using the model deaa (i ) -scale in veloped in paper III, we implemented the position-specific ∆Gapp two prediction algorithms. Topology prediction algorithms (section 4.4) are typically based on machine learning methods that extract statistical signals from known membrane protein structures, thereby optimizing a large number of parameters. An important difference to those machine learning approaches is that our methods are only based on data from experiments as in Fig. 5.3, and should be more directly linked to the physico-chemical principles underlying TM helix recognition in vivo. One method, TopPred∆G, basically employs the original TopPred algorithm (section 4.4.1), but the position independent GES hydrophobicity scale aa (i ) is replaced by the position dependent ∆Gapp -scale. Two parameters, the pred upper and lower ∆Gapp -cutoffs, corresponding to putative and certain TMhelices respectively, needed to be optimized on known structures. A variant of the algorithm that takes a sequence profile rather than a single sequence as input was also developed, which improved performance by a few percentage points. aa (i ) The ∆Gapp -scale was also implemented in an HMM-based method, aa (i ) SCAMPI, designed to be as simple as possible, such that only the ∆Gapp scale and no other parameters would influence the predicted topology. As such, all transition probabilities throughout the model are set equal to 1, and emission probabilities in the loop regions of the model are flat distributions and emit all amino acids with equal probability. In the membrane region of aa (i ) the model, the ∆Gapp -scale is used to calculate the free energy of insertion, and this energy is converted to the corresponding expected fraction of insertion in an experiment as in Fig. 5.3, which is used as emission probability in the model. Only two parameters in SCAMPI needed to be optimized on known structures. First, since many TM-helices are predicted not to be hydrophobic enough to insert by themselves (cf. Fig. 5.5), a constant correction pred factor is subtracted from the predicted ∆Gapp before this value is converted aa (i ) -scale does not to a corresponding probability. Second, since the ∆Gapp contain any orientational information, the emission probabilities for Lys and Arg are increased by a constant value in the part of inside loops closest to the membrane, in order to comply with the positive-inside rule (section 3.3.3). For both TopPred∆G and SCAMPI, the two parameters were optimized on one dataset, and used for prediction on another independent dataset. An example prediction is shown in Fig. 5.6. Somewhat unexpectedly, our simple methods based on experimental measurements of artificial TM helices were found to match the performance of the best available statistics based methods, predicting 80-85% correct topologies in our benchmark on all known membrane protein structures from PDB. We further analyzed the TM helices 44 pred in the dataset with respect to their ∆Gapp values, and found that helices that pred values than were highly buried in the structure typically had higher ∆Gapp helices that were exposed towards the lipid phase, and that TM-helices in pred proteins with many TM-helices tend to have higher ∆Gapp values on average. These observations support the hypothesis that helix-helix interactions might be an important factor for incorporation of marginally hydrophobic helices into the membrane. Figure 5.6 Topology predictions of Cytochrome c oxidase (PDB code 1ehkA) by TopPred∆G and SCAMPI. (Top) Crystal structure of the protein pred chain. (Middle) the ∆Gapp profile, with the TopPred∆G parameters ∆Ghigh and ∆Glow shown as dashed lines and the topology predicted by TopPred∆G indicated beneath the curve. (Bottom) The path through the SCAMPI model, with the vertical position in the model (see y axis) plotted as a function of sequence position. The structure and predicted topred pologies are colored according to ∆Gapp values. Locations of the TM helices as annotated in the OPM database (Lomize et al. 2006) are shown as gray boxes. Lys and Arg residues are marked with plus signs. 45 Acknowledgements Lots of people have contributed, helped and encouraged me during the work of this thesis. Thanks to all of you. I would like to especially thank the following people: My supervisor, Gunnar, for giving me the opportunity to work in your lab. Thanks for always knowing everything, and for explaining it in a way that the rest of us can understand. Thanks for giving me the freedom to try my own ideas and, when they fail, for having better suggestions. And thanks for always answering emails within 5 minutes, irrespective of time or the timezone you're in. My co-supervisor, Arne, for creating a great, relaxed atmosphere at work. Thanks for having more ideas than the rest of us taken together, and for enthusiastic explanations of them. And thanks for all the espresso. Past and present room-mates: Kristoffer, for julmys and for watering the flowers, Håkan, for all the science we learnt in Aspen, Tomas, for the amazing chip-tricks, Diana, for everything I know about the importance of conquering Kas. Erik Sj., the system guru, for your unique dart-throwing technique and for gathering people for the Thursday pub. SBC/CBR-people: Erik L., for your positive and encouraging attitude and for the PS3, Åsa, for baking amazing cake-club cakes, Per, for having the only working version of pymol, Erik G., for voting for me in the karaokecontest, Jenny, for teaching me how to spin the ball in wii-bowling, Björn, for being so reasonable with Zlatans morsa, Joppe, for choosing the sunexposed part of the table in the "hand-on-hot-table" contest in Brazil, Anna, for never occupying more than necessary of the cluster, Pär, for nervewrecking dart rounds, Aron, for taking care of the left-overs, Ali, for never bluffing, Örjan, for all the sparris, Linnea, Rauan, Marcin, Anni, Yana, Christine, Katarina for being great collegues. Mia for all the asian lunches. 46 My Biosapiens work package collegues: Rita, for sightseeing and shopping in Barcelona, Gert, for teaching me about GPCRs and for buying me lunch. My dear family: Mamma and Pappa, for life, for being so kind, generous and supportive, and for everything else, Morfar, for all your encouragement and interest, Johanna and Thomas, for long phone calls and for the sun deck at Lusten, Alma and Adam, for never saying no to play with me, Fredrik and Laura, for fine wines and private cinema shows, Bianca, for being cute and for forcing your parents to buy a bigger car, Flicka, for always being happy. My Sweetheart, Kirsi, for all your love and support, and for all the good times we've had together. You are my sunshine. 47 References Altschul, S. F., W. Gish, W. Miller, E. W. Myers and D. J. Lipman (1990). Basic local alignment search tool. J Mol Biol 215(3): 403-10. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17): 3389-402. Anfinsen, C. B. (1973). Principles that govern the folding of protein chains. Science 181(96): 223-30. Bateman, A., L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme, C. Yeats and S. R. Eddy (2004). The Pfam protein families database. Nucleic Acids Res 32(Database issue): D138-41. Baum, L. E. (1972). An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3(1): 1-8. Bendtsen, J. D., H. Nielsen, G. von Heijne and S. Brunak (2004). Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340(4): 78395. Blobel, G. (1980). Intracellular protein topogenesis. Proc Natl Acad Sci U S A 77(3): 1496-500. Boutet, E., D. Lieberherr, M. Tognolli, M. Schneider and A. Bairoch (2007). UniProtKB/Swiss-Prot: The Manually Annotated Section of the UniProt KnowledgeBase. Methods Mol Biol 406: 89-112. Brian, D. R. and N. L. Hjort (1995). Pattern Recognition and Neural Networks, Cambridge University Press. Canfield, V. A. and R. Levenson (1993). Transmembrane organization of the Na,K-ATPase determined by epitope addition. Biochemistry 32(50): 13782-6. Chamberlain, A. K., Y. Lee, S. Kim and J. U. Bowie (2004). Snorkeling preferences foster an amino acid composition bias in transmembrane helices. J Mol Biol 339(2): 471-9. Chang, X. B., Y. X. Hou, T. J. Jensen and J. R. Riordan (1994). Mapping of cystic fibrosis transmembrane conductance regulator membrane topology by glycosylation site insertion. J Biol Chem 269(28): 185725. 48 Chatterjee, S. and S. Mayor (2001). The GPI-anchor and protein sorting. Cell Mol Life Sci 58(14): 1969-87. Cherezov, V., D. M. Rosenbaum, M. A. Hanson, S. G. Rasmussen, F. S. Thian, T. S. Kobilka, H. J. Choi, P. Kuhn, W. I. Weis, B. K. Kobilka and R. C. Stevens (2007). High-resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science 318(5854): 1258-65. Choi, S., J. Jeon, J. S. Yang and S. Kim (2008). Common occurrence of internal repeat symmetry in membrane proteins. Proteins 71(1): 68-80. Choma, C., H. Gratkowski, J. D. Lear and W. F. DeGrado (2000). Asparagine-mediated self-association of a model transmembrane helix. Nat Struct Biol 7(2): 161-6. Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bull Math Biol 51(1): 79-94. Cordes, F. S., J. N. Bright and M. S. Sansom (2002). Proline-induced distortions of transmembrane helices. J Mol Biol 323(5): 951-60. Daley, D. O., M. Rapp, E. Granseth, K. Melen, D. Drew and G. von Heijne (2005). Global topology analysis of the Escherichia coli inner membrane proteome. Science 308(5726): 1321-3. Dan, N. and S. A. Safran (1998). Effect of lipid characteristics on the structure of transmembrane proteins. Biophys J 75(3): 1410-4. Dayhoff, M. O. (1978). A model of evolutionary change in proteins. Natl. Biomed. Res. Found. 5(Suppl. 3): 345-52. de Planque, M. R., B. B. Bonev, J. A. Demmers, D. V. Greathouse, R. E. Koeppe, 2nd, F. Separovic, A. Watts and J. A. Killian (2003). Interfacial anchor properties of tryptophan residues in transmembrane peptides can dominate over hydrophobic matching effects in peptide-lipid interactions. Biochemistry 42(18): 5341-8. Deisenhofer, J., O. Epp, K. Miki, R. Huber and H. Michel (1985). Structure of the protein subunits in the photosynthetic reaction centre of Rhodospeudomonas viridis at 3Å resolution. Nature 318(6047): 61824. Dorairaj, S. and T. W. Allen (2007). On the thermodynamic stability of a charged arginine side chain in a transmembrane helix. Proc Natl Acad Sci U S A 104(12): 4943-8. Drew, D., L. Froderberg, L. Baars and J. W. de Gier (2003). Assembly and overexpression of membrane proteins in Escherichia coli. Biochim Biophys Acta 1610(1): 3-10. Drew, D., D. Sjostrand, J. Nilsson, T. Urbig, C. N. Chin, J. W. de Gier and G. von Heijne (2002). Rapid topology mapping of Escherichia coli inner-membrane proteins by prediction and PhoA/GFP fusion analysis. Proc Natl Acad Sci U S A 99(5): 2690-5. Drews, J. (1996). Genomic sciences and the medicine of tomorrow. Nat Biotechnol 14(11): 1516-8. Dutzler, R., E. B. Campbell and R. MacKinnon (2003). Gating the selectivity filter in ClC chloride channels. Science 300(5616): 108-12. 49 Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14(9): 755-63. Eddy, S. R., G. Mitchison and R. Durbin (1995). Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2(1): 9-23. Engelman, D. M. (1982). An implication of the structure of bacteriorhodopsin - Globular membrane proteins are stabilized by polar interactions. Biophys J 37(1): 187-88. Engelman, D. M., T. A. Steitz and A. Goldman (1986). Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 15: 321-53. Epand, R. M. (1998). Lipid polymorphism and protein-lipid interactions. Biochim Biophys Acta 1376(3): 353-68. Feller, S. E., D. Yin, R. W. Pastor and A. D. MacKerell, Jr. (1997). Molecular dynamics simulation of unsaturated lipid bilayers at low hydration: parameterization and comparison with diffraction studies. Biophys J 73(5): 2269-79. Fischer, D. and D. Eisenberg (1996). Protein fold recognition using sequence-derived predictions. Protein Sci 5(5): 947-55. Fischer, K., A. Weber, S. Brink, B. Arbinger, D. Schunemann, S. Borchert, H. W. Heldt, B. Popp, R. Benz, T. A. Link and et al. (1994). Porins from plants. Molecular cloning and functional characterization of two new members of the porin family. J Biol Chem 269(41): 2575460. Forrest, L. R., C. L. Tang and B. Honig (2006). On the accuracy of homology modeling and sequence alignment methods applied to membrane proteins. Biophys J 91(2): 508-17. Frishman, D. and P. Argos (1992). Recognition of distantly related protein sequences using conserved motifs and neural networks. J Mol Biol 228(3): 951-62. Galdiero, S., M. Galdiero and C. Pedone (2007). beta-Barrel membrane bacterial proteins: structure, function, assembly and interaction with lipids. Curr Protein Pept Sci 8(1): 63-82. Garrow, A. G., A. Agnew and D. R. Westhead (2005). TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics 6: 56. Granseth, E., G. von Heijne and A. Elofsson (2005). A study of the membrane-water interface region of membrane proteins. J Mol Biol 346(1): 377-85. Gray, T. M. and B. W. Matthews (1984). Intrahelical hydrogen bonding of serine, threonine and cysteine residues within alpha-helices and its relevance to membrane-bound proteins. J Mol Biol 175(1): 75-81. Guidotti, G. (1972). Membrane proteins. Annu Rev Biochem 41: 731-52. Hargbo, J. and A. Elofsson (1999). Hidden Markov models that use predicted secondary structures for fold recognition. Proteins 36(1): 6876. 50 Harries, W. E., D. Akhavan, L. J. Miercke, S. Khademi and R. M. Stroud (2004). The channel architecture of aquaporin 0 at a 2.2-A resolution. Proc Natl Acad Sci U S A 101(39): 14045-50. Hedman, M., H. Deloof, G. Von Heijne and A. Elofsson (2002). Improved detection of homologous membrane proteins by inclusion of information from topology predictions. Protein Sci 11(3): 652-8. Helms, V. (2002). Attraction within the membrane. Forces behind transmembrane protein folding and supramolecular complex assembly. EMBO Rep 3(12): 1133-8. Henderson, R., J. M. Baldwin, T. A. Ceska, F. Zemlin, E. Beckmann and K. H. Downing (1990). Model for the structure of bacteriorhodopsin based on high-resolution electron cryo-microscopy. J Mol Biol 213(4): 899-929. Hengartner, M. O. (2000). The biochemistry of apoptosis. Nature 407(6805): 770-6. Henikoff, S. and J. G. Henikoff (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89(22): 10915-9. Hessa, T., H. Kim, K. Bihlmaier, C. Lundin, J. Boekel, H. Andersson, I. Nilsson, S. H. White and G. von Heijne (2005). Recognition of transmembrane helices by the endoplasmic reticulum translocon. Nature 433(7024): 377-81. Hessa, T., N. M. Meindl-Beinker, A. Bernsel, H. Kim, Y. Sato, M. LerchBader, I. Nilsson, S. H. White and G. von Heijne (2007). Molecular code for transmembrane-helix recognition by the Sec61 translocon. Nature 450(7172): 1026-30. Honig, B. and A. S. Yang (1995). Free energy balance in protein folding. Adv Protein Chem 46: 27-58. Hopkins, A. L. and C. R. Groom (2002). The druggable genome. Nat Rev Drug Discov 1(9): 727-30. Horn, F., E. Bettler, L. Oliveira, F. Campagne, F. E. Cohen and G. Vriend (2003). GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 31(1): 294-7. Jelinek, F., L. Bahl and R. Mercer (1975). Design of a linuistic statistical decoder for the recognition of continuous speech. IEEE TIT 21(3): 250-256. Joh, N. H., A. Min, S. Faham, J. P. Whitelegge, D. Yang, V. L. Woods and J. U. Bowie (2008). Modest stabilization by most hydrogen-bonded side-chain interactions in membrane proteins. Nature 453(7199): 1266-70. Johansson, A. C. and E. Lindahl (2006). Amino-acid solvation structure in transmembrane helices from molecular dynamics simulations. Biophys J 91(12): 4450-63. Jones, D. T. (2007). Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 23(5): 538-44. 51 Jones, D. T., W. R. Taylor and J. M. Thornton (1994). A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33(10): 3038-49. Käll, L., A. Krogh and E. L. Sonnhammer (2004). A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338(5): 1027-36. Käll, L., A. Krogh and E. L. Sonnhammer (2005). An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 21 Suppl 1: i251-7. Keyes, D. (1966). Flowers for Algernon, Harcourt. Kim, H., K. Melen, M. Osterberg and G. von Heijne (2006). A global topology map of the Saccharomyces cerevisiae membrane proteome. Proc Natl Acad Sci U S A 103(30): 11142-7. Kim, S., T. J. Jeon, A. Oberai, D. Yang, J. J. Schmidt and J. U. Bowie (2005). Transmembrane glycine zippers: physiological and pathological roles in membrane proteins. Proc Natl Acad Sci U S A 102(40): 14278-83. Kimura, T., M. Ohnuma, T. Sawai and A. Yamaguchi (1997). Membrane topology of the transposon 10-encoded metal-tetracycline/H+ antiporter as studied by site-directed chemical labeling. J Biol Chem 272(1): 580-5. Krogh, A. (1997). Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5: 179-86. Krogh, A., M. Brown, I. S. Mian, K. Sjolander and D. Haussler (1994). Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235(5): 1501-31. Krogh, A., B. Larsson, G. von Heijne and E. L. Sonnhammer (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3): 567-80. Kyte, J. and R. F. Doolittle (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157(1): 105-32. Lasso, G., J. F. Antoniw and J. G. Mullins (2006). A combinatorial pattern discovery approach for the prediction of membrane dipping (reentrant) loops. Bioinformatics 22(14): e290-7. Lee, S., B. Lee, I. Jang, S. Kim and J. Bhak (2006). Localizome: a server for identifying transmembrane topologies and TM helices of eukaryotic proteins utilizing domain information. Nucleic Acids Res 34(Web Server issue): W99-103. Lemmon, M. A., J. M. Flanagan, H. R. Treutlein, J. Zhang and D. M. Engelman (1992). Sequence specificity in the dimerization of transmembrane alpha-helices. Biochemistry 31(51): 12719-25. Letunic, I., R. R. Copley, S. Schmidt, F. D. Ciccarelli, T. Doerks, J. Schultz, C. P. Ponting and P. Bork (2004). SMART 4.0: towards genomic data integration. Nucleic Acids Res 32(Database issue): D142-4. Lindahl, E. and A. Elofsson (2000). Identification of related proteins on family, superfamily and fold level. J Mol Biol 295(3): 613-25. 52 Lins, L. and R. Brasseur (1995). The hydrophobic effect in protein folding. FASEB J 9(7): 535-40. Liu, Y., D. M. Engelman and M. Gerstein (2002). Genomic analysis of membrane protein families: abundance and conserved motifs. Genome Biol 3(10): research0054. Liu, Y., M. Gerstein and D. M. Engelman (2004). Transmembrane protein domains rarely use covalent domain recombination as an evolutionary mechanism. Proc Natl Acad Sci U S A 101(10): 3495-7. Lomize, M. A., A. L. Lomize, I. D. Pogozheva and H. I. Mosberg (2006). OPM: orientations of proteins in membranes database. Bioinformatics 22(5): 623-5. Manoil, C. and J. Beckwith (1986). A genetic approach to analyzing membrane protein topology. Science 233(4771): 1403-8. Meindl-Beinker, N. M., C. Lundin, I. Nilsson, S. H. White and G. von Heijne (2006). Asn- and Asp-mediated interactions between transmembrane helices during translocon-mediated membrane protein assembly. EMBO Rep 7(11): 1111-6. Mitaku, S. and T. Hirokawa (1999). Physicochemical factors for discriminating between soluble and membrane proteins: hydrophobicity of helical segments and protein length. Protein Eng 12(11): 953-7. Mittelman, D., R. Sadreyev and N. Grishin (2003). Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics 19(12): 1531-9. Mott, R., J. Schultz, P. Bork and C. P. Ponting (2002). Predicting protein cellular localization using a domain projection method. Genome Res 12(8): 1168-74. Murata, K., K. Mitsuoka, T. Hirai, T. Walz, P. Agre, J. B. Heymann, A. Engel and Y. Fujiyoshi (2000). Structural determinants of water permeation through aquaporin-1. Nature 407(6804): 599-605. Needleman, S. B. and C. D. Wunsch (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3): 443-53. Ng, P. C., J. G. Henikoff and S. Henikoff (2000). PHAT: a transmembranespecific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics 16(9): 760-6. Nilsson, I. and G. von Heijne (1990). Fine-tuning the topology of a polytopic membrane protein: role of positively and negatively charged amino acids. Cell 62(6): 1135-41. Nilsson, J., B. Persson and G. von Heijne (2005). Comparative analysis of amino acid distributions in integral membrane proteins from 107 genomes. Proteins 60(4): 606-16. Ohlson, T., B. Wallner and A. Elofsson (2004). Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods. Proteins 57(1): 188-97. Overbeek, R. (2000). Genomics: what is realistically achievable? Genome Biol 1(2): c2002.1-3. 53 Park, J., K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard and C. Chothia (1998). Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284(4): 1201-10. Park, Y. and V. Helms (2008). Prediction of the translocon-mediated membrane insertion free energies of protein sequences. Bioinformatics 24(10): 1271-7. Pearson, W. R. (1990). Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183: 63-98. Qian, N. and T. J. Sejnowski (1988). Predicting the secondary structure of globular proteins using neural network models. J Mol Biol 202(4): 865-84. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2): 257-86. Rapoport, T. A., V. Goder, S. U. Heinrich and K. E. Matlack (2004). Membrane-protein integration and the role of the translocation channel. Trends Cell Biol 14(10): 568-75. Rapp, M., E. Granseth, S. Seppala and G. von Heijne (2006). Identification and evolution of dual-topology membrane proteins. Nat Struct Mol Biol 13(2): 112-6. Rapp, M., S. Seppala, E. Granseth and G. von Heijne (2007). Emulating membrane protein evolution by rational design. Science 315(5816): 1282-4. Rice, D. W. and D. Eisenberg (1997). A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J Mol Biol 267(4): 1026-38. Rosenbaum, D. M., V. Cherezov, M. A. Hanson, S. G. Rasmussen, F. S. Thian, T. S. Kobilka, H. J. Choi, X. J. Yao, W. I. Weis, R. C. Stevens and B. K. Kobilka (2007). GPCR engineering yields highresolution structural insights into beta2-adrenergic receptor function. Science 318(5854): 1266-73. Rost, B., P. Fariselli and R. Casadio (1996). Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 5(8): 1704-18. Rychlewski, L., L. Jaroszewski, W. Li and A. Godzik (2000). Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 9(2): 232-41. Sansom, M. S. and H. Weinstein (2000). Hinges, swivels and switches: the role of prolines in signalling via transmembrane alpha-helices. Trends Pharmacol Sci 21(11): 445-51. Schleiff, E., L. A. Eichacker, K. Eckart, T. Becker, O. Mirus, T. Stahl and J. Soll (2003). Prediction of the plant beta-barrel proteome: a case study of the chloroplast outer envelope. Protein Sci 12(4): 748-59. Shental-Bechor, D., S. J. Fleishman and N. Ben-Tal (2006). Has the code for protein translocation been broken? Trends Biochem Sci 31(4): 192-6. Simon, S. M. and G. Blobel (1991). A protein-conducting channel in the endoplasmic reticulum. Cell 65(3): 371-80. 54 Simons, K. and E. Ikonen (1997). Functional rafts in cell membranes. Nature 387(6633): 569-72. Singer, S. J. and G. L. Nicolson (1972). The fluid mosaic model of the structure of cell membranes. Science 175(23): 720-31. Smith, T. F. and M. S. Waterman (1981). Identification of common molecular subsequences. J Mol Biol 147(1): 195-7. Styczynski, M. P., K. L. Jensen, I. Rigoutsos and G. Stephanopoulos (2008). BLOSUM62 miscalculations improve search performance. Nat Biotechnol 26(3): 274-5. Sugiyama, Y., N. Polulyakh and T. Shimizu (2003). Identification of transmembrane protein functions by binary topology patterns. Protein Eng 16(7): 479-88. Torres, J., T. J. Stevens and M. Samso (2003). Membrane proteins: the 'Wild West' of structural biology. Trends Biochem Sci 28(3): 137-44. Treutlein, H. R., M. A. Lemmon, D. M. Engelman and A. T. Brunger (1992). The glycophorin A transmembrane domain dimer: sequence-specific propensity for a right-handed supercoil of helices. Biochemistry 31(51): 12726-32. Tusnady, G. E., L. Kalmar, H. Hegyi, P. Tompa and I. Simon (2008). TOPDOM: database of domains and motifs with conservative location in transmembrane proteins. Bioinformatics 24(12): 1469-70. Tusnady, G. E. and I. Simon (2001). The HMMTOP transmembrane topology prediction server. Bioinformatics 17(9): 849-50. Ulmschneider, M. B., M. S. Sansom and A. Di Nola (2005). Properties of integral membrane protein structures: derivation of an implicit membrane potential. Proteins 59(2): 252-65. Unwin, N. (2005). Refined structure of the nicotinic acetylcholine receptor at 4A resolution. J Mol Biol 346(4): 967-89. Van den Berg, B., W. M. Clemons, Jr., I. Collinson, Y. Modis, E. Hartmann, S. C. Harrison and T. A. Rapoport (2004). X-ray structure of a protein-conducting channel. Nature 427(6969): 36-44. Viklund, H. and A. Elofsson (2004). Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci 13(7): 1908-17. Viklund, H. and A. Elofsson (2008). OCTOPUS: Improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics. Viklund, H., E. Granseth and A. Elofsson (2006). Structural classification and prediction of reentrant regions in alpha-helical transmembrane proteins: application to complete genomes. J Mol Biol 361(3): 591603. Viterbi, A. J. (1967). Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE TIT 13(2): 260-269. von Heijne, G. (1984). Analysis of the distribution of charged residues in the N-terminal region of signal sequences: implications for protein export in prokaryotic and eukaryotic cells. EMBO J 3(10): 2315-8. 55 von Heijne, G. (1989). Control of topology and mode of assembly of a polytopic membrane protein by positively charged residues. Nature 341(6241): 456-8. von Heijne, G. (1991). Proline kinks in transmembrane alpha-helices. J Mol Biol 218(3): 499-503. von Heijne, G. (1992). Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 225(2): 487-94. von Heijne, G. (2006). Membrane-protein topology. Nat Rev Mol Cell Biol 7(12): 909-18. von Heijne, G. and Y. Gavel (1988). Topogenic signals in integral membrane proteins. Eur J Biochem 174(4): 671-8. Wallin, E. and G. von Heijne (1998). Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7(4): 1029-38. Weiss, K. M. (2005). The phenogenetic logic of life. Nat Rev Genet 6(1): 3645. White, S. H. (2004). The progress of membrane protein structure determination. Protein Sci 13(7): 1948-9. White, S. H. (2007). Membrane protein insertion: the biology-physics nexus. J Gen Physiol 129(5): 363-9. White, S. H. and G. von Heijne (2004). The machinery of membrane protein assembly. Curr Opin Struct Biol 14(4): 397-404. White, S. H. and G. von Heijne (2005). Transmembrane helices before, during, and after insertion. Curr Opin Struct Biol 15(4): 378-86. White, S. H. and W. C. Wimley (1999). Membrane protein folding and stability: physical principles. Annu Rev Biophys Biomol Struct 28: 31965. Wiener, M. C. and S. H. White (1992). Structure of a fluid dioleoylphosphatidylcholine bilayer determined by joint refinement of x-ray and neutron diffraction data. III. Complete structure. Biophys J 61(2): 434-47. Wistrand, M., L. Kall and E. L. Sonnhammer (2006). A general model of G protein-coupled receptor sequences and its application to detect remote homologs. Protein Sci 15(3): 509-21. Yau, W. M., W. C. Wimley, K. Gawrisch and S. H. White (1998). The preference of tryptophan for membrane interfaces. Biochemistry 37(42): 14713-8. Yohannan, S., S. Faham, D. Yang, J. P. Whitelegge and J. U. Bowie (2004). The evolution of transmembrane helix kinks and the structural diversity of G protein-coupled receptors. Proc Natl Acad Sci U S A 101(4): 959-63. Zhao, G. and E. London (2006). An amino acid "transmembrane tendency" scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity. Protein Sci 15(8): 1987-2001. 56 Zhou, F. X., M. J. Cocco, W. P. Russ, A. T. Brunger and D. M. Engelman (2000). Interhelical hydrogen bonding drives strong interactions in membrane proteins. Nat Struct Biol 7(2): 154-60. 57