* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CHAPTER 3 STRUCTURAL ELEMENTS OF
Survey
Document related concepts
Gene expression wikipedia , lookup
Genetic code wikipedia , lookup
Biosynthesis wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Deoxyribozyme wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Biochemistry wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Structural alignment wikipedia , lookup
Homology modeling wikipedia , lookup
Transcript
CHAPTER 3 STRUCTURAL ELEMENTS OF THE PROTEINS The secondary structure elements of the proteins can be divided principally in three different topologies (Fig. 1): • • • α-helix β-strands loops Figure 1. Secondary structures. To have a clear and quick “sheme” of the proteins 3D structure, the schematic representation of the secondary structural elements is very useful: helices = cylinders or spirals, strands = arrows N! C, loops = tapes. This has permitted to identify also the super-secondary structures and the structural motifs existing in the proteins. In a protein, segments peptide of various lengths fold together to form regular structures. These structures are very widespread because are stable, minimizing the steric repulsion and making maximum the possibility to form hydrogen bonds. The most common secondary structure is the α-helix, followed by β-strand, that usually interact one with the other to form a β-sheet. The typical arrangement of α-helix is on the external face of the protein with one side face toward the solvent and the other face toward the internal part of the protein. In this case, the helix is named amphipathic and the primary sequence consists in a regular alternation of hydrophobic and hydrophilic amino acids. The β-sheet is a stable structure constituted by more strands that can be parallel or antiparallel depending in which direction the various strands are organized. α-helix The α-helix structure represents the the most frequent folding result that a peptide chain can produce. This is confirmed by the fact that the α-helix is the element of the secondary structure more common in the proteins. α-helix is a regular structure, characterized by defined parameters. When a number of successive pairs of dihedral angles ɸ and ψ assume values around -60° and -50°, the structure is formed. In this mode the peptides planes are arranged in a helical shape around a longitudinal axis. The α-helix has a step of 5.4 Å and each convolution of the helix consists of 3.6 residues (Fig. 2) 1 Figure 2. Schematic representation of α-helix structure highlighting the hydrogen bonds. The strong stability of this conformation depends by the fact that all NH and CO groups of the peptides are involved in hydrogen bonds. Each hydrogen bond occurs between the hydrogen of the NH group of one residue and the oxygen of the CO group of the fourth subsequent residue. The direction of the hydrogen bond is almost parallel to the helix axis. In the proteins, the α-helix is usually right-handed because the amino acids are in “L” conformation and in a left-handed conformation the side groups R would be too close to the CO groups destabilizing the structure. The side chains R of the amino acids are all directed towards the outside of the helix. The chemicalphysical characteristics of these groups influences the modality in which the α-helices structures are arranged among them to build the tertiary structure of the protein. Some amino acids are considered “good starters” of the α-helix, others, as the proline, can destabilize the helix causing a distortion. β-sheet As α-helices, also the β strand is characterized by a regular conformation. The sections of the peptide chain with β conformations are disposed with a “zig-zag” arrangement in the peptide planes. The lateral chains of the residues are oriented perpendicularly to the median plane of the structure in alternate manner from one side to other of the plane forming the β-strand structure (Fig. 3). Figure 3. Schematic representation of the β-strand. Usually, in the proteins, two or more β-strands tend to close laterally forming hydrogen bonds among them generating an extensive and pleated structure called β-sheet. In the β strands the hydrogen bonds are present between the flanked chains (Fig. 4) instead that among the same segment as in α-helix. Generally, the β sheets are not planar but tend to assume a curved shape, slightly "screwed". 2 Figure 4. Schematic representation of the parallel β-sheet. The antiparallel β-strands are the most stable elements . The high stability is due to the formation of the linear hydrogen bonds between the main chains. In figure 5 it can be noticed that the CO and NH groups are linear , in this case the acceptor-hydrogen-donor angle is equal to 180°, and the hydrogen bond is very stable. Figure 5. Top: antiparallel β-sheet; bottom: parallel β-sheet. The parallel β-sheets are less stable because they are not linear. In fact, the structures with only parallel β-sheets are relatively rare because less stable. In order to obtain a high degree of stability, the parallel β-sheets must assume particular conformations, while the β–barrel structures, constituted by antiparallel β strands, are the most frequent architectures. Loops In addition to the two, above described, regular elements, peptide segments, apparently disorganized, having variable length are present in the proteins. These segments, called loops, represent the connection between α-helices or β-strands, and have an important role in the 3D structure organization of the peptide chain. They are relatively flexible and allow directional changes of the sequences in α and β conformations. The presence of short loops of 3-5 residues that connect two consecutive β-strands, oriented in antiparallel mode (β-turns) is very usual in proteins (Figure 6). 3 Figure 6. On the left, the graph shows the hairpin loop in many different proteins. On the right, two types of hairpin loops more frequent in the structures. Moreover, the loops are often involved in the formation of the active site of the enzyme. In the loop regions, the presence of amino acids like glycine or proline is highly frequent and the effects on the conformation of the chain have been described in previous chapters. The presence of secondary structures connected by loops with different length, allows defining the concept of topology that can be defined as the manner in which the different secondary structure elements are connected one to the other. Topological diagrams The topological diagrams are very useful to represent the connections between the elements of secondary structure inside of a protein. For example, β-strands can be connected in several topologies. The topological identification of a protein is very important because only when two proteins have the same topology they can possess the same fold and therefore a comparable threedimensional structure. Figure 7 shows three types of beta sheets located in different connections. Figure 7. Topological diagrams of some β-sheets: a) β-sheet with 4 antiparallel strands b) β-sheet with 5 parallel strands c) β-barrel structure with 8 antiparallel strands In a) it is shown a simple connection named “up and down” where the C-terminal region of a β strand is linked to the N-terminal region of another strand through a short loop; in b) the connections among parallel β-strands linked to long loops are represented. In this case, due to the big length of the loops, it is possible to find an α-helix between one strand and another one and the type of connection takes the name of βαβ. In c) it is illustrated a mixed connection where there is a hairpin connection (left and right) and a greek motif (center). 4 Structural motifs The mixing of α and β structures gives rise to simple structural motifs that once assembled together can generate complex three-dimensional structures. Generally, it is possible to divide a complex protein structure as the sum of basic elements consisting of basic structural motifs (Fig. 8). The most frequent structural motifs are: 1. helix-loop-helix: it is present in many Ca2+ binding proteins (calmodulin, parvalbumin, troponin C ) or DNA binding proteins 2. β hairpin : two antiparallel β-strands are kept together by a short loop of 2-5 residues. 3. greek key: four β strands (minimum), two short loops and one long loop are necessary to generate this motif . 4. β-α-β: it is constituted by two parallel β-strands intercalated by one α-helix. Figure 8. Representation of some frequent structural motifs. Motif helix-loop-helix In figure 9, the specific structural motif helix-loop-helix for the DNA (left) and for the Ca2+ binding proteins (right), is represented. Figure 9. Two type of structural motifs helix-loop-helix. On the left: a typical structural motif DNA binding; on the right: a typical structural motif Ca2+ binding. The Ca2+ binding motif, has been identified for the first time in parvalbumin, . This motif is called also “hand EF” because the E and F helices are the regions of the protein utilized to describe the Ca2+ binding site (Fig. 10). 5 Figure 10. Schematic representation of the motif Ca2+ binding. A comparison between the motif and the hand highlights that, from the bottom of the index begins the helix E, the loop of 12 residues is represented by the middle finger and the helix F starts from the bottom of the inch toward the ends. Generally, the Ca2+ ligands are located on the loop connecting the two helices that are almost perpendicular one to the other. The motif includes two helices: E and F that flank a loop of 12 adjacent residues; five of these residues bind the Ca2+. Their side chain has oxygen atoms representing the most favourable ligands for the Ca2+ with a high coordination number (among 6/8). Principally, the side chains of the aspartic and glutamic residues are the preferred ones. The sequences of the EF motif , reported in figure 11, show some positions that are conserved and generate the consensus sequence: the residue 6 must be a glycine, the Ca2+ binding amino acids (shown in orange) must be residues with side chains that can act as possible ligands and the residues forming the hydrophobic core are represented in green. Figure 11. Consensus sequences of the motifs EF hand in three different proteins. In figure 12, the helix-loop-helix motif existing in the transcription factors interacting with DNA, is shown. The recognition helix is represented in red and contains charge positive residues that interact with DNA: the other helix is a helix having structural role. Figure 12. Typical motif helix-loop-helix binding DNA. 6 Hairpin β The β hairpin is the simplest structural motif formed by β-strands. It consists of 2 antiparallel adjacent β-strands joined by a loop having a variable length from 2 to 5 residues. Usually, this motif can be found as isolated form or as part of complex β-sheets structures. Figure 13 illustrates this concept: in the bovine trypsin inhibitor the β hairpin is found as isolated strand, whereas in erabulotoxin, a toxin present in the poison of the snake, the β turn is implicated in a complex βsheet formed by 2 β hairpin and one β-strand. Figure 13. On the left: bovine trypsin inhibitor; on the right: the two hairpin motifs present in the erabulotoxin. Greek motif It is believed that the geek motif was born following a modification of the β hairpin, in particular, from a long hairpin that subsequently had a refolding in the central part of the structure. This is the reason β-strand is linked to another β-strand located after three strands (Fig. 14). Many proteins have this motif, especially in the antiparallel β-barrel. Figure 14. Greek motif representation. β-α-β motif The β-α-β motif allows the connection between two parallel strands, in fact the motif is constituted by two parallel β-strands connected by one α-helix and two short loops (Fig. 15). 7 Figure 15. β-α-β motif In the motif, the α-helix connects the carboxyl ends of one strand to the amine ends of the next strand. The helix is closely associated with the two strands through hydrophobic interactions. More motifs connected among them giving rise to complex protein structures. The loop linking the carboxyl terminus of the β-strand with the amine of the α-helix, is often involved in the formation of the active site. The β-α-β motif is always in the right-handed connection permitting the correct position of the α-helix above the plane formed by two strands (Fig. 16). All known proteins posses the β-α-β motif in right-hand but subtilisin. Figure 16. Right-hand connection (a) and left-hand connection (b) of β-α-β motif. The protein domains Several structural motifs and secondary structures assemble among them following preferential assembly. In this paragraph, the preferential assembly made by structures α, structures α-β and structures β will be presented. Structure and alpha domain Propelled supercoiling The α-helices can assume different arrangement, but the one formed by supercoiled parallel helices is one of the most frequent. When 2 α-helices adopt the supercoiling configuration, the number of the residues for turn change from 3.6 to 3.5. In this mode the helices form a “heptad repeat” sequence, in which every seven residues one residue of leucine is located . In figure 17, the seven amino acids are indicated with the alphabetic letters: a, b, c, d, e, f and g where d is a leucine. Moreover, every 3.5 residues the helices interface between them and in this specific position is always located an amino acid with non-polar character, usually a residue of valine. 8 Figure 17. Scheme and amino acids sequence of two supercoiling helices. The interactions between the supercoiled helices are strengthened by hydrophobic contacts and by electrostatic interactions of the amino acids located near the hydrophobic residues. These amino acids, as well as the residues in position g and e (Fig. 18), have an opposite charge to improve the interaction between the helices. Figu re 18. Pack ing of the residues implicated in supercoiling helices and role of the electrostatic interactions. 9 α helical bundle The α helical bundle is often found in domains consisting of α helices. It consists of 2 pair of antiparallel helices relative to each other. In those domains, the helices possess strongly amphipathic proprieties as shown in Figure 19. The internal region is strongly hydrophobic, while the external region is hydrophilic. Figure 19. Scheme of the α helical bundle. The folding is extremely stable because, in addition to the hydrophobic interactions at the interface of the four helices, there are the intra-helix hydrogen bonds of each helix. The α helical bundle is present as single domain in monomeric protein but it can be observed also also as dimeric or tetrameric motif. It is interesting that the α helical bundle is found in proteins with completely uncorrelated functions as for example in cytochrome b562 and in the human growth factor as illustrated in figure 20. Figure 20. Cytochrome b562 and human growth factor. In figure 21, the same domain is shown for Rop, a dimeric protein. In this case, the monomer is represented by two antiparallel helices. The two monomers are joined together to form the α helical bundle. The structural architecture is absolute identical, but while in the previous examples the α helical bundle was formed by one protein, in this situation the two subunits together form the domain. 10 Figure 21. α helical bundle in the protein Rop Globin folding Globin folding represents one of the most important α-helix structures. The globin structure consists of a bundle of eight helices named A-H that are connected by short loop arranged to form a hydrophobic pocket in which the active site, the heme group, is located . The length of the helices is variable, the longest one is helix H with approximately 28 residues, the shortest one is helix C with about 7 residues (Fig 22). Figure 22. Globin structural domain. The interactions among the helices occur between not sequential helices except for the last two (G and H). The domain cannot be decomposed as the sum of simple structural motifs but can be described as a “screwing” of the helices around the central core in different directions. This type of folding is present in many proteins with correlated functions as myoglobin, phycocyanins, hemoglobins. 11 Structure with α-helix domain The β-α-β motif is a simple motif that can be generate three different classes of structure: TIM barrel, opened β-sheet and the horseshoe (Fig. 23). Figure 23. TIM barrel, open β sheet and the horseshoe structure. In the TIM barrel the α-helices are located outside of a barrel consisting by parallel β-strands. In the opened β-sheet the strands are rotated one to each other and the α-helices are on both sides of the sheet plane. The third class consists of leucine-rich sequences where the β-strands produce a curved β-sheet that is partially shielded by the solvent thanks to the presence of α helices. In this way, the helices are located only on one side of the sheet and the structure remains opened taking the peculiar name of horseshoe. The β-α-β motif is the common basic element of these three classes. The structural diversity depends by different connections. Two β-α-β motifs have two connection options to form one βsheet made by four parallel strands (Fig. 24). For example, β3-strand can be adjacent to β2-strand resulting in the 1234 order of the sheet , or adjacent to the β1-strand giving rise to a 4321 sheet . The β-α-β motif is always right-handed, so, in the first case the helices are all on a single side producing the TIM barrel or horseshoe structures. In the second case, to permit an alignment of the strands, it is necessary rotate the second motif permitting the helices to stay on both sides fof the sheet forming an opened β-sheet (Fig. 24). Figure 24. Connection type of two β-α-β motifs. 12 TIM barrel The TIM barrel has more structural constrains than the opened beta sheet, which in theory, can be extended indefinitely. The TIM barrel structure is very frequent in proteins due to its strong stability. The structure is characterized by the presence of a defined number of β-strands, generally eight, which provide the staves to form a closed barrel surrounded by α-helices. This structure represents one of the largest and regular conformations because it needs about 200 amino acids. The central part of β-sheet is composed entirely of hydrophobic amino acids closely associated with the hydrophobic chains of the helices interfaced with the β strands while the external faces of the helices are hydrophilic (Fig. 25). Figure 25. The TIM barrel structure and strands sequence. In the interactions formed between the α-helices and β-strands and in the hydrophobic core of the barrel the residues Val, Ile and Leu have a predominant role representing approximately 40% of the amino acids. This topological type of protein allows to understand very well the division between structural region and active region. In fact, the barrel is the structural part, while the active site is always located in the connection loops between the C-terminal region of the β-strand and the Nterminal region of the α-helix (Fig. 26). In general, all proteins possess a structural core separated by the active site . In agreement, in the TIM barrel structure ithe active site, is located on the loops connecting the β-strand and the α-helix. All TIM barrel structures have an enzymatic functions that in some cases involves the entire protein, while in other cases, the protein is a multi-domain protein. An example is provided by pyruvate kinase (Fig. 26) that folds into multiple domains, one of them is a TIM barrel. In the multidomain proteins the enzyme activity is always associated with the TIM barrel. Figure 26. Pyruvate kinase structure and position of the active site in TIM barrel. 13 Opened α-β sheet The opened sheet structures have α-helices on both sides of the sheet. This produces a structure that is never closed and it will never form a barrel structure. The only possibility is that the β strands enclose the α-helices in one face of the sheet. Moreover, there are always two adjacent β-strands whose connections with the next strand are found in the opposite sides of the sheet. In this region a directional change of the connections occur and here it is always located the active site, i.e. near to the C-terminal region of the β-strands. In the figure 27 it is explained how the strand 1 is connected to the strand 2 through an α-helix and how the strand 4 is connected to the strand 5 always with an α-helix. In the reversal point of the connection (i.e. where the C-terminal regions are located), there is a small cavity that represents the region where the active site can be found. Another feature is that the α-helices are always strictly attached to the sheet by hydrophobic interactions. Figure 27. Position of the active site in the opened sheet structure. Below, there are some examples where it is possible to identify the position of the active site from the protein topology (Fig. 28 and Fig. 29). Figure 28. Flavodoxin and adenylate kinase structure and position of the clefts where the active site is located. 14 Figure 29. Hexokinase and phosphoglycerate mutase structure and position of the clefts where the active site is located. Horseshoe structure The last α-β structure is called Horseshoe. Figure 30 shows the horseshoe scheme where it is possible to note that the architecture is similar to the TIM barrel, because all α-helices are located on the same side of the β-sheet, but the structure is opened and acquires the typical form of a horseshoe. Figure 30. Horseshoe structure domain. In this structure, the number of β-strands is greater than 8 and the main feature is the presence of several residues of leucine. In fact, these motifs are also named leucine rich motifs, because the β strand, the α-helix and the loop possess a high number of leucine that interact in the internal part of the structure forming a strong hydrophobic core stabilizing the structure (Fig. 31). The leucine residues 2-5-7-12-17-20-24 are generally invariant and therefore represent a consensus sequence that permits the identification of the horseshoe domain. Figure 31. Interaction of leucine residues. 15 Antiparallel beta domains structures Usually, in the antiparallel β structures, the antiparallel β-strands are arranged to form two β-sheets packed against one another creating a distorted barrel that constitutes the core of the molecule. However, the barrel is not the only element formed by antiparallel β strands. Depending on the way the filaments β are connected to each other, these structures can be divided into: Up and down structure. This type of connection is very frequent in structures consisting of 8 β strands, barrel-shaped, where each filament is connected to the next one by a small loop (for example, retinol binding protein). Generally, proteins with this topology bind bulky and hydrophobic ligands inside of their structure. Greek structures. Also in this case, the filaments form a barrel. This topology is found in immunoglobulins and in many enzymes. jelly roll structures. Characterizing different macromolecules, including viral coating proteins and hemagglutinin from influenza virus. Beta barrel structure In "all β "proteins β barrel appears to be the most stable structure. Usually, it is constituted by 8 antiparallel β-strands. Eight is the ideal number to form a barrel, since it gives the greater available compactness. However, barrels with a different number of β-strands can exist. The β barrel can have a different topology and, consequently, different connections. Figure 32 shows a typical barrel, where the eight cylindrical strands form the skeleton while the loop accommodates the active site. Figure 32. Superoxide dismutase barrel structure (Cu,Zn) with eight antiparallel strands. The greek motif represents a frequent topology in the β barrel structure where, the connection of the strand n with the strand n-3 or n + 3, is present (Fig. 33). Figure 33. Greek motif in an antiparallel barrel domain. 16 Up and down structure is another topology often found in these proteins, in which the C-terminal region of a strand is connected to the N-terminal region of another strand and so on. In Figure 34, for example, is shown the structure of the protein that binds retinol: in this case, the active site is located within the barrel itself. Figure 34. Barrel structure of the protein binding retinol. The active site is constituted by hydrophobic amino acids coming from the β-strands. Two sheets overlap to form an antiparallel barrel. In Figure 34 the strands 1 2 3 4 5 and 6 form a sheet while the strands 1 8 7 6 5 form the second sheet. The strands 1 5 6 contribute to form both sheets. Another example of a protein associated with an up and down topology is represented by neuraminidase. The whole structure is complex because the protein is a tetramer (Fig. 35), but the decomposition of the motifs present in each monomer indicates the presence of simple structural principles. Figure 35. The tetrameric structure of neuraminidase The generated structure has not exactly a barrel shape because the β-sheets are arranged in a circular mode around an axis passing through the center of the molecule. The protein contains a total of 1600 amino acids and it is involved in the sialic acid hydrolysis. Every single monomer consists of a repetition of 6 sheets which are composed by 4 strands, connected to each other in an up and down topology (Fig. 36). The six sheets are arranged to form the blades of a six-blades propeller. Figure 36. Neuraminidase : structure of a monomer and its topology. 17 The topology of the six sheets existing in a monomer and the connections between the different motifs appear identical. Strand 4 of the first sheet is in connection with strand 1 of the next sheet and so on. This produces a molecule with a pseudo-symmetry of order six, in which the 12 loops are all located on the same side of the molecule. The connection loops are the loops forming the active site and the neuraminidase represents a clear example of separation between structural and functional region. The β-strands represent in fact the structural skeleton over which it is implemented the active site, made by the loops connecting the strands (Fig. 37). Figure 37. Neuraminidase and its active site. Jelly roll domains Jerry roll is another β barrel structure. In order to understand this structure, it is useful to imagine a strip of paper, whose two sides are constituted each one by four strands and the strands located on opposite sides interact between them (Fig. 38). Figure 38. Schematic representation of the barrel jerry roll structure. Try to imagine, also, to wrap a tape of paper on a cylinder to have the β strands located on the sides, with the loops on the top and the cylinder itself on the bottom. The antiparallel strands are bound together by hydrogen bonds with the pairs 1-8, 2-7, 3-6, 4-5, and will be arranged so that the strandt 1 is adjacent to the second one, the 7 to 4, the 5 to 6 and the 3 to 8. All adjacent strands are antiparallel. The strand 8 continues to interact with the first, the second with 7 and so on, in other words, the pairs of antiparallel β-strands interact with one another, forming the structure of the protein. The corresponding topology is described in figure 39. 18 Figure 39. Topology of the jelly roll. An example of jelly roll is the head of hemagglutinin (Fig. 40), the globular region of the influenza virus protein that must recognize the sialic acid to begin the process of infection. Figure 40. Hemagglutinin, the monomer and the barrel jerry roll in the terminal region of the monomer. Hemagglutinin is a trimer and it is anchored at the membrane of the virus. It consists of two chains named HA1 and HA2. HA1 is constituted by 328 amino acids and HA2 by 221. The two chains are joined by disulfide bridges. The two chains produce a structure in which a part is constituted by a stem that extends from the membrane to the second part represented by a globular domain. HA1 begins from the membrane but does not enter into it and forms an extended structure that, for about 100 Å, follows the stem in the globular region. The apex is a globular jelly roll structure formed by eight strands composed approximately by 150 residues. After the globular region, the subunit strengthens the stem following it in a parallel mode with 70 residues. HA2 contributes only in the stem formation and in the insertion into the membrane. The recognition site for the sialic acid is located on the globular head in an inner region of the jelly roll (Fig. 40) at a distance of more than 100 Å from the membrane. The sialic acid binding site is located in an internal pocket. The immune system antibodies bind this molecule in the proximity of the binding site to prevent the viral infection. To escape from this defense mechanism, the virus undergoes mutations that are located on the border of the pocket because the inner part of the 19 pocket must be conserved to maintain intact the recognition capacity of the molecule for the sialic acid recognition. Domains with parallel β-helices β helix with 2 strands The domains consisting of parallel β-strands are relatively rare because the hydrogen bonds are less stable than the antiparallel β-strands. For this reason, to have a stable structure, the strategy used by the parallel β-strands is to form β-strands helices. In such structures, the polypeptide chain forms a supercoiled helix consisting of β-strands connected by loops. Currently, two types of such structures are known. In the simplest case, the β-helix is constituted by two sheets and each turn of the propeller contains two strands and two loops. (Fig. 41). Figure 41. Scheme of the β-helix with two strands. The basic structural unit of this motif contains 18 amino acids: three in each strand and six in each loop. The sequence shows the specific repetitions, in particular it is possible to identify a consensus sequence of nine residues Gly-Gly-X-Gly-X-Asp-HUX, where U is an amino acid with bulky and hydrophobic chain. The first six residues generate the loop while the last three the β-strands. Another feature of these motif is that they are involved in the binding of ion calcium through the Asp residue. The other structure constituted by parallel β strands, is another helix, where the basic unit is formed by three β strands that are extremely shorts, from 3 to 5 residues, connected by loops (Fig. 42). Figure 42. Schematic representation of a helix formed by three β–strands. 20 Three strands constitute the structure: two almost parallel, and the third one perpendicular to the first two. Only two residues form the connection loop, between the 1° and the 2° strand, while the other two loops are much longer with variable length and conformation. In this way, the helix forms three large parallel sheets arranged as three faces of a prism. An example of this type is represented by the pectate lyase (Fig. 43). Figure 43. Pectate lyase structure. The database CATH The database CATH classifies proteins on a structural basis. The classification is hierarchical. The two co-authors are: C.A: Orengo and J.M. Thornton. Contrary to the primary database, where data resulting from the experiment are inserted without any manipulation, CATH is a secondary database where the information is analyzed, selected and then stored. During the evolution, many families of proteins with different sequence but with correlated structure have been generated. In fact, proteins with very different sequence may have a similar three-dimensional structure, consequently, a classification based on three-dimensional structure will be of great utility, in order to identify significant correlations. The classification in CATH takes place in a semi-automatic mode, i.e. partially manually and partially automatically. The abbreviation CATH means: Class, Architecture, Topology, Homologous superfamily, terms that identify the four main levels of classification of proteins: 1. 2. 3. 4. Class Architecture Topology Homology The Class (C-level) is a very simple, and it is assigned automatically. The class is determined according to the content of secondary structure in the protein. Four classes are defined: α, β, αβ and another one in which the content of secondary structure is minimal. Architecture (A-level) considers the domain shape determined by the orientation of secondary structures but ignores the connections between the secondary structures. Currently, this classification is done manually using a simple description of the secondary structure arrangement such as β barrel, three-layer sandwich, etc.. 21 The topology (T-level) considers the connections between the elements of secondary structure: the structures are clustered into groups of folding according to the shape and connections of the secondary structures. The proteins are classified into folding families. In level of homologous superfamily (H-level), the proteins with a possible common ancestor, defined homologous, are included. In this mode, the groups of homologous superfamilies are defined. There is also a 5th level S (Sequence family), where the proteins having sequence identity ≥ 35% are clustered. There are also many sublevels that will be not discussed in this chapter. The classes are numbered with a number ranging from 1 to 4, class 1 for α , β in the class 2, and so on. The next level is the architecture that, is the description of the arrangement of the secondary structure independently of the connections as previously described. Figure 44 shows a series of proteins belonging to different architectures (bundle of helices, β-barrel, propeller, horseshoe, etc..) Figure 44. Proteins examples classified in several groups of architecture. In this level, the orientation of secondary structures is taken in consideration and only when the topology and the connections between the different elements will be considered it will be possible to get into the next level (Fig. 45). An example is represented in figure 45: it starts with the class α-β, and afterwards it is branched in three different architectures: TIM barrel, sandwich and roll. The sandwich architecture is branched again into two different topologies such as flavodoxin and β-lactamase. Although these proteins have a similar arrangement and orientation of their secondary structure, they are characterized by a different topology (i.e. different connections of secondary structure elements). Figure 45. Example of branching and classification levels. 22 Flavodoxin and lactamase are in the same sandwich architecture and in the same α-β class, but they belong to two groups with different topology. Proteins, belonging to the same topological group, have a relatively similar fold because proteins with same topology have conserved secondary structure elements . The connections between the secondary structure elements are also conserved. Generally, the length of the connections loops of the secondary structure elements represents the more variable component among proteins belonging to the same topological group. Also, the length of the elements of secondary structure, such as the β-strands and the α-helix can change. Usually, proteins with the same topology possess a conserved core, and therefore, have similar structures but different functions. The classification of a protein occurs in a hierarchical manner and each protein can be recognized by a specific number (Fig. 46). In the example, the recognition number is 1.10.490.20 resulting from the fact that classes are numbered from 1 to 4; while the levels of architecture, topology and superfamilies homologous are numbered from ten increasing each level by ten. So the number 1.10.490.20 indicates that the protein belongs to the class 1, architecture 10, topology 490 and homology 20. Figure 46. Example of classification and corresponding numbering. In this way, all proteins can be numbered and classified and each number corresponds to only one protein. In figure 47 is reported another example of hierarchical classification of the proteins. 23 Figure 47. Example of classification and corresponding numbering The population in the various levels is not identical. For example, in the H-level (Homologous superfamily) there are some folds more represented than others. The high frequency of a fold is correlated to its structural relevance. Some folding are present in enzymes having different functional characteristics . Therefore, the folding has a quality value (stability, flexibility) independently from the function to which it is associated. In figure 48 are shown some populated folds at the H-level. Figure 48. Some of the folds (H-level) most represented. Criteria for classification The ranking methodology is semi-automatic. The structures are selected from the PDB database. Native proteins or mutant structures, resolved by diffraction or NMR, are chosen when their structure has at least 3 Å of resolution. The next step is the sequence comparison that it represents a direct step, since proteins having a sequence identity greater to 35% are inserted directly to the S-level. The division of the protein domains represents the next step, which are then analyzed individually. The assignment of the class is automatic because the procedure examines the composition of the secondary structure analyzing the value of the Φ and ψ angles and observing how many values are related to the structure α or β. A comparison of the structures is then performed to define the levels H and T. SSAP is the software that automatically carries out the comparison. The program compares distances between residues of a three-dimensional structure in a sequential manner. The parameter used for classification is the number S which is proportional to the inverse of the sum of these differences. A small difference indicates similar structures that then will a large S value. When S is equal to 100 the structures are completely identical. For the level T and H the threshold is 70 and 80, respectively. Between 70 and 80 values the protein is classified in the level T, while for values larger than 80 is classified in the level H. The relative level of the architecture is defined manually. In fact, it is difficult to define such level in an automatic way. The architectures that are not easily described in a first analysis are grouped into specific architecture simply defined 'complex architecture'. Finally, a CATH number is assigned. The proteins can be obtained from the database using: - PDB code - CATH code - Keywords that define the properties of the protein. 24 In this database, the protein function is not kept in consideration. 25