* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical analysis of atomic contacts at RNA– protein
RNA interference wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Magnesium transporter wikipedia , lookup
Polyadenylation wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Western blot wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Peptide synthesis wikipedia , lookup
RNA silencing wikipedia , lookup
Interactome wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Epitranscriptome wikipedia , lookup
Point mutation wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Gene expression wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Proteolysis wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Genetic code wikipedia , lookup
JOURNAL OF MOLECULAR RECOGNITION J. Mol. Recognit. 2001; 14: 199–214 DOI:10.1002/jmr.534 Statistical analysis of atomic contacts at RNA– protein interfaces Michèle Treger1 and Eric Westhof2* 1 Laboratoire de Biostatistique et d’Informatique Médicale, Faculté de Médecine, Université Louis Pasteur, 4 rue Kirshleger, F-67000 Strasbourg, France 2 UPR 9002 du CNRS, Institut de Biologie Moléculaire et Cellulaire, 15 rue René Descartes, F-67084 Strasbourg Cedex, France Forty-five crystals of complexes between proteins and RNA molecules from the Protein Data Bank have been statistically surveyed for the number of contacts between RNA components (phosphate, ribose and the four bases) and amino acid side chains. Three groups of complexes were defined: the tRNA synthetases; the ribosomal complexes; and a third group containing a variety of complexes. The types of atomic contacts were a priori classified into ionic, neutral H-bond, C-H…O H-bond, or van der Waals interaction. All the contacts were organized into a relational database which allows for statistical analysis. The main conclusions are the following: (i) in all three groups of complexes, the most preferred amino acids (Arg, Asn, Ser, Lys) and the less preferred ones (Ala, Ile, Leu, Val) are the same; Trp and Cys are rarely observed (respectively 15 and 5 amino acids in the ensemble of interfaces); (ii) of the total number of amino acids located at the interfaces 22% are hydrophobic, 40% charged (positive 32%, negative 8%), 30% polar and 8% are Gly; (iii) in ribosomal complexes, phosphate is preferred over ribose, which is preferred over the bases, but there is no significant preference in the other two groups; (iv) there is no significant prevalence of a base type at protein–RNA interfaces, but specifically Arg and Lys display a preference for phosphate over ribose and bases; Pro and Asn prefer bases over ribose and phosphate; Met, Phe and Tyr prefer ribose over phosphate and bases. Further, Ile, Pro, Ser prefer A over the others; Leu prefers C; Asp and Gly prefer G; and Asn prefers U. Considering the contact types, the following conclusions could be drawn: (i) 23% of the contacts are via potential H-bonds (including CH…O H-bonds and ionic interactions), 72% belong to van der Waals interactions and 5% are considered as short contacts; (ii) of all potential H-bonds, 54% are standard, 33% are of the C-H…O type and 13% are ionic; (iii) the Watson–Crick sites of G, O6(G) and principally N2(G) and the hydroxyl group O2' is more often involved in H-bonds than expected; the protein main chain is involved in 32% and the side chains in 68% of the H-bonds; considering the neutral and ionic H-bonds, the following couples are more frequent than expected—base A–Ser, base G–Asp/Glu, base U– Asn. The RNA CH groups interact preferentially with oxygen atoms (62% on the main chain and 19% on the side chains); (iv) the bases are involved in 38% of all H-bonds and more than 26% of the H-bonds have the H donor group on the RNA; (v) the atom O2' is involved in 21% of all H-bonds, a number greater than expected; (vi) amino acids less frequently in direct contact with RNA components interact frequently via their main chain atoms through water molecules with RNA atoms; in contrast, those frequently observed in direct contact, except Ser, use instead their side chain atoms for water bridging interactions. Copyright # 2001 John Wiley & Sons, Ltd. Keywords: RNA; protein; contact; interface; statistics Received 21 March 2001; revised 4 April 2001; accepted 4 April 2001 INTRODUCTION RNA molecules can fold and perform chemical reactions without the help of proteins. Further, the recent crystallographic work on the 50S particles concluded that the ribosome is a ribozyme and, thus, any protein on earth is chemically assembled by catalysis performed solely by RNA components (Ban et al., 2000). However, the *Correspondence to: E. Westhof, UPR 9002 du CNRS, IBMC, 15 rue René Descartes, F-67084 Strasbourg Cedex, France. Email: [email protected] Copyright # 2001 John Wiley & Sons, Ltd. ubiquitous and essential functions executed by RNA molecules in living cells require the involvement of several proteins at all steps of the activity of a RNA molecule. Also, theories on the origins of the genetic code imply often stereospecific recognition between RNA bases and the codon table (Knight et al., 1999; Ribas de Pouplana and Schimmel, 2001). It has, therefore, been suggested that RNA aptamers raised against a given amino acid would preferentially contain bases implied in the triplet coding of that particular amino acid (Knight and Landweber, 2000). The recent increase in the number of crystal structures of complexes between protein and a cognate RNA molecule offers the possibility to analyze systematically whether 200 M. TREGER AND E. WESTHOF Table 1. The protein±RNA complexes used for the statistical analysis in this study. The 30S ribosomal subunit is composed of 20 polypeptidic chains in contact with RNA. The PDB code is given in parentheses Complexes Amino acids Nucleotides Resolution (Å) Synthetases Aspartyl tRNA synthetase (1C0A) Glutaminyl tRNA synthetase (1GTR) Seryl tRNA synthetase (1SER) Threonyl tRNA synthetase (1QF6) Isoleucyl tRNA synthetase (1QU2) Phenylalanyl tRNA synthetase (1EIY) 1170 553 842 642 917 1135 154 74 94 76 75 76 2.40 2.50 2.90 2.90 2.20 3.30 Ribosome Ribosomal protein L25 (1DFU) Ribosomal protein L11 (1MMS) 30S ribosomal subunit (1FJF) 94 140 2540 38 58 1522 1.80 2.57 3.05 Others MS2 protein capsid (1E6T) Satellite tobacco mosaic virus (1A34) Black beetle virus capsid protein (2BBV) Methionyl-tRNA formyl transferase (2FMT) U1A spliceosomal protein (1URN) U2 Snrnp (1A9N) Elongation factor Tu (1B23) Trp RNA-binding attenuation protein (1C9S) Sxl-lethal protein (1B7F) Double stranded RNA binding protein (1DI2) Transcription termination factor (2A8V) Signal recognition particle protein (1DUL) RNA binding protein Nova-2 (1EC6) Poly(A) polymerase regulatory subunit (1AV6) Bean pod mottle virus (1BMV) Desmodium yellow mottle tymovirus (1DDL) Cowpea chlorotic mottle virus (1CWP) 194 159 407 314 97 272 405 1628 168 69 354 69 87 289 198 564 570 19 21 3 78 21 24 74 55 12 20 9 49 20 7 11 9 10 2.20 1.81 2.80 2.80 1.92 2.38 2.60 1.90 2.60 1.90 2.40 1.80 2.40 2.70 3.00 2.70 3.20 there are molecular biases between RNA components (either non-specific like phosphate and ribose or specific like the bases) and amino acid side chains present in proteins. Thus, 45 crystalline complexes were retrieved from the Protein Data Bank and distributed into three classes: the tRNA synthetase group; the ribosomal complexes; and a group comprising various types of complexes. The atom– atom contacts between the RNA and the protein components were then calculated and sorted in a relational database. The contacts were classified into three main categories: salt bridge (or ionic); H-bonding type; and van der Waals interaction. The H-bond type covers the potential neutral Hbond as well as the potential C-H…O/N H-bond types. The present analysis is, therefore, purely statistically based considering only interatomic distances without any explicit reference to energetical ranking. In addition, since the analysis considers only stable and crystallized complexes, it will miss the roles of the interactions important for the dynamics of complex formation. Previous works have emphasized the protein structural elements in recognition (Draper, 1999) and the central roles of non-Watson–Crick base pairs in RNA deformation and recognition (Hermann and Westhof, 1999; Westhof and Fritsch, 2000). Copyright # 2001 John Wiley & Sons, Ltd. MATERIALS AND METHODS Data set of RNA–protein complexes The Protein Data Bank (Bernstein et al., 1977) contains several categories of protein–RNA complexes. Since the complexes in these categories have different functions, the interface between protein and RNA may have different properties. We have distinguished three categories of complexes: tRNA synthetases–tRNA (six complexes), ribosomes (22 complexes), and the others (17 complexes) comprising complexes from viruses, an elongation factor, or ribozymes (Table 1). This set of 45 complexes contains non-homologous complexes and at least one representative complex chosen among the protein– RNA complexes available at the PDB until January 2001 on the basis of resolution. When the asymmetric unit contained several copies of the complex, only one was retained. In the case of oligomeric structures, the data corresponding to the biologically significant oligomeric state were selected. In the case of icosahedral viruses, the three-fold association of the capsid proteins was considered. J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES 201 Table 2. Examples of atom groups (a) involved in interactions and (b) of possible interaction types between atoms (a) Names of the atom groups Atom groups Nitrogen N NH NHP NH_NHP Oxygen O O_ON OH Carbon C CH (b) C CH NH NHP NH_NHP N and O OH O_ON N NH NH2 NH2 NH3 NH2 NH2 O OO OH C CH CH2 CH3 C CH NH NHP NH_NHP N and O OH O_ON S/W S/W S/W S/W S/W S/W S/W S/W S/W S/W S/W S/W S/C/W S/C/W S/C/W S/H/W S/H/W S/H/W S/H/W S/H/W S/H/W S/W S/HS/W S/H/W S/H/W S/IH/W S/HS/W S/H/W S/H/W S/IH/W S/W S/H/W S/W S/H/W S/H/W S/W (a) The atoms involved in interactions (N, O, C, …) belong to atom groups. Several atom groups may be involved in the same interaction type and are depicted by the same name. (b) Interaction types between atom groups appearing in (a). Depending on the ionization of the atoms, several interaction types are possible: ionic bond (I), hydrogen bond (H), CH…O H-bond (C), van der Waals interaction (W) and short contact (S). The interaction type is defined as a sequence of at most five symbols among the previous ones. It appears after a condition on the distance d between heavy atoms (given in Å): H, C, IH, W, etc. The criteria are slightly more stringent than in other works, which should alleviate slightly the approximation of not calculating H atoms and therefore neglecting angle values. The deduced numbers of contacts involving H-bonds are thus overestimated. The symbols appearing in (b) have the following meaning: S/W: d < 3.3 S, 3.3 < d 3.8 W; S/C/W: d < 2.5 S, 2.5 d 3.3 C, 3.3 < d 3.8 W; S/CS/W: d < 2.5 S, 2.5 d 3.3 CS, 3.3 < d 3.8 W; S/H/W: d < 2.5 S, 2.5 d 3.3 H, 3.3 < d 3.8 W; S/HS/W: d < 2.5 S, 2.5 d 3.3 HS, 3.3 < d 3.8 W; S/IH/W: d < 2.5 S, 2.5 d 3.3 IH, 3.3 < d 3.8 W. Computational methods Interaction types between atoms. Given two atoms, one belonging to the protein and one belonging to the RNA, several interactions, depending on the ionization of the two atoms and their mutual distance, are possible: ionic bond (I); potential hydrogen bond (H); potential CH…O H-bond (C); van der Waals interaction (W); and short contact (S). A short contact occurs if the two atoms are too close according to the chosen criteria. In order to compute the interaction type from the atoms in contact, we have defined a finite and deterministic automaton which takes as input the names of the two atoms, the names of the amino acids to which they belong and their mutual distance. The output of the automaton is a particular sequence of at most five symbols among the following: I, H, C, W, S. Table 2(a) presents atoms (nitrogen, oxygen, carbon) appearing in different groups (NH, NH2, NH2, …, O, O , …, CH, CH2, CH3, etc.) put together under different names (NH, NH_NHP, O, O_ON, CH, etc.), while Table 2(b) gives the interaction types between these atoms depending on their group names and their mutual distance. Atoms and residues located in the interface between Copyright # 2001 John Wiley & Sons, Ltd. protein and RNA. By convention, we consider that the atoms located in an interface are those which are involved in some of the interactions previously defined. Classification of the amino acids. The amino acids have been classified according to their hydrophobicity into different categories. We have followed a standard classification in four physico-chemical categories: hydrophobic (Ala, Ile, Leu, Met, Phe, Pro, Val), charged (Arg, Asp, Glu, Lys), polar (Cys, Asn, Gln, His, Ser, Thr, Trp, Tyr), and Gly. Components of RNA nucleotides. By convention, we consider that atoms C1', C2', C3', C4', C5', O2', O4' belong to the ribose and atoms P, O1P, O2P, O3P, O3', O5' to the phosphate. The other atoms belong to the bases. The modified bases have not been considered. The database of protein–RNA interfaces A relational database, computed once, has been derived. The data types with their attributes and relations are represented by n-ary relations, with integrity constraints and operations defined on these relations. The database J. Mol. Recognit. 2001; 14: 199–214 202 M. TREGER AND E. WESTHOF contains the following information: . the set of all interactions between atoms, with the atom numbers, the mutual distance and the interaction types; . the atoms located in the interfaces, with their number, name and coordinates; . the amino acids located in the interfaces, with their number, name and chain; . the nucleotides located in the interfaces, with their number, name and chain; . the secondary structures located in the contact regions. In addition, the database includes other data that were not needed to achieve the results: polypeptidic chains; the contact regions with their number of atoms and average temperature factor; average temperature factor of every amino acid for the main chain and for the side chain; and the quaternary structure. Implementation The database was implemented using RDB (VAX/RDB 1987), a Relational Database Management System (RDBMS), and instructions for data manipulation were incorporated in the C host language. Although this system is old, it is still up to date for the functions it provides. The database and the update programs may be easily implemented using any other RDBMS which can be coupled with a host language. Statistical analysis Chi-square tests. Two problems must be distinguished: (i) the preferred amino acid types (or nucleotide components) in the RNA– protein interfaces; and (ii) the favoured pairs amino acid type–nucleotide component in the interfaces. These two problems are independent. That is, the preference of a given amino acid type for some nucleotide component over the others does not depend on its count in each complex or on the molecular surface. Amino acids and nucleotide components involved in contacts. In every complex, the amino acids may be classified by two factors: the amino acid type having 20 levels and the dichotomic criterion ‘contact RNA/no contact’. Thus there are 20 2 groups. The observed counts of amino acids falling in these groups may be presented in a 20 2 contingency table. The same holds true for the nucleotide components: ribose, phosphate, bases (3 2 contingency table), or the four bases A, C, G, U (4 2 contingency table). The expected counts are calculated from the null hypothesis of no relationship between the two factors, that is, only sampling fluctuations are responsible for the differences between the groups. The problem of whether the count of amino acids (or nucleotide components) is related to the amino acid type (or the nucleotide components) was explored by a chi-square statistic (Snedecor and Cochran, 1989). For a set of n complexes, the set of n contingency tables containing observed counts may be combined into a single one by Copyright # 2001 John Wiley & Sons, Ltd. summing the observed counts, and the set of n tables containing expected counts may be combined into a single one by summing the expected counts. These two tables are designed for a Boyd and Doll chi-square test. This test can be carried out in two ways: in one way, the cells of the test are the complexes, in the other way, the cells are the amino acid types or the nucleotide components. Contacts between amino acids and nucleotide components. In every complex, the observed counts of contacts between amino acids and nucleotide components (ribose/ phosphate/bases or bases A/C/G/U) may be presented in a 20 3 or a 20 4 contingency table. As previously, a Boyd and Doll chi-square statistic can be computed in order to test the relationship between the two factors, amino acid type and nucleotide component, for a set of n complexes. For all chi-square tests, the Yate’s correction has been used in case of small expected counts (Colquhoun, 1971). Subsequent analysis. The chi-square statistic does not point out the way in which the observed and the expected counts differ. In the different cells of the tables, the deviations between the observed and the expected counts are more or less large and contribute more or less to the chi-square statistic, but the deviations in the different cells are correlated and thus hard to interpret. If the percentage distributions appear similar in several columns and if a chi-square test confirms it, then these columns may be combined for comparisons with other columns or combined columns by means of further chi-square tests (Snedecor and Cochran, 1989). Nevertheless, within a column with several small counts, the percentage distribution on the one hand and the deviations between observed and expected counts on the other hand may be discordant. In this case, it is better to compare the deviations between observed and expected counts from column to column than the percentage distributions. Arcsine transformation of proportions, analysis of variance (ANOVA) and Scheffé test The location of differences among several groups is easier with a continuous variable than with a discrete variable. As the amino acids of a complex may be classified into two categories, contact/no contact, the proportions p of amino acids which are in contact with RNA in a complex are binomial proportions. Thus, they may be transformed by means of an arcsine transformation z = arcsin Hp (Snedecor and Cochran, 1989). This transformation is effective in stabilizing the variances. The variable z assumes one value for every amino acid type in every complex. For a set of n complexes, two factors are involved: the amino acid type having 20 levels and the complex having n levels. These factors are qualitative variables. Thus, this experiment is a 20 n factorial experiment without repeated measures. It allows a two-way ANOVA. In order to locate the differences among amino acid types, contrasts among amino acid types or groups of amino acid types have been computed from the z means and compared to the critical values by means of a Scheffé test (Winer, 1971). The same holds true in respect of the proportions of nucleotide components in contact with the protein. J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES 203 Table 3. Values of the z means for the amino acid types. The variable z is de®ned by zij = arcsin H pij where pij is the proportion of amino acids of type j in contact with RNA in the complex i. The z values have been computed for every complex and every aa type and then averaged within every aa type. In each group of complexes (synthetases, ribosomes, `others') and in the set of all complexes, the means are sorted in descending order. The counts of Cys and Trp are too low for consideration. Synthetases Arg Asn 0.39 0.27 Gln 0.27 Met 0.25 Thr 0.24 Ser 0.22 Asp 0.22 Lys 0.21 Gly 0.20 Pro 0.19 Phe 0.18 Glu 0.18 Tyr 0.16 Ile 0.15 Ala 0.14 His 0.14 Leu 0.13 Val 0.11 Ribosomes Arg Ser 0.81 0.75 His 0.71 Tyr 0.70 Lys 0.69 Asn 0.68 Thr 0.54 Phe 0.51 Gly 0.51 Gln 0.49 Pro 0.46 Asp 0.25 Ile 0.33 Leu 0.33 Met 0.28 Glu 0.26 Val 0.25 Ala 0.24 Other complexes Lys Arg Gly 0.37 0.34 0.33 Asn 0.31 Ser 0.31 Thr 0.30 Tyr 0.30 Gln 0.25 Phe 0.24 Glu 0.22 Ala 0.20 Met 0.20 Asp 0.16 His 0.15 Val 0.14 Leu 0.12 Ile 0.09 Pro 0.07 All complexes Arg Lys Ser 0.56 0.50 0.48 Asn 0.47 Tyr 0.41 Gly 0.40 Thr 0.39 Gln 0.35 His 0.34 Phe 0.32 Pro 0.27 Asp 0.25 Met 0.24 Glu 0.24 Leu 0.22 Ala 0.21 Ile 0.21 Val 0.18 Friedman test The values of z allow a classification of the amino acid types or the nucleotide components. The ranks computed from the z values (or from the proportions in contact) may be used for calculating a Friedman statistic (Conover, 1980). The treatments of the test are the amino acid types or the nucleotide components, and the blocks are the complexes or groups of complexes (in the case of groups of complexes, the ranks are calculated from the z means). According to the null hypothesis, each ranking within a block is equally likely, i.e. the ranks in the different blocks are discordant, the treatments have identical effects, and no amino acid type or no nucleotide component is preferred over the others in the contacts. If the null hypothesis is rejected, multiple comparisons by means of a t statistic adjusted to this case (Conover, 1980) permit the localization of the differences between the treatments (amino acid types or nucleotide components). RESULTS The amino acids involved in contacts with RNA By convention, we consider that a given amino acid is in contact with RNA if it possesses at least one atom involved in one or more interaction types with one or more atoms of RNA. This amino acid is counted once even if its atoms are involved in several interactions with several atoms of one or more nucleotides. In this part of the analysis, the contacts through water molecules are not counted (but see below). In a given protein, the number of amino acids are different for each of the 20 amino acid types. Thus, the number of a given amino acid type in contact with RNA may depend only on his percentage of presence in the protein. A two-way ANOVA, after arcsine transformation of the proportions of amino acids in contact with RNA, and a Boyd and Doll chi-square test have been computed. In every group of Copyright # 2001 John Wiley & Sons, Ltd. complexes, synthetases, ribosomes and the others, the two F statistics for the two factors are above the critical value for a a < 0.001 level of significance. The Boyd and Doll chisquare test confirms these results (a < 0.001). Table 3 shows a classification of the amino acid types according to their z mean. These results denote a relationship between the two factors, amino acid type and complex. That is, the number of amino acids in contact with RNA depends on both the amino acid type and the complex in which the amino acids are located, and not only on their numbers in the complex. Comparison between the three groups of complexes The classifications of the amino acid types are not exactly the same in the three groups of complexes. Nevertheless, in all three groups, the most preferred amino acid types are the same, Arg, Asn, Ser, Lys, and the less preferred amino acid types are the same, Ala, Ile, Leu, Val. If the classifications of the amino acid types according to their z means were different in the three groups of complexes, then the ranks of the amino acid types according to their z mean would be discordant. A Friedman test, computed from these ranks, results in rejection of the null hypothesis (a < 0.001). Thus, in all three groups of complexes, there is a tendency for some amino acid types to be preferred over the others in contacts with RNA. The differences between the amino acid types two by two, pointed out by means of multiple comparisons, are summarized in Table 4. The percentage contributions to the total number of amino acids located in the interfaces are the following: hydrophobic 22%, charged 40% (positive 32%, negative 8%), polar 30%, Gly 8%. The ribose, phosphate and bases involved in contacts with amino acids As for amino acids, a given nucleotide component (ribose, phosphate, base) is considered in contact with the protein J. Mol. Recognit. 2001; 14: 199–214 204 M. TREGER AND E. WESTHOF Table 4. Some amino acid types are preferred over the others for forming contacts with RNA. Differences between the amino acid types two by two (hatched cells) if it contains at least one atom involved in one or more interaction types with one or more atoms of the protein. By convention, we consider that if a given nucleotide component (for example ribose) is in contact with an amino acid, it is counted even if another component of the same nucleotide (phosphate or base) is also in contact with an amino acid, the same amino acid or another. Thus, the total count of nucleotide components in contact with the protein may be greater than the total number of nucleotides in the RNA. The contacts through water molecules are not counted. In each complex, riboses and phosphates are more numerous than bases of each type. These differences may explain the differences in their counts in the contacts with amino acids. Futhermore, as the phosphates carry negative charges, their proportions in the contacts may be different from those of ribose and bases. In order to test these hypotheses, the same procedure as for the amino acid types has been followed: a two-way ANOVA after arcsine transformation of the proportions of nucleotide components in contact with the protein, a Scheffé test, and a Boyd and Doll chi-square test. Ribose vs phosphate vs the ensemble of the four bases. Only in the group of ribosomes do the proportions of ribose, phosphate and bases in contact with protein differ from each other: phosphate is preferred over ribose, which is preferred over the bases (a < 0.01). This may be due to the large fraction of base paired regions (Watson–Crick or nonWatson–Crick) in the ribosomal RNAs. In the other two groups of complexes, no significant preference for ribose, phosphate or bases over the others exists. In most complexes, the proportions of phosphate and ribose in contact with protein are greater than the proportion of bases, but these differences are not significant. The values of these proportions, after arcsine transformation, are given in Table Copyright # 2001 John Wiley & Sons, Ltd. 5. In all groups of complexes, the proportions of ribose, phosphate and bases depend on the complex in which they are located (a < 0.01). The bases between themselves. In every three groups of complexes and in the ensemble of complexes, no relationship between the proportion of a base in contact with amino acids and its type (A/C/G/U) could be pointed out. Beside Table 5. Values of the z means for the nucleotide components. The variable z is de®ned by zij = arcsin Hpij where pij is the proportion of nucleotide components of type j in contact with amino acids in the complex i. The z values have been computed for every complex and every nucleotide component. They are then averaged within every nucleotide component. In each group of complexes the means are sorted in descending order Synthetases Ribose Phosphate 0.44 0.41 C 0.39 G 0.31 A 0.29 U 0.26 Ribosomes Phosphate 0.15 U 0.13 A 0.12 G 0.11 C 0.09 Other complexes Ribose Phosphate 0.58 0.47 G 0.45 A 0.39 U 0.38 C 0.25 All complexes Ribose Phosphate 0.33 0.30 G 0.26 U 0.24 A 0.23 C 0.19 Ribose 0.14 J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES 205 Table 6. Observed and expected numbers of contacts between amino acids and nucleotide components. The most signi®cant differences are in bold Ribose Observed Phosphate Expected Observed Expected Observed Expected 40 12 28 8 11 25 26 325 6 18 258 54 43 27 65 57 11 55 60 39 23 39 19 26 37 33 265 23 36 164 59 57 36 63 57 16 57 77 31 27 32 16 47 37 18 176 43 54 102 67 39 34 67 54 15 37 70 29 21 32 16 60 25 18 187 32 41 145 56 33 31 68 49 11 46 63 (a) Ribose, phosphate, and the four bases together Ala 34 37 Ile 31 25 Leu 51 40 Met 35 23 Phe 68 39 Pro 41 40 Val 37 29 Arg 189 237 Asp 37 30 Glu 46 40 Lys 103 153 Asn 59 64 Gln 58 50 His 45 39 Ser 69 70 Thr 55 60 Trp 16 15 Tyr 63 51 Gly 88 79 Total 1122 Base A Observed (b) The four bases separately Ala 7 Ile 11 Leu 9 Met 4 Phe 19 Pro 15 Val 4 Arg 42 Asp 5 Glu 17 Lys 27 Asn 21 Gln 4 His 7 Ser 34 Thr 15 Tyr 8 Gly 16 Total 263 Bases 1128 963 Base C Base G Base U Expected Observed Expected Observed Expected Observed Expected 6 7 7 4 15 9 3 47 12 17 32 20 7 10 21 18 7 21 6 4 14 8 5 6 5 51 10 9 16 11 13 4 10 11 11 14 9 6 9 4 9 10 4 43 11 11 16 12 11 5 11 10 12 15 10 6 3 2 18 10 3 39 22 23 43 11 16 20 15 18 9 29 9 10 7 4 17 12 4 50 12 19 37 19 13 14 25 15 7 21 9 6 7 2 5 6 7 44 6 5 17 25 7 3 8 10 9 11 6 4 9 4 7 7 6 35 8 6 16 17 9 4 10 11 11 14 207 the fact that some amino acid types occur more often than others in contacts with RNA and the fact that nucleotide components may all have the same preference for the amino acids, a certain amino acid type may give preference for some nucleotide components over the others, and vice versa. The answer to this problem can be obtained by testing the hypothesis about a dependency between the two factors: the amino acid type and the nucleotide component. Copyright # 2001 John Wiley & Sons, Ltd. 294 184 Dependency between the amino acid type and the nucleotide component The preferences that amino acid types present for ribose vs phosphate vs bases have been analysed in the three groups of complexes and in the ensemble of complexes, by means of chi-square tests. The results may be summarized as follows. J. Mol. Recognit. 2001; 14: 199–214 206 M. TREGER AND E. WESTHOF The ribose, phosphate and the four bases. In the ensemble of complexes and for all amino acid types, except Cys whose expected count is too small [Table 6(a)]: . Arg and principally Lys give preference for phosphate over ribose and bases; . the amino acids Ile, Leu, Met, Phe, Pro, Asp, Glu, Gln, Gly, principally Asp and Glu, have fewer contacts with phosphate than expected; . Pro and Asn prefer bases over ribose and phosphate; . Met, Phe and Tyr prefer ribose over phosphate and bases. In the groups of synthetases, ribosomes and other complexes the results are concordant with the previous ones, although they are not all significant owing to smaller counts. The three groups of complexes differ from each other (a < 0.001): the complexes denoted ‘others’ differ from synthetases and from ribosomes by an excess of contacts between amino acids and bases. The four bases considered separately. Some amino acids, except Cys and Trp whose expected count is too small, have a preference for some bases over the others [Table 6(b)]: . . . . Ile, Pro, Ser prefer A over the others; Leu prefers C over the others; Asp and Gly prefer G over the others; Asn prefers U over the others. H-bonds. The potential H bonds constitute 12% of all interaction types. Considering the number of H-bonds involving phosphate groups vs ribose vs bases, the phosphate groups are more often involved in H-bonds than expected (a < 0.001). All oxygen atoms of ribose and of phosphate are involved in H-bonds. All atoms O and N of the bases are involved (see Table 8). Considering the total number of interactions involving these atoms, O2', O6(G) and principally N2(G) are more often involved in H-bonds than expected, whereas O4', O1P and N3(G) are more rarely involved in H-bonds than expected. In the protein, the main chain is involved in 32% of H-bonds (O and OXT 14%, NH 19%) and the side chain in 68% [O 12%, NH 42% (principally Arg and Lys 29%), OH 14%, SD Met 0.3%]. Among all H-bonds, at least 26% have the donor group on the RNA and the acceptor group on the protein. All H donor sites of the RNA are involved in these latter H-bonds. These atoms are also involved in H-bonds with atoms of the protein which may be either H donor or H acceptor. Considering the number of H-bonds between the H donor sites of RNA and the H acceptor vs H acceptor or H donor sites of the protein, atoms N2(G) prefer H-bonds with H acceptor sites of protein, whereas O2' prefer H-bonds with H donor or acceptor sites of the protein. Atoms O2' are also involved in van der Waals interactions. Considering the number of H-bonds vs van der Waals interactions involving O2'; with atoms O of the side chain and with OH, O2' prefer H-bonds over van der Waals interactions; with atoms O of the main chain and with NH, O2' give no significant preference for H-bonds or van der Waals interactions. The preferences that some amino acids have for a base over the others, listed previously, can be found again in the Interaction types between amino acids and nucleotides In the ensemble of complexes, the percentage distribution of the interaction types is the following: 23% H and CH…O H-bonds and ionic interactions, 72% Van der Waals interactions, and 5% short contacts. Table 7 shows the percentage distributions of interaction types for the different nucleotide components in the three groups of complexes and in the ensemble of complexes. One can notice that the three groups of complexes differ from each other (a < 0.001) in: . an excess of H bonds, CH…O H-bonds and Van der Waals interactions involving bases in the group denoted ‘others’ vs synthetases and ribosomes; . a lack of interactions of all types involving bases in the ribosomes vs synthetases and ‘others’; . a lack of interactions of all types involving phosphate in the group denoted ‘others’ vs synthetases and ribosomes. In the ensemble of complexes, the phosphates are involved in more H-bonds than expected. Ionic interactions. The ionic interactions constitute 3% of all interaction types and 13% among all H-bonds. They occur between the charged atoms of the phosphate groups and the NH3 and NH2 groups of Lys and Arg. There is no significant preference for some oxygen atoms over the others. They represent 32% of all phosphate Hbonds. Copyright # 2001 John Wiley & Sons, Ltd. Table 7. Percentage distributions of interaction types in the three groups of complexes and in the ensemble of complexes Ribose Phosphate Bases Ionic interactions and hydrogen bonds Synthetases 24 41 Ribosomes 21 58 Others 16 18 35 21 67 CH…O H-bonds Synthetases Ribosomes Others 39 38 32 24 42 13 37 19 56 Van der Waals interactions Synthetases 33 Ribosomes 35 Others 22 27 37 11 40 28 67 The ensemble of complexes IH 20 C 36 W 30 43 31 26 38 32 44 IH, ionic interactions and H bonds; C, CH…O H-bonds; W, van der Waals interactions. J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES Table 8. Atoms of RNA involved in H-bonds in the ensemble of 45 complexes with the average distance between donor and acceptor atoms and number of Hbonds in the ensemble Nucleotide Atom Distance Number G G C U A U A G U C A C G A G N2 O6 O2 O4 N6 O2 N1 N1 N3 N4 N3 N3 N7 N7 N3 O2' O3' O4' O5' O1P O2P O3P 2.95 3.04 2.94 2.91 2.98 2.96 2.89 2.98 2.90 3.00 3.02 3.03 2.94 2.94 3.01 2.94 3.03 3.05 3.12 2.91 2.89 2.93 88 54 50 45 44 39 36 36 36 33 23 22 21 7 6 254 81 34 27 136 69 1 207 CH…O H-bonds. The CH…O H-bonds constitute 33% of all potential H-bonds (neutral, CH…O, ionic), and 8% of all interaction types. All CH groups of the amino acids, except CB Ile and CZ3 Trp, are involved in CH…O H-bonds with RNA. All CH groups of ribose and bases, except C8(A), are involved in CH…O H-bonds, principally C5', C4', C2(A) and C1'. They are listed in Table 9. The absence of C-H…O bonds involving C8(A) and the small number involving C6(U), C6(C), C8(G) are due to the presence of an intramolecular CH…O bonds involving the O5' of the ribose (Sundaralingam, 1973; Auffinger et al., 1996; Brandl et al., 1999). The importance of C-H…O contacts between the serine hydroxyl group and the C2(A) is striking. Several of the C-H…O contacts involving C4' and C5' (79%) are made with main chain oxygen atoms. The CH groups of RNA interact preferentially with oxygen atoms (62% on main chain and 19% on side chain), and with OH (19%). The CH groups of the proteins interact with O (74%), OH (18 %) and N (8%). Interactions involving aromatic rings In the ensemble of complexes, 71 interactions occur between atoms OH, NH, NH2 of nucleotides A, C, G, U and aromatic rings of Phe, Trp, Tyr, from which nine belong to the synthetases, 25 to the ribosomes, and 37 to the other complexes. The distance between the ring center, assuming a six-fold symmetry of the ring, and the N or O atom of the RNA varies between 2.36 and 3.53 Å (average 3.48 Å). The optimum distance varies from 2.9 to 3.6 Å (Levitt and Perutz, 1988). A recent analysis (Brandl et al., 2001) concluded to a longer average value of 3.7 Å (standard deviation 0.2 Å). Ten contacts are below these optimum values. tendencies for the H-bonds and CH…O H-bonds. Considering all H-bonds (neutral and ionic) between amino acids and bases, the following are more frequent than expected: base A–Ser; base G–Asp and Glu; and base U–Asn. Considering all CH…O H-bonds between amino acids and bases, the following are more frequent than expected: base A–Ser, base G–Phe and Gly, base U–Val. H-bonds involving water molecules A water molecule bridging protein to RNA is in contact Table 9. CH groups of RNA involved in CH¼O H-bonds with amino acids. Observed number of CH¼O H-bonds Ala Leu Met Pro Val Arg Asp Glu Lys Asn Gln Ser Thr Trp Tyr Gly Total C1' C2' 1 1 1 1 2 2 3 2 3 1 1 2 1 1 2 23 1 1 1 1 1 1 1 C3' C4' C5' 1 1 4 1 1 6 1 1 2 1 3 1 2 4 1 1 9 31 3 2 6 5 1 3 9 41 C C5 C C6 U C6 A C2 G C8 1 1 1 1 10 U C5 1 Copyright # 2001 John Wiley & Sons, Ltd. 1 3 1 1 1 1 1 4 1 1 1 2 17 1 1 1 1 7 1 1 5 1 1 24 2 1 2 6 Total 3 3 3 10 2 5 4 10 4 17 7 29 15 3 11 24 150 J. Mol. Recognit. 2001; 14: 199–214 208 M. TREGER AND E. WESTHOF Table 10. Number of observed contacts between amino acids and ribose, phosphate and bases through bridging water molecules. Contacts with the main chain and the side chain of amino acids are distinguished. The interaction type of contact on the two sides (protein and RNA) of the water molecules may be H-bond or CH¼O H-bond Ala Ile Leu Met Phe Pro Val Arg Asp Glu Lys Asn Gln His Ser Thr Trp Tyr Gly Total (a) All interaction types on either side of the water molecule Ribose Main 1 1 2 2 3 2 6 Side 1 1 4 1 3 10 Total 2 2 2 6 1 3 5 16 Phosphate Main 2 1 1 2 2 4 4 Side 1 2 16 Total 2 1 2 2 2 6 20 Bases Main 4 4 9 1 2 2 2 Side 2 1 5 1 1 18 Total 6 5 9 5 2 2 3 20 Total Main 7 6 11 3 3 7 8 12 Side 3 2 10 2 6 44 Total 10 8 11 13 5 7 14 56 Ala (c) CH…O H-bonds Ribose Main Side Total Phosphate Main Side Total Bases Main Side Total Total Main Side Total 3 7 10 2 1 3 1 12 13 6 20 26 7 5 12 2 4 6 3 4 7 12 13 25 5 7 12 2 3 5 4 7 11 11 17 28 5 2 7 3 1 4 4 4 2 5 6 7 7 2 15 17 11 1 12 4 3 7 3 5 8 18 9 27 8 3 11 12 4 16 1 6 7 9 4 13 22 14 36 1 1 2 3 2 5 1 1 1 1 1 1 1 1 5 3 8 3 3 9 9 10 10 14 14 33 33 70 62 132 47 48 94 64 76 140 181 185 366 Ile Leu Met Phe Pro Val Arg Asp Glu Lys Asn Gln His Ser Thr Trp Tyr Gly Total (b) H-bonds on either side of Ribose Main 1 1 Side Total 1 1 Phosphate Main 2 1 Side Total 2 1 Bases Main 3 4 Side Total 3 4 Total Main 6 6 Side Total 6 6 Ala 10 10 2 3 5 5 8 13 7 21 28 the water molecule 2 2 3 2 2 4 3 1 2 2 Ile 1 9 2 2 3 4 7 9 11 11 2 2 3 2 1 2 1 3 2 3 1 6 2 7 3 6 7 Pro Val Met Phe 4 10 14 2 14 16 2 16 18 8 40 48 Arg Glu 3 6 9 2 1 3 1 10 11 6 17 23 Arg Asp Glu on the protein side and all interaction types 1 1 1 2 1 3 2 1 1 3 1 3 2 1 2 1 2 2 1 3 4 1 2 1 5 1 1 3 2 1 5 1 1 1 3 1 1 1 2 3 2 8 2 6 7 3 2 9 2 1 7 9 Ala Met Val 10 10 2 3 5 2 8 10 4 21 25 7 4 11 2 4 6 2 4 6 11 12 23 5 7 12 2 3 5 4 7 11 11 17 28 Lys Gln 10 1 11 4 3 7 3 4 7 17 8 25 5 2 2 7 7 2 15 17 7 7 His Ser 12 1 13 1 5 6 9 4 13 22 10 32 1 1 2 3 2 5 1 1 5 3 8 Thr Trp 1 1 1 1 1 1 1 1 Tyr 7 7 9 9 11 11 27 27 65 46 111 42 39 81 55 62 117 162 147 309 Gly Total on the RNA side 6 1 1 1 1 1 1 1 1 1 1 2 1 1 6 2 1 1 4 4 4 4 Lys Asn Gln 1 1 1 1 2 2 2 2 2 2 1 1 1 2 3 His Ser Thr Tyr (d) All interaction types on the protein side and CH…O H-bonds on the RNA side Ribose Main 3 2 1 1 Side 1 2 1 1 1 2 1 Total 1 2 1 3 1 2 2 2 1 1 Bases Main 1 1 Side 1 Total 1 1 1 Total Main 1 3 3 1 1 Side 1 2 1 2 1 2 1 Total 2 2 1 3 2 3 2 2 1 1 Copyright # 2001 John Wiley & Sons, Ltd. 5 4 4 2 4 6 1 2 3 1 2 3 3 3 3 1 1 4 1 1 2 2 3 3 3 11 1 1 11 Gly Total 3 1 1 1 1 2 3 7 15 22 6 11 17 8 18 26 21 44 65 4 11 12 23 3 1 4 14 13 27 J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES 209 Table 11. Number of observed contacts between amino acids and bases A, C, G, U through bridging water molecules Ala (a) All interaction types A Main 1 Side Total 1 C Main 1 Side 1 Total 2 G Main 2 Side 1 Total 3 U Main Side Total Total Main 4 Side 2 Total 6 Ala Ile Leu Met Phe Pro Val Arg Asp Glu Lys Asn Gln on either side of the water molecule 2 2 1 10 2 3 10 3 1 1 1 3 4 1 1 3 1 9 1 2 1 3 1 9 2 1 4 1 1 1 2 1 1 3 4 9 1 2 2 2 1 5 1 1 18 5 9 5 2 2 3 20 Ile (b) H-bonds on either side of A Main Side Total C Main 1 3 Side Total 1 3 G Main 2 1 Side Total 2 1 U Main Side Total Total Main 3 4 Side Total 3 4 Leu Met Phe Pro 1 1 1 1 4 5 9 2 2 5 8 13 2 2 1 4 5 2 2 2 1 3 2 3 5 4 4 2 2 1 12 13 1 1 2 3 4 7 2 4 6 4 7 11 Ser 1 1 1 1 2 2 3 5 3 3 1 1 2 2 Thr Trp Tyr Gly Total 1 2 3 1 1 8 1 9 3 3 1 1 1 1 1 7 7 Val Arg Asp Glu Lys Asn Gln 3 5 8 Ser 9 4 13 1 1 1 8 1 8 2 1 1 2 14 1 14 10 26 36 12 17 29 33 21 54 9 12 21 64 76 140 Thr Tyr Gly Total the water molecule 2 2 8 8 1 1 9 9 2 2 1 9 9 2 2 1 1 1 2 1 1 2 3 3 1 3 4 1 2 3 2 16 18 simultaneously with a protein atom and an RNA atom. Among the ensemble of 45 RNA-protein complexes, 17 complexes contain bridging water molecules. In the ensemble of these 17 complexes, 208 bridging water molecules are in contact between 290 protein atoms and 292 RNA atoms via H-bonds (neutral, CH…O H-bonds). The average number of bridging water molecules per complex is 12.2 (95% confidence interval: 7.7–16.8). The interaction type between amino acids and water molecules may be H-bond or CH…O H-bond; the same holds true for the contact between nucleotide components and water molecules. There are several possible combinations of these two interaction types. The observed counts of contacts between amino acids and nucleotide components through water molecules for some of these combinations are listed in Table 10 for ribose, phosphate and bases and in Table 11 for the bases separately. For most amino acid types, the expected counts are too small for a statistical test to deal with the question of favoured amino acid type– Copyright # 2001 John Wiley & Sons, Ltd. 1 1 1 1 1 5 6 2 2 2 8 10 2 2 1 3 4 2 2 2 1 3 2 3 5 4 4 1 1 1 10 11 1 1 2 4 6 2 4 6 4 7 11 1 2 2 1 1 1 2 3 5 3 3 1 1 1 2 3 7 7 2 1 1 1 8 1 9 1 1 3 4 7 2 9 4 13 1 6 1 6 2 1 1 2 11 1 11 8 20 28 12 12 24 28 19 47 7 11 18 55 62 117 nucleotide component pairs. Several amino acids, and especially those disfavoured in direct contacts (Ala, Ile, Leu, Val), H-bond RNA atoms via water molecules using their main chain atoms [see Table 10(b)], except for Arg, Asp, Glu, Asn, Gln which overall prefer to use their side chain atoms. It is interesting to note that, in this respect, Lys and Phe use almost equally main chain and side chain atoms. One-third of the contacts made by Gly involve the Ca-H [Table 10(c)]. The amino acids Met involve their side chain atoms in C-H…O contacts with RNA, especially base atoms. The numbers of contacts between nucleotide components and amino acids through water molecules with main chain vs side chain do not differ significantly from each other, except for adenine and guanine bases: A contacts preferentially side chain over main chain while G contacts preferentially main chain over side chain atoms. Leu, Thr and Gly bridge their main chain atoms via water molecules to the base G in a large proportion. The atoms of RNA involved in H-bonds with bridging water molecules may J. Mol. Recognit. 2001; 14: 199–214 210 M. TREGER AND E. WESTHOF Table 12. Observed counts of interface amino acids in contact with RNA through their main chain vs side chain, along with their localization in secondary structures Ribose Phosphate Bases Total 56 231 287 16 96 112 183 313 496 255 640 895 52 120 172 40 119 159 213 352 565 305 591 896 177 557 734 78 330 408 625 1010 1635 880 1897 2777 A C G U Total 11 42 53 17 34 51 67 142 209 95 218 313 14 36 50 8 33 41 47 76 123 69 145 214 23 44 67 3 28 31 75 158 233 101 230 331 13 28 41 13 52 65 51 95 146 77 175 252 61 150 211 41 147 188 240 471 711 342 768 1110 (a) Contacts with ribose, phosphate, bases Helice Main 69 Side 206 Total 275 Sheet Main 22 Side 115 Total 137 Other Main 229 Side 345 Total 574 Total Main 320 Side 666 Total 986 (b) Contacts with bases A, C, G, U Helice Main Side Total Sheet Main Side Total Other Main Side Total Total Main Side Total also be involved in H-bonds directly with amino acids. Those RNA atoms, whose observed counts are sufficient (O2', O4', O3', O1P, O2P, N1, N2, N3), give no significant preference for contact through water molecules or directly with amino acids. Secondary structures in interfaces Among all interface amino acids, 44% belong to a secondary structure: 28% to helices and 16% to sheets. The number of interface amino acids belonging to a given secondary structure varies from 1 to 20. The distribution of the secondary structures with their corresponding number of interface amino acids is asymmetrical, with median equal to 5, i.e. the secondary structures contain five amino acids. The interface amino acids are in contact with RNA through their main chain (26%) or their side chain (74%). The amino acids belonging to helices and in contact through their main chain are less numerous than expected, whereas the amino acids belonging to regions without repetitive secondary structure (loops, junctions, etc.) and in contact through their main chain are more numerous than expected. The counts of interface amino acids in contact with ribose, phosphate and bases through their main chain and their side Copyright # 2001 John Wiley & Sons, Ltd. chain along with their localization in the secondary structures are listed in Tables 12(a) and (b). Amino acids belonging to helices and in contact with RNA through their side chain prefer contacts with phosphate over bases; for amino acids in contact through their main chain, the preferences are in the reverse order [computed from Table 12(a)]. Amino acids belonging to sheets and in contact through their side chain prefer the uracil over the guanine bases. DISCUSSION Here, we will discuss and compare our results with previous data and especially with the work on a smaller set of complexes (Jones et al., 2001) which appeared during the course of our own analysis (see Table 13). Secondary structures in interfaces In subunit or domain interfaces, the amino acids which belong to distinct secondary structures represent 70% of the interface surface (Argos, 1988). In RNA–protein interfaces, the secondary structures, helix or sheet, with a unique J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES 211 Table 13. Comparison between some properties of RNA±protein interfaces derived from the present work and from another analysis of 32 RNA±protein complexes (Jones et al., 2001) Present work The ensemble of all interaction types Most preferred amino acid types Less preferred amino acid types Contribution to total interface amino acids Ribose, phosphate, bases Bases A/C/G/U Favoured pairs Distribution of interaction types H-bonds (neutral, ionic, CH…O) ionic H-bonds CH…O H-bonds van der Waals Short H-bonds Ratio: bases/(ribose phosphate) Atoms of RNA involved Arg, Asn, Ser, Lys Ala, Ile, Leu, Val Hydrophobic, 22% Charged, 40% positive, 32% negative, 8% Polar, 30% Gly, 8% 42% of interactions with bases No significant preference Arg/Lys–phosphate Met/Phe/Tyr–ribose Ile/Pro/Ser–A Leu–C Asp/Gly–G Asn–U 23% 3% 12% 8% 72% 5% 32 RNA–protein complexes sample (Jones et al., 2001) Tyr, Lys, Phe, Ile, Arg, Asn, Ser His, Asp, Pro, Ala, Glu 58% of interactions with bases Preference for G and U Arg/Lys–phosphate Ser/Phe–A Asp/Tyr–C Trp/Gly–G Asn/Glu–U 8% 92% 1.1 (0.4–1.8) O2', 21% of all H-bonds O2', O6(G) and N2(G) preferred O4', O1P and N3(G) infrequently Main chain, 32% O, OXT, 14% NH, 19% Side chain, 68% O, 12% OH, 14% NH, 42% SD Met, 0.3% 1.0 Ribose, O2' only van der Waals Ratio: bases/(ribose phosphate) Atoms of ribose involved 1.0 (0.6–1.5) O2', 32% 1.4 O2', 34% CH…O H-bonds CH groups of RNA Favoured pairs Principally C5', C4', C2(A), C1' Phe–A Ser–A Atoms of protein involved Bridging water molecules Average number of water molecules that form H-bonds to both RNA and protein 12.2 (7.7–16.8) interface amino acid contain only 5% of the total interface amino acids and most of them contain three amino acids. Draper (1999) distinguishes two main classes of proteins binding to RNA. In the first class, a secondary structure element (helix, ribbon or even unstructured region) binds in Copyright # 2001 John Wiley & Sons, Ltd. 0.3 (0.1–0.5) a frequently distorted groove of the RNA helix. In the second class, a b-sheet surface binds to a single-stranded RNA region. He noticed that, in the second class, the proteins tend to ‘ignore the RNA backbone’ (Draper, 1999). Interestingly, from Table 12(a), one can observe proporJ. Mol. Recognit. 2001; 14: 199–214 212 M. TREGER AND E. WESTHOF tionally twice as many contacts with the sugar-phosphate backbone in contacts between helices and RNA than in contacts between sheets and RNA. Two trends can be further noticed, although not statistically significant in our present set of structures. The protein main chain atoms are less often implicated in contacts when they belong to bsheets than when they form an a-helix. In absence of repetitive structures, there is no difference in the number of contacts formed by main chain and side chain atoms. Size of the interfaces The size of the interfaces, measured by the number of their atoms, varies from one complex to another: the percentage of atoms located at the interfaces ranges from 1% up to 25%. A point plot of atom counts in interfaces against atom counts in the complex, after square root transformation, shows scattered points with no relationship between the two counts. The same holds true in respect of protein–protein interfaces (Conte et al., 1999), subunit-subunit interfaces and domain interfaces (Argos, 1988), as well as DNA– protein interfaces (Nadassy et al., 1999). The composition of amino acids at the interfaces Considering the counts of amino acids of each type in the protein, Ala, Ile, Leu, Val are infrequently located in RNA– protein interfaces, whereas Arg, Asn, Ser, Lys are most frequently located in the interfaces (see Table 3). These findings may be compared with those obtained from a sample of 32 RNA–protein complexes, in which the amino acid interface propensities have been calculated for every amino acid type (Jones et al., 2001). From that study, the amino acids Arg, Lys, Asn, Tyr, Phe, Ile, Ser, which have values of propensity greater than one, occur more frequently in the interface than in the remaining protein surface; on the contrary, the amino acids Asp, Glu, His, Ala, Pro occur less frequently in the interfaces than in the remaining surface. However, owing to the variability of values from one complex to another, a value of propensity greater than 1 may be due to random sampling fluctuations and may not necessarily indicate that this amino acid occurs significantly more frequently in the interface than on the protein surface. Thus, the values of the amino acid propensities are difficult to compare with the preferences for some amino acids over the others found in the present sample of RNA–protein complexes and it is not possible to deduce whether the differences between amino acid preferences observed in the two samples of complexes are significant. The protein–protein interfaces (Conte et al., 1999), the subunit–subunit interfaces and the domain interfaces (Argos, 1988) are depleted in the charged amino acids Asp, Glu, Lys compared to the remaining surface. These interfaces contain more aromatic amino acids, His, Tyr, Phe, Trp, and aliphatic amino acids, Leu, Ile, Val, Met, than the remaining surface. Arg is the most abundant amino acid in protein–protein interfaces (Conte et al., 1999). In the RNA–protein complexes, the percentages of amino acids at the interfaces, defined as percentage atom count Copyright # 2001 John Wiley & Sons, Ltd. contributions to the total atom count of interfaces, have the following means and 95% confidence intervals. For example: Arg, 19.6% (16.0–23.1); Lys, 14.2% (11.4– 17.0); Gly, 6.6% (5.0–8.3); Asp, 2.6% (1.6–3.5); Ile, 2.5% (1.4–3.5); Leu, 2.6% (1.6–3.6); Tyr, 5.9% (3.7–8.0). The percentages in question are not defined in the same way as for the protein–protein interfaces: in protein–protein interfaces (Conte et al., 1999; Argos, 1988) they are defined as percentage area contribution to the solvent-accessible surface area. Nevertheless, they may be compared. Arg, Lys and Gly are more represented in RNA–protein interfaces than in protein–protein interfaces, whereas Asp, Ile, Leu and Tyr are less represented in RNA–protein interfaces than in protein–protein interfaces. From these results one may conclude that protein–protein interfaces and RNA–protein interfaces differ from each other in amino acid composition: Arg, Lys, Gly are more abundant and Ile, Leu, Tyr less abundant in RNA–protein interfaces than in protein–protein interfaces. The amino acid compositions of RNA–protein interfaces and DNA–protein interfaces (Nadassy et al., 1999) seem similar: the interfaces contain principally positively charged amino acids. The percentages of amino acids in the interfaces are approximately the same in the two kinds of complexes, except for Gly which seems more abundant in RNA–protein interfaces than in DNA–protein interfaces. Ribose, phosphate and bases at the interfaces In DNA–protein complexes, the interfaces and the remaining surface differ from each other in an excess of phosphate and a lack of ribose in the interfaces vs the remaining surface, measured by the fraction of solvent-accessible surface area for ribose, phosphate and bases (Nadassy et al., 1999). On the other hand, in the RNA–protein complexes, considering the total counts of ribose, phosphate and bases in the complexes, the proportions of ribose, phosphate and bases located in the interfaces vary from one complex to another, and no significant preference for ribose or phosphate or bases over the others appears in the interfaces. The ribose, phosphate and bases at the interfaces may also be measured by the percentage of their atom count contributions to the total atom count of the interfaces. The values are 34% for ribose, 31% for phosphate and 33% for bases, with a large variation between complexes. Considering the total number of interactions, all types joined, between atoms of the protein and atoms of the RNA, 29% occur with ribose, 29% with phosphate and 42% with bases. In the 32 RNA–protein complexes sample (Jones et al., 2001), the bases are involved in 58% if the total count of interactions, which is greater, but without any assurance of significance. Dependency between the amino acid type and the nucleotide component From the present sample of RNA–protein complexes one may summarize the preferences that some amino acid types, except for Cys and Trp whose expected count is too small, have for some nucleotide components over the others as J. Mol. Recognit. 2001; 14: 199–214 RNA–PROTEIN INTERFACES follows: Arg and principally Lys display a preference for phosphate over ribose and bases. Some amino acids show a preference for some bases over the others: Ile, Pro, Ser prefer A over the others; Leu prefers C over the others; Asp and Gly prefer G over the others; and Asn prefers U over the others. The findings from the 32 RNA–protein complexes sample (Jones et al., 2001) allow computation of a chi-square statistic from the observed counts of interactions (H-bonds and van der Waals interactions summed) between the amino acid types (all amino acid types except Cys and His which have too small expected counts) and the nucleotide component ribose, phosphate, base A, base C, base G and base U. The expected counts are calculated as usual in a chisquare test. This test points out a relationship between the two factors (amino acid type and nucleotide component) and shows favoured interactions between Arg–phosphate, Lys– phosphate, Asn–U, Gly–G, Tyr–C, Asp–C, Ser–A, Phe–A, Glu–U, and principally Trp–G. Some favoured contacts between amino acid types and nucleotide components are common between the two samples of complexes; Trp–G could not be found in the present study owing to a small expected count. Nevertheless, the statistic computed should be a Boyd and Doll chi-square statistic, that is to say that the expected counts should be computed in every complex and then summed. And this may explain some differences in the favoured contacts observed in the two samples of complexes. Furthermore, in the 32 RNA–protein complexes sample, the van der Waals interactions involving a given amino acid are not counted if this amino acid is involved in H-bonds, whereas in the present sample of complexes, all van der Waals interactions have been counted. As these interactions are the most numerous, this difference in counting the contacts between amino acids and nucleotide component may also explain some divergences in the favoured contacts observed in the two samples of complexes. The H-bonds In the 32 RNA–protein complexes sample, the H-bonds represent only 8% of the total count of interactions (Jones et al., 2001) and 12% in the present sample. This difference may be due in part to the additional criterion involving the angle at the calculated position of the H atom in a potential H-bond in the former study (Jones et al., 2001). In DNA– protein interfaces, the H-bonds involve principally phosphates (60% of H-bonds; Nadassy et al., 1999). In RNA– protein interfaces, the observed number of H-bonds and ionic interactions involving phosphates is greater than expected, considering the counts of ribose, phosphate and bases involved in all interaction types, but they constitute only 43% of all potential H-bonds. The bases are involved in 38% of all H-bonds, as in DNA–protein interfaces. In RNA– protein interfaces the ribose is more often involved in Hbonds (20%) than in DNA–protein complexes (6%). Actually, the latter difference is greater since atoms O3' and O5' are counted as belonging to ribose in DNA–protein complexes and to phosphate in RNA–protein complexes, but, in absence of information on the counts of H-bonds involving ribose, phosphate and bases in DNA–protein Copyright # 2001 John Wiley & Sons, Ltd. 213 interfaces, it is not possible to assert that these differences are significant. In DNA–protein complexes, only 10% of the H-bonds have the H donor group on the DNA (Nadassy et al., 1999), whereas in RNA–protein complexes, more than 26% of the H-bonds have the H donor group on the RNA. Atoms involved in H-bonds. In RNA–protein interfaces, the atom O2' is involved in 21% of all H-bonds. This percentage agrees with the one derived from a small sample of RNA–protein interfaces (Nadassy et al., 1999). Considering the number of H-bonds and van der Waals interactions involving O2' and the other atoms involved in hydrogen bonding, the number of H-bonds involving O2' is greater than expected; the same holds true for N2(G) and O6(G); similarly, O4', O1P and N3(G) have fewer H-bonds than expected. In DNA–protein complexes, H bonds involving bases occur frequently between G (O6, N7) and Arg or Lys and between A (N6, N7) and Asn or Gln. On the contrary, in RNA–protein complexes, H-bonds involving bases occur principally with N2(G) and O6(G) (as was suggested by one crystal structure, Masquida et al., 1999). The guanine bases display a preference for Asp and Glu over the other amino acids in the case of H-bonds and for Phe in the case of CH…O H-bonds; A has a preference for Ser over the other amino acids in the case of H-bonds and principally in the case of CH…O H- bonds. Water molecules bridging protein to RNA The O2' hydroxyl of RNA is the group most frequently involved in H-bonds with bridging water molecules despite the fact that it is surrounded by rapidly exchanging water molecules as shown by molecular dynamics simulations of RNAs (Auffinger and Westhof, 1997). Atoms C2, C6, C8 of RNA are rarely involved in such H-bonds, since they form intramolecular CH..O H-bonds with oxygen atoms in the RNA (Auffinger and Westhof, 1997). Water bridges between protein main chain and RNA atoms are surprisingly frequent, except for those amino acids, which are found most frequently in direct contact with RNA (like Arg and Asn). CONCLUSIONS While the sizes of the samplings for each variable were adequate for the conclusions reached, it is yet not known whether the crystallized complexes represent a significant sampling of the overall population of RNA–protein complexes. Within the approximations made and with this caveat, the main conclusions of the present analysis are contained in Table 13. The complexity and diversity of contacts of RNA–protein interfaces are apparent. One can notice a frequent involvement of hydrophobic and non-polar amino acids (30%) with an almost equipartition of contacts between bases and the sugar-phosphate backbone, but there is no clear preference for a single nucleic acid base. The hydrophilic atoms of the two pyrimidine bases form about the same number of H-bonds. On the other hand, between the two purine bases, while the N3(G) is avoided, it is the J. Mol. Recognit. 2001; 14: 199–214 214 M. TREGER AND E. WESTHOF N7(A) which is avoided in adenine. There is a small but definite preference between some amino acid type and some nucleic acid base (e.g. Ser-A, Gly-G or Asn-U). As in all interfaces, there is a high percentage of van der Waals contacts between the RNA and the proteins (about threequarter of them). The importance of the protein main chain is noticeable since about one-third of the protein atoms making contacts belong to the main chain. On the other hand, although positively charged amino acids are among the most preferred amino acids at the interface between RNA and proteins, they occur for only one-third of the total number of contacts. In slightly less than half of the complexes, an average of 12 water molecules occurs in the interface bridging atoms of the RNA and the protein. Thus, overall, it is clear that no single interaction type or contact stands out. All the electrostatic and weak interactions constituting RNA–protein complexes form an ensemble of embedded hierarchies maintaining a complex network of precisely fitted and adapted inter- and intramolecular contacts. Acknowledgements We wish to thank Professor Michel Roos for his help with statistical methods and Pascal Auffinger for numerous and fruitful discussions on molecular interactions. M.T. thanks Professor M. Roos for support. REFERENCES Argos P. 1988. An investigation of protein subunit and domain interfaces. Protein Engng 2: 101±113. Auf®nger P, Louise-May S, Westhof E. 1996. Hydration of C-H groups in tRNA. Faraday Discuss. 103: 151±173. Auf®nger P, Westhof E. 1997. Rules governing the orientation of the 2'-hydroxyl group in RNA. J. Mol. Biol. 274: 54±63. Ban N, Nissen P, Hansen J, Moore PB, Steitz TA. 2000. The complete atomic structure of the large ribosomal subunit at 2.4 AÊ resolution. Science 289: 905±920. Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. 1977. The Protein Data Bank: a computer-based archival ®le for macromolecular structures. J. Mol. Biol. 112: 535±542. Brandl M, Lindauer K, Meyer M, SuÈhnel J. 1999. C-H¼O and CH¼N interactions in RNA structures. Theoret. Chem. Acc. 101: 103±113. Brandl M, Weiss MS, Jabs A, SuÈhnel J, Hilgenfeld R. 2001. CH¼p interactions in proteins. J. Mol. Biol. 307: 357±377. Colquhoun D. 1971. Lectures on Biostatistics. Clarendon Press: Oxford. Conover WJ. 1980. Practical Nonparametric Statistics. John Wiley: Chichester. Conte LL, Chothia C, Janin J. 1999. The atomic structure of protein±protein recognition sites. J. Mol. Biol. 285: 2177± 2198. Draper DE. 1999. Themes in RNA±protein recognition. J. Mol. Biol. 293: 255±270. Hermann T, Westhoff E. 1999. Non-Watson±Crick base pairs-in RNA-protein recognition. Chem. Biol. 6: R335±R343. Jones S, Daley DTA, Luscombe NM, Berman HM, Thornton JM. 2001. Protein±RNA interactions: a structural analysis. Nucleic Acids Res. 29: 943±954. Knight RD, Freeland SJ, Landweber LF. 1999. Selection, history Copyright # 2001 John Wiley & Sons, Ltd. and chemistry: the three faces of the genetic code. Trends Biochem. Sci. 24: 241±247. Knight RD, Landweber LF. 2000. Guilt by association: the arginine case revisited. RNA 6: 499±510. Levitt M, Perutz MF. 1988. Aromatic rings act as hydrogen bond acceptors. J. Mol. Biol. 201: 751±754. Masquida B, Sauter C, Westhof E. 1999. A sulfate pocket formed Ê resolution X-ray structure of by three GoU pairs in the 0.97 A a nonameric RNA. RNA 5: 1384±1395. Nadassy K, Wodak SJ, Janin J. 1999. Structural features of protein±nucleic acid recognition sites. Biochemistry 38: 1999±2017. Ribas de Pouplana L, Schimmel P. 2001. Operational RNA code for amino acids in relation to genetic code in evolution. J. Biol. Chem. 276: 6881±6884. Snedecor GW, Cochran WG. 1989. Statistical Methods. Iowa State University Press/AMES. Sundaralingam M. 1973. The concept of a conformationally `rigid' nucleotide and its signi®cance in polynucleotide conformational analysis. In Conformation of Biological Molecules and Polymers. (Bergmann VED & Pullmann B eds), pp. 417±456, The Israel Academy of Sciences and Humanities, Jerusalem. VAX Rdb/VMS Reference Manual, VAX Rdb/VMS Guide to Database Design and De®nition, VAX Rdb/VMS Guide to Data Manipulation, VAX Rdb/VMS Guide to Programming, RDML Reference Manual, VAX Rdb/VMS Guide to Database Administration and Maintenance, Digital Equipment Corporation, February, 1987. Westhof E, Fritsch V. 2000. RNA folding: beyond Watson±Crick pairs. Structure 8: R55±R65. Winer BJ. 1971. Statistical Principles in Experimental Design. McGraw-Hill: Kogakusha. J. Mol. Recognit. 2001; 14: 199–214