Download Statistical analysis of atomic contacts at RNA– protein

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA interference wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Magnesium transporter wikipedia , lookup

Polyadenylation wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Western blot wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Peptide synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

RNA silencing wikipedia , lookup

Interactome wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein wikipedia , lookup

Epitranscriptome wikipedia , lookup

Point mutation wikipedia , lookup

RNA wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Gene expression wikipedia , lookup

Metabolism wikipedia , lookup

Metalloprotein wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Proteolysis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Transcript
JOURNAL OF MOLECULAR RECOGNITION
J. Mol. Recognit. 2001; 14: 199–214
DOI:10.1002/jmr.534
Statistical analysis of atomic contacts at RNA–
protein interfaces
Michèle Treger1 and Eric Westhof2*
1
Laboratoire de Biostatistique et d’Informatique Médicale, Faculté de Médecine, Université Louis Pasteur, 4 rue Kirshleger, F-67000
Strasbourg, France
2
UPR 9002 du CNRS, Institut de Biologie Moléculaire et Cellulaire, 15 rue René Descartes, F-67084 Strasbourg Cedex, France
Forty-five crystals of complexes between proteins and RNA molecules from the Protein Data Bank have
been statistically surveyed for the number of contacts between RNA components (phosphate, ribose and the
four bases) and amino acid side chains. Three groups of complexes were defined: the tRNA synthetases; the
ribosomal complexes; and a third group containing a variety of complexes. The types of atomic contacts
were a priori classified into ionic, neutral H-bond, C-H…O H-bond, or van der Waals interaction. All the
contacts were organized into a relational database which allows for statistical analysis. The main
conclusions are the following: (i) in all three groups of complexes, the most preferred amino acids (Arg, Asn,
Ser, Lys) and the less preferred ones (Ala, Ile, Leu, Val) are the same; Trp and Cys are rarely observed
(respectively 15 and 5 amino acids in the ensemble of interfaces); (ii) of the total number of amino acids
located at the interfaces 22% are hydrophobic, 40% charged (positive 32%, negative 8%), 30% polar and
8% are Gly; (iii) in ribosomal complexes, phosphate is preferred over ribose, which is preferred over the
bases, but there is no significant preference in the other two groups; (iv) there is no significant prevalence of
a base type at protein–RNA interfaces, but specifically Arg and Lys display a preference for phosphate over
ribose and bases; Pro and Asn prefer bases over ribose and phosphate; Met, Phe and Tyr prefer ribose over
phosphate and bases. Further, Ile, Pro, Ser prefer A over the others; Leu prefers C; Asp and Gly prefer G;
and Asn prefers U. Considering the contact types, the following conclusions could be drawn: (i) 23% of the
contacts are via potential H-bonds (including CH…O H-bonds and ionic interactions), 72% belong to van
der Waals interactions and 5% are considered as short contacts; (ii) of all potential H-bonds, 54% are
standard, 33% are of the C-H…O type and 13% are ionic; (iii) the Watson–Crick sites of G, O6(G) and
principally N2(G) and the hydroxyl group O2' is more often involved in H-bonds than expected; the protein
main chain is involved in 32% and the side chains in 68% of the H-bonds; considering the neutral and ionic
H-bonds, the following couples are more frequent than expected—base A–Ser, base G–Asp/Glu, base U–
Asn. The RNA CH groups interact preferentially with oxygen atoms (62% on the main chain and 19% on
the side chains); (iv) the bases are involved in 38% of all H-bonds and more than 26% of the H-bonds have
the H donor group on the RNA; (v) the atom O2' is involved in 21% of all H-bonds, a number greater than
expected; (vi) amino acids less frequently in direct contact with RNA components interact frequently via
their main chain atoms through water molecules with RNA atoms; in contrast, those frequently observed in
direct contact, except Ser, use instead their side chain atoms for water bridging interactions. Copyright
# 2001 John Wiley & Sons, Ltd.
Keywords: RNA; protein; contact; interface; statistics
Received 21 March 2001; revised 4 April 2001; accepted 4 April 2001
INTRODUCTION
RNA molecules can fold and perform chemical reactions
without the help of proteins. Further, the recent crystallographic work on the 50S particles concluded that the
ribosome is a ribozyme and, thus, any protein on earth is
chemically assembled by catalysis performed solely by
RNA components (Ban et al., 2000). However, the
*Correspondence to: E. Westhof, UPR 9002 du CNRS, IBMC, 15 rue René
Descartes, F-67084 Strasbourg Cedex, France.
Email: [email protected]
Copyright # 2001 John Wiley & Sons, Ltd.
ubiquitous and essential functions executed by RNA
molecules in living cells require the involvement of several
proteins at all steps of the activity of a RNA molecule. Also,
theories on the origins of the genetic code imply often
stereospecific recognition between RNA bases and the
codon table (Knight et al., 1999; Ribas de Pouplana and
Schimmel, 2001). It has, therefore, been suggested that
RNA aptamers raised against a given amino acid would
preferentially contain bases implied in the triplet coding of
that particular amino acid (Knight and Landweber, 2000).
The recent increase in the number of crystal structures of
complexes between protein and a cognate RNA molecule
offers the possibility to analyze systematically whether
200
M. TREGER AND E. WESTHOF
Table 1. The protein±RNA complexes used for the statistical analysis in this study. The 30S ribosomal subunit is
composed of 20 polypeptidic chains in contact with RNA. The PDB code is given in parentheses
Complexes
Amino acids
Nucleotides
Resolution (Å)
Synthetases
Aspartyl tRNA synthetase (1C0A)
Glutaminyl tRNA synthetase (1GTR)
Seryl tRNA synthetase (1SER)
Threonyl tRNA synthetase (1QF6)
Isoleucyl tRNA synthetase (1QU2)
Phenylalanyl tRNA synthetase (1EIY)
1170
553
842
642
917
1135
154
74
94
76
75
76
2.40
2.50
2.90
2.90
2.20
3.30
Ribosome
Ribosomal protein L25 (1DFU)
Ribosomal protein L11 (1MMS)
30S ribosomal subunit (1FJF)
94
140
2540
38
58
1522
1.80
2.57
3.05
Others
MS2 protein capsid (1E6T)
Satellite tobacco mosaic virus (1A34)
Black beetle virus capsid protein (2BBV)
Methionyl-tRNA formyl transferase (2FMT)
U1A spliceosomal protein (1URN)
U2 Snrnp (1A9N)
Elongation factor Tu (1B23)
Trp RNA-binding attenuation protein (1C9S)
Sxl-lethal protein (1B7F)
Double stranded RNA binding protein (1DI2)
Transcription termination factor (2A8V)
Signal recognition particle protein (1DUL)
RNA binding protein Nova-2 (1EC6)
Poly(A) polymerase regulatory subunit (1AV6)
Bean pod mottle virus (1BMV)
Desmodium yellow mottle tymovirus (1DDL)
Cowpea chlorotic mottle virus (1CWP)
194
159
407
314
97
272
405
1628
168
69
354
69
87
289
198
564
570
19
21
3
78
21
24
74
55
12
20
9
49
20
7
11
9
10
2.20
1.81
2.80
2.80
1.92
2.38
2.60
1.90
2.60
1.90
2.40
1.80
2.40
2.70
3.00
2.70
3.20
there are molecular biases between RNA components
(either non-specific like phosphate and ribose or specific
like the bases) and amino acid side chains present in
proteins.
Thus, 45 crystalline complexes were retrieved from the
Protein Data Bank and distributed into three classes: the
tRNA synthetase group; the ribosomal complexes; and a
group comprising various types of complexes. The atom–
atom contacts between the RNA and the protein components
were then calculated and sorted in a relational database. The
contacts were classified into three main categories: salt
bridge (or ionic); H-bonding type; and van der Waals
interaction. The H-bond type covers the potential neutral Hbond as well as the potential C-H…O/N H-bond types. The
present analysis is, therefore, purely statistically based
considering only interatomic distances without any explicit
reference to energetical ranking. In addition, since the
analysis considers only stable and crystallized complexes, it
will miss the roles of the interactions important for the
dynamics of complex formation. Previous works have
emphasized the protein structural elements in recognition
(Draper, 1999) and the central roles of non-Watson–Crick
base pairs in RNA deformation and recognition (Hermann
and Westhof, 1999; Westhof and Fritsch, 2000).
Copyright # 2001 John Wiley & Sons, Ltd.
MATERIALS AND METHODS
Data set of RNA–protein complexes
The Protein Data Bank (Bernstein et al., 1977) contains
several categories of protein–RNA complexes. Since the
complexes in these categories have different functions, the
interface between protein and RNA may have different
properties. We have distinguished three categories of
complexes: tRNA synthetases–tRNA (six complexes),
ribosomes (22 complexes), and the others (17 complexes) comprising complexes from viruses, an elongation
factor, or ribozymes (Table 1). This set of 45 complexes
contains non-homologous complexes and at least one
representative complex chosen among the protein–
RNA complexes available at the PDB until January 2001
on the basis of resolution. When the asymmetric unit
contained several copies of the complex, only one was
retained. In the case of oligomeric structures, the data
corresponding to the biologically significant oligomeric
state were selected. In the case of icosahedral viruses,
the three-fold association of the capsid proteins was
considered.
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
201
Table 2. Examples of atom groups (a) involved in interactions and (b) of possible interaction types between atoms
(a) Names of the atom groups
Atom groups
Nitrogen
N
NH
NHP
NH_NHP
Oxygen
O
O_ON
OH
Carbon
C
CH
(b)
C
CH
NH
NHP
NH_NHP
N and O
OH
O_ON
N
NH NH2
NH2‡ NH3‡
NH2 NH2‡
O
OO
OH
C
CH CH2 CH3
C
CH
NH
NHP
NH_NHP
N and O
OH
O_ON
S/W
S/W
S/W
S/W
S/W
S/W
S/W
S/W
S/W
S/W
S/W
S/W
S/C/W
S/C/W
S/C/W
S/H/W
S/H/W
S/H/W
S/H/W
S/H/W
S/H/W
S/W
S/HS/W
S/H/W
S/H/W
S/IH/W
S/HS/W
S/H/W
S/H/W
S/IH/W
S/W
S/H/W
S/W
S/H/W
S/H/W
S/W
(a) The atoms involved in interactions (N, O, C, …) belong to atom groups. Several atom groups may be involved in the same interaction type
and are depicted by the same name.
(b) Interaction types between atom groups appearing in (a). Depending on the ionization of the atoms, several interaction types are possible:
ionic bond (I), hydrogen bond (H), CH…O H-bond (C), van der Waals interaction (W) and short contact (S). The interaction type is defined as a
sequence of at most five symbols among the previous ones. It appears after a condition on the distance d between heavy atoms (given in Å): H,
C, IH, W, etc. The criteria are slightly more stringent than in other works, which should alleviate slightly the approximation of not calculating H
atoms and therefore neglecting angle values. The deduced numbers of contacts involving H-bonds are thus overestimated. The symbols
appearing in (b) have the following meaning: S/W: d < 3.3 S, 3.3 < d 3.8 W; S/C/W: d < 2.5 S, 2.5 d 3.3 C, 3.3 < d 3.8 W; S/CS/W:
d < 2.5 S, 2.5 d 3.3 CS, 3.3 < d 3.8 W; S/H/W: d < 2.5 S, 2.5 d 3.3 H, 3.3 < d 3.8 W; S/HS/W: d < 2.5 S, 2.5 d 3.3 HS,
3.3 < d 3.8 W; S/IH/W: d < 2.5 S, 2.5 d 3.3 IH, 3.3 < d 3.8 W.
Computational methods
Interaction types between atoms. Given two atoms, one
belonging to the protein and one belonging to the RNA,
several interactions, depending on the ionization of the two
atoms and their mutual distance, are possible: ionic bond (I);
potential hydrogen bond (H); potential CH…O H-bond (C);
van der Waals interaction (W); and short contact (S). A
short contact occurs if the two atoms are too close according
to the chosen criteria. In order to compute the interaction
type from the atoms in contact, we have defined a finite and
deterministic automaton which takes as input the names of
the two atoms, the names of the amino acids to which they
belong and their mutual distance. The output of the
automaton is a particular sequence of at most five symbols
among the following: I, H, C, W, S. Table 2(a) presents
atoms (nitrogen, oxygen, carbon) appearing in different
groups (NH, NH2, NH2‡, …, O, O , …, CH, CH2, CH3,
etc.) put together under different names (NH, NH_NHP, O,
O_ON, CH, etc.), while Table 2(b) gives the interaction
types between these atoms depending on their group names
and their mutual distance.
Atoms and residues located in the interface between
Copyright # 2001 John Wiley & Sons, Ltd.
protein and RNA. By convention, we consider that the
atoms located in an interface are those which are involved in
some of the interactions previously defined.
Classification of the amino acids. The amino acids have
been classified according to their hydrophobicity into
different categories. We have followed a standard classification in four physico-chemical categories: hydrophobic
(Ala, Ile, Leu, Met, Phe, Pro, Val), charged (Arg, Asp, Glu,
Lys), polar (Cys, Asn, Gln, His, Ser, Thr, Trp, Tyr), and Gly.
Components of RNA nucleotides. By convention, we
consider that atoms C1', C2', C3', C4', C5', O2', O4' belong
to the ribose and atoms P, O1P, O2P, O3P, O3', O5' to the
phosphate. The other atoms belong to the bases. The
modified bases have not been considered.
The database of protein–RNA interfaces
A relational database, computed once, has been derived.
The data types with their attributes and relations are
represented by n-ary relations, with integrity constraints
and operations defined on these relations. The database
J. Mol. Recognit. 2001; 14: 199–214
202
M. TREGER AND E. WESTHOF
contains the following information:
. the set of all interactions between atoms, with the atom
numbers, the mutual distance and the interaction types;
. the atoms located in the interfaces, with their number,
name and coordinates;
. the amino acids located in the interfaces, with their
number, name and chain;
. the nucleotides located in the interfaces, with their
number, name and chain;
. the secondary structures located in the contact regions.
In addition, the database includes other data that were not
needed to achieve the results: polypeptidic chains; the
contact regions with their number of atoms and average
temperature factor; average temperature factor of every
amino acid for the main chain and for the side chain; and the
quaternary structure.
Implementation
The database was implemented using RDB (VAX/RDB
1987), a Relational Database Management System (RDBMS),
and instructions for data manipulation were incorporated in
the C host language. Although this system is old, it is still up
to date for the functions it provides. The database and the
update programs may be easily implemented using any other
RDBMS which can be coupled with a host language.
Statistical analysis
Chi-square tests.
Two problems must be distinguished: (i) the preferred
amino acid types (or nucleotide components) in the RNA–
protein interfaces; and (ii) the favoured pairs amino acid
type–nucleotide component in the interfaces. These two
problems are independent. That is, the preference of a given
amino acid type for some nucleotide component over the
others does not depend on its count in each complex or on
the molecular surface.
Amino acids and nucleotide components involved in
contacts. In every complex, the amino acids may be
classified by two factors: the amino acid type having 20
levels and the dichotomic criterion ‘contact RNA/no
contact’. Thus there are 20 2 groups. The observed
counts of amino acids falling in these groups may be
presented in a 20 2 contingency table. The same holds
true for the nucleotide components: ribose, phosphate, bases
(3 2 contingency table), or the four bases A, C, G, U
(4 2 contingency table). The expected counts are
calculated from the null hypothesis of no relationship
between the two factors, that is, only sampling fluctuations
are responsible for the differences between the groups. The
problem of whether the count of amino acids (or nucleotide
components) is related to the amino acid type (or the
nucleotide components) was explored by a chi-square
statistic (Snedecor and Cochran, 1989). For a set of n
complexes, the set of n contingency tables containing
observed counts may be combined into a single one by
Copyright # 2001 John Wiley & Sons, Ltd.
summing the observed counts, and the set of n tables
containing expected counts may be combined into a single
one by summing the expected counts. These two tables are
designed for a Boyd and Doll chi-square test. This test can
be carried out in two ways: in one way, the cells of the test
are the complexes, in the other way, the cells are the amino
acid types or the nucleotide components.
Contacts between amino acids and nucleotide components. In every complex, the observed counts of contacts
between amino acids and nucleotide components (ribose/
phosphate/bases or bases A/C/G/U) may be presented in a
20 3 or a 20 4 contingency table. As previously, a Boyd
and Doll chi-square statistic can be computed in order to test
the relationship between the two factors, amino acid type
and nucleotide component, for a set of n complexes. For all
chi-square tests, the Yate’s correction has been used in case
of small expected counts (Colquhoun, 1971).
Subsequent analysis. The chi-square statistic does not point
out the way in which the observed and the expected counts
differ. In the different cells of the tables, the deviations
between the observed and the expected counts are more or less
large and contribute more or less to the chi-square statistic,
but the deviations in the different cells are correlated and thus
hard to interpret. If the percentage distributions appear similar
in several columns and if a chi-square test confirms it, then
these columns may be combined for comparisons with other
columns or combined columns by means of further chi-square
tests (Snedecor and Cochran, 1989). Nevertheless, within a
column with several small counts, the percentage distribution
on the one hand and the deviations between observed and
expected counts on the other hand may be discordant. In this
case, it is better to compare the deviations between observed
and expected counts from column to column than the
percentage distributions.
Arcsine transformation of proportions, analysis of
variance (ANOVA) and Scheffé test
The location of differences among several groups is easier
with a continuous variable than with a discrete variable. As
the amino acids of a complex may be classified into two
categories, contact/no contact, the proportions p of amino
acids which are in contact with RNA in a complex are
binomial proportions. Thus, they may be transformed by
means of an arcsine transformation z = arcsin Hp (Snedecor
and Cochran, 1989). This transformation is effective in
stabilizing the variances. The variable z assumes one value
for every amino acid type in every complex. For a set of n
complexes, two factors are involved: the amino acid type
having 20 levels and the complex having n levels. These
factors are qualitative variables. Thus, this experiment is a
20 n factorial experiment without repeated measures. It
allows a two-way ANOVA. In order to locate the
differences among amino acid types, contrasts among
amino acid types or groups of amino acid types have been
computed from the z means and compared to the critical
values by means of a Scheffé test (Winer, 1971). The same
holds true in respect of the proportions of nucleotide
components in contact with the protein.
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
203
Table 3. Values of the z means for the amino acid types. The variable z is de®ned by zij = arcsin H pij where pij is the
proportion of amino acids of type j in contact with RNA in the complex i. The z values have been computed for
every complex and every aa type and then averaged within every aa type. In each group of complexes
(synthetases, ribosomes, `others') and in the set of all complexes, the means are sorted in descending order. The
counts of Cys and Trp are too low for consideration.
Synthetases
Arg
Asn
0.39 0.27
Gln
0.27
Met
0.25
Thr
0.24
Ser
0.22
Asp
0.22
Lys
0.21
Gly
0.20
Pro
0.19
Phe
0.18
Glu
0.18
Tyr
0.16
Ile
0.15
Ala
0.14
His
0.14
Leu
0.13
Val
0.11
Ribosomes
Arg
Ser
0.81 0.75
His
0.71
Tyr
0.70
Lys
0.69
Asn
0.68
Thr
0.54
Phe
0.51
Gly
0.51
Gln
0.49
Pro
0.46
Asp
0.25
Ile
0.33
Leu
0.33
Met
0.28
Glu
0.26
Val
0.25
Ala
0.24
Other complexes
Lys
Arg Gly
0.37 0.34 0.33
Asn
0.31
Ser
0.31
Thr
0.30
Tyr
0.30
Gln
0.25
Phe
0.24
Glu
0.22
Ala
0.20
Met
0.20
Asp
0.16
His
0.15
Val
0.14
Leu
0.12
Ile
0.09
Pro
0.07
All complexes
Arg
Lys
Ser
0.56 0.50 0.48
Asn
0.47
Tyr
0.41
Gly
0.40
Thr
0.39
Gln
0.35
His
0.34
Phe
0.32
Pro
0.27
Asp
0.25
Met
0.24
Glu
0.24
Leu
0.22
Ala
0.21
Ile
0.21
Val
0.18
Friedman test
The values of z allow a classification of the amino acid types
or the nucleotide components. The ranks computed from the
z values (or from the proportions in contact) may be used for
calculating a Friedman statistic (Conover, 1980). The
treatments of the test are the amino acid types or the
nucleotide components, and the blocks are the complexes or
groups of complexes (in the case of groups of complexes,
the ranks are calculated from the z means). According to the
null hypothesis, each ranking within a block is equally
likely, i.e. the ranks in the different blocks are discordant,
the treatments have identical effects, and no amino acid type
or no nucleotide component is preferred over the others in
the contacts. If the null hypothesis is rejected, multiple
comparisons by means of a t statistic adjusted to this case
(Conover, 1980) permit the localization of the differences
between the treatments (amino acid types or nucleotide
components).
RESULTS
The amino acids involved in contacts with RNA
By convention, we consider that a given amino acid is in
contact with RNA if it possesses at least one atom involved
in one or more interaction types with one or more atoms of
RNA. This amino acid is counted once even if its atoms are
involved in several interactions with several atoms of one or
more nucleotides. In this part of the analysis, the contacts
through water molecules are not counted (but see below). In
a given protein, the number of amino acids are different for
each of the 20 amino acid types. Thus, the number of a given
amino acid type in contact with RNA may depend only on
his percentage of presence in the protein. A two-way
ANOVA, after arcsine transformation of the proportions of
amino acids in contact with RNA, and a Boyd and Doll
chi-square test have been computed. In every group of
Copyright # 2001 John Wiley & Sons, Ltd.
complexes, synthetases, ribosomes and the others, the two F
statistics for the two factors are above the critical value for a
a < 0.001 level of significance. The Boyd and Doll chisquare test confirms these results (a < 0.001). Table 3 shows
a classification of the amino acid types according to their z
mean. These results denote a relationship between the two
factors, amino acid type and complex. That is, the number of
amino acids in contact with RNA depends on both the amino
acid type and the complex in which the amino acids are
located, and not only on their numbers in the complex.
Comparison between the three groups of complexes
The classifications of the amino acid types are not exactly
the same in the three groups of complexes. Nevertheless, in
all three groups, the most preferred amino acid types are the
same, Arg, Asn, Ser, Lys, and the less preferred amino acid
types are the same, Ala, Ile, Leu, Val. If the classifications
of the amino acid types according to their z means were
different in the three groups of complexes, then the ranks of
the amino acid types according to their z mean would be
discordant. A Friedman test, computed from these ranks,
results in rejection of the null hypothesis (a < 0.001). Thus,
in all three groups of complexes, there is a tendency for
some amino acid types to be preferred over the others in
contacts with RNA. The differences between the amino acid
types two by two, pointed out by means of multiple
comparisons, are summarized in Table 4. The percentage
contributions to the total number of amino acids located in
the interfaces are the following: hydrophobic 22%, charged
40% (positive 32%, negative 8%), polar 30%, Gly 8%.
The ribose, phosphate and bases involved in contacts
with amino acids
As for amino acids, a given nucleotide component (ribose,
phosphate, base) is considered in contact with the protein
J. Mol. Recognit. 2001; 14: 199–214
204
M. TREGER AND E. WESTHOF
Table 4. Some amino acid types are preferred over the others for
forming contacts with RNA. Differences between the amino acid
types two by two (hatched cells)
if it contains at least one atom involved in one or more
interaction types with one or more atoms of the protein.
By convention, we consider that if a given nucleotide
component (for example ribose) is in contact with an amino
acid, it is counted even if another component of the same
nucleotide (phosphate or base) is also in contact with an
amino acid, the same amino acid or another. Thus, the total
count of nucleotide components in contact with the protein
may be greater than the total number of nucleotides in the
RNA. The contacts through water molecules are not
counted. In each complex, riboses and phosphates are more
numerous than bases of each type. These differences may
explain the differences in their counts in the contacts with
amino acids. Futhermore, as the phosphates carry negative
charges, their proportions in the contacts may be different
from those of ribose and bases. In order to test these
hypotheses, the same procedure as for the amino acid types
has been followed: a two-way ANOVA after arcsine
transformation of the proportions of nucleotide components
in contact with the protein, a Scheffé test, and a Boyd and
Doll chi-square test.
Ribose vs phosphate vs the ensemble of the four bases.
Only in the group of ribosomes do the proportions of ribose,
phosphate and bases in contact with protein differ from each
other: phosphate is preferred over ribose, which is preferred
over the bases (a < 0.01). This may be due to the large
fraction of base paired regions (Watson–Crick or nonWatson–Crick) in the ribosomal RNAs. In the other two
groups of complexes, no significant preference for ribose,
phosphate or bases over the others exists. In most
complexes, the proportions of phosphate and ribose in
contact with protein are greater than the proportion of bases,
but these differences are not significant. The values of these
proportions, after arcsine transformation, are given in Table
Copyright # 2001 John Wiley & Sons, Ltd.
5. In all groups of complexes, the proportions of ribose,
phosphate and bases depend on the complex in which they
are located (a < 0.01).
The bases between themselves. In every three groups of
complexes and in the ensemble of complexes, no relationship between the proportion of a base in contact with amino
acids and its type (A/C/G/U) could be pointed out. Beside
Table 5. Values of the z means for the nucleotide
components. The variable z is de®ned by zij = arcsin
Hpij where pij is the proportion of nucleotide components of type j in contact with amino acids in the
complex i. The z values have been computed for every
complex and every nucleotide component. They are
then averaged within every nucleotide component. In
each group of complexes the means are sorted in
descending order
Synthetases
Ribose
Phosphate
0.44
0.41
C
0.39
G
0.31
A
0.29
U
0.26
Ribosomes
Phosphate
0.15
U
0.13
A
0.12
G
0.11
C
0.09
Other complexes
Ribose
Phosphate
0.58
0.47
G
0.45
A
0.39
U
0.38
C
0.25
All complexes
Ribose
Phosphate
0.33
0.30
G
0.26
U
0.24
A
0.23
C
0.19
Ribose
0.14
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
205
Table 6. Observed and expected numbers of contacts between amino acids and nucleotide components. The most
signi®cant differences are in bold
Ribose
Observed
Phosphate
Expected
Observed
Expected
Observed
Expected
40
12
28
8
11
25
26
325
6
18
258
54
43
27
65
57
11
55
60
39
23
39
19
26
37
33
265
23
36
164
59
57
36
63
57
16
57
77
31
27
32
16
47
37
18
176
43
54
102
67
39
34
67
54
15
37
70
29
21
32
16
60
25
18
187
32
41
145
56
33
31
68
49
11
46
63
(a) Ribose, phosphate, and the four bases together
Ala
34
37
Ile
31
25
Leu
51
40
Met
35
23
Phe
68
39
Pro
41
40
Val
37
29
Arg
189
237
Asp
37
30
Glu
46
40
Lys
103
153
Asn
59
64
Gln
58
50
His
45
39
Ser
69
70
Thr
55
60
Trp
16
15
Tyr
63
51
Gly
88
79
Total
1122
Base A
Observed
(b) The four bases separately
Ala
7
Ile
11
Leu
9
Met
4
Phe
19
Pro
15
Val
4
Arg
42
Asp
5
Glu
17
Lys
27
Asn
21
Gln
4
His
7
Ser
34
Thr
15
Tyr
8
Gly
16
Total
263
Bases
1128
963
Base C
Base G
Base U
Expected
Observed
Expected
Observed
Expected
Observed
Expected
6
7
7
4
15
9
3
47
12
17
32
20
7
10
21
18
7
21
6
4
14
8
5
6
5
51
10
9
16
11
13
4
10
11
11
14
9
6
9
4
9
10
4
43
11
11
16
12
11
5
11
10
12
15
10
6
3
2
18
10
3
39
22
23
43
11
16
20
15
18
9
29
9
10
7
4
17
12
4
50
12
19
37
19
13
14
25
15
7
21
9
6
7
2
5
6
7
44
6
5
17
25
7
3
8
10
9
11
6
4
9
4
7
7
6
35
8
6
16
17
9
4
10
11
11
14
207
the fact that some amino acid types occur more often than
others in contacts with RNA and the fact that nucleotide
components may all have the same preference for the amino
acids, a certain amino acid type may give preference for
some nucleotide components over the others, and vice versa.
The answer to this problem can be obtained by testing the
hypothesis about a dependency between the two factors: the
amino acid type and the nucleotide component.
Copyright # 2001 John Wiley & Sons, Ltd.
294
184
Dependency between the amino acid type and the
nucleotide component
The preferences that amino acid types present for ribose vs
phosphate vs bases have been analysed in the three groups
of complexes and in the ensemble of complexes, by means
of chi-square tests. The results may be summarized as
follows.
J. Mol. Recognit. 2001; 14: 199–214
206
M. TREGER AND E. WESTHOF
The ribose, phosphate and the four bases. In the ensemble
of complexes and for all amino acid types, except Cys
whose expected count is too small [Table 6(a)]:
. Arg and principally Lys give preference for phosphate
over ribose and bases;
. the amino acids Ile, Leu, Met, Phe, Pro, Asp, Glu, Gln,
Gly, principally Asp and Glu, have fewer contacts with
phosphate than expected;
. Pro and Asn prefer bases over ribose and phosphate;
. Met, Phe and Tyr prefer ribose over phosphate and
bases.
In the groups of synthetases, ribosomes and other
complexes the results are concordant with the previous
ones, although they are not all significant owing to smaller
counts. The three groups of complexes differ from each
other (a < 0.001): the complexes denoted ‘others’ differ
from synthetases and from ribosomes by an excess of
contacts between amino acids and bases.
The four bases considered separately. Some amino acids,
except Cys and Trp whose expected count is too small, have
a preference for some bases over the others [Table 6(b)]:
.
.
.
.
Ile, Pro, Ser prefer A over the others;
Leu prefers C over the others;
Asp and Gly prefer G over the others;
Asn prefers U over the others.
H-bonds. The potential H bonds constitute 12% of all
interaction types. Considering the number of H-bonds
involving phosphate groups vs ribose vs bases, the
phosphate groups are more often involved in H-bonds than
expected (a < 0.001). All oxygen atoms of ribose and of
phosphate are involved in H-bonds. All atoms O and N of
the bases are involved (see Table 8). Considering the total
number of interactions involving these atoms, O2', O6(G)
and principally N2(G) are more often involved in H-bonds
than expected, whereas O4', O1P and N3(G) are more rarely
involved in H-bonds than expected. In the protein, the main
chain is involved in 32% of H-bonds (O and OXT 14%, NH
19%) and the side chain in 68% [O 12%, NH 42%
(principally Arg and Lys 29%), OH 14%, SD Met 0.3%].
Among all H-bonds, at least 26% have the donor group on
the RNA and the acceptor group on the protein. All H donor
sites of the RNA are involved in these latter H-bonds. These
atoms are also involved in H-bonds with atoms of the
protein which may be either H donor or H acceptor.
Considering the number of H-bonds between the H donor
sites of RNA and the H acceptor vs H acceptor or H donor
sites of the protein, atoms N2(G) prefer H-bonds with H
acceptor sites of protein, whereas O2' prefer H-bonds with
H donor or acceptor sites of the protein. Atoms O2' are also
involved in van der Waals interactions. Considering the
number of H-bonds vs van der Waals interactions involving
O2'; with atoms O of the side chain and with OH, O2' prefer
H-bonds over van der Waals interactions; with atoms O of
the main chain and with NH, O2' give no significant
preference for H-bonds or van der Waals interactions. The
preferences that some amino acids have for a base over the
others, listed previously, can be found again in the
Interaction types between amino acids and nucleotides
In the ensemble of complexes, the percentage distribution of
the interaction types is the following: 23% H and CH…O
H-bonds and ionic interactions, 72% Van der Waals
interactions, and 5% short contacts. Table 7 shows the
percentage distributions of interaction types for the different
nucleotide components in the three groups of complexes and
in the ensemble of complexes. One can notice that the
three groups of complexes differ from each other
(a < 0.001) in:
. an excess of H bonds, CH…O H-bonds and Van der
Waals interactions involving bases in the group
denoted ‘others’ vs synthetases and ribosomes;
. a lack of interactions of all types involving bases in the
ribosomes vs synthetases and ‘others’;
. a lack of interactions of all types involving phosphate
in the group denoted ‘others’ vs synthetases and
ribosomes.
In the ensemble of complexes, the phosphates are
involved in more H-bonds than expected.
Ionic interactions. The ionic interactions constitute 3%
of all interaction types and 13% among all H-bonds. They
occur between the charged atoms of the phosphate
groups and the NH3‡ and NH2‡ groups of Lys and Arg.
There is no significant preference for some oxygen atoms
over the others. They represent 32% of all phosphate Hbonds.
Copyright # 2001 John Wiley & Sons, Ltd.
Table 7. Percentage distributions of interaction types
in the three groups of complexes and in the ensemble
of complexes
Ribose
Phosphate
Bases
Ionic interactions and hydrogen bonds
Synthetases
24
41
Ribosomes
21
58
Others
16
18
35
21
67
CH…O H-bonds
Synthetases
Ribosomes
Others
39
38
32
24
42
13
37
19
56
Van der Waals interactions
Synthetases
33
Ribosomes
35
Others
22
27
37
11
40
28
67
The ensemble of complexes
IH
20
C
36
W
30
43
31
26
38
32
44
IH, ionic interactions and H bonds; C, CH…O H-bonds; W, van der
Waals interactions.
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
Table 8. Atoms of RNA involved in H-bonds in the
ensemble of 45 complexes with the average distance
between donor and acceptor atoms and number of Hbonds in the ensemble
Nucleotide
Atom
Distance
Number
G
G
C
U
A
U
A
G
U
C
A
C
G
A
G
N2
O6
O2
O4
N6
O2
N1
N1
N3
N4
N3
N3
N7
N7
N3
O2'
O3'
O4'
O5'
O1P
O2P
O3P
2.95
3.04
2.94
2.91
2.98
2.96
2.89
2.98
2.90
3.00
3.02
3.03
2.94
2.94
3.01
2.94
3.03
3.05
3.12
2.91
2.89
2.93
88
54
50
45
44
39
36
36
36
33
23
22
21
7
6
254
81
34
27
136
69
1
207
CH…O H-bonds. The CH…O H-bonds constitute 33% of
all potential H-bonds (neutral, CH…O, ionic), and 8% of all
interaction types. All CH groups of the amino acids, except
CB Ile and CZ3 Trp, are involved in CH…O H-bonds with
RNA. All CH groups of ribose and bases, except C8(A), are
involved in CH…O H-bonds, principally C5', C4', C2(A) and
C1'. They are listed in Table 9. The absence of C-H…O bonds
involving C8(A) and the small number involving C6(U),
C6(C), C8(G) are due to the presence of an intramolecular CH…O bonds involving the O5' of the ribose (Sundaralingam,
1973; Auffinger et al., 1996; Brandl et al., 1999). The
importance of C-H…O contacts between the serine hydroxyl
group and the C2(A) is striking. Several of the C-H…O
contacts involving C4' and C5' (79%) are made with main
chain oxygen atoms. The CH groups of RNA interact
preferentially with oxygen atoms (62% on main chain and
19% on side chain), and with OH (19%). The CH groups of
the proteins interact with O (74%), OH (18 %) and N (8%).
Interactions involving aromatic rings
In the ensemble of complexes, 71 interactions occur between
atoms OH, NH, NH2 of nucleotides A, C, G, U and aromatic
rings of Phe, Trp, Tyr, from which nine belong to the
synthetases, 25 to the ribosomes, and 37 to the other
complexes. The distance between the ring center, assuming
a six-fold symmetry of the ring, and the N or O atom of the
RNA varies between 2.36 and 3.53 Å (average 3.48 Å). The
optimum distance varies from 2.9 to 3.6 Å (Levitt and Perutz,
1988). A recent analysis (Brandl et al., 2001) concluded to a
longer average value of 3.7 Å (standard deviation 0.2 Å). Ten
contacts are below these optimum values.
tendencies for the H-bonds and CH…O H-bonds. Considering all H-bonds (neutral and ionic) between amino acids and
bases, the following are more frequent than expected: base
A–Ser; base G–Asp and Glu; and base U–Asn. Considering
all CH…O H-bonds between amino acids and bases, the
following are more frequent than expected: base A–Ser,
base G–Phe and Gly, base U–Val.
H-bonds involving water molecules
A water molecule bridging protein to RNA is in contact
Table 9. CH groups of RNA involved in CH¼O H-bonds with amino acids. Observed number of CH¼O H-bonds
Ala
Leu
Met
Pro
Val
Arg
Asp
Glu
Lys
Asn
Gln
Ser
Thr
Trp
Tyr
Gly
Total
C1'
C2'
1
1
1
1
2
2
3
2
3
1
1
2
1
1
2
23
1
1
1
1
1
1
1
C3'
C4'
C5'
1
1
4
1
1
6
1
1
2
1
3
1
2
4
1
1
9
31
3
2
6
5
1
3
9
41
C
C5
C
C6
U
C6
A
C2
G
C8
1
1
1
1
10
U
C5
1
Copyright # 2001 John Wiley & Sons, Ltd.
1
3
1
1
1
1
1
4
1
1
1
2
17
1
1
1
1
7
1
1
5
1
1
24
2
1
2
6
Total
3
3
3
10
2
5
4
10
4
17
7
29
15
3
11
24
150
J. Mol. Recognit. 2001; 14: 199–214
208
M. TREGER AND E. WESTHOF
Table 10. Number of observed contacts between amino acids and ribose, phosphate and bases through bridging
water molecules. Contacts with the main chain and the side chain of amino acids are distinguished. The interaction
type of contact on the two sides (protein and RNA) of the water molecules may be H-bond or CH¼O H-bond
Ala
Ile Leu Met Phe Pro Val Arg Asp Glu Lys Asn Gln His Ser Thr Trp Tyr Gly Total
(a) All interaction types on either side of the water molecule
Ribose
Main 1
1
2
2
3
2
6
Side
1
1
4
1
3
10
Total 2
2
2
6
1
3
5
16
Phosphate Main 2
1
1
2
2
4
4
Side
1
2
16
Total 2
1
2
2
2
6
20
Bases
Main 4
4
9
1
2
2
2
Side
2
1
5
1
1
18
Total 6
5
9
5
2
2
3
20
Total
Main 7
6 11
3
3
7
8
12
Side
3
2
10
2
6
44
Total 10
8 11 13
5
7 14
56
Ala
(c) CH…O H-bonds
Ribose
Main
Side
Total
Phosphate
Main
Side
Total
Bases
Main
Side
Total
Total
Main
Side
Total
3
7
10
2
1
3
1
12
13
6
20
26
7
5
12
2
4
6
3
4
7
12
13
25
5
7
12
2
3
5
4
7
11
11
17
28
5
2
7
3
1
4
4
4
2
5
6
7
7
2
15
17
11
1
12
4
3
7
3
5
8
18
9
27
8
3
11
12
4
16
1
6
7
9
4
13
22
14
36
1
1
2
3
2
5
1
1
1
1
1
1
1
1
5
3
8
3
3
9
9
10
10
14
14
33
33
70
62
132
47
48
94
64
76
140
181
185
366
Ile Leu Met Phe Pro Val Arg Asp Glu Lys Asn Gln His Ser Thr Trp Tyr Gly Total
(b) H-bonds on either side of
Ribose
Main 1
1
Side
Total 1
1
Phosphate Main 2
1
Side
Total 2
1
Bases
Main 3
4
Side
Total 3
4
Total
Main 6
6
Side
Total 6
6
Ala
10
10
2
3
5
5
8
13
7
21
28
the water molecule
2
2
3
2
2
4
3
1
2
2
Ile
1
9
2
2
3
4
7
9
11
11
2
2
3
2
1
2
1
3
2
3
1
6
2
7
3
6
7
Pro
Val
Met Phe
4
10
14
2
14
16
2
16
18
8
40
48
Arg Glu
3
6
9
2
1
3
1
10
11
6
17
23
Arg Asp Glu
on the protein side and all interaction types
1
1
1
2
1
3
2
1
1
3
1
3
2
1
2
1
2
2
1
3
4
1
2
1
5
1
1
3
2
1
5
1
1
1
3
1
1
1
2
3
2
8
2
6
7
3
2
9
2
1
7
9
Ala Met Val
10
10
2
3
5
2
8
10
4
21
25
7
4
11
2
4
6
2
4
6
11
12
23
5
7
12
2
3
5
4
7
11
11
17
28
Lys Gln
10
1
11
4
3
7
3
4
7
17
8
25
5
2
2
7
7
2
15
17
7
7
His
Ser
12
1
13
1
5
6
9
4
13
22
10
32
1
1
2
3
2
5
1
1
5
3
8
Thr
Trp
1
1
1
1
1
1
1
1
Tyr
7
7
9
9
11
11
27
27
65
46
111
42
39
81
55
62
117
162
147
309
Gly Total
on the RNA side
6
1
1
1
1
1
1
1
1
1
1
2
1
1
6
2
1
1
4
4
4
4
Lys Asn Gln
1
1
1
1
2
2
2
2
2
2
1
1
1
2
3
His
Ser
Thr
Tyr
(d) All interaction types on the protein side and CH…O H-bonds on the RNA side
Ribose
Main
3
2
1
1
Side
1
2
1
1
1
2
1
Total 1
2
1
3
1
2
2
2
1
1
Bases
Main 1
1
Side
1
Total 1
1
1
Total
Main 1
3
3
1
1
Side
1
2
1
2
1
2
1
Total 2
2
1
3
2
3
2
2
1
1
Copyright # 2001 John Wiley & Sons, Ltd.
5
4
4
2
4
6
1
2
3
1
2
3
3
3
3
1
1
4
1
1
2
2
3
3
3
11
1
1
11
Gly Total
3
1
1
1
1
2
3
7
15
22
6
11
17
8
18
26
21
44
65
4
11
12
23
3
1
4
14
13
27
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
209
Table 11. Number of observed contacts between amino acids and bases A, C, G, U through bridging water
molecules
Ala
(a) All interaction types
A
Main
1
Side
Total
1
C
Main
1
Side
1
Total
2
G
Main
2
Side
1
Total
3
U
Main
Side
Total
Total
Main
4
Side
2
Total
6
Ala
Ile
Leu Met Phe Pro
Val Arg Asp Glu Lys Asn Gln
on either side of the water molecule
2
2
1
10
2
3
10
3
1
1
1
3
4
1
1
3
1
9
1
2
1
3
1
9
2
1
4
1
1
1
2
1
1
3
4
9
1
2
2
2
1
5
1
1
18
5
9
5
2
2
3
20
Ile
(b) H-bonds on either side of
A
Main
Side
Total
C
Main
1
3
Side
Total
1
3
G
Main
2
1
Side
Total
2
1
U
Main
Side
Total
Total
Main
3
4
Side
Total
3
4
Leu Met Phe Pro
1
1
1
1
4
5
9
2
2
5
8
13
2
2
1
4
5
2
2
2
1
3
2
3
5
4
4
2
2
1
12
13
1
1
2
3
4
7
2
4
6
4
7
11
Ser
1
1
1
1
2
2
3
5
3
3
1
1
2
2
Thr Trp Tyr Gly Total
1
2
3
1
1
8
1
9
3
3
1
1
1
1
1
7
7
Val Arg Asp Glu Lys Asn Gln
3
5
8
Ser
9
4
13
1
1
1
8
1
8
2
1
1
2
14
1
14
10
26
36
12
17
29
33
21
54
9
12
21
64
76
140
Thr Tyr Gly Total
the water molecule
2
2
8
8
1
1
9
9
2
2
1
9
9
2
2
1
1
1
2
1
1
2
3
3
1
3
4
1
2
3
2
16
18
simultaneously with a protein atom and an RNA atom.
Among the ensemble of 45 RNA-protein complexes, 17
complexes contain bridging water molecules. In the
ensemble of these 17 complexes, 208 bridging water
molecules are in contact between 290 protein atoms and
292 RNA atoms via H-bonds (neutral, CH…O H-bonds).
The average number of bridging water molecules per
complex is 12.2 (95% confidence interval: 7.7–16.8).
The interaction type between amino acids and water
molecules may be H-bond or CH…O H-bond; the same
holds true for the contact between nucleotide components
and water molecules. There are several possible combinations of these two interaction types. The observed counts of
contacts between amino acids and nucleotide components
through water molecules for some of these combinations are
listed in Table 10 for ribose, phosphate and bases and in
Table 11 for the bases separately. For most amino acid
types, the expected counts are too small for a statistical test
to deal with the question of favoured amino acid type–
Copyright # 2001 John Wiley & Sons, Ltd.
1
1
1
1
1
5
6
2
2
2
8
10
2
2
1
3
4
2
2
2
1
3
2
3
5
4
4
1
1
1
10
11
1
1
2
4
6
2
4
6
4
7
11
1
2
2
1
1
1
2
3
5
3
3
1
1
1
2
3
7
7
2
1
1
1
8
1
9
1
1
3
4
7
2
9
4
13
1
6
1
6
2
1
1
2
11
1
11
8
20
28
12
12
24
28
19
47
7
11
18
55
62
117
nucleotide component pairs. Several amino acids, and
especially those disfavoured in direct contacts (Ala, Ile,
Leu, Val), H-bond RNA atoms via water molecules using
their main chain atoms [see Table 10(b)], except for Arg,
Asp, Glu, Asn, Gln which overall prefer to use their side
chain atoms. It is interesting to note that, in this respect, Lys
and Phe use almost equally main chain and side chain
atoms. One-third of the contacts made by Gly involve the
Ca-H [Table 10(c)]. The amino acids Met involve their side
chain atoms in C-H…O contacts with RNA, especially base
atoms. The numbers of contacts between nucleotide
components and amino acids through water molecules with
main chain vs side chain do not differ significantly from
each other, except for adenine and guanine bases: A contacts
preferentially side chain over main chain while G contacts
preferentially main chain over side chain atoms. Leu, Thr
and Gly bridge their main chain atoms via water molecules
to the base G in a large proportion. The atoms of RNA
involved in H-bonds with bridging water molecules may
J. Mol. Recognit. 2001; 14: 199–214
210
M. TREGER AND E. WESTHOF
Table 12. Observed counts of interface amino acids in contact with RNA through their main chain vs side chain,
along with their localization in secondary structures
Ribose
Phosphate
Bases
Total
56
231
287
16
96
112
183
313
496
255
640
895
52
120
172
40
119
159
213
352
565
305
591
896
177
557
734
78
330
408
625
1010
1635
880
1897
2777
A
C
G
U
Total
11
42
53
17
34
51
67
142
209
95
218
313
14
36
50
8
33
41
47
76
123
69
145
214
23
44
67
3
28
31
75
158
233
101
230
331
13
28
41
13
52
65
51
95
146
77
175
252
61
150
211
41
147
188
240
471
711
342
768
1110
(a) Contacts with ribose, phosphate, bases
Helice
Main
69
Side
206
Total
275
Sheet
Main
22
Side
115
Total
137
Other
Main
229
Side
345
Total
574
Total
Main
320
Side
666
Total
986
(b) Contacts with bases A, C, G, U
Helice
Main
Side
Total
Sheet
Main
Side
Total
Other
Main
Side
Total
Total
Main
Side
Total
also be involved in H-bonds directly with amino acids.
Those RNA atoms, whose observed counts are sufficient
(O2', O4', O3', O1P, O2P, N1, N2, N3), give no significant
preference for contact through water molecules or directly
with amino acids.
Secondary structures in interfaces
Among all interface amino acids, 44% belong to a
secondary structure: 28% to helices and 16% to sheets.
The number of interface amino acids belonging to a given
secondary structure varies from 1 to 20. The distribution of
the secondary structures with their corresponding number
of interface amino acids is asymmetrical, with median equal
to 5, i.e. the secondary structures contain five amino acids.
The interface amino acids are in contact with RNA through
their main chain (26%) or their side chain (74%). The
amino acids belonging to helices and in contact through
their main chain are less numerous than expected, whereas
the amino acids belonging to regions without repetitive
secondary structure (loops, junctions, etc.) and in contact
through their main chain are more numerous than expected.
The counts of interface amino acids in contact with ribose,
phosphate and bases through their main chain and their side
Copyright # 2001 John Wiley & Sons, Ltd.
chain along with their localization in the secondary
structures are listed in Tables 12(a) and (b). Amino acids
belonging to helices and in contact with RNA through their
side chain prefer contacts with phosphate over bases; for
amino acids in contact through their main chain, the
preferences are in the reverse order [computed from Table
12(a)]. Amino acids belonging to sheets and in contact
through their side chain prefer the uracil over the guanine
bases.
DISCUSSION
Here, we will discuss and compare our results with previous
data and especially with the work on a smaller set of
complexes (Jones et al., 2001) which appeared during the
course of our own analysis (see Table 13).
Secondary structures in interfaces
In subunit or domain interfaces, the amino acids which
belong to distinct secondary structures represent 70% of the
interface surface (Argos, 1988). In RNA–protein interfaces,
the secondary structures, helix or sheet, with a unique
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
211
Table 13. Comparison between some properties of RNA±protein interfaces derived from the present work and
from another analysis of 32 RNA±protein complexes (Jones et al., 2001)
Present work
The ensemble of all interaction types
Most preferred amino acid types
Less preferred amino acid types
Contribution to total interface amino acids
Ribose, phosphate, bases
Bases A/C/G/U
Favoured pairs
Distribution of interaction types
H-bonds (neutral, ionic, CH…O)
ionic
H-bonds
CH…O H-bonds
van der Waals
Short
H-bonds
Ratio: bases/(ribose ‡ phosphate)
Atoms of RNA involved
Arg, Asn, Ser, Lys
Ala, Ile, Leu, Val
Hydrophobic, 22%
Charged, 40%
positive, 32%
negative, 8%
Polar, 30%
Gly, 8%
42% of interactions with bases
No significant preference
Arg/Lys–phosphate
Met/Phe/Tyr–ribose
Ile/Pro/Ser–A
Leu–C
Asp/Gly–G
Asn–U
23%
3%
12%
8%
72%
5%
32 RNA–protein complexes
sample (Jones et al., 2001)
Tyr, Lys, Phe, Ile, Arg, Asn, Ser
His, Asp, Pro, Ala, Glu
58% of interactions with bases
Preference for G and U
Arg/Lys–phosphate
Ser/Phe–A
Asp/Tyr–C
Trp/Gly–G
Asn/Glu–U
8%
92%
1.1 (0.4–1.8)
O2', 21% of all H-bonds
O2', O6(G) and N2(G) preferred
O4', O1P and N3(G) infrequently
Main chain, 32%
O, OXT, 14%
NH, 19%
Side chain, 68%
O, 12%
OH, 14%
NH, 42%
SD Met, 0.3%
1.0
Ribose, O2' only
van der Waals
Ratio: bases/(ribose ‡ phosphate)
Atoms of ribose involved
1.0 (0.6–1.5)
O2', 32%
1.4
O2', 34%
CH…O H-bonds
CH groups of RNA
Favoured pairs
Principally C5', C4', C2(A), C1'
Phe–A Ser–A
Atoms of protein involved
Bridging water molecules
Average number of water molecules that form
H-bonds to both RNA and protein
12.2 (7.7–16.8)
interface amino acid contain only 5% of the total interface
amino acids and most of them contain three amino acids.
Draper (1999) distinguishes two main classes of proteins
binding to RNA. In the first class, a secondary structure
element (helix, ribbon or even unstructured region) binds in
Copyright # 2001 John Wiley & Sons, Ltd.
0.3 (0.1–0.5)
a frequently distorted groove of the RNA helix. In the
second class, a b-sheet surface binds to a single-stranded
RNA region. He noticed that, in the second class, the
proteins tend to ‘ignore the RNA backbone’ (Draper, 1999).
Interestingly, from Table 12(a), one can observe proporJ. Mol. Recognit. 2001; 14: 199–214
212
M. TREGER AND E. WESTHOF
tionally twice as many contacts with the sugar-phosphate
backbone in contacts between helices and RNA than in
contacts between sheets and RNA. Two trends can be
further noticed, although not statistically significant in our
present set of structures. The protein main chain atoms are
less often implicated in contacts when they belong to bsheets than when they form an a-helix. In absence of
repetitive structures, there is no difference in the number of
contacts formed by main chain and side chain atoms.
Size of the interfaces
The size of the interfaces, measured by the number of their
atoms, varies from one complex to another: the percentage
of atoms located at the interfaces ranges from 1% up to
25%. A point plot of atom counts in interfaces against atom
counts in the complex, after square root transformation,
shows scattered points with no relationship between the two
counts. The same holds true in respect of protein–protein
interfaces (Conte et al., 1999), subunit-subunit interfaces
and domain interfaces (Argos, 1988), as well as DNA–
protein interfaces (Nadassy et al., 1999).
The composition of amino acids at the interfaces
Considering the counts of amino acids of each type in the
protein, Ala, Ile, Leu, Val are infrequently located in RNA–
protein interfaces, whereas Arg, Asn, Ser, Lys are most
frequently located in the interfaces (see Table 3). These
findings may be compared with those obtained from
a sample of 32 RNA–protein complexes, in which the
amino acid interface propensities have been calculated for
every amino acid type (Jones et al., 2001). From that study,
the amino acids Arg, Lys, Asn, Tyr, Phe, Ile, Ser, which
have values of propensity greater than one, occur more
frequently in the interface than in the remaining protein
surface; on the contrary, the amino acids Asp, Glu, His, Ala,
Pro occur less frequently in the interfaces than in the
remaining surface. However, owing to the variability of
values from one complex to another, a value of propensity
greater than 1 may be due to random sampling fluctuations
and may not necessarily indicate that this amino acid occurs
significantly more frequently in the interface than on the
protein surface. Thus, the values of the amino acid
propensities are difficult to compare with the preferences
for some amino acids over the others found in the present
sample of RNA–protein complexes and it is not possible to
deduce whether the differences between amino acid
preferences observed in the two samples of complexes are
significant.
The protein–protein interfaces (Conte et al., 1999), the
subunit–subunit interfaces and the domain interfaces
(Argos, 1988) are depleted in the charged amino acids
Asp, Glu, Lys compared to the remaining surface. These
interfaces contain more aromatic amino acids, His, Tyr, Phe,
Trp, and aliphatic amino acids, Leu, Ile, Val, Met, than the
remaining surface. Arg is the most abundant amino acid in
protein–protein interfaces (Conte et al., 1999).
In the RNA–protein complexes, the percentages of amino
acids at the interfaces, defined as percentage atom count
Copyright # 2001 John Wiley & Sons, Ltd.
contributions to the total atom count of interfaces, have the
following means and 95% confidence intervals. For
example: Arg, 19.6% (16.0–23.1); Lys, 14.2% (11.4–
17.0); Gly, 6.6% (5.0–8.3); Asp, 2.6% (1.6–3.5); Ile, 2.5%
(1.4–3.5); Leu, 2.6% (1.6–3.6); Tyr, 5.9% (3.7–8.0). The
percentages in question are not defined in the same way as
for the protein–protein interfaces: in protein–protein interfaces (Conte et al., 1999; Argos, 1988) they are defined as
percentage area contribution to the solvent-accessible
surface area. Nevertheless, they may be compared. Arg,
Lys and Gly are more represented in RNA–protein
interfaces than in protein–protein interfaces, whereas Asp,
Ile, Leu and Tyr are less represented in RNA–protein
interfaces than in protein–protein interfaces. From these
results one may conclude that protein–protein interfaces and
RNA–protein interfaces differ from each other in amino
acid composition: Arg, Lys, Gly are more abundant and Ile,
Leu, Tyr less abundant in RNA–protein interfaces than in
protein–protein interfaces.
The amino acid compositions of RNA–protein interfaces
and DNA–protein interfaces (Nadassy et al., 1999) seem
similar: the interfaces contain principally positively charged
amino acids. The percentages of amino acids in the
interfaces are approximately the same in the two kinds of
complexes, except for Gly which seems more abundant in
RNA–protein interfaces than in DNA–protein interfaces.
Ribose, phosphate and bases at the interfaces
In DNA–protein complexes, the interfaces and the remaining surface differ from each other in an excess of phosphate
and a lack of ribose in the interfaces vs the remaining
surface, measured by the fraction of solvent-accessible
surface area for ribose, phosphate and bases (Nadassy et al.,
1999). On the other hand, in the RNA–protein complexes,
considering the total counts of ribose, phosphate and bases
in the complexes, the proportions of ribose, phosphate and
bases located in the interfaces vary from one complex to
another, and no significant preference for ribose or
phosphate or bases over the others appears in the interfaces.
The ribose, phosphate and bases at the interfaces may also
be measured by the percentage of their atom count
contributions to the total atom count of the interfaces. The
values are 34% for ribose, 31% for phosphate and 33% for
bases, with a large variation between complexes. Considering the total number of interactions, all types joined,
between atoms of the protein and atoms of the RNA, 29%
occur with ribose, 29% with phosphate and 42% with bases.
In the 32 RNA–protein complexes sample (Jones et al.,
2001), the bases are involved in 58% if the total count of
interactions, which is greater, but without any assurance of
significance.
Dependency between the amino acid type and the
nucleotide component
From the present sample of RNA–protein complexes one
may summarize the preferences that some amino acid types,
except for Cys and Trp whose expected count is too small,
have for some nucleotide components over the others as
J. Mol. Recognit. 2001; 14: 199–214
RNA–PROTEIN INTERFACES
follows: Arg and principally Lys display a preference for
phosphate over ribose and bases. Some amino acids show a
preference for some bases over the others: Ile, Pro, Ser
prefer A over the others; Leu prefers C over the others; Asp
and Gly prefer G over the others; and Asn prefers U over the
others.
The findings from the 32 RNA–protein complexes sample
(Jones et al., 2001) allow computation of a chi-square
statistic from the observed counts of interactions (H-bonds
and van der Waals interactions summed) between the amino
acid types (all amino acid types except Cys and His which
have too small expected counts) and the nucleotide
component ribose, phosphate, base A, base C, base G and
base U. The expected counts are calculated as usual in a chisquare test. This test points out a relationship between the
two factors (amino acid type and nucleotide component) and
shows favoured interactions between Arg–phosphate, Lys–
phosphate, Asn–U, Gly–G, Tyr–C, Asp–C, Ser–A, Phe–A,
Glu–U, and principally Trp–G. Some favoured contacts
between amino acid types and nucleotide components are
common between the two samples of complexes; Trp–G
could not be found in the present study owing to a small
expected count. Nevertheless, the statistic computed should
be a Boyd and Doll chi-square statistic, that is to say that the
expected counts should be computed in every complex and
then summed. And this may explain some differences in the
favoured contacts observed in the two samples of complexes. Furthermore, in the 32 RNA–protein complexes
sample, the van der Waals interactions involving a given
amino acid are not counted if this amino acid is involved in
H-bonds, whereas in the present sample of complexes, all
van der Waals interactions have been counted. As these
interactions are the most numerous, this difference in
counting the contacts between amino acids and nucleotide
component may also explain some divergences in the
favoured contacts observed in the two samples of complexes.
The H-bonds
In the 32 RNA–protein complexes sample, the H-bonds
represent only 8% of the total count of interactions (Jones
et al., 2001) and 12% in the present sample. This difference
may be due in part to the additional criterion involving the
angle at the calculated position of the H atom in a potential
H-bond in the former study (Jones et al., 2001). In DNA–
protein interfaces, the H-bonds involve principally phosphates (60% of H-bonds; Nadassy et al., 1999). In RNA–
protein interfaces, the observed number of H-bonds and
ionic interactions involving phosphates is greater than
expected, considering the counts of ribose, phosphate and
bases involved in all interaction types, but they constitute
only 43% of all potential H-bonds. The bases are involved in
38% of all H-bonds, as in DNA–protein interfaces. In RNA–
protein interfaces the ribose is more often involved in Hbonds (20%) than in DNA–protein complexes (6%).
Actually, the latter difference is greater since atoms O3'
and O5' are counted as belonging to ribose in DNA–protein
complexes and to phosphate in RNA–protein complexes,
but, in absence of information on the counts of H-bonds
involving ribose, phosphate and bases in DNA–protein
Copyright # 2001 John Wiley & Sons, Ltd.
213
interfaces, it is not possible to assert that these differences
are significant. In DNA–protein complexes, only 10% of
the H-bonds have the H donor group on the DNA (Nadassy
et al., 1999), whereas in RNA–protein complexes, more
than 26% of the H-bonds have the H donor group on the
RNA.
Atoms involved in H-bonds. In RNA–protein interfaces,
the atom O2' is involved in 21% of all H-bonds. This
percentage agrees with the one derived from a small sample
of RNA–protein interfaces (Nadassy et al., 1999). Considering the number of H-bonds and van der Waals
interactions involving O2' and the other atoms involved in
hydrogen bonding, the number of H-bonds involving O2' is
greater than expected; the same holds true for N2(G) and
O6(G); similarly, O4', O1P and N3(G) have fewer H-bonds
than expected. In DNA–protein complexes, H bonds
involving bases occur frequently between G (O6, N7) and
Arg or Lys and between A (N6, N7) and Asn or Gln. On the
contrary, in RNA–protein complexes, H-bonds involving
bases occur principally with N2(G) and O6(G) (as was
suggested by one crystal structure, Masquida et al., 1999).
The guanine bases display a preference for Asp and Glu
over the other amino acids in the case of H-bonds and for
Phe in the case of CH…O H-bonds; A has a preference for
Ser over the other amino acids in the case of H-bonds and
principally in the case of CH…O H- bonds.
Water molecules bridging protein to RNA
The O2' hydroxyl of RNA is the group most frequently
involved in H-bonds with bridging water molecules despite
the fact that it is surrounded by rapidly exchanging water
molecules as shown by molecular dynamics simulations of
RNAs (Auffinger and Westhof, 1997). Atoms C2, C6, C8 of
RNA are rarely involved in such H-bonds, since they form
intramolecular CH..O H-bonds with oxygen atoms in the
RNA (Auffinger and Westhof, 1997). Water bridges
between protein main chain and RNA atoms are surprisingly
frequent, except for those amino acids, which are found
most frequently in direct contact with RNA (like Arg and
Asn).
CONCLUSIONS
While the sizes of the samplings for each variable were
adequate for the conclusions reached, it is yet not known
whether the crystallized complexes represent a significant
sampling of the overall population of RNA–protein
complexes. Within the approximations made and with this
caveat, the main conclusions of the present analysis are
contained in Table 13. The complexity and diversity of
contacts of RNA–protein interfaces are apparent. One can
notice a frequent involvement of hydrophobic and non-polar
amino acids (30%) with an almost equipartition of contacts
between bases and the sugar-phosphate backbone, but there
is no clear preference for a single nucleic acid base. The
hydrophilic atoms of the two pyrimidine bases form about
the same number of H-bonds. On the other hand, between
the two purine bases, while the N3(G) is avoided, it is the
J. Mol. Recognit. 2001; 14: 199–214
214
M. TREGER AND E. WESTHOF
N7(A) which is avoided in adenine. There is a small but
definite preference between some amino acid type and some
nucleic acid base (e.g. Ser-A, Gly-G or Asn-U). As in all
interfaces, there is a high percentage of van der Waals
contacts between the RNA and the proteins (about threequarter of them). The importance of the protein main chain
is noticeable since about one-third of the protein atoms
making contacts belong to the main chain. On the other
hand, although positively charged amino acids are among
the most preferred amino acids at the interface between
RNA and proteins, they occur for only one-third of the total
number of contacts. In slightly less than half of the
complexes, an average of 12 water molecules occurs in
the interface bridging atoms of the RNA and the protein.
Thus, overall, it is clear that no single interaction type or
contact stands out. All the electrostatic and weak interactions constituting RNA–protein complexes form an ensemble of embedded hierarchies maintaining a complex
network of precisely fitted and adapted inter- and intramolecular contacts.
Acknowledgements
We wish to thank Professor Michel Roos for his help with statistical
methods and Pascal Auffinger for numerous and fruitful discussions on
molecular interactions. M.T. thanks Professor M. Roos for support.
REFERENCES
Argos P. 1988. An investigation of protein subunit and domain
interfaces. Protein Engng 2: 101±113.
Auf®nger P, Louise-May S, Westhof E. 1996. Hydration of C-H
groups in tRNA. Faraday Discuss. 103: 151±173.
Auf®nger P, Westhof E. 1997. Rules governing the orientation of
the 2'-hydroxyl group in RNA. J. Mol. Biol. 274: 54±63.
Ban N, Nissen P, Hansen J, Moore PB, Steitz TA. 2000. The
complete atomic structure of the large ribosomal subunit at
2.4 AÊ resolution. Science 289: 905±920.
Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD,
Rodgers JR, Kennard O, Shimanouchi T, Tasumi M. 1977.
The Protein Data Bank: a computer-based archival ®le for
macromolecular structures. J. Mol. Biol. 112: 535±542.
Brandl M, Lindauer K, Meyer M, SuÈhnel J. 1999. C-H¼O and CH¼N interactions in RNA structures. Theoret. Chem. Acc.
101: 103±113.
Brandl M, Weiss MS, Jabs A, SuÈhnel J, Hilgenfeld R. 2001. CH¼p interactions in proteins. J. Mol. Biol. 307: 357±377.
Colquhoun D. 1971. Lectures on Biostatistics. Clarendon Press:
Oxford.
Conover WJ. 1980. Practical Nonparametric Statistics. John
Wiley: Chichester.
Conte LL, Chothia C, Janin J. 1999. The atomic structure of
protein±protein recognition sites. J. Mol. Biol. 285: 2177±
2198.
Draper DE. 1999. Themes in RNA±protein recognition. J. Mol.
Biol. 293: 255±270.
Hermann T, Westhoff E. 1999. Non-Watson±Crick base pairs-in
RNA-protein recognition. Chem. Biol. 6: R335±R343.
Jones S, Daley DTA, Luscombe NM, Berman HM, Thornton JM.
2001. Protein±RNA interactions: a structural analysis. Nucleic
Acids Res. 29: 943±954.
Knight RD, Freeland SJ, Landweber LF. 1999. Selection, history
Copyright # 2001 John Wiley & Sons, Ltd.
and chemistry: the three faces of the genetic code. Trends
Biochem. Sci. 24: 241±247.
Knight RD, Landweber LF. 2000. Guilt by association: the
arginine case revisited. RNA 6: 499±510.
Levitt M, Perutz MF. 1988. Aromatic rings act as hydrogen bond
acceptors. J. Mol. Biol. 201: 751±754.
Masquida B, Sauter C, Westhof E. 1999. A sulfate pocket formed
Ê resolution X-ray structure of
by three GoU pairs in the 0.97 A
a nonameric RNA. RNA 5: 1384±1395.
Nadassy K, Wodak SJ, Janin J. 1999. Structural features of
protein±nucleic acid recognition sites. Biochemistry 38:
1999±2017.
Ribas de Pouplana L, Schimmel P. 2001. Operational RNA code
for amino acids in relation to genetic code in evolution. J.
Biol. Chem. 276: 6881±6884.
Snedecor GW, Cochran WG. 1989. Statistical Methods. Iowa
State University Press/AMES.
Sundaralingam M. 1973. The concept of a conformationally
`rigid' nucleotide and its signi®cance in polynucleotide
conformational analysis. In Conformation of Biological
Molecules and Polymers. (Bergmann VED & Pullmann B
eds), pp. 417±456, The Israel Academy of Sciences and
Humanities, Jerusalem.
VAX Rdb/VMS Reference Manual, VAX Rdb/VMS Guide to
Database Design and De®nition, VAX Rdb/VMS Guide to
Data Manipulation, VAX Rdb/VMS Guide to Programming,
RDML Reference Manual, VAX Rdb/VMS Guide to Database
Administration and Maintenance, Digital Equipment Corporation, February, 1987.
Westhof E, Fritsch V. 2000. RNA folding: beyond Watson±Crick
pairs. Structure 8: R55±R65.
Winer BJ. 1971. Statistical Principles in Experimental Design.
McGraw-Hill: Kogakusha.
J. Mol. Recognit. 2001; 14: 199–214