Download Consequences of Stop Codon Reassignment on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix-assisted laser desorption/ionization wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Western blot wikipedia , lookup

Metalloprotein wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Peptide synthesis wikipedia , lookup

Protein wikipedia , lookup

Metabolism wikipedia , lookup

Point mutation wikipedia , lookup

Molecular evolution wikipedia , lookup

Proteolysis wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Consequences of Stop Codon Reassignment on Protein Evolution in Ciliates
with Alternative Genetic Codes
Karen Lee Ring and Andre R. O. Cavalcanti
Biology Department, Pomona College, Claremont, California
Tetrahymena thermophila and Paramecium tetraurelia are ciliates that reassign TAA and TAG from stop codons to
glutamine codons. Because of the lack of full genome sequences, few studies have concentrated on analyzing the effects
of codon reassignment in protein evolution. We used the recently sequenced genome of these species to analyze the
patterns of amino acid substitution in ciliates that reassign the code. We show that, as expected, the codon reassignment
has a large impact on amino acid substitutions in closely related proteins; however, contrary to expectations, these effects
also hold for very diverged proteins. Previous studies have used amino acid substitution data to calculate the
minimization of the genetic code; our results show that because of the lasting influence of the code in the patterns of
substitution, such studies are tautological. These different substitution patterns might affect alignment of ciliate proteins,
as alignment programs use scoring matrices based on substitution patterns of organisms that use the standard code. We
also show that glutamine is used more frequently in ciliates than in other species, as often as expected based on the
presence of the 2 new reassigned codons, indicating that the frequencies of amino acids in proteomes is mostly
determined by neutral processes based on their number of codons.
Introduction
Soon after the genetic code was deciphered in the
1960s, Francis Crick proposed his Frozen Accident hypothesis suggesting that the code was universal for all life on
earth (Crick 1968). The hypothesis proposes that after the
identity of the code was established, any change would be
detrimental. Such changes would cause the incorporation
of the wrong amino acid in all positions occupied by the reassigned codon. Thus, all extant life descends from a common ancestor that used the universal code, and this code was
passed unaltered to all organisms; any change to the code
would be highly deleterious and selected against.
This hypothesis was challenged in 1979 with the first
discovery of an alternative genetic code in human mitochondrial genes (Barrell et al. 1979). Alternative codes have
arisen independently at least 10 times in nuclear genomes
(Knight et al. 2001; Soll and RajBhandary 2006) and at
least 24 times in mitochondrial genomes (Swire et al.
2005). Several studies have focused on how such alternative codes might have overcome the fitness barrier of codon
reassignment, and today, 2 main theories are generally accepted for code evolution.
The codon capture model (Jukes et al. 1987; Osawa
et al. 1987; Osawa 1995) suggests that codons can be reassigned by a neutral mechanism if we assume a stage in
which the codon entirely disappears from the genome—
drift or GC pressure can account for this elimination; the
codon could then reemerge by genetic drift and be decoded
(captured) by a tRNA for another amino acid. This model
works well for small genomes, like mitochondrial genomes.
In these cases, it is easy to imagine a codon stochastically
disappearing from the genome and being captured by another tRNA. In fact, Swire et al. (2005) showed that reassignments involving stop codons in mitochondria seem to
have followed this mechanism, although they could not find
evidence for it in the reassignments of sense codons.
Key words: scoring matrix, substitution matrix, genetic code,
Tetrahymena, Paramecium.
E-mail: [email protected].
Mol. Biol. Evol. 25(1):179–186. 2008
doi:10.1093/molbev/msm237
Advance Access publication November 1, 2007
Ó The Author 2007. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
The ambiguous intermediate hypothesis (Schultz and
Yarus 1994, 1996; Yarus and Schultz 1997) proposes that
there is a transition state during which the reassigned codon
is ambiguously decoded by both its cognate tRNA and
a mutant tRNA; the novel tRNA gradually displaces the
normal cognate tRNA and takes over the decoding of
the ambiguous codon. There are several instances of ambiguous translation in nature that support this model (Lovett
et al. 1991; Chittum et al. 1998; Matsugi et al. 1998; Santos
et al. 1999; Beier and Grimm 2001).
Although several studies have focused on how alternative codes arise, few have concentrated on the evolutionary consequences of such reassignment. One exception is
a study performed by Massey et al. (2003), in which the
authors compared the genome of Saccharomyces cerevisiae
(which uses the universal code) with that of Candida albicans (a closely related yeast that reassigns the CTG codon
from Leu to Ser). They showed that coding positions occupied by CTG in C. albicans align with positions that contain
a Ser codon in S. cerevisiae; likewise positions occupied by
CTG in S. cerevisiae align with positions occupied by other
Leu codons in C. albicans. In other words, the reassignment
seems to have caused very little alteration in the protein sequences. Based on the analyses of tRNA sequences, Massey
et al. (2003) proposed that the reassignment in C. albicans
happened through an ambiguous intermediate phase.
Candida albicans reassignment, however, is atypical
as it changes a sense codon to another sense codon, whereas
the majority of codon reassignments in nuclear genomes
involve the reassignment of a stop codon to a sense codon
(Knight et al. 2001). Studies of such reassignments were
limited due to the lack of complete genome sequences
available. Recently, the genomes of 2 oligohymenophoran
ciliates that reassign the TAA and TAG stop codons to Gln,
Tetrahymena thermophila (Eisen et al. 2006) and Paramecium tetraurelia (Aury et al. 2006), have been sequenced,
allowing us to address the question of how code reassignment alters the evolutionary landscape of such organisms.
One of the main characteristics of the universal code,
also described shortly after the code was deciphered, is that
similar amino acids have similar codons (Crick 1968). This is
generally referred to as the minimization of the code because
it minimizes the effects of point mutations; codons separated
180 Ring and Cavalcanti
by 1 point mutation tend to code for similar amino acids.
Freeland et al. (2000) showed that the genetic code is highly
minimized. In fact, a random code has only one chance in
a million of being more minimized than the standard code.
In their study, Freeland et al. (2000) used the PAM74100 matrix (Benner et al. 1994) to determine the distance
between a pair of amino acids. PAM stands for point accepted mutations, for example, a PAM1 matrix is determined from the alignment of closely related sequences
by counting the number of times amino acid i substitutes
amino acid j and normalizing by a constant so that each site
in the alignment has a chance of 1% of being substituted. It
has been known that the PAM1 matrix reflects to a large
extent the organization of the genetic code; most amino
acids that are frequently substituted in the PAM1 matrix
have codons that are interchangeable by a single point mutation. In other words, the PAM1 matrix is dominated by
neutral or nearly neutral mutations.
Typically, to calculate substitution matrices with higher PAM numbers, that is, representing the substitution frequencies of amino acid for more distantly related proteins,
one multiplies the PAM1 matrix by itself many times. The
PAM250 matrix, for example, is obtained by multiplying
the PAM1 matrix by itself 250 times. Higher PAM matrices
obtained using this method will still reflect the organization
of the code as they are derived directly from the PAM1.
To avoid this bias, Freeland et al. (2000) used the
PAM74-100 matrix (Benner et al. 1994). This PAM matrix
is unique in that it was derived using only highly diverged
protein sequences. They have argued that because this matrix is derived using divergent proteins, it should predominantly reflect amino acid biochemical similarity and not
how frequently amino acids are converted by mutation
(Freeland et al. 2000). As such, they propose that this matrix is not correlated with the genetic code and is ideal to
determine the distance between amino acids when calculating the minimization of the code.
This claim has been contested by DiGiulio (2001),
which showed that even the PAM74-100 matrix still show
correlations with the organization of the code. However,
just because a substitution matrix shows some correlation
with the organization of the code, it does not follow that it is
directly influenced by the code. It could be that both the
code organization and the matrix are being influenced by
some biochemical property of the amino acids, in which
case the use of the matrix to study the minimization of
the code would be justified. In fact, subsequent studies assumed that the PAM74-100 is only slightly influenced by
the code (Gilis et al. 2001). In a recent paper (Bulka et al.
2006), the authors used statistical correlations between
amino acid properties and the PAM74-100 matrix to suggest
that the physicochemical properties of the amino acids are
more important to long-term protein evolution than codon assignment; that is, neutral evolution by mutation based on the
genetic code becomes negligible as evolution proceeds.
Due to the lack of sequenced genomes with reassigned
codes, it was difficult to determine if the code structure influences the PAM74-100 matrix or if both reflect the underlying properties of the amino acids. In this study, we
compare the newly completed genomes of 2 organisms that
use alternative genetic codes (Aury et al. 2006; Eisen et al.
2006) to calculate organism-specific substitution matrices.
Our results show that amino acid substitution frequencies
are very different for organisms that reassign codons and
that these differences are still pronounced for highly diverged proteins. The results suggest that amino acid substitution matrices should not be used to calculate the
minimization of the code as these matrices are likely to
be highly influenced by the structure of the code.
Materials and Methods
Data
We downloaded the protein and coding DNA sequences of the following organisms: T. thermophila (ftp://ftp.
tigr.org/pub/data/Eukaryotic_Projects/t_thermophila/);
P. tetraurelia (ParameciumDB: http://paramecium.cgm.
cnrs-gif.fr/); Drosophila melanogaster (FlyBase: http://
flybase.bio.indiana.edu/); Caenorhabditis elegans (WormBase: http://www.wormbase.org/); and Plasmodium falciparum (PlasmoDB: http://www.plasmodb.org/).
The reason for the use of the P. falciparum genome is
2-fold: first, Plasmodium is an alveolate, a sister taxon of
ciliates and, as such, the genome most closely related to
ciliates available; second, it has been shown that substitution matrices can be influenced by the GC content or the
amino acid content of the sequences being aligned (Yu
et al. 2003; Yu and Altschul 2005). Both Tetrahymena
and Paramecium have AT-rich genomes, so any differences
between their substitution patterns and those of the fly and
worm could be dismissed as reflecting their much lower GC
content. Plasmodium is more AT rich than either of the ciliates. If we show that the pattern of amino acid substitution
in Plasmodium is less skewed than that of the ciliates, we
can infer that the differences observed in the ciliate matrix
are due to the genetic code and not GC content.
Determining Pairs of Paralogous Genes
In their paper, Massey et al. (2003) first determined
orthologous genes between C. albicans and S. cerevisiae
and aligned pairs of orthologous genes between the 2 species to determine the frequency with which each codon in
one species aligns with codons from the other species. This
was possible because C. albicans and S. cerevisiae are
closely related yeast species and orthologs can be easily defined and aligned.
Unfortunately, although Tetrahymena and Paramecium are both oligohymenopheran ciliates, they are not as
closely related as the 2 yeast species in Massey et al.
(2003) study; they also lack a closely related outgroup. This
problem led us to use paralogous genes instead of orthologs.
Paralogs are genes separated by a duplication event, whereas
orthologs are genes separated by speciation.
To determine pairs of paralogous genes, we used reciprocal best Blast hits. Basically, for each species, we performed a Blast search (Altschul et al. 1990, 1997) of each
protein against all other proteins in the organism. If the best
hit of protein A was protein B and the best hit of protein B
was protein A, proteins A and B were considered paralogs.
Note that each of these proteins can have many more
Protein Evolution in Ciliates with Alternative Codes 181
paralogs in the genome; by using reciprocal best Blast hits,
we tried to use pairs of proteins that are derived from a single duplication event. This simplification avoids the complications of large gene families biasing our results.
Alignment of Paralogs
After determining the paralogous pairs, we used the
FASTA program (Pearson and Lipman 1988) to align
the protein sequences and determine the substitution patterns of amino acids for each organism. We chose FASTA
to perform these analyses as, for proteins, it uses a Smith–
Waterman algorithm. FASTA also allows the alignment to
be performed using different substitution matrices. We performed the procedure twice, once for closely related proteins and once for divergent proteins.
In the first case, we used FASTA with the MDM10
matrix, a PAM10 matrix obtained by Jones et al.
(1992)—this is the matrix with the lowest PAM value available in FASTA. We removed identical proteins and only
used pairs of paralogs that generated alignments longer than
100 amino acids with at least 85% identity in the calculation
of the substitution matrix.
Unfortunately, the PAM74-100 is not available in
FASTA, so to calculate the matrix for divergent proteins,
we used FASTA with the BLOSUM45 matrix (Henikoff
and Henikoff 1992). This matrix is derived from highly divergent sequences using a procedure slightly different from
the one used to calculate PAM matrices. We only used
alignments longer than 100 amino acids and with identities
between 25% and 60%.
Calculation of Substitution Matrices
For each organism, we calculated a substitution matrix
for closely related proteins and one for divergent proteins.
The procedure to calculate the matrices was similar and is
described below.
First we created a matrix A, where each element Aij
gives the frequency with which we observe amino acid i
aligned with amino acid j. Each Aij was determined by analyzing the alignments one position at a time. Because we
are using pairwise alignments, we only observe the aligned
amino acids and cannot determine what the ancestral amino
acid is. Thus, each time that amino acids i and j are in the
same column of an alignment, we added one to Aij and to Aji.
For each organism, we calculated the total number of
substitutions observed in the alignments and used this number to calculate a normalization constant lambda. This constant was used to normalize the proportion of substitutions
among organisms and is calculated as 0.01 times the number of substitutions divided by the total number of amino
acids in the alignment (eq. 1). All columns containing gaps
were excluded.
ntotal
ð1Þ
k50:01 P :
Aij
i6¼j
Finally, we calculated a substitution matrix M, where
each element Mij was given by the sum of Aij and Aji mul-
tiplied by lambda and divided by the frequency of amino
acid i plus that of amino acid j, to account for the different
frequencies of the amino acids in each organism and generate a symmetric matrix (eq. 2).
k Aij þ Aji
:
ð2Þ
Mij 5
ni þ nj
The matrices M represent the amino acid substitution
frequencies and were used to compare the frequencies with
which each amino acid substitutes another in each organism. We will refer to the matrices obtained from the alignment of closely related proteins as the ‘‘C’’ matrices and
those obtained for divergent proteins as the ‘‘D’’ matrices.
Results
Effects of Reassignment in Amino Acid Frequencies
King and Jukes (1969) showed that the frequency of
amino acids in proteins correlates with the number of synonymous codons that code for each amino acid. They proposed that this is caused because most amino acids in
proteins are the result of neutral mutations that do not affect
protein structure or function; as such, the number of codons
determines the frequency of the amino acids. Gilis et al.
(2001) suggested an alternative explanation for the correspondence between amino acid frequency and number of
codons. They hypothesize that the genetic code evolved
to take into account preexisting amino acid frequencies.
For example, Arg, Leu, and Ser are each coded by 6
codons and are common amino acids in proteins. According
to the first hypothesis, they are common in proteins because
most amino acids arise by random mutations in the genome
and, as such, the number of codons for a given amino acid
should determine its frequency (King and Jukes 1969). If
the second hypothesis were correct, these amino acids have
6 codons because they were common when the code was
being established, and their frequency acted as an evolutionary pressure determining their number of synonymous
codons (Gilis et al. 2001).
Although we cannot determine the frequency of the
amino acids when the code was established, we can determine the effects of changing the number of codons that code
for an amino acid on the usage of that amino acid. Glutamine is coded by 4 codons in Tetrahymena and Paramecium and only by 2 codons in organisms that use the
universal code. If the number of codons impacts the frequency with which an organism uses a given amino acid
through neutral processes, Gln should be more common
in Tetrahymena and Paramecium than in organisms that
use the universal code.
Indeed, Glutamine residues are more common in these
ciliates’ genomes than in the genomes of other eukaryotes
(table 1). To determine what would be the expected frequency of Gln in the different genomes under a neutral
model, we need to take into consideration the GC content
of the genomes—under a neutral model, we would expect
the frequency of a codon in the genes to be proportional to
the frequency of its constituent nucleotides. If we consider
182 Ring and Cavalcanti
Table 1
Correspondence between Observed and Expected Frequencies for an Amino Acid with Reassigned Codons
Organism
Drosophila
Caenorhabditis elegans
Plasmodium
Paramecium
Tetrahymena
GC Content
(%)
Observed Gln
Frequency
Expected Frequency
Based on the No. of Codons
Expected Frequency Correcting
for Different GC Contents
54
43
24
30
28
0.053
0.042
0.028
0.091
0.098
0.033
0.033
0.033
0.063
0.063
0.038
0.036
0.029
0.098
0.101
the GC content of each species’ coding regions, it can be
seen that the usage of Gln in the proteins follows closely the
expected frequency based on its number of codons (table 1).
This result cannot rule out that the number of codons
for each amino acid in the code reflects the availability of
the amino acids when the code was established. However, it
suggests that, in present species, the number of codons coding for an amino acid determines its frequency in the proteins, supporting King and Jukes’ (1969) hypothesis.
Substitution Patterns of Gln in Tetrahymena and
Paramecium
In their analyses, Massey et al. (2003) used orthologous genes between C. albicans and S. cerevisisae to determine which codons align with CTG in both species.
Unfortunately, in the case of Tetrahymena and Paramecium, we do not have the genome sequence of a closely
related species that uses the standard code. The most closely
related species with a full genome is the fairly distantly related apicomplexan P. falciparum (both apicomplexans and
ciliates are part of the alveolates taxon).
Because interspecies comparisons were ruled out, we
decided to perform intraspecies comparisons and contrast
the pattern of substitutions in each species to that of the
other species. We used the genomes of Tetrahymena, Paramecium, Plasmodium, Drosophila, and C. elegans. For
each of these genomes, we determined pairs of paralogous
genes—genes that are related through a duplication event—
and aligned these pairs to determine the patterns of amino
acid substitutions in each species. We performed the calculations twice. For each organism, first we generated a substitution matrix C for closely related proteins (more than
85% identity) and then a substitution matrix D for divergent
proteins (percent identity between 25% and 60%). Table 2
shows some statistics of the alignments used to create the
matrices.
To verify how closely each of the calculated C matrices matches the MDM_10 matrix (Jones et al. 1992) used to
align the closely related proteins, we calculated correlations
coefficients between the C matrices and MDM_10. Likewise, we calculated the correlation coefficients between
the D matrices and the BLOSUM45 matrix (Henikoff
and Henikoff 1992). Even though we could not use the
PAM74-100 matrix (Benner et al. 1994) to align the proteins, as this matrix is not available in the FASTA program,
we also calculated the correlations between the D matrices
and the PAM74-100 matrix. Table 3 shows the results of
these calculations.
As can be seen, both the C and the D matrices show
good correlations with the MDM_10 and BLOSUM45
(PAM74-100) matrices, respectively. In all comparisons,
the Tetrahymena and the Paramecium matrices are less correlated to the matrices used to generate the alignments than
the matrices of the other organisms. It could be argued that
the ciliates matrices are less correlated with the original matrices because of the extremely low GC content of the ciliates genome. However, the ciliates’ matrices are less
correlated than those of Plasmodium, an organism with
an even lower GC content, discarding this explanation.
Because both Tetrahymena and Paramecium reassign
the stop codons TAG and TAA to Gln, we also calculated
the correlation coefficients using only substitutions that
involve Gln instead of the whole matrix (table 3). The difference in the correlations of the Tetrahymena and Paramecium matrices and those of the other species are
even higher when we consider only the subset of substitutions involving the reassigned amino acid Gln, as would be
expected if the differences are mostly due to the use of an
alternative genetic code.
To further understand how the genetic code change
affects the amino acid substitution matrix, we concentrated
only on the substitutions involving Gln (fig. 1). For closely
related proteins (fig. 1a), 4 amino acids consistently tend
to substitute Gln more frequently in Tetrahymena and
Table 2
Statistics of the Alignments Used to Generate the Substitution Matrices for Each Organism
Closely Related Proteins
Organism
Drosophila
Caenorhabditis elegans
Plasmodium
Paramecium
Tetrahymena
Divergent Proteins
No. of
Substitutions
No. of Aligned
Residues
% Mutated
No. of
Substitutions
No. of Aligned
Residues
% Mutated
5,558
19,324
1,322
382,700
41,922
1,033,570
1,377,198
42,958
5,330,070
516,282
0.54
1.40
3.08
7.18
8.12
403,972
563,786
151,234
807,994
1,631,594
722,106
1,032,324
250,326
1,576,270
2,833,642
55.94
54.61
60.41
51.26
57.58
Protein Evolution in Ciliates with Alternative Codes 183
Table 3
Correlation Coefficients between the Substitution Matrices Derived for Each Organism and the MDM_10 (PAM10),
BLOSUM45, and PAM74-100 Scoring Matricesa
MDM_10
All matrix
Only Gln
BLOSUM45
All matrix
Only Gln
PAM74-100
All matrix
Only Gln
Drosophila
Caenorhabditis elegans
Plasmodium
Paramecium
Tetrahymena
0.731
0.868
0.778
0.89
0.771
0.732
0.683
0.601
0.718
0.549
0.816
0.869
0.81
0.875
0.764
0.825
0.737
0.681
0.753
0.696
0.787
0.929
0.778
0.927
0.723
0.847
0.696
0.697
0.708
0.733
NOTE.—The Tetrahymena and Paramecium matrices have lower correlations than the matrices for other organisms.
a
All correlations are significant in the 99% confidence level.
Paramecium than in any of the other species: Leu, Ser, Trp,
and Tyr. The codons of 3 of these amino acids, Ser, Trp, and
Tyr, are not interchangeable to Gln codons by a single point
mutation in the standard genetic code but are interchangeable in the ciliates’ code. The other amino acid, Leu, has 2
codons that can be converted to Gln following a point mutation in the universal genetic code (CU(AG) /CA(AG)),
whereas in the ciliate code, 4 of the Leu codons are interchangeable to Gln codons by 1 point substitution
(CU(AG)/CA(AG) and UU(AG)/UA(AG)).
For more distantly related proteins (fig. 1b), we can see
that 3 of these 4 amino acids—Tyr, Ser, and Leu—still tend
to substitute Gln more frequently in the ciliates, but we also
observe an overabundance of Gln substitutions to Lys, Asn,
Ile, Phe, and Thr. The excess in Lys is also observed for
closely related proteins in Tetrahymena (but not Paramecium; fig. 1a) and can be explained by the presence of 4
Gln codons that can be converted to Lys in the ciliate code
as opposed to the 2 codons in the standard code. The excesses of Asn, Thr, Phe, and Ile can be explained assuming
a 2-step substitution process. For example, it is possible that
the excess Thr is caused because Gln tends to be substituted
by Ser more frequently in ciliates and Ser can be easily converted into Thr. The same argument can be made for Ile and
Phe: Gln is substituted by Leu, which in turn is easily
substituted by Ile or Phe. An analogous argument can also
be made for Asn: Gln to Lys to Asn. The high frequency of
Phe can also be explained by Gln to Tyr to Phe mutations
FIG. 1.—The substitution frequencies of Gln to each of the other amino acids are different in organisms that reassign the code. (a) Closely related
proteins. The x axis shows the amino acids ordered by the substitution frequency in the MDM10 matrix (Jones et al. 1992). (b) Divergent proteins. The
x axis shows the amino acids ordered by the substitution frequency in the BLOSUM45 matrix (Henikoff and Henikoff 1992).
184 Ring and Cavalcanti
FIG. 2.—Substitution frequencies of Asn (N), Ile (I), Thr (T), and Phe (F) to each of the other amino acids in divergent proteins in Paramecium.
The frequency with which each amino acid is conserved is not plotted as it is outside the bounds of the plot (each value is ca. 0.99, as the matrix is
normalized to 1% changes). Note that the substitutions Ile to Leu, Phe to Leu, Phe to Tyr, Thr to Ser, and Asn to Lys are all very frequent (see text).
as Tyr and Phe are substituted very frequently. Figure 2
shows the frequency of substitutions of Asn, Ile, Phe,
and Thr to the other amino acids for divergent proteins
in Paramecium (Tetrahymena gives similar results; data
not shown). As can be seen, all scenarios described above
agree with the data; the frequency of substitution of Lys to
Asn, Leu to Ile, Leu to Phe, Tyr to Phe, and Ser to Thr are all
high.
Interestingly, the alignment of Gln to Trp in divergent
proteins is not more common in ciliates than in organisms
that use the standard code. This indicates that as the proteins
become more diverged, the physicochemical properties of
the amino acids have a larger effect in the substitution patterns—Trp is a very large amino acid and only rarely aligns
with Gln even in closely related proteins. Besides the Gln to
Trp substitution, not all 2-step substitutions made possible
by the alternative code are overrepresented in ciliates. This
makes sense, we do not propose that the genetic code is the
only factor affecting substitution matrices, the physicochemical characteristics of the amino acids are also an important factor in their substitution frequencies, and the Gln
to Trp case is an example of that. However, our results show
that the structure of the code influences even substitution
matrices derived from highly diverged proteins.
Conclusions
King and Jukes (1969) proposed that the frequency of
use of an amino acid is correlated with its number of codons
for neutral reasons—most amino acids in a genome arise by
random mutations that do not affect protein function. Gilis
et al. (2001) proposed that another reason for the correspondence between amino acid frequency and number of synonymous codons could be that the genetic code adapted to
preexisting amino acid frequencies.
When we take into account the GC content of the genomes, the frequencies of Gln usage in Tetrahymena and
Paramecium are extremely close to those predicted by a neutral model of mutation; both ciliate genomes have a much
higher frequency of Gln than the genomes of organisms that
use the standard code. These frequencies closely follow the
expected frequencies based on the number of Gln codons in
the ciliates’ genetic code and their GC content.
This indicates that the hypothesis of King and Jukes
(1969) is sufficient to explain the correlation between
amino acid usage and number of codons, without the need
to invoke an adaptation of the code to preexisting amino
acid frequencies as proposed by Gilis et al. (2001). However, one could still argue that the ciliates reassigned TAA
and TAG to Gln because, for some reason, their genomes
experienced selection for the use of more Gln in their proteins. Although we cannot rule out this explanation, it
seems unlikely when one considers that other ciliates, like
Euplotes octocarinatus, use a different alternative code,
which reassigns the TGA stop codon to Cys. If this hypothesis were correct, it would indicate that some ciliate species
experienced selection to increase the number of Gln in their
proteins, whereas other ciliate species experienced selection
to increase Cys. Although such different pressures are not
mutually exclusive, this makes an explanation based on
neutral mutations more parsimonious.
When looking at the amino acid substitution patterns
in Tetrahymena and Paramecium, our results are surprising
because they indicate that the importance of the genetic
code organization in amino acid substitution frequencies
does not become irrelevant with evolutionary time as previously thought (Freeland et al. 2000). For closely related
proteins, we have shown that several amino acids whose
codons can be converted to TAA or TAG by a single point
mutation—Leu, Ser, Trp, and Tyr—substitute Gln more
frequently in ciliates than in species that use the standard
code. For distantly related proteins, even some amino acids
whose codons are separated from TAA or TAG by 2 point
mutations might substitute Gln more frequently in ciliates
than in organisms that use the standard code. The latter result shows that the genetic code organization also influence
matrices derived from highly divergent proteins. It is important to stress, however, that the structure of the code is not
the only factor influencing substitution frequencies. Some
of the 2-step substitutions made possible by the alternative
code are not overrepresented in the divergent proteins of
ciliates, indicating that physicochemical characteristics
of the amino acids are also an important factor affecting
substitution frequencies. The final observed substitution
frequencies are a result of these 2 factors—organization
of the code and physicochemical properties of the amino
acids—acting together.
Protein Evolution in Ciliates with Alternative Codes 185
Our results indicate that matrices obtained from amino
acid substitutions in protein alignments should not be used
to study genetic code minimization as they are highly influenced by the structure of the code, which makes the analyses tautological. It is important to note that we are not
questioning the results of Freeland et al. (2000) suggesting
that the code is highly minimized. The code is highly minimized regardless of the metric chosen for the distance
between amino acids. On the other hand, the use of the
PAM74-100 in their study probably influenced the level
of minimization calculated for the code. The genetic code
is more minimized when using the PAM74-100 than when
using polarity as a distance metric (Freeland et al. 2000),
probably because code structure influences the PAM74100 matrix.
Furthermore, in the same paper Freeland et al. (2000)
conclude that the standard code is more minimized than any
of the known alternative codes. Because of the intrinsic bias
present in the PAM74-100 matrix, this analysis might not
be valid. For example, the reassignment of TAA and TAG
to Gln adds several point mutations to the ciliate genetic
code not available in the standard code, like Gln to Ser
and Gln to Tyr. In the PAM74-100 matrix, these are unlikely substitutions, which would penalize the ciliate code
in the calculation. If the frequency of amino acid substitutions used was calculated from the ciliate genomes, these
mutations would be frequent, which could make the ciliate
code more minimized than the standard code.
A recent study showed that ciliate proteins have an elevated rate of evolution compared with those of nonciliate
species (Zufall et al. 2006). The authors correlate the higher
evolutionary rates to genome organization in ciliates; they
propose that the ciliate genome architecture—nuclear dimorphism, high levels of chromosome processing, amitosis,
and epigenetic inheritance—allows these organisms to explore protein space in a novel manner (Zufall et al. 2006). It
is unlikely that the different patterns of amino acid substitution that we describe could be caused by the same mechanisms proposed by Zufall et al. (2006); their mechanisms
suggest a generally higher evolutionary rate not related to
the genetic code, whereas we observe that some substitution
rates, those favored by the alternative code, are more elevated than others (note that we are not comparing absolute
substitution rates, our calculations, by definition, normalize
the number of substitutions observed). Instead, we believe
that our results present yet another different way in which
ciliates are able to explore protein space, the reassignment
of TAG and TAA to Gln allows some rare substitutions to
occur more often in ciliate lineages. This could also contribute to the increased evolutionary rates of ciliate proteins, in
addition to the mechanisms proposed by Zufall et al. (2006).
It is important to keep in mind the different pattern of
amino acid substitutions in organisms that reassign the code
when aligning proteins from these organisms. All alignment programs come with precompiled substitution matrices, and none of those are derived from alternative codes.
The alignment of closely related proteins, with few amino
acid differences, is usually very robust to the matrix chosen
and should not suffer from this problem. However, the use
of different matrices has a large effect on the alignment of
divergent proteins. This would not be much of a problem if,
as proposed by Freeland et al. (2000) and generally accepted,
matrices derived from divergent alignments were not influenced by the code, but in this study, we show that the effect
of code organization persists even for very diverged proteins.
Acknowledgments
The authors wish to thank Nicholas A. Stover and
Hannah M.W. Salim for useful comments and suggestions.
Literature Cited
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990.
Basic local alignment search tool. J Mol Biol. 215:403–410.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z,
Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST:
a new generation of protein database search programs. Nucleic
Acids Res. 25:3389–3402.
Aury JM, Jaillon O, Duret L, et al. (43 co-authors). 2006. Global
trends of whole-genome duplications revealed by the ciliate
Paramecium tetraurelia. Nature. 444:171–178.
Barrell BG, Bankier AT, Drouin J. 1979. A different genetic code
in human mitochondria. Nature. 282:189–194.
Beier H, Grimm M. 2001. Misreading of termination codons in
eukaryotes by natural nonsense suppressor tRNAs. Nucleic
Acids Res. 29:4767–4782.
Benner SA, Cohen MA, Gonnet GH. 1994. Amino acid
substitution during functionally divergent evolution of protein
sequences. Protein Eng. 7:1323–1332.
Bulka B, DesJardins M, Freeland SJ. 2006. An interactive
visualization tool to explore the biophysical properties of
amino acids and their contribution to substitution matrices.
BMC Bioinformatics. 7:329.
Chittum HS, Lane WS, Carlson BA, Roller PP, Lung FD, Lee BJ,
Hatfield DL. 1998. Rabbit beta-globin is extended beyond its
UGA stop codon by multiple suppressions and translational
reading gaps. Biochemistry. 37:10866–10870.
Crick FHC. 1968. On the origin of the genetic code. J Mol Biol.
38:367–379.
DiGiulio M. 2001. The origin of the genetic code cannot be
studied using measurements based on the PAM matrix
because this matrix reflects the code itself, making any such
analyses tautologous. J Theor Biol. 208:141–144.
Eisen JA, Coyne RS, Wu M, et al. (53 co-authors). 2006.
Macronuclear genome sequence of the ciliate Tetrahymena
thermophila, a model eukaryote. PLoS Biol. 4(9):e286.
Freeland SJ, Knight RD, Landweber LF, Hurst LD. 2000. Early
fixation of an optimal genetic code. Mol Biol Evol. 17(4):
511–518.
Gilis D, Massar S, Cerf NJ, Rooman M. 2001. Optimality of the
genetic code with respect to protein stability and amino-acid
frequencies. Genome Res. 2(11):R0049.
Henikoff S, Henikoff JG. 1992. Amino acid substitution matrices
from protein blocks. Proc Natl Acad Sci USA. 89:10915–10919.
Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation
of mutation data matrices from protein sequences. Comput
Appl Biosci. 8:275–282.
Jukes TH, Osawa S, Muto A, Lehman N. 1987. Evolution of
anticodons: variations in the genetic code. Cold Spring Harb
Symp Quant Biol. 52:769–776.
King JL, Jukes TH. 1969. Non-Darwinian evolution. Science.
164:788–798.
Knight RD, Freeland SJ, Landweber LF. 2001. Rewiring the
keyboard: evolvability of the genetic code. Nat Rev Genet.
2:49–58.
186 Ring and Cavalcanti
Lovett PS, Ambulos NP, Mulbry W, Noguchi N, Rogers EJ.
1991. UGA can be decoded as tryptophan at low efficiency in
Bacillus subtilis. J Bacteriol. 173:1810–1812.
Massey SE, Moura G, Beltrao P, Almeida R, Garey JR, Tuite MF,
Santos MA. 2003. Comparative evolutionary genomics unveils
the molecular mechanism of reassignment of the CTG codon in
Candida spp. Genome Res. 25(1):179–186.
Matsugi J, Murao K, Ishikura H. 1998. Effect of B. subtilis tRNA
(Trp) on readthrough rate at an opal UGA codon. J Biochem.
123:853–858.
Osawa S. 1995. Evolution of the genetic code. Oxford: Oxford
University Press.
Osawa S, Jukes TH, Muto A, Yamao F, Ohama T, Andachi Y.
1987. Role of directional mutation pressure in the evolution of
the eubacterial genetic code. Cold Spring Harb Symp Quant
Biol. 52:777–789.
Pearson WR, Lipman DJ. 1988. Improved tools for biological
sequence comparison. Proc Natl Acad Sci USA. 85:2444–2448.
Santos MA, Cheesman C, Costa V, Moradas-Ferreira P,
Tuite MF. 1999. Selective advantages created by codon
ambiguity allowed for the evolution of an alternative genetic
code in Candida spp. Mol Microbiol. 31:937–947.
Schultz DW, Yarus M. 1994. Transfer RNA mutation and the
malleability of the genetic code. J Mol Biol. 235:1377–1380.
Schultz DW, Yarus M. 1996. On malleability in the genetic code.
J Mol Evol. 42:597–601.
Soll D, RajBhandary UL. 2006. The genetic code—thawing the
‘frozen accident’. J Biosci. 31(4):459–463.
Swire J, Judson OP, Burt A. 2005. Mitochondrial genetic codes
evolve to match amino acid requirements of proteins. J Mol
Evol. 60:128–139.
Yarus M, Schultz DW. 1997. Response: further comments on
codon reassignment. J Mol Evol. 45:1–8.
Yu YK, Altschul SF. 2005. The construction of amino acid
substitution matrices for the comparison of proteins with nonstandard compositions. Bioinformatics. 21:902–911.
Yu YK, Wootton JC, Altschul SF. 2003. The compositional
adjustment of amino acid substitution matrices. Proc Natl
Acad Sci USA. 100:15688–15693.
Zufall RA, McGrath C, Muse S, Katz LA. 2006. Genome
architecture drives protein evolution in ciliates. Mol Biol
Evol. 23:1681–1687.
Laura Katz, Associate Editor
Accepted October 27, 2007