Download Consequences of Stop Codon Reassignment on

Consequences of Stop Codon Reassignment on Protein Evolution in Ciliates with Alternative Genetic Codes Karen Lee Ring and Andre R. O. Cavalcanti Biology Department, Pomona College, Claremont, California Tetrahymena thermophila and Paramecium tetraurelia are ciliates that reassign TAA and TAG from stop codons to glutamine codons. Because of the lack of full genome sequences, few studies have concentrated on analyzing the effects of codon reassignment in protein evolution. We used the recently sequenced genome of these species to analyze the patterns of amino acid substitution in ciliates that reassign the code. We show that, as expected, the codon reassignment has a large impact on amino acid substitutions in closely related proteins; however, contrary to expectations, these effects also hold for very diverged proteins. Previous studies have used amino acid substitution data to calculate the minimization of the genetic code; our results show that because of the lasting influence of the code in the patterns of substitution, such studies are tautological. These different substitution patterns might affect alignment of ciliate proteins, as alignment programs use scoring matrices based on substitution patterns of organisms that use the standard code. We also show that glutamine is used more frequently in ciliates than in other species, as often as expected based on the presence of the 2 new reassigned codons, indicating that the frequencies of amino acids in proteomes is mostly determined by neutral processes based on their number of codons. Introduction Soon after the genetic code was deciphered in the 1960s, Francis Crick proposed his Frozen Accident hypothesis suggesting that the code was universal for all life on earth (Crick 1968). The hypothesis proposes that after the identity of the code was established, any change would be detrimental. Such changes would cause the incorporation of the wrong amino acid in all positions occupied by the reassigned codon. Thus, all extant life descends from a common ancestor that used the universal code, and this code was passed unaltered to all organisms; any change to the code would be highly deleterious and selected against. This hypothesis was challenged in 1979 with the first discovery of an alternative genetic code in human mitochondrial genes (Barrell et al. 1979). Alternative codes have arisen independently at least 10 times in nuclear genomes (Knight et al. 2001; Soll and RajBhandary 2006) and at least 24 times in mitochondrial genomes (Swire et al. 2005). Several studies have focused on how such alternative codes might have overcome the fitness barrier of codon reassignment, and today, 2 main theories are generally accepted for code evolution. The codon capture model (Jukes et al. 1987; Osawa et al. 1987; Osawa 1995) suggests that codons can be reassigned by a neutral mechanism if we assume a stage in which the codon entirely disappears from the genome— drift or GC pressure can account for this elimination; the codon could then reemerge by genetic drift and be decoded (captured) by a tRNA for another amino acid. This model works well for small genomes, like mitochondrial genomes. In these cases, it is easy to imagine a codon stochastically disappearing from the genome and being captured by another tRNA. In fact, Swire et al. (2005) showed that reassignments involving stop codons in mitochondria seem to have followed this mechanism, although they could not find evidence for it in the reassignments of sense codons. Key words: scoring matrix, substitution matrix, genetic code, Tetrahymena, Paramecium. E-mail: [email protected]. Mol. Biol. Evol. 25(1):179–186. 2008 doi:10.1093/molbev/msm237 Advance Access publication November 1, 2007 Ó The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] The ambiguous intermediate hypothesis (Schultz and Yarus 1994, 1996; Yarus and Schultz 1997) proposes that there is a transition state during which the reassigned codon is ambiguously decoded by both its cognate tRNA and a mutant tRNA; the novel tRNA gradually displaces the normal cognate tRNA and takes over the decoding of the ambiguous codon. There are several instances of ambiguous translation in nature that support this model (Lovett et al. 1991; Chittum et al. 1998; Matsugi et al. 1998; Santos et al. 1999; Beier and Grimm 2001). Although several studies have focused on how alternative codes arise, few have concentrated on the evolutionary consequences of such reassignment. One exception is a study performed by Massey et al. (2003), in which the authors compared the genome of Saccharomyces cerevisiae (which uses the universal code) with that of Candida albicans (a closely related yeast that reassigns the CTG codon from Leu to Ser). They showed that coding positions occupied by CTG in C. albicans align with positions that contain a Ser codon in S. cerevisiae; likewise positions occupied by CTG in S. cerevisiae align with positions occupied by other Leu codons in C. albicans. In other words, the reassignment seems to have caused very little alteration in the protein sequences. Based on the analyses of tRNA sequences, Massey et al. (2003) proposed that the reassignment in C. albicans happened through an ambiguous intermediate phase. Candida albicans reassignment, however, is atypical as it changes a sense codon to another sense codon, whereas the majority of codon reassignments in nuclear genomes involve the reassignment of a stop codon to a sense codon (Knight et al. 2001). Studies of such reassignments were limited due to the lack of complete genome sequences available. Recently, the genomes of 2 oligohymenophoran ciliates that reassign the TAA and TAG stop codons to Gln, Tetrahymena thermophila (Eisen et al. 2006) and Paramecium tetraurelia (Aury et al. 2006), have been sequenced, allowing us to address the question of how code reassignment alters the evolutionary landscape of such organisms. One of the main characteristics of the universal code, also described shortly after the code was deciphered, is that similar amino acids have similar codons (Crick 1968). This is generally referred to as the minimization of the code because it minimizes the effects of point mutations; codons separated 180 Ring and Cavalcanti by 1 point mutation tend to code for similar amino acids. Freeland et al. (2000) showed that the genetic code is highly minimized. In fact, a random code has only one chance in a million of being more minimized than the standard code. In their study, Freeland et al. (2000) used the PAM74100 matrix (Benner et al. 1994) to determine the distance between a pair of amino acids. PAM stands for point accepted mutations, for example, a PAM1 matrix is determined from the alignment of closely related sequences by counting the number of times amino acid i substitutes amino acid j and normalizing by a constant so that each site in the alignment has a chance of 1% of being substituted. It has been known that the PAM1 matrix reflects to a large extent the organization of the genetic code; most amino acids that are frequently substituted in the PAM1 matrix have codons that are interchangeable by a single point mutation. In other words, the PAM1 matrix is dominated by neutral or nearly neutral mutations. Typically, to calculate substitution matrices with higher PAM numbers, that is, representing the substitution frequencies of amino acid for more distantly related proteins, one multiplies the PAM1 matrix by itself many times. The PAM250 matrix, for example, is obtained by multiplying the PAM1 matrix by itself 250 times. Higher PAM matrices obtained using this method will still reflect the organization of the code as they are derived directly from the PAM1. To avoid this bias, Freeland et al. (2000) used the PAM74-100 matrix (Benner et al. 1994). This PAM matrix is unique in that it was derived using only highly diverged protein sequences. They have argued that because this matrix is derived using divergent proteins, it should predominantly reflect amino acid biochemical similarity and not how frequently amino acids are converted by mutation (Freeland et al. 2000). As such, they propose that this matrix is not correlated with the genetic code and is ideal to determine the distance between amino acids when calculating the minimization of the code. This claim has been contested by DiGiulio (2001), which showed that even the PAM74-100 matrix still show correlations with the organization of the code. However, just because a substitution matrix shows some correlation with the organization of the code, it does not follow that it is directly influenced by the code. It could be that both the code organization and the matrix are being influenced by some biochemical property of the amino acids, in which case the use of the matrix to study the minimization of the code would be justified. In fact, subsequent studies assumed that the PAM74-100 is only slightly influenced by the code (Gilis et al. 2001). In a recent paper (Bulka et al. 2006), the authors used statistical correlations between amino acid properties and the PAM74-100 matrix to suggest that the physicochemical properties of the amino acids are more important to long-term protein evolution than codon assignment; that is, neutral evolution by mutation based on the genetic code becomes negligible as evolution proceeds. Due to the lack of sequenced genomes with reassigned codes, it was difficult to determine if the code structure influences the PAM74-100 matrix or if both reflect the underlying properties of the amino acids. In this study, we compare the newly completed genomes of 2 organisms that use alternative genetic codes (Aury et al. 2006; Eisen et al. 2006) to calculate organism-specific substitution matrices. Our results show that amino acid substitution frequencies are very different for organisms that reassign codons and that these differences are still pronounced for highly diverged proteins. The results suggest that amino acid substitution matrices should not be used to calculate the minimization of the code as these matrices are likely to be highly influenced by the structure of the code. Materials and Methods Data We downloaded the protein and coding DNA sequences of the following organisms: T. thermophila (ftp://ftp. tigr.org/pub/data/Eukaryotic_Projects/t_thermophila/); P. tetraurelia (ParameciumDB: http://paramecium.cgm. cnrs-gif.fr/); Drosophila melanogaster (FlyBase: http:// flybase.bio.indiana.edu/); Caenorhabditis elegans (WormBase: http://www.wormbase.org/); and Plasmodium falciparum (PlasmoDB: http://www.plasmodb.org/). The reason for the use of the P. falciparum genome is 2-fold: first, Plasmodium is an alveolate, a sister taxon of ciliates and, as such, the genome most closely related to ciliates available; second, it has been shown that substitution matrices can be influenced by the GC content or the amino acid content of the sequences being aligned (Yu et al. 2003; Yu and Altschul 2005). Both Tetrahymena and Paramecium have AT-rich genomes, so any differences between their substitution patterns and those of the fly and worm could be dismissed as reflecting their much lower GC content. Plasmodium is more AT rich than either of the ciliates. If we show that the pattern of amino acid substitution in Plasmodium is less skewed than that of the ciliates, we can infer that the differences observed in the ciliate matrix are due to the genetic code and not GC content. Determining Pairs of Paralogous Genes In their paper, Massey et al. (2003) first determined orthologous genes between C. albicans and S. cerevisiae and aligned pairs of orthologous genes between the 2 species to determine the frequency with which each codon in one species aligns with codons from the other species. This was possible because C. albicans and S. cerevisiae are closely related yeast species and orthologs can be easily defined and aligned. Unfortunately, although Tetrahymena and Paramecium are both oligohymenopheran ciliates, they are not as closely related as the 2 yeast species in Massey et al. (2003) study; they also lack a closely related outgroup. This problem led us to use paralogous genes instead of orthologs. Paralogs are genes separated by a duplication event, whereas orthologs are genes separated by speciation. To determine pairs of paralogous genes, we used reciprocal best Blast hits. Basically, for each species, we performed a Blast search (Altschul et al. 1990, 1997) of each protein against all other proteins in the organism. If the best hit of protein A was protein B and the best hit of protein B was protein A, proteins A and B were considered paralogs. Note that each of these proteins can have many more Protein Evolution in Ciliates with Alternative Codes 181 paralogs in the genome; by using reciprocal best Blast hits, we tried to use pairs of proteins that are derived from a single duplication event. This simplification avoids the complications of large gene families biasing our results. Alignment of Paralogs After determining the paralogous pairs, we used the FASTA program (Pearson and Lipman 1988) to align the protein sequences and determine the substitution patterns of amino acids for each organism. We chose FASTA to perform these analyses as, for proteins, it uses a Smith– Waterman algorithm. FASTA also allows the alignment to be performed using different substitution matrices. We performed the procedure twice, once for closely related proteins and once for divergent proteins. In the first case, we used FASTA with the MDM10 matrix, a PAM10 matrix obtained by Jones et al. (1992)—this is the matrix with the lowest PAM value available in FASTA. We removed identical proteins and only used pairs of paralogs that generated alignments longer than 100 amino acids with at least 85% identity in the calculation of the substitution matrix. Unfortunately, the PAM74-100 is not available in FASTA, so to calculate the matrix for divergent proteins, we used FASTA with the BLOSUM45 matrix (Henikoff and Henikoff 1992). This matrix is derived from highly divergent sequences using a procedure slightly different from the one used to calculate PAM matrices. We only used alignments longer than 100 amino acids and with identities between 25% and 60%. Calculation of Substitution Matrices For each organism, we calculated a substitution matrix for closely related proteins and one for divergent proteins. The procedure to calculate the matrices was similar and is described below. First we created a matrix A, where each element Aij gives the frequency with which we observe amino acid i aligned with amino acid j. Each Aij was determined by analyzing the alignments one position at a time. Because we are using pairwise alignments, we only observe the aligned amino acids and cannot determine what the ancestral amino acid is. Thus, each time that amino acids i and j are in the same column of an alignment, we added one to Aij and to Aji. For each organism, we calculated the total number of substitutions observed in the alignments and used this number to calculate a normalization constant lambda. This constant was used to normalize the proportion of substitutions among organisms and is calculated as 0.01 times the number of substitutions divided by the total number of amino acids in the alignment (eq. 1). All columns containing gaps were excluded. ntotal ð1Þ k50:01 P : Aij i6¼j Finally, we calculated a substitution matrix M, where each element Mij was given by the sum of Aij and Aji mul- tiplied by lambda and divided by the frequency of amino acid i plus that of amino acid j, to account for the different frequencies of the amino acids in each organism and generate a symmetric matrix (eq. 2). k Aij þ Aji : ð2Þ Mij 5 ni þ nj The matrices M represent the amino acid substitution frequencies and were used to compare the frequencies with which each amino acid substitutes another in each organism. We will refer to the matrices obtained from the alignment of closely related proteins as the ‘‘C’’ matrices and those obtained for divergent proteins as the ‘‘D’’ matrices. Results Effects of Reassignment in Amino Acid Frequencies King and Jukes (1969) showed that the frequency of amino acids in proteins correlates with the number of synonymous codons that code for each amino acid. They proposed that this is caused because most amino acids in proteins are the result of neutral mutations that do not affect protein structure or function; as such, the number of codons determines the frequency of the amino acids. Gilis et al. (2001) suggested an alternative explanation for the correspondence between amino acid frequency and number of codons. They hypothesize that the genetic code evolved to take into account preexisting amino acid frequencies. For example, Arg, Leu, and Ser are each coded by 6 codons and are common amino acids in proteins. According to the first hypothesis, they are common in proteins because most amino acids arise by random mutations in the genome and, as such, the number of codons for a given amino acid should determine its frequency (King and Jukes 1969). If the second hypothesis were correct, these amino acids have 6 codons because they were common when the code was being established, and their frequency acted as an evolutionary pressure determining their number of synonymous codons (Gilis et al. 2001). Although we cannot determine the frequency of the amino acids when the code was established, we can determine the effects of changing the number of codons that code for an amino acid on the usage of that amino acid. Glutamine is coded by 4 codons in Tetrahymena and Paramecium and only by 2 codons in organisms that use the universal code. If the number of codons impacts the frequency with which an organism uses a given amino acid through neutral processes, Gln should be more common in Tetrahymena and Paramecium than in organisms that use the universal code. Indeed, Glutamine residues are more common in these ciliates’ genomes than in the genomes of other eukaryotes (table 1). To determine what would be the expected frequency of Gln in the different genomes under a neutral model, we need to take into consideration the GC content of the genomes—under a neutral model, we would expect the frequency of a codon in the genes to be proportional to the frequency of its constituent nucleotides. If we consider 182 Ring and Cavalcanti Table 1 Correspondence between Observed and Expected Frequencies for an Amino Acid with Reassigned Codons Organism Drosophila Caenorhabditis elegans Plasmodium Paramecium Tetrahymena GC Content (%) Observed Gln Frequency Expected Frequency Based on the No. of Codons Expected Frequency Correcting for Different GC Contents 54 43 24 30 28 0.053 0.042 0.028 0.091 0.098 0.033 0.033 0.033 0.063 0.063 0.038 0.036 0.029 0.098 0.101 the GC content of each species’ coding regions, it can be seen that the usage of Gln in the proteins follows closely the expected frequency based on its number of codons (table 1). This result cannot rule out that the number of codons for each amino acid in the code reflects the availability of the amino acids when the code was established. However, it suggests that, in present species, the number of codons coding for an amino acid determines its frequency in the proteins, supporting King and Jukes’ (1969) hypothesis. Substitution Patterns of Gln in Tetrahymena and Paramecium In their analyses, Massey et al. (2003) used orthologous genes between C. albicans and S. cerevisisae to determine which codons align with CTG in both species. Unfortunately, in the case of Tetrahymena and Paramecium, we do not have the genome sequence of a closely related species that uses the standard code. The most closely related species with a full genome is the fairly distantly related apicomplexan P. falciparum (both apicomplexans and ciliates are part of the alveolates taxon). Because interspecies comparisons were ruled out, we decided to perform intraspecies comparisons and contrast the pattern of substitutions in each species to that of the other species. We used the genomes of Tetrahymena, Paramecium, Plasmodium, Drosophila, and C. elegans. For each of these genomes, we determined pairs of paralogous genes—genes that are related through a duplication event— and aligned these pairs to determine the patterns of amino acid substitutions in each species. We performed the calculations twice. For each organism, first we generated a substitution matrix C for closely related proteins (more than 85% identity) and then a substitution matrix D for divergent proteins (percent identity between 25% and 60%). Table 2 shows some statistics of the alignments used to create the matrices. To verify how closely each of the calculated C matrices matches the MDM_10 matrix (Jones et al. 1992) used to align the closely related proteins, we calculated correlations coefficients between the C matrices and MDM_10. Likewise, we calculated the correlation coefficients between the D matrices and the BLOSUM45 matrix (Henikoff and Henikoff 1992). Even though we could not use the PAM74-100 matrix (Benner et al. 1994) to align the proteins, as this matrix is not available in the FASTA program, we also calculated the correlations between the D matrices and the PAM74-100 matrix. Table 3 shows the results of these calculations. As can be seen, both the C and the D matrices show good correlations with the MDM_10 and BLOSUM45 (PAM74-100) matrices, respectively. In all comparisons, the Tetrahymena and the Paramecium matrices are less correlated to the matrices used to generate the alignments than the matrices of the other organisms. It could be argued that the ciliates matrices are less correlated with the original matrices because of the extremely low GC content of the ciliates genome. However, the ciliates’ matrices are less correlated than those of Plasmodium, an organism with an even lower GC content, discarding this explanation. Because both Tetrahymena and Paramecium reassign the stop codons TAG and TAA to Gln, we also calculated the correlation coefficients using only substitutions that involve Gln instead of the whole matrix (table 3). The difference in the correlations of the Tetrahymena and Paramecium matrices and those of the other species are even higher when we consider only the subset of substitutions involving the reassigned amino acid Gln, as would be expected if the differences are mostly due to the use of an alternative genetic code. To further understand how the genetic code change affects the amino acid substitution matrix, we concentrated only on the substitutions involving Gln (fig. 1). For closely related proteins (fig. 1a), 4 amino acids consistently tend to substitute Gln more frequently in Tetrahymena and Table 2 Statistics of the Alignments Used to Generate the Substitution Matrices for Each Organism Closely Related Proteins Organism Drosophila Caenorhabditis elegans Plasmodium Paramecium Tetrahymena Divergent Proteins No. of Substitutions No. of Aligned Residues % Mutated No. of Substitutions No. of Aligned Residues % Mutated 5,558 19,324 1,322 382,700 41,922 1,033,570 1,377,198 42,958 5,330,070 516,282 0.54 1.40 3.08 7.18 8.12 403,972 563,786 151,234 807,994 1,631,594 722,106 1,032,324 250,326 1,576,270 2,833,642 55.94 54.61 60.41 51.26 57.58 Protein Evolution in Ciliates with Alternative Codes 183 Table 3 Correlation Coefficients between the Substitution Matrices Derived for Each Organism and the MDM_10 (PAM10), BLOSUM45, and PAM74-100 Scoring Matricesa MDM_10 All matrix Only Gln BLOSUM45 All matrix Only Gln PAM74-100 All matrix Only Gln Drosophila Caenorhabditis elegans Plasmodium Paramecium Tetrahymena 0.731 0.868 0.778 0.89 0.771 0.732 0.683 0.601 0.718 0.549 0.816 0.869 0.81 0.875 0.764 0.825 0.737 0.681 0.753 0.696 0.787 0.929 0.778 0.927 0.723 0.847 0.696 0.697 0.708 0.733 NOTE.—The Tetrahymena and Paramecium matrices have lower correlations than the matrices for other organisms. a All correlations are significant in the 99% confidence level. Paramecium than in any of the other species: Leu, Ser, Trp, and Tyr. The codons of 3 of these amino acids, Ser, Trp, and Tyr, are not interchangeable to Gln codons by a single point mutation in the standard genetic code but are interchangeable in the ciliates’ code. The other amino acid, Leu, has 2 codons that can be converted to Gln following a point mutation in the universal genetic code (CU(AG) /CA(AG)), whereas in the ciliate code, 4 of the Leu codons are interchangeable to Gln codons by 1 point substitution (CU(AG)/CA(AG) and UU(AG)/UA(AG)). For more distantly related proteins (fig. 1b), we can see that 3 of these 4 amino acids—Tyr, Ser, and Leu—still tend to substitute Gln more frequently in the ciliates, but we also observe an overabundance of Gln substitutions to Lys, Asn, Ile, Phe, and Thr. The excess in Lys is also observed for closely related proteins in Tetrahymena (but not Paramecium; fig. 1a) and can be explained by the presence of 4 Gln codons that can be converted to Lys in the ciliate code as opposed to the 2 codons in the standard code. The excesses of Asn, Thr, Phe, and Ile can be explained assuming a 2-step substitution process. For example, it is possible that the excess Thr is caused because Gln tends to be substituted by Ser more frequently in ciliates and Ser can be easily converted into Thr. The same argument can be made for Ile and Phe: Gln is substituted by Leu, which in turn is easily substituted by Ile or Phe. An analogous argument can also be made for Asn: Gln to Lys to Asn. The high frequency of Phe can also be explained by Gln to Tyr to Phe mutations FIG. 1.—The substitution frequencies of Gln to each of the other amino acids are different in organisms that reassign the code. (a) Closely related proteins. The x axis shows the amino acids ordered by the substitution frequency in the MDM10 matrix (Jones et al. 1992). (b) Divergent proteins. The x axis shows the amino acids ordered by the substitution frequency in the BLOSUM45 matrix (Henikoff and Henikoff 1992). 184 Ring and Cavalcanti FIG. 2.—Substitution frequencies of Asn (N), Ile (I), Thr (T), and Phe (F) to each of the other amino acids in divergent proteins in Paramecium. The frequency with which each amino acid is conserved is not plotted as it is outside the bounds of the plot (each value is ca. 0.99, as the matrix is normalized to 1% changes). Note that the substitutions Ile to Leu, Phe to Leu, Phe to Tyr, Thr to Ser, and Asn to Lys are all very frequent (see text). as Tyr and Phe are substituted very frequently. Figure 2 shows the frequency of substitutions of Asn, Ile, Phe, and Thr to the other amino acids for divergent proteins in Paramecium (Tetrahymena gives similar results; data not shown). As can be seen, all scenarios described above agree with the data; the frequency of substitution of Lys to Asn, Leu to Ile, Leu to Phe, Tyr to Phe, and Ser to Thr are all high. Interestingly, the alignment of Gln to Trp in divergent proteins is not more common in ciliates than in organisms that use the standard code. This indicates that as the proteins become more diverged, the physicochemical properties of the amino acids have a larger effect in the substitution patterns—Trp is a very large amino acid and only rarely aligns with Gln even in closely related proteins. Besides the Gln to Trp substitution, not all 2-step substitutions made possible by the alternative code are overrepresented in ciliates. This makes sense, we do not propose that the genetic code is the only factor affecting substitution matrices, the physicochemical characteristics of the amino acids are also an important factor in their substitution frequencies, and the Gln to Trp case is an example of that. However, our results show that the structure of the code influences even substitution matrices derived from highly diverged proteins. Conclusions King and Jukes (1969) proposed that the frequency of use of an amino acid is correlated with its number of codons for neutral reasons—most amino acids in a genome arise by random mutations that do not affect protein function. Gilis et al. (2001) proposed that another reason for the correspondence between amino acid frequency and number of synonymous codons could be that the genetic code adapted to preexisting amino acid frequencies. When we take into account the GC content of the genomes, the frequencies of Gln usage in Tetrahymena and Paramecium are extremely close to those predicted by a neutral model of mutation; both ciliate genomes have a much higher frequency of Gln than the genomes of organisms that use the standard code. These frequencies closely follow the expected frequencies based on the number of Gln codons in the ciliates’ genetic code and their GC content. This indicates that the hypothesis of King and Jukes (1969) is sufficient to explain the correlation between amino acid usage and number of codons, without the need to invoke an adaptation of the code to preexisting amino acid frequencies as proposed by Gilis et al. (2001). However, one could still argue that the ciliates reassigned TAA and TAG to Gln because, for some reason, their genomes experienced selection for the use of more Gln in their proteins. Although we cannot rule out this explanation, it seems unlikely when one considers that other ciliates, like Euplotes octocarinatus, use a different alternative code, which reassigns the TGA stop codon to Cys. If this hypothesis were correct, it would indicate that some ciliate species experienced selection to increase the number of Gln in their proteins, whereas other ciliate species experienced selection to increase Cys. Although such different pressures are not mutually exclusive, this makes an explanation based on neutral mutations more parsimonious. When looking at the amino acid substitution patterns in Tetrahymena and Paramecium, our results are surprising because they indicate that the importance of the genetic code organization in amino acid substitution frequencies does not become irrelevant with evolutionary time as previously thought (Freeland et al. 2000). For closely related proteins, we have shown that several amino acids whose codons can be converted to TAA or TAG by a single point mutation—Leu, Ser, Trp, and Tyr—substitute Gln more frequently in ciliates than in species that use the standard code. For distantly related proteins, even some amino acids whose codons are separated from TAA or TAG by 2 point mutations might substitute Gln more frequently in ciliates than in organisms that use the standard code. The latter result shows that the genetic code organization also influence matrices derived from highly divergent proteins. It is important to stress, however, that the structure of the code is not the only factor influencing substitution frequencies. Some of the 2-step substitutions made possible by the alternative code are not overrepresented in the divergent proteins of ciliates, indicating that physicochemical characteristics of the amino acids are also an important factor affecting substitution frequencies. The final observed substitution frequencies are a result of these 2 factors—organization of the code and physicochemical properties of the amino acids—acting together. Protein Evolution in Ciliates with Alternative Codes 185 Our results indicate that matrices obtained from amino acid substitutions in protein alignments should not be used to study genetic code minimization as they are highly influenced by the structure of the code, which makes the analyses tautological. It is important to note that we are not questioning the results of Freeland et al. (2000) suggesting that the code is highly minimized. The code is highly minimized regardless of the metric chosen for the distance between amino acids. On the other hand, the use of the PAM74-100 in their study probably influenced the level of minimization calculated for the code. The genetic code is more minimized when using the PAM74-100 than when using polarity as a distance metric (Freeland et al. 2000), probably because code structure influences the PAM74100 matrix. Furthermore, in the same paper Freeland et al. (2000) conclude that the standard code is more minimized than any of the known alternative codes. Because of the intrinsic bias present in the PAM74-100 matrix, this analysis might not be valid. For example, the reassignment of TAA and TAG to Gln adds several point mutations to the ciliate genetic code not available in the standard code, like Gln to Ser and Gln to Tyr. In the PAM74-100 matrix, these are unlikely substitutions, which would penalize the ciliate code in the calculation. If the frequency of amino acid substitutions used was calculated from the ciliate genomes, these mutations would be frequent, which could make the ciliate code more minimized than the standard code. A recent study showed that ciliate proteins have an elevated rate of evolution compared with those of nonciliate species (Zufall et al. 2006). The authors correlate the higher evolutionary rates to genome organization in ciliates; they propose that the ciliate genome architecture—nuclear dimorphism, high levels of chromosome processing, amitosis, and epigenetic inheritance—allows these organisms to explore protein space in a novel manner (Zufall et al. 2006). It is unlikely that the different patterns of amino acid substitution that we describe could be caused by the same mechanisms proposed by Zufall et al. (2006); their mechanisms suggest a generally higher evolutionary rate not related to the genetic code, whereas we observe that some substitution rates, those favored by the alternative code, are more elevated than others (note that we are not comparing absolute substitution rates, our calculations, by definition, normalize the number of substitutions observed). Instead, we believe that our results present yet another different way in which ciliates are able to explore protein space, the reassignment of TAG and TAA to Gln allows some rare substitutions to occur more often in ciliate lineages. This could also contribute to the increased evolutionary rates of ciliate proteins, in addition to the mechanisms proposed by Zufall et al. (2006). It is important to keep in mind the different pattern of amino acid substitutions in organisms that reassign the code when aligning proteins from these organisms. All alignment programs come with precompiled substitution matrices, and none of those are derived from alternative codes. The alignment of closely related proteins, with few amino acid differences, is usually very robust to the matrix chosen and should not suffer from this problem. However, the use of different matrices has a large effect on the alignment of divergent proteins. This would not be much of a problem if, as proposed by Freeland et al. (2000) and generally accepted, matrices derived from divergent alignments were not influenced by the code, but in this study, we show that the effect of code organization persists even for very diverged proteins. Acknowledgments The authors wish to thank Nicholas A. Stover and Hannah M.W. Salim for useful comments and suggestions. Literature Cited Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402. Aury JM, Jaillon O, Duret L, et al. (43 co-authors). 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature. 444:171–178. Barrell BG, Bankier AT, Drouin J. 1979. A different genetic code in human mitochondria. Nature. 282:189–194. Beier H, Grimm M. 2001. Misreading of termination codons in eukaryotes by natural nonsense suppressor tRNAs. Nucleic Acids Res. 29:4767–4782. Benner SA, Cohen MA, Gonnet GH. 1994. Amino acid substitution during functionally divergent evolution of protein sequences. Protein Eng. 7:1323–1332. Bulka B, DesJardins M, Freeland SJ. 2006. An interactive visualization tool to explore the biophysical properties of amino acids and their contribution to substitution matrices. BMC Bioinformatics. 7:329. Chittum HS, Lane WS, Carlson BA, Roller PP, Lung FD, Lee BJ, Hatfield DL. 1998. Rabbit beta-globin is extended beyond its UGA stop codon by multiple suppressions and translational reading gaps. Biochemistry. 37:10866–10870. Crick FHC. 1968. On the origin of the genetic code. J Mol Biol. 38:367–379. DiGiulio M. 2001. The origin of the genetic code cannot be studied using measurements based on the PAM matrix because this matrix reflects the code itself, making any such analyses tautologous. J Theor Biol. 208:141–144. Eisen JA, Coyne RS, Wu M, et al. (53 co-authors). 2006. Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol. 4(9):e286. Freeland SJ, Knight RD, Landweber LF, Hurst LD. 2000. Early fixation of an optimal genetic code. Mol Biol Evol. 17(4): 511–518. Gilis D, Massar S, Cerf NJ, Rooman M. 2001. Optimality of the genetic code with respect to protein stability and amino-acid frequencies. Genome Res. 2(11):R0049. Henikoff S, Henikoff JG. 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 89:10915–10919. Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 8:275–282. Jukes TH, Osawa S, Muto A, Lehman N. 1987. Evolution of anticodons: variations in the genetic code. Cold Spring Harb Symp Quant Biol. 52:769–776. King JL, Jukes TH. 1969. Non-Darwinian evolution. Science. 164:788–798. Knight RD, Freeland SJ, Landweber LF. 2001. Rewiring the keyboard: evolvability of the genetic code. Nat Rev Genet. 2:49–58. 186 Ring and Cavalcanti Lovett PS, Ambulos NP, Mulbry W, Noguchi N, Rogers EJ. 1991. UGA can be decoded as tryptophan at low efficiency in Bacillus subtilis. J Bacteriol. 173:1810–1812. Massey SE, Moura G, Beltrao P, Almeida R, Garey JR, Tuite MF, Santos MA. 2003. Comparative evolutionary genomics unveils the molecular mechanism of reassignment of the CTG codon in Candida spp. Genome Res. 25(1):179–186. Matsugi J, Murao K, Ishikura H. 1998. Effect of B. subtilis tRNA (Trp) on readthrough rate at an opal UGA codon. J Biochem. 123:853–858. Osawa S. 1995. Evolution of the genetic code. Oxford: Oxford University Press. Osawa S, Jukes TH, Muto A, Yamao F, Ohama T, Andachi Y. 1987. Role of directional mutation pressure in the evolution of the eubacterial genetic code. Cold Spring Harb Symp Quant Biol. 52:777–789. Pearson WR, Lipman DJ. 1988. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 85:2444–2448. Santos MA, Cheesman C, Costa V, Moradas-Ferreira P, Tuite MF. 1999. Selective advantages created by codon ambiguity allowed for the evolution of an alternative genetic code in Candida spp. Mol Microbiol. 31:937–947. Schultz DW, Yarus M. 1994. Transfer RNA mutation and the malleability of the genetic code. J Mol Biol. 235:1377–1380. Schultz DW, Yarus M. 1996. On malleability in the genetic code. J Mol Evol. 42:597–601. Soll D, RajBhandary UL. 2006. The genetic code—thawing the ‘frozen accident’. J Biosci. 31(4):459–463. Swire J, Judson OP, Burt A. 2005. Mitochondrial genetic codes evolve to match amino acid requirements of proteins. J Mol Evol. 60:128–139. Yarus M, Schultz DW. 1997. Response: further comments on codon reassignment. J Mol Evol. 45:1–8. Yu YK, Altschul SF. 2005. The construction of amino acid substitution matrices for the comparison of proteins with nonstandard compositions. Bioinformatics. 21:902–911. Yu YK, Wootton JC, Altschul SF. 2003. The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA. 100:15688–15693. Zufall RA, McGrath C, Muse S, Katz LA. 2006. Genome architecture drives protein evolution in ciliates. Mol Biol Evol. 23:1681–1687. Laura Katz, Associate Editor Accepted October 27, 2007

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Consequences of Stop Codon Reassignment on