* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Translational selection is operative for synonymous codon usage in
Multilocus sequence typing wikipedia , lookup
Gene expression wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Gene desert wikipedia , lookup
Transposable element wikipedia , lookup
Gene regulatory network wikipedia , lookup
Community fingerprinting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Genomic library wikipedia , lookup
Point mutation wikipedia , lookup
Non-coding DNA wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Biosynthesis wikipedia , lookup
Ridge (biology) wikipedia , lookup
Molecular ecology wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene expression profiling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genome evolution wikipedia , lookup
Microbiology (2003), 149, 855–863 DOI 10.1099/mic.0.26063-0 Translational selection is operative for synonymous codon usage in Clostridium perfringens and Clostridium acetobutylicum Héctor Musto,1 Héctor Romero1,2 and Alejandro Zavala1 1 Laboratorio de Organización y Evolución del Genoma, Facultad de Ciencias, Iguá 4225, Montevideo 11400, Uruguay Correspondence Héctor Musto 2 [email protected] Escuela Universitaria de Tecnologı́a Médica, Facultad de Medicina, Avda. Italia (s/n) Hospital de Clı́nicas, Montevideo 11600, Uruguay Received 17 November 2002 Revised 3 December 2002 Accepted 17 January 2003 Here, the codon usage patterns of two Clostridium species (Clostridium perfringens and Clostridium acetobutylicum) are reported. These prokaryotes are characterized by a strong mutational bias towards A+T, a striking excess of coding sequences and purine-rich leading strands of replication, strong GC-skews and a high frequency of genomic rearrangements. As expected, it was found that the mutational bias dominates codon usage but there is some variation of synonymous codon choices among genes in the two species. This variation was investigated using a multivariate statistical approach. In the two species, two major trends were detected. One was related to the location of the sequences in the leading or lagging strand of replication, and the other was associated with the preferential use of putatively translational optimal codons in heavily expressed genes. Analyses of the estimated number of synonymous and non-synonymous substitutions among orthologous genes permit us to postulate that optimal codons might be selected not only for speed but also for accuracy during translation. INTRODUCTION Synonymous codons encode the same amino acid. Hence, in principle, it could be assumed that if a large sample of genes is studied, all triplets encoding the same amino acid should be equally frequent. However, it is very clear that this assumption is far from true, both among organisms and among genes from a single species. Among prokaryotes, it is generally agreed that the codon usage of any gene (and, consequently, of any genome) is the result of the balance between mutational biases and natural selection acting at the level of translation, the latter effect being ‘visible’ only if it is strong enough to overcome the effect of random genetic drift (Sharp & Li, 1986; Bulmer, 1991; Akashi & EyreWalker, 1998). The available evidence suggests that the strength and direction of these forces can vary both among different species and among sequences from the same genome. For example, the genomic G+C contents of prokaryotes vary from 25 to 75 mol% (Sueoka, 1962), and given the correlation that holds between GC3s (G+C content at ‘silent’ third codon positions) and genomic G+C content (Bernardi & Bernardi, 1986; Muto & Osawa, 1987), the mutational bias characteristic of each genome greatly Abbreviations: CAI, codon adaptation index; COA, correspondence analysis; RSCU, relative synonymous codon usage. 0002-6063 G 2003 SGM Printed in Great Britain influences codon choices. However, the availability of very long contigs, and especially complete genomes, has shown that the mutational bias is not simply shifting the whole genome towards G+C or A+T. For instance, it has been shown that there are regional variations in the G+C content around the genome of Mycoplasma genitalium (McInerney, 1997; Kerr et al., 1997) which exert a great influence on GC3s and, consequently, on codon usage. Perhaps more unexpected was the finding of Lobry (1996), who showed that in several bacteria the leading and lagging strand of replication can be easily recognized by the so-called ‘GC-skew’, the quantity (G2C)/(G+C). Indeed, the leading strand usually displays positive values while the reverse is true for the lagging strand (the switch of sign occurs exactly at or very near to the origin and terminus of replication). As a consequence, the leading strand is G- (and T-) rich, while the lagging strand displays a bias towards C (and A). This effect can be so strong that in species like Borrelia burgdorferi, Treponema pallidum and Chlamydia trachomatis the position of the sequences in relation to the replication fork can be recognized as the most important force driving codon usage (McInerney, 1998; Lafay et al., 1999; Romero et al., 2000a). Finally, a ‘common theme’ in completely sequenced genomes is the finding of regions displaying base compositions far away from those of the genome as a whole. These regions have been interpreted as being the result of events of horizontal transfer of DNA between species differing in their genomic G+C contents (Garcia-Vallve et al., 2000; Karlin, 855 H. Musto, H. Romero and A. Zavala 2001), and the sequences located in these regions display different codon usage than the rest of the genes. Therefore, it can be concluded that the overall base content of a genome and the mutational bias of each replicative strand are the main forces driving codon usage. However, superimposed onto these general effects, in several species it has been found that natural selection leads to the fixation of some triplets among highly expressed genes. This was observed in Escherichia coli (Post & Nomura, 1979; Gouy & Gautier, 1982), where it was noted that the codon usage of highly expressed sequences was biased in relation to the pattern of lowly expressed genes. Indeed, in the former group there is an increase of certain triplets (‘major codons’) while in the latter group the usage of codons is more random. From another perspective, Ikemura (1981) showed that there is a match between these codons and the most abundant tRNAs. Therefore, for E. coli it was proposed that the triplets that are recognized more efficiently by the most abundant isoacceptor are preferred, and the degree of bias in each gene should be proportional to the level of expression. Although the codon usage pattern of several prokaryotes fell within this interpretation (i.e. codon usage is the result of mutational biases and translational selection) the more species that are being studied the more peculiarities are beginning to appear. For example, it was shown that in Helicobacter pylori, although the composition of the genome is not skewed and there is a low (but detectable) level of heterogeneity among genes, codon usage does not appear to be influenced simply by mutational biases or translational selection (Lafay et al., 2000). Furthermore, in Mycobacterium tuberculosis, although the ‘classical’ factors are apparent, it was reported that the hydropathy level of each protein is correlated with the base content at silent sites (de Miranda et al., 2000). A more complex pattern was found in Chlamydia trachomatis, since codon usage appears to be shaped by the global genomic composition, the strandspecific mutational bias (as noted above), natural selection acting at the level of translation, the hydropathy level of each protein and each amino acid’s conservation (Romero et al., 2000a). Therefore, as more prokaryotic genomes are analysed it is becoming clear that more factors shape codon usage than previously thought. Hence, more studies are needed (I) to understand the generality of the factors and phenomena described above, and (II) to detect new forces shaping codon usage. With these goals in mind, we decided to study the codon usage patterns in two species of Clostridium that have been sequenced recently, namely Clostridium perfringens (Shimizu et al., 2002) and Clostridium acetobutylicum (Nolling et al., 2001). These Gram-positive, anaerobic, spore-forming bacteria have several features that make them useful for these studies: (i) they belong to the same genus, which is important for comparative purposes; (ii) their genomes are compositionally biased (G+C contents of 31 and 29 mol%, respectively), which could hide the effect of natural selection; (iii) their generation time is short (Shimizu et al., 2002), which, contrary to (i) and (ii), would make selection for translational efficiency more likely to be 856 detected; and (iv) on the leading strand of replication the two species display a very strong purine bias and an excess of coding sequences (Shimizu et al., 2002), which might add additional levels of complexity to their patterns of codon usage. METHODS Sequences. The complete genomes and coding sequences of C. acetobutylicum (Nolling et al., 2001) and C. perfringens (Shimizu et al., 2002) were obtained from two NCBI ftp sites (ftp://ftp.ncbi. nih.gov/genomes/Bacteria/Clostridium_acetobutylicum/ and ftp://ftp. ncbi.nih.gov/genomes/Bacteria/Clostridium_perfringens/). Methods of analysis. Codon usage, correspondence analysis (COA) (Greenacre, 1984), GC3s (the frequency of codons ending in C or G, excluding Met, Trp and stop codons), the relative synonymous codon usage (RSCU) (Sharp et al., 1986) and the codon adaptation index (CAI) (Sharp & Li, 1987) were calculated using the program CODONW 1.3 (written by John Peden and available from ftp://molbiol. ox.ac.uk/Win95.codonW.zip). In the two species under study, the CAI was calculated taking the codon usage of the ribosomal proteins as a reference. COA of RSCU values was carried out to determine the major sources of variation among synonymous codons. The putative orthologous sequences were identified running a BLAST query of the whole set of proteins of one genome against the set of the other one using the stand-alone BLAST package (Altschul et al., 1997). The sequence with the best match, according to the score value, was identified. Then, the coding sequences of these pairs were translated and aligned using CLUSTAL W (Thompson et al., 1994); subsequently, the alignments were back-translated to the known DNA sequences. dS (synonymous distance) and dN (non-synonymous distance) values were calculated using the Nei–Gojobori method (Nei & Gojobori, 1986) using the JADIS package (Goncalves et al., 1999), only on those pairs of sequences displaying a minimal value of 50 % identity and with a length difference of 20 % at the amino acid level. The analyses were performed only with the pairs of sequences displaying dS values ¡2?0. The final dataset comprised 676 pairs of genes. Wholegenome alignment and comparison were carried out with the MUMmer system (release 2.1) (Delcher et al., 2002) using the default settings. RESULTS AND DISCUSSION Compositional properties of C. perfringens and C. acetobutylicum As can be seen in Fig. 1, the genomes of C. perfringens and C. acetobutylicum are strongly biased towards low G+C contents. Indeed, with the exceptions of small regions encoding rDNA and operons for ribosomal proteins, the two genomes show little variation around each mean value (31 and 29 %, respectively). As a consequence of this strong mutational bias, the coding sequences (2660 and 3672 ORFs, respectively) are characterized by extremely low GC3s and symmetrical distributions (mean values 14 and 19 %, standard deviations 3 and 4 %, for C. perfringens and C. acetobutylicum, respectively). This predominance of A and T at the synonymous sites is better displayed in Table 1, in which the global codon usage (RSCU values) is shown for both species. Indeed, for each amino acid the predominant triplet (or triplets for three-, four- and sixfold degenerate Microbiology 149 Codon usage in C. perfringens and C. acetobutylicum heat-shock proteins, while genes expressed at the lowest levels and those encoding hypothetical proteins are distributed almost normally around the mean value of this axis. The clustering of highly expressed genes at one end of the distribution indicates that these sequences are characterized by a different pattern of codon usage than the rest of the genes; therefore, translational selection might be operative in this bacterium. To see which triplets are increased in the highly expressed group of genes, we compared the codon usage pattern of the sequences displaying the most extreme values at both ends of the first axis (50 genes at either extreme). The differences in codon usage between the two groups were tested with a x2 test. We found that there are 17 codons whose usage is significantly increased (P<0?01) among the highly expressed group of genes, and they encode 17 amino acids (Cys is the only residue without an increased triplet). These codons are listed in Table 2. Fig. 1. Some compositional properties of the genomes of (a) C. perfringens and (b) C. acetobutylicum. Light-grey line, GC-skew; black line, G+C content (mol%); dark-grey line, purine content (R%). The window size was 20 kb with steps of 2?5 kb. codons) is A- and/or T-ended. Therefore, it can be concluded that the main force driving codon usage in C. perfringens and C. acetobutylicum is the strong mutational bias towards A and T. However, some results suggest that this bias alone cannot explain the whole trend. Indeed, in the two species studied here, the range of GC3s is rather high (46 and 35 %, respectively), and in Table 1 it can be seen that T- and A-ending codons are not equally frequent among fourfold degenerate codons. Therefore, it seems reasonable to assume that some other minor factors are shaping codon choices. To investigate this possibility, we conducted a COA of RSCU values on all genes of C. perfringens and C. acetobutylicum. This statistical approach has been widely used to investigate major trends in codon usage in several species of bacteria (Grantham et al., 1981; McInerney, 1998; Lafay et al., 1999; Zavala et al., 2002). The position of genes on the main axes generated by the analysis can subsequently be compared with biological properties of the sequences, such as expressivity, base composition, etc., which can help to understand the significance of each main trend. Patterns of codon usage in C. perfringens When COA is applied to C. perfringens it detects a principal trend (8?4 % of the total variability) that is clearly associated with expression levels. Indeed, at one extreme of this axis (Fig. 2a) lie genes that are known to be heavily expressed, such as those encoding several ribosomal proteins, translation elongation factors, glyceraldehyde-3-phosphate dehydrogenase, phosphoglycerate kinase, fructose-bisphosphate aldolase, triose-phosphate isomerase, pyruvate kinase and http://mic.sgmjournals.org Two different features related to the aforementioned codons support the hypothesis that they are translationally optimal. First, seven of the codons are C-ending, which is against the above-mentioned strong mutational bias towards A+T, suggesting that the increase may be caused by natural selection. A similar increase of triplets against the mutational bias among highly expressed genes has been reported in several unicellular species (prokaryotes and eukaryotes), and has always been explained in terms of natural selection (Grocock & Sharp, 2002; Musto et al., 1999; Romero et al., 2000b). Second, 15 of the codons match perfectly with the putative most abundant (or with the first and second most abundant) isoacceptor tRNA – we assume a correlation between the cellular levels of tRNAs and the copy numbers of tRNA genes, as was found in E. coli (Ikemura, 1981; Dong et al., 1996), Bacillus subtilis (Kanaya et al., 1999) and Saccharomyces cerevisiae (Percudani et al., 1997). For example, there are six Ser tRNA genes, one in three copies, one in two copies and one in single copy; the former recognizes AGC, the second recognizes UCA and the latter recognizes UCC, and the increased codons among highly expressed sequences are, precisely, the first two triplets. Similarly, for Arg, the tRNA that matches with the only increased codon (AGA) is present in three copies, while other Arg tRNA sequences are present in single copy. In other cases, where the match is not perfect (as is the case for the fourfold degenerate codons encoding Val, Thr and Ala) it seems reasonable to postulate that this behaviour is due to modifications in the first position of the anticodon. To further confirm the translational selection hypothesis, we calculated the CAI value for each sequence in C. perfringens taking as a reference the codon usage of ribosomal proteins, which are certainly heavily expressed. When all the sequences were sorted according to their CAI, the highest values were displayed not only by the genes encoding ribosomal proteins (which is a trivial result) but also by almost exactly the same genes that lie at the extreme of the first axis generated by the COA, which is confirmed by the strong correlation between the position of the sequences 857 H. Musto, H. Romero and A. Zavala Table 1. Codon usage (RSCU values) in C. perfringens and C. acetobutylicum Amino acid Phe Leu Ile Met Val Tyr TER His Gln Asn Lys Asp Glu Codon C. perfringens C. acetobutylicum TTT TTC TTA TTG CTT CTC CTA CTG ATT ATC ATA ATG GTT GTC GTA GTG TAT TAC TAA TAG CAT CAC CAA CAG AAT AAC AAA AAG GAT GAC GAA GAG 1?61 0?39 3?96 0?24 1?23 0?03 0?51 0?03 0?99 0?14 1?87 1?00 2?09 0?08 1?60 0?23 1?67 0?33 2?24 0?65 1?62 0?38 1?72 0?28 1?65 0?35 1?39 0?61 1?74 0?26 1?54 0?46 1?70 0?30 2?49 0?68 1?86 0?16 0?65 0?16 1?10 0?14 1?75 1?00 1?80 0?13 1?65 0?42 1?57 0?43 1?90 0?76 1?57 0?43 1?40 0?60 1?61 0?39 1?35 0?65 1?69 0?31 1?47 0?53 along this axis and the respective CAI values (R=0?82, P<0?0001). These results support our interpretation that the first axis discriminates expression levels. The second axis of the COA (6?7 % of the variability) discriminates between genes located in the leading or lagging strand of replication. The importance of this effect can be so high that in species like Borrelia burgdorferi, Treponema pallidum and Chlamydia trachomatis it is the most important force driving codon usage (McInerney, 1998; Lafay et al., 1999; Romero et al., 2000a). Among these species, the sequences located in the leading strand are G- and T-rich at the synonymous sites, while the complementary bases are more frequent in genes located in the lagging strand. However, this kind of bias is not found in Clostridium perfringens. Indeed, when the position of the codons in relation to the second axis is analysed it can be seen that purine- and pyrimidine-ending triplets lie at the opposite extremes. When the genes are sorted according to their position on the second axis, most sequences located in the lagging strand of replication cluster together towards 858 Amino acid Ser Pro Thr Ala Cys TER Trp Arg Ser Arg Gly Codon C. perfringens C. acetobutylicum TCT TCC TCA TCG CCT CCC CCA CCG ACT ACC ACA ACG GCT GCC GCA GCG TGT TGC TGA TGG CGT CGC CGA CGG AGT AGC AGA AGG GGT GGC GGA GGG 1?54 0?22 1?94 0?04 1?73 0?08 2?14 0?05 2?02 0?24 1?67 0?06 2?15 0?30 1?46 0?08 1?59 0?41 0?11 1?00 0?20 0?02 0?03 0?00 1?74 0?52 5?18 0?57 1?13 0?21 2?34 0?33 1?45 0?35 1?62 0?21 1?80 0?19 1?76 0?26 1?58 0?40 1?78 0?24 1?64 0?29 1?81 0?26 1?40 0?60 0?34 1?00 0?39 0?10 0?18 0?03 1?68 0?69 4?24 1?08 1?25 0?39 2?04 0?32 one end of the distribution (Fig. 3a). This result is certainly related to the very strong purine bias associated with an excess of coding sequences that characterizes the leading strand of C. perfringens, as well as the genomes of several other Gram-positive prokaryotes (Shimizu et al., 2002). This is shown in Table 3, where the nucleotide compositions of C. perfringens and C. acetobutylicum are displayed. It can be seen that there is a clear asymmetry in the distribution of ORFs between the two strands and that although the GC3 content remains constant, the purine content is higher in the leading strand, although it should be stressed that the differences are higher with G than with A. However, the differences are constant in the two clostridial species across their entire genomes (Table 3). We note that this bias towards A+G in the leading strand is so strong that it detects the origin and terminus of replication as clear as does the GC-skew (Fig. 1). The analysis of the third axis of the COA (6?0 % of the variability) showed that the genes at the ends of the distribution are not related to any particular functional Microbiology 149 Codon usage in C. perfringens and C. acetobutylicum C. acetobutylicum. Although the two species belong to the same genus, there are strong differences between them. First, the genome of C. acetobutylicum is 30 % longer and displays 40 % more ORFs than the genome of C. perfringens. Second, while in the former species the origin and terminus of replication are roughly opposite in the genome, in the latter bacterium this is not the case (Fig. 1). Third, since the split of these two species from their last common ancestor there have been a number of genomic rearrangements (Shimizu et al., 2002), although both organisms still share several compositional features (low G+C content, strong purine bias in the leading strand of replication, mean GC-skew of 20 %). Fig. 2. Plot of the two first axes generated by the COA of RSCU values for (a) C. perfringens and (b) C. acetobutylicum. Blue dots correspond to all genes except for the ribosomal proteins, which are represented by red dots. group and do not have any preferential location in the genome. We found, as reported by Lafay et al. (1999), that this axis is dominated by the usage of a single Arg codon, CGC. Even if this codon is excluded from the analysis, axis 3 appears to be associated with another Arg codon (CGA) and subsequently, with CGG. Thus, the third source of variation in C. perfringens seems to be the fourfold degenerate family of Arg codons, which are only marginally used in this species. Therefore, from the above results it can be concluded that the three main forces driving codon usage in C. perfringens are (i) a strong ‘whole genome’ mutational bias towards A+T, (ii) natural selection acting at the translational level, and (iii) the location of each sequence in relation to the replication fork, which leads to an excess of purine-ending triplets in the leading strand. Patterns of codon usage in C. acetobutylicum Our next step was to study the factors that shape codon usage in a bacterium related to C. perfringens, http://mic.sgmjournals.org COA in C. acetobutylicum detected a principal trend (6?7 % of the total variability) that was equivalent to the second main trend in C. perfringens; in other words, it discriminated between genes located on the leading or lagging strand of replication (Fig. 3b), and again it was associated with a strong purine bias in the sequences placed in the leading strand (see Fig. 1 and Table 3). Not surprisingly, when the genes were sorted according to their position on the second axis generated by the analysis (5?4 % of the variability), the most heavily expressed sequences were clustered at one end of the distribution, indicating that translational selection for codon usage is operative in C. acetobutylicum too (Fig. 2b). We made the same analyses as were made in C. perfringens, to detect the increased codons among the putatively highly expressed genes of C. acetobutylicum (see above). We found that 17 triplets encoding 15 amino acids are increased among the highly expressed set of sequences (no optimal codons were detected for Cys, Asp and Thr). It is interesting to note that 13 of these codons were shared between the two species (Table 2), showing that the general pattern described in C. perfringens is also valid for C. acetobutylicum. However, we should remark that the differences observed in the RSCU values between highly and lowly expressed sequences in C. acetobutylicum were not as high as those in C. perfringens (Table 2). When the CAI values were calculated in C. acetobutylicum (taking as a reference the sequences encoding its ribosomal proteins) we found that the highest values were again displayed by the same genes that lie at the extreme of the second axis generated by the COA, and the correlation between the position of the sequences along this axis and the respective CAI values was highly significant (R=0?56, P<0?0001), although lower than in C. perfringens (this is consistent with the observation of smaller differences in the RSCU values in the two species, see above). Therefore, we conclude that, in spite of minor differences, the same main forces are operative for shaping codon usage in the two bacteria studied here, although it should be noted that translational selection appears to be less strong in C. acetobutylicum than in C. perfringens. Whether these forces are due to differences in generation times and/or effective population size is something that deserves more investigation. 859 H. Musto, H. Romero and A. Zavala Table 2. Codon usage (RSCU values) in putatively highly and lowly expressed genes in C. perfringens and C. acetobutylicum Amino acid Phe Leu Ile Met Val Tyr TER His Gln Asn Lys Asp Glu C. perfringens C. acetobutylicum Amino acid Codon H* L* Codon H L UUU UUC3 UUA3 UUG CUU CUC CUA CUG AUU AUC3 AUA AUG GUU3 GUC GUA GUG UAU UAC3 UAA UAG CAU CAC3 CAA3 CAG AAU AAC3 AAA3 AAG GAU GAC3 GAA3 GAG 0?38 1?62 5?03 0?00 0?74 0?00 0?22 0?01 0?24 1?16 1?60 1?00 2?56 0?01 1?41 0?02 0?65 1?35 2?64 0?36 0?61 1?39 1?97 0?03 0?52 1?48 1?63 0?37 1?25 0?75 1?70 0?30 1?78 0?22 3?79 0?52 1?01 0?08 0?53 0?07 1?21 0?12 1?67 1?00 1?72 0?18 1?73 0?36 1?73 0?27 2?16 0?66 1?66 0?34 1?56 0?44 1?74 0?26 1?44 0?56 1?77 0?23 1?46 0?54 UUU UUC3 UUA3 UUG CUU3 CUC CUA CUG AUU AUC3 AUA AUG GUU3 GUC GUA GUG UAU UAC3 UAA UAG CAU CAC3 CAA3 CAG AAU AAC3 AAA3 AAG GAU GAC GAA3 GAG 1?24 0?76 2?99 0?25 2?22 0?10 0?37 0?07 1?00 0?38 1?62 1?00 2?26 0?05 1?56 0?12 1?36 0?64 2?22 0?78 1?12 0?88 1?54 0?46 1?31 0?69 1?55 0?45 1?66 0?34 1?69 0?31 1?68 0?32 2?11 1?11 1?44 0?24 0?73 0?38 1?22 0?16 1?62 1?00 1?58 0?16 1?57 0?69 1?58 0?42 1?56 0?66 1?67 0?33 1?25 0?75 1?58 0?42 1?29 0?71 1?60 0?40 1?36 0?64 Ser Pro Thr Ala Cys TER Trp Arg Ser Arg Gly C. perfringens C. acetobutylicum Codon H L Codon H L UCU UCC UCA3 UCG CCU CCC CCA3 CCG ACU3 ACC ACA ACG GCU3 GCC GCA GCG UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC3 AGA3 AGG GGU3 GGC GGA GGG 0?47 0?00 4?25 0?00 0?78 0?01 3?20 0?01 2?51 0?01 1?48 0?01 2?78 0?02 1?17 0?03 1?49 0?51 0?00 1?00 0?08 0?00 0?00 0?00 0?48 0?80 5?91 0?01 1?51 0?10 2?38 0?02 1?70 0?49 1?35 0?07 2?17 0?64 0?99 0?20 1?80 0?52 1?47 0?21 1?90 0?53 1?52 0?04 1?53 0?47 0?18 1?00 0?07 0?36 0?10 0?13 1?92 0?47 2?95 2?39 0?96 0?27 2?17 0?60 UCU UCC UCA3 UCG CCU3 CCC CCA3 CCG ACU ACC ACA ACG GCU3 GCC GCA GCG UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA3 AGG GGU GGC GGA3 GGG 1?25 0?28 3?07 0?04 1?60 0?00 2?37 0?04 1?91 0?23 1?83 0?03 2?02 0?08 1?88 0?02 1?27 0?73 0?00 1?00 0?78 0?02 0?02 0?00 0?79 0?57 5?17 0?02 1?38 0?33 2?25 0?04 1?40 0?29 1?07 0?36 0?79 0?97 1?38 0?86 1?63 0?39 1?63 0?35 1?45 0?43 1?66 0?46 1?28 0?72 0?78 1?00 0?43 0?35 0?46 0?51 2?18 0?71 2?02 2?23 1?18 0?43 1?81 0?58 *H, putatively highly expressed genes; L, putatively lowly expressed genes. 3Codons with significantly (P<0?01) higher frequency in H. The total number of codons analysed was 12 017 for H and 8073 for L in C. perfringens, and 10 894 for H and 6724 for L in C. acetobutylicum. Comparative studies of C. perfringens and C. acetobutylicum To gain support for the above-mentioned conclusions, we analysed the orthologous sequences from C. perfringens and C. acetobutylicum. Since no qualitative differences were observed using either the Nei–Gojobori or the Li (Li, 1993) method the results shown correspond to the former method. As can be seen in Fig. 4, as a consequence of the huge genomic rearrangements, most orthologous sequences fell outside the diagonal, indicating a nearly complete lack of gene order conservation. From this result, and taking into account the strong and diverse mutational biases that characterize the two replicative strands of these genomes, it is interesting to split the sequences into three groups: those 860 that are placed on the same strand (which can be leading or lagging) and those which changed strand. The total figures are 561 leading, 50 lagging and 65 that have switched strand. The base compositions at the synonymous sites for these pairs is representative, in each species, of the whole dataset (data not shown). As mentioned above, for both clostridial species the CAI values were calculated according to the genes encoding ribosomal proteins in the two species. As shown in Fig. 5, the respective values of the pairs of orthologous genes correlate very significantly (R=0?62, P<0?0001). It is interesting to note that the R value of the correlation changes if the three groups of sequences are considered separately: indeed, the values are 0?43 (P<0?001) for genes placed in Microbiology 149 Codon usage in C. perfringens and C. acetobutylicum Fig. 4. Plot of the chromosome locations of pairs of orthologous sequences between C. perfringens and C. acetobutylicum. Fig. 3. Histogram of the distribution of genes located in the leading (black bars) or lagging (grey bars) strand of replication along (a) Axis 2 for C. perfringens and (b) Axis 1 for C. acetobutylicum. The respective axes were divided into 10 parts, each of them containing an equal number of genes. For graphical purposes, the total number of sequences in each strand was normalized to 100 %. Table 3. Base compositions at the third codon position in the leading and lagging strands N is the total number of genes in each strand for each species. R is the purine content. SD Values are shown in parentheses. Parameter/ C. perfringens strand base N T3 C3 A3 G3 GC3 R3 C. acetobutylicum strand Leading Lagging Leading Lagging 2206 0?38 (0?05) 0?06 (0?03) 0?45 (0?05) 0?11 (0?03) 0?17 (0?03) 0?56 (0?04) 454 0?42 (0?05) 0?09 (0?03) 0?42 (0?05) 0?07 (0?03) 0?16 (0?04) 0?49 (0?05) 2902 0?39 (0?04) 0?07 (0?02) 0?40 (0?05) 0?14 (0?03) 0?21 (0?04) 0?54 (0?05) 770 0?40 (0?05) 0?12 (0?03) 0?39 (0?05) 0?09 (0?03) 0?22 (0?04) 0?48 (0?05) different strands, 0?59 (P<0?0001) for those placed in the lagging strand, and 0?65 (P<0?0001) for those located in the leading strand. Without a doubt, the different mutational biases that characterize each strand are the cause of the http://mic.sgmjournals.org Fig. 5. Plot of the CAI values of pairs of orthologous sequences between C. perfringens and C. acetobutylicum. relatively low value found for the genes that switched strands. Furthermore, the correlation found among all sequences suggests that the codon usage in the reference set is very similar for C. perfringens and C. acetobutylicum. In fact, the cumulative codon usage for ribosomal proteins in these prokaryotes is almost the same, and the only exceptions are within the pyrimidine-ending twofold 861 H. Musto, H. Romero and A. Zavala Fig. 6. Plot of CAI against dS NG (Nei–Gojobori’s synonymous distance) for (a) C. perfringens and (b) C. acetobutylicum; plot of CAI against dN NG (Nei–Gojobori’s non-synonymous distance) for (c) C. perfringens and (d) C. acetobutylicum. degenerate codons, where C. perfringens shows a bias towards C at the synonymous sites and C. acetobutylicum prefers T-ending triplets (data not shown). Therefore, from the correlation of the CAI values it seems reasonable to suggest that orthologous genes are submitted to equivalent selective pressures and are probably expressed at comparable levels in the two species studied here. Several results concerning the analyses of the estimated number of synonymous and non-synonymous substitutions (dS and dN, respectively) support our proposal that selection acting at the level of translation contributes to codon usage in Clostridium spp. First, when the genes are sorted according to their dS values, the sequences displaying the lowest values are very highly expressed, i.e. ribosomal proteins, translation elongation factors, glyceraldehyde-3phosphate dehydrogenase, groEL, etc. This indicates that the lowest divergence at the synonymous sites has occurred among highly expressed genes. It is obvious to say that this strongly suggests that selection is acting at the synonymous sites, and it is more effective on the sequences expressed at highest levels. Second, there are negative and highly significant correlations between the dS and CAI values for both species (20?35 and 20?48 for C. perfringens and C. acetobutylicum, respectively; Fig. 6a, b), which show that the genes which diverged less at the synonymous sites are the sequences displaying the highest frequencies of the presumed optimal codons. Furthermore, there are negative and highly significant correlations between the dN and CAI values for each genome (20?45 and 20?41 for C. perfringens and C. acetobutylicum, respectively; Fig. 6c, d), indicating that the genes which diverged less at the non-synonymous 862 sites display higher frequencies of the presumed optimal codons. In other words, among C. perfringens and C. acetobutylicum the optimal codons might be selected not only for speed but also for accuracy during translation. Another interpretation is possible: highly expressed proteins are also highly conserved proteins; thus, the correlation between dN and CAI values could be a passive result of this phenomenon (for a more thorough discussion of this point, see Romero et al., 2000). In summary, in this study we have shown that in spite of the strong mutational biases towards A+T and the purine bias in the leading strand of replication, the genomes of C. perfringens and C. acetobutylicum show unambiguous features which strongly suggest that translational selection influences synonymous codon usage, both at the levels of speed and accuracy. However, it should be stressed that the fraction of the total variability associated with expression is rather low. Two non-mutually exclusive interpretations of this weak effect might be the strong mutational bias and/or the population size during vegetative growth. ACKNOWLEDGEMENTS We thank the two anonymous reviewers of this manuscript for their very helpful suggestions. This work was supported by award 7094 from ‘Fondo Clemente Estable’, Uruguay. REFERENCES Akashi, H. & Eyre-Walker, A. (1998). Translational selection and molecular evolution. Curr Opin Genet Dev 8, 688–693. Microbiology 149 Codon usage in C. perfringens and C. acetobutylicum Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new Lobry, J. R. (1996). Origin of replication of Mycoplasma genitalium. Science 272, 745–746. generation of protein database search programs. Nucleic Acids Res 25, 3389–3402. McInerney, J. O. (1997). Prokaryotic genome evolution as assessed Bernardi, G. & Bernardi, G. (1986). Compositional constraints and by multivariate analysis of codon usage patterns. Microb Comp Genomics 2, 1–10. genome evolution. J Mol Evol 24, 1–11. McInerney, J. O. (1998). Replicational and transcriptional selection Bulmer, M. (1991). The selection-mutation-drift theory of synon- on codon usage in Borrelia burgdorferi. Proc Natl Acad Sci U S A 95, 10698–10703. ymous codon usage. Genetics 129, 897–907. de Miranda, A. B., Alvarez-Valin, F., Jabbari, K., Degrave, W. M. & Bernardi, G. (2000). Gene expression, amino acid conservation, and hydrophobicity are the main factors shaping codon preferences in Mycobacterium tuberculosis and Mycobacterium leprae. J Mol Evol 50, 45–55. Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. (2002). Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 30, 2478–2483. Dong, H., Nilsson, L. & Kurland, C. G. (1996). Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J Mol Biol 260, 649–663. Garcia-Vallve, S., Romeu, A. & Palau, J. (2000). Horizontal gene Musto, H., Romero, H., Zavala, A., Jabbari, K. & Bernardi, G. (1999). Synonymous codon choices in the extremely GC-poor genome of Plasmodium falciparum: compositional constraints and translational selection. J Mol Evol 49, 27–35. Muto, A. & Osawa, S. (1987). The guanine and cytosine content of genomic DNA and bacterial evolution. Proc Natl Acad Sci U S A 84, 166–169. Nei, M. & Gojobori, T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3, 418–426. Nolling, J., Breton, G., Omelchenko, M. V. & 16 other authors (2001). Genome sequence and comparative analysis of the solvent- transfer of glycosyl hydrolases of the rumen fungi. Mol Biol Evol 17, 352–361. producing bacterium Clostridium acetobutylicum. J Bacteriol 183, 4823–4838. Goncalves, I., Robinson, M., Perriere, G. & Mouchiroud, D. (1999). Percudani, R., Pavesi, A. & Ottonello, S. (1997). Transfer RNA gene JADIS: computing distances between nucleic acid sequences. Bioinformatics 15, 424–425. Gouy, M. & Gautier, C. (1982). Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res 10, 7055–7074. Grantham, R., Gautier, C., Gouy, M., Jacobzone, M. & Mercier, R. (1981). Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Res 9, r43–74. redundancy and translational selection in Saccharomyces cerevisiae. J Mol Biol 268, 322–330. Post, L. E. & Nomura, M. (1979). Nucleotide sequence of the intercistronic region preceding the gene for RNA polymerase subunit alpha in Escherichia coli. J Biol Chem 254, 10604–10606. Romero, H., Zavala, A. & Musto, H. (2000a). Codon usage in Analysis. London: Academic. Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces. Nucleic Acids Res 28, 2084–2090. Grocock, R. J. & Sharp, P. M. (2002). Synonymous codon usage in Romero, H., Zavala, A. & Musto, H. (2000b). Compositional pressure Greenacre, M. (1984). Theory and Applications of Correspondence Pseudomonas aeruginosa PAO1. Gene 289, 131–139. Ikemura, T. (1981). Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151, 389–409. Kanaya, S., Yamada, Y., Kudo, Y. & Ikemura, T. (1999). Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143–155. Karlin, S. (2001). Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9, 335–343. Kerr, A. R., Peden, J. F. & Sharp, P. M. (1997). Systematic base composition variation around the genome of Mycoplasma genitalium, but not Mycoplasma pneumoniae. Mol Microbiol 25, 1177–1179. and translational selection determine codon usage in the extremely GC-poor unicellular eukaryote Entamoeba histolytica. Gene 242, 307–311. Sharp, P. M. & Li, W. H. (1986). An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 24, 28–38. Sharp, P. M. & Li, W. H. (1987). The codon Adaptation Index: a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 15, 1281–1295. Sharp, P. M., Tuohy, T. M. & Mosurski, K. R. (1986). Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res 14, 5125–5143. Shimizu, T., Ohtani, K., Hirakawa, H. & 7 other authors (2002). Complete genome sequence of Clostridium perfringens, an anaerobic flesh-eater. Proc Natl Acad Sci U S A 99, 996–1001. Sueoka, N. (1962). On the genetic basis of variation and hetero- Lafay, B., Lloyd, A. T., McLean, M. J., Devine, K. M., Sharp, P. M. & Wolfe, K. H. (1999). Proteome composition and codon usage in geneity of DNA base composition. Proc Natl Acad Sci U S A 48, 582–592. spirochaetes: species-specific and DNA strand-specific mutational biases. Nucleic Acids Res 27, 1642–1649. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: Lafay, B., Atherton, J. C. & Sharp, P. M. (2000). Absence of trans- lationally selected synonymous codon usage bias in Helicobacter pylori. Microbiology 146, 851–860. Li, W. H. (1993). Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 36, 96–99. http://mic.sgmjournals.org improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–4680. Zavala, A., Naya, H., Romero, H. & Musto, H. (2002). Trends in codon and amino acid usage in Thermotoga maritima. J Mol Evol 54, 563–568. 863