Download Translational selection is operative for synonymous codon usage in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multilocus sequence typing wikipedia , lookup

Gene expression wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene desert wikipedia , lookup

Transposable element wikipedia , lookup

Gene regulatory network wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Genomic library wikipedia , lookup

Point mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Biosynthesis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Molecular ecology wikipedia , lookup

Genomic imprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular evolution wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Microbiology (2003), 149, 855–863
DOI 10.1099/mic.0.26063-0
Translational selection is operative for synonymous
codon usage in Clostridium perfringens and
Clostridium acetobutylicum
Héctor Musto,1 Héctor Romero1,2 and Alejandro Zavala1
1
Laboratorio de Organización y Evolución del Genoma, Facultad de Ciencias, Iguá 4225,
Montevideo 11400, Uruguay
Correspondence
Héctor Musto
2
[email protected]
Escuela Universitaria de Tecnologı́a Médica, Facultad de Medicina, Avda. Italia (s/n) Hospital
de Clı́nicas, Montevideo 11600, Uruguay
Received 17 November 2002
Revised
3 December 2002
Accepted 17 January 2003
Here, the codon usage patterns of two Clostridium species (Clostridium perfringens and
Clostridium acetobutylicum) are reported. These prokaryotes are characterized by a strong
mutational bias towards A+T, a striking excess of coding sequences and purine-rich leading
strands of replication, strong GC-skews and a high frequency of genomic rearrangements. As
expected, it was found that the mutational bias dominates codon usage but there is some variation
of synonymous codon choices among genes in the two species. This variation was investigated
using a multivariate statistical approach. In the two species, two major trends were detected. One
was related to the location of the sequences in the leading or lagging strand of replication, and the
other was associated with the preferential use of putatively translational optimal codons in heavily
expressed genes. Analyses of the estimated number of synonymous and non-synonymous
substitutions among orthologous genes permit us to postulate that optimal codons might be
selected not only for speed but also for accuracy during translation.
INTRODUCTION
Synonymous codons encode the same amino acid. Hence, in
principle, it could be assumed that if a large sample of genes
is studied, all triplets encoding the same amino acid should
be equally frequent. However, it is very clear that this
assumption is far from true, both among organisms and
among genes from a single species. Among prokaryotes, it is
generally agreed that the codon usage of any gene (and,
consequently, of any genome) is the result of the balance
between mutational biases and natural selection acting at the
level of translation, the latter effect being ‘visible’ only if it is
strong enough to overcome the effect of random genetic
drift (Sharp & Li, 1986; Bulmer, 1991; Akashi & EyreWalker, 1998).
The available evidence suggests that the strength and
direction of these forces can vary both among different
species and among sequences from the same genome. For
example, the genomic G+C contents of prokaryotes vary
from 25 to 75 mol% (Sueoka, 1962), and given the
correlation that holds between GC3s (G+C content at
‘silent’ third codon positions) and genomic G+C content
(Bernardi & Bernardi, 1986; Muto & Osawa, 1987), the
mutational bias characteristic of each genome greatly
Abbreviations: CAI, codon adaptation index; COA, correspondence
analysis; RSCU, relative synonymous codon usage.
0002-6063 G 2003 SGM
Printed in Great Britain
influences codon choices. However, the availability of very
long contigs, and especially complete genomes, has shown
that the mutational bias is not simply shifting the whole
genome towards G+C or A+T. For instance, it has been
shown that there are regional variations in the G+C content
around the genome of Mycoplasma genitalium (McInerney,
1997; Kerr et al., 1997) which exert a great influence on GC3s
and, consequently, on codon usage. Perhaps more unexpected was the finding of Lobry (1996), who showed that in
several bacteria the leading and lagging strand of replication
can be easily recognized by the so-called ‘GC-skew’, the
quantity (G2C)/(G+C). Indeed, the leading strand usually
displays positive values while the reverse is true for the
lagging strand (the switch of sign occurs exactly at or very
near to the origin and terminus of replication). As a consequence, the leading strand is G- (and T-) rich, while the
lagging strand displays a bias towards C (and A). This effect
can be so strong that in species like Borrelia burgdorferi,
Treponema pallidum and Chlamydia trachomatis the position of the sequences in relation to the replication fork can
be recognized as the most important force driving codon
usage (McInerney, 1998; Lafay et al., 1999; Romero et al.,
2000a). Finally, a ‘common theme’ in completely sequenced
genomes is the finding of regions displaying base compositions far away from those of the genome as a whole. These
regions have been interpreted as being the result of events of
horizontal transfer of DNA between species differing in their
genomic G+C contents (Garcia-Vallve et al., 2000; Karlin,
855
H. Musto, H. Romero and A. Zavala
2001), and the sequences located in these regions display
different codon usage than the rest of the genes. Therefore, it
can be concluded that the overall base content of a genome
and the mutational bias of each replicative strand are the
main forces driving codon usage.
However, superimposed onto these general effects, in several
species it has been found that natural selection leads to the
fixation of some triplets among highly expressed genes. This
was observed in Escherichia coli (Post & Nomura, 1979;
Gouy & Gautier, 1982), where it was noted that the codon
usage of highly expressed sequences was biased in relation to
the pattern of lowly expressed genes. Indeed, in the former
group there is an increase of certain triplets (‘major codons’)
while in the latter group the usage of codons is more
random. From another perspective, Ikemura (1981) showed
that there is a match between these codons and the most
abundant tRNAs. Therefore, for E. coli it was proposed that
the triplets that are recognized more efficiently by the most
abundant isoacceptor are preferred, and the degree of bias in
each gene should be proportional to the level of expression.
Although the codon usage pattern of several prokaryotes
fell within this interpretation (i.e. codon usage is the result
of mutational biases and translational selection) the more
species that are being studied the more peculiarities are
beginning to appear. For example, it was shown that in
Helicobacter pylori, although the composition of the genome
is not skewed and there is a low (but detectable) level
of heterogeneity among genes, codon usage does not appear
to be influenced simply by mutational biases or translational selection (Lafay et al., 2000). Furthermore, in
Mycobacterium tuberculosis, although the ‘classical’ factors
are apparent, it was reported that the hydropathy level of
each protein is correlated with the base content at silent sites
(de Miranda et al., 2000). A more complex pattern was
found in Chlamydia trachomatis, since codon usage appears
to be shaped by the global genomic composition, the strandspecific mutational bias (as noted above), natural selection
acting at the level of translation, the hydropathy level of each
protein and each amino acid’s conservation (Romero et al.,
2000a). Therefore, as more prokaryotic genomes are analysed
it is becoming clear that more factors shape codon usage
than previously thought. Hence, more studies are needed (I)
to understand the generality of the factors and phenomena
described above, and (II) to detect new forces shaping codon
usage. With these goals in mind, we decided to study the
codon usage patterns in two species of Clostridium that have
been sequenced recently, namely Clostridium perfringens
(Shimizu et al., 2002) and Clostridium acetobutylicum
(Nolling et al., 2001). These Gram-positive, anaerobic,
spore-forming bacteria have several features that make them
useful for these studies: (i) they belong to the same genus,
which is important for comparative purposes; (ii) their
genomes are compositionally biased (G+C contents of 31
and 29 mol%, respectively), which could hide the effect
of natural selection; (iii) their generation time is short
(Shimizu et al., 2002), which, contrary to (i) and (ii), would
make selection for translational efficiency more likely to be
856
detected; and (iv) on the leading strand of replication the
two species display a very strong purine bias and an excess of
coding sequences (Shimizu et al., 2002), which might add
additional levels of complexity to their patterns of codon
usage.
METHODS
Sequences. The complete genomes and coding sequences of
C. acetobutylicum (Nolling et al., 2001) and C. perfringens (Shimizu
et al., 2002) were obtained from two NCBI ftp sites (ftp://ftp.ncbi.
nih.gov/genomes/Bacteria/Clostridium_acetobutylicum/ and ftp://ftp.
ncbi.nih.gov/genomes/Bacteria/Clostridium_perfringens/).
Methods of analysis. Codon usage, correspondence analysis (COA)
(Greenacre, 1984), GC3s (the frequency of codons ending in C or G,
excluding Met, Trp and stop codons), the relative synonymous
codon usage (RSCU) (Sharp et al., 1986) and the codon adaptation
index (CAI) (Sharp & Li, 1987) were calculated using the program
CODONW 1.3 (written by John Peden and available from ftp://molbiol.
ox.ac.uk/Win95.codonW.zip). In the two species under study, the
CAI was calculated taking the codon usage of the ribosomal proteins
as a reference. COA of RSCU values was carried out to determine the
major sources of variation among synonymous codons. The putative
orthologous sequences were identified running a BLAST query of the
whole set of proteins of one genome against the set of the other one
using the stand-alone BLAST package (Altschul et al., 1997). The
sequence with the best match, according to the score value, was identified. Then, the coding sequences of these pairs were translated and
aligned using CLUSTAL W (Thompson et al., 1994); subsequently, the
alignments were back-translated to the known DNA sequences. dS
(synonymous distance) and dN (non-synonymous distance) values
were calculated using the Nei–Gojobori method (Nei & Gojobori,
1986) using the JADIS package (Goncalves et al., 1999), only on those
pairs of sequences displaying a minimal value of 50 % identity and
with a length difference of 20 % at the amino acid level. The analyses
were performed only with the pairs of sequences displaying dS values
¡2?0. The final dataset comprised 676 pairs of genes. Wholegenome alignment and comparison were carried out with the
MUMmer system (release 2.1) (Delcher et al., 2002) using the default
settings.
RESULTS AND DISCUSSION
Compositional properties of C. perfringens and
C. acetobutylicum
As can be seen in Fig. 1, the genomes of C. perfringens and
C. acetobutylicum are strongly biased towards low G+C
contents. Indeed, with the exceptions of small regions
encoding rDNA and operons for ribosomal proteins, the
two genomes show little variation around each mean value
(31 and 29 %, respectively). As a consequence of this strong
mutational bias, the coding sequences (2660 and 3672 ORFs,
respectively) are characterized by extremely low GC3s and
symmetrical distributions (mean values 14 and 19 %,
standard deviations 3 and 4 %, for C. perfringens and
C. acetobutylicum, respectively). This predominance of A
and T at the synonymous sites is better displayed in Table 1,
in which the global codon usage (RSCU values) is shown for
both species. Indeed, for each amino acid the predominant
triplet (or triplets for three-, four- and sixfold degenerate
Microbiology 149
Codon usage in C. perfringens and C. acetobutylicum
heat-shock proteins, while genes expressed at the lowest
levels and those encoding hypothetical proteins are distributed almost normally around the mean value of this axis.
The clustering of highly expressed genes at one end of the
distribution indicates that these sequences are characterized
by a different pattern of codon usage than the rest of the
genes; therefore, translational selection might be operative
in this bacterium. To see which triplets are increased in the
highly expressed group of genes, we compared the codon
usage pattern of the sequences displaying the most extreme
values at both ends of the first axis (50 genes at either
extreme). The differences in codon usage between the two
groups were tested with a x2 test. We found that there are 17
codons whose usage is significantly increased (P<0?01)
among the highly expressed group of genes, and they encode
17 amino acids (Cys is the only residue without an increased
triplet). These codons are listed in Table 2.
Fig. 1. Some compositional properties of the genomes of (a)
C. perfringens and (b) C. acetobutylicum. Light-grey line,
GC-skew; black line, G+C content (mol%); dark-grey line, purine
content (R%). The window size was 20 kb with steps of 2?5 kb.
codons) is A- and/or T-ended. Therefore, it can be
concluded that the main force driving codon usage in
C. perfringens and C. acetobutylicum is the strong mutational
bias towards A and T. However, some results suggest that
this bias alone cannot explain the whole trend. Indeed, in the
two species studied here, the range of GC3s is rather high (46
and 35 %, respectively), and in Table 1 it can be seen that
T- and A-ending codons are not equally frequent among
fourfold degenerate codons. Therefore, it seems reasonable
to assume that some other minor factors are shaping codon
choices.
To investigate this possibility, we conducted a COA
of RSCU values on all genes of C. perfringens and
C. acetobutylicum. This statistical approach has been
widely used to investigate major trends in codon usage in
several species of bacteria (Grantham et al., 1981;
McInerney, 1998; Lafay et al., 1999; Zavala et al., 2002).
The position of genes on the main axes generated by the
analysis can subsequently be compared with biological
properties of the sequences, such as expressivity, base
composition, etc., which can help to understand the
significance of each main trend.
Patterns of codon usage in C. perfringens
When COA is applied to C. perfringens it detects a principal
trend (8?4 % of the total variability) that is clearly associated
with expression levels. Indeed, at one extreme of this axis
(Fig. 2a) lie genes that are known to be heavily expressed,
such as those encoding several ribosomal proteins, translation elongation factors, glyceraldehyde-3-phosphate dehydrogenase, phosphoglycerate kinase, fructose-bisphosphate
aldolase, triose-phosphate isomerase, pyruvate kinase and
http://mic.sgmjournals.org
Two different features related to the aforementioned codons
support the hypothesis that they are translationally optimal.
First, seven of the codons are C-ending, which is against the
above-mentioned strong mutational bias towards A+T,
suggesting that the increase may be caused by natural
selection. A similar increase of triplets against the mutational bias among highly expressed genes has been reported
in several unicellular species (prokaryotes and eukaryotes),
and has always been explained in terms of natural selection
(Grocock & Sharp, 2002; Musto et al., 1999; Romero et al.,
2000b). Second, 15 of the codons match perfectly with the
putative most abundant (or with the first and second most
abundant) isoacceptor tRNA – we assume a correlation
between the cellular levels of tRNAs and the copy numbers
of tRNA genes, as was found in E. coli (Ikemura, 1981; Dong
et al., 1996), Bacillus subtilis (Kanaya et al., 1999) and
Saccharomyces cerevisiae (Percudani et al., 1997). For
example, there are six Ser tRNA genes, one in three
copies, one in two copies and one in single copy; the former
recognizes AGC, the second recognizes UCA and the latter
recognizes UCC, and the increased codons among highly
expressed sequences are, precisely, the first two triplets.
Similarly, for Arg, the tRNA that matches with the only
increased codon (AGA) is present in three copies, while
other Arg tRNA sequences are present in single copy. In
other cases, where the match is not perfect (as is the case for
the fourfold degenerate codons encoding Val, Thr and Ala)
it seems reasonable to postulate that this behaviour is due to
modifications in the first position of the anticodon.
To further confirm the translational selection hypothesis, we
calculated the CAI value for each sequence in C. perfringens
taking as a reference the codon usage of ribosomal proteins,
which are certainly heavily expressed. When all the
sequences were sorted according to their CAI, the highest
values were displayed not only by the genes encoding
ribosomal proteins (which is a trivial result) but also by
almost exactly the same genes that lie at the extreme of the
first axis generated by the COA, which is confirmed by the
strong correlation between the position of the sequences
857
H. Musto, H. Romero and A. Zavala
Table 1. Codon usage (RSCU values) in C. perfringens and C. acetobutylicum
Amino acid
Phe
Leu
Ile
Met
Val
Tyr
TER
His
Gln
Asn
Lys
Asp
Glu
Codon
C. perfringens
C. acetobutylicum
TTT
TTC
TTA
TTG
CTT
CTC
CTA
CTG
ATT
ATC
ATA
ATG
GTT
GTC
GTA
GTG
TAT
TAC
TAA
TAG
CAT
CAC
CAA
CAG
AAT
AAC
AAA
AAG
GAT
GAC
GAA
GAG
1?61
0?39
3?96
0?24
1?23
0?03
0?51
0?03
0?99
0?14
1?87
1?00
2?09
0?08
1?60
0?23
1?67
0?33
2?24
0?65
1?62
0?38
1?72
0?28
1?65
0?35
1?39
0?61
1?74
0?26
1?54
0?46
1?70
0?30
2?49
0?68
1?86
0?16
0?65
0?16
1?10
0?14
1?75
1?00
1?80
0?13
1?65
0?42
1?57
0?43
1?90
0?76
1?57
0?43
1?40
0?60
1?61
0?39
1?35
0?65
1?69
0?31
1?47
0?53
along this axis and the respective CAI values (R=0?82,
P<0?0001). These results support our interpretation that
the first axis discriminates expression levels.
The second axis of the COA (6?7 % of the variability)
discriminates between genes located in the leading or
lagging strand of replication. The importance of this effect
can be so high that in species like Borrelia burgdorferi,
Treponema pallidum and Chlamydia trachomatis it is the
most important force driving codon usage (McInerney,
1998; Lafay et al., 1999; Romero et al., 2000a). Among these
species, the sequences located in the leading strand are
G- and T-rich at the synonymous sites, while the
complementary bases are more frequent in genes located
in the lagging strand. However, this kind of bias is not found
in Clostridium perfringens. Indeed, when the position of the
codons in relation to the second axis is analysed it can be
seen that purine- and pyrimidine-ending triplets lie at the
opposite extremes. When the genes are sorted according to
their position on the second axis, most sequences located in
the lagging strand of replication cluster together towards
858
Amino acid
Ser
Pro
Thr
Ala
Cys
TER
Trp
Arg
Ser
Arg
Gly
Codon
C. perfringens
C. acetobutylicum
TCT
TCC
TCA
TCG
CCT
CCC
CCA
CCG
ACT
ACC
ACA
ACG
GCT
GCC
GCA
GCG
TGT
TGC
TGA
TGG
CGT
CGC
CGA
CGG
AGT
AGC
AGA
AGG
GGT
GGC
GGA
GGG
1?54
0?22
1?94
0?04
1?73
0?08
2?14
0?05
2?02
0?24
1?67
0?06
2?15
0?30
1?46
0?08
1?59
0?41
0?11
1?00
0?20
0?02
0?03
0?00
1?74
0?52
5?18
0?57
1?13
0?21
2?34
0?33
1?45
0?35
1?62
0?21
1?80
0?19
1?76
0?26
1?58
0?40
1?78
0?24
1?64
0?29
1?81
0?26
1?40
0?60
0?34
1?00
0?39
0?10
0?18
0?03
1?68
0?69
4?24
1?08
1?25
0?39
2?04
0?32
one end of the distribution (Fig. 3a). This result is certainly
related to the very strong purine bias associated with an
excess of coding sequences that characterizes the leading
strand of C. perfringens, as well as the genomes of several
other Gram-positive prokaryotes (Shimizu et al., 2002). This
is shown in Table 3, where the nucleotide compositions of
C. perfringens and C. acetobutylicum are displayed. It can be
seen that there is a clear asymmetry in the distribution of
ORFs between the two strands and that although the GC3
content remains constant, the purine content is higher in
the leading strand, although it should be stressed that the
differences are higher with G than with A. However, the
differences are constant in the two clostridial species across
their entire genomes (Table 3). We note that this bias
towards A+G in the leading strand is so strong that it
detects the origin and terminus of replication as clear as does
the GC-skew (Fig. 1).
The analysis of the third axis of the COA (6?0 % of the
variability) showed that the genes at the ends of the
distribution are not related to any particular functional
Microbiology 149
Codon usage in C. perfringens and C. acetobutylicum
C. acetobutylicum. Although the two species belong to the
same genus, there are strong differences between them. First,
the genome of C. acetobutylicum is 30 % longer and displays
40 % more ORFs than the genome of C. perfringens. Second,
while in the former species the origin and terminus of
replication are roughly opposite in the genome, in the latter
bacterium this is not the case (Fig. 1). Third, since the split
of these two species from their last common ancestor there
have been a number of genomic rearrangements (Shimizu
et al., 2002), although both organisms still share several
compositional features (low G+C content, strong purine
bias in the leading strand of replication, mean GC-skew of
20 %).
Fig. 2. Plot of the two first axes generated by the COA of
RSCU values for (a) C. perfringens and (b) C. acetobutylicum.
Blue dots correspond to all genes except for the ribosomal
proteins, which are represented by red dots.
group and do not have any preferential location in the
genome. We found, as reported by Lafay et al. (1999), that
this axis is dominated by the usage of a single Arg codon,
CGC. Even if this codon is excluded from the analysis, axis 3
appears to be associated with another Arg codon (CGA) and
subsequently, with CGG. Thus, the third source of variation
in C. perfringens seems to be the fourfold degenerate family
of Arg codons, which are only marginally used in this
species. Therefore, from the above results it can be concluded that the three main forces driving codon usage in
C. perfringens are (i) a strong ‘whole genome’ mutational
bias towards A+T, (ii) natural selection acting at the
translational level, and (iii) the location of each sequence in
relation to the replication fork, which leads to an excess of
purine-ending triplets in the leading strand.
Patterns of codon usage in C. acetobutylicum
Our next step was to study the factors that shape
codon usage in a bacterium related to C. perfringens,
http://mic.sgmjournals.org
COA in C. acetobutylicum detected a principal trend (6?7 %
of the total variability) that was equivalent to the second
main trend in C. perfringens; in other words, it discriminated
between genes located on the leading or lagging strand of
replication (Fig. 3b), and again it was associated with a
strong purine bias in the sequences placed in the leading
strand (see Fig. 1 and Table 3). Not surprisingly, when the
genes were sorted according to their position on the second
axis generated by the analysis (5?4 % of the variability), the
most heavily expressed sequences were clustered at one end
of the distribution, indicating that translational selection for
codon usage is operative in C. acetobutylicum too (Fig. 2b).
We made the same analyses as were made in C. perfringens,
to detect the increased codons among the putatively highly
expressed genes of C. acetobutylicum (see above). We found
that 17 triplets encoding 15 amino acids are increased
among the highly expressed set of sequences (no optimal
codons were detected for Cys, Asp and Thr). It is interesting
to note that 13 of these codons were shared between the two
species (Table 2), showing that the general pattern described
in C. perfringens is also valid for C. acetobutylicum. However,
we should remark that the differences observed in the RSCU
values between highly and lowly expressed sequences in
C. acetobutylicum were not as high as those in C. perfringens
(Table 2).
When the CAI values were calculated in C. acetobutylicum
(taking as a reference the sequences encoding its ribosomal
proteins) we found that the highest values were again
displayed by the same genes that lie at the extreme of the
second axis generated by the COA, and the correlation
between the position of the sequences along this axis and
the respective CAI values was highly significant (R=0?56,
P<0?0001), although lower than in C. perfringens (this is
consistent with the observation of smaller differences in the
RSCU values in the two species, see above). Therefore, we
conclude that, in spite of minor differences, the same main
forces are operative for shaping codon usage in the two
bacteria studied here, although it should be noted that
translational selection appears to be less strong in
C. acetobutylicum than in C. perfringens. Whether these
forces are due to differences in generation times and/or
effective population size is something that deserves more
investigation.
859
H. Musto, H. Romero and A. Zavala
Table 2. Codon usage (RSCU values) in putatively highly and lowly expressed genes in C. perfringens and C. acetobutylicum
Amino acid
Phe
Leu
Ile
Met
Val
Tyr
TER
His
Gln
Asn
Lys
Asp
Glu
C. perfringens
C. acetobutylicum
Amino acid
Codon
H*
L*
Codon
H
L
UUU
UUC3
UUA3
UUG
CUU
CUC
CUA
CUG
AUU
AUC3
AUA
AUG
GUU3
GUC
GUA
GUG
UAU
UAC3
UAA
UAG
CAU
CAC3
CAA3
CAG
AAU
AAC3
AAA3
AAG
GAU
GAC3
GAA3
GAG
0?38
1?62
5?03
0?00
0?74
0?00
0?22
0?01
0?24
1?16
1?60
1?00
2?56
0?01
1?41
0?02
0?65
1?35
2?64
0?36
0?61
1?39
1?97
0?03
0?52
1?48
1?63
0?37
1?25
0?75
1?70
0?30
1?78
0?22
3?79
0?52
1?01
0?08
0?53
0?07
1?21
0?12
1?67
1?00
1?72
0?18
1?73
0?36
1?73
0?27
2?16
0?66
1?66
0?34
1?56
0?44
1?74
0?26
1?44
0?56
1?77
0?23
1?46
0?54
UUU
UUC3
UUA3
UUG
CUU3
CUC
CUA
CUG
AUU
AUC3
AUA
AUG
GUU3
GUC
GUA
GUG
UAU
UAC3
UAA
UAG
CAU
CAC3
CAA3
CAG
AAU
AAC3
AAA3
AAG
GAU
GAC
GAA3
GAG
1?24
0?76
2?99
0?25
2?22
0?10
0?37
0?07
1?00
0?38
1?62
1?00
2?26
0?05
1?56
0?12
1?36
0?64
2?22
0?78
1?12
0?88
1?54
0?46
1?31
0?69
1?55
0?45
1?66
0?34
1?69
0?31
1?68
0?32
2?11
1?11
1?44
0?24
0?73
0?38
1?22
0?16
1?62
1?00
1?58
0?16
1?57
0?69
1?58
0?42
1?56
0?66
1?67
0?33
1?25
0?75
1?58
0?42
1?29
0?71
1?60
0?40
1?36
0?64
Ser
Pro
Thr
Ala
Cys
TER
Trp
Arg
Ser
Arg
Gly
C. perfringens
C. acetobutylicum
Codon
H
L
Codon
H
L
UCU
UCC
UCA3
UCG
CCU
CCC
CCA3
CCG
ACU3
ACC
ACA
ACG
GCU3
GCC
GCA
GCG
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC3
AGA3
AGG
GGU3
GGC
GGA
GGG
0?47
0?00
4?25
0?00
0?78
0?01
3?20
0?01
2?51
0?01
1?48
0?01
2?78
0?02
1?17
0?03
1?49
0?51
0?00
1?00
0?08
0?00
0?00
0?00
0?48
0?80
5?91
0?01
1?51
0?10
2?38
0?02
1?70
0?49
1?35
0?07
2?17
0?64
0?99
0?20
1?80
0?52
1?47
0?21
1?90
0?53
1?52
0?04
1?53
0?47
0?18
1?00
0?07
0?36
0?10
0?13
1?92
0?47
2?95
2?39
0?96
0?27
2?17
0?60
UCU
UCC
UCA3
UCG
CCU3
CCC
CCA3
CCG
ACU
ACC
ACA
ACG
GCU3
GCC
GCA
GCG
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA3
AGG
GGU
GGC
GGA3
GGG
1?25
0?28
3?07
0?04
1?60
0?00
2?37
0?04
1?91
0?23
1?83
0?03
2?02
0?08
1?88
0?02
1?27
0?73
0?00
1?00
0?78
0?02
0?02
0?00
0?79
0?57
5?17
0?02
1?38
0?33
2?25
0?04
1?40
0?29
1?07
0?36
0?79
0?97
1?38
0?86
1?63
0?39
1?63
0?35
1?45
0?43
1?66
0?46
1?28
0?72
0?78
1?00
0?43
0?35
0?46
0?51
2?18
0?71
2?02
2?23
1?18
0?43
1?81
0?58
*H, putatively highly expressed genes; L, putatively lowly expressed genes.
3Codons with significantly (P<0?01) higher frequency in H. The total number of codons analysed was 12 017 for H and 8073 for L in
C. perfringens, and 10 894 for H and 6724 for L in C. acetobutylicum.
Comparative studies of C. perfringens and
C. acetobutylicum
To gain support for the above-mentioned conclusions, we
analysed the orthologous sequences from C. perfringens and
C. acetobutylicum. Since no qualitative differences were
observed using either the Nei–Gojobori or the Li (Li, 1993)
method the results shown correspond to the former method.
As can be seen in Fig. 4, as a consequence of the huge
genomic rearrangements, most orthologous sequences fell
outside the diagonal, indicating a nearly complete lack of
gene order conservation. From this result, and taking into
account the strong and diverse mutational biases that
characterize the two replicative strands of these genomes, it
is interesting to split the sequences into three groups: those
860
that are placed on the same strand (which can be leading or
lagging) and those which changed strand. The total figures
are 561 leading, 50 lagging and 65 that have switched strand.
The base compositions at the synonymous sites for these
pairs is representative, in each species, of the whole dataset
(data not shown).
As mentioned above, for both clostridial species the CAI
values were calculated according to the genes encoding
ribosomal proteins in the two species. As shown in Fig. 5,
the respective values of the pairs of orthologous genes
correlate very significantly (R=0?62, P<0?0001). It is
interesting to note that the R value of the correlation changes
if the three groups of sequences are considered separately:
indeed, the values are 0?43 (P<0?001) for genes placed in
Microbiology 149
Codon usage in C. perfringens and C. acetobutylicum
Fig. 4. Plot of the chromosome locations of pairs of orthologous
sequences between C. perfringens and C. acetobutylicum.
Fig. 3. Histogram of the distribution of genes located in the
leading (black bars) or lagging (grey bars) strand of replication
along (a) Axis 2 for C. perfringens and (b) Axis 1 for C. acetobutylicum. The respective axes were divided into 10 parts,
each of them containing an equal number of genes. For graphical purposes, the total number of sequences in each strand
was normalized to 100 %.
Table 3. Base compositions at the third codon position in
the leading and lagging strands
N is the total number of genes in each strand for each species. R
is the purine content. SD Values are shown in parentheses.
Parameter/ C. perfringens strand
base
N
T3
C3
A3
G3
GC3
R3
C. acetobutylicum strand
Leading
Lagging
Leading
Lagging
2206
0?38 (0?05)
0?06 (0?03)
0?45 (0?05)
0?11 (0?03)
0?17 (0?03)
0?56 (0?04)
454
0?42 (0?05)
0?09 (0?03)
0?42 (0?05)
0?07 (0?03)
0?16 (0?04)
0?49 (0?05)
2902
0?39 (0?04)
0?07 (0?02)
0?40 (0?05)
0?14 (0?03)
0?21 (0?04)
0?54 (0?05)
770
0?40 (0?05)
0?12 (0?03)
0?39 (0?05)
0?09 (0?03)
0?22 (0?04)
0?48 (0?05)
different strands, 0?59 (P<0?0001) for those placed in the
lagging strand, and 0?65 (P<0?0001) for those located in the
leading strand. Without a doubt, the different mutational
biases that characterize each strand are the cause of the
http://mic.sgmjournals.org
Fig. 5. Plot of the CAI values of pairs of orthologous
sequences between C. perfringens and C. acetobutylicum.
relatively low value found for the genes that switched
strands. Furthermore, the correlation found among all
sequences suggests that the codon usage in the reference set
is very similar for C. perfringens and C. acetobutylicum. In
fact, the cumulative codon usage for ribosomal proteins
in these prokaryotes is almost the same, and the only
exceptions are within the pyrimidine-ending twofold
861
H. Musto, H. Romero and A. Zavala
Fig. 6. Plot of CAI against dS NG (Nei–Gojobori’s synonymous distance) for (a) C. perfringens and (b) C. acetobutylicum;
plot of CAI against dN NG (Nei–Gojobori’s non-synonymous distance) for (c) C. perfringens and (d) C. acetobutylicum.
degenerate codons, where C. perfringens shows a bias
towards C at the synonymous sites and C. acetobutylicum
prefers T-ending triplets (data not shown). Therefore, from
the correlation of the CAI values it seems reasonable to
suggest that orthologous genes are submitted to equivalent
selective pressures and are probably expressed at comparable
levels in the two species studied here.
Several results concerning the analyses of the estimated
number of synonymous and non-synonymous substitutions
(dS and dN, respectively) support our proposal that
selection acting at the level of translation contributes to
codon usage in Clostridium spp. First, when the genes are
sorted according to their dS values, the sequences displaying
the lowest values are very highly expressed, i.e. ribosomal
proteins, translation elongation factors, glyceraldehyde-3phosphate dehydrogenase, groEL, etc. This indicates that the
lowest divergence at the synonymous sites has occurred
among highly expressed genes. It is obvious to say that this
strongly suggests that selection is acting at the synonymous
sites, and it is more effective on the sequences expressed
at highest levels. Second, there are negative and highly
significant correlations between the dS and CAI values
for both species (20?35 and 20?48 for C. perfringens and
C. acetobutylicum, respectively; Fig. 6a, b), which show that
the genes which diverged less at the synonymous sites are
the sequences displaying the highest frequencies of the
presumed optimal codons. Furthermore, there are negative
and highly significant correlations between the dN and CAI
values for each genome (20?45 and 20?41 for C. perfringens
and C. acetobutylicum, respectively; Fig. 6c, d), indicating
that the genes which diverged less at the non-synonymous
862
sites display higher frequencies of the presumed optimal codons. In other words, among C. perfringens and
C. acetobutylicum the optimal codons might be selected not
only for speed but also for accuracy during translation.
Another interpretation is possible: highly expressed proteins
are also highly conserved proteins; thus, the correlation
between dN and CAI values could be a passive result of this
phenomenon (for a more thorough discussion of this point,
see Romero et al., 2000).
In summary, in this study we have shown that in spite of
the strong mutational biases towards A+T and the purine
bias in the leading strand of replication, the genomes of
C. perfringens and C. acetobutylicum show unambiguous
features which strongly suggest that translational selection
influences synonymous codon usage, both at the levels of
speed and accuracy. However, it should be stressed that the
fraction of the total variability associated with expression is
rather low. Two non-mutually exclusive interpretations of
this weak effect might be the strong mutational bias and/or
the population size during vegetative growth.
ACKNOWLEDGEMENTS
We thank the two anonymous reviewers of this manuscript for their
very helpful suggestions. This work was supported by award 7094 from
‘Fondo Clemente Estable’, Uruguay.
REFERENCES
Akashi, H. & Eyre-Walker, A. (1998). Translational selection and
molecular evolution. Curr Opin Genet Dev 8, 688–693.
Microbiology 149
Codon usage in C. perfringens and C. acetobutylicum
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z.,
Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new
Lobry, J. R. (1996). Origin of replication of Mycoplasma genitalium.
Science 272, 745–746.
generation of protein database search programs. Nucleic Acids Res 25,
3389–3402.
McInerney, J. O. (1997). Prokaryotic genome evolution as assessed
Bernardi, G. & Bernardi, G. (1986). Compositional constraints and
by multivariate analysis of codon usage patterns. Microb Comp
Genomics 2, 1–10.
genome evolution. J Mol Evol 24, 1–11.
McInerney, J. O. (1998). Replicational and transcriptional selection
Bulmer, M. (1991). The selection-mutation-drift theory of synon-
on codon usage in Borrelia burgdorferi. Proc Natl Acad Sci U S A 95,
10698–10703.
ymous codon usage. Genetics 129, 897–907.
de Miranda, A. B., Alvarez-Valin, F., Jabbari, K., Degrave, W. M. &
Bernardi, G. (2000). Gene expression, amino acid conservation, and
hydrophobicity are the main factors shaping codon preferences in
Mycobacterium tuberculosis and Mycobacterium leprae. J Mol Evol 50,
45–55.
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. (2002). Fast
algorithms for large-scale genome alignment and comparison.
Nucleic Acids Res 30, 2478–2483.
Dong, H., Nilsson, L. & Kurland, C. G. (1996). Co-variation of tRNA
abundance and codon usage in Escherichia coli at different growth
rates. J Mol Biol 260, 649–663.
Garcia-Vallve, S., Romeu, A. & Palau, J. (2000). Horizontal gene
Musto, H., Romero, H., Zavala, A., Jabbari, K. & Bernardi, G. (1999).
Synonymous codon choices in the extremely GC-poor genome of
Plasmodium falciparum: compositional constraints and translational
selection. J Mol Evol 49, 27–35.
Muto, A. & Osawa, S. (1987). The guanine and cytosine content of
genomic DNA and bacterial evolution. Proc Natl Acad Sci U S A 84,
166–169.
Nei, M. & Gojobori, T. (1986). Simple methods for estimating the
numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3, 418–426.
Nolling, J., Breton, G., Omelchenko, M. V. & 16 other authors
(2001). Genome sequence and comparative analysis of the solvent-
transfer of glycosyl hydrolases of the rumen fungi. Mol Biol Evol 17,
352–361.
producing bacterium Clostridium acetobutylicum. J Bacteriol 183,
4823–4838.
Goncalves, I., Robinson, M., Perriere, G. & Mouchiroud, D. (1999).
Percudani, R., Pavesi, A. & Ottonello, S. (1997). Transfer RNA gene
JADIS:
computing distances between nucleic acid sequences.
Bioinformatics 15, 424–425.
Gouy, M. & Gautier, C. (1982). Codon usage in bacteria: correlation
with gene expressivity. Nucleic Acids Res 10, 7055–7074.
Grantham, R., Gautier, C., Gouy, M., Jacobzone, M. & Mercier, R.
(1981). Codon catalog usage is a genome strategy modulated for
gene expressivity. Nucleic Acids Res 9, r43–74.
redundancy and translational selection in Saccharomyces cerevisiae.
J Mol Biol 268, 322–330.
Post, L. E. & Nomura, M. (1979). Nucleotide sequence of the
intercistronic region preceding the gene for RNA polymerase subunit
alpha in Escherichia coli. J Biol Chem 254, 10604–10606.
Romero, H., Zavala, A. & Musto, H. (2000a). Codon usage in
Analysis. London: Academic.
Chlamydia trachomatis is the result of strand-specific mutational
biases and a complex pattern of selective forces. Nucleic Acids Res 28,
2084–2090.
Grocock, R. J. & Sharp, P. M. (2002). Synonymous codon usage in
Romero, H., Zavala, A. & Musto, H. (2000b). Compositional pressure
Greenacre, M. (1984). Theory and Applications of Correspondence
Pseudomonas aeruginosa PAO1. Gene 289, 131–139.
Ikemura, T. (1981). Correlation between the abundance of
Escherichia coli transfer RNAs and the occurrence of the respective
codons in its protein genes: a proposal for a synonymous codon
choice that is optimal for the E. coli translational system. J Mol Biol
151, 389–409.
Kanaya, S., Yamada, Y., Kudo, Y. & Ikemura, T. (1999). Studies of
codon usage and tRNA genes of 18 unicellular organisms and
quantification of Bacillus subtilis tRNAs: gene expression level and
species-specific diversity of codon usage based on multivariate
analysis. Gene 238, 143–155.
Karlin, S. (2001). Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9,
335–343.
Kerr, A. R., Peden, J. F. & Sharp, P. M. (1997). Systematic base
composition variation around the genome of Mycoplasma genitalium, but not Mycoplasma pneumoniae. Mol Microbiol 25, 1177–1179.
and translational selection determine codon usage in the extremely
GC-poor unicellular eukaryote Entamoeba histolytica. Gene 242,
307–311.
Sharp, P. M. & Li, W. H. (1986). An evolutionary perspective on
synonymous codon usage in unicellular organisms. J Mol Evol 24,
28–38.
Sharp, P. M. & Li, W. H. (1987). The codon Adaptation Index: a
measure of directional synonymous codon usage bias, and its
potential applications. Nucleic Acids Res 15, 1281–1295.
Sharp, P. M., Tuohy, T. M. & Mosurski, K. R. (1986). Codon usage in
yeast: cluster analysis clearly differentiates highly and lowly expressed
genes. Nucleic Acids Res 14, 5125–5143.
Shimizu, T., Ohtani, K., Hirakawa, H. & 7 other authors (2002).
Complete genome sequence of Clostridium perfringens, an anaerobic
flesh-eater. Proc Natl Acad Sci U S A 99, 996–1001.
Sueoka, N. (1962). On the genetic basis of variation and hetero-
Lafay, B., Lloyd, A. T., McLean, M. J., Devine, K. M., Sharp, P. M. &
Wolfe, K. H. (1999). Proteome composition and codon usage in
geneity of DNA base composition. Proc Natl Acad Sci U S A 48,
582–592.
spirochaetes: species-specific and DNA strand-specific mutational
biases. Nucleic Acids Res 27, 1642–1649.
Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W:
Lafay, B., Atherton, J. C. & Sharp, P. M. (2000). Absence of trans-
lationally selected synonymous codon usage bias in Helicobacter pylori.
Microbiology 146, 851–860.
Li, W. H. (1993). Unbiased estimation of the rates of synonymous and
nonsynonymous substitution. J Mol Evol 36, 96–99.
http://mic.sgmjournals.org
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and
weight matrix choice. Nucleic Acids Res 22, 4673–4680.
Zavala, A., Naya, H., Romero, H. & Musto, H. (2002). Trends in
codon and amino acid usage in Thermotoga maritima. J Mol Evol 54,
563–568.
863