* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Studies of codon usage and tRNA genes of 18 unicellular organisms
Gene nomenclature wikipedia , lookup
Non-coding DNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Copy-number variation wikipedia , lookup
Transposable element wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Epitranscriptome wikipedia , lookup
Oncogenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene desert wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Human genome wikipedia , lookup
Genomic library wikipedia , lookup
Public health genomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Essential gene wikipedia , lookup
History of genetic engineering wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Expanded genetic code wikipedia , lookup
Genome (book) wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Transfer RNA wikipedia , lookup
Gene 238 (1999) 143–155 www.elsevier.com/locate/gene Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis Shigehiko Kanaya a,b, Yuko Yamada c, Yoshihiro Kudo a, Toshimichi Ikemura d, * a Department of Electrical and Information Engineering, Faculty of Engineering, Yamagata University, Yonezawa, Yamagata-ken 992-8510, Japan b Division of Physiological Genetics, Department of Ontogenetics, National Institute of Genetics, Mishima, Shizuoka-ken 411-8540, Japan c Department of Biochemistry, Jichi Medical School, Kawachi-gun, Tochigi-ken 329-0498, Japan d Division of Evolutionary Genetics, Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka-ken 411-8540, Japan Received 1 March 1999; received in revised form 7 May 1999; accepted 1 June 1999 Abstract We examined codon usage in Bacillus subtilis genes by multivariate analysis, quantified its cellular levels of individual tRNAs, and found a clear constraint of tRNA contents on synonymous codon choice. Individual tRNA levels were proportional to the copy number of the respective tRNA genes. This indicates that the tRNA gene copy number is an important factor to determine in cellular tRNA levels, which is common with Escherichia coli and yeast Saccharomyces cerevisiae. Codon usage in 18 unicellular organisms whose genomes have been sequenced completely was analyzed and compared with the composition of tRNA genes. The 18 organisms are as follows: yeast S. cerevisiae, Aquifex aeolicus, Archaeoglobus fulgidus, B. subtilis, Borrelia burgdorferi, Chlamydia trachomatis, E. coli, Haemophilus influenzae, Helicobacter pylori, Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Mycobacterium tuberculosis, Mycoplasma genitalium, Mycoplasma pneumoniae, Pyrococcus horikoshii, Rickettsia prowazekii, Synechocystis sp., and Treponema pallidum. Codons preferred in highly expressed genes were related to the codons optimal for the translation process, which were predicted by the composition of isoaccepting tRNA genes. Genes with specific codon usage are discussed in connection with their evolutionary origins and functions. The origin and terminus of replication could be predicted on the basis of codon usage when the usage was analyzed relative to the transcription direction of individual genes. © 1999 Elsevier Science B.V. All rights reserved. Keywords: Bacillus subtilis; Codon usage; Principal component analysis; Trna content 1. Introduction With the progress of genome projects, a vast amount of nucleotide sequence data is now available, which makes it possible to study the general characteristics of codon usage for a wide range of organisms. Multivariate analyses, such as factor corresponding analysis and principal component analysis, have been used to study systematically heterogeneous codon usage in various species (Grantham et al., 1980; Medigue et al., 1991; Pouwels and Leunissen, 1994; Andersson and Sharp, 1996; Kanaya et al., 1996a; Guerdoux-Jamet et al., 1997; * Corresponding author. Tel.: +81-559-81-6788; fax: +81-559-81-6794. E-mail address: [email protected] ( T. Ikemura) Kunst et al., 1997). To characterize the species-specific heterogeneity of genes in codon usage, we have developed a measure denoted by Z based on the widest 1 range of axis obtained by principal component analysis of codon frequency patterns of genes ( Kanaya et al., 1996a) and have analyzed several bacterial species ( Kanaya et al., 1996b; Nakayama et al., 1999). The relative proportions of isoaccepting tRNAs in cells are important factors that determine synonymous codon choice in individual genes. Codon choice patterns in unicellular organisms such as Escherichia coli and Saccharomyces cerevisiae have been studied extensively and revealed that codon usage in highly expressed genes is typically dependent on tRNA content ( Ikemura, 1981a, b, 1982). Codons that were optimal for the translation process (optimal codons) were assigned for 0378-1119/99/$ – see front matter © 1999 Elsevier Science B.V. All rights reserved. PII: S0 3 7 8 -1 1 1 9 ( 9 9 ) 0 0 22 5 - 5 144 S. Kanaya et al. / Gene 238 (1999) 143–155 E. coli and S. cerevisiae based on the cellular contents of isoacceptors and the strength of interaction between anticodon and codon. The extent of codon bias for each gene has been associated with the level of protein production in E. coli and S. cerevisiae (Ikemura, 1981a, b, 1982, 1985a, b; Gouy and Gautier, 1982; Medigue et al., 1991; reviewed in Andersson and Kurland 1990; Sharp and Matassi, 1994) and in T4 bacteriophage ( Kunisawa, 1992). Levels of the optimal codon use in E. coli genes and their protein production levels were correlated with the aforementioned Z ( Kanaya et al., 1996a). In the present 1 study, we examined codon usage in Bacillus subtilis genes with multivariate analysis, and determined the optimal codons on the basis of its cellular tRNA contents. As of 1998, the complete genomic sequences of the following 18 unicellular species have been determined: Aquifex aeolicus (Deckert et al., 1998), Archaeoglobus fulgidus ( Klenk et al., 1997), B. subtilis ( Kunst et al., 1997), Borrelia burgdorferi (Fraser et al., 1997), Chlamydia trachomatis (Stephens et al., 1998), E. coli (Blattner et al., 1997), Haemophilus influenzae (Fleischmann et al., 1995), Helicobacter pylori ( Tomb et al., 1997), Methanococcus jannaschii (Bult et al., 1996), Methanobacterium thermoautotrophicum (Smith et al., 1997), Mycobacterium tuberculosis (Cole et al., 1998), Mycoplasma genitalium (Fraser et al., 1995), Mycoplasma pneumoniae (Himmelreich et al., 1996), Pyrococcus horikoshii ( Kawarabayasi et al., 1998), Rickettsia prowazekii (Andersson et al., 1998), Synechocystis sp. ( Kaneko et al., 1996), Treponema pallidum (Fraser et al., 1998), and S. cerevisiae (Mewes et al., 1997). Codon usage heterogeneities for these species were related to the level of the optimal codon use that was predicted by the copy numbers of isoaccepting tRNA genes. 2. Materials and methods 2.1. Quantification of B. subtilis tRNAs To purify and quantify B. subtilis tRNAs, two-dimensional polyacrylamide gel electrophoresis was performed in two different ways (A and B systems) as described in Ikemura (1989). The first dimension of electrophoresis was done in 0.3×TBE on 10% acrylamide gels for system A and on 14% acrylamide for system B. The second dimension was separated on 22% acrylamide gels with 7 M urea for both systems. The separated molecules were assigned to known tRNAs by RNase-fingerprinting and their relative contents were measured, as described by Ikemura and Ozeki (1977). 2.2. Computer analysis of codon usage We characterized the latent structure of speciesspecific codon usage heterogeneity in genes with known functions based on principal component analysis. Since the methodology has been described previously ( Kanaya et al., 1996a), we provide an outline here. A measure based on the widest scale of gene distribution in codon frequency space was constructed based on principal component analysis, and this scale is denoted by Z in 1 Eq. (3). The procedure for estimating Z is as follows. 1 First, Z∞ was estimated by principal component analysis 1 for a data set consisting of codon usage patterns of genes for a species: Z∞ =b · X, (1) 1 1 where b is a vector consisting of the first principal 1 component coefficients b ( j=1, 2, …, M ). Here, M 1j represents the total codon number, and X is the codon usage vector for a gene. To exclude effects of gene size, amino acid composition, and codon box number, codon frequency for the i(m)th codon x was calculated with i(m) Eq. (2): C D M(m) ∑ f /M(m) . (2) i(m) i=1 Here, f denotes the i(m)th synonymous codon number i(m) for the mth amino acid, and M(m) denotes its codon box number. The variable Z∞ was normalized by Eq. (3): 1 Z =(Z∞ −Av[Z∞ ])/SD[Z∞ ]. (3) 1 1 1 1 Here, Av[Z∞ ] and SD[Z∞ ] are the average and standard 1 1 deviation of Z∞ for the data set. With this scaling, we 1 obtained statistical information for genes in the principal component axes. For example, if a gene has a Z larger 1 than 2, then codon usage is very biased in genes of this species because, theoretically, only 2.3% of genes are included in this interval. The contribution of the ith codon frequency to the first principal component is represented by the factor loadings in Eq. (4): x i(m) =f i(m) / r(Z∞ , X )=Cov[Z∞ , X ]/( Var[Z∞ ] Var[X ])1/2. (4) 1 i 1 i 1 i Here, Cov[A, B] and Var[A] denote the covariance between two variables, A and B, and the variance of A, respectively. 3. Results and discussion 3.1. Synonymous codon choice in B. subtilis genes is constrained by translation efficiency Fig. 1 shows that ribosomal protein genes, which were assumed to be the constitutively highly expressed S. Kanaya et al. / Gene 238 (1999) 143–155 145 Fig. 1. Gene distribution in Z . All, all genes analyzed; Rp, genes encoding ribosomal proteins; Rp(mit) and Rp(cyt), nuclear ribosomal protein 1 genes used in mitochondria and cytoplasm in S. cerevisiae; Spo, sporulation genes; Pro, prophage genes; Ret, retrotransposable elements; TN, transposable elements; cag, genes in cag cluster; Chl, 210 Synechocystis genes with homologues in any chloroplast; PKS, genes encoding polyketide synthetases; Mit I and II, R. prowazekii genes with homologues in R. americana mitochondrial genome and human mitochondrial genome. 146 S. Kanaya et al. / Gene 238 (1999) 143–155 Fig. 2. Factor loadings. Red and blue marks represent optimal codons predicted by the four rules and by Bennetzen and Hall (1982), respectively. Abbreviations of species are as follows: Aa, A. aeolicus; Af, A. fulgidus; Bs, B. subtilis; Bb, B. burgdorferi; Ct, C. trachomatis; Ec, E. coli; Hi, H. influenzae; Hp, H. pylori; Mj, M. jannaschii; Mt, M. thermoautotrophicum; Mb, M. tuberculosis; Mg, M. genitalium; Mp, M. pneumoniae; Ph, P. horikoshii; Rp, R. prowazekii; Ss, Synechocystis sp.; Tp, T. pallidum; Sc, S. cerevisiae. These abbreviations are used also in other figures. Bimodal distributions of Z for all genes of B. burgdorferi and C. trachomatis are observed ( Fig. 1 N and O). Ribosomal protein genes are concentrated in 1 the positive Z peak (red histogram), and encoded on the leading strands (see also Fig. 6A, B). Codons contributing positively to Z do not 1 1 correspond to the optimal codons predicted by Rule 4 or the rule proposed by Bennetzen and Hall (1982). Asymmetric base composition between the leading and lagging strands (Lobry, 1996; McInerney, 1998) may affect more significantly in Z for these two organisms than for others. 1 S. Kanaya et al. / Gene 238 (1999) 143–155 genes and used to obtain the codon adaptation index (Sharp and Li, 1987; Nakamura and Tabata, 1997), are distributed in the positive region of Z in B. subtilis (red 1 histogram in Fig. 1A). The cause of the biased use among synonymous codons can be explained by the factor loadings in Fig. 2; i.e., genes composed of codons with positive factor loadings have positive Z values, 1 and genes composed of codons with negative factor loadings have negative Z values. We investigated this 1 bias with respect to optimization for the translation process. Ikemura (1985a) proposed four rules for assigning the optimal codons of E. coli and S. cerevisiae. Codon choices are constrained by the cellular amounts of isoaccepting tRNAs (Rule 1); modified uridines such as thiolated uridine and 5-carboxymethyluridine at the anticodon wobble position produce a preference for A over G at the codon third position (Rule 2); an inosine at the anticodon wobble position produces a preference for U or C over A at the third position (Rule 3); and in two-codon sets of the (A/U )–(A/U )-pyrimidine type, C rather than U at the third position promotes an optimal interaction strength between codon and anticodon (Rule 4; Grosjean and Fiers, 1982). B. subtilis tRNAs were separated by two-dimensional polyacrylamide gel electrophoresis (Fig. 3), and assigned to known tRNAs, and their relative contents were quantified ( Table 1). Based on the tRNA quantification data and the four rules listed above, the optimal codons in B. subtilis were estimated ( Table 1). Fig. 2 shows that most of the optimal codons (marked in red) contribute positively to Z , and Fig. 1 shows that most ribosomal 1 protein genes have positive Z values and thus use the 1 optimal codons. This is totally consistent with previous findings that synonymous codon choice in E. coli and S. cerevisiae genes is constrained by translation efficiency (Ikemura, 1985a). Highly expressed genes of these organisms are almost always more dependent on the tRNA content and tend to have a strong bias of codon usage. This common characteristic among the three organisms should reflect the fact that, of all the cellular processes, the greatest amounts of energy and mass are required for translation (Maaløe, 1976). 3.2. Copy number of tRNA genes and optimal codons Fig. 4A illustrates the correlation between B. subtilis tRNA contents and tRNA gene copy numbers. Here, the tRNAs with the same anticodon were grouped into a single isoacceptor species regardless of minor base differences in other positions. The correlation coefficient was 0.86, demonstrating a clear gene-dosage effect on tRNA contents. This is consistent with the previous findings for E. coli (Ikemura, 1981a; Dong et al., 1996) and S. cerevisiae (Percudani et al., 1997), and indicates a possibility that the tRNA contents of other species can be estimated by the tRNA gene copy number. If so, 147 Fig. 3. Two-dimensional gel separation of 32P-labeled B. subtilis tRNAs. (A, B) Systems A and B described in Materials and methods. Spot numbers assigned to known tRNAs in each system are presented in Table 1. this could be a valuable tool to study genome properties and codon usage for a wide range of species whose genome has been sequenced completely. Fig. 5 shows the composition of tRNA genes for the 18 unicellular species. In H. influenzae, genes for the isoacceptors for 14 amino acids are multiplied. The respective tRNA genes are also multiplied in B. subtilis, but three of them are not in E. coli (Fig. 5), indicating that the multiplication pattern of H. influenzae is more similar to that of B. subtilis, which is a phylogenetically more distant species than E. coli. The GC contents of H. influenzae and B. subtilis (38.1 and 43.5%, respectively) are lower than that of E. coli (50.8%), and this AT-richness may require multiplication of tRNA species with anticodons matching the codons with A or U at the third position. Interestingly, the factor loading of H. influenzae is most similar to that of B. subtilis (r=0.77) among the 18 unicellular species (Fig. 2). Amplification of isoacceptor genes and genome GC content should have coevolved. Codons denoted by red marks in the H. influenzae 148 S. Kanaya et al. / Gene 238 (1999) 143–155 Table 1 tRNA genes in B. subtilis genome Amino acid Arg Leu Ser Ala Gly Pro Thr Val Ile Asn Asp Cys Gln Glu His Lys Phe Tyr Trp Met fMet Anticodon ACG CCG CCU UCU UAA CAA GAG UAG CAG UGA GGA GCU UGC GGC UCC GCC UGG UGU GGU UAC GAC GAU CAU GUU GUC GCA UUG UUC GUG UUU GAA GUA CCA CAU CAU Modified at the wobble positiona I mnm5U * * mo5U mo5U cmnm5U mo5U mo5U mo5U L Q * * Q cmnm5s2U Gm Q Gene copy No. 4 1 1 1 3 1 1 2 1 2 1 2 5 1 3 4 3 4 1 4 1 3 1 4 4 1 4 6 2 4 3 2 1 2 3 Gel spot ID A B 46 55 25 34 52 43 53 56 54, 55 15 16 51 32 42 31, 41 24 13 10 11 23 22, 26 45 52 51 63 56 61, 62 22 53 34 31 36 14 12 11, 13 16, 21, 33 42 33 43 45 12 24, 25 26 46 23 35 32 21 36, 35 14 tRNA content Codon recognized 0.48 nd nd 0.13 0.41 0.22 0.20 0.20 0.15 0.20 0.13 0.20 1.24 0.23 0.66 0.85 1.12 1.19 0.16 0.90 0.42 1.42 0.19 1.20 1.31 np np 1.52 np 0.73 0.75 0.38 0.15 0.54 1.00 CGC, CGU, CGA CGG AGG AGA UUA, UUG UUG CUC, CUU CUA, CUG CUG UCA, UCG, UCU UCC, UCU AGC, AGU GCA, GCG, GCU GCC, GCU GGA, GGG, GGU GGC, GGU CCA, CCG, CCU ACA, ACG, ACU ACC, ACU GUA, GUG, GUU GUC, GUU AUC, AUU AUA AAC, AAU GAC, GAU UGC, UGU CAA, CAG GAA, GAG CAC, CAU AAA, AAG UUC, UUU UAC, UAU UGG AUG AUG a Modified at the wobble position (Sprinzl et al., 1998; Y. Yamada, unpublished ); mo5U, 5-methoxyuridine; I, inosine; cmnm5U, 5-carboxymethylaminomethyluridine; cmnm5s2U, 5-carboxymethylaminomethyl-2-thiouridine; Q, queuosine; Gm, 2∞-O-methylguanosine; mnm5U, 5-methylaminomethyluridine; L, lysidine; *, unidentified modified base. The content for tRNA fMet was normalized to 1.0; nd, not determined, presumably because of low contents; np, not obtained in a pure form. Optimal codons are denoted by underline at the column of codon recognized. Detail procedures for assigning optimal codons were described in Ikemura (1981b). column ( Fig. 2) are the optimal codons predicted on the basis of the copy number of tRNA genes. The predicted optimal codons clearly correspond to preferred codons in ribosomal protein genes in H. influenzae, demonstrating the validity of the prediction. The copy number of rRNA genes varies between species. Multiplication of tRNA or rRNA genes can increase the level of the respective gene product circumventing the limited capacity of a single promoter (Nomura and Morgan, 1977). To support efficient protein synthesis, the genes for tRNAs and rRNAs should have concordantly increased in their copy numbers. Actually, the copy number of rRNA genes is correlated closely with the total number of tRNA genes ( Fig. 4B). In most species we analyzed, ribosomal protein genes tend to have positive Z values ( Fig. 1), but this ten1 dency decreases when the gene copy number for individual isoacceptors decreases. In other words, the tendency is weaker in organisms where only one gene encodes each isoacceptor species (Figs. 4C and 5). Even in such organisms, tRNA sets differ between organisms (Fig. 5). Bacteria with high GC contents such as M. tuberculosis and T. pallidum tend to have additional tRNA species that respond solely to a G or C at the codon third position; i.e., tRNA with C or G at the anticodon first position. Organisms with high GC contents usually require larger number of isoacceptors than organisms with low GC contents. Therefore, the number of tRNA species is correlated with GC content ( Fig. 4D). However, in Micrococcus luteus, which has an extremely S. Kanaya et al. / Gene 238 (1999) 143–155 149 Fig. 4. (A) Relation between tRNA contents and tRNA gene number of B. subtilis. The level for tRNA fMet was normalized to 1.0. (B) Relation between rRNA gene number and total number of tRNA genes. Correlation coefficient between rRNA gene number and the square of the total tRNA gene number was calculated with or without the S. cerevisiae data. (C ) Relation between total number of tRNA genes and bias of ribosomal proteins in Z . The Z bias for ribosomal proteins represents the average of differences between Z∞ for ribosomal proteins and the average of Z∞ 1 1 1 1 for all genes; and log (number of tRNA genes) represents the common logarithm of the number of tRNA genes. (D) Relation between genome GC content and number of isoaccepting tRNA species. high GC content (74 G+C%), tRNAs with the UNN anticodon are lost, and some NNA-type codons are absent in the protein genes ( Kano et al., 1991). These demonstrate that genome GC content affects not only codon usage (Sueoka, 1962; Bernardi and Bernardi, 1985; Muto and Osawa, 1987; Sueoka, 1988), but also tRNA composition (Osawa, 1995). In M. luteus and M. capricolum, while each isoacceptor species is encoded primarily by a single gene, the relative levels of the isoacceptors are known to be different and clearly correlated with codon usage patterns ( Kano et al., 1991; Yamao et al., 1991). This suggests that codon usage, even in bacteria where each isoacceptor is encoded by a single gene, is constrained by translation efficiency. The cellular tRNA levels have been quantified only for a few of the aforementioned unicellular organisms, and therefore the optimal codons based on Rules 1–3 cannot be deduced in most cases. Rule 4, however, is independent of tRNA content and thus is generally applicable; AUC, AAC, UUC, and UAC are the optimal codons for Ile, Asn, Phe, and Tyr, respectively. Interestingly, all four codons contributed positively to the Z (denoted 1 by red mark in Fig. 2) in most organisms examined. This should not be due to genome GC content because Fig. 5. Compositions of tRNA genes. The rRNA-based phylogenetic tree reported by Pace (1997) and Olsen et al. (1994) is depicted in the column of Phylogeny. Of tRNAs annotated in the GenBank/EMBL/DDBJ database, a few tRNAs whose clover-leaf structures could not be built up are removed from this analysis. Total number of tRNA genes in parentheses represents that of isoacceptor types. Results of Mycoplasma capricolum tRNAs (denoted by Mc) are obtained from Yamao et al. (1991) and Andachi et al. (1989). 150 S. Kanaya et al. / Gene 238 (1999) 143–155 S. Kanaya et al. / Gene 238 (1999) 143–155 151 Fig. 6. Profiles of Z along individual DNA strands for (A) B. burgdorferi, (B) C. trachomatis, (C ) B. subtilis, and (D) E. coli. Blue and red 1 represent Z for genes on Watson and Crick strands. Origin, replication origin; arrows, the direction of replication. In B. subtilis and E. coli, 1 smoothing Z for ±10 contiguous genes was conducted. 1 152 S. Kanaya et al. / Gene 238 (1999) 143–155 out of the 18 organisms examined, 12 have an AT-rich genome and only one has a GC-rich genome ( Fig. 5). By combining Rules 1–4, a total of 117 optimal codons could be predicted for 16 organisms; B. burgdorferi and C. trachomatis are discussed separately in the legend to Fig. 2. Of the 117 optimal codons, 110 contribute positively to Z and only 7 contribute negatively to Z (red 1 1 marks in Fig. 2). Some codons that were not assigned to optimal codons contributed positively to Z . One 1 trivial case is that the respective isoacceptor was not quantified either because the isoacceptor was not purified or the sequence was not known. Another case is that there are two major isoacceptors with an equivalent abundance (e.g., B. subtilis tRNASer), and therefore the optimal codon for the amino acid could not be assigned. To explain codon usage patterns of a wide range of species, Ikemura and Ozeki (1982) summarized a total of eight rules, some of which are not related directly with translation efficiency and are applicable only to a limited class of species. The preferred codons being not assigned to optimal codons may relate with such rules that are not connected with translation efficiency. The point to be stressed in the present study is that most of the optimal codons for the examined species contribute positively to Z . This demonstrates that codon usage in 1 most bacteria, if not all, is constrained by translation efficiency. This finding is again consistent with the fact that, of all the cellular processes, the greatest amounts of energy and mass are required for translation (Maaløe, 1976). Recently, the four rules related with translation efficiency have been successfully applied for determining optimal codons in Drosophila melanogaster (Moriyama and Powell, 1997). Bennetzen and Hall (1982) proposed another rule concerning a codon preference; a preference for codons that can form a standard Watson–Crick base pair at the codon third position over those that would require wobble pairing. This preference is seen in codon choice of Asp and His (blue marks in Fig. 2) in most organisms. It is worth noting the correlation between Z values 1 and the gene attrition process from ancestral to presentday chloroplast and mitochondria genomes. In the cyanobacteria Synechocystis sp., the genes involved in photosynthesis and those for which homologues are present in present-day chloroplasts are of particular interest, because chloroplasts evolved from cyanobacteria-like endosymbionts (Gray, 1992). Martin et al. (1998) illustrated the process of gene attrition from ancestral to present-day chloroplasts focusing on 210 Synechocystis genes and their homologues present in a wide range of chloroplast genomes. Interestingly, the Synechocystis genes with homologues in chloroplast genomes tend to have positive Z values (yellow histo1 gram in Fig. 1J ). These findings support the view that codon usage in Synechocystis genes with homologues in the present-day chloroplast genomes have been opti- mized for translational efficiency presumably during the entire course of evolution. This view also appears applicable to Rickettsia genes. Mitochondria evolved from eubacteria-like endosymbionts (Gray, 1992) whose closest known relative is Rickettsia according to rRNAbased phylogenetic analyses ( Yang et al., 1985; Olsen et al., 1994). The mitochondrial genome in the freshwater protozoon Reclinomonas americana resembles most closely the ancestral proto-mitochondrial genome and is assumed to be a primitive mitochondrial genome (Lang et al., 1997). The genome size (69 kbp) is fourtimes larger than vertebrate mitochondrial genomes (16– 18 kbp). Rickettsia prowazekii seems to be the closest relative of R. americana mitochondria ( Yang et al., 1985; Olsen et al., 1994). R. prowazekii genes with homologues in the mitochondrial genes tend to have positive Z values (green histograms in Fig. 1P). This 1 supports the view that the genes with optimal codons in the R. prowazekii genome tends to have retained in present-day mitochondrial genomes during evolution. 3.3. Genes with negative Z 1 In the above sections, we primarily considered positive Z values and highly expressed genes. We extended 1 our focus to genes with negative Z values in connection 1 with the life-form of individual organisms and gene functions. Horizontally transferred genes, such as transposable elements in bacteria, retrotransposable elements of yeast, and prophage genes in B. subtilis, tend to have negative Z values (blue histograms denoted by ‘TN’, 1 ‘Ret’, ‘Pro’ in Fig. 1). A negative Z was also obtained 1 for the cag cluster in H. pylori (a pathogenicity island containing genes that stimulate production of interleukin-8 by gastric epithelial cells) (Censini et al., 1996; ‘cag’ in Fig. 1I ), for most genes in flagellar operons in E. coli, T. pallidum, and A. aeolicus, and for genes associated with type III protein secretion in C. trachomatis. Collectively, codon usage in horizontally transferred genes and in those involved in pathogenicity is clearly distinct from that in highly expressed intrinsic genes. In B. subtilis, over 100 genes are involved in the sporulation process (Stragier and Losick, 1996). The expression of these genes is regulated by six sigma factors, sA, sH, sE, sF, sG, and sK. Sporulation genes regulated by sporulation-specific s factors (sE, sF, sG, and sK) tend to have negative Z values (‘Spo’ in 1 Fig. 1A) and do not use the optimal codons identified on the basis of the present tRNA contents except for four genes; sspC, sspB, and sspG regulated by sG, and cotG regulated by sK, show very biased codon usage (Z >2.0). The four gene products are required in large 1 amounts for spore construction, and therefore, the optimal codons may be used preferentially. This is also the case for flagellar synthesis genes. In B. subtilis, the S. Kanaya et al. / Gene 238 (1999) 143–155 flagellin gene hag uses optimal codons preferentially, although most other genes involved in flagellar synthesis tend to have negative Z values. In E. coli, the flagellin 1 gene, which is required in large amounts to assemble flagellar filaments (Macnab, 1987), has a high Z , 1 although other genes have negative Z values. For M. 1 tuberculosis, foreign-type genes (Fig. 1J ), such as polyketide synthesis (Hopwood, 1997), prophages, and transposable elements tend to have negative Z values. 1 In S. cerevisiae, sporulation and germination genes also tend to have negative Z values (‘Spo’ in Fig. 1B). 1 Nuclear genes for ribosomal proteins used in mitochondria tend to have negative Z values (‘Rp(mit)’ in 1 Fig. 1B) in common with retrotransposable elements (‘Ret’), although genes for ribosomal proteins used in the cytoplasm tend to have high Z values (red histo1 gram). This may be explained by the fact that the former ribosomal genes are foreign-type for the yeast nuclear genome and/or by their lower expression levels. 153 organization near the replication origin may have kept a more ancient pattern in B. subtilis than in E. coli; i.e., the genetic organization of E. coli has undergone gross changes by duplications and translocations during evolution (Ogasawara et al. 1985; Bachmann, 1983). This property of Z should be useful to identify the replication 1 origin in a wide range of genomes, and presumably the terminus in small genomes. Acknowledgements This work was supported by Grant-in-Aid of Scientific Research from the Ministry of Education, Science and Culture of Japan. The computers at the DDBJ and the Human Genome Center of Japan were used. The authors are very grateful to Ms Kazuko Suzuki for technical assistance. 3.4. Prediction of replication origin and terminus by Z 1 In E. coli, B. subtilis, and B. burgdorferi, highly expressed genes usually orient along the genome so that transcription is in the same direction as replication (Brewer, 1988; Ziegler and Dean, 1990; McInerney, 1998). Therefore, replication origins may be detected by Z -distribution profiles along the reported genome 1 sequence when the direction of transcription of individual genes is taken into account. Genes transcribed in the same direction as the leading strand progression may have positive Z values, and genes transcribed in 1 the opposite direction to the leading strand progression may have negative Z values. If so, a switch of Z from 1 1 positive to negative in one direction and a switch from negative to positive in the other direction should be observed at the replication origin. In linear chromosome in B. burgdorferi, a switch of Z -profile was observed at 1 the position of 0.46 Mbp (Fig. 6A); the Z -profile on 1 one strand switches from negative to positive (red marks) and on the other strand switches from positive to negative (blue marks). This switch site corresponds to the replication origin reported by Fraser et al. (1997). In Z -profiles of the circular chromosome for C. tracho1 matis, the replication origin and the terminus appear to be located at 0.72 Mbp and 0.2 Mbp, respectively (Fig. 6B). The fact that the replication origin is annotated as being between 71 998 and 72 058 bp in the database shows the validity of the present method. In B. subtilis, similar clear switch was observed also near the origin when profiles were analyzed by smoothing Z for contiguous genes ( Fig. 6C ). However, in E. coli 1 the strand asymmetry was observed only in the close vicinity of replication origin ( Fig. 6D). The fact that the asymmetry in B. subtilis is more evident than that in E. coli may relate with the previous finding that genetic References Andachi, Y., Yamao, F., Muto, A., Osawa, S., 1989. Codon recognition patterns as deduced from sequences of the complete set of transfer RNA species in Mycoplasma capricolum. J. Mol. Biol. 209, 37–54. Andersson, S.G., Kurland, C.G., 1990. Codon preferences in freeliving microorganisms. Microbiol. Rev. 54, 198–210. Andersson, S.G.E., Sharp, P.M., 1996. Codon usage in the Mycobacterium tuberculosis complex. Microbiology 142, 915–925. Andersson, S.G.E., Zomorodipour, A., Andersson, J.O., SicheritzPonten, T., Alsmark, U.C.M., Podowski, R.M., Naslund, A.K., Eriksson, A., Winkler, H.H., Kurland, C.G., 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria... Nature 396, 133–140. Bachmann, B.J., 1983. Linkage map of Escherichia coli K-12, Edition 7. Microbiol. Rev. 47, 180–230. Bennetzen, J.L., Hall, B.D., 1982. Codon selection in yeast. J. Biol. Chem. 257, 3026–3031. Bernardi, G., Bernardi, G., 1985. Codon usage and genome composition. J. Mol. Evol. 22, 363–365. Blattner, F.R., Plunkett III, G., Bloch, C.A., et al., 1997. The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462. Brewer, B.J., 1988. When polymerases collide: replication and the transcriptional organization of the E. coli chromosome. Cell 53, 679–686. Bult, C.J., White, O., Olsen, G.J., et al., 1996. Complete genome sequence of the methanogenic Archaeon, Methanococcus jannaschii. Science 273, 1058–1073. Censini, S., Lange, C., Xiang, Z., Crabtree, J.E., Ghiara, P., Borodovsky, M., Rappuoli, R., Covacci, A., 1996. cag, a pathogenicity island of Helicobacter pylori, encodes type I-specific and diseaseassociated virulence factors. Proc. Natl. Acad. Sci. USA 93, 14648–14653. Cole, S.T., Brosch, R., Parkhill, J., et al., 1998. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544. Deckert, G., Warren, P.V., Gaasterland, T., et al., 1998. The complete 154 S. Kanaya et al. / Gene 238 (1999) 143–155 genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358. Dong, H., Nilsson, L., Kurland, C.G., 1996. Co-variation of tRNA abundance and codon usage in Escherichia coli at different growth rates. J. Mol. Biol. 260, 649–663. Fleischmann, R.D., Adams, M.D., White, O., et al., 1995. Wholegenome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512. Fraser, C.M., Casjens, S., Huang, W.M., et al., 1997. Genomic sequence of a Lyme disease spirochetaete, Borrelia burgdorferi. Nature 390, 580–586. Fraser, C.M., Gocayne, J.D., White, O., et al., 1995. The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403. Fraser, C.M., Norris, S.J., Weinstock, G.M., et al., 1998. Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281, 375–388. Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave, A., 1980. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 8, r49–r62. Gray, M.W., 1992. The endosymbiont hypothesis revisited. Int. Rev. Cyt. 141, 233–357. Gouy, M., Gautier, C., 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10, 7055–7074. Grosjean, H., Fiers, W., 1982. Preferential codon usage in prokaryotic genes: the optimal codon-anticodon interaction energy and the selective codon usage in efficiently expressed genes. Gene 18, 199–209. Guerdoux-Jamet, P., Henaut, A., Nitschke, P., Risler, J., Danchin, A., 1997. Using codon usage to predict genes origin: is the Escherichia coli outer membrane a patchwork of products from different genomes? DNA Res. 4, 257–265. Himmelreich, R., Hilbert, H., Plagens, H., Pirkl, E., Li, B., Herrmann, R., 1996. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24, 4420–4449. Hopwood, D.A., 1997. Genetic contributions to understanding polyketide synthases. Chem. Rev. 97, 2465–2497. Ikemura, T., 1981a. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 146, 1–21. Ikemura, T., 1981b. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J. Mol. Biol. 151, 389–409. Ikemura, T., 1982. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes: differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs. J. Mol. Biol. 158, 573–597. Ikemura, T., 1985a. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2, 13–34. Ikemura, T., 1985b. Codon usage, tRNA content and rate of synonymous substitution. In: Ohta, T., Aoki, K. ( Eds.), Population Genetics and Molecular Evolution. Japan Science Society Press, pp. 385–406. Ikemura, T., 1989. Purification of RNA molecules by gel techniques. Methods Enzymol. 180, 14–25. Ikemura, T., Ozeki, H., 1977. Gross map location of Escherichia coli transfer RNA genes. J. Mol. Biol. 117, 419–446. Ikemura, T., Ozeki, H., 1982. Codon usage and transfer RNA contents: organism-specific codon-choice patterns in reference to the isoacceptor contents. Cold Spring Harbor Symp.Quant. Biol. 47, 1087–1097. Kanaya, S., Kudo, Y., Nakamura, Y., Ikemura, T., 1996a. Detection of genes in Escherichia coli sequences determined by genome projects and prediction of protein production levels, based on multivariate diversity in codon usage. CABIOS 12, 213–225. Kanaya, S., Kudo, Y., Suzuki, S., Ikemura, T., 1996b. Systematization of species-specific diversity of genes in codon usage: comparison of the diversity among bacteria and prediction of the protein production levels in cells in: Akutsu, T.et al., (Eds.), Genome Informatics Series No.7. Universal Academy Press, Tokyo, pp. 61–71. Kaneko, T., Sato, S., Kotani, H., et al., 1996. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 3, 109–136. Kano, A., Andachi, Y., Ohama, T., Osawa, S., 1991. Novel anticodon composition of transfer RNAs in Micrococcus luteus, a bacterium with a high genomic G+C content. Correlation with codon usage. J Mol. Biol. 20, 387–401. Kawarabayasi, Y., Sawada, M., Horikawa, H., et al., 1998. Complete sequence and gene organization of the genome of a hyper-thermophilic Archaebacterium, Pyrococcus horikoshii OT3. DNA Res. 5, 55–76. Klenk, H., Clayton, R.A., Tomb, J., et al., 1997. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390, 364–370. Kunisawa, T., 1992. Synonymous codon preferences in bacteriophage T4: a distinctive use of transfer RNAs from T4 and from its Host Escherichia coli. J. Theor. Biol. 159, 287–298. Kunst, F., Ogasawara, N., Moszer, I., et al., 1997. The complete genome sequence of the Gram-positive bacterium Bacillus subtilis. Nature 390, 249–256. Lang, B.F., Burger, G., O’Kelly, C.J., Cedergren, R., Golding, G.B., Lemieux, C., Sankoff, D., Turmel, M., Gray, M.W., 1997. An ancestral mitochondrial DNA resmebling a eubacterial genome in miniature. Nature 387, 493–497. Lobry, J.R., 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660–665. Maaløe, O., 1976. Past, present and future trends. In: Kjeldgaard, N.C., Maaløe, O. ( Eds.), Control of Ribosome Synthesis in: Alfred Benzon Symposium IX. Munksgaard, Copenhagen, pp. 15–21. Macnab, R.M., 1987. Flagella, Neidhardt, F.C., Ingraham, J.L.Low, K.B., Magasanik, B., Schaechter, M., Umbarger, H.E. ( Eds.), Escherichia coli and Salmonella typhimurium. American Society for Microbiology vol. 1, 70–83. Martin, W., Stoebe, B., Goremykin, V., Hansmann, S., Hasegawa, M., Kowallik, K.V., 1998. Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393, 162–165. McInerney, J.O., 1998. Replication and transcriptional selection on codon usage in Borrelia burgdorferi. Proc. Natl. Acad. Sci. USA 95, 10698–10703. Medigue, C., Rouxel, T., Vigier, P., Henaut, A., Danchin, A., 1991. Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851–856. Mewes, H.W., Albermann, K., Bahr, M., Frishman, D., Gleissner, A., Hani, J., Heumann, K., Kleine, K., Maierl, A., Oliver, S.G., Pfeiffer, F., Zollner, A., 1997. Overview of the yeast genome. Nature 387, 7–9. Moriyama, E.N., Powell, J.R., 1997. Codon usage bias and tRNA abundance in Drosophila. J. Mol. Evol. 45, 514–523. Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. USA 84, 166–169. Nakamura, Y., Tabata, S., 1997. Codon-anticodon assignment and detection of codon usage trends in seven microbial genomes. Microbiol. Comp. Genom. 2, 299–312. Nakayama, K., Kanaya, S., Ohnishi, M., Terawaki, Y., Hayashi, T., 1999. The complete nucleotide sequence of wCTX, a cytotoxinconverting phage of Pseudomonas aeruginosa: implications for phage evolution and horizontal gene transfer via bacteriophages. Mol. Microbiol. 31, 399–419. S. Kanaya et al. / Gene 238 (1999) 143–155 Nomura, M., Morgan, E.A., 1977. Genetics of bacterial ribosomes. Annu. Rev. Genet. 11, 297–347. Ogasawara, N., Moriya, S., von Meyenburg, K., Hansen, F.G., Yoshikawa, H., 1985. Conservation of genes and their organization in the chromosomal replication origin region of Bacillus subtilis and Escherichia coli. EMBO J. 4, 3345–3350. Olsen, G.J., Woese, C.R., Overbeek, R., 1994. The winds of (evolutionary) change: breathing new life into microbiology. J. Bacteriol. 176, 1–6. Osawa, S., 1995. Evolution of the Genetic Code. Oxford University Press, Oxford. Pace, N.R., 1997. A molecular view of microbial diversity and the biosphere. Science 276, 734–740. Percudani, R., Pavesi, A., Ottonello, S., 1997. Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae. J. Mol. Biol. 268, 322–330. Pouwels, P.H., Leunissen, J.A.M., 1994. Divergence in codon usage of Lactobacillus species. Nucleic Acids Res. 22, 929–936. Sharp, P.M., Li, W., 1987. The codon adaptation index — a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281–1295. Sharp, P.M., Matassi, G., 1994. Codon usage and genome evolution. Curr. Opin. Genet. Dev. 4, 851–860. Smith, D.R., Douchette-Stamm, L.A., Deloughery, C., et al., 1997. 155 Complete genome sequence of Methanobacterium thermoautotrophicum DH: functional analysis and comparative genomics. J. Bacteriol. 179, 7135–7155. Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A., Steinberg, S., 1998. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26, 148–153. Stephens, R.S., Kalman, S., Lammel, C., et al., 1998. Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 282, 754–759. Stragier, P., Losick, R., 1996. Molecular genetics of sporulation in Bacillus subtilis. Annu. Rev. Genet. 30, 297–341. Sueoka, N., 1962. On the genetic basis of variation and heterogeneity in base composition. Proc. Natl. Acad. Sci. USA 48, 582–592. Sueoka, N., 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85, 2653–2657. Tomb, J., White, O., Kerlavage, A.R., et al., 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388, 539–547. Yamao, F., Andachi, Y., Muto, A., Ikemura, T., Osawa, S., 1991. Levels of tRNAs in bacterial cells as affected by acid usage in proteins. Nucleic Acids Res. 19, 6119–6122. Yang, D., Oyaizu, Y., Oyaizu, H., Olsen, G.J., Woese, C.R., 1985. Mitochondrial origins. Proc. Natl. Acad. Sci. USA 82, 4443–4447. Ziegler, D., Dean, D., 1990. Orientation of genes in Bacillus subtilis chromosome. Genetics 125, 703–708.