* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Open Reading Frames and Codon Bias in Streptomyces coelicolor
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Peptide synthesis wikipedia , lookup
Ridge (biology) wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Gene expression profiling wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Point mutation wikipedia , lookup
Proteolysis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Protein structure prediction wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Molecular evolution wikipedia , lookup
Biochemistry wikipedia , lookup
Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48 1 Open Reading Frames and Codon Bias in Streptomyces coelicolor and the Evolution of the Genetic Code. Robert Huether, William L. Duax, Charles M. Weeks, Sanjay Connare, Vladimir Pletnev, and Timothy C. Umland Abstract— Examination of the complete genome of Streptomyces coelicolor reveals that the antisense strands of 70% of the 7555 genes contain no stop codons and could in principal be open reading frames (ORFs). Furthermore, 2174 genes have a third full length ORF, 228 have a fourth ORF and 56 have a fifth ORF. Examination of the genes in S coelicolor having multiple ORFs revealed a pronounced bias in codon use and a DNA triple distribution that is most severe in the genes having four and five ORFs. When the 170 hypothetical gene products that have four ORFs and at least 100 amino acids are examined, 87% of the coding is from the GC-rich half of the genetic code and 80% of the protein sequences are composed of only 10 amino acids (GPASTDLVER). This population of amino acids is consistent with the probable order of entry of amino acids into proteins in the course of evolution. Only nineteen of these 170 hypothetical gene products are specifically characterized. They are identified as 5 dehydrogenases, 3 kinases, 2 esterases, a permease, a deformylase, 2 ABC transport proteins, a two-component regulator, and three ribosomal proteins, [S12, L18 and L33]. Genes in S. coelicolor having four ORFs appear to identify a subset of the codon system that evolved first, coding for a subset of amino acids that make up the composition of the earliest folded proteins. Index Terms—Bioinformatics, Codon Bias, Evolution, Multiple Open Reading Frames. I. INTRODUCTION A subset of the short chain oxidoreductase (SCOR) family Manuscript received xxxxx. This work was supported in part by: NIH Grant Number: DK026546. R. H. Author is with Hauptman-Woodward Medical Research Institute & Dept. of Structural Biology, Buffalo, 73 High St., Buffalo, NY 14203, USA (phone: 716-898-8600; fax: 716-898-8660; email: [email protected]) W. L. D. Author is with Hauptman-Woodward Medical Research Institute & Dept. of Structural Biology, Buffalo, (email: [email protected]) C. M. W. Author is with Hauptman-Woodward Medical Research Institute & Dept. of Structural Biology, Buffalo, (email: [email protected]) S. C. Author is with Hauptman-Woodward Medical Research Institute & Dept. of Structural Biology, Buffalo, (email: [email protected]) T. C. U. Author is with Hauptman-Woodward Medical Research Institute & Dept. of Structural Biology, Buffalo, (email: [email protected]) V. P. Author is associated with Shemyakin-Ovchinnikov Inst., Moscow, Russia Federation (email [email protected]) of enzymes was found to have full length multiple open reading frames (MORFs) and an unusually specific codon bias [1]. The SCOR genes having MORFs were composed of nucleic acid triples that were primarily GC-rich or CG-only in composition. The possible implications of these MORFs and their codon bias have been described [1]. It was demonstrated that the frequency of various types of MORFs exceed random by a factor of at most 106 and that 18% of the genes in the entire gene bank contain MORFs. In the SCOR family the codon bias was detected in 407 genes in species extending from bacteria and archaea through humans, their frequency of occurrence was greatest in species having high G+C content. An unusual codon bias has also been detected in the genes of over thirty members of the heat shock protein 70 (HSP-70) family that have sense antisense open reading frames (SAS ORFs) (manuscript in preparation). In an attempt to identify other families of proteins having a similar bias in gene composition and to further explore the possible implications of the GC-bias in genes having MORFs we analyzed the genome of the GC-rich bacteria Streptomyces coelicolor. S. coelicolor is a soil dwelling antibiotic producing bacterium of the taxonomic order Actinomycetales [2]. S. coelicolor has one of the largest bacterial genomes (8.7 million base pairs) and has a very GC-rich compositions at 72.1% [2]. It contains 7555 computer annotated genes, of which we have found that 70% of them contain a completely overlapping antisense open reading frame. This paper describes the similarity between the codon bias observed in S. coelicolor and the SCOR proteins having MORFs, an associated amino acid bias in the S. coelicolor genome and the implications of these patterns with respect to the evolution of the genetic code and the amino acid composition of proteins. II. METHODS The entire genome of Streptomyces coelicolor was downloaded from the NCBI ftp [ftp://ftp.ncbi.nih.gov]. The genome was parsed and set for analysis using in house perl Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48 2 TABLE I: MULTIPLE ORFS IN STREPTOMYCES COELICOLOR 7555 GENES # % Single (SORF) 2278 30 Double (DORF) 2819 37 Triple (TORF) 2174 29 Quadruple (QORF) 228 3 Penta (PORF) 56 7 scripts. The computational methods and procedures have been previously described [1]. In short, house scripts were written to examine the nucleic acid sequence of the 7555 genes in the S. coelicolor genome. Genes were tested for the presence of “stop” codons in the five alternate ways in which each gene could be read. Annotation of open reading frames used throughout this manuscript are 1, 2, 3 for the sense strand, sense strand +1, and sense strand +2, respectively. The numbering 3, 4, 5 represent the antisense strand, antisense +1 and antisense +2, respectively. These represent the five additional reading frames. These scripts were also used to tabulate the use of the 64 codons in each of the six reading frames and the total frequency of occurrence of the 64 possible ordered nucleotide triples in a gene. If a reading frame, other than the annotated coding frame, does not contain a “stop” codon (TAA, TAG, and TGA), it was identified as an open reading frame (ORF). Following the annotation given by Duax et al. 2005 [1], genes having ORFs in two different reading frames are called double open reading frames (DORFs). Those with three ORFs are termed triple open reading frames (TORFs), four ORFs are termed quartet ORFs (QORFs) and five ORFs are called penta ORFs (PORFs). We examined the frequency of use of each of the 64 codons in each reading frame and displayed the results graphically distinguishing among the nucleic acid triples that are GC-only, GC-rich, AT-rich, and AT-only in composition. This also allowed the tabulation of the amino acids for each protein. Throughout the analysis, we retained the identification of the gene product found in the SWISS PROT TrEMBL [3] gene bank (i.e. known, hypothetical, putative, or unknown gene product) in order to explore correlations between MORFs, codon bias, amino acid bias and protein classification. Homologous Secondary Structure of Proteins (HSSP) as of August 2006 was used as the analysis set to identify ribosomal homologs. The homologs share 30% sequence identity with the crystal structure sequence. III. MULTIPLE OPEN READING FRAMES: Examination of the complete genome of S. coelicolor reveals that the antisense strands of 70% of the 7555 genes (5277) contain no stop codons and could, in principle, be open reading frames (ORFs). Furthermore, 2174 genes have a third full length ORF, 228 have a fourth ORF and 56 have five ORF. Table I presents the range of amino acid lengths and their average for each MORF class. A protein of 231 amino acids was found to contain five full length ORFs. For a DORF, a gene with two open reading frames, five possible Mean AA Range AA length Low High 382 30 7464 336 30 2241 272 20 1132 148 28 464 88 22 231 combinations can occur with the annotated sense strand (1-2, 1-3, 1-4, 1-5, 1-6). We observed only one of the five present in S. coelicolor genes, the sense/antisense overlapping DORFs (1-2). Further, examination of the MORFs revealed an additional bias in codon use and DNA triple composition that is identical to that observed in the SCOR enzyme family [1]. The codon and amino acid bias is most severe in genes having four and five open reading frames (QORFs and PORFs, respectively). IV. CODON BIAS A graph of frequencies of use of the 64 codons in the 7555 genes in S. coelicolor reveals that 80% of all of the amino acids used in the bacteria are encoded by 32 of the codons (Fig. 1a). Although a preferential use of codons that are GCrich or GC-only might be anticipated in a GC-rich species the extent of the bias seen in this genome is extreme. The thirtytwo codons that are used include all eight of the GC only codons (green in Fig. 1), seventeen of the GC-rich codons (blue in Fig. 1), seven AT-rich codons and no AT-only codons. It should be noted that the high GC content of S. coelicolor does not preclude the potential existence of ATonly codons. Furthermore, 83% of the amino acids are encoded by the GC rich half of the genetic code (blue and green symbols in Fig. 1a). A pattern of codon bias in the coding frame does not mean that an identical or even similar pattern will be found in the other five frames. The distribution of the 64 possible codons is different in each of the six possible reading frames. Analysis of the triple codon frequency in the S. coelicolor genome, which involves all 6 frames, reveals the presence in the DNA of a similar bias to that seen in the coding frame in which 32 GC-rich nucleotide triples account for 83% of the triples in the DNA of all the genes in the genome (Fig. 1b). The pattern of separation of occurrence of specific combinations of G, C, T and A however, become more pronounced. The partitioning of the four subsets of triples is much more pronounced with a distribution of 34.2% GC-only, 49.6% GC-rich, 15.8% AT-rich and 0.4% AT-only and there is a break in the distribution between the GC-rich and AT-rich halves (Fig. 1b separations of blue from yellow). Nucleotide triples composed of G and C only are more common than those composed of A and T only, and the GC-rich and AT-rich triples cluster in the upper and lower halves of the distribution, respectively. The two most used codons (GCG-Ala and GCGGly) are complements of one another and together account for 14.0% of the amino acids in the putative protein products of Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48 1a. 3 1b. 1c. Figure 1: Streptomyces coelicolor Codon plots: GC-only codon in green, GC-rich codons in blue, AT-rich codons in yellow, and AT-only codons in red. Codons with multiple definitions represented at an X. AT-rich before AT-only. The data also suggests that triples corresponding to complementary antisense pairs in each of these subgroups entering the coding system at approximately the same time in the course of the evolution of the code. V. BIASED AMINO ACID COMPOSITION: The codons of some amino acids are GC-rich (i.e. Gly, Ala, Pro, Arg) while others are AT-rich (Phe, Tyr). A bias in amino acid composition of proteins has been noted in species with a very high or very low GC content [9]. We noted a severe bias in the amino acid composition of the putative protein products of the genes in S. coelicolor. Ten amino acids (GPASTDLVER) make up 82% of the composition of the postulated protein products of all 7555 annotated genes in S. coelicolor (Fig. 2). These same ten amino acids have been proposed as the first to appear in primordial proteins. This is based on a variety of different theoretical and biochemical analyses [5]. We discovered that this bias in amino acid composition in S. coelicolor was not randomly distributed among the 7555 proteins. Forty percent of the proteins in S. coelicolor are missing one or more of the less commonly 16 14 12 % Occurrence the S. coelicolor. Examination of the triple content of the DNA rather then just the triple frequency in the coding frame demonstrates that the nucleotide triple bias is not restricted to the coding frame and is in fact a more fundamental property of the DNA of genes containing MORFs. A graph of the distribution of occurrence of complementary pairs of nucleotide triples in the DNA of S. coelicolor reveals an even more pronounced separation of the four classes of codons (GC-only, GC-rich, AT-rich, and ATonly) (Fig. 1c). Just as the average value of each of the four classes of triples (GC-only, GC-rich, AT-rich and AT-only) is markedly different, combining the percentage use of individual complementary pairs of triples results in enhanced separation of the four classes. Of particular significance is the appearance of a significant gap between the GC-rich and ATrich halves of the coding system. Fourteen codons of the standard or universal code are known to code for different amino acids in different species including bacteria and the mitochondria of eukaryotes [6]. The majority of the codons having variable definitions are AT-only or AT-rich. These codons are not found in the genome of S. coelicolor (Fig. 1a) or in the MORFs of the protein family of short chain oxidoreductases [1]. The fact that the majority of the codons that are used least in coding in the MORFs are ATrich, and include those that have multiple definitions in different species is consistent with the possibility that the earliest genes evolved before the AT-rich half of the coding system was fully defined and before some species separation. Although we have not attempted to predict an exact order of introduction of codons into the genetic code, our data supports, in part, the order of codon evolution proposed by Trifonov [5,7]. However, our pattern of separation of populations of the four classes of codons (GC-only, GC-rich, AT-rich, and AT-only) in the S. coelicolor genome and SCOR MORFs suggests that the evolution of triples used for coding might have been in the order, GC-only before GC-rich before 10 8 6 4 2 0 A L G R V P T D E S I Q F AA Figure 2: Amino acid occurrence in S. coelicolor H K M W N Y C Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48 4 Comparison of the amino acid sequences of the 125 RSL6 proteins reveals that Cys is not conserved in any sequence position at greater than 8%. In 37 sequences, a Trp residue resides in a common position on the surface of RSL6 in bacteria but not archaea. Additionally, we expanded the search to include 13 of the largest proteins from the 50S ribosomal subunit. Homologs of these proteins are also missing Trp or Cys or both (Fig. 4). Further investigation reveals a high incidence of MORFs, codon bias and amino acid bias in most prokaryotic ribosomal proteins. Regardless of overall GC content, many ribosomal proteins have genes containing double ORFs and have no cysteines or tryptophans in their amino acid composition. Figure 3: The relative frequency of absence of each of the 20 amino acids in S. coelicolor. occurring ten amino acids (CWNKYMHFIQ) (Fig. 3). Genes having QORFs and PORFs have the most restricted amino acid use. If these are the most ancient and least altered genes, they should correspond to the most essential proteins and protein folds. The 228 QORF (quartet ORF) genes vary in length from 28 to 464 amino acids, there are 170 QORFs with >100 amino acids. When these 170 genes are examined 87% of the coding is from the GC-rich half of the genetic code. Only 19 of the expected gene products are characterized by homology. These include 5 dehydrogenases, 3 kinases, 2 esterases, a permease, a deformylase, 2 ABC transport proteins, a two-component regulator, and three ribosomal proteins, [S12, L18 and L33]. VI. RIBOSOMAL PROTEINS: Further analyses revealed that most of the ribosomal protein genes in S. coelicolor had double and triple ORFs and are missing one or more of the ten least commonly occurring amino acids. The ribosomal subunit L6 (RSL6) was selected The identity of twenty-two of the 170 residues in RSL6 are conserved at 90% (Table II). Of the conserved identities, 18 (G11P2V2ALE) are among the ten residues that are most populated within protein folds according to AA Position % SwissProt/ TrEMBL statistics and 17 of P 11 96 them (G11P2K3A) are from the tRNA V 14 92 synthetase II family [3]. We have G 27 95 previously noted an enhanced frequency of G 30 95 amino acids related to tRNA Syn II in the G 65 92 G 77 10 conserved identities in the SCOR enzyme V 78 96 family [1]. We postulate that if one of the G 81 95 tRNA Syn enzymes evolved before the L 86 92 other, the observed frequencies of G 90 91 conservation of amino acids in ancient G 92 91 proteins suggests that tRNA Syn II arose G 107 97 first [8]. G 134 95 The highly conserved residues are K 137 91 A 144 96 generally seen in the turns of RSL6 and P 153 90 rarely in the !-helices and "-sheets (Fig. 5). Y 156 93 The charged residues are extended away K 157 90 from the structure where they can interact G 158 92 with rRNA or other ribosomal surfaces. The K 159 91 presence of conserved tRNA Syn II residues G 160 93 in the turns of RSL6 matching the pattern E 166 91 previously observed in SCOR enzyme [1]. TABLE 2: TABLE SHOWS RESIDUES CONSERVED >90% AMINO ACIDS FROM TRNA SYNTHETASE I ARE IN RED TRNA SYNTHETASE ARE IN GREEN. VII. CONCLUSIONS Figure 4: missing amino acids in several Ribosomal proteins for further analysis. We were able to identify 125 homologs in the SWISS-PROT TrEMBL database based on sequence similarity to an L6 for which a crystal structure has been reported [1RL6] (Fig. 5). Of these 125 RSL6 sequences, 50% are missing Trp, 64% are missing Cys and 35% are missing both. The RSL6 proteins in archaea rarely have Trp and/or Cys residues whereas those from eukaryote usually have both. Regardless of overall GC content, many ribosomal proteins have genes containing double ORFs and have no cysteines or tryptophans in their amino acid composition. The MORFs found in S. coelicolor genes appear to identify a subset of the codon system that evolved first as well as a subset of amino acids that may have constituted the earliest folded proteins. These findings suggest that MORFs, severe codon bias and the absence of Trp and Cys residues are hallmarks of ancient enzymes that have been little altered by millions of years of evolution. 0 Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48 Figure 5: structure of PDB: 1RL6. Colors indicate secondary structure with alpha helices: red, beta sheets: yellow and turns: green. Conserved residues (>90% ID) indicated in purple. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] W. L. Duax, V. Pletnev, A. Addlagatta, J. Bruenn and C. M. Weeks. Rational proteomics I: Predicting fold and cofactor preference in the short-chain oxidoreductase (SCOR) enzyme family. Proteins,2003,Vol. 53, pp 931-943. S. D. Bentley, K. F. Chater, A-M. Cerdeno-Tarraga, G. L. Challis, N. R.Thomson, K. D. James, D. E.Harris, M. A. Quail, H. Kieser,D. Harper, A. Bateman, S. Brown, G. Chandra, C. W. Chen, M. Collins,A. Cronin, A. Fraser, A. Goble, J. Hidalgo, T. Hornsby, S. Howarth, C-H. Huang, T. Kieser, L. Larke, L. Murphy, K. Oliver, S. O’Neil, E. Rabbinowitsch, M-A. Rajandream, K. Rutherford, S. Rutter, K. Seeger, D. Saunders, S. Sharp, R. Squares, S. Squares, K. Taylor, T. Warren, A. Wietzorrek, J. Woodward, B. G. Barrell, J. Parkhill, and D. A. Hopwood. Complete genome sequence of the model actinomycete Streptomyces coelicolor. Nature, 2002; Vol. A3(2), pp. 417:141-147. B. Boeckman, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider The SWISS-PROT protein knowledgebase and its supplement TrEMBL in Nucleic Acids Res. 2003; Vol. 31, pp 365-370. C. Sander and R. Schneider. Database of homology-derived protein structure and the structural meaning of sequence alignment. Protein s, 1991, Vol. 9, pp 56-68. E. N. Trifonov. Consensus temporal order of amino acids and evolution of the triplet code. Gene. 2000, Vol. 26, pp 139-151. Maeshiro T, Kimura M. The role of robustness and changeability on the origin and evolution of genetic codes. Proc Natl Acad Sci USA 1998;95:5088–5093. Trifonov EN, Kirzhner A, Kirzhner M, Berezovsky IN. Distinct stages of protein evolution as suggested by protein sequence analysis. J Mol Evol 2001;53:394–401. Carter C., Duax, W., Did tRNA synthetase classes arise on opposite strands of the same gene? Mol Cell. 2002;10(4):705-8. Singer GA, Hickey DA. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol Biol Evol. 2000 Nov;17(11):1581-8. 5