Download Open Reading Frames and Codon Bias in Streptomyces coelicolor

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Peptide synthesis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Magnesium transporter wikipedia , lookup

Interactome wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Western blot wikipedia , lookup

Metalloprotein wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene expression wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Gene expression profiling wikipedia , lookup

Metabolism wikipedia , lookup

Protein wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48
1
Open Reading Frames and Codon Bias in
Streptomyces coelicolor and the Evolution of
the Genetic Code.
Robert Huether, William L. Duax, Charles M. Weeks, Sanjay Connare, Vladimir Pletnev, and
Timothy C. Umland
Abstract— Examination of the complete genome of
Streptomyces coelicolor reveals that the antisense strands of 70%
of the 7555 genes contain no stop codons and could in principal
be open reading frames (ORFs). Furthermore, 2174 genes have a
third full length ORF, 228 have a fourth ORF and 56 have a fifth
ORF. Examination of the genes in S coelicolor having multiple
ORFs revealed a pronounced bias in codon use and a DNA triple
distribution that is most severe in the genes having four and five
ORFs. When the 170 hypothetical gene products that have four
ORFs and at least 100 amino acids are examined, 87% of the
coding is from the GC-rich half of the genetic code and 80% of
the protein sequences are composed of only 10 amino acids
(GPASTDLVER). This population of amino acids is consistent
with the probable order of entry of amino acids into proteins in
the course of evolution. Only nineteen of these 170 hypothetical
gene products are specifically characterized. They are identified
as 5 dehydrogenases, 3 kinases, 2 esterases, a permease, a
deformylase, 2 ABC transport proteins, a two-component
regulator, and three ribosomal proteins, [S12, L18 and L33].
Genes in S. coelicolor having four ORFs appear to identify a
subset of the codon system that evolved first, coding for a subset
of amino acids that make up the composition of the earliest
folded proteins.
Index Terms—Bioinformatics, Codon Bias, Evolution, Multiple
Open Reading Frames.
I. INTRODUCTION
A subset of the short chain oxidoreductase (SCOR) family
Manuscript received xxxxx. This work was supported in part by: NIH
Grant Number: DK026546.
R. H. Author is with Hauptman-Woodward Medical Research Institute &
Dept. of Structural Biology, Buffalo, 73 High St., Buffalo, NY 14203, USA
(phone: 716-898-8600; fax: 716-898-8660; email: [email protected])
W. L. D. Author is with Hauptman-Woodward Medical Research Institute
& Dept. of Structural Biology, Buffalo, (email: [email protected])
C. M. W. Author is with Hauptman-Woodward Medical Research Institute
& Dept. of Structural Biology, Buffalo, (email: [email protected])
S. C. Author is with Hauptman-Woodward Medical Research Institute &
Dept. of Structural Biology, Buffalo, (email: [email protected])
T. C. U. Author is with Hauptman-Woodward Medical Research Institute
& Dept. of Structural Biology, Buffalo, (email: [email protected])
V. P. Author is associated with Shemyakin-Ovchinnikov Inst., Moscow,
Russia Federation (email [email protected])
of enzymes was found to have full length multiple open
reading frames (MORFs) and an unusually specific codon bias
[1]. The SCOR genes having MORFs were composed of
nucleic acid triples that were primarily GC-rich or CG-only in
composition. The possible implications of these MORFs and
their codon bias have been described [1]. It was demonstrated
that the frequency of various types of MORFs exceed random
by a factor of at most 106 and that 18% of the genes in the
entire gene bank contain MORFs. In the SCOR family the
codon bias was detected in 407 genes in species extending
from bacteria and archaea through humans, their frequency of
occurrence was greatest in species having high G+C content.
An unusual codon bias has also been detected in the genes of
over thirty members of the heat shock protein 70 (HSP-70)
family that have sense antisense open reading frames (SAS
ORFs) (manuscript in preparation). In an attempt to identify
other families of proteins having a similar bias in gene
composition and to further explore the possible implications of
the GC-bias in genes having MORFs we analyzed the genome
of the GC-rich bacteria Streptomyces coelicolor.
S. coelicolor is a soil dwelling antibiotic producing
bacterium of the taxonomic order Actinomycetales [2]. S.
coelicolor has one of the largest bacterial genomes (8.7
million base pairs) and has a very GC-rich compositions at
72.1% [2]. It contains 7555 computer annotated genes, of
which we have found that 70% of them contain a completely
overlapping antisense open reading frame. This paper
describes the similarity between the codon bias observed in S.
coelicolor and the SCOR proteins having MORFs, an
associated amino acid bias in the S. coelicolor genome and the
implications of these patterns with respect to the evolution of
the genetic code and the amino acid composition of proteins.
II. METHODS
The entire genome of Streptomyces coelicolor was
downloaded from the NCBI ftp [ftp://ftp.ncbi.nih.gov]. The
genome was parsed and set for analysis using in house perl
Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48
2
TABLE I: MULTIPLE ORFS IN STREPTOMYCES COELICOLOR 7555 GENES
#
%
Single (SORF)
2278
30
Double (DORF)
2819
37
Triple (TORF)
2174
29
Quadruple (QORF)
228
3
Penta (PORF)
56
7
scripts. The computational methods and procedures have been
previously described [1]. In short, house scripts were written
to examine the nucleic acid sequence of the 7555 genes in the
S. coelicolor genome. Genes were tested for the presence of
“stop” codons in the five alternate ways in which each gene
could be read. Annotation of open reading frames used
throughout this manuscript are 1, 2, 3 for the sense strand,
sense strand +1, and sense strand +2, respectively. The
numbering 3, 4, 5 represent the antisense strand, antisense +1
and antisense +2, respectively. These represent the five
additional reading frames. These scripts were also used to
tabulate the use of the 64 codons in each of the six reading
frames and the total frequency of occurrence of the 64
possible ordered nucleotide triples in a gene. If a reading
frame, other than the annotated coding frame, does not contain
a “stop” codon (TAA, TAG, and TGA), it was identified as an
open reading frame (ORF). Following the annotation given by
Duax et al. 2005 [1], genes having ORFs in two different
reading frames are called double open reading frames
(DORFs). Those with three ORFs are termed triple open
reading frames (TORFs), four ORFs are termed quartet ORFs
(QORFs) and five ORFs are called penta ORFs (PORFs).
We examined the frequency of use of each of the 64 codons
in each reading frame and displayed the results graphically
distinguishing among the nucleic acid triples that are GC-only,
GC-rich, AT-rich, and AT-only in composition. This also
allowed the tabulation of the amino acids for each protein.
Throughout the analysis, we retained the identification of the
gene product found in the SWISS PROT TrEMBL [3] gene
bank (i.e. known, hypothetical, putative, or unknown gene
product) in order to explore correlations between MORFs,
codon bias, amino acid bias and protein classification.
Homologous Secondary Structure of Proteins (HSSP) as of
August 2006 was used as the analysis set to identify ribosomal
homologs. The homologs share 30% sequence identity with
the crystal structure sequence.
III. MULTIPLE OPEN READING FRAMES:
Examination of the complete genome of S. coelicolor
reveals that the antisense strands of 70% of the 7555 genes
(5277) contain no stop codons and could, in principle, be open
reading frames (ORFs). Furthermore, 2174 genes have a third
full length ORF, 228 have a fourth ORF and 56 have five
ORF. Table I presents the range of amino acid lengths and
their average for each MORF class. A protein of 231 amino
acids was found to contain five full length ORFs. For a DORF,
a gene with two open reading frames, five possible
Mean AA
Range AA
length
Low
High
382
30
7464
336
30
2241
272
20
1132
148
28
464
88
22
231
combinations can occur with the annotated sense strand (1-2,
1-3, 1-4, 1-5, 1-6). We observed only one of the five present in
S. coelicolor genes, the sense/antisense overlapping DORFs
(1-2). Further, examination of the MORFs revealed an
additional bias in codon use and DNA triple composition that
is identical to that observed in the SCOR enzyme family [1].
The codon and amino acid bias is most severe in genes having
four and five open reading frames (QORFs and PORFs,
respectively).
IV. CODON BIAS
A graph of frequencies of use of the 64 codons in the 7555
genes in S. coelicolor reveals that 80% of all of the amino
acids used in the bacteria are encoded by 32 of the codons
(Fig. 1a). Although a preferential use of codons that are GCrich or GC-only might be anticipated in a GC-rich species the
extent of the bias seen in this genome is extreme. The thirtytwo codons that are used include all eight of the GC only
codons (green in Fig. 1), seventeen of the GC-rich codons
(blue in Fig. 1), seven AT-rich codons and no AT-only
codons. It should be noted that the high GC content of S.
coelicolor does not preclude the potential existence of ATonly codons. Furthermore, 83% of the amino acids are
encoded by the GC rich half of the genetic code (blue and
green symbols in Fig. 1a). A pattern of codon bias in the
coding frame does not mean that an identical or even similar
pattern will be found in the other five frames. The distribution
of the 64 possible codons is different in each of the six
possible reading frames.
Analysis of the triple codon frequency in the S. coelicolor
genome, which involves all 6 frames, reveals the presence in
the DNA of a similar bias to that seen in the coding frame in
which 32 GC-rich nucleotide triples account for 83% of the
triples in the DNA of all the genes in the genome (Fig. 1b).
The pattern of separation of occurrence of specific
combinations of G, C, T and A however, become more
pronounced. The partitioning of the four subsets of triples is
much more pronounced with a distribution of 34.2% GC-only,
49.6% GC-rich, 15.8% AT-rich and 0.4% AT-only and there
is a break in the distribution between the GC-rich and AT-rich
halves (Fig. 1b separations of blue from yellow). Nucleotide
triples composed of G and C only are more common than
those composed of A and T only, and the GC-rich and AT-rich
triples cluster in the upper and lower halves of the distribution,
respectively. The two most used codons (GCG-Ala and GCGGly) are complements of one another and together account for
14.0% of the amino acids in the putative protein products of
Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48
1a.
3
1b.
1c.
Figure 1: Streptomyces coelicolor Codon plots: GC-only codon in green, GC-rich codons in blue, AT-rich codons in yellow, and AT-only codons in red. Codons
with multiple definitions represented at an X.
AT-rich before AT-only. The data also suggests that triples
corresponding to complementary antisense pairs in each of
these subgroups entering the coding system at approximately
the same time in the course of the evolution of the code.
V. BIASED AMINO ACID COMPOSITION:
The codons of some amino acids are GC-rich (i.e. Gly, Ala,
Pro, Arg) while others are AT-rich (Phe, Tyr). A bias in amino
acid composition of proteins has been noted in species with a
very high or very low GC content [9]. We noted a severe bias
in the amino acid composition of the putative protein products
of the genes in S. coelicolor. Ten amino acids
(GPASTDLVER) make up 82% of the composition of the
postulated protein products of all 7555 annotated genes in S.
coelicolor (Fig. 2). These same ten amino acids have been
proposed as the first to appear in primordial proteins. This is
based on a variety of different theoretical and biochemical
analyses [5]. We discovered that this bias in amino acid
composition in S. coelicolor was not randomly distributed
among the 7555 proteins. Forty percent of the proteins in S.
coelicolor are missing one or more of the less commonly
16
14
12
% Occurrence
the S. coelicolor. Examination of the triple content of the
DNA rather then just the triple frequency in the coding frame
demonstrates that the nucleotide triple bias is not restricted to
the coding frame and is in fact a more fundamental property of
the DNA of genes containing MORFs.
A graph of the distribution of occurrence of
complementary pairs of nucleotide triples in the DNA of S.
coelicolor reveals an even more pronounced separation of the
four classes of codons (GC-only, GC-rich, AT-rich, and ATonly) (Fig. 1c). Just as the average value of each of the four
classes of triples (GC-only, GC-rich, AT-rich and AT-only) is
markedly different, combining the percentage use of
individual complementary pairs of triples results in enhanced
separation of the four classes. Of particular significance is the
appearance of a significant gap between the GC-rich and ATrich halves of the coding system.
Fourteen codons of the standard or universal code are
known to code for different amino acids in different species
including bacteria and the mitochondria of eukaryotes [6]. The
majority of the codons having variable definitions are AT-only
or AT-rich. These codons are not found in the genome of S.
coelicolor (Fig. 1a) or in the MORFs of the protein family of
short chain oxidoreductases [1]. The fact that the majority of
the codons that are used least in coding in the MORFs are ATrich, and include those that have multiple definitions in
different species is consistent with the possibility that the
earliest genes evolved before the AT-rich half of the coding
system was fully defined and before some species separation.
Although we have not attempted to predict an exact order of
introduction of codons into the genetic code, our data
supports, in part, the order of codon evolution proposed by
Trifonov [5,7]. However, our pattern of separation of
populations of the four classes of codons (GC-only, GC-rich,
AT-rich, and AT-only) in the S. coelicolor genome and SCOR
MORFs suggests that the evolution of triples used for coding
might have been in the order, GC-only before GC-rich before
10
8
6
4
2
0
A
L
G
R
V
P
T
D
E
S
I
Q
F
AA
Figure 2: Amino acid occurrence in S. coelicolor
H
K
M
W
N
Y
C
Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48
4
Comparison of the amino acid sequences of the 125 RSL6
proteins reveals that Cys is not conserved in any sequence
position at greater than 8%. In 37 sequences, a Trp residue
resides in a common position on the surface of RSL6 in
bacteria but not archaea. Additionally, we expanded the search
to include 13 of the largest proteins from the 50S ribosomal
subunit. Homologs of these proteins are also missing Trp or
Cys or both (Fig. 4). Further investigation reveals a high
incidence of MORFs, codon bias and amino acid bias in most
prokaryotic ribosomal proteins. Regardless of overall GC
content, many ribosomal proteins have genes containing
double ORFs and have no cysteines or tryptophans in their
amino acid composition.
Figure 3: The relative frequency of absence of each of the 20 amino acids in
S. coelicolor.
occurring ten amino acids (CWNKYMHFIQ) (Fig. 3).
Genes having QORFs and PORFs have the most restricted
amino acid use. If these are the most ancient and least altered
genes, they should correspond to the most essential proteins
and protein folds. The 228 QORF (quartet ORF) genes vary in
length from 28 to 464 amino acids, there are 170 QORFs with
>100 amino acids. When these 170 genes are examined 87%
of the coding is from the GC-rich half of the genetic code.
Only 19 of the expected gene products are characterized by
homology. These include 5 dehydrogenases, 3 kinases, 2
esterases, a permease, a deformylase, 2 ABC transport
proteins, a two-component regulator, and three ribosomal
proteins, [S12, L18 and L33].
VI. RIBOSOMAL PROTEINS:
Further analyses revealed that most of the ribosomal
protein genes in S. coelicolor had double and triple ORFs and
are missing one or more of the ten least commonly occurring
amino acids. The ribosomal subunit L6 (RSL6) was selected
The identity of twenty-two of the 170 residues in RSL6 are
conserved at 90% (Table II). Of the conserved identities, 18
(G11P2V2ALE) are among the ten residues that are most
populated within protein folds according to
AA Position %
SwissProt/ TrEMBL statistics and 17 of
P
11
96
them (G11P2K3A) are from the tRNA
V
14
92
synthetase II family [3]. We have
G
27
95
previously noted an enhanced frequency of
G
30
95
amino acids related to tRNA Syn II in the
G
65
92
G
77
10
conserved identities in the SCOR enzyme
V
78
96
family [1]. We postulate that if one of the
G
81
95
tRNA Syn enzymes evolved before the
L
86
92
other, the observed frequencies of
G
90
91
conservation of amino acids in ancient
G
92
91
proteins suggests that tRNA Syn II arose
G
107
97
first [8].
G
134
95
The highly conserved residues are
K
137
91
A
144
96
generally seen in the turns of RSL6 and
P
153
90
rarely in the !-helices and "-sheets (Fig. 5).
Y
156
93
The charged residues are extended away
K
157
90
from the structure where they can interact
G
158
92
with rRNA or other ribosomal surfaces. The
K
159
91
presence of conserved tRNA Syn II residues
G
160
93
in the turns of RSL6 matching the pattern
E
166
91
previously observed in SCOR enzyme [1].
TABLE 2: TABLE SHOWS RESIDUES CONSERVED >90% AMINO ACIDS FROM
TRNA SYNTHETASE I ARE IN RED TRNA SYNTHETASE ARE IN GREEN.
VII. CONCLUSIONS
Figure 4: missing amino acids in several Ribosomal proteins
for further analysis. We were able to identify 125 homologs in
the SWISS-PROT TrEMBL database based on sequence
similarity to an L6 for which a crystal structure has been
reported [1RL6] (Fig. 5). Of these 125 RSL6 sequences, 50%
are missing Trp, 64% are missing Cys and 35% are missing
both. The RSL6 proteins in archaea rarely have Trp and/or
Cys residues whereas those from eukaryote usually have both.
Regardless of overall GC content, many ribosomal proteins
have genes containing double ORFs and have no cysteines or
tryptophans in their amino acid composition. The MORFs
found in S. coelicolor genes appear to identify a subset of the
codon system that evolved first as well as a subset of amino
acids that may have constituted the earliest folded proteins.
These findings suggest that MORFs, severe codon bias and the
absence of Trp and Cys residues are hallmarks of ancient
enzymes that have been little altered by millions of years of
evolution.
0
Biotechnology and Bioinformatics Symposium (BIOT-2007) Paper ID: 48
Figure 5: structure of PDB: 1RL6. Colors indicate secondary structure with
alpha helices: red, beta sheets: yellow and turns: green. Conserved residues
(>90% ID) indicated in purple.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
W. L. Duax, V. Pletnev, A. Addlagatta, J. Bruenn and C. M. Weeks.
Rational proteomics I: Predicting fold and cofactor preference in the
short-chain oxidoreductase (SCOR) enzyme family. Proteins,2003,Vol.
53, pp 931-943.
S. D. Bentley, K. F. Chater, A-M. Cerdeno-Tarraga, G. L. Challis, N.
R.Thomson, K. D. James, D. E.Harris, M. A. Quail, H. Kieser,D.
Harper, A. Bateman, S. Brown, G. Chandra, C. W. Chen, M. Collins,A.
Cronin, A. Fraser, A. Goble, J. Hidalgo, T. Hornsby, S. Howarth, C-H.
Huang, T. Kieser, L. Larke, L. Murphy, K. Oliver, S. O’Neil, E.
Rabbinowitsch, M-A. Rajandream, K. Rutherford, S. Rutter, K. Seeger,
D. Saunders, S. Sharp, R. Squares, S. Squares, K. Taylor, T. Warren, A.
Wietzorrek, J. Woodward, B. G. Barrell, J. Parkhill, and D. A.
Hopwood. Complete genome sequence of the model actinomycete
Streptomyces coelicolor. Nature, 2002; Vol. A3(2), pp. 417:141-147.
B. Boeckman, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E.
Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout,
and M. Schneider The SWISS-PROT protein knowledgebase and its
supplement TrEMBL in Nucleic Acids Res. 2003; Vol. 31, pp 365-370.
C. Sander and R. Schneider. Database of homology-derived protein
structure and the structural meaning of sequence alignment. Protein s,
1991, Vol. 9, pp 56-68.
E. N. Trifonov. Consensus temporal order of amino acids and evolution
of the triplet code. Gene. 2000, Vol. 26, pp 139-151.
Maeshiro T, Kimura M. The role of robustness and changeability on the
origin and evolution of genetic codes. Proc Natl Acad Sci USA
1998;95:5088–5093.
Trifonov EN, Kirzhner A, Kirzhner M, Berezovsky IN. Distinct stages
of protein evolution as suggested by protein sequence analysis. J Mol
Evol 2001;53:394–401.
Carter C., Duax, W., Did tRNA synthetase classes arise on opposite
strands of the same gene? Mol Cell. 2002;10(4):705-8.
Singer GA, Hickey DA. Nucleotide bias causes a genomewide bias in
the amino acid composition of proteins. Mol Biol Evol. 2000
Nov;17(11):1581-8.
5