Download The percentage of bacterial genes on leading versus

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

X-inactivation wikipedia , lookup

Heritability of IQ wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

NEDD9 wikipedia , lookup

Genetic engineering wikipedia , lookup

Transposable element wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Replisome wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Human genome wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Metagenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Public health genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Essential gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome evolution wikipedia , lookup

Minimal genome wikipedia , lookup

Transcript
The percentage of bacterial genes on leading versus lagging strands
is influenced by multiple balancing forces
Xizeng Mao1, Han Zhang1,4, Yanbin Yin1, 2, Ying Xu1, 2, 3
1
Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology and
Institute of Bioinformatics, University of Georgia, Athens, GA 30605; 2Department of BioEnergy
Science Center (BESC), Oak Ridge, TN 37831; and 3College of Computer Science and
Technology, Jilin University, Changchun, Jilin, China; and 4Department of Automation, Nankai
University, Tianjin, China
Abstract
It has been observed that the majority of protein-encoding genes in a bacterial genome are located
on the leading genomic strand, and the percentage of such genes has a large variation across
different bacteria. While some explanations have been proposed for this observed strand bias,
these explanations are at most partial explanations since they cover only small percentages of
leading-strand genes (~10%), leaving the majority of such genes unexplained. We have carried
out a computational study on 802 sequenced bacterial genomes, aiming to elucidate other factors
that may have influenced the strand-location of genes in a bacterium. Our analyses suggest that (a)
genes of some functional categories such as ribosome and translation have higher tendencies to be
on the leading strands; (b) the level of such tendency for individual genes is influenced by their
relationships to the survivability of the bacterium; (c) there is a balancing force that tends to keep
genes from all moving to the leading strand during evolution: a more balanced genome facilitates
higher gene-densities in a genome; and (d) the percentage of leading-strand genes in an bacterium
can be accurately predicted based on the numbers of genes in the functional categories outlined in
(a) and (b).
\body
Introduction
It has been observed that the majority of bacterial genes tend to be located on the leading strand
in their genome, and the percentage of such genes has a large variation across different bacteria,
ranging from ~45% to ~90% (1, 2). A number of studies have been carried out aiming to provide
explanations for these observations. A key factor considered in these studies is the different
mechanisms employed by bacterial cells in replication of the leading and the lagging strands
when cell replication and transcription occur simultaneously (3, 4). Specifically, during
chromosomal replication, DNA and RNA polymerases move in the same direction on the leading
1
strand but in opposite directions on the lagging strand, creating the possibility of head-on
collisions between the two polymerases during transcription of some genes on the lagging strand,
hence making the lagging strand the less efficient strand (1, 4). In an earlier study, Brewer
suggested that bacterial cells may be under a selection pressure to have highly expressed genes
reside on the leading strand (3).
Rocha and Danchin recently argued that it is really the
essentiality instead of the needed expression levels of genes that may have driven certain genes to
the leading strand (5, 6). While this interpretation seems to be correct, it provides only a partial
answer since essential genes account for only a small portion of the whole gene set encoded in a
bacterial genome, e.g., ~10% in E. coli (7, 8) and ~10% in B. subtilis (9). Price et al. observed
that longer operons tend to be on the leading strand, and suggested that there may be a selection
pressure to have such an arrangement to avoid interruptions during transcription of such operons
(10). Furthermore, Rocha observed that the presence/absence of the RNA polymerase PolC in a
genome is highly correlated with bacterial genomes having at least 70% their genes on the
leading strand or not (11). Hu et al. proposed that replication-associated purine asymmetry may
also contribute to the strand-bias in a genome (12). While these studies have provided some
partial explanations of the aforementioned observations, the general issues of why the majority of
bacterial genes tend to be located on the leading strands and why this percentage has such a large
variation remain largely unanswered.
We present here a computational analysis of all the sequenced bacterial genomes aiming to
provide a more general explanation to the two observations. Our key findings are (a) genes of
different functional categories have different level of tendency to be on the (more efficient)
leading strand; (b) the level of such tendency by individual genes is influenced by their
relationships to the survivability of the bacterium in its environment; (c) there is at least one
balancing force that keeps genes from all moving to the leading strand during evolution, i.e., a
more balanced genome facilitates a higher gene density in a genome; and (d) the percentage of
leading-strand genes for a bacterium can be accurately predicted based on the number of genes in
some functional categories outlined in (a) and (b). Based on these findings, we believe that the
percentage of genes on the leading versus lagging strand in a genome is the result of two sets of
balancing forces, one that tends to drive genes of certain functional categories to the leading
strands to make the bacteria more efficient in their responses to environmental changes and one
that tends to keep the genome as compact as possible to stay energetically efficient when
replicating and maintaining the genome.
2
Results and Discussion
Genes on leading strands
We have analyzed all the 802 sequenced bacterial genomes in terms of the strand biases of their
protein-encoding genes. Fig. 1(a) shows the percentage distribution of leading-strand genes
across all the 802 genomes, ranging from 45% to ~90%. This observation extends the previous
observations made based on a few bacterial genomes. We have also examined gene-expression
levels versus genes on leading and lagging strands of E. coli, which has a substantial amount of
microarray gene-expression data1. We found that the percentage of genes with similar expression
levels on the leading strand increases as the expression level (averaged over all the available
experimental conditions) goes up, as shown in Fig. 1(b), which is consistent with a previous
finding (4, 12) that highly expressed genes tend to be on the leading strands.
We have examined the relationship between strand biases across different bacteria and their
habitat styles. Of the 802 bacteria under consideration, 768 have lifestyle information in the
NCBI database (13) so we used only 768 genomes in this analysis. The 768 bacteria are classified
into five types according to their living environments: specialized type for bacteria (72 out of 768)
living in specialized environment such as marine thermal vents; host-associated type (14, 15) for
bacteria (280 out of 768) associated with a host; aquatic type (16) for bacteria (133 out of 768)
living in fresh or ocean water; multiple type for bacteria (238 out of 768) living in multiple types
of environments; and terrestrial type for bacteria (45 out of 768) living in the soil. These five
types are ordered according to the stability of their living environments, going from the most
stable to the least stable (17-20) (see Dataset S1). Fig. 1(b) shows that there is a positive
correlation between the increased variability of environments and the increased percentage of
genes on the leading strands across all bacteria excluding those of the host-associated type (this
group of bacteria is intrinsically different from the others and needs to be considered separately).
We have also examined bacterial genomes of different taxonomic groups, and found that different
phyla have substantially different averaged percentages of genes on the leading strands (see Fig.
S1), which is consistent with a previous finding made on a smaller group of bacterial genomes
(21).
1
We did not use the codon adaptation index as done in previous studies for estimating gene-expression
levels since it is too crude compared to the microarray gene expression data.
3
Genes of certain functional categories have higher tendencies to be on the leading strands
We have examined if genes of different functional categories may have different level of
tendencies to be on the leading strands across all bacteria. To do this, we checked all the genes
with GOslim functional assignments (22) in 773 out of the 802 genomes (the other 29 genomes
do not have GO-based annotations). For each of the 127 GOslim functional categories, we
consider a functional category prefers the leading strand if genes in this category (across all
genomes) have a higher percentage than the overall percentage of genes on the leading strands
across all bacterial genomes. The Wilcoxon rank-sum test is used to assess the statistical
significance of an observed preference measured using a p-value. We found that 63 out of the 127
categories prefer the leading strand with p-value < 0.01, including genes related to translation,
protein binding, structural molecule activity, motor activity as well as RNA-binding genes as
shown in Fig. 2. On average, 58% of the genes encoded in a bacterial genome is covered by these
63 functional categories, and the detailed distribution of this percentage across different bacterial
genomes is given in Fig. S2.
To check if our analysis covers the observation by Rocha and Danchin (5) that essential genes
tend to be on the leading strands, we created an artificial functional category “essential genes”,
and applied our analysis to all the essential genes in 13 bacterial genomes in the DEG database
(23), which has the annotated essential gene information on these 13 genomes. Not surprisingly,
this category has a significant p-value for tending to be on the leading strands (see Fig. S3 for
details), indicating that our explanation covers the observation made by Rocha and Danchin (5).
Functional categories determine the level of preference for genes to be on leading strands
The above analysis indicates that genes of different functional categories have different levels of
preference for the leading strands. For example, some categories, e.g., ribosome (GO:0005840),
always have most genes on the leading strand regardless of the overall percentage of genes on the
leading strand in a genome, while other categories, e.g., transcription factor activity (GO:0003700)
increase their percentage of leading-strand genes along with the increase of the overall percentage
of leading-strand genes in a genome. We have carried out an analysis aiming to derive a
comprehensive picture of how each such functional category affects its percentage of leadingstrand genes as a function of the percentage of all leading-strand genes in a genome. We noted
that 62 out of the 127 functional categories show consistent changes in terms of the percentage of
the leading-strand genes as the percentage of leading-strand genes in a genome increases checked
using a Pearson correlation score ≥ 0.5 and p-value ≤ 0.05 as the cutoffs (essentially we checked
4
if these two quantities are highly linearly correlated) while other functional categories do not
show any consistent relationships between the two. Fig. S4(a) and (b) show two examples, i.e.,
transcription factor activity and translation factor activity, of such cases with consistent changes.
We grouped the 62 functional categories into three groups, i.e., the ones having similar rates of
increase between the percentages of leading-strand genes in a functional category and in a whole
genome (i.e., the slope is within [0.9, 1.1], the ones having substantially higher rates of increase
(i.e., the slope is > 1.1), and the ones having substantially lower rates (i.e., the slope is < 0.9).
Out of the 62 (covering 65% of genes on average across 802 genomes) functional categories, 28
(covering 60% of genes on average) are in the first group; 29 (covering 42% of genes on average)
are in the second group; and 5 (covering 14% of genes on average) in the third group, for which
the sum of the three groups being more than 65% is due to the overlaps among these GOslim
categories (see Fig. 3 and S2). In the second group, the following functional categories have the
highest preference to the leading strands, motor activity (GO:0003774), protein transport
(GO:0015031) and protein complex (GO:0043234), while in the third group, only functional
category, i.e., transcription factor activity (GO:0003700), shows obvious preference to the
lagging strands. Regarding genes preferring the leading strands, some explanation is provided in
the next section. Regarding why transcription factors show preference to the lagging strands, one
possible explanation could be that transcription factors, particularly non-global transcription
factors, are known to have low expression levels (24) and hence are the last group of genes to
move to the leading strands during evolution.
Having certain genes on leading strands may enhance the survivability of a bacterium: a
case study
We hypothesize that some genes have moved to the leading strands to enable bacteria to respond
more quickly to environmental changes, which would improve their survivability in complex
environments. We have tested this hypothesis on the following data. It has been reported that the
chemotactic response of P. haloplanktis (P. halo) in exploiting ephemeral microscale nutrient
patches is at least 10 times faster than that of E. coli (25), suggesting that P. halo may be
genetically optimized for this particular capability. To check whether some genes are specifically
located on the leading strand of the organism, we have examined the strand distribution of genes
across the 127 GOslim functional categories on P. halo and E. coli. We found that genes of some
functional categories are significantly enriched (with p-value < 0.05) on the leading strand of P.
halo, including DNA binding, transcription regulator activity, protein kinase activity, motor
5
activity and transporter activity than those in E. coli (see Fig. S5). This clearly makes sense as
collectively having more genes related to motor activity, transporter activity, transcription
regulator among others on the leading strand may enable the bacteria to react much faster when
the nutrients become available (26, 27).
A balancing force: strand bias versus gene density
Our analysis suggests that there might be a selection pressure for a bacterium to have a more
compact genome (i.e., a shorter genome without losing genes), particularly in a complex
environment. To test this, we have examined the percentages of coding regions in the two groups
of bacteria, one containing all bacteria with at least 70% of the genes on the leading strands and
one containing all the other sequenced bacteria, and checked their relationship with the living
styles of the bacteria. Our analysis revealed that (i) the bacteria in the second group (with less
strand-bias) tend to have higher percentages of coding regions than those in the first group, with a
p-value 1.1 × 10-8 based on the Wilcoxon rank-sum test, shown in Fig. 4(a); and (ii) this tendency
is more significant for bacteria living in complex environments as shown in Fig. 4(b) – (f). One
possible explanation is that there might be a selection pressure for bacteria living in nutrientdepleted environments to keep their genomes as compact as possible (without losing genes), and
having a more balanced genome is one way to achieve this goal (a more balanced genome seems
to allow a higher degree of overlap between regulatory regions of operons).
A model for predicting the percentage of leading-strand genes
Our main hypothesis suggests that the percentage of leading-strand genes in a genome reflects the
relationship between the key functionalities and the living environment of an organism. To check
for this hypothesis, we have examined the population of genes in each functional category
encoded in each genome to see if some of them can be used to predict the percentage of leadingstrand genes. We found that the numbers of genes in the 23 out of 127 GOslim categories can
well predict the percentage of leadings-strand genes in a genome with a linear correlation
coefficient = 0.9, as shown in Fig. 5. Specifically, the 23 functional categories are RNA binding,
cell cycle, structural molecule activity, plasma membrane, cell envelope, cellular homeostasis,
generation of precursor metabolites and energy, secondary metabolic process, cellular component
organization, transport, antioxidant activity, translation, signal transducer activity, electron carrier
activity, protein modification process, vacuole, hydrolase activity, cell recognition, cell
communication, nucleus, motor activity, transcription factor activity, and response to stress. This
result is highly consistent with our above analysis results.
6
Using the 23 numbers from each genome, we have trained a neural network with one hidden layer
to predict the overall percentage of genes on the leading strand as follows:
k2
k1
j =1
i =1
(1)
P =  w (2)
j f ( wij pi ), pi =
ni
ni,max
, k1 = 23, k2 = 10
where P is the percentage of leading-strand genes in a genome, f is a hyperbolic tangent
sigmoid transfer function, w (1)
is the weight of the ith functional category to the jth node of the
ij
is the weight of the jth node in the hidden layer to the output node in the
hidden layer and w (2)
j
neural network model, k1 is the number of functional categories, k2 is the number of nodes of
hidden layer, pi is a scaling factor for each factional category, calculated as the ratio between the
number ( ni ) of genes under this category and the max number ( ni,max ) of genes under this
category across all 773 bacterial genomes.
Concluding remark
It has been observed that bacterial genomes have a large variation in terms of the percentage of
their leading-strand genes, ranging from ~45% to ~90%. We have provided our explanation for
the observed strand-biases and the large variation of the biases, which extends the previous
proposed explanations by substantial margin. Our key contributions through this study include
that (a) we demonstrated that it is the genes of certain functional categories that need to be on the
leading strands of genomes, to enhance the survivability of the bacteria; (b) the level of leadingstrand biases is probably dominated by the living environments of the bacteria; (c) there is a
balancing force that keeps genes from all moving to the more efficient leading strands during
evolution, particularly in nutrient-depleted environments; and (d) the percentage of leading-strand
genes for a bacterial genome can be accurately predicted using the numbers of genes in 23
functional categories outlined in (a) and (b). We anticipate that more sophisticated analyses could
possibly lead to quantitative models relating the percentage of leading-strand genes in a
bacterium to a few parameters reflecting the environment where the organism lives, giving rise to
improved understanding about the rules that may determine which genes will be on the leading
versus the lagging strand of a genome.
7
Material and Methods
Data
802 bacterial genome sequences along with their predicted genes and functional annotations were
retrieved from the NCBI FTP site as of 01/14/2009. The GO annotations for these genomes were
from the GOA Proteome Sets (v52) (28), and the GOslim definitions were downloaded from the
Gene Ontology site (http://www.geneontology.org/GO_slims/goslim_generic.obo) (22).
The
microarray data for E. coli are downloaded from the M3D web site (http://m3d.bu.edu) (29).
Annotation with high-level functional categories
Gene Ontology (GO) (22) was used to define functional categories of gene products. Based on
the detailed GO annotation and GO hierarchy information, the Perl script map2slim
(http://search.cpan.org/~cmungall/go-perl/scripts/map2slim) was used to the bacterial genomes
for assignment of GOslim-based functional categories.
Determination of genes on leading and lagging strands
To determine if genes are on the leading versus lagging strand, the origin and the terminus of
replication are needed. The origin of replication for each of the 802 bacterial genomes was
predicted using the Ori-Finder web server (30). The terminus of replication is thus calculated as
the location of origin of replication plus half of the chromosome length. With these two positions,
the leading and lagging strands are predicted for each half of the chromosome according to a
well-known fact that the leading strand always has more genes than the lagging strand does (11).
For each bacterium, only the major chromosome is considered, and plasmids are excluded in this
study.
Tendency of functional categories on the leading versus lagging strands
Given a GO functional category, an index value x is calculated using the following formula:
x=
n0
,
n0 + n1
where n0 is the number of leading-strand genes of this category, n1 is the number of laggingstrand genes of this category. x is calculated for all the GOslim functional categories for all 802
genomes so that for each category, there is a data set (A) of 802 values. In addition the overall
percentage of leading-strand genes (data set B) is obtained for each of the 802 genomes as well.
For each GOslim functional category, a Wilcoxon rank sum test was performed to test if the data
sets A and B are from two distinct distributions; and a linear regression analysis was performed to
8
test if there is any good linear correlation between A and B. All the statistical analyses are
conducted by the R statistical language (http://www.r-project.org).
Prediction of the percentage of leading-strand genes in a genome
A neural network, with 23 input nodes, one hidden layer of 10 nodes and one output node, is
employed to predict the percentage of leading-strand genes in a genome based on the number of
genes in each of 23 functional categories, using the Neural Network Toolbox for MATLAB. The
training data consists of the leading-strand gene percentage across 464 (60%) genomes, arbitrarily
selected from the 773, along with the 23 numbers extracted from each of the 464 genomes, while
the other genomes serve as the validation set. We trained the model for 7 × 105 times and got the
MSE (Mean Squared Error) 0.0012. We then applied the trained neural network on the whole set
of 773 bacterial genomes, the ones with GOslim functional assignments, and got a Pearson
correlation score 0.9 between the predicted and the actual percentages.
The trained neural
network can be downloaded from http://csbl.bmb.uga.edu/~xizeng/research/leading_strand_bias/.
Acknowledgement
We thank all the members of the CSBL Lab at UGA, especially Dr. Victor Olman for discussion
of statistical analyses. This work was supported in part by the National Science Foundation
(DEB-0830024 and DBI-0542119) and the DOE BioEnergy Science Center grant (DE-PS0206ER64304), which is supported by the Office of Biological and Environmental Research in the
Department of Energy Office of Science.
References
1.
2.
3.
4.
5.
6.
7.
Koonin EV (2009) Evolution of genome architecture. Int J Biochem Cell Biol
41(2):298-306.
Zivanovic Y, Lopez P, Philippe H, & Forterre P (2002) Pyrococcus genome
comparison evidences chromosome shuffling-driven evolution. Nucleic Acids Res
30(9):1902-1910.
Brewer BJ (1988) When polymerases collide: replication and the transcriptional
organization of the E. coli chromosome. Cell 53(5):679-686.
French S (1992) Consequences of replication fork movement through
transcription units in vivo. Science 258(5086):1362-1365.
Rocha EP & Danchin A (2003) Essentiality, not expressiveness, drives genestrand bias in bacteria. Nat Genet 34(4):377-378.
Rocha EP & Danchin A (2003) Gene essentiality determines chromosome
organisation in bacteria. Nucleic Acids Res 31(22):6570-6577.
Hashimoto M, et al. (2005) Cell size and nucleoid organization of engineered
Escherichia coli cells with a reduced genome. Mol Microbiol 55(1):137-149.
9
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
Kato J & Hashimoto M (2007) Construction of consecutive deletions of the
Escherichia coli chromosome. Mol Syst Biol 3:132.
Kobayashi K, et al. (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci U
S A 100(8):4678-4683.
Price MN, Alm EJ, & Arkin AP (2005) Interruptions in gene expression drive
highly expressed operons to the leading strand of DNA replication. Nucleic Acids
Res 33(10):3224-3234.
Rocha E (2002) Is there a role for replication fork asymmetry in the distribution
of genes in bacterial genomes? Trends Microbiol 10(9):393-395.
Hu J, Zhao X, & Yu J (2007) Replication-associated purine asymmetry may
contribute to strand-biased gene distribution. Genomics 90(2):186-194.
NCBI (Entrez Genome Project.
Nilsson AI, et al. (2005) Bacterial genome size reduction by experimental
evolution. Proc Natl Acad Sci U S A 102(34):12112-12116.
Hosokawa T, Kikuchi Y, Nikoh N, Shimada M, & Fukatsu T (2006) Strict hostsymbiont cospeciation and reductive genome evolution in insect gut bacteria.
PLoS Biol 4(10):e337.
Robertson BR & Button DK (1989) Characterizing aquatic bacteria according to
population, cell size, and apparent DNA content by flow cytometry. Cytometry
10(1):70-76.
Parter M, Kashtan N, & Alon U (2007) Environmental variability and modularity
of bacterial metabolic networks. BMC Evol Biol 7:169.
Ochman H & Moran NA (2001) Genes lost and genes found: evolution of
bacterial pathogenesis and symbiosis. Science 292(5519):1096-1099.
Xu J (2006) Microbial ecology in the age of genomics and metagenomics:
concepts, tools, and recent advances. Mol Ecol 15(7):1713-1731.
Bardgett RD (2002) Causes and consequences of biological diversity in soil.
Zoology (Jena) 105(4):367-374.
Worning P, Jensen LJ, Hallin PF, Staerfeldt HH, & Ussery DW (2006) Origin of
replication in circular prokaryotic chromosomes. Environ Microbiol 8(2):353-361.
Ashburner M, et al. (2000) Gene ontology: tool for the unification of biology. The
Gene Ontology Consortium. Nat Genet 25(1):25-29.
Zhang R & Lin Y (2009) DEG 5.0, a database of essential genes in both
prokaryotes and eukaryotes. Nucleic Acids Res 37(Database issue):D455-458.
Janga SC, Salgado H, & Martinez-Antonio A (2009) Transcriptional regulation
shapes the organization of genes on bacterial chromosomes. Nucleic Acids Res
37(11):3680-3688.
Stocker R, Seymour JR, Samadani A, Hunt DE, & Polz MF (2008) Rapid
chemotactic response enables marine bacteria to exploit ephemeral microscale
nutrient patches. Proc Natl Acad Sci U S A 105(11):4209-4214.
Amos L & Klug A (1974) Arrangement of subunits in flagellar microtubules. J
Cell Sci 14(3):523-549.
Wemmer KA & Marshall WF (2004) Flagellar motility: all pull together. Curr
Biol 14(23):R992-993.
Barrell D, et al. (2009) The GOA database in 2009--an integrated Gene Ontology
Annotation resource. Nucleic Acids Res 37(Database issue):D396-403.
10
29.
30.
Faith JJ, et al. (2008) Many Microbe Microarrays Database: uniformly
normalized Affymetrix compendia with structured experimental metadata.
Nucleic Acids Res 36(Database issue):D866-870.
Gao F & Zhang CT (2008) Ori-Finder: a web-based system for finding oriCs in
unannotated bacterial genomes. BMC Bioinformatics 9:79.
11
Figure legends
Fig. 1: General characteristics for leading-strand genes: (a) distribution of the number of bacteria with a
specific percentage of genes on the leading strands; (b) distribution of the percentages of leading-strand
genes as a function of the average gene expression level in E. coli, respectively; and (c) distribution of the
percentages of leading-strand genes across all sequenced bacteria in different environments.
Fig. 2: Tendency of genes in different functional categories to the leading strands for 773 bacterial genomes.
Fig. 3: Preference of genes of different functional categories to the leading strands across 773 bacterial
genomes.
Fig. 4: Boxplots of the percentage of coding region versus the percentage of leading strand genes in a
genome: (a) is for all bacteria (p-value of the Wilcoxon test: 1.1 × 10-8); (b) bacteria of specialized type
with p-value 0.22; (c) bacteria of host-associated type with p-value 0.54; (d) bacteria of aquatic type with
p-value 0.065; (e) bacteria of terrestrial type with p-value 0.0031; and (f) bacteria of multiple type with pvalue 1.9 × 10-9.
Fig. 5: Performance in predicting the percentage of leading-strand genes in a genome by a neural network
on 773 genomes.
12
0.6
(b)
0.8
0.9
1.0
6
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.7
●
0.8
0.5
50
78
0.6
283
1371
90
816
S
1031
297
c
pe
ial
ize
d
0.4
Ho
0.2
32
0.0
Percentage of genes on the leading strand
0.7
Percentage of genes on leading strand
●
●
0.6
6
4
Density
2
0
0.5
Overall percentage of genes on leading strand
(c)
8
(a)
[ 3, 5)
[ 5, 6)
[ 6, 7)
[ 7, 8)
[ 8, 9)
[ 9,10)
[10,11)
Bins of gene expression levels
[11,12)
[12,13)
[13,14]
st-
as
c
so
iat
ed
Aq
ua
tic
Mu
ltip
le
Bacterial habitat lifestyle
tria
es
rr
Te
l
<= 0.7
0.95
0.85
0.75
0.7
0.9
Percentage of coding region
(b)
0.5
Percentage of coding region
(a)
> 0.7
<= 0.7
Percentage of genes on leading strand
Percentage of genes on leading strand
<= 0.7
0.90
0.75
0.60
0.7
0.9
Percentage of coding region
(d)
0.5
Percentage of coding region
(c)
> 0.7
<= 0.7
Percentage of genes on leading strand
> 0.7
Percentage of genes on leading strand
> 0.7
Percentage of genes on leading strand
0.80
0.70
1.00
0.90
0.80
<= 0.7
0.90
(f )
Percentage of coding region
(e)
Percentage of coding region
> 0.7
<= 0.7
> 0.7
Percentage of genes on leading strand
0.5
0.6
0.7
0.8
Actual percentages of leading-strand genes in a genome
0.9
0.5
0.6
0.7
0.8
0.9
Predicted percentages of leading-strand genes in a genome