* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The percentage of bacterial genes on leading versus
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
X-inactivation wikipedia , lookup
Heritability of IQ wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Transposable element wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Genomic library wikipedia , lookup
Non-coding DNA wikipedia , lookup
Human genome wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Metagenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Public health genomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Helitron (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Essential gene wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Gene expression profiling wikipedia , lookup
The percentage of bacterial genes on leading versus lagging strands is influenced by multiple balancing forces Xizeng Mao1, Han Zhang1,4, Yanbin Yin1, 2, Ying Xu1, 2, 3 1 Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA 30605; 2Department of BioEnergy Science Center (BESC), Oak Ridge, TN 37831; and 3College of Computer Science and Technology, Jilin University, Changchun, Jilin, China; and 4Department of Automation, Nankai University, Tianjin, China Abstract It has been observed that the majority of protein-encoding genes in a bacterial genome are located on the leading genomic strand, and the percentage of such genes has a large variation across different bacteria. While some explanations have been proposed for this observed strand bias, these explanations are at most partial explanations since they cover only small percentages of leading-strand genes (~10%), leaving the majority of such genes unexplained. We have carried out a computational study on 802 sequenced bacterial genomes, aiming to elucidate other factors that may have influenced the strand-location of genes in a bacterium. Our analyses suggest that (a) genes of some functional categories such as ribosome and translation have higher tendencies to be on the leading strands; (b) the level of such tendency for individual genes is influenced by their relationships to the survivability of the bacterium; (c) there is a balancing force that tends to keep genes from all moving to the leading strand during evolution: a more balanced genome facilitates higher gene-densities in a genome; and (d) the percentage of leading-strand genes in an bacterium can be accurately predicted based on the numbers of genes in the functional categories outlined in (a) and (b). \body Introduction It has been observed that the majority of bacterial genes tend to be located on the leading strand in their genome, and the percentage of such genes has a large variation across different bacteria, ranging from ~45% to ~90% (1, 2). A number of studies have been carried out aiming to provide explanations for these observations. A key factor considered in these studies is the different mechanisms employed by bacterial cells in replication of the leading and the lagging strands when cell replication and transcription occur simultaneously (3, 4). Specifically, during chromosomal replication, DNA and RNA polymerases move in the same direction on the leading 1 strand but in opposite directions on the lagging strand, creating the possibility of head-on collisions between the two polymerases during transcription of some genes on the lagging strand, hence making the lagging strand the less efficient strand (1, 4). In an earlier study, Brewer suggested that bacterial cells may be under a selection pressure to have highly expressed genes reside on the leading strand (3). Rocha and Danchin recently argued that it is really the essentiality instead of the needed expression levels of genes that may have driven certain genes to the leading strand (5, 6). While this interpretation seems to be correct, it provides only a partial answer since essential genes account for only a small portion of the whole gene set encoded in a bacterial genome, e.g., ~10% in E. coli (7, 8) and ~10% in B. subtilis (9). Price et al. observed that longer operons tend to be on the leading strand, and suggested that there may be a selection pressure to have such an arrangement to avoid interruptions during transcription of such operons (10). Furthermore, Rocha observed that the presence/absence of the RNA polymerase PolC in a genome is highly correlated with bacterial genomes having at least 70% their genes on the leading strand or not (11). Hu et al. proposed that replication-associated purine asymmetry may also contribute to the strand-bias in a genome (12). While these studies have provided some partial explanations of the aforementioned observations, the general issues of why the majority of bacterial genes tend to be located on the leading strands and why this percentage has such a large variation remain largely unanswered. We present here a computational analysis of all the sequenced bacterial genomes aiming to provide a more general explanation to the two observations. Our key findings are (a) genes of different functional categories have different level of tendency to be on the (more efficient) leading strand; (b) the level of such tendency by individual genes is influenced by their relationships to the survivability of the bacterium in its environment; (c) there is at least one balancing force that keeps genes from all moving to the leading strand during evolution, i.e., a more balanced genome facilitates a higher gene density in a genome; and (d) the percentage of leading-strand genes for a bacterium can be accurately predicted based on the number of genes in some functional categories outlined in (a) and (b). Based on these findings, we believe that the percentage of genes on the leading versus lagging strand in a genome is the result of two sets of balancing forces, one that tends to drive genes of certain functional categories to the leading strands to make the bacteria more efficient in their responses to environmental changes and one that tends to keep the genome as compact as possible to stay energetically efficient when replicating and maintaining the genome. 2 Results and Discussion Genes on leading strands We have analyzed all the 802 sequenced bacterial genomes in terms of the strand biases of their protein-encoding genes. Fig. 1(a) shows the percentage distribution of leading-strand genes across all the 802 genomes, ranging from 45% to ~90%. This observation extends the previous observations made based on a few bacterial genomes. We have also examined gene-expression levels versus genes on leading and lagging strands of E. coli, which has a substantial amount of microarray gene-expression data1. We found that the percentage of genes with similar expression levels on the leading strand increases as the expression level (averaged over all the available experimental conditions) goes up, as shown in Fig. 1(b), which is consistent with a previous finding (4, 12) that highly expressed genes tend to be on the leading strands. We have examined the relationship between strand biases across different bacteria and their habitat styles. Of the 802 bacteria under consideration, 768 have lifestyle information in the NCBI database (13) so we used only 768 genomes in this analysis. The 768 bacteria are classified into five types according to their living environments: specialized type for bacteria (72 out of 768) living in specialized environment such as marine thermal vents; host-associated type (14, 15) for bacteria (280 out of 768) associated with a host; aquatic type (16) for bacteria (133 out of 768) living in fresh or ocean water; multiple type for bacteria (238 out of 768) living in multiple types of environments; and terrestrial type for bacteria (45 out of 768) living in the soil. These five types are ordered according to the stability of their living environments, going from the most stable to the least stable (17-20) (see Dataset S1). Fig. 1(b) shows that there is a positive correlation between the increased variability of environments and the increased percentage of genes on the leading strands across all bacteria excluding those of the host-associated type (this group of bacteria is intrinsically different from the others and needs to be considered separately). We have also examined bacterial genomes of different taxonomic groups, and found that different phyla have substantially different averaged percentages of genes on the leading strands (see Fig. S1), which is consistent with a previous finding made on a smaller group of bacterial genomes (21). 1 We did not use the codon adaptation index as done in previous studies for estimating gene-expression levels since it is too crude compared to the microarray gene expression data. 3 Genes of certain functional categories have higher tendencies to be on the leading strands We have examined if genes of different functional categories may have different level of tendencies to be on the leading strands across all bacteria. To do this, we checked all the genes with GOslim functional assignments (22) in 773 out of the 802 genomes (the other 29 genomes do not have GO-based annotations). For each of the 127 GOslim functional categories, we consider a functional category prefers the leading strand if genes in this category (across all genomes) have a higher percentage than the overall percentage of genes on the leading strands across all bacterial genomes. The Wilcoxon rank-sum test is used to assess the statistical significance of an observed preference measured using a p-value. We found that 63 out of the 127 categories prefer the leading strand with p-value < 0.01, including genes related to translation, protein binding, structural molecule activity, motor activity as well as RNA-binding genes as shown in Fig. 2. On average, 58% of the genes encoded in a bacterial genome is covered by these 63 functional categories, and the detailed distribution of this percentage across different bacterial genomes is given in Fig. S2. To check if our analysis covers the observation by Rocha and Danchin (5) that essential genes tend to be on the leading strands, we created an artificial functional category “essential genes”, and applied our analysis to all the essential genes in 13 bacterial genomes in the DEG database (23), which has the annotated essential gene information on these 13 genomes. Not surprisingly, this category has a significant p-value for tending to be on the leading strands (see Fig. S3 for details), indicating that our explanation covers the observation made by Rocha and Danchin (5). Functional categories determine the level of preference for genes to be on leading strands The above analysis indicates that genes of different functional categories have different levels of preference for the leading strands. For example, some categories, e.g., ribosome (GO:0005840), always have most genes on the leading strand regardless of the overall percentage of genes on the leading strand in a genome, while other categories, e.g., transcription factor activity (GO:0003700) increase their percentage of leading-strand genes along with the increase of the overall percentage of leading-strand genes in a genome. We have carried out an analysis aiming to derive a comprehensive picture of how each such functional category affects its percentage of leadingstrand genes as a function of the percentage of all leading-strand genes in a genome. We noted that 62 out of the 127 functional categories show consistent changes in terms of the percentage of the leading-strand genes as the percentage of leading-strand genes in a genome increases checked using a Pearson correlation score ≥ 0.5 and p-value ≤ 0.05 as the cutoffs (essentially we checked 4 if these two quantities are highly linearly correlated) while other functional categories do not show any consistent relationships between the two. Fig. S4(a) and (b) show two examples, i.e., transcription factor activity and translation factor activity, of such cases with consistent changes. We grouped the 62 functional categories into three groups, i.e., the ones having similar rates of increase between the percentages of leading-strand genes in a functional category and in a whole genome (i.e., the slope is within [0.9, 1.1], the ones having substantially higher rates of increase (i.e., the slope is > 1.1), and the ones having substantially lower rates (i.e., the slope is < 0.9). Out of the 62 (covering 65% of genes on average across 802 genomes) functional categories, 28 (covering 60% of genes on average) are in the first group; 29 (covering 42% of genes on average) are in the second group; and 5 (covering 14% of genes on average) in the third group, for which the sum of the three groups being more than 65% is due to the overlaps among these GOslim categories (see Fig. 3 and S2). In the second group, the following functional categories have the highest preference to the leading strands, motor activity (GO:0003774), protein transport (GO:0015031) and protein complex (GO:0043234), while in the third group, only functional category, i.e., transcription factor activity (GO:0003700), shows obvious preference to the lagging strands. Regarding genes preferring the leading strands, some explanation is provided in the next section. Regarding why transcription factors show preference to the lagging strands, one possible explanation could be that transcription factors, particularly non-global transcription factors, are known to have low expression levels (24) and hence are the last group of genes to move to the leading strands during evolution. Having certain genes on leading strands may enhance the survivability of a bacterium: a case study We hypothesize that some genes have moved to the leading strands to enable bacteria to respond more quickly to environmental changes, which would improve their survivability in complex environments. We have tested this hypothesis on the following data. It has been reported that the chemotactic response of P. haloplanktis (P. halo) in exploiting ephemeral microscale nutrient patches is at least 10 times faster than that of E. coli (25), suggesting that P. halo may be genetically optimized for this particular capability. To check whether some genes are specifically located on the leading strand of the organism, we have examined the strand distribution of genes across the 127 GOslim functional categories on P. halo and E. coli. We found that genes of some functional categories are significantly enriched (with p-value < 0.05) on the leading strand of P. halo, including DNA binding, transcription regulator activity, protein kinase activity, motor 5 activity and transporter activity than those in E. coli (see Fig. S5). This clearly makes sense as collectively having more genes related to motor activity, transporter activity, transcription regulator among others on the leading strand may enable the bacteria to react much faster when the nutrients become available (26, 27). A balancing force: strand bias versus gene density Our analysis suggests that there might be a selection pressure for a bacterium to have a more compact genome (i.e., a shorter genome without losing genes), particularly in a complex environment. To test this, we have examined the percentages of coding regions in the two groups of bacteria, one containing all bacteria with at least 70% of the genes on the leading strands and one containing all the other sequenced bacteria, and checked their relationship with the living styles of the bacteria. Our analysis revealed that (i) the bacteria in the second group (with less strand-bias) tend to have higher percentages of coding regions than those in the first group, with a p-value 1.1 × 10-8 based on the Wilcoxon rank-sum test, shown in Fig. 4(a); and (ii) this tendency is more significant for bacteria living in complex environments as shown in Fig. 4(b) – (f). One possible explanation is that there might be a selection pressure for bacteria living in nutrientdepleted environments to keep their genomes as compact as possible (without losing genes), and having a more balanced genome is one way to achieve this goal (a more balanced genome seems to allow a higher degree of overlap between regulatory regions of operons). A model for predicting the percentage of leading-strand genes Our main hypothesis suggests that the percentage of leading-strand genes in a genome reflects the relationship between the key functionalities and the living environment of an organism. To check for this hypothesis, we have examined the population of genes in each functional category encoded in each genome to see if some of them can be used to predict the percentage of leadingstrand genes. We found that the numbers of genes in the 23 out of 127 GOslim categories can well predict the percentage of leadings-strand genes in a genome with a linear correlation coefficient = 0.9, as shown in Fig. 5. Specifically, the 23 functional categories are RNA binding, cell cycle, structural molecule activity, plasma membrane, cell envelope, cellular homeostasis, generation of precursor metabolites and energy, secondary metabolic process, cellular component organization, transport, antioxidant activity, translation, signal transducer activity, electron carrier activity, protein modification process, vacuole, hydrolase activity, cell recognition, cell communication, nucleus, motor activity, transcription factor activity, and response to stress. This result is highly consistent with our above analysis results. 6 Using the 23 numbers from each genome, we have trained a neural network with one hidden layer to predict the overall percentage of genes on the leading strand as follows: k2 k1 j =1 i =1 (1) P = w (2) j f ( wij pi ), pi = ni ni,max , k1 = 23, k2 = 10 where P is the percentage of leading-strand genes in a genome, f is a hyperbolic tangent sigmoid transfer function, w (1) is the weight of the ith functional category to the jth node of the ij is the weight of the jth node in the hidden layer to the output node in the hidden layer and w (2) j neural network model, k1 is the number of functional categories, k2 is the number of nodes of hidden layer, pi is a scaling factor for each factional category, calculated as the ratio between the number ( ni ) of genes under this category and the max number ( ni,max ) of genes under this category across all 773 bacterial genomes. Concluding remark It has been observed that bacterial genomes have a large variation in terms of the percentage of their leading-strand genes, ranging from ~45% to ~90%. We have provided our explanation for the observed strand-biases and the large variation of the biases, which extends the previous proposed explanations by substantial margin. Our key contributions through this study include that (a) we demonstrated that it is the genes of certain functional categories that need to be on the leading strands of genomes, to enhance the survivability of the bacteria; (b) the level of leadingstrand biases is probably dominated by the living environments of the bacteria; (c) there is a balancing force that keeps genes from all moving to the more efficient leading strands during evolution, particularly in nutrient-depleted environments; and (d) the percentage of leading-strand genes for a bacterial genome can be accurately predicted using the numbers of genes in 23 functional categories outlined in (a) and (b). We anticipate that more sophisticated analyses could possibly lead to quantitative models relating the percentage of leading-strand genes in a bacterium to a few parameters reflecting the environment where the organism lives, giving rise to improved understanding about the rules that may determine which genes will be on the leading versus the lagging strand of a genome. 7 Material and Methods Data 802 bacterial genome sequences along with their predicted genes and functional annotations were retrieved from the NCBI FTP site as of 01/14/2009. The GO annotations for these genomes were from the GOA Proteome Sets (v52) (28), and the GOslim definitions were downloaded from the Gene Ontology site (http://www.geneontology.org/GO_slims/goslim_generic.obo) (22). The microarray data for E. coli are downloaded from the M3D web site (http://m3d.bu.edu) (29). Annotation with high-level functional categories Gene Ontology (GO) (22) was used to define functional categories of gene products. Based on the detailed GO annotation and GO hierarchy information, the Perl script map2slim (http://search.cpan.org/~cmungall/go-perl/scripts/map2slim) was used to the bacterial genomes for assignment of GOslim-based functional categories. Determination of genes on leading and lagging strands To determine if genes are on the leading versus lagging strand, the origin and the terminus of replication are needed. The origin of replication for each of the 802 bacterial genomes was predicted using the Ori-Finder web server (30). The terminus of replication is thus calculated as the location of origin of replication plus half of the chromosome length. With these two positions, the leading and lagging strands are predicted for each half of the chromosome according to a well-known fact that the leading strand always has more genes than the lagging strand does (11). For each bacterium, only the major chromosome is considered, and plasmids are excluded in this study. Tendency of functional categories on the leading versus lagging strands Given a GO functional category, an index value x is calculated using the following formula: x= n0 , n0 + n1 where n0 is the number of leading-strand genes of this category, n1 is the number of laggingstrand genes of this category. x is calculated for all the GOslim functional categories for all 802 genomes so that for each category, there is a data set (A) of 802 values. In addition the overall percentage of leading-strand genes (data set B) is obtained for each of the 802 genomes as well. For each GOslim functional category, a Wilcoxon rank sum test was performed to test if the data sets A and B are from two distinct distributions; and a linear regression analysis was performed to 8 test if there is any good linear correlation between A and B. All the statistical analyses are conducted by the R statistical language (http://www.r-project.org). Prediction of the percentage of leading-strand genes in a genome A neural network, with 23 input nodes, one hidden layer of 10 nodes and one output node, is employed to predict the percentage of leading-strand genes in a genome based on the number of genes in each of 23 functional categories, using the Neural Network Toolbox for MATLAB. The training data consists of the leading-strand gene percentage across 464 (60%) genomes, arbitrarily selected from the 773, along with the 23 numbers extracted from each of the 464 genomes, while the other genomes serve as the validation set. We trained the model for 7 × 105 times and got the MSE (Mean Squared Error) 0.0012. We then applied the trained neural network on the whole set of 773 bacterial genomes, the ones with GOslim functional assignments, and got a Pearson correlation score 0.9 between the predicted and the actual percentages. The trained neural network can be downloaded from http://csbl.bmb.uga.edu/~xizeng/research/leading_strand_bias/. Acknowledgement We thank all the members of the CSBL Lab at UGA, especially Dr. Victor Olman for discussion of statistical analyses. This work was supported in part by the National Science Foundation (DEB-0830024 and DBI-0542119) and the DOE BioEnergy Science Center grant (DE-PS0206ER64304), which is supported by the Office of Biological and Environmental Research in the Department of Energy Office of Science. References 1. 2. 3. 4. 5. 6. 7. Koonin EV (2009) Evolution of genome architecture. Int J Biochem Cell Biol 41(2):298-306. Zivanovic Y, Lopez P, Philippe H, & Forterre P (2002) Pyrococcus genome comparison evidences chromosome shuffling-driven evolution. Nucleic Acids Res 30(9):1902-1910. Brewer BJ (1988) When polymerases collide: replication and the transcriptional organization of the E. coli chromosome. Cell 53(5):679-686. French S (1992) Consequences of replication fork movement through transcription units in vivo. Science 258(5086):1362-1365. Rocha EP & Danchin A (2003) Essentiality, not expressiveness, drives genestrand bias in bacteria. Nat Genet 34(4):377-378. Rocha EP & Danchin A (2003) Gene essentiality determines chromosome organisation in bacteria. Nucleic Acids Res 31(22):6570-6577. Hashimoto M, et al. (2005) Cell size and nucleoid organization of engineered Escherichia coli cells with a reduced genome. Mol Microbiol 55(1):137-149. 9 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. Kato J & Hashimoto M (2007) Construction of consecutive deletions of the Escherichia coli chromosome. Mol Syst Biol 3:132. Kobayashi K, et al. (2003) Essential Bacillus subtilis genes. Proc Natl Acad Sci U S A 100(8):4678-4683. Price MN, Alm EJ, & Arkin AP (2005) Interruptions in gene expression drive highly expressed operons to the leading strand of DNA replication. Nucleic Acids Res 33(10):3224-3234. Rocha E (2002) Is there a role for replication fork asymmetry in the distribution of genes in bacterial genomes? Trends Microbiol 10(9):393-395. Hu J, Zhao X, & Yu J (2007) Replication-associated purine asymmetry may contribute to strand-biased gene distribution. Genomics 90(2):186-194. NCBI (Entrez Genome Project. Nilsson AI, et al. (2005) Bacterial genome size reduction by experimental evolution. Proc Natl Acad Sci U S A 102(34):12112-12116. Hosokawa T, Kikuchi Y, Nikoh N, Shimada M, & Fukatsu T (2006) Strict hostsymbiont cospeciation and reductive genome evolution in insect gut bacteria. PLoS Biol 4(10):e337. Robertson BR & Button DK (1989) Characterizing aquatic bacteria according to population, cell size, and apparent DNA content by flow cytometry. Cytometry 10(1):70-76. Parter M, Kashtan N, & Alon U (2007) Environmental variability and modularity of bacterial metabolic networks. BMC Evol Biol 7:169. Ochman H & Moran NA (2001) Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis. Science 292(5519):1096-1099. Xu J (2006) Microbial ecology in the age of genomics and metagenomics: concepts, tools, and recent advances. Mol Ecol 15(7):1713-1731. Bardgett RD (2002) Causes and consequences of biological diversity in soil. Zoology (Jena) 105(4):367-374. Worning P, Jensen LJ, Hallin PF, Staerfeldt HH, & Ussery DW (2006) Origin of replication in circular prokaryotic chromosomes. Environ Microbiol 8(2):353-361. Ashburner M, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25-29. Zhang R & Lin Y (2009) DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res 37(Database issue):D455-458. Janga SC, Salgado H, & Martinez-Antonio A (2009) Transcriptional regulation shapes the organization of genes on bacterial chromosomes. Nucleic Acids Res 37(11):3680-3688. Stocker R, Seymour JR, Samadani A, Hunt DE, & Polz MF (2008) Rapid chemotactic response enables marine bacteria to exploit ephemeral microscale nutrient patches. Proc Natl Acad Sci U S A 105(11):4209-4214. Amos L & Klug A (1974) Arrangement of subunits in flagellar microtubules. J Cell Sci 14(3):523-549. Wemmer KA & Marshall WF (2004) Flagellar motility: all pull together. Curr Biol 14(23):R992-993. Barrell D, et al. (2009) The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res 37(Database issue):D396-403. 10 29. 30. Faith JJ, et al. (2008) Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res 36(Database issue):D866-870. Gao F & Zhang CT (2008) Ori-Finder: a web-based system for finding oriCs in unannotated bacterial genomes. BMC Bioinformatics 9:79. 11 Figure legends Fig. 1: General characteristics for leading-strand genes: (a) distribution of the number of bacteria with a specific percentage of genes on the leading strands; (b) distribution of the percentages of leading-strand genes as a function of the average gene expression level in E. coli, respectively; and (c) distribution of the percentages of leading-strand genes across all sequenced bacteria in different environments. Fig. 2: Tendency of genes in different functional categories to the leading strands for 773 bacterial genomes. Fig. 3: Preference of genes of different functional categories to the leading strands across 773 bacterial genomes. Fig. 4: Boxplots of the percentage of coding region versus the percentage of leading strand genes in a genome: (a) is for all bacteria (p-value of the Wilcoxon test: 1.1 × 10-8); (b) bacteria of specialized type with p-value 0.22; (c) bacteria of host-associated type with p-value 0.54; (d) bacteria of aquatic type with p-value 0.065; (e) bacteria of terrestrial type with p-value 0.0031; and (f) bacteria of multiple type with pvalue 1.9 × 10-9. Fig. 5: Performance in predicting the percentage of leading-strand genes in a genome by a neural network on 773 genomes. 12 0.6 (b) 0.8 0.9 1.0 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.7 ● 0.8 0.5 50 78 0.6 283 1371 90 816 S 1031 297 c pe ial ize d 0.4 Ho 0.2 32 0.0 Percentage of genes on the leading strand 0.7 Percentage of genes on leading strand ● ● 0.6 6 4 Density 2 0 0.5 Overall percentage of genes on leading strand (c) 8 (a) [ 3, 5) [ 5, 6) [ 6, 7) [ 7, 8) [ 8, 9) [ 9,10) [10,11) Bins of gene expression levels [11,12) [12,13) [13,14] st- as c so iat ed Aq ua tic Mu ltip le Bacterial habitat lifestyle tria es rr Te l <= 0.7 0.95 0.85 0.75 0.7 0.9 Percentage of coding region (b) 0.5 Percentage of coding region (a) > 0.7 <= 0.7 Percentage of genes on leading strand Percentage of genes on leading strand <= 0.7 0.90 0.75 0.60 0.7 0.9 Percentage of coding region (d) 0.5 Percentage of coding region (c) > 0.7 <= 0.7 Percentage of genes on leading strand > 0.7 Percentage of genes on leading strand > 0.7 Percentage of genes on leading strand 0.80 0.70 1.00 0.90 0.80 <= 0.7 0.90 (f ) Percentage of coding region (e) Percentage of coding region > 0.7 <= 0.7 > 0.7 Percentage of genes on leading strand 0.5 0.6 0.7 0.8 Actual percentages of leading-strand genes in a genome 0.9 0.5 0.6 0.7 0.8 0.9 Predicted percentages of leading-strand genes in a genome