Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
1 Protein Localization Analysis of Essential Genes in Prokaryotes Chong Peng Center of BioInformatics Tianjin University 2014.3.26 2 Abstract • Essential genes are indispensable for the survival of any living entity under certain conditions. As the antimicrobial targets and cornerstones of synthetic biology, investigation of essential genes has many important practical implications. Protein localization is the key factor for the function of protein. However, systematical examination of essential genes from the aspect of the localizations of proteins they encode has not been executed before. Here, a comprehensive protein localization analysis of essential genes in 27 prokaryotes including 26 bacteria and 1 archaea has been performed. We found that proteins encoded by essential genes are enriched in cytoplasm, while proteins encoded by non-essential genes tend to have diverse localizations. Furthermore, GO (Gene Ontology) terms enriched in the essential genes in these genomes have been identified by using Fisher's exact test. These results would provide further insights into the understanding of fundamental functions needed to support a cellular life and improve gene essentiality prediction by taking the protein localization and enriched GO terms into consideration. 3 Introduction • Essential genes are those indispensable for the survival of an organism under certain conditions, and the functions they encode are therefore considered a foundation of life. • Significant advancements not only in vivo but also in silico have been made in the past few years. ▫ High-throughput sequencing has been applied together with high-density transposon-mediated mutagenesis, which, has increased the number of prokaryotic species involved in gene essentiality research dramatically. ▫ Analyses of the functional distribution of essential and nonessential genes have been performed to examine the characteristics of essential genes. Introduction 4 • Our study is focused on the protein location of essential genes. In general case, proteins must be transported to the appropriate location to perform their designated function. • All bacterial: ▫ Cytoplasm, where all proteins are synthesized and most of them remained; ▫ Cytoplasmic membrane, A lipid bilayer, around the cytoplasm. • Gram-positive bacteria: cell wall, extracellular space. • Gram-negative bacteria: outer membrane, extracellular space, periplasm Fig. 1 Cell structure of Gram-positive bacteria (left panel) and Gram-negative bacteria (right panel) Results 5 We selected 27 prokaryotic organisms including 26 bacteria and Methanococcus maripaludis S2, the only representative of the Archaea domain to analyze the protein location of the essential and non-essential genes. The data used in the current study were obtained from DEG (a database of essential genes, available at http://www.essentialgene.org/) and are displayed in Table 1. Table 1 The data of essential genes used in the current study Organism RefSeq No. of essential genes No. of total genes Acinetobacter baylyi ADP1 NC_005966 499 3307 Bacillus subtilis str. 168 NC_000964 271 4175 Bacteroides thetaiotaomicron VPI-5482 NC_004663 325 4778 NC_007650 42 2356 NC_007651 364 3276 Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819 NC_002163 228 1576 Caulobacter crescentus NC_011916 480 3818 Escherichia coli MG1655 NC_000913 609 4141 Francisella novicida U112 NC_008601 392 1719 Haemophilus influenzae Rd KW20 NC_000907 642 1610 Helicobacter pylori 26695 NC_000915 323 1469 Methanococcus maripaludis S2 NC_005791 519 1722 Mycobacterium tuberculosis H37Rv NC_000962 687 4018 Mycoplasma genitalium G37 NC_000908 381 475 Mycoplasma pulmonis UAB CTIP NC_002771 310 782 Porphyromonas gingivalis ATCC 33277 NC_010729 463 2089 Pseudomonas aeruginosa PAO1 NC_002516 117 5572 Pseudomonas aeruginosa UCBPP-PA14 NC_008463 335 5892 Salmonella enterica serovar Typhi Ty2 NC_004631 358 4352 Salmonella enterica serovar Typhimurium SL1344 NC_016810 353 4446 Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S NC_016856 105 5315 Salmonella typhimurium LT2 NC_003197 230 4454 Shewanella oneidensis MR-1 NC_004347 403 4065 Sphingomonas wittichii RW1 NC_009511 535 4850 Staphylococcus aureus N315 NC_002745 302 2582 Staphylococcus aureus NCTC 8325 NC_007795 351 2767 Streptococcus sanguinis NC_009009 218 2270 NC_002505 565 2534 NC_002506 214 970 Burkholderia thailandensis E264 Vibrio cholerae N16961 Results 6 Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 79 94 100 42 Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344 Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S Salmonella enterica subsp. enterica serovar Typhi str. Ty2 Escherichia coli str. K-12 substr. MG1655 Shewanella oneidensis MR-1 100 60 46 Vibrio cholerae O1 biovar El Tor str. N16961 Haemophilus influenzae Rd KW20 Acinetobacter sp. ADP1 68 Pseudomonas aeruginosa PAO1 79 100 99 Pseudomonas aeruginosa UCBPP-PA14 Burkholderia thailandensis E264 57 Francisella novicida U112 Caulobacter crescentus 90 Sphingomonas wittichii RW1 100 Campylobacter jejuni subsp. jejuni NCTC 11168 ATCC 700819 45 Helicobacter pylori 26695 100 Bacteroides thetaiotaomicron VPI-5482 55 Porphyromonas gingivalis ATCC 33277 100 Mycoplasma genitalium G37 99 Mycoplasma pulmonis UAB CTIP Streptococcus sanguinis SK36 75 Bacillus subtilis subsp. subtilis str. 168 98 Staphylococcus aureus subsp. aureus N315 100 100 Staphylococcus aureus subsp. aureus NCTC 8325 Mycobacterium tuberculosis H37Rv Methanococcus maripaludis S2 0.05 Fig. 2 The phylogenetic tree of the organisms used in the current study. The phylogenetic tree was constructed with the sequences of 16s ribosomal RNA of the 27 organisms downloaded from NCBI. Based on the branches of the tree, the organisms can be divided into 4 groups: gram-negative bacteria, gram-positive bacteria, mycoplasma and archaea Results 7 Protein localizations are different between essential and non-essential genes We first submitted the amino acid sequences of both essential and non-essential genes in the 27 organisms to PSORTb and obtained the protein localization information. With precision values >97% for both archaea and bacteria, PSORTb 3.0 is the most precise bacterial localization prediction tool available. The average percentage of proteins located in cytoplasm of essential and non-essential genes are 64.40% and 43.88%, respectively. The Student’s t test showed that the difference is statistically significant (p=1.57×10-10). For all the organisms except Vibrio cholerae N16961, the percentages of proteins located in cytoplasm in essential genes are higher than that of non-essential genes (Figure 3). The reason of the anomalous conclusion in Vibrio cholerae N16961 may be the high proportion of “unknown” predicted results (17.97% on average and 43.13% in Vibrio cholerae N16961). These results suggest that proteins encoded by essential genes are enriched in cytoplasm. The average percentage of proteins located in cytoplasm membrane of essential and non-essential genes are 16.73% and 23.35%, respectively. The Student’s t test showed that the difference is statistically significant (p=1.33×10-5). The pink bars in Figure 3 showed that in 23 (85.19%)of the 27 groups of data, the percentages of proteins located in cytoplasm membrane in non-essential genes are higher than that of essential genes. These results suggest that proteins encoded by non-essential genes are enriched in cytoplasm membrane. 8 Table 2 Percentages of proteins located in cytoplasm, cytoplasm membrane and extracellular of essential and non-essential genes in the 27 genomes. C (%) CM (%) P (%) E (%) OM (%) E NE E NE E NE E NE E NE Salmonella typhimurium LT2 58.26 43.19 21.74 24.67 3.04 1.47 1.74 3.45 3.04 2.06 Salmonella enterica serovar Typhimurium SL1344 68.84 42.28 16.15 25.06 1.42 1.49 0.85 3.57 1.42 2.26 Salmonella enterica serovar Typhimurium str. 14028S 56.19 36.7 28.57 21.38 0 1.25 1.90 2.82 3.81 1.73 Salmonella enterica serovar Typhi Ty2 74.30 41.6 14.25 24.71 0.56 1.56 0.56 3.43 1.12 2.05 Escherichia coli MG1655 I 58.78 45.88 18.72 28.19 0.16 1.2 1.81 4.58 0.66 2.33 Shewanella oneidensis MR-1 72.39 55.39 13.93 23.66 0.25 0.63 1.00 4.53 1.24 2.27 Vibrio cholerae N16961 41.21 44.27 13.35 27.18 0.51 1.36 1.03 2.99 0.77 2.24 Haemophilus influenzae Rd KW20 61.84 52.93 21.03 25.78 0.16 0.98 2.34 3.91 1.71 3.52 Acinetobacter baylyi ADP1 75.75 45.64 12.22 24.06 0.2 0.93 0.80 1.73 1.40 3.86 Pseudomonas aeruginosa PAO1 48.72 46.61 31.62 22.83 0 1.3 0.85 3.12 4.27 3.03 Pseudomonas aeruginosa UCBPP-PA14 61.79 42.92 13.73 19.38 1.19 0.63 1.79 2.19 0.60 0.73 Burkholderia thailandensis E264 66.01 42.59 16.50 20.47 0.49 1.84 0.74 3.37 0.99 2.37 Francisella novicida U112 67.86 47.25 17.86 24.91 0.51 1.73 0.77 1.28 0.77 1.81 Caulobacter crescentus 65.42 35.79 15.63 20.47 0.63 0.99 0.83 2.98 1.46 3.07 Sphingomonas wittichii RW1 60.19 47.07 14.21 17.82 0.19 0.6 0.75 2.20 0.93 4.59 Campylobacter jejuni NCTC 11168=ATCC 700819 56.14 53.05 17.98 21.79 0.44 1.29 1.32 2.22 0.88 2.08 Helicobacter pylori 26695 54.80 51.01 18.27 19.91 0.31 1.85 0.62 1.23 2.48 2.82 Bacteroides thetaiotaomicron VPI-5482 64.92 40.11 15.08 17.61 0.92 1.17 1.85 1.53 1.85 4.69 Porphyromonas gingivalis ATCC 33277 67.82 42.41 17.28 16.96 0.22 0.31 0.43 0.98 1.51 2.15 Mycobacterium tuberculosis H37Rv 59.83 37.92 22.56 19.93 0.87 2.02 1.60 1.30 0.15 0.16 Streptococcus sanguinis 81.19 48.34 13.30 30.56 0 Bacillus subtilis 168 80.07 48.50 14.39 30.32 Staphylococcus aureus N315 81.13 45.77 13.91 Staphylococcus aureus NCTC 8325 75.21 42.15 Methanococcus maripaludis S2 82.47 Mycoplasma genitalium G37 C W (%) E NE 1.07 0.00 2.39 0 2.4 0.37 0.63 28.80 0 4.03 0.00 1.84 13.68 28.77 0.28 4.25 0.85 1.46 63.97 9.83 21.45 0 1.11 0.19 0.19 51.71 36.17 26.25 30.85 0.52 0 Mycoplasma pulmonis UAB CTIP 60.97 30.75 17.74 27.95 1.29 1.55 Average 64.40 43.88 16.73 23.35 0.50 1.54 0.30 1.30 P value (Student’s t test) 1.57×10-10 1.33×10-5 1.95×10-4 1.19 2.77 4.06×10-6 1.25 2.60 3.06×10-3 0.047 Results Fig. 3 Percentages of proteins located in cytoplasm (green bars), cytoplasm membrane (pink bars) and extracellular (red bars) of essential (the left column of each pair) and non-essential genes (the right column of each pair) in the 27 genomes. 9 • For both essential and nonessential proteins, the proportions of secreted proteins are quite low, just 0.50% essential proteins and 1.54% non-essential proteins are located in extracellular space. • Prediction coverage of essential genes is higher than that of nonessential genes. • Protein localization differences between essential and nonessential genes in Gram-positive bacteria are more significant than that in Gram-negative bacteria. The reason may be that cell structure is more simple in Gram-positive bacteria. 10 100.00 80.00 60.00 40.00 20.00 0.00 percentage(%) percentage(%) An alternative form of Fig.3 5 4 3 2 1 0 35.00 30.00 25.00 20.00 15.00 10.00 5.00 0.00 a essential gene non-essential gene b essential gene non-essential gene c essential gene non-essential gene Fig. 3 a. Percentages of proteins located in cytoplasm of essential and non-essential genes in the 27 genomes. b. Percentages of proteins located in cytoplasm membrane of essential and non-essential genes in the 27 genomes. c. Percentages of proteins located in extracellular of essential and non-essential genes in the 27 genomes. percentage (%) S. typhimurium LT2 S. enterica serovar… S. enterica serovar… S. enterica serovar… E. coli MG1655 S. oneidensis MR-1 V. cholerae N16961 H. influenzae Rd KW20 A. baylyi ADP1 P. aeruginosa PAO1 P. aeruginosa UCBPP-… B. thailandensis E264 F. novicida U112 C. crescentus S. wittichii RW1 C. jejuni NCTC 11168… H. pylori 26695 B. thetaiotaomicron… P. gingivalis ATCC… M. tuberculosis H37Rv 5.00 4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 non-essential gene b essential gene non-essential gene the differences M. maripaludis S2 S. aureus NCTC 8325 S. aureus N315 essential gene B. subtilis 168 a S. sanguinis percentage (%) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 S. typhimurium LT2 S. enterica serovar… S. enterica serovar… S. enterica serovar… E. coli MG1655 S. oneidensis MR-1 V. cholerae N16961 H. influenzae Rd… A. baylyi ADP1 P. aeruginosa PAO1 P. aeruginosa… B. thailandensis E264 F. novicida U112 C. crescentus S. wittichii RW1 C. jejuni NCTC… H. pylori 26695 B. thetaiotaomicron… P. gingivalis ATCC… M. tuberculosis… percentage (%) Results 11 3.00 2.50 2.00 1.50 1.00 0.50 0.00 c essential gene are non-essential gene The proportions of non-essential proteins located in periplasm, outer membrane and cell wall are higher than those of essential proteins. The corresponding p values are 4.06×10-6, 3.06×10-3 and 0.047. All the values are less than 0.05, which means that statistically significant. Fig. 4 Percentages of proteins located in periplasm (a) and outer membrane (b) of essential and non-essential genes in the 20 Gram-negative genomes. c. Percentages of proteins located in cell wall of essential and non-essential genes in the 4 Gram-positive and archaeal genomes. Results 12 Protein localization analysis of essential genes based on GO terms The Gene Ontology (GO) is one of the most useful terms and controlled vocabularies for describing the roles of genes and gene product characteristics. The ontology covers three domains: cellular component, molecular function and biological process. The Fisher's exact test was employed to obtain the GO terms enriched in the essential genes of 27 prokaryotes. P values less than 0.05 were considered statistically significant. GO:0005737 (cytoplasm), GO:0005840 (ribosome) and GO:0015935 (small ribosomal subunit) are the over-represented essential Gene Ontology terms in all the 4 groups of organisms in the category of cellular component. This result can be construed as another evidence to the conclusion that proteins encoded by essential genes tend to located in cytoplasm. GO:0016021 (integral component of membrane), GO:0016020 (membrane), GO:0005886 (plasma membrane), GO:0005622 (intracellular) and GO:0009279 (cell outer membrane) are under-represented in over 6 organisms, which means that proteins in cell components such as membrane have no much relationship with essential genes and are more likely to be nonessential genes. Results 13 Fig. 5 Statistically significant gene ontology terms in the category of cellular component. Every GO ID with P value less than 0.05 according to the results of Fisher's exact tests is listed in the vertical axis. If the GO term is over-represented in the organism listed in the horizontal axis, the cell at the crossing of the row and column is red. Blue boxes represent that the GO term is under-represented in the organism of the column. If the GO term is not statistically significant in the organism, the box is white. Results 14 60.00 a 50.00 percentage (%) percentage (%) The discrepancy in protein functions between different 40.00 subcellular sites in Bacillus subtilis str. 168 30.00 In a specific subcellular site, protein functions are quite essential 20.00 different between essential and non-essential genes. To non-essential 10.00 observe the discrepancy in Bacillus subtilis str. 168 quantitatively, we first filtered out the proteins located in 0.00 cytoplasm and cytoplasm membrane of both the two groups of genes separately. Then we counted the percentages of related molecular function GO terms. For 60.00 b essential proteins located in cytoplasm, GO:0005524 50.00 (ATP binding) had the greatest proportion (47.62%) 40.00 whereas for non-essential proteins located in this site, 30.00 essential GO:0003677 (DNA binding) occupied the greatest 20.00 non-essential proportion (17.17%). Other GO terms with relatively 10.00 high proportions are showed in Fig. 5a. For essential 0.00 proteins located in cytoplasm membrane, the proportion of GO:0005524 (ATP binding) was the greatest (16.00%) whereas for non-essential proteins located in this site, Fig. 6 Percentages of GO terms in the genes located in cytoplasm (a) and cytoplasm membrane (b). GO:0005215 (transporter activity) had the greatest proportion (9.55%). Other GO terms with relatively high GO:0005524 ATP binding proportions are displayed in Fig. 5b. GO:0000287 magnesium ion binding GO:0046872 GO:0003677 GO:0003924 GO:0005525 GO:0000049 GO:0003723 GO:0003676 metal ion binding DNA binding GTPase activity GTP binding tRNA binding RNA binding nucleic acid binding GO:0003700 GO:0000156 sequence-specific DNA binding transcription factor activity phosphorelay response regulator activity GO:0005524 GO:0016757 GO:0047355 GO:0046872 GO:0005525 GO:0003924 GO:0005215 ATP binding transferase activity, transferring glycosyl groups CDP-glycerol glycerophospho transferase activity metal ion binding GTP binding GTPase activity transporter activity Discussion 15 Our results, for the first time, showed the protein localization difference between essential and non-essential genes in prokaryotes. Essential proteins are enriched in cytoplasm. The proportion for non-essential genes locating in cytoplasm membrane, periplasm, outer membrane, cell wall and extracellular are significantly lower than that of essential genes. The Fisher's exact test of GO terms reached a coincident conclusion. The Fisher's exact test was also employed to obtain enriched GO terms in the category of biological process and molecular function. GO:0007049 (cell cycle), GO:0006260 (DNA replication), GO:0009252 (peptidoglycan biosynthetic progress), GO:0051301 (cell division), GO:0065002 (intracellular protein transmembrane transport), GO:0006265 (DNA topological change) and GO:0006184 (GTP catabolic process) are the most significantly overrepresented biological process GO terms. These progress are all indispensable for a cell and take place in cytoplasm or ribosome. The GO terms under-represented in over 6 organisms in this category are GO:0006355 (regulation of transcription, DNA-templated), GO:0035556 (intracellular signal transduction), GO:0006200 (ATP catabolic process), GO:0005975 (carbohydrate metabolic process) and GO:0055114 (oxidation-reduction process). For the GO terms relating to molecular function, the most significantly over-represented molecular functions are GO:0003735 (structural constituent of ribosome), GO:0019843 (rRNA binding), GO:0005524 (ATP binding), GO:0000049 (tRNA binding), GO:0000287 (magnesium ion binding) and GO:0005525 (GTP binding). While GO:0003700 (sequence-specific DNA binding transcription factor activity), GO:0003677 (DNA binding), GO:0003824 (catalytic activity), GO:0051539 (4 iron, 4 sulfur cluster binding), GO:0000155 (phosphorelay sensor kinase activity), GO:0043565 (sequence-specific DNA binding), GO:0000156 (phosphorelay response regulator activity) and GO:0004872 (receptor activity) are significantly under-represented in more than 6 organisms. Taking the protein localization and protein function into consideration comprehensively, we can know more about essential genes. These results would provide further insights into the understanding of fundamental functions needed to support a cellular life and improve gene essentiality prediction by taking the protein localization and enriched GO terms into consideration. Materials and Methods 16 Bioinformatics Databases DEG is a database of essential genes available at http://www.essentialgene.org/, and stores the records of currently available essential genes, non-essential genes and genomic elements among a wide range of organisms including bacteria, archaea and eukaryotes. The non-essential genes in Methanococcus maripaludis S2 and 13 bacteria such as Escherichia coli MG1655 are obtained based on the original literatures, while non-essential genes in other 12 organisms such as Bacillus subtilis 168 are the complementary set of essential genes. The UniProt Knowledgebase (UniProtKB; http://www.uniprot.org) maintained by UniProt Consortium members is the central hub for the collection of functional information on proteins, with a comprehensive, high-quality and freely accessible resource of protein sequences and functional annotation. The manual and electronic GO terms are assigned to corresponding UniProt entry by the Gene Ontology Annotation program, which is supplied by external collaborating GO Consortium groups. In this study, part of the subcellular location information and GO terms used to the analysis are downloaded from UniProtKB. Software Tools PSORTb is the most precise bacterial localization prediction tool available. The likelihood of a protein being at a specific localization site is showed by a score. Because PSORTb 3.0 added the capability of predicting subcellular localizations of archaeal protein, we can obtain the localization information of Methanococcus maripaludis S2 with this tool. GO terms Analysis The Fisher's exact test, a statistical significance test used in the analysis of contingency tables, was employed to obtain the GO terms enriched in the essential genes of 27 prokaryotes including 24 bacteria, 2 mycoplasma and one archaea, Methanococcus maripaludis S2. P values less than 0.05 were considered statistically significant. 17 Acknowledgments The authors thank Dr. McGarvey for providing assistance in obtaining the GO IDs of the genes. They also would like to thank Dr. Ren Zhang for invaluable assistance. The present work was supported in part by National Natural Science Foundation of China (Grant Nos. 31171238 and 30800642), and Program for New Century Excellent Talents in University (No. NCET-120396). Author contributions FG designed the study. CP, FG and XZ performed the data analysis. YL and HL contributed detailed discussions and revisions. CP and FG wrote the main manuscript text. All authors reviewed the manuscript. Competing financial interests The authors declare no competing financial interests.