Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Protein Localization Analysis of
Essential Genes in Prokaryotes
Chong Peng
Center of BioInformatics
Tianjin University
2014.3.26
2
Abstract
• Essential genes are indispensable for the survival of any living entity under
certain conditions. As the antimicrobial targets and cornerstones of synthetic
biology, investigation of essential genes has many important practical
implications. Protein localization is the key factor for the function of protein.
However, systematical examination of essential genes from the aspect of the
localizations of proteins they encode has not been executed before. Here, a
comprehensive protein localization analysis of essential genes in 27 prokaryotes
including 26 bacteria and 1 archaea has been performed. We found that proteins
encoded by essential genes are enriched in cytoplasm, while proteins
encoded by non-essential genes tend to have diverse localizations.
Furthermore, GO (Gene Ontology) terms enriched in the essential genes in these
genomes have been identified by using Fisher's exact test. These results would
provide further insights into the understanding of fundamental functions needed
to support a cellular life and improve gene essentiality prediction by taking the
protein localization and enriched GO terms into consideration.
3
Introduction
• Essential genes are those indispensable for the survival of an
organism under certain conditions, and the functions they encode
are therefore considered a foundation of life.
• Significant advancements not only in vivo but also in silico have
been made in the past few years.
▫ High-throughput sequencing has been applied together with
high-density transposon-mediated mutagenesis, which, has
increased the number of prokaryotic species involved in gene
essentiality research dramatically.
▫ Analyses of the functional distribution of essential and nonessential genes have been performed to examine the
characteristics of essential genes.
Introduction
4
• Our study is focused on the protein location of essential genes. In general case, proteins must be
transported to the appropriate location to perform their designated function.
• All bacterial:
▫
Cytoplasm, where all proteins are synthesized and most of them remained;
▫
Cytoplasmic membrane, A lipid bilayer, around the cytoplasm.
• Gram-positive bacteria: cell wall, extracellular space.
• Gram-negative bacteria: outer membrane, extracellular space, periplasm
Fig. 1 Cell structure of Gram-positive bacteria (left panel) and Gram-negative bacteria (right panel)
Results
5
We selected 27 prokaryotic organisms including 26 bacteria and Methanococcus maripaludis S2, the only representative
of the Archaea domain to analyze the protein location of the essential and non-essential genes. The data used in the
current study were obtained from DEG (a database of essential genes, available at http://www.essentialgene.org/) and are
displayed in Table 1.
Table 1 The data of essential genes used in the current study
Organism
RefSeq
No. of essential genes
No. of total genes
Acinetobacter baylyi ADP1
NC_005966
499
3307
Bacillus subtilis str. 168
NC_000964
271
4175
Bacteroides thetaiotaomicron VPI-5482
NC_004663
325
4778
NC_007650
42
2356
NC_007651
364
3276
Campylobacter jejuni subsp. jejuni NCTC 11168 = ATCC 700819
NC_002163
228
1576
Caulobacter crescentus
NC_011916
480
3818
Escherichia coli MG1655
NC_000913
609
4141
Francisella novicida U112
NC_008601
392
1719
Haemophilus influenzae Rd KW20
NC_000907
642
1610
Helicobacter pylori 26695
NC_000915
323
1469
Methanococcus maripaludis S2
NC_005791
519
1722
Mycobacterium tuberculosis H37Rv
NC_000962
687
4018
Mycoplasma genitalium G37
NC_000908
381
475
Mycoplasma pulmonis UAB CTIP
NC_002771
310
782
Porphyromonas gingivalis ATCC 33277
NC_010729
463
2089
Pseudomonas aeruginosa PAO1
NC_002516
117
5572
Pseudomonas aeruginosa UCBPP-PA14
NC_008463
335
5892
Salmonella enterica serovar Typhi Ty2
NC_004631
358
4352
Salmonella enterica serovar Typhimurium SL1344
NC_016810
353
4446
Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S
NC_016856
105
5315
Salmonella typhimurium LT2
NC_003197
230
4454
Shewanella oneidensis MR-1
NC_004347
403
4065
Sphingomonas wittichii RW1
NC_009511
535
4850
Staphylococcus aureus N315
NC_002745
302
2582
Staphylococcus aureus NCTC 8325
NC_007795
351
2767
Streptococcus sanguinis
NC_009009
218
2270
NC_002505
565
2534
NC_002506
214
970
Burkholderia thailandensis E264
Vibrio cholerae N16961
Results
6
Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
79
94
100
42
Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344
Salmonella enterica subsp. enterica serovar Typhimurium str. 14028S
Salmonella enterica subsp. enterica serovar Typhi str. Ty2
Escherichia coli str. K-12 substr. MG1655
Shewanella oneidensis MR-1
100
60
46
Vibrio cholerae O1 biovar El Tor str. N16961
Haemophilus influenzae Rd KW20
Acinetobacter sp. ADP1
68
Pseudomonas aeruginosa PAO1
79
100
99
Pseudomonas aeruginosa UCBPP-PA14
Burkholderia thailandensis E264
57
Francisella novicida U112
Caulobacter crescentus
90
Sphingomonas wittichii RW1
100
Campylobacter jejuni subsp. jejuni NCTC 11168 ATCC 700819
45
Helicobacter pylori 26695
100
Bacteroides thetaiotaomicron VPI-5482
55
Porphyromonas gingivalis ATCC 33277
100
Mycoplasma genitalium G37
99
Mycoplasma pulmonis UAB CTIP
Streptococcus sanguinis SK36
75
Bacillus subtilis subsp. subtilis str. 168
98
Staphylococcus aureus subsp. aureus N315
100
100
Staphylococcus aureus subsp. aureus NCTC 8325
Mycobacterium tuberculosis H37Rv
Methanococcus maripaludis S2
0.05
Fig. 2 The phylogenetic tree of the organisms used in the current study.
The phylogenetic tree was constructed
with the sequences of 16s ribosomal
RNA of the 27 organisms downloaded
from NCBI. Based on the branches of
the tree, the organisms can be divided
into 4 groups: gram-negative bacteria,
gram-positive bacteria, mycoplasma
and archaea
Results
7
Protein localizations are different between essential and non-essential genes
We first submitted the amino acid sequences of both essential and non-essential genes in the 27
organisms to PSORTb and obtained the protein localization information. With precision values >97%
for both archaea and bacteria, PSORTb 3.0 is the most precise bacterial localization prediction tool
available.
The average percentage of proteins located in cytoplasm of essential and non-essential genes are
64.40% and 43.88%, respectively. The Student’s t test showed that the difference is statistically
significant (p=1.57×10-10). For all the organisms except Vibrio cholerae N16961, the percentages of
proteins located in cytoplasm in essential genes are higher than that of non-essential genes (Figure 3).
The reason of the anomalous conclusion in Vibrio cholerae N16961 may be the high proportion of
“unknown” predicted results (17.97% on average and 43.13% in Vibrio cholerae N16961). These
results suggest that proteins encoded by essential genes are enriched in cytoplasm.
The average percentage of proteins located in cytoplasm membrane of essential and non-essential
genes are 16.73% and 23.35%, respectively. The Student’s t test showed that the difference is
statistically significant (p=1.33×10-5). The pink bars in Figure 3 showed that in 23 (85.19%)of the
27 groups of data, the percentages of proteins located in cytoplasm membrane in non-essential genes
are higher than that of essential genes. These results suggest that proteins encoded by non-essential
genes are enriched in cytoplasm membrane.
8
Table 2 Percentages of proteins located in cytoplasm, cytoplasm membrane and extracellular of essential and non-essential genes in the 27
genomes.
C (%)
CM (%)
P (%)
E (%)
OM (%)
E
NE
E
NE
E
NE
E
NE
E
NE
Salmonella typhimurium LT2
58.26
43.19
21.74
24.67
3.04
1.47
1.74
3.45
3.04
2.06
Salmonella enterica serovar Typhimurium SL1344
68.84
42.28
16.15
25.06
1.42
1.49
0.85
3.57
1.42
2.26
Salmonella enterica serovar Typhimurium str. 14028S
56.19
36.7
28.57
21.38
0
1.25
1.90
2.82
3.81
1.73
Salmonella enterica serovar Typhi Ty2
74.30
41.6
14.25
24.71
0.56
1.56
0.56
3.43
1.12
2.05
Escherichia coli MG1655 I
58.78
45.88
18.72
28.19
0.16
1.2
1.81
4.58
0.66
2.33
Shewanella oneidensis MR-1
72.39
55.39
13.93
23.66
0.25
0.63
1.00
4.53
1.24
2.27
Vibrio cholerae N16961
41.21
44.27
13.35
27.18
0.51
1.36
1.03
2.99
0.77
2.24
Haemophilus influenzae Rd KW20
61.84
52.93
21.03
25.78
0.16
0.98
2.34
3.91
1.71
3.52
Acinetobacter baylyi ADP1
75.75
45.64
12.22
24.06
0.2
0.93
0.80
1.73
1.40
3.86
Pseudomonas aeruginosa PAO1
48.72
46.61
31.62
22.83
0
1.3
0.85
3.12
4.27
3.03
Pseudomonas aeruginosa UCBPP-PA14
61.79
42.92
13.73
19.38
1.19
0.63
1.79
2.19
0.60
0.73
Burkholderia thailandensis E264
66.01
42.59
16.50
20.47
0.49
1.84
0.74
3.37
0.99
2.37
Francisella novicida U112
67.86
47.25
17.86
24.91
0.51
1.73
0.77
1.28
0.77
1.81
Caulobacter crescentus
65.42
35.79
15.63
20.47
0.63
0.99
0.83
2.98
1.46
3.07
Sphingomonas wittichii RW1
60.19
47.07
14.21
17.82
0.19
0.6
0.75
2.20
0.93
4.59
Campylobacter jejuni NCTC 11168=ATCC 700819
56.14
53.05
17.98
21.79
0.44
1.29
1.32
2.22
0.88
2.08
Helicobacter pylori 26695
54.80
51.01
18.27
19.91
0.31
1.85
0.62
1.23
2.48
2.82
Bacteroides thetaiotaomicron VPI-5482
64.92
40.11
15.08
17.61
0.92
1.17
1.85
1.53
1.85
4.69
Porphyromonas gingivalis ATCC 33277
67.82
42.41
17.28
16.96
0.22
0.31
0.43
0.98
1.51
2.15
Mycobacterium tuberculosis H37Rv
59.83
37.92
22.56
19.93
0.87
2.02
1.60
1.30
0.15
0.16
Streptococcus sanguinis
81.19
48.34
13.30
30.56
0
Bacillus subtilis 168
80.07
48.50
14.39
30.32
Staphylococcus aureus N315
81.13
45.77
13.91
Staphylococcus aureus NCTC 8325
75.21
42.15
Methanococcus maripaludis S2
82.47
Mycoplasma genitalium G37
C W (%)
E
NE
1.07
0.00
2.39
0
2.4
0.37
0.63
28.80
0
4.03
0.00
1.84
13.68
28.77
0.28
4.25
0.85
1.46
63.97
9.83
21.45
0
1.11
0.19
0.19
51.71
36.17
26.25
30.85
0.52
0
Mycoplasma pulmonis UAB CTIP
60.97
30.75
17.74
27.95
1.29
1.55
Average
64.40
43.88
16.73
23.35
0.50
1.54
0.30
1.30
P value (Student’s t test)
1.57×10-10
1.33×10-5
1.95×10-4
1.19
2.77
4.06×10-6
1.25
2.60
3.06×10-3
0.047
Results
Fig. 3 Percentages of proteins located in cytoplasm (green
bars), cytoplasm membrane (pink bars) and extracellular
(red bars) of essential (the left column of each pair) and
non-essential genes (the right column of each pair) in the
27 genomes.
9
• For both essential and nonessential
proteins,
the
proportions of secreted proteins
are quite low, just 0.50%
essential proteins and 1.54%
non-essential
proteins
are
located in extracellular space.
• Prediction coverage of essential
genes is higher than that of nonessential genes.
• Protein localization differences
between essential and nonessential genes in Gram-positive
bacteria are more significant
than that in Gram-negative
bacteria. The reason may be that
cell structure is more simple in
Gram-positive bacteria.
10
100.00
80.00
60.00
40.00
20.00
0.00
percentage(%)
percentage(%)
An alternative form of Fig.3
5
4
3
2
1
0
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0.00
a
essential gene
non-essential gene
b
essential gene
non-essential gene
c
essential gene
non-essential
gene
Fig. 3 a. Percentages of proteins located in cytoplasm of essential and non-essential genes in the 27
genomes. b. Percentages of proteins located in cytoplasm membrane of essential and non-essential genes
in the 27 genomes. c. Percentages of proteins located in extracellular of essential and non-essential genes
in the 27 genomes.
percentage (%)
S. typhimurium LT2
S. enterica serovar…
S. enterica serovar…
S. enterica serovar…
E. coli MG1655
S. oneidensis MR-1
V. cholerae N16961
H. influenzae Rd KW20
A. baylyi ADP1
P. aeruginosa PAO1
P. aeruginosa UCBPP-…
B. thailandensis E264
F. novicida U112
C. crescentus
S. wittichii RW1
C. jejuni NCTC 11168…
H. pylori 26695
B. thetaiotaomicron…
P. gingivalis ATCC…
M. tuberculosis H37Rv
5.00
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
0.50
0.00
non-essential gene
b
essential gene
non-essential gene
the
differences
M. maripaludis S2
S. aureus NCTC
8325
S. aureus N315
essential gene
B. subtilis 168
a
S. sanguinis
percentage (%)
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
S. typhimurium LT2
S. enterica serovar…
S. enterica serovar…
S. enterica serovar…
E. coli MG1655
S. oneidensis MR-1
V. cholerae N16961
H. influenzae Rd…
A. baylyi ADP1
P. aeruginosa PAO1
P. aeruginosa…
B. thailandensis E264
F. novicida U112
C. crescentus
S. wittichii RW1
C. jejuni NCTC…
H. pylori 26695
B. thetaiotaomicron…
P. gingivalis ATCC…
M. tuberculosis…
percentage (%)
Results
11
3.00
2.50
2.00
1.50
1.00
0.50
0.00
c
essential gene
are
non-essential gene
The proportions of non-essential
proteins located in periplasm, outer
membrane and cell wall are higher
than those of essential proteins. The
corresponding p values are 4.06×10-6,
3.06×10-3 and 0.047. All the values
are less than 0.05, which means that
statistically
significant.
Fig. 4 Percentages of proteins located in periplasm (a) and outer membrane (b) of essential and non-essential genes in the 20 Gram-negative
genomes. c. Percentages of proteins located in cell wall of essential and non-essential genes in the 4 Gram-positive and archaeal genomes.
Results
12
Protein localization analysis of essential genes based on GO terms
The Gene Ontology (GO) is one of the most useful terms and controlled vocabularies for
describing the roles of genes and gene product characteristics. The ontology covers three
domains: cellular component, molecular function and biological process.
The Fisher's exact test was employed to obtain the GO terms enriched in the essential genes of
27 prokaryotes. P values less than 0.05 were considered statistically significant.
GO:0005737 (cytoplasm), GO:0005840 (ribosome) and GO:0015935 (small ribosomal subunit)
are the over-represented essential Gene Ontology terms in all the 4 groups of organisms in the
category of cellular component. This result can be construed as another evidence to the
conclusion that proteins encoded by essential genes tend to located in cytoplasm.
GO:0016021 (integral component of membrane), GO:0016020 (membrane), GO:0005886
(plasma membrane), GO:0005622 (intracellular) and GO:0009279 (cell outer membrane) are
under-represented in over 6 organisms, which means that proteins in cell components such as
membrane have no much relationship with essential genes and are more likely to be nonessential genes.
Results
13
Fig. 5 Statistically significant gene ontology terms in the category of cellular component. Every GO ID with
P value less than 0.05 according to the results of Fisher's exact tests is listed in the vertical axis. If the GO
term is over-represented in the organism listed in the horizontal axis, the cell at the crossing of the row and
column is red. Blue boxes represent that the GO term is under-represented in the organism of the column. If
the GO term is not statistically significant in the organism, the box is white.
Results
14
60.00
a
50.00
percentage (%)
percentage (%)
The discrepancy in protein functions between different
40.00
subcellular sites in Bacillus subtilis str. 168
30.00
In a specific subcellular site, protein functions are quite
essential
20.00
different between essential and non-essential genes. To
non-essential
10.00
observe the discrepancy in Bacillus subtilis str. 168
quantitatively, we first filtered out the proteins located in
0.00
cytoplasm and cytoplasm membrane of both the two
groups of genes separately. Then we counted the
percentages of related molecular function GO terms. For
60.00
b
essential proteins located in cytoplasm, GO:0005524
50.00
(ATP binding) had the greatest proportion (47.62%)
40.00
whereas for non-essential proteins located in this site,
30.00
essential
GO:0003677 (DNA binding) occupied the greatest
20.00
non-essential
proportion (17.17%). Other GO terms with relatively
10.00
high proportions are showed in Fig. 5a. For essential
0.00
proteins located in cytoplasm membrane, the proportion
of GO:0005524 (ATP binding) was the greatest (16.00%)
whereas for non-essential proteins located in this site,
Fig. 6 Percentages of GO terms in the genes located in cytoplasm (a) and cytoplasm membrane (b).
GO:0005215 (transporter activity) had the greatest
proportion (9.55%). Other GO terms with relatively high
GO:0005524
ATP binding
proportions are displayed in Fig. 5b.
GO:0000287
magnesium ion binding
GO:0046872
GO:0003677
GO:0003924
GO:0005525
GO:0000049
GO:0003723
GO:0003676
metal ion binding
DNA binding
GTPase activity
GTP binding
tRNA binding
RNA binding
nucleic acid binding
GO:0003700
GO:0000156
sequence-specific DNA binding transcription factor activity
phosphorelay response regulator activity
GO:0005524
GO:0016757
GO:0047355
GO:0046872
GO:0005525
GO:0003924
GO:0005215
ATP binding
transferase activity, transferring glycosyl groups
CDP-glycerol glycerophospho transferase activity
metal ion binding
GTP binding
GTPase activity
transporter activity
Discussion
15
Our results, for the first time, showed the protein localization difference between essential and non-essential genes in
prokaryotes. Essential proteins are enriched in cytoplasm. The proportion for non-essential genes locating in
cytoplasm membrane, periplasm, outer membrane, cell wall and extracellular are significantly lower than that of
essential genes. The Fisher's exact test of GO terms reached a coincident conclusion.
The Fisher's exact test was also employed to obtain enriched GO terms in the category of biological process and
molecular function. GO:0007049 (cell cycle), GO:0006260 (DNA replication), GO:0009252 (peptidoglycan
biosynthetic progress), GO:0051301 (cell division), GO:0065002 (intracellular protein transmembrane transport),
GO:0006265 (DNA topological change) and GO:0006184 (GTP catabolic process) are the most significantly overrepresented biological process GO terms. These progress are all indispensable for a cell and take place in cytoplasm or
ribosome. The GO terms under-represented in over 6 organisms in this category are GO:0006355 (regulation of
transcription, DNA-templated), GO:0035556 (intracellular signal transduction), GO:0006200 (ATP catabolic process),
GO:0005975 (carbohydrate metabolic process) and GO:0055114 (oxidation-reduction process). For the GO terms
relating to molecular function, the most significantly over-represented molecular functions are GO:0003735 (structural
constituent of ribosome), GO:0019843 (rRNA binding), GO:0005524 (ATP binding), GO:0000049 (tRNA binding),
GO:0000287 (magnesium ion binding) and GO:0005525 (GTP binding). While GO:0003700 (sequence-specific DNA
binding transcription factor activity), GO:0003677 (DNA binding), GO:0003824 (catalytic activity), GO:0051539 (4
iron, 4 sulfur cluster binding), GO:0000155 (phosphorelay sensor kinase activity), GO:0043565 (sequence-specific
DNA binding), GO:0000156 (phosphorelay response regulator activity) and GO:0004872 (receptor activity) are
significantly under-represented in more than 6 organisms.
Taking the protein localization and protein function into consideration comprehensively, we can know more about
essential genes. These results would provide further insights into the understanding of fundamental functions needed
to support a cellular life and improve gene essentiality prediction by taking the protein localization and enriched GO
terms into consideration.
Materials and Methods
16
Bioinformatics Databases
DEG is a database of essential genes available at http://www.essentialgene.org/, and stores the records of
currently available essential genes, non-essential genes and genomic elements among a wide range of
organisms including bacteria, archaea and eukaryotes. The non-essential genes in Methanococcus
maripaludis S2 and 13 bacteria such as Escherichia coli MG1655 are obtained based on the original
literatures, while non-essential genes in other 12 organisms such as Bacillus subtilis 168 are the
complementary set of essential genes.
The UniProt Knowledgebase (UniProtKB; http://www.uniprot.org) maintained by UniProt Consortium
members is the central hub for the collection of functional information on proteins, with a comprehensive,
high-quality and freely accessible resource of protein sequences and functional annotation. The manual and
electronic GO terms are assigned to corresponding UniProt entry by the Gene Ontology Annotation program,
which is supplied by external collaborating GO Consortium groups. In this study, part of the subcellular
location information and GO terms used to the analysis are downloaded from UniProtKB.
Software Tools
PSORTb is the most precise bacterial localization prediction tool available. The likelihood of a protein
being at a specific localization site is showed by a score. Because PSORTb 3.0 added the capability of
predicting subcellular localizations of archaeal protein, we can obtain the localization information of
Methanococcus maripaludis S2 with this tool.
GO terms Analysis
The Fisher's exact test, a statistical significance test used in the analysis of contingency tables, was
employed to obtain the GO terms enriched in the essential genes of 27 prokaryotes including 24 bacteria, 2
mycoplasma and one archaea, Methanococcus maripaludis S2. P values less than 0.05 were considered
statistically significant.
17
Acknowledgments
The authors thank Dr. McGarvey for providing assistance in obtaining the GO IDs of the genes.
They also would like to thank Dr. Ren Zhang for invaluable assistance. The present work was
supported in part by National Natural Science Foundation of China (Grant Nos. 31171238 and
30800642), and Program for New Century Excellent Talents in University (No. NCET-120396).
Author contributions
FG designed the study. CP, FG and XZ performed the data analysis. YL and HL contributed
detailed discussions and revisions. CP and FG wrote the main manuscript text. All authors
reviewed the manuscript.
Competing financial interests
The authors declare no competing financial interests.
Related documents