Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Comparative metagenomics of microbial traits
within oceanic viral communities
Methods
Clustering procedure for proxy proteins
Procedure cluster_proxy_proteins(proxies, BLAST(proxies)):
Input: Proxy proteins (proxiess), BLAST results for each prproxies
(BLAST(pr, proxies)).
Output: proxy_clusters, clusters representing homologous Proxy proteins.
Construction of the graph G(V, E):
1. Set V=proxies
2. Foreach pr1V do
a. Foreach pr2BLAST(pr1, proxies) do
i. If identity(pr1, pr2)/|pr1|0.6 and conserved(pr1, pr2)/|pr2|0.75
then
ii. E = E {(pr1, pr2)}
iii. Endif
b. End foreach
3. End foreach
Clustering:
4. Foreach prV do:
a. If Cproxy_clusters s.t. prC then continue
b. Generate Cpr, the set of all pr’V s.t.
i. (pr, pr’)E and (pr’, pr)E
ii. ¬Cproxy_clusters s.t. pr’C
c. pr’ C, set rank(pr’)=(# edges (pr’, pr’’) s.t. pr’’Cpr)/(# vertices in Cpr)
d. Find prmin, the vertex with the minimal rank in Cpr
e. If rank(prmin)<0.7 then
i. Cpr = Cpr / {prmin}
ii. Goto 4c
f. End if
g. Add Cpr to proxy_clusters
5. End foreach
identity(pr1, pr2) and conserved(pr1, pr2) (step 2.a.i) are the number of identical
and conserved amino acids, respectively, in the alignment of pr1 and pr2. The
algorithm assumes that the graph resulting from the graph construction stage is a
collection of strongly connected components. If this is not the case, the algorithm
will generate many small clusters. Exclusion of vertices from Cpr (step 4e) is
based on a heuristic; other alternatives also exist, but we have found this one to
be both fast and accurate and therefore chose to use it. The criterion is stringent
and prefers accuracy over sensitivity.
KEGG enrichment analysis (Fig. 2B)
All genes belonging to one of the 34 manually verified microbial gene clusters on
viral scaffolds blasted against the KEGG protein database with parameters (-p
blastx -e 1e-5 -F F). In order to detect enrichment with respect to known viral
genes in the Global Ocean Survey (GOS) dataset, all viral genes on viral
scaffolds were blasted against the KEGG protein database using the same
parameters. Also, approximately 300,000 microbial proteins were collected from
non-viral scaffolds in GOS in order to detect enrichment with respect to microbial
genes as well. Proteins were chosen from the GOS predicted proteins database
(Rusch et al., 2007). Best hits for all genes in each of the three sets (when
available) were grouped based on categories available in the KEGG database.
Whenever a gene matched more than one category, it was counted twice;
categories clearly not related to microbes and viruses were excluded, as well as
categories containing less than 1% of the genes for all three sets.
P-value was calculated based on the hypergeometric distribution using the GOS
microbial genes as a reference set and the microbial genes on viral genomes as
the target. Assuming that the set of GOS microbial genes consists of N genes of
which M belong to category C, and assuming that m out of the n genes in the set
of microbial genes on viral scaffolds also belong to the same category C, then
the probability of observing at least m out of n genes sampled at random from the
GOS microbial genes set can serve as a p-value describing the significance of
difference between C’s appearance in the microbial genes on viral scaffolds with
respect to the set of GOS microbial proteins. The hypergeometric distribution is
given by
M N n
m M m
Pr( X m)
N
n
The probability of observing m members of category C or more is given by the
hypergeometric tail:
Pr( X m)
min( M , n )
Pr( X k )
k m
3D structure modeling
The 3D structures in Fig. 3B were built using the SWISS-MODEL homology
modeling suite (Arnold et al., 2006). The most homologous 3D structure
templates were analyzed by gapped Blast and HHSearch template library
searches using several complete viral PDF contigs, and identified to correspond
to Thermotoga maritima PDF, A. thaliana PDF1B, Plasmodium faciparum PDF
(e-15<E-value<e-12) and Xanthomonas oryzae PDF (Score=349.38). Next, the 3D
structure was computed with the automatic mode using the longest full-length
viral PDF contig as the search template (1096627055049.1149.1577; see panel
A) and the crystal structure of X. oryzae pv. oryzae KACC10331 PDF (PDB code
3dldA). The calculated E-value was 3.2 e-29 and the final total energy of the
computed model was -4734.566 KJ/mol. The refined model (top Fig. 3B) was
compared to the 3D crystal structure of the most relevant PDF from chloroplastic
PDF1B (PDB code 3cpmA; bottom Fig. 3B). The main secondary structures of
both structures were rendered with PyMOL Release 0.99; DeLano, W.L. The
PyMOL Molecular Graphics System (2002) DeLano Scientific, San Carlos, CA,
USA. http://www.pymol.org.
Further description of results
Gene clustering
Overall, 13,741 clusters resulted from the above process, 12,257 of which
consisted of a single proxy protein. A total of 6,851 (49.9%) of the clusters
received viral-microbial (VM) tagging, 5,182 (37.7%) were microbial-exclusive (M)
and 1,708 (12.4%) were viral-exclusive (V). Conflicting tagging was found in only
38 (0.3%) of the clusters, supporting the credibility of the process. Table S1
summarizes the 11 largest sequence similarity-based clusters (before semantic
clustering) in terms of number of proxy proteins, including their tagging and
annotation. Annotation of these clusters was done manually based on the
annotation of most cluster members. In most cases, annotation of the different
cluster members was consistent, again supporting the credibility of the process.
Overall, 5,267 scaffolds were identified as viral containing at least one microbial
gene that was not previously observed on viral genomes. After removing
redundant scaffolds, we were left with 3,083 scaffolds, which were further
processed. Next, we were interested in validating the origin of our scaffolds; this
was done by recruiting the scaffolds against the Northern Line Islands samples,
excluding the two samples from Christmas Island. As a control, we added all
scaffolds from GOS containing some part of the 16S rRNA coding gene.
Fig. S3 presents the results of the recruitment. Overall, 47% of the VirMic
scaffolds recruited at least two reads. Of these, the vast majority got significant
recruitment from the viral sample and no or very little recruitment from the
microbial fraction. Control scaffolds, on the other hand, recruited much more
reads from the microbial samples than from the viral samples. These results are
in good agreement with our process, which relied on gene contents-based
separation of scaffolds into viral and microbial bins. All scaffolds that recruited
only microbial reads were excluded from further analysis; these included for the
most part scaffolds with mostly microbial and very few viral genes. Note that Fig.
S3 contains all scaffolds that were identified as potential viral scaffolds with
microbial genes before the manual inspection stage (see below). Even in this set,
the separation of most scaffolds is clearly observed, suggesting that the
credibility of the process is high.
Table S1. Clusters with the largest number of proxy proteins.
Cluster Size Tagging Description
C133
37
VM
PSII D1
C83
31
VM
Ribonucleoside-diphosphatereductase beta subunit
(small chain)
C322
24
M
Scaffold protein/FeS cluster assembly scaffold IscU/ NifU
family protein
C310
21
M
Ribosomal protein S6 modification protein/ alpha-Lglutamate ligase
C542
21
VM
Ribonucleotide-diphosphatereductase subunit alpha
C34
19
VM
Ferredoxin
C22
17
VM
Guanosine 5'-monophosphate oxidoreductase
C47
17
VM
GDP-mannose 4,6-dehydratase
C45
16
VM
Ribonucleoside-diphosphatereductase, alpha subunit
(large chain)
C410
16
M
Redoxin/antioxidant, AhpC/Tsa family protein/
peroxiredoxin/glutaredoxin family protein/ putative
hydroperoxidereductase protein
C276
15
VM
PSII D2
Manual inspection of all microbial gene clusters found on viral scaffolds left us
with 34 clusters, representing the set of newly discovered microbial genes on
viral genomes (Table 1). This set consists mostly of newly discovered genes, as
well as a few photosystem I (PSI) gene clusters such as T109 (psaA), T204
(psaJ) found also in other contexts than the PSI gene cassette, and T1557
(psaC). These genes were previously reported to be of viral origin (Sharon et al.,
2009) based on the analysis of metagenomic data, and their finding using our
generalized approach increases our confidence in the method. Note that other
PSI genes (psaB, psaD, psaE and psaK) were not found due to the lack of viral
genes on the scaffolds on which they appeared.
Origin of genes in viral scaffolds
As expected, the vast majority of genes in the set of viral scaffolds (86%) most
resemble viral genes. These genes share decreases to 64% when the set of
scaffolds containing novel microbial genes on viral scaffolds is considered due to
the directed enrichment in microbial genes of this set. Of all viral genes, about
76% in the full set of scaffolds and 92% in the set containing newly microbial
genes account for cyanophage genes. It is hard to tell whether the observed
enrichment in cyanophage genes in the set of scaffolds containing microbial
genes is due to an increased exchange of genes in cyanobacteria-cyanophage
system, or, possibly, due to technical reasons such as the availability of relatively
large number of cyanobactria and cyanophage genomes in refseq with respect to
other genomes. The majority of cyanophage genes found most resemble
cyanophages infecting Procholorococcus strains which is expected due to the
fact that the vast majority of sequences in the GOS database originate from open
water environment in which Prochlorococcus outnumbers Synechococcus. As
expected, myoviruses contribute the vast majority of cyanophage genes (95%
and 98% of all-viral and viral with new microbial genes sets, respectively). Most
likely this is due to their larger genomes and greater abundance comparing to
podoviruses. Myoviruses are also natural candidates for obtaining new microbial
genes, which may explain the enrichment in these genes in the viral scaffolds
with microbial genes set; however the difference is not significant enough to draw
concrete conclusions.
Figures
Fig. S1. Viral peroxiredoxin proteins.
(A) Peroxiredoxin FastTree approximated maximum-likelihood phylogenetic tree.
Synechococcus sequences colored in cyan, Prochlorococcus in green and viral
proteins in red. Only bootstrap values above 80% are shown as black circles on
the branches. (B) Peroxiredoxin protein alignment. The conserved cysteine
residues are marked with black arrows. For clarity, only part of the protein length
is shown.
Fig. S2. Viral peptide deformylases (PDFs) classification in the phylogenetic tree
of PDFs.
One hundred and ninety-two sequences were selected to represent PDF
sequence diversity among 500 sequences. The sequences were extracted from
completely sequenced genomes or from genomes for which sequencing was
almost complete. The sequences were aligned with Clustal X (Jeanmougin et al.,
1998) and the bootstrap tree was constructed with PHYLIP. The random number
generator seed was 111 and the number of bootstrap trials was 1,000. The
rooted phylogenic tree was constructed with N-J Tree and drawn with
TreeView1.66 (Page, 2002). Internal values labeled on each node record the
stability of the branch over the bootstrap replicates. The three main PDF types
and classes are clustered and shown in color. Only bootstrap values above 80%
are shown as black circles on the branches.
Fig. S3. Recruitment of GOS and VirMic scaffolds against the Northern Line
Islands biomes.
Recruitment of VirMic scaffolds (blue) and scaffolds containing 16s rRNA coding
genes (red) against the Northern Line Islands viral (X axis) and microbial (Y axis)
fractions, excluding the Christmas Island samples.
Fig. S4. Distribution of GOS genes across all viral scaffolds. About 86% of the
202,176 genes found on the 84,922 scaffolds identified as viral has a viral protein
as their proxy protein. Left: 76% of viral genes most resemble cyanophage
genes, of which about 95% most resemble myovirus genes. Right: interestingly,
Synechococcus and Prochlorococcus genes account for only 4.9% and 8.7%,
respectively, of all microbial-like genes.
Fig. S5. Distribution of GOS genes across viral scaffolds containing microbial
genes previously undetected on fully sequenced genomes. Cyanophage genes
(92% of all viral genes) are enriched with respect to the set of all viral scaffold
(76%, Fig. S4, left). Prochlorococcus and Synechococcus genes also increase
their share (25% vs. 14%). As expected, the vast majority of viral genes come
from myoviruses, most notably P-SSM2 (56% of all viral genes).
References
Arnold, K., Bordoli, L., Kopp, J., and Schwede, T. (2006) The SWISS-MODEL
workspace: a web-based environment for protein structure homology modelling.
Bioinformatics 22: 195-201.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high
throughput. Nucleic Acids Res 32: 1792-1797.
Guindon, S., Delsuc, F., Dufayard, J.F., and Gascuel, O. (2009) Estimating maximum
likelihood phylogenies with PhyML. Methods Mol Biol 537: 113–137.
Jeanmougin, F., Thompson, J.D., Gouy, M., Higgins, D.G., and Gibson, T.J. (1998)
Multiple sequence alignment with Clustal X. Trends Biochem Sci 23: 403-405.
Page, R.D. (2002) Visualizing phylogenetic trees using TreeView. Curr Protoc
Bioinformatics Chapter 6: Unit 6 2.
Price, M.N., Dehal, P.S., and Arkin, A.P. (2010) FastTree 2 - approximately maximumlikelihood trees for large alignments. PLoS One 5: e9490.
Rusch, D.B., Halpern, A.L., Heidelberg, K.B., Sutton, G., Williamson, S.J., Yooseph, S.
et al. (2007) The Sorcerer II Global Ocean Sampling expedition: I, The northwest
Atlantic through the eastern tropical Pacific. PLoS Biol 5: e77.
Sharon, I., Alperovitch, A., Rohwer, F., Haynes, M., Glaser, F., Atamna-Ismaeel, N. et al.
(2009) Photosystem-I gene cassettes are present in marine virus genomes. Nature
461: 258-262.