Download Harnessing public data repositories for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Harnessing public data repositories for metaproteomics using Enosi
Natalie E Castellana*, Andrey D. Prjibelski+, Dmitry An:pov+ *Digital Proteomic, LLC, San Diego, CA, +Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia Mo#va#on and Background Metaproteomics is the next frontier in proteomics research, yet bioinformatic tools and resources
remain scarce. Lack of fully sequenced genomes or proteomes create an enormous challenge.
Often metaproteomics experiments are accompanied by expensive and time-consuming
metagenomic and metatranscriptomic experiments that may only incompletely represent the
repertoire of proteins in the sample. Alternatively, de novo sequencing of peptides can be used,
however, this approach does not report the context of the peptides in a gene. In this study we
examine the opportunity for re-use of public data to replace the need for coupled, large-scale
metagenomic studies with metaproteomics.
Methods 1: Transcript Assembly
3: Enosi Database Construction
Originally, we attempted to map raw sequencing reads to the reference genome, however, high levels of variation prevented
us from confidently mapping more than 1% of the reads. As a second pass, we pursued de novo transcript assembly to
derive longer transcripts that would result in more confident mapping to the reference genome.
408 million RNA reads from both Illumina and 454 were considered. Contig assembly was done using RNAspades4. Contigs
with more than 500 nucleotides were kept for consideration. 107,131 contigs were accepted in total.
All datasets used in the study are publicly available. The driving questions behind our proof-ofconcept study were the following:
2: Reference Genome Selection, Variant Calling, Contig Mapping
Q: Can current proteogenomic technology1,2 be adapted for metaproteogenomics?
To generate a reasonable reference genome for
our experiment, we searched the lituerature
associated with the metatranscriptomic dataset.
We determined the set of phyla that could be
represented in the metatranscriptomic and
metaproteomic datasets and downloaded all
possible genomes from RefSeq. We
downloaded genomes for bacteria, protists,
fungi, archaea, and metazoa.
Q: Are matched metagenomic datasets required for metaproteomics?
Q: Can similar (but unmatched) metatranscriptomic data fill the gap
between the sequenced, cultured organism and the sampled organism?
Data Mass spectra from soil samples
collected at the Stordalen Mire in
Sweden were downloaded from PRIDE
(PXD000410).
We then chose RNA-seq experiments
from arctic peat soil samples (SRA
identifier SRP014474) as our simulated
‘paired’ transcriptomic sample 3
Proteome Transcriptome Enosi creates a compact sequence database containing all possible peptides informed
by the genome, variant calls, and splice junctions.
Contigs were aligned to the genomes using
STAR aligner5 for long read alignment. The top
27 genomes were selected based on number of
contigs mapped. The mapped contig
coordinates for these genomes were retained.
The concatenated 27 genomes became our
‘reference’ genome. Variant calling was done
using mpileup from samtools.
Since the reference genome was not likely to
exactly match the species sampled in the
metatranscriptomic experiments, we found
many variants. In total, our database contained
14,451 mutations and 4,996 splice junctions.
While we don’t expect splicing to exist in many
of these species, the splice junctions may
represent chromosomal changes between the
observed organism and the reference.
Accession Class gi|478476202|ref|NC_020912.1|Pseudomonas aeruginosa B136-33
gi|805557611|ref|NZ_LATE01000150.1|Sinorhizobium sp. PC2
B077DRAFT_scf7180000000478_quiver.150_C
Gammaproteobacteria
gi|83591340|ref|NC_007643.1|Rhodospirillum rubrum ATCC 11170 chromosome
Alphaproteobacteria
gi|820953769|ref|NZ_CP010979.1|Pseudomonas putida S13.1.2
Gammaproteobacteria
gi|808032638|ref|NZ_CP011018.1|Escherichia coli strain CI5
Gammaproteobacteria
gi|262193326|ref|NC_013440.1|Haliangium ochraceum DSM 14365
Deltaproteobacteria
gi|389578211|ref|NZ_CM001488.1|Desulfobacter postgatei 2ac9 chromosome
Deltaproteobacteria
gi|507121141|ref|NZ_CM001773.1|Providencia sneebia DSM 19967 chromosome
Gammaproteobacteria
gi|114568554|ref|NC_008347.1|Maricaulis maris MCS10
gi|511097871|ref|NZ_CM001871.1|Xanthomonas campestris pv. campestris str. CN14
chromosome
Alphaproteobacteria
Alphaproteobacteria
Gammaproteobacteria
4: Peptide Identification and Aggregation
After database construction, Enosi performs peptide identification using MSGF+6.
From the dataset available on ProteomeXchange, 52,641 MS/MS spectra were
selected for our proof-of-concept study. Spectra were filtered to 5% FDR using the
target-decoy approach.
Clusters of co-located peptides were grouped together, with a maximum distance
between any pair of peptides in a cluster set to 1000 base pairs. If a reference
proteome is provided, we can assign ‘event types’ to the clusters of peptides.
gi|655514693|ref|NZ_KI867150.1|Syntrophorhabdus aromaticivorans UI SynarDRAFT_SAI.2 Deltaproteobacteria
gi|803453125|ref|NZ_JZXD01000120.1|Sinorhizobium meliloti strain L5-30 contig120
Alphaproteobacteria
gi|383760955|ref|NC_017079.1|Caldilinea aerophila, complete genome
Chloroflexi
gi|320539756|ref|NZ_GL636115.1|Serratia symbiotica str. Tucson scaffold00283
Gammaproteobacteria
gi|514340177|ref|NZ_KE150238.1|Bilophila wadsworthia 3_1_6 acCls-supercont2.1
Deltaproteobacteria
gi|690979824|ref|NZ_KK737786.1|Acinetobacter baumanii BIDMC 57 aeebj-supercont1.1
Gammaproteobacteria
gi|757660460|ref|NZ_KK214763.1|Enterobacter sp. BWH 27 adINx-supercont1.1
Gammaproteobacteria
gi|148552929|ref|NC_009511.1|Sphingomonas wittichii RW1
Alphaproteobacteria
gi|757650965|ref|NZ_KI973322.1|Enterobacter sp. MGH 2 adjFw-supercont2.1
Gammaproteobacteria
gi|740145672|ref|NZ_AUNC01000049.1|Thalassospira permensis NBRC 106175 contig53
Alphaproteobacteria
gi|820902542|ref|NZ_CP011254.1|Serratia fonticola strain DSM 4576
Gammaproteobacteria
gi|739702482|ref|NZ_JNFC01000035.1|Sphingopyxis sp. LC363 contig40
Alphaproteobacteria
gi|150395228|ref|NC_009636.1|Sinorhizobium medicae WSM419 chromosome
gi|820948705|ref|NZ_JYFZ02000001.1|Citrobacter freundii strain MRSN 12115
scaffold00001
gi|482803487|ref|NZ_AJPW01000118.1|Bradyrhizobium elkanii CCBAU 43297
Scaffold1.contig123
Alphaproteobacteria
gi|371486894|ref|NC_016147.2|Pseudoxanthomonas spadix BD-a59
Gammaproteobacteria
gi|183236918|ref|NW_001915840.1|Entamoeba histolytica
Amoebozoa
Gammaproteobacteria
Alphaproteobacteria
Conclusion Enosi is a proteogenomic toolkit provided by Digital Proteomics LLC.
•  Similar, but un-matched, meta-proteogenomic samples exist across the many available databases (NCBI, SRA, ProteomeXchange)
•  The Enosi database construction method effectively condenses sequence information from many RNA-Seq runs. This paves the way to creating
organism or disease specific databases from large repositories.
We selected 27 bacterial and
eukaryotic
genomes
that
share
similarity to the transcriptome data.
The combined genome was used as
the ‘reference’.
•  Expanding the peptide identification to allow point mutations could potentially boost the identification rate.
Genome •  Enosi accepts mass spectra from all major vendor instruments, both high accuracy and low accuracy, and in common file formats including mzML,
mzXML, and mgf.
1.  Castellana, et al. (2013). An automated proteogenomic method
uses mass spectrometry to reveal novel genes in Zea mays. Mol.
Cell Proteomics, 13, 1:157-67..
2.  Woo, et al. (2013). Proteogenomic database construction driven
from large scale RNA-seq data. J. Proteome Res., 13, 1:21-8.
3.  Tveit, et al. (2013). STAR: ultrafast universal RNA-seq aligner.
Bioinformatics, 29, 1:15-21.
4.  Bankevich, et al. (2012). SPAdes: a new genome assembly
algorithm and its applications to single-cell sequencing. J. Comput.
Biol., 19, 5:455-77.
5.  Dobin, et al. (2013). Bioinformatics
6.  Kim, et al. (2010). The generating function of CID, ETD, and CID/
ETD pairs of tandem mass spectra: applications to database
search. Mol. Cell. Proteomics, 9:2840-52.