Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Harnessing public data repositories for metaproteomics using Enosi Natalie E Castellana*, Andrey D. Prjibelski+, Dmitry An:pov+ *Digital Proteomic, LLC, San Diego, CA, +Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia Mo#va#on and Background Metaproteomics is the next frontier in proteomics research, yet bioinformatic tools and resources remain scarce. Lack of fully sequenced genomes or proteomes create an enormous challenge. Often metaproteomics experiments are accompanied by expensive and time-consuming metagenomic and metatranscriptomic experiments that may only incompletely represent the repertoire of proteins in the sample. Alternatively, de novo sequencing of peptides can be used, however, this approach does not report the context of the peptides in a gene. In this study we examine the opportunity for re-use of public data to replace the need for coupled, large-scale metagenomic studies with metaproteomics. Methods 1: Transcript Assembly 3: Enosi Database Construction Originally, we attempted to map raw sequencing reads to the reference genome, however, high levels of variation prevented us from confidently mapping more than 1% of the reads. As a second pass, we pursued de novo transcript assembly to derive longer transcripts that would result in more confident mapping to the reference genome. 408 million RNA reads from both Illumina and 454 were considered. Contig assembly was done using RNAspades4. Contigs with more than 500 nucleotides were kept for consideration. 107,131 contigs were accepted in total. All datasets used in the study are publicly available. The driving questions behind our proof-ofconcept study were the following: 2: Reference Genome Selection, Variant Calling, Contig Mapping Q: Can current proteogenomic technology1,2 be adapted for metaproteogenomics? To generate a reasonable reference genome for our experiment, we searched the lituerature associated with the metatranscriptomic dataset. We determined the set of phyla that could be represented in the metatranscriptomic and metaproteomic datasets and downloaded all possible genomes from RefSeq. We downloaded genomes for bacteria, protists, fungi, archaea, and metazoa. Q: Are matched metagenomic datasets required for metaproteomics? Q: Can similar (but unmatched) metatranscriptomic data fill the gap between the sequenced, cultured organism and the sampled organism? Data Mass spectra from soil samples collected at the Stordalen Mire in Sweden were downloaded from PRIDE (PXD000410). We then chose RNA-seq experiments from arctic peat soil samples (SRA identifier SRP014474) as our simulated ‘paired’ transcriptomic sample 3 Proteome Transcriptome Enosi creates a compact sequence database containing all possible peptides informed by the genome, variant calls, and splice junctions. Contigs were aligned to the genomes using STAR aligner5 for long read alignment. The top 27 genomes were selected based on number of contigs mapped. The mapped contig coordinates for these genomes were retained. The concatenated 27 genomes became our ‘reference’ genome. Variant calling was done using mpileup from samtools. Since the reference genome was not likely to exactly match the species sampled in the metatranscriptomic experiments, we found many variants. In total, our database contained 14,451 mutations and 4,996 splice junctions. While we don’t expect splicing to exist in many of these species, the splice junctions may represent chromosomal changes between the observed organism and the reference. Accession Class gi|478476202|ref|NC_020912.1|Pseudomonas aeruginosa B136-33 gi|805557611|ref|NZ_LATE01000150.1|Sinorhizobium sp. PC2 B077DRAFT_scf7180000000478_quiver.150_C Gammaproteobacteria gi|83591340|ref|NC_007643.1|Rhodospirillum rubrum ATCC 11170 chromosome Alphaproteobacteria gi|820953769|ref|NZ_CP010979.1|Pseudomonas putida S13.1.2 Gammaproteobacteria gi|808032638|ref|NZ_CP011018.1|Escherichia coli strain CI5 Gammaproteobacteria gi|262193326|ref|NC_013440.1|Haliangium ochraceum DSM 14365 Deltaproteobacteria gi|389578211|ref|NZ_CM001488.1|Desulfobacter postgatei 2ac9 chromosome Deltaproteobacteria gi|507121141|ref|NZ_CM001773.1|Providencia sneebia DSM 19967 chromosome Gammaproteobacteria gi|114568554|ref|NC_008347.1|Maricaulis maris MCS10 gi|511097871|ref|NZ_CM001871.1|Xanthomonas campestris pv. campestris str. CN14 chromosome Alphaproteobacteria Alphaproteobacteria Gammaproteobacteria 4: Peptide Identification and Aggregation After database construction, Enosi performs peptide identification using MSGF+6. From the dataset available on ProteomeXchange, 52,641 MS/MS spectra were selected for our proof-of-concept study. Spectra were filtered to 5% FDR using the target-decoy approach. Clusters of co-located peptides were grouped together, with a maximum distance between any pair of peptides in a cluster set to 1000 base pairs. If a reference proteome is provided, we can assign ‘event types’ to the clusters of peptides. gi|655514693|ref|NZ_KI867150.1|Syntrophorhabdus aromaticivorans UI SynarDRAFT_SAI.2 Deltaproteobacteria gi|803453125|ref|NZ_JZXD01000120.1|Sinorhizobium meliloti strain L5-30 contig120 Alphaproteobacteria gi|383760955|ref|NC_017079.1|Caldilinea aerophila, complete genome Chloroflexi gi|320539756|ref|NZ_GL636115.1|Serratia symbiotica str. Tucson scaffold00283 Gammaproteobacteria gi|514340177|ref|NZ_KE150238.1|Bilophila wadsworthia 3_1_6 acCls-supercont2.1 Deltaproteobacteria gi|690979824|ref|NZ_KK737786.1|Acinetobacter baumanii BIDMC 57 aeebj-supercont1.1 Gammaproteobacteria gi|757660460|ref|NZ_KK214763.1|Enterobacter sp. BWH 27 adINx-supercont1.1 Gammaproteobacteria gi|148552929|ref|NC_009511.1|Sphingomonas wittichii RW1 Alphaproteobacteria gi|757650965|ref|NZ_KI973322.1|Enterobacter sp. MGH 2 adjFw-supercont2.1 Gammaproteobacteria gi|740145672|ref|NZ_AUNC01000049.1|Thalassospira permensis NBRC 106175 contig53 Alphaproteobacteria gi|820902542|ref|NZ_CP011254.1|Serratia fonticola strain DSM 4576 Gammaproteobacteria gi|739702482|ref|NZ_JNFC01000035.1|Sphingopyxis sp. LC363 contig40 Alphaproteobacteria gi|150395228|ref|NC_009636.1|Sinorhizobium medicae WSM419 chromosome gi|820948705|ref|NZ_JYFZ02000001.1|Citrobacter freundii strain MRSN 12115 scaffold00001 gi|482803487|ref|NZ_AJPW01000118.1|Bradyrhizobium elkanii CCBAU 43297 Scaffold1.contig123 Alphaproteobacteria gi|371486894|ref|NC_016147.2|Pseudoxanthomonas spadix BD-a59 Gammaproteobacteria gi|183236918|ref|NW_001915840.1|Entamoeba histolytica Amoebozoa Gammaproteobacteria Alphaproteobacteria Conclusion Enosi is a proteogenomic toolkit provided by Digital Proteomics LLC. • Similar, but un-matched, meta-proteogenomic samples exist across the many available databases (NCBI, SRA, ProteomeXchange) • The Enosi database construction method effectively condenses sequence information from many RNA-Seq runs. This paves the way to creating organism or disease specific databases from large repositories. We selected 27 bacterial and eukaryotic genomes that share similarity to the transcriptome data. The combined genome was used as the ‘reference’. • Expanding the peptide identification to allow point mutations could potentially boost the identification rate. Genome • Enosi accepts mass spectra from all major vendor instruments, both high accuracy and low accuracy, and in common file formats including mzML, mzXML, and mgf. 1. Castellana, et al. (2013). An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol. Cell Proteomics, 13, 1:157-67.. 2. Woo, et al. (2013). Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res., 13, 1:21-8. 3. Tveit, et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 1:15-21. 4. Bankevich, et al. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 5:455-77. 5. Dobin, et al. (2013). Bioinformatics 6. Kim, et al. (2010). The generating function of CID, ETD, and CID/ ETD pairs of tandem mass spectra: applications to database search. Mol. Cell. Proteomics, 9:2840-52.