Download A proteogenomic toolkit

Enosi A proteogenomic toolkit Motivation Proteomic identifications rely on a comprehensive database in order to perform searches on mass spectra. The available proteomes, even for model organisms, are often incomplete [3]. Furthermore, these reference proteomes can never contain sequences that arise somatically and drive disease. RNA-seq experiments can be used to capture peptides arising from unannotated protein isoforms, or somatic events. Enosi allows for the efficient searching of a transcript derived database together with a six-frame translation database, identifying peptides that would otherwise go unnoticed. Enosi’s proteogenomic engine has been used in many diverse scenarios to identify novel genes, refine genome annotations, and characterize cancer proteomes [3, 9, 4, 8]. Proteogenomic workflows, however, present challenges in scale and sensitivity that are unmatched in traditional proteomic workflows. Here we present best practices employed by Enosi to perform proteogenomic analysis. The proteogenomic database The key to any proteogenomic database is to maximize completeness while minimizing size. A large database results in increased search times and reduced sensitivity. While the search time issue can be addressed by distributing the search across CPUs, the reduced sensitivity is problematic. In proteogenomic projects, we are often seeking rare events. The six-frame translation is a popular proteogenomic database that encodes all possible exon sequences. The drawbacks are significant; enormous size, consists predominantly of non-coding sequence, contains no splicing, contains no genomic variants. These drawbacks may be overlooked in certain scenarios such as proteogenomic projects in bacteria, or projects where no reference proteome exists. For humans and other well-characterized species, the six-frame translation is likely overkill. Alternative databases that are informed by RNA-seq experiments have increased in popularity [9, 8]. The RNA-seq reduces the focus of the proteogenomic effort to transcribed regions (as opposed to the entire genome). In addition, one can capture non-reference splicing and genomic variation. In order to construct a proteogenomic database that is as comprehensive as possible, the RNA-seq experiments should cover a wide range of experimental conditions. Enosi constructs a database from RNA-seq that contains all insertions, deletions, mutations, and splice junctions detected in the RNA-seq data provided. To control database size, Enosi only retains sequence that is diverged from the reference genome. We present here a sample proteogenomic application of Enosi using public data collected from patients suffering from acute myeloid leukemia (AML). Data snapshot • Genome hg38 repeat-masked, including all chromosomes and scaffolds • RNA-seq 44M mapped reads taken from SRR1918758 (bone marrow of AML patients) [6] • Proteomics 124,388 spectra taken from PRIDE PXD003822 [1] Proteogenomic database comparison We begin by comparing the relative sizes of the potential proteogenomic databases. As a baseline, we use the RefSeqDB containing isoform information. In Table 1, we see a dramatic increase in size from the RefSeqDB to the EnosiDB. While almost 6x bigger than the RefSeqDB, the EnosiDB is almost entirely comprised of non-reference sequence derived from RNA-seq data. In contrast, the six-frame translation is 42x larger than the RefSeqDB. A naive translation of only mapped RNA-seq reads (RNA-seq translation) shows an even more dramatic expansion of the candidate sequence. The EnosiDB and the RNA-seq translation are derived from the exact same set of RNA-seq reads. How do we interpret the results? In a standard proteomic workflow, a protein list can be inferred by mapping peptides to proteins in the reference proteome. When identifying non-reference peptides, a new nomenclature and method of inference is needed. DB RefSeqDB EnosiDB 6-frame translation RNA-seq translation Size (MB) 23 937 4 403 17 717 Size (AA) 23 011 601 134 894 247 973 424 342 9 923 456 274 Relative 1.00 5.86 42.30 431.24 Table 1: Table of database sizes, both absolute and relative to RefSeqDB. Figure 1: Examples of event types Enosi uses the notion of events to group together co-located peptides that modify the reference genome or genome annotation. Figure 1 shows a subset of the event-types that Enosi identifies. Enosi also includes rules for inferring events analogous to rules for inferring proteins, such as minimum numbers of peptides per event or probabilistic scoring based on peptide identification quality. A real world example Among 77 distinct non-reference (aka novel) peptide sequences, we identify 107 events. An event is defined by a collection of peptides and a reference protein. For this reason, the same peptide can contribute to multiple events (e.g. in the case of multiple isoforms for the same gene, we cannot determine which isoform is being altered). Each event contains at least one uniquely located peptide, but is otherwise unfiltered. We identify a novel frameshift in a gene known to be related to AML, NPM1. The event is supported by one non-reference peptide sequence identified by two spectra. Interestingly, the reference splice site is also identified in the sample. While an AML-associated frameshift in exon 12 of NPM1 has been previously described in the literature [2, 5, 7], our frameshift occurs on exon 3. Figure 2 shows the peptide evidence for NPM1. (a) Peptides spanning exon boundary (b) Zoomed to frameshift Figure 2: Peptide evidence for novel frameshift and known splice junction References [1] E. Aasebø, O. Mjaavatten, M. Vaudel, Y. Farag, F. Selheim, F. Berven, Ø. Bruserud, and M. HernandezValladares. Freezing effects on the acute myeloid leukemia cell proteome and phosphoproteome revealed using optimal quantitative workflows. Journal of Proteomics, 2016. [2] M. T. Andersen, M. K. Andersen, D. Christiansen, and J. Pedersen-Bjergaard. Npm1 mutations in therapyrelated acute myeloid leukemia with uncharacteristic features. Leukemia, 22(5):951–955, 2008. [3] N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs. Discovery and revision of arabidopsis genes by proteogenomics. Proceedings of the national academy of sciences, 105(52):21034–21038, 2008. [4] N. E. Castellana, Z. Shen, Y. He, J. W. Walley, S. P. Briggs, V. Bafna, et al. An automated proteogenomic method uses mass spectrometry to reveal novel genes in zea mays. Molecular & Cellular Proteomics, 13(1):157– 167, 2014. [5] W. Chen, G. Z. Rassidakis, and L. J. Medeiros. Nucleophosmin gene mutations in acute myeloid leukemia. Archives of pathology & laboratory medicine, 130(11):1687–1692, 2006. [6] V.-P. Lavallée, I. Baccelli, J. Krosl, B. Wilhelm, F. Barabé, P. Gendron, G. Boucher, S. Lemieux, A. Marinier, S. Meloche, et al. The transcriptomic landscape and directed chemical interrogation of mll-rearranged acute myeloid leukemias. Nature genetics, 2015. [7] F. Pastore, P. A. Greif, S. Schneider, B. Ksienzyk, G. Mellert, E. Zellmeier, J. Braess, C. M. Sauerland, A. Heinecke, U. Krug, et al. The npm1 mutation type has no impact on survival in cytogenetically normal aml. PloS one, 9(10):e109759, 2014. [8] S. Woo, S. W. Cha, S. Bonissone, S. Na, D. L. Tabb, P. A. Pevzner, and V. Bafna. Advanced proteogenomic analysis reveals multiple peptide mutations and complex immunoglobulin peptides in colon cancer. Journal of proteome research, 14(9):3555–3567, 2015. [9] S. Woo, S. W. Cha, G. Merrihew, Y. He, N. Castellana, C. Guest, M. MacCoss, and V. Bafna. Proteogenomic database construction driven from large scale rna-seq data. Journal of proteome research, 13(1):21–28, 2013.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A proteogenomic toolkit