Download A proteogenomic toolkit

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Whole genome sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Mutation wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Peptide synthesis wikipedia , lookup

Proteolysis wikipedia , lookup

Cell-penetrating peptide wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Bottromycin wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Self-assembling peptide wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Enosi
A proteogenomic toolkit
Motivation
Proteomic identifications rely on a comprehensive database in order to perform searches on mass spectra. The
available proteomes, even for model organisms, are often incomplete [3]. Furthermore, these reference proteomes
can never contain sequences that arise somatically and drive disease. RNA-seq experiments can be used to capture
peptides arising from unannotated protein isoforms, or somatic events. Enosi allows for the efficient searching
of a transcript derived database together with a six-frame translation database, identifying peptides that would
otherwise go unnoticed. Enosi’s proteogenomic engine has been used in many diverse scenarios to identify novel
genes, refine genome annotations, and characterize cancer proteomes [3, 9, 4, 8]. Proteogenomic workflows,
however, present challenges in scale and sensitivity that are unmatched in traditional proteomic workflows. Here
we present best practices employed by Enosi to perform proteogenomic analysis.
The proteogenomic database
The key to any proteogenomic database is to maximize completeness while minimizing size. A large database
results in increased search times and reduced sensitivity. While the search time issue can be addressed by
distributing the search across CPUs, the reduced sensitivity is problematic. In proteogenomic projects, we are
often seeking rare events.
The six-frame translation is a popular proteogenomic database that encodes all possible exon sequences. The
drawbacks are significant; enormous size, consists predominantly of non-coding sequence, contains no splicing,
contains no genomic variants. These drawbacks may be overlooked in certain scenarios such as proteogenomic
projects in bacteria, or projects where no reference proteome exists. For humans and other well-characterized
species, the six-frame translation is likely overkill.
Alternative databases that are informed by RNA-seq experiments have increased in popularity [9, 8]. The
RNA-seq reduces the focus of the proteogenomic effort to transcribed regions (as opposed to the entire genome).
In addition, one can capture non-reference splicing and genomic variation. In order to construct a proteogenomic
database that is as comprehensive as possible, the RNA-seq experiments should cover a wide range of experimental
conditions.
Enosi constructs a database from RNA-seq that contains all insertions, deletions, mutations, and splice junctions detected in the RNA-seq data provided. To control database size, Enosi only retains sequence that is
diverged from the reference genome. We present here a sample proteogenomic application of Enosi using public
data collected from patients suffering from acute myeloid leukemia (AML).
Data snapshot
• Genome hg38 repeat-masked, including all chromosomes and scaffolds
• RNA-seq 44M mapped reads taken from SRR1918758 (bone marrow of AML patients) [6]
• Proteomics 124,388 spectra taken from PRIDE PXD003822 [1]
Proteogenomic database comparison
We begin by comparing the relative sizes of the potential proteogenomic databases. As a baseline, we use the
RefSeqDB containing isoform information. In Table 1, we see a dramatic increase in size from the RefSeqDB to the
EnosiDB. While almost 6x bigger than the RefSeqDB, the EnosiDB is almost entirely comprised of non-reference
sequence derived from RNA-seq data. In contrast, the six-frame translation is 42x larger than the RefSeqDB. A
naive translation of only mapped RNA-seq reads (RNA-seq translation) shows an even more dramatic expansion
of the candidate sequence. The EnosiDB and the RNA-seq translation are derived from the exact same set of
RNA-seq reads.
How do we interpret the results?
In a standard proteomic workflow, a protein list can be inferred by mapping peptides to proteins in the reference
proteome. When identifying non-reference peptides, a new nomenclature and method of inference is needed.
DB
RefSeqDB
EnosiDB
6-frame translation
RNA-seq translation
Size (MB)
23
937
4 403
17 717
Size (AA)
23 011 601
134 894 247
973 424 342
9 923 456 274
Relative
1.00
5.86
42.30
431.24
Table 1: Table of database sizes, both absolute and relative to RefSeqDB.
Figure 1: Examples of event types
Enosi uses the notion of events to group together co-located peptides that modify the reference genome or
genome annotation. Figure 1 shows a subset of the event-types that Enosi identifies. Enosi also includes rules
for inferring events analogous to rules for inferring proteins, such as minimum numbers of peptides per event or
probabilistic scoring based on peptide identification quality.
A real world example
Among 77 distinct non-reference (aka novel) peptide sequences, we identify 107 events. An event is defined by
a collection of peptides and a reference protein. For this reason, the same peptide can contribute to multiple
events (e.g. in the case of multiple isoforms for the same gene, we cannot determine which isoform is being
altered). Each event contains at least one uniquely located peptide, but is otherwise unfiltered. We identify a
novel frameshift in a gene known to be related to AML, NPM1. The event is supported by one non-reference
peptide sequence identified by two spectra. Interestingly, the reference splice site is also identified in the sample.
While an AML-associated frameshift in exon 12 of NPM1 has been previously described in the literature [2,
5, 7], our frameshift occurs on exon 3. Figure 2 shows the peptide evidence for NPM1.
(a) Peptides spanning exon boundary
(b) Zoomed to frameshift
Figure 2: Peptide evidence for novel frameshift and known splice junction
References
[1] E. Aasebø, O. Mjaavatten, M. Vaudel, Y. Farag, F. Selheim, F. Berven, Ø. Bruserud, and M. HernandezValladares. Freezing effects on the acute myeloid leukemia cell proteome and phosphoproteome revealed using
optimal quantitative workflows. Journal of Proteomics, 2016.
[2] M. T. Andersen, M. K. Andersen, D. Christiansen, and J. Pedersen-Bjergaard. Npm1 mutations in therapyrelated acute myeloid leukemia with uncharacteristic features. Leukemia, 22(5):951–955, 2008.
[3] N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs. Discovery and revision of
arabidopsis genes by proteogenomics. Proceedings of the national academy of sciences, 105(52):21034–21038,
2008.
[4] N. E. Castellana, Z. Shen, Y. He, J. W. Walley, S. P. Briggs, V. Bafna, et al. An automated proteogenomic
method uses mass spectrometry to reveal novel genes in zea mays. Molecular & Cellular Proteomics, 13(1):157–
167, 2014.
[5] W. Chen, G. Z. Rassidakis, and L. J. Medeiros. Nucleophosmin gene mutations in acute myeloid leukemia.
Archives of pathology & laboratory medicine, 130(11):1687–1692, 2006.
[6] V.-P. Lavallée, I. Baccelli, J. Krosl, B. Wilhelm, F. Barabé, P. Gendron, G. Boucher, S. Lemieux, A. Marinier,
S. Meloche, et al. The transcriptomic landscape and directed chemical interrogation of mll-rearranged acute
myeloid leukemias. Nature genetics, 2015.
[7] F. Pastore, P. A. Greif, S. Schneider, B. Ksienzyk, G. Mellert, E. Zellmeier, J. Braess, C. M. Sauerland,
A. Heinecke, U. Krug, et al. The npm1 mutation type has no impact on survival in cytogenetically normal
aml. PloS one, 9(10):e109759, 2014.
[8] S. Woo, S. W. Cha, S. Bonissone, S. Na, D. L. Tabb, P. A. Pevzner, and V. Bafna. Advanced proteogenomic
analysis reveals multiple peptide mutations and complex immunoglobulin peptides in colon cancer. Journal
of proteome research, 14(9):3555–3567, 2015.
[9] S. Woo, S. W. Cha, G. Merrihew, Y. He, N. Castellana, C. Guest, M. MacCoss, and V. Bafna. Proteogenomic
database construction driven from large scale rna-seq data. Journal of proteome research, 13(1):21–28, 2013.