Download Novel Peptide Identification using ESTs and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Novel Peptide Identification using
ESTs and Genomic Sequence
Nathan Edwards
Center for Bioinformatics and Computational Biology
University of Maryland, College Park
Novel Peptides
• Absent from traditional protein
sequence databases
• IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB
• Due to
• Deliberate “redundancy” elimination
• “Dark-side” genes
• Bias towards high-quality, high-confidence
full-length protein sequence
2
What is missing?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Alternative translation start-sites
• Microexons
• Alternative translation frames
3
Why should we care?
• Alternative splicing is the norm!
• Only 20-25K human genes
• Each gene makes many proteins
• Proteins have clinical implications
• Biomarker discovery
• Evidence for SNPs and alternative splicing
stops with transcription
• Genomic assays, ESTs, mRNA sequence.
• No hard evidence for translation start site
4
Novel Protein
HEQASNVLSDISEFR
Evidence:
• log10(E-value) = -9.6
• 100’s of ESTs
• Full length mRNA sequence
Details:
• Peptide Atlas A8_IP (Resing et al.);
5
Novel Protein
6
Novel Protein
7
Novel Protein
8
Novel Splice Isoform
LQGSATAAEAQVGHQTAR
Evidence:
• log10(E-value) = -6.8
• 10’s of ESTs
• Full length mRNA sequence
Details:
• Peptide Atlas raftflow (von Haller, et al.);
• LIME1 gene
9
Novel Splice Isoform
10
Novel Splice Isoform
11
Novel Splice Isoform
12
Novel Frame
TAGSPLCLPTPGAAPGSAGSCSHR
Evidence:
• log10(E-value) = -3.9
• 10’s of ESTs
• Full length mRNA sequence
Details:
• Peptide Atlas raftflow (von Haller, et al.);
• LIME1 gene, downstream from LQGSA...
13
Novel Frame
14
Novel Frame
15
Novel Frame
16
“Novel” Microexon
LQTASDESYKDPTNIQLSK
Evidence:
• log10(E-value) = -6.4
• 10’s of ESTs / mRNA sequences
• SwissProt variant, absent from IPI
Details:
• Peptide Atlas raftflow (von Haller, et al.);
• SPTAN1 gene
17
“Novel” Microexon
18
“Novel” Microexon
19
“Novel” Microexon
20
“Novel” Microexon
21
Novel Mutation
KADDTWEPFASGK
Evidence:
• log10(E-value) = -7.6
• 2 ESTs from same clone library
• Ala2 Deletion
Details:
• HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.);
• TTR gene
• Known Mutation: Ala2-to-Pro associated with
familial amyloidotic polyneuropathy.
22
Novel Mutation
23
Novel Mutation
24
Novel Mutation
25
Novel Mutation
26
Known Coding SNP
DTEEEDFHVDQ[V|A]TTVK
Evidence:
• log10(E-value) = -9.5 / -9.4
• Known dbSNP (coding): Val12-to-Ala
• Wildtype also observed
Details:
• HUPO PPP 40 (Wang; Omenn et al.);
• SERPINA1 gene
27
Wildtype
28
Known Coding SNP
29
Known Coding SNP
30
Known Coding SNP
LQHL[E|V]NELTHDIITK
Evidence:
• log10(E-value) = -6.7/-10.9
• 4 ESTs, same clone library
• Known dbSNP (coding): Glu5-to-Val
• Wildtype also observed
Details:
• HUPO PPP 28_b2-CIT
(Pounds/Adkins/Rodland/Anderson; Omenn et al.);
• SERPINA1 gene
31
IPI Common Variant
Elimination
YYGGGYGSTQATFMVFQALAQYQK
Evidence:
• log10(E-value) = -5.9
• 100’s ESTs, mRNA sequence
• IPI has (rare) variant (Insertion of AS@10)
• Differ in 5’ splice site.
Details:
• HUPO PPP 29 (Qian/He; Omenn et al.);
• C3 gene
32
Why don’t we see more
novel peptides?
• Tandem mass spectrometry doesn’t
discriminate against novel peptides...
...but protein sequence databases do!
• Searching traditional protein sequence
databases biases the results towards
well-understood protein isoforms!
33
Why don’t we see more
novel peptides?
• Traditional protein sequence
databases
• High-quality, full-length proteins only
• Many interesting peptides are omitted
• Exclusive – peptide identifications are lost.
• ESTs, genomic & mRNA sequence
• Used as evidence for full-length protein
sequences
• Inclusive – may need to filter results
34
Significant False Positives
• E-values are not enough!
• Random guessers are easy to beat.
• Post-translational modifications vs. amino-acid
substitution
• methylation (on I/L, Q, R, C, H, K, S, T, N): +14
• D → E, G → A, V → I/L, N → Q, S → T: +14
• Peptide extension z=+2 → z=+3
• Nonsense AA masses sum to precursor
• Need to ensure:
• fragment ions define novel sequence
• sequence evidence is strong
• other plausible explanations can be eliminated
35
Significant False Positives
• DFLAGGLAAAISK
2.2x10-8
• DFLAGGIAAAISK
2.2x10-8
• DFLAGGVAAAISK
3.7x10-8
• 2 ESTs
• IPI (2), RefSeq, mRNA, ~ 1400 ESTs
• IPI, RefSeq, mRNA, ~700 ESTs
• DFLAGGVAAAISKMAVVPI 3.5x10-5
• Genscan exon
• AISFAKDFLAGGIAAAISK
• Genscan exon
36
3.3x10-4
Significant False Positives
37
How do we know they are
novel?
• How do we know they are real?
•
•
•
•
Good spectra
Good E-value
Good ion ladders
Good sequence evidence
• Lack of other explanations...
38
Peptide Sequence Evidence
Gb of Sequence
Self Corrected ESTs (1)
Genome Corrected ESTs (2)
Corrected ESTs (1+2)
Genscan Exons (3)
Genscan Exon Pairs (4)
Combo (1+2+3+4)
Genomic ORFs (5)
Naïve Enumeration
7.60
2.90
5.40
1.30
13.00
4.20
0.10
0.06
2.30
1.60
28.40
10.06
6.20
1.90
• C3 Compression:
• Amino-acid 30-mers
• Complete, Correct(, Compact)
• Present at least twice (ESTs only)
39
C3
0.18
0.12
0.20
0.05
0.55
0.78
1.50
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
40
Compressed-SBH-graph
1
2
2
1
2
ACDEFGI
41
Peptide Sequence Databases
• MS/MS search engine input only
• Protein context is lost
• Inclusive, rather than exclusive
• Download from http://www.umiacs.umd.edu/~nedwards
• Exact string search for gene/protein context
• Recover peptide sequence evidence
• Relational database to reassemble...
...with respect to genes & genome
• Grid Computing + Web Services + Viewer
• Work in progress
42
Peptide Identification Navigator
43
Peptide Identification Navigator
44
Conclusions
• Peptides identify more than proteins
• Search EST sequences (at least)
• Compressed peptide sequence
databases make this feasible
45
Related documents