Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park Novel Peptides • Absent from traditional protein sequence databases • IPI, SwissProt, TrEMBL, NCBI’s nr, MSDB • Due to • Deliberate “redundancy” elimination • “Dark-side” genes • Bias towards high-quality, high-confidence full-length protein sequence 2 What is missing? • Known coding SNPs • Novel coding mutations • Alternative splicing isoforms • Alternative translation start-sites • Microexons • Alternative translation frames 3 Why should we care? • Alternative splicing is the norm! • Only 20-25K human genes • Each gene makes many proteins • Proteins have clinical implications • Biomarker discovery • Evidence for SNPs and alternative splicing stops with transcription • Genomic assays, ESTs, mRNA sequence. • No hard evidence for translation start site 4 Novel Protein HEQASNVLSDISEFR Evidence: • log10(E-value) = -9.6 • 100’s of ESTs • Full length mRNA sequence Details: • Peptide Atlas A8_IP (Resing et al.); 5 Novel Protein 6 Novel Protein 7 Novel Protein 8 Novel Splice Isoform LQGSATAAEAQVGHQTAR Evidence: • log10(E-value) = -6.8 • 10’s of ESTs • Full length mRNA sequence Details: • Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene 9 Novel Splice Isoform 10 Novel Splice Isoform 11 Novel Splice Isoform 12 Novel Frame TAGSPLCLPTPGAAPGSAGSCSHR Evidence: • log10(E-value) = -3.9 • 10’s of ESTs • Full length mRNA sequence Details: • Peptide Atlas raftflow (von Haller, et al.); • LIME1 gene, downstream from LQGSA... 13 Novel Frame 14 Novel Frame 15 Novel Frame 16 “Novel” Microexon LQTASDESYKDPTNIQLSK Evidence: • log10(E-value) = -6.4 • 10’s of ESTs / mRNA sequences • SwissProt variant, absent from IPI Details: • Peptide Atlas raftflow (von Haller, et al.); • SPTAN1 gene 17 “Novel” Microexon 18 “Novel” Microexon 19 “Novel” Microexon 20 “Novel” Microexon 21 Novel Mutation KADDTWEPFASGK Evidence: • log10(E-value) = -7.6 • 2 ESTs from same clone library • Ala2 Deletion Details: • HUPO PPP 29_b1-EDTA_1 (Qian/He; Omenn et al.); • TTR gene • Known Mutation: Ala2-to-Pro associated with familial amyloidotic polyneuropathy. 22 Novel Mutation 23 Novel Mutation 24 Novel Mutation 25 Novel Mutation 26 Known Coding SNP DTEEEDFHVDQ[V|A]TTVK Evidence: • log10(E-value) = -9.5 / -9.4 • Known dbSNP (coding): Val12-to-Ala • Wildtype also observed Details: • HUPO PPP 40 (Wang; Omenn et al.); • SERPINA1 gene 27 Wildtype 28 Known Coding SNP 29 Known Coding SNP 30 Known Coding SNP LQHL[E|V]NELTHDIITK Evidence: • log10(E-value) = -6.7/-10.9 • 4 ESTs, same clone library • Known dbSNP (coding): Glu5-to-Val • Wildtype also observed Details: • HUPO PPP 28_b2-CIT (Pounds/Adkins/Rodland/Anderson; Omenn et al.); • SERPINA1 gene 31 IPI Common Variant Elimination YYGGGYGSTQATFMVFQALAQYQK Evidence: • log10(E-value) = -5.9 • 100’s ESTs, mRNA sequence • IPI has (rare) variant (Insertion of AS@10) • Differ in 5’ splice site. Details: • HUPO PPP 29 (Qian/He; Omenn et al.); • C3 gene 32 Why don’t we see more novel peptides? • Tandem mass spectrometry doesn’t discriminate against novel peptides... ...but protein sequence databases do! • Searching traditional protein sequence databases biases the results towards well-understood protein isoforms! 33 Why don’t we see more novel peptides? • Traditional protein sequence databases • High-quality, full-length proteins only • Many interesting peptides are omitted • Exclusive – peptide identifications are lost. • ESTs, genomic & mRNA sequence • Used as evidence for full-length protein sequences • Inclusive – may need to filter results 34 Significant False Positives • E-values are not enough! • Random guessers are easy to beat. • Post-translational modifications vs. amino-acid substitution • methylation (on I/L, Q, R, C, H, K, S, T, N): +14 • D → E, G → A, V → I/L, N → Q, S → T: +14 • Peptide extension z=+2 → z=+3 • Nonsense AA masses sum to precursor • Need to ensure: • fragment ions define novel sequence • sequence evidence is strong • other plausible explanations can be eliminated 35 Significant False Positives • DFLAGGLAAAISK 2.2x10-8 • DFLAGGIAAAISK 2.2x10-8 • DFLAGGVAAAISK 3.7x10-8 • 2 ESTs • IPI (2), RefSeq, mRNA, ~ 1400 ESTs • IPI, RefSeq, mRNA, ~700 ESTs • DFLAGGVAAAISKMAVVPI 3.5x10-5 • Genscan exon • AISFAKDFLAGGIAAAISK • Genscan exon 36 3.3x10-4 Significant False Positives 37 How do we know they are novel? • How do we know they are real? • • • • Good spectra Good E-value Good ion ladders Good sequence evidence • Lack of other explanations... 38 Peptide Sequence Evidence Gb of Sequence Self Corrected ESTs (1) Genome Corrected ESTs (2) Corrected ESTs (1+2) Genscan Exons (3) Genscan Exon Pairs (4) Combo (1+2+3+4) Genomic ORFs (5) Naïve Enumeration 7.60 2.90 5.40 1.30 13.00 4.20 0.10 0.06 2.30 1.60 28.40 10.06 6.20 1.90 • C3 Compression: • Amino-acid 30-mers • Complete, Correct(, Compact) • Present at least twice (ESTs only) 39 C3 0.18 0.12 0.20 0.05 0.55 0.78 1.50 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI 40 Compressed-SBH-graph 1 2 2 1 2 ACDEFGI 41 Peptide Sequence Databases • MS/MS search engine input only • Protein context is lost • Inclusive, rather than exclusive • Download from http://www.umiacs.umd.edu/~nedwards • Exact string search for gene/protein context • Recover peptide sequence evidence • Relational database to reassemble... ...with respect to genes & genome • Grid Computing + Web Services + Viewer • Work in progress 42 Peptide Identification Navigator 43 Peptide Identification Navigator 44 Conclusions • Peptides identify more than proteins • Search EST sequences (at least) • Compressed peptide sequence databases make this feasible 45