Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park Mass Spectrometry for Proteomics • Measure mass of many (bio)molecules simultaneously • High bandwidth • Mass is an intrinsic property of all (bio)molecules • No prior knowledge required 2 Mass Spectrometer Sample + _ Ionizer • MALDI • Electro-Spray Ionization (ESI) Mass Analyzer • Time-Of-Flight (TOF) • Quadrapole • Ion-Trap 3 Detector • Electron Multiplier (EM) High Bandwidth % Intensity 100 0 250 500 750 4 1000 m/z Mass is fundamental! 5 Mass Spectrometry for Proteomics • Measure mass of many molecules simultaneously • ...but not too many, abundance bias • Mass is an intrinsic property of all (bio)molecules • ...but need a reference to compare to 6 Mass Spectrometry for Proteomics • Mass spectrometry has been around since the turn of the century... • ...why is MS based Proteomics so new? • Ionization methods • MALDI, Electrospray • Protein chemistry & automation • Chromatography, Gels, Computers • Protein / genome sequences • A reference for comparison 7 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation 8 Single Stage MS MS m/z 9 Tandem Mass Spectrometry (MS/MS) m/z Precursor selection m/z 10 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z 11 Peptide Identification • For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well • Peptide sequences from (any) sequence database • Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ... • Automated, high-throughput peptide identification in complex mixtures 12 Peptide Identification ...can provide direct experimental evidence for the amino-acid sequence of functional proteins. Evidence for: • Functional protein isoforms • Translation start and frame • Proteins with short open-reading-frames 13 How could this help? • Evidence for SNPs and alternative splicing stops with transcription • No genomic or transcript evidence for translation start-site. • Conservation doesn’t stop at coding bases! • Statistical gene-finders struggle with microexons, translation start-site, and short ORFs. 14 What can be observed? • Known coding SNPs • Novel coding mutations • Alternative splicing isoforms • Microexons ( non-cannonical splice-sites ) • Alternative translation start-sites ( codons ) • Alternative translation frames • “Dark” open-reading-frames 15 Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications 16 Splice Isoform 17 Novel Splice Isoform 18 Translation Start-Site • Human erythroleukemia K562 cell-line • Depth of coverage study • Resing et al. Anal. Chem. 2004. • THOC2 gene: • Part of the heteromultimeric THO/TREX complex. • Initially believed to be a “novel” ORF • • • • • RefSeq mRNA in Jun 2007, no RefSeq protein TrEMBL entry Feb 2005, no SwissProt entry Genbank mRNA in May 2002 (complete CDS) Plenty of EST support ~ 100,000 bases upstream of other isoforms 19 Translation Start-Site 20 Translation Start-Site 21 Translation Start-Site 22 Translation Start-Site 23 Easily distinguish minor sequence variations Two B. anthracis Sterne α/β SASP annotations • RefSeq/Gb: MVMARN... (7441 Da) • CMR: MARN... (7211 Da) • Intact proteins differ by 230 Da • 7441 Da vs 7211 Da • N-terminal tryptic peptides: • MVMAR (606.3 Da), MVMARNR (876.4 Da), vs • MARNR (646.3 Da) • Very different MS/MS spectra 24 Bacterial Gene-Finding • Find all the open-reading-frames... …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon Stop codon 25 ...courtesy of Art Delcher Bacterial Gene-Finding • Find all the open-reading-frames... Reverse strand Stop codon …ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon Stop codon Shifted Stop ...but they overlap – which ones are correct? 26 ...courtesy of Art Delcher Coding-Sequence “Score” 27 ...courtesy of Art Delcher Glimmer3 Performance Genome Organism Archaeoglobus fulgidus Bacillus anthracis Bacillus subtilis Campylobacter jejuni Carboxydothermus hydrogenoformans Caulobacter crescentus Chlorobium tepidum Clostridium perfringens Colwellia psychrerythraea Dehalococcoides ethenogenes Escherichia coli Geobacter sulfurreducens Haemophilus influenzae Helicobacter pylori Listeria monocytogenes Methylococcus capsulatus Mycobacterium tuberculosis Neisseria meningitidis Porphyromonas gingivalis Pseudomonas fluorescens Pseudomonas putida Ralstonia solanacearum Staphylococcus epidermidis Streptococcus agalactiae Streptococcus pneumoniae Thermotoga maritima Treponema denticola Treponema pallidum Ureaplasma parvum Wolbachia endosymbiont Length GC% # Genes 2.18Mb 48.6 1165 5.23Mb 35.4 3132 4.21Mb 43.5 1576 1.78Mb 30.3 1233 2.40Mb 42.0 1753 4.02Mb 67.2 2192 2.15Mb 56.5 1292 3.03Mb 28.6 1504 5.37Mb 38.0 3063 1.47Mb 48.9 1069 4.64Mb 50.8 3603 3.81Mb 60.9 2351 1.83Mb 38.1 1170 1.67Mb 38.9 915 2.91Mb 38.0 1966 3.30Mb 63.6 2015 4.40Mb 65.6 2217 2.27Mb 51.5 1232 2.34Mb 48.3 1200 7.07Mb 63.3 4535 6.18Mb 61.5 3633 3.72Mb 67.0 2512 2.62Mb 32.1 1650 2.16Mb 35.6 1441 2.16Mb 39.7 1359 1.86Mb 46.2 1092 2.84Mb 37.9 1463 1.14Mb 52.8 575 0.75Mb 25.5 327 1.08Mb 34.2 628 Averages: Glimmer3 Predictions Matches Correct Starts 1162 99.70% 875 75.10% 3129 99.9% 2768 88.4% 1567 99.4% 1429 90.7% 1233 100.0% 1149 93.2% 1752 99.9% 1590 90.7% 2187 99.8% 1552 70.8% 1289 99.8% 949 73.5% 1503 99.9% 1385 92.1% 3060 99.9% 2663 86.9% 1059 99.1% 929 86.9% 3553 98.6% 3150 87.4% 2340 99.5% 1974 84.0% 1170 100.0% 1054 90.1% 914 99.9% 805 88.0% 1965 99.9% 1797 91.4% 2005 99.5% 1542 76.5% 2205 99.5% 1493 67.3% 1217 98.8% 1042 84.6% 1198 99.8% 933 77.8% 4503 99.3% 3577 78.9% 3596 99.0% 2825 77.8% 2487 99.0% 2061 82.0% 1649 99.9% 1511 91.6% 1438 99.8% 1336 92.7% 1355 99.7% 1214 89.3% 1090 99.8% 892 81.7% 1463 100.0% 1332 91.0% 572 99.5% 425 73.9% 327 100.0% 300 91.7% 627 99.8% 528 84.1% 99.6% 84.3% 28 Extra 1305 2340 2879 668 865 1559 765 1178 1714 483 913 1091 639 765 845 1231 2104 1329 887 1871 1916 1077 771 683 780 804 1210 557 293 537 • Glimmer3 trained & compared to RefSeq genes with annotated function • Correct STOP: • 99.6% • Correct START: • 84.3% • “Not all the genomes necessarily have carefully/accurately annotated start sites, so the results for number of correct starts may be suspect.” N-terminal peptides • (Protein) N-terminal peptides establish • start-site of known & unexpected ORFs Use: • Directly to annotate genomes • Evaluate and improve algorithms • Map cross-species 29 N-terminal peptide workflows • Typical proteomics workflows sample peptides from the proteome “randomly” • Caulobacter crescentus (70%) • 3733 Proteins (RefSeq Genome annot.) • 66K tryptic peptides (600 Da to 3000 Da) • 2085 N-terminal tryptic peptides (3%) 30 N-terminal peptide workflow • Protect protein N-terminus • Digest to peptides • Chemically modify free peptide N-term • Use chem. mod. to capture unwanted peptides Nat Biotech, Vol. 21, pp. 566-569, 2003. 31 Increasing N-terminal peptide coverage • Multiple (digest) enzymes: • trypsin-R: 60% (80%) • acid + lys-C + trypsin: 85% (94%) • Repeated LC-MS/MS • Precursor Exclusion / Inclusion lists • MALDI / ESI • Protein separation and/or orthogonal fractionation Anal Chem, Vol. 76, pp. 4193-4201, 2004. 32 Proteomics Informatics • Search spectra against: • Entire bacterial genome; • All Met initiated peptides; or • Statistically likely Met initiated peptides. • Easily consider initial Met loss PTM, too • Off-the-shelf MS/MS search engines (Mascot / X!Tandem / OMSSA) 33 Other Practical Issues • Suitable for commonly available instrumentation • Only the sample prep. is (somewhat) novel. • Need living organism • Stage of life-cycle? • Bang for buck? • N-terminal peptides / $$$$ 34 Other Research Projects • Alternative splicing and coding SNPs in clinical cancer samples • MS/MS spectral matching using HMMs • Combining MS/MS search engine results using machine learning • Microorganism identification using MS (www.RMIDb.org) • Gapped/spaced seeds for inexact sequence alignment. • Applications of SBH-graphs and Eulerian paths 35 Hidden Markov Models for Spectral Matching • Capture statistical variation and consensus in peak intensity • Capture semantics of peaks • Extrapolate model to other peptides • Good specificity with superior sensitivity for peptide detection • Assign 1000’s of additional spectra (w/ p-value < 10-5) 36 Peptide DLATVYVDVLK 37 Peptide DLATVYVDVLK 38 Acknowledgements • Catherine Fenselau, Steve Swatkoski • UMCP Biochemistry • Chau-Wen Tseng, Xue Wu • UMCP Computer Science • Cheng Lee • Calibrant Biosystems • PeptideAtlas, HUPO PPP, X!Tandem • Funding: NIH/NCI, USDA/ARS 39