Download Faster, More Sensitive Peptide ID by Sequence DB Compression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Improving Genome
Annotation using
Proteomics
Nathan Edwards
Center for Bioinformatics and Computational Biology
University of Maryland, College Park
Mass Spectrometry for
Proteomics
• Measure mass of many (bio)molecules
simultaneously
• High bandwidth
• Mass is an intrinsic property of all
(bio)molecules
• No prior knowledge required
2
Mass Spectrometer
Sample
+
_
Ionizer
• MALDI
• Electro-Spray
Ionization (ESI)
Mass Analyzer
• Time-Of-Flight (TOF)
• Quadrapole
• Ion-Trap
3
Detector
• Electron
Multiplier
(EM)
High Bandwidth
% Intensity
100
0
250
500
750
4
1000
m/z
Mass is fundamental!
5
Mass Spectrometry for
Proteomics
• Measure mass of many molecules
simultaneously
• ...but not too many, abundance bias
• Mass is an intrinsic property of all
(bio)molecules
• ...but need a reference to compare to
6
Mass Spectrometry for
Proteomics
• Mass spectrometry has been around
since the turn of the century...
• ...why is MS based Proteomics so new?
• Ionization methods
• MALDI, Electrospray
• Protein chemistry & automation
• Chromatography, Gels, Computers
• Protein / genome sequences
• A reference for comparison
7
Sample Preparation for
Peptide Identification
Enzymatic Digest
and
Fractionation
8
Single Stage MS
MS
m/z
9
Tandem Mass Spectrometry
(MS/MS)
m/z
Precursor selection
m/z
10
Tandem Mass Spectrometry
(MS/MS)
Precursor selection +
collision induced dissociation
(CID)
m/z
MS/MS
m/z
11
Peptide Identification
• For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
• Peptide sequences from (any) sequence database
• Swiss-Prot, IPI, NCBI’s nr, ESTs, genomes, ...
• Automated, high-throughput peptide identification
in complex mixtures
12
Peptide Identification
...can provide direct experimental evidence
for the amino-acid sequence of functional
proteins.
Evidence for:
• Functional protein isoforms
• Translation start and frame
• Proteins with short open-reading-frames
13
How could this help?
• Evidence for SNPs and alternative splicing
stops with transcription
• No genomic or transcript evidence for
translation start-site.
• Conservation doesn’t stop at coding bases!
• Statistical gene-finders struggle with microexons, translation start-site, and short ORFs.
14
What can be observed?
• Known coding SNPs
• Novel coding mutations
• Alternative splicing isoforms
• Microexons ( non-cannonical splice-sites )
• Alternative translation start-sites ( codons )
• Alternative translation frames
• “Dark” open-reading-frames
15
Splice Isoform
• Human Jurkat leukemia cell-line
• Lipid-raft extraction protocol, targeting T cells
• von Haller, et al. MCP 2003.
• LIME1 gene:
• LCK interacting transmembrane adaptor 1
• LCK gene:
• Leukocyte-specific protein tyrosine kinase
• Proto-oncogene
• Chromosomal aberration involving LCK in leukemias.
• Multiple significant peptide identifications
16
Splice Isoform
17
Novel Splice Isoform
18
Translation Start-Site
• Human erythroleukemia K562 cell-line
• Depth of coverage study
• Resing et al. Anal. Chem. 2004.
• THOC2 gene:
• Part of the heteromultimeric THO/TREX complex.
• Initially believed to be a “novel” ORF
•
•
•
•
•
RefSeq mRNA in Jun 2007, no RefSeq protein
TrEMBL entry Feb 2005, no SwissProt entry
Genbank mRNA in May 2002 (complete CDS)
Plenty of EST support
~ 100,000 bases upstream of other isoforms
19
Translation Start-Site
20
Translation Start-Site
21
Translation Start-Site
22
Translation Start-Site
23
Easily distinguish minor
sequence variations
Two B. anthracis Sterne α/β SASP
annotations
• RefSeq/Gb: MVMARN... (7441 Da)
• CMR:
MARN... (7211 Da)
• Intact proteins differ by 230 Da
• 7441 Da vs 7211 Da
• N-terminal tryptic peptides:
• MVMAR (606.3 Da), MVMARNR (876.4 Da), vs
• MARNR (646.3 Da)
• Very different MS/MS spectra
24
Bacterial Gene-Finding
• Find all the open-reading-frames...
…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…
Stop
codon
Stop
codon
25
...courtesy of Art Delcher
Bacterial Gene-Finding
• Find all the open-reading-frames...
Reverse
strand
Stop
codon
…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…
…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…
Stop
codon
Stop
codon
Shifted
Stop
...but they overlap – which ones are correct?
26
...courtesy of Art Delcher
Coding-Sequence “Score”
27
...courtesy of Art Delcher
Glimmer3 Performance
Genome
Organism
Archaeoglobus fulgidus
Bacillus anthracis
Bacillus subtilis
Campylobacter jejuni
Carboxydothermus hydrogenoformans
Caulobacter crescentus
Chlorobium tepidum
Clostridium perfringens
Colwellia psychrerythraea
Dehalococcoides ethenogenes
Escherichia coli
Geobacter sulfurreducens
Haemophilus influenzae
Helicobacter pylori
Listeria monocytogenes
Methylococcus capsulatus
Mycobacterium tuberculosis
Neisseria meningitidis
Porphyromonas gingivalis
Pseudomonas fluorescens
Pseudomonas putida
Ralstonia solanacearum
Staphylococcus epidermidis
Streptococcus agalactiae
Streptococcus pneumoniae
Thermotoga maritima
Treponema denticola
Treponema pallidum
Ureaplasma parvum
Wolbachia endosymbiont
Length GC% # Genes
2.18Mb 48.6
1165
5.23Mb 35.4
3132
4.21Mb 43.5
1576
1.78Mb 30.3
1233
2.40Mb 42.0
1753
4.02Mb 67.2
2192
2.15Mb 56.5
1292
3.03Mb 28.6
1504
5.37Mb 38.0
3063
1.47Mb 48.9
1069
4.64Mb 50.8
3603
3.81Mb 60.9
2351
1.83Mb 38.1
1170
1.67Mb 38.9
915
2.91Mb 38.0
1966
3.30Mb 63.6
2015
4.40Mb 65.6
2217
2.27Mb 51.5
1232
2.34Mb 48.3
1200
7.07Mb 63.3
4535
6.18Mb 61.5
3633
3.72Mb 67.0
2512
2.62Mb 32.1
1650
2.16Mb 35.6
1441
2.16Mb 39.7
1359
1.86Mb 46.2
1092
2.84Mb 37.9
1463
1.14Mb 52.8
575
0.75Mb 25.5
327
1.08Mb 34.2
628
Averages:
Glimmer3 Predictions
Matches
Correct Starts
1162 99.70% 875 75.10%
3129 99.9% 2768 88.4%
1567 99.4% 1429 90.7%
1233 100.0% 1149 93.2%
1752 99.9% 1590 90.7%
2187 99.8% 1552 70.8%
1289 99.8% 949 73.5%
1503 99.9% 1385 92.1%
3060 99.9% 2663 86.9%
1059 99.1% 929 86.9%
3553 98.6% 3150 87.4%
2340 99.5% 1974 84.0%
1170 100.0% 1054 90.1%
914 99.9% 805 88.0%
1965 99.9% 1797 91.4%
2005 99.5% 1542 76.5%
2205 99.5% 1493 67.3%
1217 98.8% 1042 84.6%
1198 99.8% 933 77.8%
4503 99.3% 3577 78.9%
3596 99.0% 2825 77.8%
2487 99.0% 2061 82.0%
1649 99.9% 1511 91.6%
1438 99.8% 1336 92.7%
1355 99.7% 1214 89.3%
1090 99.8% 892 81.7%
1463 100.0% 1332 91.0%
572 99.5% 425 73.9%
327 100.0% 300 91.7%
627 99.8% 528 84.1%
99.6%
84.3%
28
Extra
1305
2340
2879
668
865
1559
765
1178
1714
483
913
1091
639
765
845
1231
2104
1329
887
1871
1916
1077
771
683
780
804
1210
557
293
537
• Glimmer3 trained &
compared to RefSeq genes
with annotated function
• Correct STOP:
• 99.6%
• Correct START:
• 84.3%
• “Not all the genomes
necessarily have
carefully/accurately
annotated start sites, so the
results for number of correct
starts may be suspect.”
N-terminal peptides
• (Protein) N-terminal peptides establish
• start-site of known & unexpected ORFs
Use:
• Directly to annotate genomes
• Evaluate and improve algorithms
• Map cross-species
29
N-terminal peptide workflows
• Typical proteomics workflows sample
peptides from the proteome “randomly”
• Caulobacter crescentus (70%)
• 3733 Proteins (RefSeq Genome annot.)
• 66K tryptic peptides (600 Da to 3000 Da)
• 2085 N-terminal tryptic peptides (3%)
30
N-terminal peptide workflow
• Protect protein
N-terminus
• Digest to peptides
• Chemically modify
free peptide N-term
• Use chem. mod. to
capture unwanted
peptides
Nat Biotech, Vol. 21, pp. 566-569, 2003.
31
Increasing N-terminal
peptide coverage
• Multiple (digest)
enzymes:
• trypsin-R:
60% (80%)
• acid + lys-C + trypsin:
85% (94%)
• Repeated LC-MS/MS
• Precursor Exclusion /
Inclusion lists
• MALDI / ESI
• Protein separation
and/or orthogonal
fractionation
Anal Chem, Vol. 76, pp. 4193-4201, 2004.
32
Proteomics Informatics
• Search spectra against:
• Entire bacterial genome;
• All Met initiated peptides; or
• Statistically likely Met initiated peptides.
• Easily consider initial Met loss PTM, too
• Off-the-shelf MS/MS search engines
(Mascot / X!Tandem / OMSSA)
33
Other Practical Issues
• Suitable for commonly available
instrumentation
• Only the sample prep. is (somewhat) novel.
• Need living organism
• Stage of life-cycle?
• Bang for buck?
• N-terminal peptides / $$$$
34
Other Research Projects
• Alternative splicing and coding SNPs in
clinical cancer samples
• MS/MS spectral matching using HMMs
• Combining MS/MS search engine results
using machine learning
• Microorganism identification using MS
(www.RMIDb.org)
• Gapped/spaced seeds for inexact sequence
alignment.
• Applications of SBH-graphs and Eulerian
paths
35
Hidden Markov Models for
Spectral Matching
• Capture statistical variation and consensus in
peak intensity
• Capture semantics of peaks
• Extrapolate model to other peptides
• Good specificity with superior sensitivity for
peptide detection
• Assign 1000’s of additional spectra (w/ p-value < 10-5)
36
Peptide DLATVYVDVLK
37
Peptide DLATVYVDVLK
38
Acknowledgements
• Catherine Fenselau, Steve Swatkoski
• UMCP Biochemistry
• Chau-Wen Tseng, Xue Wu
• UMCP Computer Science
• Cheng Lee
• Calibrant Biosystems
• PeptideAtlas, HUPO PPP, X!Tandem
• Funding: NIH/NCI, USDA/ARS
39
Related documents