Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exome sequencing wikipedia , lookup
Metalloprotein wikipedia , lookup
Matrix-assisted laser desorption/ionization wikipedia , lookup
Metabolomics wikipedia , lookup
Mass spectrometry wikipedia , lookup
Peptide synthesis wikipedia , lookup
Proteolysis wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook) 1 Outline • Motivation of proteomics • Mass spectrometry-based proteomics • Instrumentation of mass spectrometry • De novo sequencing algorithm • Database search • Algorithms of real software (e.g., sequence tags) 2 Motivation • Proteins are working units of the cells – The number of found genes is much less than the number of expressed proteins – Directly related with cell processes and diseases DNA SNP ~30,000 human genes mRNA Protein Alternative Post-translational splicing Modification >100,000 RNA messages >1,000,000 distinct protein forms 3 Tools for Proteomics • Edman degradation reaction • NMR (Nuclear Magnetic Resonance) • X-ray crystallography • Protein array • Mass Spectrometry 4 Mass Spectrometry-based Proteomics • Primary sequence (sequencing, identification) • Post-translational modification (PTM) (characterization) • Quantitative proteomics (quantification) • Protein-protein interaction 5 6 Components of Mass Spectrometer • Ion source (ESI and MALDI) • Mass analyzer (ion traps, TOF, Quadrupole, FT, etc.) – Mass-to-charge ratio (m/z) • Ion detector 7 Peptide and Intact Protein • Peptide: a fragment of protein • Some enzymes, e.g. trypsin, break protein into peptides. • Some technology put intact protein into the mass spectrometer 8 Peptide Fragmentation Collision Induced Dissociation H+ H...-HN-CH-CO Ri-1 N-Terminus • • . . . NH-CH-CO-NH-CH-CO-…OH Ri Ri+1 C-Terminus Peptides tend to fragment along the backbone. Fragments can also loose neutral chemical groups like NH3 and H2O. 9 Ideal Mass Spectrum 10 Real Mass Spectrum 11 N- and C-terminal Peptides 12 Terminal peptides and ion types Peptide Mass (D) Peptide Mass (D) 57 + 97 + 147 + 114 = 415 without 57 + 97 + 147 + 114 – 18 = 397 13 N- and C-terminal Peptides 486 71 415 301 185 154 332 57 429 14 N- and C-terminal Peptides 486 71 415 301 185 154 332 57 429 15 N- and C-terminal Peptides 486 71 415 301 185 154 332 57 429 16 N- and C-terminal Peptides 486 71 415 Problem: 301 154 57 Reconstruct peptide from the set of masses of fragment 185 332 429 17 Mass Spectra 57 Da =K‘G’ D D V 99 Da = ‘V’ L H2O G L D K V G mass 0 • The peaks in the mass spectrum: – Prefix and Suffix Fragments. – Fragments with neutral losses (-H2O, -NH3) – Noise and missing peaks. 18 Protein Identification with MS/MS G V D K Peptide Identification: Intensity MS/MS L mass 00 19 Protein Identification by Tandem Mass Spectrometry MS/MS instrument S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 850.3 100 95 687.3 90 85 588.1 80 75 70 65 Relative Abundance S e q u e n c e 60 55 851.4 425.0 50 45 949.4 40 326.0 35 De Novo interpretation •Sherenga Database search •Sequest 524.9 30 25 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m/z 1200 1400 1600 1800 2000 20 De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full m s 2 638.00 [ 165.00 - 1925.00] 850.3 100 95 85 588.1 80 De Novo 75 70 65 Relative Abundance Database Search 687.3 90 60 55 851.4 425.0 50 45 949.4 40 326.0 35 524.9 30 25 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m /z 1200 1400 1600 1800 2000 Mass, Score W Database of known peptides R V A A MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, ALKIIMNVRT,AVGELTK AVGELTK, , HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. C G G L P L L T E K K W D T AVGELTK 21 Current Status • It is still a open problem of protein sequencing no matter whether using de novo sequencing or database search methods • Following algorithms only deal with simplified (or ideal) spectrums • Some algorithms combine de novo sequencing and database search 22 Pros and Cons of de novo Sequencing • Advantage: – Gets the sequences that are not necessarily in the database. • – An additional similarity search step using these sequences may identify the related proteins in the database. Disadvantage: – Requires higher quality data. – Often contains errors. 23 Outline • Motivation of proteomics • Mass spectrometry-based proteomics • Instrumentation of mass spectrometry • De novo sequencing • Database search • Algorithms of real software (e.g., sequence tags) 24 De novo Peptide Sequencing S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full ms2 638.00 [ 165.00 - 1925.00] 850.3 100 95 687.3 90 85 588.1 80 75 70 Relative Abundance 65 60 55 851.4 425.0 50 45 949.4 40 326.0 35 524.9 30 25 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m/z 1200 1400 1600 1800 2000 Sequence 25 Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: – S: experimental spectrum – Δ: set of possible ion types – m: parent mass Output: – P: peptide with mass m, whose theoretical spectrum matches the experimental S spectrum the best 26 Procedure of De Novo Sequencing • Build spectrum graph – How to create vertices (from masses) – How to create edges (from mass differences) • Find best path or rank paths of spectrum graph – How to find candidate paths – How to score paths 27 From Sequence to Spectrum b S E Q U E N Mass/Charge (M/Z) C E 28 From Sequence to Spectrum (cont.) a SE Q U E N Mass/Charge (M/Z) C E 29 From Sequence to Spectrum (cont.) a is an ion type shift in b S E Q U E Mass/Charge (M/Z) N C E 30 From Sequence to Spectrum (cont.) y E C N E U Q Mass/Charge (M/Z) E S 31 Intensity From Sequence to Spectrum (cont.) Mass/Charge (M/Z) 32 Intensity From Sequence to Spectrum (cont.) Mass/Charge (M/Z) 33 From Sequence to Spectrum (cont.) noise Mass/Charge (M/Z) 34 Intensity MS/MS Spectrum Mass/Charge (M/z) 35 Some Mass Differences between Peaks Correspond to Amino Acids u q s e s e e c e u q e n n q u e n c c e e s e 36 Now decoding from spectrum to sequence…? Build spectrum graph 37 Peptide Fragmentation • Different ion types (b, y, b-NH3, b-H2O) • Fragment at one site (internal ions) b2-H2O a2 b3- NH3 b2 a3 b3 HO NH3+ | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H y3 y2 y3 -H2O y1 y2 - NH3 38 Example of Ion Type • Δ={δ1, δ2,…, δk} • Ion types {b, b-NH3, b-H2O} correspond to Δ={0, 17, 18} *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity 39 Why Peptide Sequencing hard • Two ladders of overlapping masses, could not tell whether it is b ion or y ion • Incomplete fragmentation • Chemical noise • Mass accuracy of the instrument is not good enough (Q=K, G+V=156.090, R=156.101) • Q: Is sequencing shorter or longer peptide harder? 40 Vertices of Spectrum Graph • Vertices are generated by reverse shifts corresponding to ion types • Δ={δ1, δ2,…, δk} Every mass s in an MS/MS spectrum generates k vertices V(s) = {s+δ1, s+δ2, …, s+δk} corresponding to potential N-terminal peptides • Vertices of the spectrum graph: {initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex} 41 Reverse Shifts Shift in H2O Shift in H2O+NH3 42 Edges of Spectrum Graph • Two vertices with mass difference corresponding to an amino acid A: – Connect with an edge labeled by A (Directed Graph) • Gap edges for di- and tri-peptides – Potential sequence tag method (covered later) 43 Best Path of Spectrum Graph • How to find candidate paths • There are many paths, how to find the correct one? • We need scoring to evaluate paths 44 Find Candidate Paths • Heuristics: find a path with maximum number of edges • Longest path problem in DAG • DFS (Depth First Search) 45 Path Score • p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq} • Scoring = computing probabilities 46 Finding Optimal Paths in the Spectrum Graph • For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P: p(P',S) max P p(P,S) • Peptides = paths in the spectrum graph • P’ = the optimal path in the spectrum graph • Some software rank paths 47 Ratio Test Scoring for Partial Peptides • Incorporates premiums for observed ions and penalties for missing ions. • Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4. The score is calculated as: q1 q2 (1 q3 ) q4 qR qR (1 qR ) qR 48 Why Not Sequence De Novo? • De novo sequencing is still not very accurate! Amino Acid Accuracy Whole Peptide Accuracy 0.566 0.189 SHERENGA (Dancik et. al., 1999). 0.690 0.289 Peaks 0.673 0.727 0.246 0.296 Algorithm Lutefisk (Taylor and Johnson, 1997). (Ma et al., 2003). PepNovo (Frank and Pevzner, 2005). • Less than 30% of the peptides sequenced were completely correct! 49 De Novo vs. Database Search S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6 T: + c d Full m s 2 638.00 [ 165.00 - 1925.00] 850.3 100 95 85 588.1 80 De Novo 75 70 65 Relative Abundance Database Search 687.3 90 60 55 851.4 425.0 50 45 949.4 40 326.0 35 524.9 30 25 20 589.2 226.9 1048.6 1049.6 397.1 489.1 15 10 629.0 5 0 200 400 600 800 1000 m /z 1200 1400 1600 1800 2000 W Database of known peptides R V A A MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, ALKIIMNVRT,AVGELTK AVGELTK, , HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. C G G L P L L T E K K W D T AVGELTK 50 Outline • Motivation of proteomics • Mass spectrometry-based proteomics • Instrumentation: Mass Spectrometry • De novo sequencing algorithm • Database search • Algorithms of real software (e.g., sequence tags) 51 Peptide Identification Problem Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum. Input: – S: experimental spectrum – database of peptides – Δ: set of possible ion types – m: parent mass Output: – A peptide of mass m from the database whose theoretical spectrum matches the experimental S spectrum the best 52 Match between Spectra and the Shared Peak Count • The match between two spectra is the number of masses (peaks) they share (Shared Peak Count or SPC) • In practice mass-spectrometrists use the weighted SPC that reflects intensities of the peaks • Match between experimental and theoretical spectra is defined similarly 53 MS/MS Database Search Database search in mass-spectrometry has been successful in identification of already known proteins. Experimental spectrum can be compared with theoretical spectra of database peptides to find the best fit. SEQUEST (Yates et al., 1995) But reliable algorithms for identification of peptides is a much more difficult problem. Q: Why can a peptide be not identical to a sequence in the database 54 Deficiency of the Shared Peaks Count Shared peaks count (SPC): intuitive measure of spectral similarity. Problem: SPC diminishes very quickly as the number of mutations increases. Only a small portion of correlations between the spectra of mutated peptides is captured by SPC. 55 SPC Diminishes Quickly no mutations SPC=10 1 mutation SPC=5 2 mutations SPC=2 S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632} S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682} S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583} 56 Post-Translational Modifications Proteins are involved in cellular signaling and metabolic regulation. They are subject to a large number of biological modifications. Almost all protein sequences are posttranslationally modified and 200 types of modifications of amino acid residues are known. 57 Examples of Post-Translational Modification Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion in both database search and de novo approaches. 58 Search for Modified Peptides: Virtual Database Approach Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides. Exhaustive search leads to a large combinatorial problem, even for a small set of modifications types. Problem (Yates et al.,1995). Extend the virtual database approach to a large set of modifications. 59 Modified Peptide Identification Problem Goal: Find a modified peptide from the database with maximal match between an experimental and theoretical spectrum. Input: – S: experimental spectrum – database of peptides – Δ: set of possible ion types – m: parent mass – Parameter k (# of mutations/modifications) Output: – A peptide of mass m that is at most k mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best 60 Spectrum Alignment • See 8.14 and 8.15 in the text book for one algorithm • Complicated for real spectrums 61 Outline • Motivation of proteomics • Mass spectrometry-based proteomics • Instrumentation: Mass Spectrometry • De novo sequencing algorithm • Database search • Algorithms of real software (e.g., sequence tags) 62 Combining de novo and Database Search in Mass-Spectrometry • • • • So far de novo and database search were presented as two separate techniques Database search is rather slow: many labs generate more than 100,000 spectra per day. SEQUEST takes approximately 1 minute to compare a single spectrum against SWISS-PROT (54Mb) on a desktop. It will take SEQUEST more than 2 months to analyze the MS/MS data produced in a single day. Q: Can slow database search be combined with fast de novo analysis? 63 What Can be Done with De Novo? • Given an MS/MS spectrum: – Can de novo predict the entire peptide sequence? - No! (accuracy is less than 30%). – Can de novo predict a set of partial sequences, that with high probability, contains at least one correct tag? A Covering Set of Tags - Yes! 64 Peptide Sequence Tags • A Peptide Sequence Tag is short substring of a peptide. Example: Tags: GVDLK GVD VDL DLK 65 Filtration with Peptide Sequence Tags • • Peptide sequence tags can be used as filters in database searches. The Filtration: Consider only database peptides that contain the tag (in its correct relative mass location). • First suggested by Mann and Wilm (1994). • Similar concepts also used by: – GutenTag - Tabb et. al. 2003. – MultiTag - Sunayev et. al. 2003. – OpenSea - Searle et. al. 2004. 66 Why Filter Database Candidates? • • Effective filtration can greatly speed-up the process, enabling expensive searches involving post-translational modifications. Goal: generate a small set of covering tags and use them to filter the database peptides. 67 Summary • Protein sequencing • Mass spectrum • De novo search and database search • Difficulty of protein sequencing 68 The End 69 Quality Measure of Mass Spectrometer • Sensitivity • Mass accuracy • Resolution • Dynamic range 70 Exhaustive Search for Modified Peptides • YFDSTDYNMAK Oxidation? • • For each peptide, generate all modifications. Score each modification. Phosphorylation? • 25=32 possibilities, with 2 types of modifications! 71 Peptide Identification Problem: Challenge Very similar peptides may have very different spectra! Goal: Define a notion of spectral similarity that correlates well with the sequence similarity. If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high. 72 Why Filtration ? Sequence Alignment – Smith BLASTWaterman Algorithm Protein Query Sequence matches Scoring Filtration Database actgcgctagctacggatagctgatcc agatcgatgccataggtagctgatcc atgctagcttagacataaagcttgaat cgatcgggtaacccatagctagctcg atcgacttagacttcgattcgatcgaat tcgatctgatctgaatatattaggtccg atgctagctgtggtagtgatgtaaga • BLAST filters out very few correct matches and is almost as accurate as Smith – Waterman algorithm. 73 Filtration and MS/MS Peptide Sequencing – SEQUEST / Mascot MS/MS spectrum Sequence matches Scoring Filtration Database MDERHILNMKLQWVCSDLPT YWASDLENQIKRSACVMTLA CHGGEMNGALPQWRTHLLE RTYKMNVVGGPASSDALITG MQSDPILLVCATRGHEWAILF GHNLWACVNMLETAIKLEGVF GSVLRAEKLNKAAPETYIN.. 74 Filtration in MS/MS Sequencing • • • • Filtration in MS/MS is more difficult than in BLAST. Early approaches using Peptide Sequence Tags were not able to substitute the complete database search. Current filtration approaches are mostly used to generate additional identifications rather than replace the database search. Can we design a filtration based search that can replace the database search, and is orders of magnitude faster? 75 Asking the Old Question Again: Why Not Sequence De Novo? • De novo sequencing is still not very accurate! Amino Acid Accuracy Whole Peptide Accuracy Lutefisk (Taylor and Johnson, 1997). 0.566 0.189 SHERENGA (Dancik et. al., 1999). 0.690 0.289 Peaks (Ma et al., 2003). 0.673 0.246 PepNovo (Frank and Pevzner, 2005). 0.727 0.296 Algorithm 76