Download Data Processing Algorithms for Analysis of High

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Peptide synthesis wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Community fingerprinting wikipedia , lookup

Transcript
Data Processing Algorithms for Analysis of High
Resolution MSMS Spectra of Peptides with
Complex Patterns of Posttranslational
Modifications
Shenheng Guan and Alma L. Burlingame
Problem

Input: An MS/MS spectrum of a mixture of peptides:






Heavily modified protein
Same amino acid sequence
Same PTM
Same total number of PTMs
Different PTM configurations
Example

Two peptides with two methylations each.
LATK[+32]AARKSAE
LATK[+16]AARK[+16]SAE

Problem:

Identify the PTM configurations
 Estimate their relative abundance
Work flow
Peptide identification

Input

A deisotoped MS/MS spectrum of a mixture of peptides
 An identified peptide, the type of PTMs and the number of
PTMs.

Example

Peptide: LATKAARKSAPATGGVKKPHRYRPGTVALRE
 PTM: Methylation
 #PTM: 4

Problem

Identify the PTM configurations
 Estimate their relative abundance
All possible configuration

Assumption:

All methylations are on lysine residues
 Each lysine residue has at most 3 methyl groups.
Configuration identification

Score of Spectrum-Configuration-Pair

Spectrum S: ETD peak list
 Configuration C: theoretical peak list (c-ion)
 Sc(S,C) is the number of matched peaks in the real peak list and
the theoretical peak list.

Greedy algorithm

Compute the matching score for each configuration
 Remove the configure with the highest score from the
configuration set and remove the peaks in S that are matched to
the configuration
 Repeat the above steps until all configurations have score 0
Configuration identification
results
Estimation of relative abundance


We have four identified configurations C1,C2,C3,C4.
x1, x2, x3, x4 the relative abundance


Consider the ith c-ion with charge z





Sum equals to 1
Five possible peaks p0, …, p4
Suppose p2 is matched to C1, C2
Observed peak intensity I(p2)
Theoretical peak intensity ( x1  x2 ) 
 I(p
0 j  4
j
)
Compute the observed and theoretical peak
intensity pair for each matched c-ion
Estimation of relative abundance


Find x1, x2, x3, x4 such that the sum of the squared errors
of these intensity pairs is minimized.
Standard non-negative least-square procedure
A Novel Approach for Untargeted Posttranslational Modification Identification Using
Integer Linear Optimization and Tandem Mass
Spectrometry
Richard C. Baliban, Peter A. DiMaggio, Mariana D.
Plazas-Mayorca, Nicolas L. Young, Benjamin A. Garcia
and Christodoulos A. Floudas
Bottom up PTM identification

Two approaches

Tags
 Non-tags
Restricted
 Unrestricted


PILOT_PTM
Preprocessing
Remove all peaks related the precursor ion
 Only keep locally significant peaks
 Deisotope
 Remove neutral offset if the peak doe not
have a complementary peak.
 Each candidate peak has a list of
supporting peaks.

ILP Model

Input




Theoretical peak bk


A preprocessed deisotoped spectrum S={ a1,a2,…,am }
A peptide (theoretical b-ion peak list) P={ b1b2…bn}
A list of all known PTMs
CSk is the set of all possible peaks (indices) in S that bk can be
matched to with PTMs
Real peak aj

Posj is the set of all possible peaks (indices) in P that aj can be
matched to with PTMs
 Supportj is the set of all peaks (indices) supporting peak j in S
 Multj is the set of all peaks (indices) peak j supports
ILP Model

Binary variable
 pj,k
= 1 if peak aj in S is matched to bk in P,
otherwise pj,k = 0
 yj = 1 is peak aj is a supporting peak or
matched peak, otherwise yj = 0
ILP Model
Objective
 Subject to


One peak in P can only match one peak in S

One peak in S can only match one peak in P
ILP Model
Subject to:

No three consecutive missing peaks

The intensity of peak i is counted iff the exists one peak j
such that peak i supports j and peak j is a matched peak.
ILP Model

Solve using CPLEX
 Report

top-10 variable assignments
Existing problem
 No
constraints that require the distance
between two neighboring matched peaks
should match the mass of a residue (with
PTM)
New constraints

For each pj,k
Set of candidate ion peaks j’
with respect to k’ such that no valid jump
exists between j and j’

The maximum and minimum
masses that can be reached from j,
respectively

New constraints

Neighboring matched peaks do not conflict

Conflicting matched peaks must have a matched peak between them

The distance between two matched peaks should be bounded
Postprocessing

Re-scoring 10 candidate modified
candidate peptides
 Cross-correlation

score
Recheck modifications if there are
unmatched peaks indicating nonmodification
Test data sets

Test set A: 44 CID spectra (Ion trap), 174 ETD spectra (Orbitrap) of
chemically synthesized phosphopeptides, manually validated

Test set B: 58 ECD spectra (FTICR) of Histone H3-(1–50) N-terminal Tail,
manually validated

Test set C: 553 CID spectra (Orbitrap) of Propionylated Histone Fragments,
manually validated

Test set D: 525 modified and 6025 unmodified CID spectra (Orbitrap) from
chromatin fraction. Identified by SEQUEST and validated by MASCOT and
remove low quality spectra manually

Test set E: unmodified 36 (Ion trap), 37 (Q-TOF), 4061(Orbitrap) CID
unmodified spectra. Validated as test set D
Residue predication accuracy
Peptide prediction accuracy
Comparison on test sets C and D1
Peptide and residue prediction accuracy
Comparison on test sets C and D1
Subsequence prediction accuracy
Running time

Q&A