Download poster - Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

G protein–coupled receptor wikipedia , lookup

Genetic code wikipedia , lookup

Lipid signaling wikipedia , lookup

Proteolysis wikipedia , lookup

Biosynthesis wikipedia , lookup

Biochemistry wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Homology modeling wikipedia , lookup

Ultrasensitivity wikipedia , lookup

Mitogen-activated protein kinase wikipedia , lookup

Transcript
Sixth Annual Joint
Bioinformatics Symposium
2006
Artificial Intelligence Research Laboratory
Bioinformatics and Computational Biology Program
Computational Intelligence, Learning, and Discovery Program
Department of Computer Science
Machine Learning Versus Profile-Based Methods for Protein Phosphorylation Site Prediction
Yasser EL-Manzalawy, Cornelia Caragea, Drena Dobbs, and Vasant Honavar
 Profile-Based Approaches
Prediction of Phosphorylation Sites-Motivation
Protein phosphorylation, performed by protein kinases, is a very
important process involved in signal transduction pathways. Predicting
phosphorylation sites is an essential step towards understanding
phosphorylation, which in turn, is essential in understanding diseases
and, ultimately, designing drugs that can prevent or cure diseases.
Results
 Scansite
 A web service that is using 63 experimentally developed motifs,
represented as PSSM, for identifying potential Ser/Thr phosphorylated
sites.
 KinasePhos
Table 2 compares the performance of ML methods against profilebased methods for predicting kinase-specific phosphorylation sites.
We also report the ROC curves for basic PSSM and basic HMM in
Fig. 3
Conclusions
 Another web service that uses Kinase-specific HMMs for
predictions.
We proposed PSSMPhos, a method for combining PSSM profiles and
ML methods.
 Basic PSSM
 Our implementation of PSSM motifs using PROFILEWEIGHT
program.
 Basic HMM
Our study demonstrates the superiority of ML over profile-based
methods when enough training data is available.
Our experiments suggest that ML methods and profile-based methods
should complement each other to produce more efficient
phosphorylation site prediction tools.
 Our implementation of HMM motifs using HMMER software
package.
Sequence-Based Machine Learning Methods
Fig.1: Addition of a phosphate
to an amino acid
Fig.2: Conformation changes
caused by phosphorylation
In this study, we empirically compare a number of Machine Learning
(ML) and profile-based methods for predicting kinase-specific protein
phosphorylation sites.
We propose a method for combining PSSM profiles and ML approaches.
Our proposed method yields fast and simple classifiers that consistently
outperform profile-based methods for predicting kinase-specific
phosphorylation sites.
Phospho.ELM Data Set – a resource containing 1805 proteins
from different species covering 1372 Tyr, 3175 Ser and 767 Thr
experimentally verified phosphorylation sites manually curated from the
literature.
We constructed separate data sets for kinase families that are well
represented in terms of the data available in the database (i.e., they are
known to recognize more than 50 phosphorylation sites) (see Table 1)
Kinase
Ser
Thr
Total
CDK
124
60
184
CK2
188
38
226
MAPK
82
26
108
PKA
222
20
242
PKB
43
12
55
PKC
215
47
262
Table 1: Kinase families considered in our study and the number of
Ser and Thr sites known to be phosphorylated
• The set of features for each Ser or Thr is based on windows n amino
acids (n=15) centered around each Ser or Thr residue.
• Encode each window as a 20*n binary vector, in which entries denote
whether or not a particular amino acid appears at a particular position
• Using this binary encoding, evaluate the performance of Support
Vector Machine with Gaussian kernel (Bin(SVM)), Naïve Bayes
(Bin(NB)), and Decision Tree (Bin(C4.5)) machine learning algorithms
PSSM-Based Representation – Our Approach (PSSMPhos)
• Combines profile-based and machine learning approaches
• PSSM motifs are obtained as before for each kinase family
• Encode each window as an n+1 vector, using the computed PSSM,
<e1(x1),…, en(xn),Score(x)>, where ei(xi) is the PSSM emitted score of
observing amino acid xi at position i and Score(x) is the sum of the n
emitted PSSM scores
• Train kinase-specific classifiers (PSSMPhos(SVM), PSSMPhos(NB),
PSSMPhos(C4.5)) on the PSSM based representation
Method/ BasicHMM BasicPSSM
Kinase
CDK
CK2
MAPK
PKA
PKB
PKC
82.61
79.65
68.22
85.54
82.73
76.34
86.96
78.76
75.23
84.30
82.73
67.75
PSSMPhos
(SVM)
91.03
82.74
78.04
90.70
90.00
77.48
PSSMPhos PSSMPhos
(NB)
(C4.5)
90.49
80.31
78.04
89.05
90.00
79.39
90.76
79.87
70.09
86.78
87.27
73.09
Bin
(SVM)
91.03
82.96
79.44
90.70
89.09
81.87
Fig.3: Comparison of ROC curves for BasicPSSM and BasicHMM
for the six kinase families considered
Bin
(NB)
91.03
79.65
78.97
89.05
92.73
80.92
Bin
(c4.5)
91.03
77.43
74.77
86.78
85.45
79.20
Scansite
(low)
77.21
84.30
-
Scansite
(med)
68.14
71.90
-
Scansite
(high)
57.08
59.50
-
KinasePhos KinasePhos
(default)
(90)
82.88
83.70
77.43
76.77
74.07
72.22
87.60
89.26
81.82
81.82
80.92
81.49
Table 2: Prediction accuracy of different methods using 5-fold cross validation test
Acknowledgements: This work is supported in part by grants from the National Science Foundation (IIS 0219699), and the National Institutes of Health (GM 066387) to Vasant Honavar.