Download poster - Computer Science and Engineering

Sixth Annual Joint Bioinformatics Symposium 2006 Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science Machine Learning Versus Profile-Based Methods for Protein Phosphorylation Site Prediction Yasser EL-Manzalawy, Cornelia Caragea, Drena Dobbs, and Vasant Honavar  Profile-Based Approaches Prediction of Phosphorylation Sites-Motivation Protein phosphorylation, performed by protein kinases, is a very important process involved in signal transduction pathways. Predicting phosphorylation sites is an essential step towards understanding phosphorylation, which in turn, is essential in understanding diseases and, ultimately, designing drugs that can prevent or cure diseases. Results  Scansite  A web service that is using 63 experimentally developed motifs, represented as PSSM, for identifying potential Ser/Thr phosphorylated sites.  KinasePhos Table 2 compares the performance of ML methods against profilebased methods for predicting kinase-specific phosphorylation sites. We also report the ROC curves for basic PSSM and basic HMM in Fig. 3 Conclusions  Another web service that uses Kinase-specific HMMs for predictions. We proposed PSSMPhos, a method for combining PSSM profiles and ML methods.  Basic PSSM  Our implementation of PSSM motifs using PROFILEWEIGHT program.  Basic HMM Our study demonstrates the superiority of ML over profile-based methods when enough training data is available. Our experiments suggest that ML methods and profile-based methods should complement each other to produce more efficient phosphorylation site prediction tools.  Our implementation of HMM motifs using HMMER software package. Sequence-Based Machine Learning Methods Fig.1: Addition of a phosphate to an amino acid Fig.2: Conformation changes caused by phosphorylation In this study, we empirically compare a number of Machine Learning (ML) and profile-based methods for predicting kinase-specific protein phosphorylation sites. We propose a method for combining PSSM profiles and ML approaches. Our proposed method yields fast and simple classifiers that consistently outperform profile-based methods for predicting kinase-specific phosphorylation sites. Phospho.ELM Data Set – a resource containing 1805 proteins from different species covering 1372 Tyr, 3175 Ser and 767 Thr experimentally verified phosphorylation sites manually curated from the literature. We constructed separate data sets for kinase families that are well represented in terms of the data available in the database (i.e., they are known to recognize more than 50 phosphorylation sites) (see Table 1) Kinase Ser Thr Total CDK 124 60 184 CK2 188 38 226 MAPK 82 26 108 PKA 222 20 242 PKB 43 12 55 PKC 215 47 262 Table 1: Kinase families considered in our study and the number of Ser and Thr sites known to be phosphorylated • The set of features for each Ser or Thr is based on windows n amino acids (n=15) centered around each Ser or Thr residue. • Encode each window as a 20*n binary vector, in which entries denote whether or not a particular amino acid appears at a particular position • Using this binary encoding, evaluate the performance of Support Vector Machine with Gaussian kernel (Bin(SVM)), Naïve Bayes (Bin(NB)), and Decision Tree (Bin(C4.5)) machine learning algorithms PSSM-Based Representation – Our Approach (PSSMPhos) • Combines profile-based and machine learning approaches • PSSM motifs are obtained as before for each kinase family • Encode each window as an n+1 vector, using the computed PSSM, <e1(x1),…, en(xn),Score(x)>, where ei(xi) is the PSSM emitted score of observing amino acid xi at position i and Score(x) is the sum of the n emitted PSSM scores • Train kinase-specific classifiers (PSSMPhos(SVM), PSSMPhos(NB), PSSMPhos(C4.5)) on the PSSM based representation Method/ BasicHMM BasicPSSM Kinase CDK CK2 MAPK PKA PKB PKC 82.61 79.65 68.22 85.54 82.73 76.34 86.96 78.76 75.23 84.30 82.73 67.75 PSSMPhos (SVM) 91.03 82.74 78.04 90.70 90.00 77.48 PSSMPhos PSSMPhos (NB) (C4.5) 90.49 80.31 78.04 89.05 90.00 79.39 90.76 79.87 70.09 86.78 87.27 73.09 Bin (SVM) 91.03 82.96 79.44 90.70 89.09 81.87 Fig.3: Comparison of ROC curves for BasicPSSM and BasicHMM for the six kinase families considered Bin (NB) 91.03 79.65 78.97 89.05 92.73 80.92 Bin (c4.5) 91.03 77.43 74.77 86.78 85.45 79.20 Scansite (low) 77.21 84.30 - Scansite (med) 68.14 71.90 - Scansite (high) 57.08 59.50 - KinasePhos KinasePhos (default) (90) 82.88 83.70 77.43 76.77 74.07 72.22 87.60 89.26 81.82 81.82 80.92 81.49 Table 2: Prediction accuracy of different methods using 5-fold cross validation test Acknowledgements: This work is supported in part by grants from the National Science Foundation (IIS 0219699), and the National Institutes of Health (GM 066387) to Vasant Honavar.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download poster - Computer Science and Engineering