Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L. Brinkman Dept. of Mol. Biology & Biochemistry Simon Fraser University, BC Canada 1. Problem Introduction Gram-negative bacteria Medically important disease-causing bacteria 5 sub-cellular localizations (2 layers of membranes) 1. Cytoplasmic 2. Inner Membrane 5. Extra-cellular 4. Outer Membrane 3. Periplasmic 2 Outer Membrane Proteins Predicting outer membrane proteins (OMPs) of Gram-negative bacteria attached to the “outer membrane” of Gramnegative bacterial cell Particularly useful as drug target 3 Outer Membrane Protein (Cont.) Outer loop structure: -strands, form central barrel shape Inner turns, shorter stretches Outer loops, longer stretches Extracellular side Outer membrane Periplasmic side -strand Inner turn 4 Challenges Identifying OMPs from sequence information alone Discriminative sequence patterns of OMPs would be helpful 5 Challenges (Cont.) favor precision over recall lengthy time and laborious effort to study targeted drug in lab Confusion Matrix Actual OMP Actual non-OMP Subtotal Classified as OMP TP FP A Classified as non-OMP FN TN B Subtotal C D E Overall accuracy = (TP+TN) / E Precision = TP / A Recall = TP / C 6 2. Dataset OMP sequence dataset Excellent quality (http://www.psort.org/dataset) Protein sequences (strings over alphabet of 20 letters) e.g. MNQIHK… Two classes with imbalanced distributions Data Number of sequences Percentage of each class Minimum length Maximum length Average length OMP 427 27.4% 91 3705 571.1 Non-OMP 1132 72.6% 50 1034 256.8 Total 1559 342.9 7 Evaluation Majority of data is non-OMP, overall accuracy is determined mainly by non-OMP prediction; Precision is our main concern (90%); Recall should be maintained at reasonable level (50%). 8 3. Related Work Existing sub-cellular localization predictors Inner membrane proteins have -helix structures – prediction is highly accurate Prediction of cytoplasmic, periplasmic and extracellular proteins neural networks, covariant discriminate algorithm, Markov chain models, support vector machines (highest accuracy: 91%) Do not apply to OMPs 9 Existing work on OMP prediction Neural networks, Hydrophobicity analysis, Combination of methods (homology analysis, amino acid abundance) Current state-of-the-art Hidden Markov Models by Martelli et al. [1] Use HMM to model OMPs according to their 3D structures Training set is small (12 proteins with known 3D structures) Overall accuracy: 89%; Recall: 84%; Precision: 46%. 10 4. Algorithms Motivations Frequent subsequence mining is helpful frequent subsequence: consecutive amino acids that occur frequently in OMPs OMP characteristics Common structure in OMPs Different regions have different characteristic sequence residues Model local similarities by frequent subsequences and highly variable regions by wild cards (*X*X*…) => Association-Rule-based classification 11 Algorithm 1: Rule-Based Classification Mine frequent subsequences X (consecutive amino acids) only from OMP class (support(X) MinSup). Remove trivial similarities by restricting minimum length (MinLgh) of frequent subsequences Find frequent patterns (*X*X*…) Build classifier using frequent pattern rules (*X*X*… OMP). 12 Algorithm 1: Refined The previous classifier performs good in precision, but poor in recall A second level of classifier is built on top of the existing classifier New training data: cases covered by the default rule in the first classifier Apply same pattern-mining and classifier-building process Future case is first matched against the 1st classifier; if it is classified as OMP, we accept it; otherwise the 2nd classifier is used. 13 Algorithm 2: SVM-based Classification Support Vector Machines (SVM) [5] Excellent performer in previous biological sequence classification Data needs to be transformed for SVM to be used (sequences => vectors) Frequent subsequences of OMPs are used as features. Protein sequences are mapped into binary vectors. 14 5. Empirical Studies 5 Classification methods Single-level Rule-Based Classification (SRB) Refined Rule-Based Classification (RRB) SVM-based Classification (SVM-light [6]) Martelli’s HMM See5 (latest version of C4.5) 5-fold cross validation (same folding for all algorithms) 15 Summary of Classifier Comparison 1 0.8 Precision Recall 0.6 0.4 0.2 0 SRB RRB HMM SVM See5 SVM outperforms all methods RRB is the 2nd best performer Both SVM and RRB outperform HMM Improvement from SRB to RRB shows that refinement works 16 Other Biological Benefits (Rule-Based Classifiers) Sequential rules (obtained by SRB/RRB) lead to biological insights Mapped to both β-strands and periplasmic turn regions Assist in developing 3D models for proteins Identification of primary drug target regions conserved sequences in the surface-exposed regions are ideal targets for new diagnostics and drugs 17 6. Conclusions and Future Work Contributions Provide excellent predictors for OMP prediction; Obtained interpretable sequential patterns for further biological benefits; Proposed the use of frequent subsequences for SVM feature extraction; Demonstrated the usefulness of data mining techniques in biological sequence analysis. 18 Future Work Include more features of sequences, e.g., secondary structure, additional properties of proteins, barrel and turn sizes, polarity of amino acides, etc. Explore ways to extract symbolic information from SVMs 19 References 1. 2. 3. 4. 5. 6. 7. Martelli P., Fariselli P., Krogh A. and Casadio R., A sequence-profile-based HMM for predicting and discrimating barrel membrane proteins, Bioinformatics, 18(1) 2002, S46-S53, 2002. Wang J., Chirn G., Marr T., Shapiro B., Shasha D. and Zhang K., Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results, SIGMOD-94, Minnesota, USA, 1994. Wang K., Zhou S. and He Y., Growing Decision Tree on Support-less Association Rules, KDD’00, Boston, MA, USA, 2000. Quinlan J., C4.5: programs for machine learning, Morgan Kaufmann Publishers, 1993. Vapnik V., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. Joachims T., Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, 2002. software downloadable at http://svmlight.joachims.org/ Rulequest Research, Information on See5/C5.0, at http://www.rulequest.com/see5info.html 20