Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea ([email protected]) Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007 Research supported in part by a grant from the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Background and Motivation Machine Learning methods offer some of the most costeffective approaches to building predictive models One problem – multiple approaches Needed: comparing the effectiveness of different predictive classifiers Difficulty: different data selection and evaluation procedures Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Macromolecular Sequence Classification Predict a label for each element in a given sequence Example: Identify post-translational modification residues H3N+ M K L L S P I L T L I L F R S C L T Q S Q E E S Glycosylated? I D COO- Phosphorylated? Research supported in part by a grant from the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Macromolecular Sequence Classification Example: Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA 0000000000000000111110010000000000000001100100000000000000000000010000000001111100000000000000000 Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Macromolecular Sequence Classification Learning System Training Data Resulting Classifier Performance Validation Test Data All Data Research supported in part by a grant from the National Institutes of Health (GM066387). on test set Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Macromolecular Sequence Classification Sliding Window Approach: Target residue Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: 1111110011111110011111001011111100000001111101000000 . . Class label . VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1 . . . Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Performance Evaluation K-Fold Cross-Validation: S1 S2 Sk-1 Sk Learn classifier C Evaluate classifier C repeat k times Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Window-Based Cross-Validation Procedure: Extract windows from all sequences in the dataset Partition the set of windows into k disjoint subsets Perform standard cross-validation windows S1 S2 Sk-1 Sk Learn classifier C Evaluate classifier C repeat k times Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Sequence-Based Cross-Validation Procedure: Partition the set of sequences into k disjoint subsets Extract windows from sequences in each subset Perform standard cross-validation sequences S1 S2 Sk-1 Sk Learn classifier C Evaluate classifier C repeat k times Research supported in part by a grant from the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Window-Based vs. Sequence-Based Cross-Validation Window-Based Cross-Validation: Train and test sets are likely to contain some windows that originate from the same sequence. This violates the independence assumption between train and test sets. Sequence-Based Cross-Validation: Windows belonging to the same sequence end up in the same set. Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Machine Learning Classifiers Support Vector Machine: 0/1 String Kernel |w| K ( x, y ) ( I xi yi ) e i 1 Example: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[xi=yi] = 010010010000000 Naïve Bayes: Identity Window: VKKFGGEVVKAGNIL x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Datasets O-GlycBase dataset: contains experimentally verified glycosylation sites http://www.cbs.dtu.dk/databases/OGLYCBASE/ RNA-Protein Interface dataset, RB147 : consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. http://bindr.gdcb.iastate.edu/RNABindR/ Protein-Protein Interface dataset: consists of protein-binding protein sequences Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Datasets Number of positive and negative instances used in our experiments Dataset Number of Sequences Number of + Instances Number of Instances O-GlycBase 216 2168 12147 RNA-Protein 147 4336 27988 Protein-Protein 42 2350 9204 Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Experimental Design Questions: How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation? How do the results vary when we vary the size of the dataset? Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University CC AUC Results a) O-glycBase b) RNA-Protein Interface c) Protein-Protein Interface Research supported in part by a grant from the National Institutes of Health (GM066387). Department of Computer Science Artificial Intelligence Research Laboratory Iowa State University Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions Research supported in part by a grant from the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Conclusions Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation. The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV. Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence. Research supported in part by a grant from the National Institutes of Health (GM066387). Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Vasant Honavar Jivko Sinapov Drena Dobbs Research supported in part by a grant from the National Institutes of Health (GM066387).