Download bibe07_slides - Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Assessing the Performance of
Macromolecular Sequence Classifiers
Cornelia Caragea ([email protected])
Iowa State University
Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar
October 15, 2007
Research supported in part by a grant from the National Institutes of Health (GM066387).
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Background and Motivation
 Machine Learning methods offer some of the most costeffective approaches to building predictive models
 One problem – multiple approaches
 Needed: comparing the effectiveness of different
predictive classifiers
 Difficulty: different data selection and evaluation
procedures
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Outline
 Macromolecular Sequence Classification
 Performance Evaluation
 Window-Based Cross-Validation
 Sequence-Based Cross-Validation
 Experiments
 Conclusions
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Macromolecular Sequence Classification
 Predict a label for each element in a given sequence
 Example:
 Identify post-translational modification residues
H3N+
M K L
L
S P
I
L
T
L
I
L
F
R S C
L
T
Q
S Q
E
E S
Glycosylated?
I
D
COO-
Phosphorylated?
Research supported in part by a grant from the National Institutes of Health (GM066387).
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Macromolecular Sequence Classification
 Example:
 Identify RNA-binding residues
1T0K_B
SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA
0000000000000000111110010000000000000001100100000000000000000000010000000001111100000000000000000
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Macromolecular Sequence Classification
Learning
System
Training Data
Resulting
Classifier
Performance
Validation
Test Data
All Data
Research supported in part by a grant from the National Institutes of Health (GM066387).
on test set
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Macromolecular Sequence Classification
 Sliding Window Approach:
Target residue
Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK
Class:
1111110011111110011111001011111100000001111101000000
.
.
Class label
.
VKKFGGEVVKAGNIL,0
KKFGGEVVKAGNILV,0
KFGGEVVKAGNILVR,1
FGGEVVKAGNILVRQ,1
.
.
.
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Outline
 Macromolecular Sequence Classification
 Performance Evaluation
 Window-Based Cross-Validation
 Sequence-Based Cross-Validation
 Experiments
 Conclusions
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Performance Evaluation
K-Fold Cross-Validation:
S1 S2
Sk-1 Sk
Learn classifier C
Evaluate classifier C
repeat k times
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Window-Based Cross-Validation
Procedure:
 Extract windows from all sequences in the dataset
 Partition the set of windows into k disjoint subsets
 Perform standard cross-validation
windows
S1
S2
Sk-1 Sk
Learn classifier C
Evaluate classifier C
repeat k
times
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Sequence-Based Cross-Validation
Procedure:
 Partition the set of sequences into k disjoint subsets
 Extract windows from sequences in each subset
 Perform standard cross-validation
sequences
S1
S2
Sk-1 Sk
Learn classifier C
Evaluate classifier C
repeat k
times
Research supported in part by a grant from the National Institutes of Health (GM066387).
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Window-Based vs. Sequence-Based Cross-Validation
 Window-Based Cross-Validation:
 Train and test sets are likely to contain some windows that originate
from the same sequence.
 This violates the independence assumption between train and test
sets.
 Sequence-Based Cross-Validation:
 Windows belonging to the same sequence end up in the same set.
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Machine Learning Classifiers
 Support Vector Machine:
 0/1 String Kernel
|w|
K ( x, y )  ( I xi  yi ) e
i 1
 Example:
x = VKKFGGEVVKAGNIL
y = KKFGGEVVKAGNILV
I[xi=yi] = 010010010000000
 Naïve Bayes:
 Identity Window:
VKKFGGEVVKAGNIL
x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Datasets
 O-GlycBase dataset:
 contains experimentally verified glycosylation sites
 http://www.cbs.dtu.dk/databases/OGLYCBASE/
 RNA-Protein Interface dataset, RB147 :
 consists of RNA-binding protein sequences extracted from structures of
known RNA-protein complexes solved by X-ray crystallography in the
Protein Data Bank.
 http://bindr.gdcb.iastate.edu/RNABindR/
 Protein-Protein Interface dataset:
 consists of protein-binding protein sequences
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Datasets
Number of positive and negative instances used in our experiments
Dataset
Number of
Sequences
Number of +
Instances
Number of Instances
O-GlycBase
216
2168
12147
RNA-Protein
147
4336
27988
Protein-Protein
42
2350
9204
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Outline
 Macromolecular Sequence Classification
 Performance Evaluation
 Window-Based Cross-Validation
 Sequence-Based Cross-Validation
 Experiments
 Conclusions
Research supported in part by a grant from the National Institutes of Health (GM066387).
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Experimental Design
Questions:
 How does Sequence-Based Cross-Validation compare with
Window-Based Cross-Validation?
 How do the results vary when we vary the size of the
dataset?
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Results
Receiver Operating Characteristic (ROC) Curves for Window-Based and
Sequence-Based 10-Fold Cross-Validation using SVM
O-glycBase
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
CC
AUC
Results
a) O-glycBase
b) RNA-Protein Interface
c) Protein-Protein Interface
Research supported in part by a grant from the National Institutes of Health (GM066387).
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Outline
 Macromolecular Sequence Classification
 Performance Evaluation
 Window-Based Cross-Validation
 Sequence-Based Cross-Validation
 Experiments
 Conclusions
Research supported in part by a grant from the National Institutes of Health (GM066387).
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Conclusions
 Compared two variants of k-fold cross-validation: window-based
and sequence-based k-fold cross-validation.
 The comparison shows that Window-Based CV overestimates the
performance of the classifiers relative to Sequence-Based CV.
 Sequence-Based CV provides more realistic estimates of
performance, because predictors trained on labeled sequence
data have to predict the labels for residues in a novel sequence.
Research supported in part by a grant from the National Institutes of Health (GM066387).
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar
Jivko Sinapov
Drena Dobbs
Research supported in part by a grant from the National Institutes of Health (GM066387).
Related documents