Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Active Learning Strategies for Drug Screening 1 Walker 1,2 Kasif Megon and Simon 1Bioinformatics Program, Boston University, Boston, MA 2Department of Biomedical Engineering, Boston University, Boston, MA 1. Introduction 2. Objectives At the intersection of drug discovery and experimental design, active learning algorithms guide selection of successive compound batches for biological assays when screening a chemical library in order to identify many target binding compounds with minimal screening iterations.1-3 • exploitation: optimize the number of target binding (active) drugs retrieved with each batch • exploration: optimize the prediction accuracy of the committee during each iteration of querying Start Input data files with compound descriptors Designate training and testing sets for this round of cross validation 3. Methods Datasets • a binary feature vector for each compound indicated the presence or absence of structural fragments • 200 features with highest feature-activity mutual information (MI) selected for each dataset • retrospective data: labels provided with the features • labels: target binding or active (A); non-binding or inactive (I) • 632 DuPont thrombin-targeting compounds4 (149 A, 483 I, mean MI = 0.126) • 1346 Abbott monoamine oxidase inhibitors5 (221 A,1125 I, mean MI = 0.006) Pipeline • 5X cross validation • 5% batch size • 5 classifiers in the committee (Figure 2) • perceptron classifier data shown Classifier committees • bagging: samples from the labeled training data with uniform distribution • boosting: samples from the labeled training data with varied sampling distribution such that compounds misclassified by the previously obtained hypothesis are more likely to be sampled again Sample selection strategies • random • uncertainty: compounds on which the committee disagrees most strongly are selected • density with respect to actives: compounds most similar to previously labeled or predicted actives are selected (Tanimoto similarity metric) • P(active) : compounds predicted active with highest probability by the committee are selected 1st batch of drugs whose labels are queried? The drug screening pipeline proposed here combines committeebased active learning with bagging and boosting techniques yes and several options for sample selection. Our best strategy 1st batch selected retrieves up to 87% of the active compounds after screening by chemist’s only 30% of the chemical datasets analyzed. Figure 1: The Drug Discovery Cycle compounds Drugs descriptors Features 0 1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 Sample Selection random uncertainty density P(active) domain knowledge Labels for a batch from the unlabeled training set queried committee of classifiers trained on sub-samples from the labeled training set drugs Classifiers Committees naïve Bayes bagging perceptron boosting selection screening no Unlabeled testing set & training set drugs classified by committee (weighted majority vote) Figure 3: Querying for labels & training classifiers on sub-samples after 1st query Drugs Drugs Features A/I train classifier # 1 I train classifier # 2 A ? ? ? NOT labeled ? ? ? ? test ? after 2nd query Features A/I I train classifier # 1 A A I train classifier # 2 A I ? NOT labeled ? ? test ? All training set labels queried? no yes Cross validation completed? no yes Accuracy and performance statistics 5. Discussion End 4. Results total number of hits • exploitation: number of active drugs retrieved with each batch queried Figure 4: Hit Performance and Sensitivity • P(active) sample selection shows best hit performance when feature a. Thrombin Hit Performance information content is higher (Figure 4a) (actives queried) -after 30% of drug are labeled (cross validation averages): 130 1. P(active) retrieves 84% actives 110 2. density retrieves 77% actives 90 3. uncertainty retrieves 65% actives 4. random retrieves 42% actives 70 • density sample selection strategy shows best initial hit performance 50 when feature information content is lower (Figure 4b) 30 -classifier sensitivity is compromised 10 0.0506 0.1519 0.2532 0.3544 0.4557 0.557 0.6582 0.7595 -linear hit performance for all strategies after 20% of drugs labeled b. MAOI Hit Performance (actives queried) 170 total number of hits The active learning paradigm refers to the ability of the learner to modify the sampling strategy of data chosen for training based on previously seen data. During each round of screening, the active learning algorithm selects a batch of unlabeled compounds to be tested for target binding activity and added to the training set. Once the labels for this batch are known, the model of activity is recomputed on all examples labeled so far, and a new chemical set for screening is selected (Figure 1). Figure 2: Pipeline Flowchart 150 130 110 90 70 50 30 10 0.0506 Future work will involve ROC and precision-recall analysis, along with comparison of various classifiers and feature descriptors. c. Bagged Committee Sensitivity 0.3544 0.4557 0.557 0.6582 0.7595 d. Single Classifier vs. Bagged Committee Sensitivity 0.75 thrombin testing set sensitivity thrombin testing set sensitivity • tradeoff: sample selection methods resulting in the best hit performance display the lowest testing set sensitivity (Figure 4c) • bagging and boosting methods do not result in significantly different hit performance for any sample selection strategy on these datasets • bagging and boosting techniques significantly enhance the testing set sensitivity of the component learning algorithm (Figure 4d) 0.2532 fraction of examples labeled fraction of examples labeled • exploration: the prediction accuracy of the committee on the testing data set during each iteration of querying • uncertainty sample selection shows best testing set sensitivity • increases in the labeled training set size during progressive rounds of querying result in no significant increase in testing set sensitivity (Figure 4c) -labeled training set ratio actives:inactives biases the classifier? -multiple modes of drug activity present in datasets? 0.1519 0.73 0.71 0.69 0.67 0.65 0.63 0.61 0.75 0.7 0.65 0.6 0.55 0.0506 0.1519 0.2532 0.3544 0.4557 0.59 0.0506 0.1519 0.2532 0.3544 0.4557 0.557 fraction of examples labeled 0.6582 0.7595 0.557 0.6582 0.7595 fraction of examples labeled 6. References 1. N. Abe, and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging. ICML 1998, 1-9. 2. G. Forman. Incremental Machine Learning to Reduce Biochemistry Lab Costs in the Search for Drug Discovery. BIOKDD 2002, 33-36. 3. M. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C. Lemmen. Active Learning in the Drug Discovery Process. NIPS 2001, 1449-1456. 4. KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/ 5. R. Brown and Y. Martin. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection. Journal of Chemical Information and Computer Science.1996. 36, 572584.