Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiple-Instance Machine Learning in Bioinformatics Stephen Scott Associate Professor Dept. of Computer Science University of Nebraska April 21, 2004 Supported by: NSF CCR-0092761 NIH RR-P20 RR17675 NSF EPS-0091900 What is Machine Learning? Building machines that automatically learn from experience – Important research goal of artificial intelligence (Very) small sampling of applications: – – – – 4/21/2004 Classification of protein sequences by family Gene finding Structure prediction Inferring phylogenies Stephen Scott, Dept of Computer Science 2 What is Machine Learning? (cont’d) Given several labeled examples of a concept – E.g. thioredoxin-fold (Trx) vs. non-Trx-fold proteins Examples are described by features – E.g. CxxC-motif-present (boolean), AA-frequency (20 real numbers from [0,1]), average-polarity (real number) A ML algorithm uses these examples to create a hypothesis that will predict the label of new (previously unseen) examples Hypotheses can take on many forms: artificial neural networks, hidden Markov models, k nearest neighbor, etc. 4/21/2004 Stephen Scott, Dept of Computer Science 3 An Example 4/21/2004 Stephen Scott, Dept of Computer Science 4 Multiple-Instance Learning Generalizes conventional machine learning (provably harder to learn) Now each example consists of a set (bag) of instances Single label for entire bag is a function of individual instances’ labels – Don’t know what the individual labels are 4/21/2004 Stephen Scott, Dept of Computer Science 5 Multiple-Instance Learning Example 4/21/2004 Stephen Scott, Dept of Computer Science 6 Multiple-Instance Learning Example 4/21/2004 Stephen Scott, Dept of Computer Science 7 Application: Molecular Binding Dietterich et al. (1997) Given representations of molecules labeled according to whether they bind or don’t bind to a specific site of another molecule Problem: learn to predict whether new examples will bind at same site Represent shape by 166 measurements from origin to boundary 4/21/2004 Stephen Scott, Dept of Computer Science 8 Application: Molecular Binding Issue: molecule binds if any of its conformations does Which one is responsible for binding? Well-studied application of multi-instance learning 4/21/2004 Stephen Scott, Dept of Computer Science 9 Application: Modeling Trx-Fold Proteins Thioredoxin-fold superfamily has very low primary sequence conservation – E.g. hidden Markov models perform very poorly Our representation: Kim et al.’s (2000) quantitative property values – Originally developed for G protein-coupled receptors – Properties: e.g. hydropathy index, polarity, molecular weight, solubility – Our multi-instance representation yields an eightdimensional signature for each protein – We search for a set of boxes that are “hit” by points 4/21/2004 Stephen Scott, Dept of Computer Science 10 Application: Modeling Trx-Fold Proteins 4/21/2004 Stephen Scott, Dept of Computer Science 11 Application: Modeling Trx-Fold Proteins 4/21/2004 Stephen Scott, Dept of Computer Science 12 Results Molecular binding: “Musk” data sets Algorithm Musk 1 Error Musk 2 Error “Box Kernel” (our alg) 8.8% 9.7% EMDD 15.2% 15.1% DD 12.0% 16.0% 4/21/2004 Stephen Scott, Dept of Computer Science 13 Results (cont’d) Identification of Trx-fold proteins Algorithm FP Error FN Error Box Kernel 21.8% 16.8% EMDD 36.5% 35.6% DD 66.7% 12.5% HMM 95% 2% 4/21/2004 Stephen Scott, Dept of Computer Science 14 Current Work: 3D Struc. Analysis Goal: To identify new redox proteins via analysis of 3D structure Align active sites of true 3D structures of known positive and negative examples Represent each molecule as a bag, each point = 1 atom or residue – Can also augment with Kim et al.’s properties 4/21/2004 Stephen Scott, Dept of Computer Science 15 Current Work: 3D Struc. Analysis 4/21/2004 Stephen Scott, Dept of Computer Science 16 Conclusions Machine learning provides very powerful tools in bioinformatics Multiple-instance learning yields an even richer representation, though complexity increases Other applications: content-based image retrieval, robot vision 4/21/2004 Stephen Scott, Dept of Computer Science 17 Acknowledgments Qingping Tao Chang Wang Dmitri Fomenko Vadim Gladyshev Etsuko Moriyama Vasant Honavar & Carson Andorf Josh Brown & Jun Zhang UNL’s Research Computing Facility 4/21/2004 Stephen Scott, Dept of Computer Science 18 4/21/2004 Stephen Scott, Dept of Computer Science 19