Download On Recognizing Protein Families with Low Primary Sequence

Multiple-Instance Machine Learning in Bioinformatics Stephen Scott Associate Professor Dept. of Computer Science University of Nebraska April 21, 2004 Supported by: NSF CCR-0092761 NIH RR-P20 RR17675 NSF EPS-0091900 What is Machine Learning?  Building machines that automatically learn from experience – Important research goal of artificial intelligence  (Very) small sampling of applications: – – – – 4/21/2004 Classification of protein sequences by family Gene finding Structure prediction Inferring phylogenies Stephen Scott, Dept of Computer Science 2 What is Machine Learning? (cont’d)  Given several labeled examples of a concept – E.g. thioredoxin-fold (Trx) vs. non-Trx-fold proteins  Examples are described by features – E.g. CxxC-motif-present (boolean), AA-frequency (20 real numbers from [0,1]), average-polarity (real number)  A ML algorithm uses these examples to create a hypothesis that will predict the label of new (previously unseen) examples  Hypotheses can take on many forms: artificial neural networks, hidden Markov models, k nearest neighbor, etc. 4/21/2004 Stephen Scott, Dept of Computer Science 3 An Example 4/21/2004 Stephen Scott, Dept of Computer Science 4 Multiple-Instance Learning  Generalizes conventional machine learning (provably harder to learn)  Now each example consists of a set (bag) of instances  Single label for entire bag is a function of individual instances’ labels – Don’t know what the individual labels are 4/21/2004 Stephen Scott, Dept of Computer Science 5 Multiple-Instance Learning Example 4/21/2004 Stephen Scott, Dept of Computer Science 6 Multiple-Instance Learning Example 4/21/2004 Stephen Scott, Dept of Computer Science 7 Application: Molecular Binding  Dietterich et al. (1997)  Given representations of molecules labeled according to whether they bind or don’t bind to a specific site of another molecule  Problem: learn to predict whether new examples will bind at same site  Represent shape by 166 measurements from origin to boundary 4/21/2004 Stephen Scott, Dept of Computer Science 8 Application: Molecular Binding  Issue: molecule binds if any of its conformations does  Which one is responsible for binding?  Well-studied application of multi-instance learning 4/21/2004 Stephen Scott, Dept of Computer Science 9 Application: Modeling Trx-Fold Proteins  Thioredoxin-fold superfamily has very low primary sequence conservation – E.g. hidden Markov models perform very poorly  Our representation: Kim et al.’s (2000) quantitative property values – Originally developed for G protein-coupled receptors – Properties: e.g. hydropathy index, polarity, molecular weight, solubility – Our multi-instance representation yields an eightdimensional signature for each protein – We search for a set of boxes that are “hit” by points 4/21/2004 Stephen Scott, Dept of Computer Science 10 Application: Modeling Trx-Fold Proteins 4/21/2004 Stephen Scott, Dept of Computer Science 11 Application: Modeling Trx-Fold Proteins 4/21/2004 Stephen Scott, Dept of Computer Science 12 Results  Molecular binding: “Musk” data sets Algorithm Musk 1 Error Musk 2 Error “Box Kernel” (our alg) 8.8% 9.7% EMDD 15.2% 15.1% DD 12.0% 16.0% 4/21/2004 Stephen Scott, Dept of Computer Science 13 Results (cont’d)  Identification of Trx-fold proteins Algorithm FP Error FN Error Box Kernel 21.8% 16.8% EMDD 36.5% 35.6% DD 66.7% 12.5% HMM 95% 2% 4/21/2004 Stephen Scott, Dept of Computer Science 14 Current Work: 3D Struc. Analysis  Goal: To identify new redox proteins via analysis of 3D structure  Align active sites of true 3D structures of known positive and negative examples  Represent each molecule as a bag, each point = 1 atom or residue – Can also augment with Kim et al.’s properties 4/21/2004 Stephen Scott, Dept of Computer Science 15 Current Work: 3D Struc. Analysis 4/21/2004 Stephen Scott, Dept of Computer Science 16 Conclusions  Machine learning provides very powerful tools in bioinformatics  Multiple-instance learning yields an even richer representation, though complexity increases  Other applications: content-based image retrieval, robot vision 4/21/2004 Stephen Scott, Dept of Computer Science 17 Acknowledgments  Qingping Tao  Chang Wang  Dmitri Fomenko  Vadim Gladyshev  Etsuko Moriyama  Vasant Honavar & Carson Andorf  Josh Brown & Jun Zhang  UNL’s Research Computing Facility 4/21/2004 Stephen Scott, Dept of Computer Science 18 4/21/2004 Stephen Scott, Dept of Computer Science 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download On Recognizing Protein Families with Low Primary Sequence