Download On Recognizing Protein Families with Low Primary Sequence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Multiple-Instance Machine
Learning in Bioinformatics
Stephen Scott
Associate Professor
Dept. of Computer Science
University of Nebraska
April 21, 2004
Supported by:
NSF CCR-0092761
NIH RR-P20 RR17675
NSF EPS-0091900
What is Machine Learning?

Building machines that automatically learn
from experience
– Important research goal of artificial intelligence

(Very) small sampling of applications:
–
–
–
–
4/21/2004
Classification of protein sequences by family
Gene finding
Structure prediction
Inferring phylogenies
Stephen Scott, Dept of Computer
Science
2
What is Machine Learning?
(cont’d)

Given several labeled examples of a concept
– E.g. thioredoxin-fold (Trx) vs. non-Trx-fold proteins

Examples are described by features
– E.g. CxxC-motif-present (boolean), AA-frequency (20 real
numbers from [0,1]), average-polarity (real number)

A ML algorithm uses these examples to create a
hypothesis that will predict the label of new
(previously unseen) examples
 Hypotheses can take on many forms: artificial neural
networks, hidden Markov models, k nearest
neighbor, etc.
4/21/2004
Stephen Scott, Dept of Computer
Science
3
An Example
4/21/2004
Stephen Scott, Dept of Computer
Science
4
Multiple-Instance Learning

Generalizes conventional machine learning
(provably harder to learn)
 Now each example consists of a set (bag) of
instances
 Single label for entire bag is a function of
individual instances’ labels
– Don’t know what the individual labels are
4/21/2004
Stephen Scott, Dept of Computer
Science
5
Multiple-Instance Learning
Example
4/21/2004
Stephen Scott, Dept of Computer
Science
6
Multiple-Instance Learning
Example
4/21/2004
Stephen Scott, Dept of Computer
Science
7
Application: Molecular Binding

Dietterich et al. (1997)
 Given representations of molecules labeled
according to whether they bind or don’t
bind to a specific site of another molecule
 Problem: learn to predict whether new
examples will bind at same site
 Represent shape by 166 measurements from
origin to boundary
4/21/2004
Stephen Scott, Dept of Computer
Science
8
Application: Molecular Binding

Issue: molecule binds if any of its conformations does
 Which one is responsible for binding?
 Well-studied application of multi-instance learning
4/21/2004
Stephen Scott, Dept of Computer
Science
9
Application: Modeling Trx-Fold
Proteins

Thioredoxin-fold superfamily has very low
primary sequence conservation
– E.g. hidden Markov models perform very poorly

Our representation: Kim et al.’s (2000)
quantitative property values
– Originally developed for G protein-coupled receptors
– Properties: e.g. hydropathy index, polarity, molecular
weight, solubility
– Our multi-instance representation yields an eightdimensional signature for each protein
– We search for a set of boxes that are “hit” by points
4/21/2004
Stephen Scott, Dept of Computer
Science
10
Application: Modeling Trx-Fold
Proteins
4/21/2004
Stephen Scott, Dept of Computer
Science
11
Application: Modeling Trx-Fold
Proteins
4/21/2004
Stephen Scott, Dept of Computer
Science
12
Results

Molecular binding: “Musk” data sets
Algorithm
Musk 1 Error
Musk 2 Error
“Box Kernel”
(our alg)
8.8%
9.7%
EMDD
15.2%
15.1%
DD
12.0%
16.0%
4/21/2004
Stephen Scott, Dept of Computer
Science
13
Results (cont’d)

Identification of Trx-fold proteins
Algorithm
FP Error
FN Error
Box Kernel
21.8%
16.8%
EMDD
36.5%
35.6%
DD
66.7%
12.5%
HMM
95%
2%
4/21/2004
Stephen Scott, Dept of Computer
Science
14
Current Work: 3D Struc. Analysis

Goal: To identify new redox proteins via
analysis of 3D structure
 Align active sites of true 3D structures of
known positive and negative examples
 Represent each molecule as a bag, each
point = 1 atom or residue
– Can also augment with Kim et al.’s properties
4/21/2004
Stephen Scott, Dept of Computer
Science
15
Current Work: 3D Struc. Analysis
4/21/2004
Stephen Scott, Dept of Computer
Science
16
Conclusions

Machine learning provides very powerful
tools in bioinformatics
 Multiple-instance learning yields an even
richer representation, though complexity
increases
 Other applications: content-based image
retrieval, robot vision
4/21/2004
Stephen Scott, Dept of Computer
Science
17
Acknowledgments

Qingping Tao
 Chang Wang
 Dmitri Fomenko
 Vadim Gladyshev
 Etsuko Moriyama
 Vasant Honavar & Carson Andorf
 Josh Brown & Jun Zhang
 UNL’s Research Computing Facility
4/21/2004
Stephen Scott, Dept of Computer
Science
18
4/21/2004
Stephen Scott, Dept of Computer
Science
19
Related documents