Download Analyzing Biomolecules with Graph Mining and Learning Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Western blot wikipedia , lookup

Protein moonlighting wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Protein adsorption wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Analyzing Biomolecules with Graph
Mining and Learning Techniques
Luke Huan
Assistant Professor
Department of Electrical Engineering and Computer Science
University of Kansas
Outline
Introduction
Graph-based Supervised Learning for Biomolecules
Applications
Conclusion and Future Directions
Supervised Learning in Analyzing Biomolecules
Introduction Protein Structure
To elucidate protein function at the molecular level, we
need to understand protein structure.
Lys
Lys
Gly
Gly
Leu
Val
Ala
His
Oxygen
Nitrogen
Carbon
Sulfur
Cartoon
Space
filling
Surface
Ribbon
Supervised Learning in Analyzing Biomolecules
Exploring Protein Structure Space
http://www.nigms.nih.gov/psi/
Supervised Learning in Analyzing Biomolecules
Other Types of Biomolecules
Chemicals
RNAs
tRNA
rRNA
Carbohydrate
Fatty acids
A common problem: predicting the biological function of the
biomolecules
?
Supervised Learning in Analyzing Biomolecules
Protein Structure Initiative
Protein Structure Initiative (PSI)
Developing methodology and technology to increase
success rates and lower costs of structure
determination,
Constructing and automating the protein production
and structure determination pipeline, and
Determining unique protein structures (less than 30%
identical in sequence to proteins for which structures
had already been determined).
Supervised Learning in Analyzing Biomolecules
Molecular Libraries Program
NIH Molecular Libraries program emphasizes the
generation of high quality probes and biologicalchemical data for high-value targets. The goals of the
program is to develop new small molecule probes
for accelerating the development of new therapeutics
Long term impacts include aiding the identification
and analysis of protein function, signaling and
metabolic pathways, and cellular function important
to the maintenance of human health.
Supervised Learning in Analyzing Biomolecules
Outline
Introduction
Graph-based Supervised Learning for BioMolecules
Representing structures as geometric graphs
Computational tasks
Applications
Conclusions & Future Directions
Supervised Learning in Analyzing Biomolecules
What is a Labeled Graph?
A labeled graph is a graph where each node and each
edge has a label.
p1
a
p5
c
p2
y
y b
y
q2
a
x
b
p3
y
G1
d
p4
q1
b
y
x
y
b
q3
G2
s2
a
s1
b
y
y
s4
c
y
b
s3
G3
Supervised Learning in Analyzing Biomolecules
How to Represent Molecular Structures
It may be straightforward
C
=
O
_ _
C C
Supervised Learning in Analyzing Biomolecules
Representing Protein Structures
We may use geometric graphs
Euclidian graph:
Nodes − represent points, may be labeled
Edges − connect two points and is labeled the Euclidian
distance
Contact
A geometric graph
Huan et al. RECOMB’04
Supervised Learning in Analyzing Biomolecules
Computational Tasks
Feature Extraction
Feature Selection
Pattern Recognition & Regression
Supervised Learning in Analyzing Biomolecules
Feature Extraction: Pattern Matching
A graph G is subgraph isomorphic to a graph G’, denoted by G ⊆ G’,
if
there exists a 1-1 mapping from nodes in G to G’ such that node labels,
edges, and edge labels are preserved with the mapping.
A pattern is a graph. Pattern G matches G’ if G ⊆ G’
G occurs in G’ if G ⊆ G’.
p1
a
p5
c
p2
y
y b
y
q2
a
x
b
p3
y
G1
G
y
x
y
b
q3
d
p4
g1
a
q1
b
G2
y
g2
b
y
s2
a
s1
b
y
y
s4
c
y
b
s3
G3
g3
c
Supervised Learning in Analyzing Biomolecules
Feature Extraction by Frequent Subgraph
Mining?
The support value of a pattern P in a collection of graphs G is the
fraction of graphs in G where P occurs.
Given a collection of graphs G and a threshold 0 < σ ≤ 1, the frequent
subgraph mining problem is the identification of all patterns that have
support at least σ.
Supervised Learning in Analyzing Biomolecules
Example
p1
a
p5
c
p2
y
b
y
y
y
b
p3
σ = 2/3
f=0/3
f=2/3
a
3/3
ff == 1/3
a
y
P1
b
x
G3
f=2/3
a
b
y
x
y
b
b
P2
b
The induced subgraph
isomorphism penalizes
any unmatched edges
b
s3
G2
y
s
y 4
c
y
b
q3
b
s1
b
y
s2
a
x
y
d
p4
G1
y
y
q2
a
x
q1
b
P3
+
+: induced frequent subgraphs
f=2/3 b
y
P4
c
+
f=3/3
a
y
b
P5 +
b
x
b f=2/3
P6 +
Supervised Learning in Analyzing Biomolecules
from H. Jeong et al Nature 411, 41
(2001)
Where Frequent Subgraph Mining is Useful?
Yeast protein interaction network
Aspirin
YKL009W
YNL248C
YDR496C
YNL182C
UNKNOWN
YOL077C
Gene Network
Co-author network
Supervised Learning in Analyzing Biomolecules
FFSM Search
+
Task: identify all frequently occurring
subgraphs from a family of graphs
Depth-first search
Better memory utilization
Apriori property
Eliminate unnecessary isomorphism
checks
Graph normalization
Avoid redundant examination
Subgraph isomorphism test is NPcomplete
Incremental isomorphism check
Applies to frequent induced subgraph
mining with minor modifications
Supervised Learning in Analyzing Biomolecules
Feature Selection
Filtering: select features individually
Pearson Correlation
Spectral Feature selection (Zhao etal, ICML’07),
Log Odd Ratio (Huan etal, RECOMB’04)
Wrapper: using a classifier (e.g. support vector
machine) and select subset of features
Best subset
Forward selection, backward selection
SVM-RFE (Guyon, J. ML02)
LASSO (Least Absolute Shrinkage and Selection Operator)
Supervised Learning in Analyzing Biomolecules
The Challenges
The structural relationship of features
C
S
d11
C
C
N
N
C
C
O
d21
C
C
d12
C
G1
N
F1
C
N
N
C
O
d31
C
C
d22
G2
N
C
C
d32
G3
O
F2
C
N
N
C
O
C
F3
C
Supervised Learning in Analyzing Biomolecules
Supervised Learning with Kernel Function and
Kernel Machines
Graph kernel functions measure the inner product of
a pair of graphs by mapping objects to a Hilbert
space.
Graph kernel functions need to be symmetric and
positive semi-definite
Support Vector Machines (SVM) is then used to build
highly accurate models with the computed kernel
matrix.
Supervised Learning in Analyzing Biomolecules
R-Convolution Kernel
A general framework for constructing kernels
Define a decomposition of an object into a set of
pieces
Define convolution kernel between two objects as the
sum the kernel functions between all pieces of the
decomposition
Using only a few assumptions, we can define a
kernel function for any types of data.
The true power of this framework lies in its
potentially recursive definition.
Supervised Learning in Analyzing Biomolecules
Existing Graph Kernels
One way to derive graph kernel function is based on
counting shared substructures
Paths are used in this case, other possibilities: trees,
cycles, subgraphs
Instead of exhaustive path enumeration, random
walks are generated for fast computation.
Frequent pattern based kernel function also gains
popularity
Supervised Learning in Analyzing Biomolecules
Product Graph Kernel
Decomposition is node sequences (walks)
Calculated as
Limit can be computed efficiently if lambda is chosen
to be a geometric or exponential series
O(n^6) running time.
Supervised Learning in Analyzing Biomolecules
Marginalized Kernel
Rather than worry about infinite feature space of
walks, we do random sampling.
Decomposition is still a set of walks, but we generate
them randomly.
Definition:
Supervised Learning in Analyzing Biomolecules
Shortest Path Kernel
Again, set of walks of a graph is the relevant
decomposition.
Instead of random generation, we limit to only the set
of shortest paths from one vertex to another.
Use Floyd transformation to turn graphs into SP
graphs
Supervised Learning in Analyzing Biomolecules
Spectrum Kernel
Another kernel that uses walk decompositions.
Instead of randomly generating walks, or using only
shortest paths, we generate walks of a specific length.
Definition:
In practice
Supervised Learning in Analyzing Biomolecules
Optimal Assignment Kernel
Graph kernel function that computes molecular similarity by
finding the maximal weighted bipartite graph between two sets of
graph vertices.
NOT a true kernel function.
Supervised Learning in Analyzing Biomolecules
Future Work: What is a good kernel function?
A central issue is that complex structures naturally
lend themselves to complex and time-consuming
analysis.
A good kernel function creates a space where objects
in the same class are close to each other, but far from
objects of other classes.
Must capture similarities in structural information.
It must also be efficient to compute.
Supervised Learning in Analyzing Biomolecules
Outline
Introduction
Graph-based Pattern Discovery in Protein Structures
Applications
MotifSpace Architecture
Identify functional sites in proteins
Predict protein function
Future Directions
Supervised Learning in Analyzing Biomolecules
MotifSpace Architecture
G
O
Protein
Data Bank
protein
structures
protein family
Pattern
Feature
Filter
selection
Pattern structure
Subgraph
Miner
patterns
mining
S
C
O
P
C
A
T
H
Biological
Experiments
testable hypotheses Experimental validation
Protein
Classification
Classifier
Pattern
Visualization
Validation
family-specific
patterns
Structure
Pattern
Indexing
&
Database
Search
Functional
Motifs
Knowledge
Knowledgebase
management
Huan et al. ISMB’05 demo
Supervised Learning in Analyzing Biomolecules
Effectiveness
Serine proteases have three subclasses
Subtilisins
Eukaryotic serine proteases
Prokaryotic serine proteases
1R64
1HJ9
1SSX
Supervised Learning in Analyzing Biomolecules
Frequent Patterns
20 highly specific patterns mined from serine proteases
#patterns, coverge, and Protein length
Statistics about frequent patterns (support 48/56)
25
20
15
10
5
0
1
6
11
16
21
26
31
36
41
46
51
56
Proteins
# of patterns is the total number of fingerprints a protein has. The coverage of a
#patterns Coverage (%) Length (200 residues)
protein is the fraction of
residues which are covered by at least one fingerprint (%),
Length (of the protein) is displayed in unit of 200 residues
Supervised Learning in Analyzing Biomolecules
Patterns’ Biological Relevance
# patterns contained in Proteins outside the Serine Protease
25
# of Patterns
20
15
10
5
0
0
50
100
150
200
250
Proteins
1HJ9
1MD8
1OP0
1OS8
# patterns
1PQ7
1P57
1SSX
1S83
Supervised Learning in Analyzing Biomolecules
More Case Studies
Papain-like cysteine proteases
Nuclear receptor ligand binding domains
NADP/FAD binding proteins
Papain-like cysteine protease
Nuclear Binding domains
NADP binding proteins
Supervised Learning in Analyzing Biomolecules
Predict Protein Function
How does a protein function in a biological system?
Function
3D structure of
a protein
Functional motifs
carry out protein
function
Supervised Learning in Analyzing Biomolecules
Functional Inference for 1TWU
1ecs
SCOP 54598
Antibiotic resistance protein
Glyoxalase / bleomycin resistance
/ dioxygenase superfamily
4 members (SCOP 1.65), 62 family
specific spatial motifs
1twu
Yyce
unknown function, not in SCOP
1.67, DALI z < 10 in Nov 2004
46 motifs found, structurally
similar to the three new nonredundant AR proteins added in
SCOP 1.67
Supervised Learning in Analyzing Biomolecules
Applications in Immunology
Major Histocompatibility Complex (MHC) is a large family of
proteins involved in human immune response.
Some microbes produce MHC-like protein in order to block
the immune response.
Recognizing MHC-like protein is important for
Vaccine development and drug development
Major Histocompatibility Complex (MHC) Protein.
red: MHC-I platform helices
green: MHC-I platform strands
tan: MHC-I alpha 3 domain
blue: beta-2 microglobulin
magenta: foreign protein fragment
Supervised Learning in Analyzing Biomolecules
ChemSpace:Elucidating Roles of Small
Molecules in Biological Systems
FSM – data mining technique where we enumerate all of the
common patterns in a set of graphs (called frequent subgraphs)
Smalter et al. APBC 2008
Supervised Learning in Analyzing Biomolecules
Protein-Chemical Interaction Data sets
3 Biological data
sets
Proteinchemical
interaction
Protein
inhibitors
Toxicity
Supervised Learning in Analyzing Biomolecules
Experiment Protocols
Perform experiments with ten-fold cross-validation.
Binary classification, accuracy = (TP + TN) / S
TP: true positive
TN: true negative
S: total testing samples
Comparisons:
Marginalized - termination probability = 0.1
Spectrum - path length = 4.
Tanimoto - path length = 4.
Subtree - maximum depth = 3
Optimal assignment - neighbor-matching depth = 3.
Pattern matching - diffusion rate = 0.2, diffusion time = 3 steps;
subgraph support 25%, size <= 5.
Supervised Learning in Analyzing Biomolecules
Comparison to other Graph Kernels
Supervised Learning in Analyzing Biomolecules
Comparison to Non-kernel Classifiers: CBA
Classification based on association
Supervised Learning in Analyzing Biomolecules
Outline
Introduction
Graph-based Pattern Discovery in Protein Structures
Applications
Conclusion & Future Challenges
Supervised Learning in Analyzing Biomolecules
Conclusions
High throughput technologies have produced a large
volume of data in biological and biomedical research.
We need to develop and apply advanced informatics
approach to understand the functions of biomolecules in a
biological system.
Data Mining and Machine Learning techniques help us
retrieve hidden but useful patterns from biological data and
(most importantly)
build accurate predictive and classification models to
predict roles of molecules in a biological system
Bioinformatics post many new challenges for learning
theory development and learning algorithm design.
Supervised Learning in Analyzing Biomolecules
Acknowledgements
Funding from:
National Institute of Health
National Science Foundation
Kansas IDeA Network of Biomedical Research Excellence
The University of Kansas
Supervised Learning in Analyzing Biomolecules
References
Frequent subgraph mining
Akihiro Inokuchi, Takashi Washio, Kunio Nishimura, Hiroshi Motoda. A Fast
Algorithm for Mining Frequent Connected Subgraphs. IBM Research, Tokyo
Research Laboratory, 10 pages, 2002.
Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda. An Apriori-Based
Algorithm for Mining Frequent Substructures from Graph Data. In: Principles
of Knowledge Discovery and Data Mining (PKDD2000), pages 13-23, 2000.
Chen Wang, Wei Wang, Jian Pei, Yongtai Zhu, Baile Shi. Scalable Mining
Large Disk-Based Graph Databases. In: Proceedings of the 2004 Conference on
Knowledge Discovery and Data Mining (SIGKDD2004), 2004.
Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda. Complete Mining of
Frequent Patterns from Graphs: Mining Graph Data. In: Machine Learning,
pages 321-354, 2003.
M. Cohen, E. Gudes. Diagonally Subgraphs Pattern Mining. In: Proceedings of
the 9th ACM SIGMOD Workshop on Research issues in data mining and
knowledge discovery, 2004.
Supervised Learning in Analyzing Biomolecules
References (cont.)
Jun Huan, Wei Wang, Jan Prins. Efficient Mining of Frequent Subgraphs in
the Presence of Isomorphism. In: Proceedings of the 2003 International
Conference on Data Mining (ICDM2003), 2003.
Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. "SPIN: Mining Maximal
Frequent Subgraphs from Graph Databases", in Proceedings of the 10th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 581-586, 2004.
Michihiro Kuramochi, George Karypis, Frequent Subgraph Discovery. In:
Proceedings of the 2001 International Conference on Data Mining
(ICDM2001), 2001.
N. Vanetik, E. Gudes, S.E. Shimony. Computing Frequent Graph Patterns
from Semistructured Data. In: Proceedings of the International Conference on
Data Mining 2002 (ICDM2002), 2002
Xifeng Yan, Jiawei Han. CloseGraph: Mining Closed Frequent Graph Patterns.
In: Proceedings of the 2003 Conference on Knowledge Discovery and Data
Mining (SIGKDD2003), 2003.
Xifeng Yan, Jiawei Han. gSpan: Graph-Based Substructure Pattern Mining. In:
Proceedings of the 2002 International Conference on Data Mining
(ICDM2002), 2002.
Supervised Learning in Analyzing Biomolecules
References (cont.)
Graph kernel function
Tamas Horvath, Thomas Gartner, and Stefan Wrobel. Cyclic pattern kernels
for predictive graph mining. SIGKDD, 2004.
H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between
labeled graphs. In Proc. of the Twentieth Int. Conf. on Machine Learning
(ICML), 2003.
David Haussler. Convolution kernels on discrete structures. Technical Report
UCSC-CRL099-10, Computer Science Department, UC Santa Cruz, 1999.
Holger Frohlich, Jorg K. Wegner, Florian Sieker, and Andreas Zell. Optimal
assignment kernels for attributed molecular graphs. In Proceedings of the
22nd international conference on Machine learning, 2005.
S. V. N. Vishwanathan, Karsten M. Borgwardt, and Nicol N. Schraudolph.
Fast computation of graph kernels. In In Advances in Neural Information
Processing Systems, 2006.
Aaron Smalter, Jun Huan, and Gerald Lushington. A Graph Pattern Diffusion
Kernel for Chemical Compound Classification. BIBE 2008.
Aaron Smalter, Jun Huan, Gerald Lushington, Graph Wavelet Alignment
Kernels for Drug Virtual Screening, CSB 2008
Supervised Learning in Analyzing Biomolecules