Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analyzing Biomolecules with Graph Mining and Learning Techniques Luke Huan Assistant Professor Department of Electrical Engineering and Computer Science University of Kansas Outline Introduction Graph-based Supervised Learning for Biomolecules Applications Conclusion and Future Directions Supervised Learning in Analyzing Biomolecules Introduction Protein Structure To elucidate protein function at the molecular level, we need to understand protein structure. Lys Lys Gly Gly Leu Val Ala His Oxygen Nitrogen Carbon Sulfur Cartoon Space filling Surface Ribbon Supervised Learning in Analyzing Biomolecules Exploring Protein Structure Space http://www.nigms.nih.gov/psi/ Supervised Learning in Analyzing Biomolecules Other Types of Biomolecules Chemicals RNAs tRNA rRNA Carbohydrate Fatty acids A common problem: predicting the biological function of the biomolecules ? Supervised Learning in Analyzing Biomolecules Protein Structure Initiative Protein Structure Initiative (PSI) Developing methodology and technology to increase success rates and lower costs of structure determination, Constructing and automating the protein production and structure determination pipeline, and Determining unique protein structures (less than 30% identical in sequence to proteins for which structures had already been determined). Supervised Learning in Analyzing Biomolecules Molecular Libraries Program NIH Molecular Libraries program emphasizes the generation of high quality probes and biologicalchemical data for high-value targets. The goals of the program is to develop new small molecule probes for accelerating the development of new therapeutics Long term impacts include aiding the identification and analysis of protein function, signaling and metabolic pathways, and cellular function important to the maintenance of human health. Supervised Learning in Analyzing Biomolecules Outline Introduction Graph-based Supervised Learning for BioMolecules Representing structures as geometric graphs Computational tasks Applications Conclusions & Future Directions Supervised Learning in Analyzing Biomolecules What is a Labeled Graph? A labeled graph is a graph where each node and each edge has a label. p1 a p5 c p2 y y b y q2 a x b p3 y G1 d p4 q1 b y x y b q3 G2 s2 a s1 b y y s4 c y b s3 G3 Supervised Learning in Analyzing Biomolecules How to Represent Molecular Structures It may be straightforward C = O _ _ C C Supervised Learning in Analyzing Biomolecules Representing Protein Structures We may use geometric graphs Euclidian graph: Nodes − represent points, may be labeled Edges − connect two points and is labeled the Euclidian distance Contact A geometric graph Huan et al. RECOMB’04 Supervised Learning in Analyzing Biomolecules Computational Tasks Feature Extraction Feature Selection Pattern Recognition & Regression Supervised Learning in Analyzing Biomolecules Feature Extraction: Pattern Matching A graph G is subgraph isomorphic to a graph G’, denoted by G ⊆ G’, if there exists a 1-1 mapping from nodes in G to G’ such that node labels, edges, and edge labels are preserved with the mapping. A pattern is a graph. Pattern G matches G’ if G ⊆ G’ G occurs in G’ if G ⊆ G’. p1 a p5 c p2 y y b y q2 a x b p3 y G1 G y x y b q3 d p4 g1 a q1 b G2 y g2 b y s2 a s1 b y y s4 c y b s3 G3 g3 c Supervised Learning in Analyzing Biomolecules Feature Extraction by Frequent Subgraph Mining? The support value of a pattern P in a collection of graphs G is the fraction of graphs in G where P occurs. Given a collection of graphs G and a threshold 0 < σ ≤ 1, the frequent subgraph mining problem is the identification of all patterns that have support at least σ. Supervised Learning in Analyzing Biomolecules Example p1 a p5 c p2 y b y y y b p3 σ = 2/3 f=0/3 f=2/3 a 3/3 ff == 1/3 a y P1 b x G3 f=2/3 a b y x y b b P2 b The induced subgraph isomorphism penalizes any unmatched edges b s3 G2 y s y 4 c y b q3 b s1 b y s2 a x y d p4 G1 y y q2 a x q1 b P3 + +: induced frequent subgraphs f=2/3 b y P4 c + f=3/3 a y b P5 + b x b f=2/3 P6 + Supervised Learning in Analyzing Biomolecules from H. Jeong et al Nature 411, 41 (2001) Where Frequent Subgraph Mining is Useful? Yeast protein interaction network Aspirin YKL009W YNL248C YDR496C YNL182C UNKNOWN YOL077C Gene Network Co-author network Supervised Learning in Analyzing Biomolecules FFSM Search + Task: identify all frequently occurring subgraphs from a family of graphs Depth-first search Better memory utilization Apriori property Eliminate unnecessary isomorphism checks Graph normalization Avoid redundant examination Subgraph isomorphism test is NPcomplete Incremental isomorphism check Applies to frequent induced subgraph mining with minor modifications Supervised Learning in Analyzing Biomolecules Feature Selection Filtering: select features individually Pearson Correlation Spectral Feature selection (Zhao etal, ICML’07), Log Odd Ratio (Huan etal, RECOMB’04) Wrapper: using a classifier (e.g. support vector machine) and select subset of features Best subset Forward selection, backward selection SVM-RFE (Guyon, J. ML02) LASSO (Least Absolute Shrinkage and Selection Operator) Supervised Learning in Analyzing Biomolecules The Challenges The structural relationship of features C S d11 C C N N C C O d21 C C d12 C G1 N F1 C N N C O d31 C C d22 G2 N C C d32 G3 O F2 C N N C O C F3 C Supervised Learning in Analyzing Biomolecules Supervised Learning with Kernel Function and Kernel Machines Graph kernel functions measure the inner product of a pair of graphs by mapping objects to a Hilbert space. Graph kernel functions need to be symmetric and positive semi-definite Support Vector Machines (SVM) is then used to build highly accurate models with the computed kernel matrix. Supervised Learning in Analyzing Biomolecules R-Convolution Kernel A general framework for constructing kernels Define a decomposition of an object into a set of pieces Define convolution kernel between two objects as the sum the kernel functions between all pieces of the decomposition Using only a few assumptions, we can define a kernel function for any types of data. The true power of this framework lies in its potentially recursive definition. Supervised Learning in Analyzing Biomolecules Existing Graph Kernels One way to derive graph kernel function is based on counting shared substructures Paths are used in this case, other possibilities: trees, cycles, subgraphs Instead of exhaustive path enumeration, random walks are generated for fast computation. Frequent pattern based kernel function also gains popularity Supervised Learning in Analyzing Biomolecules Product Graph Kernel Decomposition is node sequences (walks) Calculated as Limit can be computed efficiently if lambda is chosen to be a geometric or exponential series O(n^6) running time. Supervised Learning in Analyzing Biomolecules Marginalized Kernel Rather than worry about infinite feature space of walks, we do random sampling. Decomposition is still a set of walks, but we generate them randomly. Definition: Supervised Learning in Analyzing Biomolecules Shortest Path Kernel Again, set of walks of a graph is the relevant decomposition. Instead of random generation, we limit to only the set of shortest paths from one vertex to another. Use Floyd transformation to turn graphs into SP graphs Supervised Learning in Analyzing Biomolecules Spectrum Kernel Another kernel that uses walk decompositions. Instead of randomly generating walks, or using only shortest paths, we generate walks of a specific length. Definition: In practice Supervised Learning in Analyzing Biomolecules Optimal Assignment Kernel Graph kernel function that computes molecular similarity by finding the maximal weighted bipartite graph between two sets of graph vertices. NOT a true kernel function. Supervised Learning in Analyzing Biomolecules Future Work: What is a good kernel function? A central issue is that complex structures naturally lend themselves to complex and time-consuming analysis. A good kernel function creates a space where objects in the same class are close to each other, but far from objects of other classes. Must capture similarities in structural information. It must also be efficient to compute. Supervised Learning in Analyzing Biomolecules Outline Introduction Graph-based Pattern Discovery in Protein Structures Applications MotifSpace Architecture Identify functional sites in proteins Predict protein function Future Directions Supervised Learning in Analyzing Biomolecules MotifSpace Architecture G O Protein Data Bank protein structures protein family Pattern Feature Filter selection Pattern structure Subgraph Miner patterns mining S C O P C A T H Biological Experiments testable hypotheses Experimental validation Protein Classification Classifier Pattern Visualization Validation family-specific patterns Structure Pattern Indexing & Database Search Functional Motifs Knowledge Knowledgebase management Huan et al. ISMB’05 demo Supervised Learning in Analyzing Biomolecules Effectiveness Serine proteases have three subclasses Subtilisins Eukaryotic serine proteases Prokaryotic serine proteases 1R64 1HJ9 1SSX Supervised Learning in Analyzing Biomolecules Frequent Patterns 20 highly specific patterns mined from serine proteases #patterns, coverge, and Protein length Statistics about frequent patterns (support 48/56) 25 20 15 10 5 0 1 6 11 16 21 26 31 36 41 46 51 56 Proteins # of patterns is the total number of fingerprints a protein has. The coverage of a #patterns Coverage (%) Length (200 residues) protein is the fraction of residues which are covered by at least one fingerprint (%), Length (of the protein) is displayed in unit of 200 residues Supervised Learning in Analyzing Biomolecules Patterns’ Biological Relevance # patterns contained in Proteins outside the Serine Protease 25 # of Patterns 20 15 10 5 0 0 50 100 150 200 250 Proteins 1HJ9 1MD8 1OP0 1OS8 # patterns 1PQ7 1P57 1SSX 1S83 Supervised Learning in Analyzing Biomolecules More Case Studies Papain-like cysteine proteases Nuclear receptor ligand binding domains NADP/FAD binding proteins Papain-like cysteine protease Nuclear Binding domains NADP binding proteins Supervised Learning in Analyzing Biomolecules Predict Protein Function How does a protein function in a biological system? Function 3D structure of a protein Functional motifs carry out protein function Supervised Learning in Analyzing Biomolecules Functional Inference for 1TWU 1ecs SCOP 54598 Antibiotic resistance protein Glyoxalase / bleomycin resistance / dioxygenase superfamily 4 members (SCOP 1.65), 62 family specific spatial motifs 1twu Yyce unknown function, not in SCOP 1.67, DALI z < 10 in Nov 2004 46 motifs found, structurally similar to the three new nonredundant AR proteins added in SCOP 1.67 Supervised Learning in Analyzing Biomolecules Applications in Immunology Major Histocompatibility Complex (MHC) is a large family of proteins involved in human immune response. Some microbes produce MHC-like protein in order to block the immune response. Recognizing MHC-like protein is important for Vaccine development and drug development Major Histocompatibility Complex (MHC) Protein. red: MHC-I platform helices green: MHC-I platform strands tan: MHC-I alpha 3 domain blue: beta-2 microglobulin magenta: foreign protein fragment Supervised Learning in Analyzing Biomolecules ChemSpace:Elucidating Roles of Small Molecules in Biological Systems FSM – data mining technique where we enumerate all of the common patterns in a set of graphs (called frequent subgraphs) Smalter et al. APBC 2008 Supervised Learning in Analyzing Biomolecules Protein-Chemical Interaction Data sets 3 Biological data sets Proteinchemical interaction Protein inhibitors Toxicity Supervised Learning in Analyzing Biomolecules Experiment Protocols Perform experiments with ten-fold cross-validation. Binary classification, accuracy = (TP + TN) / S TP: true positive TN: true negative S: total testing samples Comparisons: Marginalized - termination probability = 0.1 Spectrum - path length = 4. Tanimoto - path length = 4. Subtree - maximum depth = 3 Optimal assignment - neighbor-matching depth = 3. Pattern matching - diffusion rate = 0.2, diffusion time = 3 steps; subgraph support 25%, size <= 5. Supervised Learning in Analyzing Biomolecules Comparison to other Graph Kernels Supervised Learning in Analyzing Biomolecules Comparison to Non-kernel Classifiers: CBA Classification based on association Supervised Learning in Analyzing Biomolecules Outline Introduction Graph-based Pattern Discovery in Protein Structures Applications Conclusion & Future Challenges Supervised Learning in Analyzing Biomolecules Conclusions High throughput technologies have produced a large volume of data in biological and biomedical research. We need to develop and apply advanced informatics approach to understand the functions of biomolecules in a biological system. Data Mining and Machine Learning techniques help us retrieve hidden but useful patterns from biological data and (most importantly) build accurate predictive and classification models to predict roles of molecules in a biological system Bioinformatics post many new challenges for learning theory development and learning algorithm design. Supervised Learning in Analyzing Biomolecules Acknowledgements Funding from: National Institute of Health National Science Foundation Kansas IDeA Network of Biomedical Research Excellence The University of Kansas Supervised Learning in Analyzing Biomolecules References Frequent subgraph mining Akihiro Inokuchi, Takashi Washio, Kunio Nishimura, Hiroshi Motoda. A Fast Algorithm for Mining Frequent Connected Subgraphs. IBM Research, Tokyo Research Laboratory, 10 pages, 2002. Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda. An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. In: Principles of Knowledge Discovery and Data Mining (PKDD2000), pages 13-23, 2000. Chen Wang, Wei Wang, Jian Pei, Yongtai Zhu, Baile Shi. Scalable Mining Large Disk-Based Graph Databases. In: Proceedings of the 2004 Conference on Knowledge Discovery and Data Mining (SIGKDD2004), 2004. Akihiro Inokuchi, Takashi Washio, Hiroshi Motoda. Complete Mining of Frequent Patterns from Graphs: Mining Graph Data. In: Machine Learning, pages 321-354, 2003. M. Cohen, E. Gudes. Diagonally Subgraphs Pattern Mining. In: Proceedings of the 9th ACM SIGMOD Workshop on Research issues in data mining and knowledge discovery, 2004. Supervised Learning in Analyzing Biomolecules References (cont.) Jun Huan, Wei Wang, Jan Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. In: Proceedings of the 2003 International Conference on Data Mining (ICDM2003), 2003. Jun Huan, Wei Wang, Jan Prins, and Jiong Yang. "SPIN: Mining Maximal Frequent Subgraphs from Graph Databases", in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 581-586, 2004. Michihiro Kuramochi, George Karypis, Frequent Subgraph Discovery. In: Proceedings of the 2001 International Conference on Data Mining (ICDM2001), 2001. N. Vanetik, E. Gudes, S.E. Shimony. Computing Frequent Graph Patterns from Semistructured Data. In: Proceedings of the International Conference on Data Mining 2002 (ICDM2002), 2002 Xifeng Yan, Jiawei Han. CloseGraph: Mining Closed Frequent Graph Patterns. In: Proceedings of the 2003 Conference on Knowledge Discovery and Data Mining (SIGKDD2003), 2003. Xifeng Yan, Jiawei Han. gSpan: Graph-Based Substructure Pattern Mining. In: Proceedings of the 2002 International Conference on Data Mining (ICDM2002), 2002. Supervised Learning in Analyzing Biomolecules References (cont.) Graph kernel function Tamas Horvath, Thomas Gartner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. SIGKDD, 2004. H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proc. of the Twentieth Int. Conf. on Machine Learning (ICML), 2003. David Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL099-10, Computer Science Department, UC Santa Cruz, 1999. Holger Frohlich, Jorg K. Wegner, Florian Sieker, and Andreas Zell. Optimal assignment kernels for attributed molecular graphs. In Proceedings of the 22nd international conference on Machine learning, 2005. S. V. N. Vishwanathan, Karsten M. Borgwardt, and Nicol N. Schraudolph. Fast computation of graph kernels. In In Advances in Neural Information Processing Systems, 2006. Aaron Smalter, Jun Huan, and Gerald Lushington. A Graph Pattern Diffusion Kernel for Chemical Compound Classification. BIBE 2008. Aaron Smalter, Jun Huan, Gerald Lushington, Graph Wavelet Alignment Kernels for Drug Virtual Screening, CSB 2008 Supervised Learning in Analyzing Biomolecules