Download Data Mining in Bioinformatics Day 9: Graph Mining in

Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Drug discovery Modern therapeutic research From serendipity to rationalized drug design Ancient Greeks treat infections with mould NH2 NH S HO CH3 O N CH3 O O HO Biapenem in PBP-1A Karsten Borgwardt: Data Mining in Bioinformatics, Page 2 Drug discovery process 1. Find a target Protein that we want to inhibit so as to interfer with a biological process 2. Identify hits Compounds likely to bind to the target 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis Can they be drugs? (ADME-Tox) - bioactivity - pharmacokinetics - synthetic pathway 5. Assay - in vitro - in vivo - clinical Karsten Borgwardt: Data Mining in Bioinformatics, Page 3 Drug discovery process 90 months 52 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Karsten Borgwardt: Data Mining in Bioinformatics, Page 4 Drug discovery process 90 months 52 months 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay $500,000,000 to $2,000,000,000 Karsten Borgwardt: Data Mining in Bioinformatics, Page 5 Chemoinformatics How can computer science help? → Chemoinformatics! “...the mixing of information resources to transform data into information, and information into knowledge, for the intended purpose of making better decisions faster in the arena of drug lead identification and optimisation.” – F. K. Brown “... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel Karsten Borgwardt: Data Mining in Bioinformatics, Page 6 Chemoinformatics Chemoinformatics 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay Karsten Borgwardt: Data Mining in Bioinformatics, Page 7 Chemoinformatics The chemical space 1060 possible small organic molecules 1022 stars in the observable universe (Slide courtesy of Matthew A. Kayala) Karsten Borgwardt: Data Mining in Bioinformatics, Page 8 Drug discovery process 1. Find a target 2. Identify hits 3.Hit-to-lead: characterize hits 4. Lead optimization and synthesis 5. Assay QSAR QSPR QSAR: Qualitative Structure-Activity Relationship i.e. classification QSPR: Quantititive Structure-Property Relationship i.e. regression Karsten Borgwardt: Data Mining in Bioinformatics, Page 9 Representing chemicals in silico Expert knowledge molecular descriptors → hard, potentially incomplete Molecules are... NH2 NH S HO CH3 O N CH3 O O HO Karsten Borgwardt: Data Mining in Bioinformatics, Page 10 Representing chemicals in silico Similar Property Principle Molecules having similar structures should exhibit similar activities. → Structure-based representations Compare molecules by comparing substructures Karsten Borgwardt: Data Mining in Bioinformatics, Page 11 Molecular graph O O d C O d C C O O C C C C N C C C C C d C C N C S N C Undirected labeled graph Karsten Borgwardt: Data Mining in Bioinformatics, Page 12 Fingerprints Define feature vectors that record the presence/absence (or number of occurrences) of particular patterns in a given molecular graph where φ(A) = (φs(A))s substructure 1 if s occurs in A φs(A) = 0 otherwise Extension of traditional chemical fingerprints Karsten Borgwardt: Data Mining in Bioinformatics, Page 13 Fingerprints Learning from fingerprints Classical machine learning and data mining techniques can be applied to these vectorial feature representations. Any distance / kernel can be used Classification Feature selection Clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 14 Fingerprints Fingerprints compression Systematic enumeration → long, sparse vectors e.g. 50, 000 random compounds from ChemDB → 300, 000 paths of length up to 8 → 300 non-zeros on average “Naive” Compression List the positions of the 1s 219 = 524, 288 average encoding: 300 × 19 = 5, 700 bits Karsten Borgwardt: Data Mining in Bioinformatics, Page 15 Fingerprints Fingerprints compression Modulo Compression (lossy) Karsten Borgwardt: Data Mining in Bioinformatics, Page 16 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] P = positive (mutagenic) compounds N = negative compounds features: fragments (= patterns) f such that both freq(f, P) ≥ t and freq(f, N) ≥ t Limited to frequent linear patterns ML algorithm: SVM with linear or quadratic kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 17 Frequent patterns fingerprints MOLFEA [Helma et al., 2004] CPDB – Carcinogenic Potency DataBase 684 compounds classified in 341 mutagens and 343 nonmutagens according to Ames test on Salmonella Mutagenicity prediction [Hema04] Linear kernel Quadratic kernel 100 Cross-validated sensitivity 90 80 70 60 50 1% 3% 5% Frequency threshold 10% Karsten Borgwardt: Data Mining in Bioinformatics, Page 18 Spectrum kernels φ(A) = (φs(A))s∈S Kspectrum(A, A0) = k(φ(A), φ(A0)) |(S)| |(S)| k ∈ RR ×R can be Dot product (linear kernel) RBF kernel Tanimoto kernel: k(A, B) = MinMax kernel: A∩B A∪B PN min(Ai ,Bi ) PNi=1 i=1 max(Ai ,Bi ) Karsten Borgwardt: Data Mining in Bioinformatics, Page 19 Spectrum kernels Tanimoto and MinMax Both Tanimoto and Minmax are kernels. Proof for Tanimoto: J.C. Gower A general coefficient of similarity and some of its properties. Biometrics 1971. Proof for MinMax: hφ(x), φ(y)i MinMax(x, y) = hφ(x), φ(x)i + hφ(y), φ(y)i − hφ(x), φ(y)i with φ(x) of length: # patterns × max count φ(x)i = 1 iff. the pattern indexed by bi/qc appears more than i mod q times in x Karsten Borgwardt: Data Mining in Bioinformatics, Page 20 All patterns fingerprints Paths fingerprints Labeled sub-paths (walks) O O d C O d CsCsCdO C O O C C C C N C C NsCsCsS C C C d C C C N C S N C Some sub-paths of length 3 Karsten Borgwardt: Data Mining in Bioinformatics, Page 21 All patterns fingerprints Circular fingerprints Labeled sub-trees - Extended-Connectivity (or Circular) features O O d C O d C O O C C C C N C C C S C C C d C C C N N C{sC{sN|sC}|sN{sC}|sS{sC}} C Example of a circular substructure of depth 2 Karsten Borgwardt: Data Mining in Bioinformatics, Page 22 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Systematically extract paths / circular fingerprints, for various maximal depths SVM with Tanimoto / Minmax Karsten Borgwardt: Data Mining in Bioinformatics, Page 23 All patterns fingerprints 2D spectrum kernels [Azencott et al., 2007] Mutagenicity (Mutag): 188 compounds Benzodiazepine receptor affinity (BZR): 181+125 compounds Cyclooxygenase-2 ihibitors (COX2): 178 + 125 compounds Estrogen receptor affinity (ER): 166 + 180 compounds Data Mutag BZR COX2 ER SVM Previous best 90.4% 85.2% (gBoost) 79.8% 76.4% 70.1% 73.6% 82.1% 79.8% Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] Goal: scalability Compute a sequence that captures topological and label information of graphs in a runtime linear in the number of edges → sub-tree kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 25 Weisfeiler-Lehman kernel [Shervashidze et al., 2011] Karsten Borgwardt: Data Mining in Bioinformatics, Page 26 Convolution kernels a.k.a. decomposition kernels (x1, . . . , xD ) is a tuple of parts of x, with xd ∈ X for each part d = 1, . . . , D kd ∈ RXd×Xd : a Mercer kernel 0 Kdecomposition(x, x ) = X X k1(x1, x01)k2(x2, x02) . . . kD (xD , x0D ) x1 x2 ...xD =x x0 x0 x0 =x0 1 2 D Spectrum kernels are a particular case of convolution kernels Karsten Borgwardt: Data Mining in Bioinformatics, Page 27 Convolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] Match atoms and weigh them according to a kernel between subgraphs that include these atoms P P KW DK (x, x0) = (a,σ∈Dr (x)) (a0,σ0∈Dr (x0)) δ(a, a0)Kc(σ, σ 0) r>0∈N Dr (x): decompositions of the molecular graph of x in an atom a and a subpath σ of x including a and of depth at most r Karsten Borgwardt: Data Mining in Bioinformatics, Page 28 Convolution kernels Weighted Decomposition Kernel [Menchetti et al., 2005] Kc: contextual kernel, here: histogram intersection kernel P 0 Kc(σ, σ 0) = l∈L min(f σ (l), f σ (l)) L: possible labels for edges and vertices f σ (l): frequency of label l subgraph σ. Karsten Borgwardt: Data Mining in Bioinformatics, Page 29 Introducing spatial information 3D Histograms [Azencott et al., 2007] Groups of k atoms Associated size: Pairwise distances (k = 2) diameter of the smallest sphere that contains all k atoms Karsten Borgwardt: Data Mining in Bioinformatics, Page 30 Introducing spatial information 3D Histograms [Azencott et al., 2007] One histogram per class of k -tuple (e.g. C-C-C, C-C-O) O O 2.6 5.6 O 7.9 2.7 9.2 C O C 6.3 C 6.6 C 6.7 3.2 3.7 2.4 C 4.6 O C C N C C 2.2 C N C 5.7 C C C 9.5 N S Frequency of N-O 4 C C 3 2 1 0 0 1 2 7 3 5 6 4 N-O distance (A) 8 9 10 Karsten Borgwardt: Data Mining in Bioinformatics, Page 31 Introducing spatial information 3D Histograms: performance [Azencott et al., 2007] 2D kernel Hist3D kernel Data Mutag 90.4% 88.8% BZR (loo) 82.0% 79.4% ER (loo) 87.0% 86.1% COX2 76.9% 78.6% Karsten Borgwardt: Data Mining in Bioinformatics, Page 32 Introducing spatial information 3D Decomposition Kernels [Ceroni et al., 2007] P P Remember: KW DK (x, x0 ) = (a,σ∈Dr (x)) (a0 ,σ0 ∈Dr (x0 )) δ(a, a0 )Kc (σ, σ 0 ) P P K3DDK (x, x0 ) = σ∈Sr (x) σ0 ∈Sr (x0 ) Ks (σ, σ 0 ) Sr (x): subgraphs of x composed of r distinct vertices Qr(r−1)/2 0 Ks (σ, σ 0 ) = i=1 δ(ei , e0i )e−γ(li −li ) li = length of edge ei in x (e1 , e2 , . . . , er(r−1)/2 lexicographically ordered; γ ∈ R Karsten Borgwardt: Data Mining in Bioinformatics, Page 33 Introducing spatial information 3DDK: Performance [Ceroni et al., 2007] 2D kernel Hist3D kernel Data Mutag 90.4% 88.8% BZR (loo) 82.0% 79.4% ER (loo) 87.0% 86.1% COX2 76.9% 78.6% 3DDK Circ3DDK 86.7% 83.5% 78.4% 81.4% 82.3% 82.1% 75.6% 75.2% Karsten Borgwardt: Data Mining in Bioinformatics, Page 34 Introducing spatial information The pharmacophore kernel [Mahé et al., 2006] pharmacophore p ∈ P(x): p = [(x1 , l1 ), (x2 , l2 ), (x3 , l3 )] xi 3D coordinates of atom i of x; li = label of atom i P P K(x, x0 ) = p∈P(x) p0 ∈P(x0 ) KP (p, p0 ) KP (p, p0 ) = Kdist (d1 , d01 )Kdist (d2 , d02 )Kdist (d3 , d03 )Kf eat (l1 , l10 )Kf eat (l2 , l20 )Kf eat (l3 , l30 ) kd−d0 k2 0 Kdist : RBF Gaussian Kdist (d, d ) = exp 2σ 2 Kf eat : Dirac Karsten Borgwardt: Data Mining in Bioinformatics, Page 35 Introducing spatial information 3D LAP kernel [Hinselmann et al., 2010] M : pairwise intramolecular matrix of inter-atomic geometric distances Karsten Borgwardt: Data Mining in Bioinformatics, Page 36 Introducing spatial information Conclusion How relevant is 3D information? How good is 3D information? Karsten Borgwardt: Data Mining in Bioinformatics, Page 37 Drug discovery process 1. Find a target 3.Hit-to-lead: characterize hits 2. Identify hits 4. Lead optimization and synthesis 5. Assay Docking Virtual High-Throughput Screening Karsten Borgwardt: Data Mining in Bioinformatics, Page 38 High-throughput screening Assay a large library of potential drugs against their target Very costly → docking → virtual high-throughput screening (vHTS) Karsten Borgwardt: Data Mining in Bioinformatics, Page 39 Measuring performance Imbalanced data Typically, most compounds are inactive ⇒ many more negative than positive examples E.g. DHFR data set: 99, 995 chemicals screened for activity against dihydrofolate reductase; < 0.2% active compounds Accuracy is not appropriate: predicting all compounds negative ⇒ accuracy = 99.8% sensitivity= # True Positives # Positives # True Negatives specificity= # Negatives For many methods, the output is continuous ⇒ accuracy, sensitivity and specificity depend on a threshold θ Karsten Borgwardt: Data Mining in Bioinformatics, Page 40 Measuring performance Receiver-Operator Characteristic Curves For all possible values of θ, report sensitivity and 1 − specif icity AUROC (Area under the ROC Curve) is a numerical measure of performance AUROC(random) = 0.5 and AUROC(optimal) = 1 3/4 2/4 x 0.9 1/4 x 0.81 x 0.73 x 0.52 x 0.2 0 True Positive Rate 1 x 0.17 x 0.12 x 0.09 label + + + + - x 0.95 x 0.94 x Inf 0 random perfect real 1/6 1/3 1/2 2/3 5/6 prediction 0.95 0.94 0.90 0.81 0.73 0.52 0.20 0.17 0.12 0.09 1 False Positive Rate Karsten Borgwardt: Data Mining in Bioinformatics, Page 41 Measuring performance [Azencott et al., 2007] 0.6 0.4 0.2 RANDOM IRV SVM MAXSIM 0.0 AUC 0.71 0.59 0.59 0.54 TPR method IRV SVM kNN MAX-SIM 0.8 1.0 Inhibition of DHFR: ROC Curves 0.0 0.2 0.4 0.6 0.8 1.0 FPR Karsten Borgwardt: Data Mining in Bioinformatics, Page 42 Measuring performance Precision-recall curves Precision = # True Positives # Predicted Positives Recall = sensitivity x 0.81 3/5 x 0.9 x 0.52 x 0.2 1/5 2/5 x 0.94 x 0.73 x 0.17 x 0.12 x 0.09 perfect real 0 Precision 4/5 1 x 0.95 0 1/4 2/4 3/4 1 Recall Karsten Borgwardt: Data Mining in Bioinformatics, Page 43 Other applications Other applications of graph mining in chemoinformatics Database indexing and search Prediction of 3D structures of small compounds and proteins Reaction Prediction Karsten Borgwardt: Data Mining in Bioinformatics, Page 44 References and further reading [Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to fourdimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical information and modeling 47, 965–974. 23, 24, 35, 36, 37, 47 [Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprints using integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109. [Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P. (2007). Classification of small molecules by two-and three-dimensional decomposition kernels. Bioinformatics 23, 2038–2045. 38, 39 [Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal of chemical information and computer sciences 44, 1402–1411. 17, 18 [Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 41 [Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening with support vector machines. Journal of chemical information and modeling 46, 2003–2014. 40 [Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34 [Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programming approach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29 [Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). WeisfeilerLehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31 Karsten Borgwardt: Data Mining in Bioinformatics, Page 45

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining in Bioinformatics Day 9: Graph Mining in