Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich Basel, Fall Semester 2015 D-BSSE Our course - The team Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 2 / 65 Our course - Background information Schedule Lecture: Wednesdays 9:15-11:00 (excluding September 23) Tutorial: Wednesdays 11:10-12:00 (excluding September 23) Room: Misrock (but Euler on September 30) Written exam to get the certificate in early 2016 Structure Key topics: distance functions, classification, clustering, feature selection Exercises to apply the algorithms in practice Moodle link https://moodle-app2.let.ethz.ch/course/view.php?id=1420 D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 3 / 65 Why Data Mining in Biology and Medicine? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 4 / 65 What is Data Mining? The search for patterns and statistical dependencies in large datasets D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 5 / 65 Data Mining: The basic principle ? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 6 / 65 Data Mining: The basic principle ? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 7 / 65 Data Mining: The basic principle ? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 8 / 65 Data Mining is all around you Online shopping - product recommendations “Customers who bought this item also bought” D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 9 / 65 Data Mining is all around you U.S. Presidential Election 2012 - mining for swing voters Copyright: M. E. J.Newman, http://www-personal.umich.edu/~mejn/election/2012/, Creative Commons Attribution 2.0 Generic license, unchanged. D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 10 / 65 What is personalized medicine? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 11 / 65 Personalized Medicine What is personalized medicine? Florian Holsboer: “...treatments that are tailored to individual patients’ genetic and pathophysiological backgrounds.” Nature Reviews Neuroscience 9, 638-646 (August 2008) ETH News (03.07.2014): “Based on genetic analyses, therapies shall be routinely tailored to patients’ needs.” Barack Obama (30.1.2015): “delivering the right treatments, at the right time, every time to the right person.” D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 12 / 65 The vision of personalized medicine What is the goal of personalized medicine? Many medical drugs only work in a fraction of all patients. Genetic and other molecular properties are a potential explanation for this phenomenon. The vision of personalized medicine: Tailoring medical treatment to the molecular properties of a patient D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 13 / 65 Through data mining to personalized medicine Current state Enormous technological progress makes sequencing thousands of genomes an almost “industrial” endeavor. Every human genome comprises billions of bases. Individuals differ in millions of these bases. Source: Dr. C. Beisel, QGF, D-BSSE D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 14 / 65 Through data mining to personalized medicine Central data mining problems Can one detect correlations between diseases and base differences? Can one detect correlations between drug response and base variation? Source: DREAM8 Toxicogenetics Challenge D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 15 / 65 Through data mining to personalized medicine Barack Obama, 30.1.2015 “So if we have a big data set, a big pool of people that’s varied, then that allows us to really map out not only the genome of one person, but now we can start seeing connections and patterns and correlations that helps us refine exactly what it is that we are trying to do with respect to treatment.” Quelle: Science— DOI: 10.1126/science.aaa6436 D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 16 / 65 Through data mining to personalized medicine Ambitious goals In 2013, Google founded the biotech company Calico. “[Our] mission is to harness advanced technologies to increase our understanding of the biology that controls lifespan”(calicolabs.com) In 2013, Craig Venter founded Human Longevity, Inc. “For the first time, the power of human genomics, informatics, next generation DNA sequencing technologies, and stem cell advances are being harnessed in one company...” (humanlongevity.com). In 2012, the European Union decided to fund a Marie Curie Initial Training Network for “Machine Learning for Personalized Medicine” with 3.75 million Euro (mlpm.eu). D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 17 / 65 Which new data mining problems have to be solved in personalized medicine? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 18 / 65 Data Mining in Genetics Search for disease-associated loci in the genome D-BSSE A C A T C A G T A G C A G T A T C A A C G G C G G C G T C G G C A G C A A T G T A G A T G G C G G T G C G Karsten Borgwardt T Data Mining Course - Part 1, Basel Fall Semester 2015 19 / 65 Data Mining in Genetics Search for disease-associated loci in the genome D-BSSE A C A T C A G T A G C A G T A T C A A C G G C G G C G T C G G C A G C A A T G T A G A T G G C G G T G C G Karsten Borgwardt T Data Mining Course - Part 1, Basel Fall Semester 2015 20 / 65 Data Mining in Genetics Success and failure Hundreds of new disease-associated genetic loci have been identified. The correlations are rather weak and cannot explain the high heritability of these diseases (missing heritability). Potential reasons for missing heritability Sample sizes are too small (too few patients) Non-genetic influences (Environment, epigenetics) Too simple models (many genes rather than one) D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 21 / 65 Data Mining in Genetics Search for interactions between genetic loci D-BSSE A C A T A C G T A T A A G T A T A C A C G T G A G C G T G C G C A G A C A T G T G C A T G G G C G T G G A Karsten Borgwardt T Data Mining Course - Part 1, Basel Fall Semester 2015 22 / 65 Data Mining in Genetics Why is interaction search so difficult? Human genomes can differ in millions of bases. Without a clever search strategy, we have to consider billions of pairs! D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 23 / 65 Data Mining in Genetics Efficient interaction search without exhaustive enumeration (Achlioptas et al., KDD 2011) I II III IV V VI I II III IV V VI 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0 0 IV II V VI I III I I 0 0 0 1 1 1 II 0 1 1 0 0 1 III 0 1 1 0 1 0 IV V II III IV V VI 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 VI D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 24 / 65 Data Mining in Genetics: Present and future Achieved so far Algorithms for interaction search that are now being used by international genetics consortia (Kam-Thong 2010, 2011, 2012, Azencott, Bioinformatics 2013) Algorithms that support the large-scale collection of datasets (Cao et al., Nature Genetics 2011, Karaletsos et al., Bioinformatics 2012) Statistical test to quantify the impact of additional (non-genetic) factors (Becker et al., Nature 2011, Hagmann et al., PLoS Genetics 2015) Next steps More complex models of association D-BSSE Karsten Borgwardt (Llinares-Lopez, ISMB 2015) Data Mining Course - Part 1, Basel Fall Semester 2015 25 / 65 Chemoinformatics: Molecule classification Mutagenic effect Non-mutagenic effect Unknown effect Source: Seal et al., J Cheminform. 2012; 4:10; Creative Commons Attribution 2.0 Generic license, unchanged. D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 26 / 65 Chemoinformatics: Graph classification Why is graph comparison so difficult? Even simple questions in graph comparison lead to enormous computational problems: Are two graphs identical? Is one graph contained in another one? The computational effort grows exponentially with the number of nodes. Needed: Efficient methods for comparing large graphs D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 27 / 65 Chemoinformatics: Graph classification Efficient algorithms for graph comparison 1st iteration Result of steps 1 and 2: multiset-label determination and sorting Given labeled graphs G and G’ e e b b (Shervashidze and Borgwardt, NIPS 2009) e,bcd c d d c,bde d,aace a a c,bde d,abce b a G a e,bcd b,de b,ce c a,d a,d G’ 1st iteration Result of step 3: label compression G’ 1st iteration Result of step 4: relabeling a,d f c,bde j b,c g d,aace k b,ce h d,abce l b,de i e,bcd m c b,c a,d G b m h k i j f f d m l G j f g G’ End of the 1st iteration Feature vector representations of G and G’ (1) D-BSSE Karsten Borgwardt φWLsubtree(G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1) a b c d e f Course g Data h i Mining j k l m Source: Shervashidze et al., JMLR 2011 - Part 1, Basel Fall Semester 2015 28 / 65 Through data mining to personalized medicine New challenges for data mining Development of new methods for measuring statistical significance in high-dimensional spaces (Sugiyama et al., SDM 2015) Search for patients with unusual drug response or unusual disease progression (Outlier Detection) (Sugiyama and Borgwardt, NIPS 2013) D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 29 / 65 Which role is data mining going to play in the future of medicine? D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 30 / 65 The future of data mining in medicine What will the future bring? Enormous increase in the amount of data that describes the health state of a person Electronic health record with more and more molecular and imaging data Direct continuous health state monitoring via wearable devices Indirect health monitoring with smartphone and social media D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 31 / 65 The future of data mining in medicine Which contributions can data mining make? Exploration of molecular mechanisms underlying diseases Support when choosing the optimal therapy Early detection of disease-relevant symptoms Detection of acute disease symptoms D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 32 / 65 References I C. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara, K. M. Borgwardt, Bioinformatics 29, 171 (2013). P. Achlioptas, B. Schölkopf, K. Borgwardt, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2011), pp. 726–734. C. Becker, et al., Nature 480, 245 (2011). J. Cao, et al., Nature Genetics 43, 956 (2011). J. Hagmann, et al., PLoS Genetics 11, e1004920 (2015). T. Kam-Thong, et al., Eur J Hum Genet (2010). T. Kam-Thong, B. Pütz, N. Karbalai, B. Müller-Myhsok, K. Borgwardt, Bioinformatics (ISMB) 27, i214 (2011). D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 33 / 65 References II T. Kam-Thong, et al., Human Heredity 73, 220 (2012). T. Karaletsos, O. Stegle, C. Dreyer, J. Winn, K. M. Borgwardt, Bioinformatics 28, 1001 (2012). N. Shervashidze, K. M. Borgwardt, Advances in Neural Information Processing Systems 22, Proceedings of the Twenty-Third Annual Conference on Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta, eds. (2009), pp. 1660–1668. N. Shervashidze, P. Schweitzer, E. van Leeuwen, K. Mehlhorn, K. M. Borgwardt, Journal of Machine Learning Research 12, 2539 (2011). M. Sugiyama, K. M. Borgwardt, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. (2013), pp. 467–475. D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 34 / 65 References III M. Sugiyama, F. L. Lopez, N. Kasenburg, K. M. Borgwardt, Proceedings of the 2015 SIAM International Conference on Data Mining . D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 35 / 65 The Basics of Data Mining D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 36 / 65 What is data mining? Data Mining The search for reoccurring patterns and statistical dependencies in large datasets (K.B., 2013) Extracting knowledge from large amounts of data (Han and Kamber, 2006) Often used as synonym for Machine Learning: different origins, but nowadays almost identical topics Often used as synonym for Knowledge Discovery, but some definitions deem Data Mining a step within the Knowledge Discovery Process D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 37 / 65 What is data mining? Knowledge Discovery Process (Han and Kamber, 2006) Step Data cleaning Data integration Data selection Data transformation Data mining Pattern evaluation Knowledge presentation D-BSSE Karsten Borgwardt Action Removing noise and inconsistent data Combining multiple data sources Retrieving relevant data from database Bringing data in a form that is appropriate for mining Finding reoccurring patterns in data Identifying truly interesting patterns Representing new knowledge for users Data Mining Course - Part 1, Basel Fall Semester 2015 38 / 65 What is data mining? Key concept: Similarity At the heart of mining data is the ability to detect similarities between objects. Defining distance functions (or similarity measures, kernel or covariance functions) is therefore a key topic in data mining. In particular, scaling these functions to large, high-dimensional datasets is a central current challenge. D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 39 / 65 Metric Definition of a metric We assume that the vectors x1 , x2 , x3 are from a Euclidean space of dimension d, that is x1 , x2 , x3 ∈ Rd . A function d is a metric iff d(x1 , x2 ) ≥ 0 d(x1 , x2 ) = 0 if and only if x1 = x2 d(x1 , x2 ) = d(x2 , x1 ) 4 d(x1 , x3 ) ≤ d(x1 , x2 ) + d(x2 , x3 ) 1 2 3 D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 40 / 65 Similarity measures on vectors Popular distance functions on vectors We assume that x, x0 ∈ Rd . The Manhattan Distance is d(x, x0 ) = d X |xi − xi0 |. i=1 The Hamming Distance on binary vectors is d(x, x0 ) = d X |xi − xi0 |. i=1 D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 41 / 65 Similarity measures on vectors Popular distance functions on vectors The Euclidean Distance is defined as v u d uX 0 d(x, x ) = t (xi − xi0 )2 . i=1 The Chebyshev Distance is defined as d(x, x0 ) = max(|xi − xi0 |). i D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 42 / 65 Similarity measures on vectors Popular distance functions on vectors The Minkowski Distance is defined as: d X 1 d(x, x0 ) = ( |xi − xi0 |p ) p , where p ∈ R+ i=1 We recover the Manhattan Distance for p = 1 and the Euclidean distance for p = 2. The larger p, the more large deviations in one dimension matter. For p → ∞, the Minkowski distances converges to the Chebyshev distance. For p ≥ 1, the Minkowski distance is a metric. D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 43 / 65 Similarity measures on sets Finite sets of objects Jaccard coefficient j(A, B) = |A ∩ B| |A ∪ B| Jaccard distance d(A, B) = 1 − j(A, B) = D-BSSE Karsten Borgwardt |A ∪ B| − |A ∩ B| |A ∪ B| Data Mining Course - Part 1, Basel Fall Semester 2015 44 / 65 Similarity measures on sets Finite sets of objects Overlap coefficient o(A, B) = |A ∩ B| min(|A|, |B|) Sorensen-Dice coefficient s(A, B) = D-BSSE Karsten Borgwardt 2|A ∩ B| |A| + |B| Data Mining Course - Part 1, Basel Fall Semester 2015 45 / 65 Similarity measures on sets Sets of vectors Single link distance function d(A, B) = min dvector (a, b) a∈A,b∈B Complete link distance function d(A, B) = max dvector (a, b) a∈A,b∈B Average link distance function d(A, B) = 1 XX dvector (a, b) |A||B| a∈A b∈B D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 46 / 65 Similarity measures on strings k-mer based similarity measures Goal: Try to quantify the similarity between words w and w 0 . k-mers are substrings of length k. Represent each string w as a histogram of k-mer frequencies, hk (w ). Spectrum kernel w 0. (Leslie et al., 2002): Count number of matching pairs of k-mers in w and Example Goal: Try to quantify the similarity between words downtown and known. h3 (downtown)= (dow : 1, own : 2, wnt : 1, nto : 1, tow : 1) h3 (known)= (kno : 1, now : 1, own : 1) D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 47 / 65 Similarity measures on nodes Shortest path distance Objects are nodes in a graph G . Edge weights w (i, j) represent distances between nodes i and j. Our goal is to quantify the similarity of an arbitrary pair of nodes. The most popular distance function is the shortest path length. Floyd-Warshall’s algorithm allows to compute all pairs-shortest paths in O(n3 ), where n is the number of nodes in G . D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 48 / 65 Similarity measures on nodes Floyd-Warshall’s Algorithm (1962) procedure Floyd-Warshall(G = (V , E , w )) d(i, j) := w (i, j), if (i, j) ∈ E d(i, j) := ∞, if (i, j) ∈ /E for k = 1 : n do for i = 1 : n do for j = 1 : n do if d(i, j) > d(i, k) + d(k, j) then d(i, j) := d(i, k) + d(k, j) return matrix of shortest path distances D, Dij = d(i, j) D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 49 / 65 Similarity measures on time series Time series comparison: Theory and practice If two time series x, x0 are vectors of length d and corresponding dimensions represent the same point in time, any vectorial distance function can be used to compare them Unfortunately, these assumptions are often violated in practice: We often compare time series of different length, d 6= d 0 . The time points at which the time series were observed are not synchronous. The time intervals between observations may vary within and between time series. D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 50 / 65 Similarity measures on time series 5 4 3 2 1 0 5 4 3 2 1 0 D-BSSE Karsten Borgwardt 0 2 4 6 8 Data Mining Course - Part 1, Basel Fall Semester 2015 51 / 65 Similarity measures on time series D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 52 / 65 Similarity measures on time series Dynamic Time Warping (DTW) A similarity measure for time series of different length, with different intervals between measurements. It is the cost of an optimal alignment between the measurements of two time series, x and x0 . Individual time points are compared by a base distance function d (e.g. a Minkowski distance). The function DTW can be computed recursively as repeat xi DTW (i, j − 1) 0 DTW (i − 1, j) repeat xj0 DTW (i, j) = d(xi , xj ) + min DTW (i − 1, j − 1) repeat neither where DTW (0, 0) = 0, DTW (i, 0) = ∞, DTW (0, j) = ∞ for all 1 ≤ i ≤ d, 1 ≤ j ≤ d 0 . D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 53 / 65 Similarity measures on time series x’ x 2 2 3 4 3 2 1 2 2 2 0 0 1 2 1 0 1 0 0 2 0 0 1 2 1 0 1 0 0 3 1 1 0 1 0 1 2 1 1 2 0 0 1 2 1 0 1 0 0 2 0 0 1 2 1 0 1 0 0 1 1 1 2 3 2 1 0 1 1 2 0 0 1 2 1 0 1 0 0 2 0 0 1 2 1 0 1 0 0 Distance Matrix D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 54 / 65 Similarity measures on time series 5 4 3 2 1 0 5 4 3 2 1 0 D-BSSE Karsten Borgwardt 0 2 4 6 8 Data Mining Course - Part 1, Basel Fall Semester 2015 55 / 65 Similarity measures on time series D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 56 / 65 Similarity measures on graphs D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 57 / 65 Similarity measures on graphs Approaches to Graph comparison Family 1: Graph isomorphism or subgraph isomorphims test Family 2: Graph edit distance Cost of transforming graph 1 into graph 2 Family 3: Topological vectors Map graph to vector Then apply vectorial distance functions D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 58 / 65 Wiener Index Graph representation Let G be a graph with vertices V and edges E. Let P be the the set of shortest paths in G . Then the Wiener Index (Wiener, 1947) of G is defined as ν(G ) = 1 2 P p∈P p. Graph comparison The shortest path kernel (Borgwardt and Kriegel, ICDM 2005) is a class of similarity measures between two graphs G and G 0 . The simplest instance of this class is a product between the Wiender Indices of G and G 0 : k(G , G ) = ν(G )ν(G 0 ) D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 59 / 65 Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, 2009) 1st(Itera:on( Result(of(Steps(1(and(2:(Mul:setBlabel(determina:on(and(sor:ng( Given(labeled(graphs(G1(and(G2( E B B D D C A a( A E,BCD( E D,AACE( C A G1( B B,CE( G2( C,BDE( A,D( A,D( b( A,D( F( C,BDE( J( B,C( G( D,AACE( K( B,CE( H( D,ABCE( L( ( ( B,DE( ( ( ( I( ( ( E,BCD( M ( Karsten Borgwardt D,ABCE( C,BDE( A,D( G1( B,C( G2( H I( M ( K L( J( J( ( M( d( c( D-BSSE ( E,BCD( 1st(Itera:on( Result(of(Step(4:(Relabeling( 1st(Itera:on( Result(of(Step(3:(Label(compression( ( B,DE( F F G1( Data Mining Course - Part 1, Basel F G G2( Fall Semester 2015 60 / 65 Weisfeiler-Lehman Kernel (Shervashidze and Borgwardt, 2009) End(of(1st(Itera:on( Feature(vector(representa:on(of(G1(and(G2( ((A,(B,(C,((D,(E,((F,(G,(H,((I,((J,((K,((L,(M()( ϕ(1)wl(G1)(=(((2,(1,(1,(1,(1,(2,(0,(1,(0,(1,(1,(0,(1()( ϕ(1)wl(G2)(=(((1,(2,(1,(1,(1,(1,(1,(0,(1,(1,(0,(1,(1()( Counts(of( original( node(labels( e( D-BSSE Karsten Borgwardt Counts(of( compressed(( node(labels( k(1)wl(G1,(G2)=<ϕ(1)wl(G1)(,(ϕ(1)wl(G2)>=11." Data Mining Course - Part 1, Basel Fall Semester 2015 61 / 65 Subtree-like Patterns 2 1 1 3 3 2 4 6 5 D-BSSE Karsten Borgwardt 1 3 1 2 6 4 5 Data Mining Course - Part 1, Basel 1 5 Fall Semester 2015 62 / 65 Weisfeiler-Lehman Kernel: Theoretical Runtime Properties Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011) Algorithm: Repeat the following steps h times 1 Sort: Represent each node v as sorted list Lv of its neighbors (O(m)) 2 Compress: Compress this list into a hash value h(Lv ) (O(m)) 3 Relabel: Relabel v by the hash value h(Lv ) (O(n)) Runtime analysis per graph pair: Runtime O(m h) for N graphs: Runtime O(N m h + N 2 n h) (naively O(N 2 m h)) D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 63 / 65 Weisfeiler-Lehman Kernel: Empirical Runtime Properties 5 600 pairwise global 4 10 3 Runtime in seconds Runtime in seconds 10 10 2 10 1 10 0 10 400 200 −1 10 1 10 2 10 Number of graphs N 0 3 10 15 10 5 0 2 D-BSSE Karsten Borgwardt 400 600 800 Graph size n 1000 15 Runtime in seconds Runtime in seconds 20 200 4 6 Subtree height h 8 10 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Graph density c Data Mining Course - Part 1, Basel Fall Semester 2015 64 / 65 Weisfeiler-Lehman Kernel: Runtime and Accuracy 1000 days 100 days 10 days 1 day WL RG 3 Graphlet RW SP 1 hour 1 minute 10 sec 85 % 80 % 75 % 70 % 65 % 60 % 55 % 50 % MUTAG NCI1 NCI109 D&D graph size D-BSSE Karsten Borgwardt Data Mining Course - Part 1, Basel Fall Semester 2015 65 / 65