Download Data Mining - ETH Zürich

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich
Basel, Fall Semester 2015
D-BSSE
Our course - The team
Dr. Damian Roqueiro, Dr. Dean Bodenham, Dr. Dominik Grimm, Dr. Xiao He
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
2 / 65
Our course - Background information
Schedule
Lecture: Wednesdays 9:15-11:00 (excluding September 23)
Tutorial: Wednesdays 11:10-12:00 (excluding September 23)
Room: Misrock (but Euler on September 30)
Written exam to get the certificate in early 2016
Structure
Key topics: distance functions, classification, clustering, feature selection
Exercises to apply the algorithms in practice
Moodle link
https://moodle-app2.let.ethz.ch/course/view.php?id=1420
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
3 / 65
Why Data Mining in Biology and Medicine?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
4 / 65
What is Data Mining?
The search for patterns and statistical dependencies in large datasets
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
5 / 65
Data Mining: The basic principle
?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
6 / 65
Data Mining: The basic principle
?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
7 / 65
Data Mining: The basic principle
?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
8 / 65
Data Mining is all around you
Online shopping - product recommendations
“Customers who bought this item also bought”
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
9 / 65
Data Mining is all around you
U.S. Presidential Election 2012 - mining for swing voters
Copyright: M. E. J.Newman, http://www-personal.umich.edu/~mejn/election/2012/, Creative Commons Attribution 2.0 Generic license, unchanged.
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
10 / 65
What is personalized medicine?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
11 / 65
Personalized Medicine
What is personalized medicine?
Florian Holsboer: “...treatments that are tailored to individual patients’ genetic and
pathophysiological backgrounds.” Nature Reviews Neuroscience 9, 638-646 (August
2008)
ETH News (03.07.2014): “Based on genetic analyses, therapies shall be routinely
tailored to patients’ needs.”
Barack Obama (30.1.2015): “delivering the right treatments, at the right time, every
time to the right person.”
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
12 / 65
The vision of personalized medicine
What is the goal of personalized medicine?
Many medical drugs only work in a fraction of all patients.
Genetic and other molecular properties are a potential explanation for this phenomenon.
The vision of personalized medicine: Tailoring medical treatment to the molecular
properties of a patient
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
13 / 65
Through data mining to personalized medicine
Current state
Enormous technological progress makes sequencing
thousands of genomes an almost “industrial”
endeavor.
Every human genome comprises billions of bases.
Individuals differ in millions of these bases.
Source: Dr. C. Beisel, QGF, D-BSSE
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
14 / 65
Through data mining to personalized medicine
Central data mining
problems
Can one detect correlations
between diseases and base
differences?
Can one detect correlations
between drug response and
base variation?
Source: DREAM8 Toxicogenetics Challenge
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
15 / 65
Through data mining to personalized medicine
Barack Obama, 30.1.2015
“So if we have a big data set, a big pool of people that’s varied, then that allows us to
really map out not only the genome of one person, but now we can start seeing
connections and patterns and correlations that helps us refine exactly what it is that we
are trying to do with respect to treatment.”
Quelle: Science— DOI: 10.1126/science.aaa6436
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
16 / 65
Through data mining to personalized medicine
Ambitious goals
In 2013, Google founded the biotech company Calico.
“[Our] mission is to harness advanced technologies to increase our understanding of the
biology that controls lifespan”(calicolabs.com)
In 2013, Craig Venter founded Human Longevity, Inc.
“For the first time, the power of human genomics, informatics, next generation DNA
sequencing technologies, and stem cell advances are being harnessed in one company...”
(humanlongevity.com).
In 2012, the European Union decided to fund a Marie Curie Initial Training Network for
“Machine Learning for Personalized Medicine” with 3.75 million Euro (mlpm.eu).
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
17 / 65
Which new data mining problems have to be solved in personalized
medicine?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
18 / 65
Data Mining in Genetics
Search for disease-associated loci in the genome
D-BSSE
A
C
A
T
C
A
G
T
A
G
C
A
G
T
A
T
C
A
A
C
G
G
C
G
G
C
G
T
C
G
G
C
A
G
C
A
A
T
G
T
A
G
A
T
G
G
C
G
G
T
G
C
G
Karsten Borgwardt
T
Data Mining Course - Part 1, Basel
Fall Semester 2015
19 / 65
Data Mining in Genetics
Search for disease-associated loci in the genome
D-BSSE
A
C
A
T
C
A
G
T
A
G
C
A
G
T
A
T
C
A
A
C
G
G
C
G
G
C
G
T
C
G
G
C
A
G
C
A
A
T
G
T
A
G
A
T
G
G
C
G
G
T
G
C
G
Karsten Borgwardt
T
Data Mining Course - Part 1, Basel
Fall Semester 2015
20 / 65
Data Mining in Genetics
Success and failure
Hundreds of new disease-associated genetic loci have been identified.
The correlations are rather weak and cannot explain the high heritability of these
diseases (missing heritability).
Potential reasons for missing heritability
Sample sizes are too small (too few patients)
Non-genetic influences (Environment, epigenetics)
Too simple models (many genes rather than one)
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
21 / 65
Data Mining in Genetics
Search for interactions between genetic loci
D-BSSE
A
C
A
T
A
C
G
T
A
T
A
A
G
T
A
T
A
C
A
C
G
T
G
A
G
C
G
T
G
C
G
C
A
G
A
C
A
T
G
T
G
C
A
T
G
G
G
C
G
T
G
G
A
Karsten Borgwardt
T
Data Mining Course - Part 1, Basel
Fall Semester 2015
22 / 65
Data Mining in Genetics
Why is interaction search so difficult?
Human genomes can differ in millions of bases.
Without a clever search strategy, we have to consider billions of pairs!
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
23 / 65
Data Mining in Genetics
Efficient interaction search without exhaustive enumeration
(Achlioptas et al., KDD 2011)
I
II
III
IV
V
VI
I
II
III
IV
V
VI
0
0
1
0
1
0
0
0
1
0
1
0
1
0
1
0
0
1
1
0
1
0
0
1
0
1
1
0
1
0
0
1
1
0
1
0
1
1
0
0
1
0
1
1
0
0
1
0
1
1
0
0
0
1
1
1
0
0
0
1
1
1
1
1
0
0
1
1
1
1
0
0
IV
II
V
VI
I
III
I
I
0
0
0
1
1
1
II
0
1
1
0
0
1
III
0
1
1
0
1
0
IV
V
II
III
IV
V
VI
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
VI
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
24 / 65
Data Mining in Genetics: Present and future
Achieved so far
Algorithms for interaction search that are now being used by international genetics
consortia (Kam-Thong 2010, 2011, 2012, Azencott, Bioinformatics 2013)
Algorithms that support the large-scale collection of datasets
(Cao et al., Nature Genetics 2011, Karaletsos
et al., Bioinformatics 2012)
Statistical test to quantify the impact of additional (non-genetic) factors
(Becker et al., Nature
2011, Hagmann et al., PLoS Genetics 2015)
Next steps
More complex models of association
D-BSSE
Karsten Borgwardt
(Llinares-Lopez, ISMB 2015)
Data Mining Course - Part 1, Basel
Fall Semester 2015
25 / 65
Chemoinformatics: Molecule classification
Mutagenic effect
Non-mutagenic effect
Unknown effect
Source: Seal et al., J Cheminform. 2012; 4:10; Creative Commons Attribution 2.0 Generic license, unchanged.
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
26 / 65
Chemoinformatics: Graph classification
Why is graph comparison so difficult?
Even simple questions in graph comparison lead to enormous computational problems:
Are two graphs identical?
Is one graph contained in another one?
The computational effort grows exponentially with the number of nodes.
Needed: Efficient methods for comparing large graphs
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
27 / 65
Chemoinformatics: Graph classification
Efficient algorithms for graph comparison
1st iteration
Result of steps 1 and 2: multiset-label determination and sorting
Given labeled graphs G and G’
e
e
b
b
(Shervashidze and Borgwardt, NIPS 2009)
e,bcd
c
d
d
c,bde
d,aace
a
a
c,bde
d,abce
b
a
G
a
e,bcd
b,de
b,ce
c
a,d
a,d
G’
1st iteration
Result of step 3: label compression
G’
1st iteration
Result of step 4: relabeling
a,d
f
c,bde
j
b,c
g
d,aace
k
b,ce
h
d,abce
l
b,de
i
e,bcd
m
c
b,c
a,d
G
b
m
h
k
i
j
f
f
d
m
l
G
j
f
g
G’
End of the 1st iteration
Feature vector representations of G and G’
(1)
D-BSSE
Karsten Borgwardt
φWLsubtree(G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)
a b c d e f
Course
g Data
h i Mining
j k l m
Source: Shervashidze et al., JMLR 2011
- Part 1, Basel
Fall Semester 2015
28 / 65
Through data mining to personalized medicine
New challenges for data mining
Development of new methods for measuring statistical significance in high-dimensional
spaces (Sugiyama et al., SDM 2015)
Search for patients with unusual drug response or unusual disease progression (Outlier
Detection) (Sugiyama and Borgwardt, NIPS 2013)
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
29 / 65
Which role is data mining going to play in the future of medicine?
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
30 / 65
The future of data mining in medicine
What will the future bring?
Enormous increase in the amount of data that
describes the health state of a person
Electronic health record with more and more molecular
and imaging data
Direct continuous health state monitoring via wearable
devices
Indirect health monitoring with smartphone and social
media
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
31 / 65
The future of data mining in medicine
Which contributions can data mining make?
Exploration of molecular mechanisms underlying diseases
Support when choosing the optimal therapy
Early detection of disease-relevant symptoms
Detection of acute disease symptoms
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
32 / 65
References I
C. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara, K. M. Borgwardt,
Bioinformatics 29, 171 (2013).
P. Achlioptas, B. Schölkopf, K. Borgwardt, ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD) (2011), pp. 726–734.
C. Becker, et al., Nature 480, 245 (2011).
J. Cao, et al., Nature Genetics 43, 956 (2011).
J. Hagmann, et al., PLoS Genetics 11, e1004920 (2015).
T. Kam-Thong, et al., Eur J Hum Genet (2010).
T. Kam-Thong, B. Pütz, N. Karbalai, B. Müller-Myhsok, K. Borgwardt,
Bioinformatics (ISMB) 27, i214 (2011).
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
33 / 65
References II
T. Kam-Thong, et al., Human Heredity 73, 220 (2012).
T. Karaletsos, O. Stegle, C. Dreyer, J. Winn, K. M. Borgwardt, Bioinformatics 28,
1001 (2012).
N. Shervashidze, K. M. Borgwardt, Advances in Neural Information Processing
Systems 22, Proceedings of the Twenty-Third Annual Conference on Neural
Information Processing Systems, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I.
Williams, A. Culotta, eds. (2009), pp. 1660–1668.
N. Shervashidze, P. Schweitzer, E. van Leeuwen, K. Mehlhorn, K. M. Borgwardt,
Journal of Machine Learning Research 12, 2539 (2011).
M. Sugiyama, K. M. Borgwardt, Advances in Neural Information Processing Systems
26: 27th Annual Conference on Neural Information Processing Systems 2013. (2013),
pp. 467–475.
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
34 / 65
References III
M. Sugiyama, F. L. Lopez, N. Kasenburg, K. M. Borgwardt, Proceedings of the 2015
SIAM International Conference on Data Mining .
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
35 / 65
The Basics of Data Mining
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
36 / 65
What is data mining?
Data Mining
The search for reoccurring patterns and statistical dependencies in large datasets (K.B.,
2013)
Extracting knowledge from large amounts of data (Han and Kamber, 2006)
Often used as synonym for Machine Learning: different origins, but nowadays almost
identical topics
Often used as synonym for Knowledge Discovery, but some definitions deem Data
Mining a step within the Knowledge Discovery Process
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
37 / 65
What is data mining?
Knowledge Discovery Process (Han and Kamber, 2006)
Step
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
D-BSSE
Karsten Borgwardt
Action
Removing noise and inconsistent data
Combining multiple data sources
Retrieving relevant data from database
Bringing data in a form that is appropriate for mining
Finding reoccurring patterns in data
Identifying truly interesting patterns
Representing new knowledge for users
Data Mining Course - Part 1, Basel
Fall Semester 2015
38 / 65
What is data mining?
Key concept: Similarity
At the heart of mining data is the ability to detect similarities between objects.
Defining distance functions (or similarity measures, kernel or covariance functions) is
therefore a key topic in data mining.
In particular, scaling these functions to large, high-dimensional datasets is a central
current challenge.
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
39 / 65
Metric
Definition of a metric
We assume that the vectors x1 , x2 , x3 are from a Euclidean space of dimension d, that
is x1 , x2 , x3 ∈ Rd .
A function d is a metric iff
d(x1 , x2 ) ≥ 0
d(x1 , x2 ) = 0 if and only if x1 = x2
d(x1 , x2 ) = d(x2 , x1 )
4 d(x1 , x3 ) ≤ d(x1 , x2 ) + d(x2 , x3 )
1
2
3
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
40 / 65
Similarity measures on vectors
Popular distance functions on vectors
We assume that x, x0 ∈ Rd .
The Manhattan Distance is
d(x, x0 ) =
d
X
|xi − xi0 |.
i=1
The Hamming Distance on binary vectors is
d(x, x0 ) =
d
X
|xi − xi0 |.
i=1
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
41 / 65
Similarity measures on vectors
Popular distance functions on vectors
The Euclidean Distance is defined as
v
u d
uX
0
d(x, x ) = t (xi − xi0 )2 .
i=1
The Chebyshev Distance is defined as
d(x, x0 ) = max(|xi − xi0 |).
i
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
42 / 65
Similarity measures on vectors
Popular distance functions on vectors
The Minkowski Distance is defined as:
d
X
1
d(x, x0 ) = (
|xi − xi0 |p ) p , where p ∈ R+
i=1
We recover the Manhattan Distance for p = 1 and the Euclidean distance for p = 2.
The larger p, the more large deviations in one dimension matter.
For p → ∞, the Minkowski distances converges to the Chebyshev distance.
For p ≥ 1, the Minkowski distance is a metric.
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
43 / 65
Similarity measures on sets
Finite sets of objects
Jaccard coefficient
j(A, B) =
|A ∩ B|
|A ∪ B|
Jaccard distance
d(A, B) = 1 − j(A, B) =
D-BSSE
Karsten Borgwardt
|A ∪ B| − |A ∩ B|
|A ∪ B|
Data Mining Course - Part 1, Basel
Fall Semester 2015
44 / 65
Similarity measures on sets
Finite sets of objects
Overlap coefficient
o(A, B) =
|A ∩ B|
min(|A|, |B|)
Sorensen-Dice coefficient
s(A, B) =
D-BSSE
Karsten Borgwardt
2|A ∩ B|
|A| + |B|
Data Mining Course - Part 1, Basel
Fall Semester 2015
45 / 65
Similarity measures on sets
Sets of vectors
Single link distance function
d(A, B) =
min dvector (a, b)
a∈A,b∈B
Complete link distance function
d(A, B) = max dvector (a, b)
a∈A,b∈B
Average link distance function
d(A, B) =
1 XX
dvector (a, b)
|A||B|
a∈A b∈B
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
46 / 65
Similarity measures on strings
k-mer based similarity measures
Goal: Try to quantify the similarity between words w and w 0 .
k-mers are substrings of length k.
Represent each string w as a histogram of k-mer frequencies, hk (w ).
Spectrum kernel
w 0.
(Leslie et al., 2002):
Count number of matching pairs of k-mers in w and
Example
Goal: Try to quantify the similarity between words downtown and known.
h3 (downtown)= (dow : 1, own : 2, wnt : 1, nto : 1, tow : 1)
h3 (known)= (kno : 1, now : 1, own : 1)
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
47 / 65
Similarity measures on nodes
Shortest path distance
Objects are nodes in a graph G . Edge weights w (i, j) represent distances between nodes
i and j.
Our goal is to quantify the similarity of an arbitrary pair of nodes.
The most popular distance function is the shortest path length.
Floyd-Warshall’s algorithm allows to compute all pairs-shortest paths in O(n3 ),
where n is the number of nodes in G .
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
48 / 65
Similarity measures on nodes
Floyd-Warshall’s Algorithm (1962)
procedure Floyd-Warshall(G = (V , E , w ))
d(i, j) := w (i, j), if (i, j) ∈ E
d(i, j) := ∞, if (i, j) ∈
/E
for k = 1 : n do
for i = 1 : n do
for j = 1 : n do
if d(i, j) > d(i, k) + d(k, j) then
d(i, j) := d(i, k) + d(k, j)
return matrix of shortest path distances D, Dij = d(i, j)
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
49 / 65
Similarity measures on time series
Time series comparison: Theory and practice
If two time series x, x0 are vectors of length d and corresponding dimensions represent
the same point in time, any vectorial distance function can be used to compare them
Unfortunately, these assumptions are often violated in practice:
We often compare time series of different length, d 6= d 0 .
The time points at which the time series were observed are not synchronous.
The time intervals between observations may vary within and between time series.
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
50 / 65
Similarity measures on time series
5
4
3
2
1
0
5
4
3
2
1
0
D-BSSE
Karsten Borgwardt
0
2
4
6
8
Data Mining Course - Part 1, Basel
Fall Semester 2015
51 / 65
Similarity measures on time series
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
52 / 65
Similarity measures on time series
Dynamic Time Warping (DTW)
A similarity measure for time series of different length, with different intervals between
measurements.
It is the cost of an optimal alignment between the measurements of two time series, x
and x0 . Individual time points are compared by a base distance function d (e.g. a
Minkowski distance).
The function DTW can be computed recursively as

repeat xi
 DTW (i, j − 1)
0
DTW (i − 1, j)
repeat xj0
DTW (i, j) = d(xi , xj ) + min

DTW (i − 1, j − 1) repeat neither
where DTW (0, 0) = 0, DTW (i, 0) = ∞, DTW (0, j) = ∞ for all 1 ≤ i ≤ d, 1 ≤ j ≤ d 0 .
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
53 / 65
Similarity measures on time series
x’ x 2 2 3 4 3 2 1 2 2 2 0 0 1 2 1 0 1 0 0 2 0 0 1 2 1 0 1 0 0 3 1 1 0 1 0 1 2 1 1 2 0 0 1 2 1 0 1 0 0 2 0 0 1 2 1 0 1 0 0 1 1 1 2 3 2 1 0 1 1 2 0 0 1 2 1 0 1 0 0 2 0 0 1 2 1 0 1 0 0 Distance Matrix D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
54 / 65
Similarity measures on time series
5
4
3
2
1
0
5
4
3
2
1
0
D-BSSE
Karsten Borgwardt
0
2
4
6
8
Data Mining Course - Part 1, Basel
Fall Semester 2015
55 / 65
Similarity measures on time series
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
56 / 65
Similarity measures on graphs
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
57 / 65
Similarity measures on graphs
Approaches to Graph comparison
Family 1: Graph isomorphism or subgraph isomorphims test
Family 2: Graph edit distance
Cost of transforming graph 1 into graph 2
Family 3: Topological vectors
Map graph to vector
Then apply vectorial distance functions
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
58 / 65
Wiener Index
Graph representation
Let G be a graph with vertices V and edges E.
Let P be the the set of shortest paths in G .
Then the Wiener Index (Wiener, 1947) of G is defined as ν(G ) =
1
2
P
p∈P
p.
Graph comparison
The shortest path kernel (Borgwardt and Kriegel, ICDM 2005) is a class of similarity
measures between two graphs G and G 0 .
The simplest instance of this class is a product between the Wiender Indices of G and G 0 :
k(G , G ) = ν(G )ν(G 0 )
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
59 / 65
Weisfeiler-Lehman Kernel
(Shervashidze and Borgwardt, 2009)
1st(Itera:on(
Result(of(Steps(1(and(2:(Mul:setBlabel(determina:on(and(sor:ng(
Given(labeled(graphs(G1(and(G2(
E
B
B
D
D
C
A
a(
A
E,BCD(
E
D,AACE(
C
A
G1(
B
B,CE(
G2(
C,BDE(
A,D(
A,D(
b(
A,D(
F(
C,BDE(
J(
B,C(
G(
D,AACE(
K(
B,CE(
H(
D,ABCE(
L(
(
(
B,DE(
(
(
(
I(
(
(
E,BCD(
M
(
Karsten Borgwardt
D,ABCE(
C,BDE(
A,D(
G1(
B,C(
G2(
H
I(
M
(
K
L(
J(
J(
(
M(
d(
c(
D-BSSE
(
E,BCD(
1st(Itera:on(
Result(of(Step(4:(Relabeling(
1st(Itera:on(
Result(of(Step(3:(Label(compression(
(
B,DE(
F
F
G1(
Data Mining Course - Part 1, Basel
F
G
G2(
Fall Semester 2015
60 / 65
Weisfeiler-Lehman Kernel
(Shervashidze and Borgwardt, 2009)
End(of(1st(Itera:on(
Feature(vector(representa:on(of(G1(and(G2(
((A,(B,(C,((D,(E,((F,(G,(H,((I,((J,((K,((L,(M()(
ϕ(1)wl(G1)(=(((2,(1,(1,(1,(1,(2,(0,(1,(0,(1,(1,(0,(1()(
ϕ(1)wl(G2)(=(((1,(2,(1,(1,(1,(1,(1,(0,(1,(1,(0,(1,(1()(
Counts(of(
original(
node(labels(
e(
D-BSSE
Karsten Borgwardt
Counts(of(
compressed((
node(labels(
k(1)wl(G1,(G2)=<ϕ(1)wl(G1)(,(ϕ(1)wl(G2)>=11."
Data Mining Course - Part 1, Basel
Fall Semester 2015
61 / 65
Subtree-like Patterns
2
1
1
3
3
2
4
6
5
D-BSSE
Karsten Borgwardt
1
3
1
2
6
4
5
Data Mining Course - Part 1, Basel
1
5
Fall Semester 2015
62 / 65
Weisfeiler-Lehman Kernel: Theoretical Runtime Properties
Fast Weisfeiler-Lehman kernel (NIPS 2009 and JMLR 2011)
Algorithm: Repeat the following steps h times
1 Sort: Represent each node v as sorted list Lv of its neighbors (O(m))
2 Compress: Compress this list into a hash value h(Lv ) (O(m))
3 Relabel: Relabel v by the hash value h(Lv ) (O(n))
Runtime analysis
per graph pair: Runtime O(m h)
for N graphs: Runtime O(N m h + N 2 n h) (naively O(N 2 m h))
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
63 / 65
Weisfeiler-Lehman Kernel: Empirical Runtime Properties
5
600
pairwise
global
4
10
3
Runtime in seconds
Runtime in seconds
10
10
2
10
1
10
0
10
400
200
−1
10
1
10
2
10
Number of graphs N
0
3
10
15
10
5
0
2
D-BSSE
Karsten Borgwardt
400
600
800
Graph size n
1000
15
Runtime in seconds
Runtime in seconds
20
200
4
6
Subtree height h
8
10
5
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Graph density c
Data Mining Course - Part 1, Basel
Fall Semester 2015
64 / 65
Weisfeiler-Lehman Kernel: Runtime and Accuracy
1000 days
100 days
10 days
1 day
WL
RG
3 Graphlet
RW
SP
1 hour
1 minute
10 sec
85 %
80 %
75 %
70 %
65 %
60 %
55 %
50 %
MUTAG
NCI1
NCI109
D&D
graph size
D-BSSE
Karsten Borgwardt
Data Mining Course - Part 1, Basel
Fall Semester 2015
65 / 65