Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Mohammad Hasan, Mohammed Zaki
RPI, Troy, NY
Consider the following problem from Medical Informatics
Tissue Images
Cell Graphs
Discriminatory
Subgraphs
Healthy
5/12/2017
Diseased
Damaged
Classifier
4
Mining Task
Dataset
30 graphs
Average vertex count: 2154
Average edge count: 36945
Support
40%
Result
5/12/2017
No Result (used gSpan, Gaston) in a week of running on
2 GHz dual-core PC with 4 GB running Linux
5
Limitations of Existing Subgraph Mining
Algorithms
Work only for small graphs
The most popular datasets in graph mining are chemical graphs
Chemical graphs are mostly tree
In DTP dataset (most popular dataset) average vertex count is 43
and average edge count is 45
Perform a complete enumeration
For large input graph, output set is neither enumerable nor usable
They follow a fixed enumeration order
Partial run does not efficiently generate the interesting subgraphs
avoid complete enumeration to sample a set of
interesting subgraphs from the output set
5/12/2017
6
Why sampling a solution?
Observation 1:
Mining is only exploratory step, mined patterns are generally used in
subsequent KD task
Not all frequent patterns are equally important for the desired task at hand
Large output set leads to information overload problem
complete enumeration is generally unnecessary
Observation 2:
Traditional mining algorithms explore the output space with a fixed enumeration
order
Good for generating non-duplicate candidate patterns
But, subsequent patterns in that order are very similar
Sampling can change enumeration order to sample interesting
and non-redundant subgraphs with a higher chance
5/12/2017
7
Output Space
Traditional frequent subgraphs for a given support threshold
Can also augment with other constraint
To find good patterns for the desired KD task
Input Space
5/12/2017
Output Space for
FPM with support = 2
9
Sampling from Output Space
Return a random pattern from the output set
Random pattern is obtained by sampling from a desired
distribution
Define an interestingness function, f : FR+; f(p) returns the
score of pattern p
The desired sampling distribution is proportional to the
interestingness score
If the output space have only 3 patterns with scores 2,3,4, the
sampling should be performed from {2/9, 1/3, 4/9} distribution
Efficiency consideration
5/12/2017
Enumerate as few auxiliary patterns as possible
10
How to choose f?
Depends on application needs
For exploratory data analysis (EDA), every frequent pattern
can have a uniform score
For Top-K pattern mining, support values can be used as
scores, which is support biased sampling.
For subgraph summarization task, only maximal graph
patterns has uniform non-zero score
For graph classification, discriminatory subgraphs should
have high scores
5/12/2017
11
Challenges
g2
The output space can not
be instantiate
g1
g3
g5
g4
Complete statistics about
the output space is not
known.
Output Space of
Graph Mining
Target distribution is not
known entirely
Graphs
s1
s2
s3
sn
Scores
We want,
si
(i )
s
i
i
5/12/2017
13
MCMC Sampling
POG as transition graph
Solution Approach (MCMC
Sampling)
5/12/2017
Perform random walk in the
output space
Represent the output space as a
transition graph to allow local
transitions
Edges of transition graph are
chosen based on structural
similarity
Make sure that the random walk
is ergodic
In POG, every pattern is
connected to it sub-pattern
(with one less edge) and all
its super patterns (with one
more edge
14
Algorithm
Define the transition graph (for instance, POG)
Define interestingness function that select desired
sampling distribution
Perform random walk on the transition graph
Compute the neighborhood locally
Compute Transition probability
Utilize the interestingness score
makes the method generic
Return the currently visiting pattern after k iterations.
5/12/2017
15
Local Computation of Output Space
g1
Super Patterns
Sub Patterns
Σ
5/12/2017
Pattern that are not part
of the output space is
discarded during local
neighborhood
computation
g0
g5
g4
g1
P01
g3
g2
g2
p02
g3
p03
g4
p04
g5
p05
u
p00
=1
16
Compute P to achieve Target Distribution
Graphs
s1
s2
s3
sn
Scores
We want,
(i )
si
s
i
i
If π is the stationary distribution, and P is the transition matrix,
in equilibrium, we have, P
Main task is to choose P, so that the desired stationary
distribution is achieved
In fact, we compute only one row of P (local computation)
5/12/2017
17
Use Metropolis-Hastings
(MH) Algorithm
1.
Fix an arbitrary proposal
distribution beforehand (q)
2.
Find a neighbor j (to move to)
by using the above distribution
3.
Compute acceptance
probability and accept the
move with this probability
q01
q02
q03
q04
q05
3
Select
s3q30
03 min
,1
s0 q03
1
2
3
0
4. If accept move to j; otherwise,
go to step 2
5/12/2017
4
q00
5
Uniform Sampling of Frequent Patterns
Target Distribution
1/n, 1/n, . . .
, 1/n
How to achieve it?
Use uniform proposal distribution
Acceptance probability is:
du
min 1,
dv
dx: Degree of a vertex x
5/12/2017
19
Uniform Sampling, Transition Probability Matrix
A
A
P
5/12/2017
D
D
1
4
B
20
Discriminatory Subgraph Sampling
Database graphs are labeled
Subgraphs may be used as
Feature for supervised classification
Graph Kernel
Graph
Label
G1
G2
G3
+1
+1
-1
Subgraph
Mining
graphs g1
g2
g3
...
G1
G2
G3
5/12/2017
21
Sampling in Proportion to Discriminatory Score (f)
Interestingness score (feature quality)
Entropy
Delta score = abs (positive support – negative support)
Direct Mining is difficult
5/12/2017
Score values (entropy, delta
score) are neither
monotone nor antimonotone
P
C
Score(P) <=> Score(C)
22
Discriminatory Subgraph Sampling
Use Metropis-Hastings Algorithm
Choose neighbor uniformly as proposal distribution
Compute acceptance probability from the delta score
Delta Score of j and i
5/12/2017
Ratio of degree of i and j
23
Datasets
Name
# of Graphs
Average Vertex
count
Average
Edge Count
DTP
1084
43
45
Chess
3196
10.25
-
Mutagenicity
2401 (+)
1936 (-)
17
18
PPI
3
2154
81607
Cell-Graphs
30
2184
36945
5/12/2017
25
Result Evaluation Metrics
Sampling Quality
Our sampling distribution vs target sampling
distribution
How the sampling converges (convergence rate)
Variation Distance:
1
t
P
( x, y ) ( y )
y
2
Scalability Test
Experiments on large datasets
Median and standard deviation of visit count
Quality of Sampled Patterns
5/12/2017
26
Uniform Sampling Results
Experiment Setup
Run the sampling algorithm for sufficient number of
iterations and observe the visit count distribution
For a dataset with n frequent patterns, we perform 200*n
iterations
Uniform Sampling
Max
count
Min
338
32
Median
Std
209
59.02
count
Ideal Sampling
Median
Std
200
14.11
Result on DTP Chemical Dataset
5/12/2017
27
Sampling Quality
Depends on the choice of proposal distribution
If the vertices of POG have similar degree values, sampling is
good
Earlier dataset have patterns with widely varying degree values
[
For clique dataset, sampling quality is almost perfect
Uniform Sampling
Max
count
Min
count
Median
Std
156
6
100
13.64
Ideal Sampling
Result on Chess (Itemset) Dataset
5/12/2017
(100*n iterations)
Median
Std
100
10
28
Discriminatory sampling
results (Mutagenicity dataset)
Distribution of Delta Score
among all frequent Patterns
Relation between sampling rate
and Delta Score
5/12/2017
29
Discriminatory sampling results (cont)
5/12/2017
Sample No
Delta Score
Rank
% of POG explored
1
404
132
5.7
2
644
21
11.0
3
707
10
10.8
4
725
4
8.9
5
280
595
2.8
6
725
4
8.9
7
627
27
3.3
8
709
9
7.7
9
721
5
9.1
10
725
4
8.9
30
Discriminatory sampling results (cell Graphs)
Total graphs 30, min-sup =
6
Number of subgraphs with delta score > 9
30
25
No graph mining algorithm
could run the dataset for a
week of running ( on a
2GHz with 4GB of RAM
machine)
20
15
Series1
10
5
0
traditional algorithm
5/12/2017
OSS
31
Existing Algorithms
Summary
Output Space Sampling
Depth-first or Breadth first
Random walk on the subgraph
walk on the subgraph space
Rightmost Extension
Complete algorithm
space
Arbitrary Extension
Sampling algorithm
Quality: Sampling quality guaranty
Scalability: Visits only a small part of the
search space
Non-Redundant: finds very dissimilar
patterns by virtue of randomness
Genericity: In terms of pattern type and
5/12/2017
sampling objective
32
Future Works and Discussion
Important to choose proposal distribution wisely to get
better sampling
For large graph, support counting is still a bottleneck
How to scrap the isomorphism checking entirely
How to effectively parallelize the support counting
How to make the random walk to converge faster
5/12/2017
The POG graph generally have smaller spectral gap, as a
result the convergence is slow.
This makes the algorithm costly (more steps to find good
samples)
33
Acceptance Probability Computation
Desired
Distribution
5/12/2017
Proposal
Distribution
Interestingness
value
36
Support Biased Sampling
We want,
Graphs
s1
s2
s3
sn
Support
(i )
si
s
i
i
What proposal distribution to choose?
1
| Nup |
Q(u, v)
(1 ) 1
| N down |
if v Nup (u )
if v N down (u )
u
α=1, if Nup(u) = ø,
α=0, if Ndown(u) = ø
5/12/2017
37
Example of Support Biased Sampling
α= 1/3,
q(u, v) = ½,
q(v, u)=1/(3x3) = 1/9
s(u) = 2
s(v) = 3
A
A
P
5/12/2017
D
D
3 x 1/9
2 X 1/2
1
3
B
38
Sampling Convergence
5/12/2017
39
Support Biased Sampling
Scatter plot of Visit count and Support shows positive
Correlation
Correlation: 0.76
5/12/2017
40
Specific Sampling Examples and Utilization
Uniform Sampling of Frequent Pattern
To explore the frequent patterns
To set a proper value of minimum support
To make an approximate counting
Support Biased Sampling
To find Top-k Pattern in terms of support value
Discriminatory subgraph sampling
5/12/2017
Finding subgraphs that are good features for
classification
41