Download MUSK: Uniform Sampling of K maximal Graphs

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Mohammad Hasan, Mohammed Zaki
RPI, Troy, NY
Consider the following problem from Medical Informatics
Tissue Images
Cell Graphs
Discriminatory
Subgraphs
Healthy
5/12/2017
Diseased
Damaged
Classifier
4
Mining Task

Dataset
30 graphs
Average vertex count: 2154
Average edge count: 36945




Support
40%


Result

5/12/2017
No Result (used gSpan, Gaston) in a week of running on
2 GHz dual-core PC with 4 GB running Linux
5
Limitations of Existing Subgraph Mining
Algorithms

Work only for small graphs




The most popular datasets in graph mining are chemical graphs
Chemical graphs are mostly tree
In DTP dataset (most popular dataset) average vertex count is 43
and average edge count is 45
Perform a complete enumeration


For large input graph, output set is neither enumerable nor usable
They follow a fixed enumeration order

Partial run does not efficiently generate the interesting subgraphs
avoid complete enumeration to sample a set of
interesting subgraphs from the output set
5/12/2017
6
Why sampling a solution?

Observation 1:

Mining is only exploratory step, mined patterns are generally used in
subsequent KD task

Not all frequent patterns are equally important for the desired task at hand

Large output set leads to information overload problem
complete enumeration is generally unnecessary

Observation 2:



Traditional mining algorithms explore the output space with a fixed enumeration
order
Good for generating non-duplicate candidate patterns
But, subsequent patterns in that order are very similar
Sampling can change enumeration order to sample interesting
and non-redundant subgraphs with a higher chance
5/12/2017
7
Output Space
 Traditional frequent subgraphs for a given support threshold
 Can also augment with other constraint
 To find good patterns for the desired KD task
Input Space
5/12/2017
Output Space for
FPM with support = 2
9
Sampling from Output Space

Return a random pattern from the output set

Random pattern is obtained by sampling from a desired
distribution

Define an interestingness function, f : FR+; f(p) returns the
score of pattern p

The desired sampling distribution is proportional to the
interestingness score

If the output space have only 3 patterns with scores 2,3,4, the
sampling should be performed from {2/9, 1/3, 4/9} distribution

Efficiency consideration

5/12/2017
Enumerate as few auxiliary patterns as possible
10
How to choose f?

Depends on application needs

For exploratory data analysis (EDA), every frequent pattern
can have a uniform score

For Top-K pattern mining, support values can be used as
scores, which is support biased sampling.

For subgraph summarization task, only maximal graph
patterns has uniform non-zero score

For graph classification, discriminatory subgraphs should
have high scores
5/12/2017
11
Challenges

g2
The output space can not
be instantiate
g1
g3
g5
g4

Complete statistics about
the output space is not
known.

Output Space of
Graph Mining
Target distribution is not
known entirely
Graphs
s1
s2
s3
sn
Scores
We want,
si
 (i ) 
s
i
i
5/12/2017
13
MCMC Sampling

POG as transition graph
Solution Approach (MCMC
Sampling)




5/12/2017
Perform random walk in the
output space
Represent the output space as a
transition graph to allow local
transitions
Edges of transition graph are
chosen based on structural
similarity
Make sure that the random walk
is ergodic
In POG, every pattern is
connected to it sub-pattern
(with one less edge) and all
its super patterns (with one
more edge
14
Algorithm

Define the transition graph (for instance, POG)

Define interestingness function that select desired
sampling distribution

Perform random walk on the transition graph

Compute the neighborhood locally

Compute Transition probability


Utilize the interestingness score
makes the method generic
 Return the currently visiting pattern after k iterations.
5/12/2017
15
Local Computation of Output Space
g1
Super Patterns
Sub Patterns
Σ
5/12/2017
Pattern that are not part
of the output space is
discarded during local
neighborhood
computation
g0
g5
g4
g1
P01
g3
g2
g2
p02
g3
p03
g4
p04
g5
p05
u
p00
=1
16
Compute P to achieve Target Distribution
Graphs
s1
s2
s3
sn
Scores
We want,
 (i ) 
si
s
i
i

If π is the stationary distribution, and P is the transition matrix,
in equilibrium, we have,    P

Main task is to choose P, so that the desired stationary
distribution is achieved

In fact, we compute only one row of P (local computation)
5/12/2017
17
Use Metropolis-Hastings
(MH) Algorithm
1.
Fix an arbitrary proposal
distribution beforehand (q)
2.
Find a neighbor j (to move to)
by using the above distribution
3.
Compute acceptance
probability and accept the
move with this probability
q01
q02
q03
q04
q05
3
Select
 s3q30 
 03  min 
,1
 s0 q03 
1
2
3
0
4. If accept move to j; otherwise,
go to step 2
5/12/2017
4
q00
5
Uniform Sampling of Frequent Patterns
 Target Distribution
1/n, 1/n, . . .
, 1/n
 How to achieve it?
 Use uniform proposal distribution
 Acceptance probability is:
 du 
min 1, 
 dv 
dx: Degree of a vertex x
5/12/2017
19
Uniform Sampling, Transition Probability Matrix
A
A
P
5/12/2017
D
D
1
4
B
20
Discriminatory Subgraph Sampling

Database graphs are labeled

Subgraphs may be used as


Feature for supervised classification
Graph Kernel
Graph
Label
G1
G2
G3
+1
+1
-1
Subgraph
Mining
graphs g1
g2
g3
...
G1
G2
G3
5/12/2017
21
Sampling in Proportion to Discriminatory Score (f)

Interestingness score (feature quality)



Entropy
Delta score = abs (positive support – negative support)
Direct Mining is difficult

5/12/2017
Score values (entropy, delta
score) are neither
monotone nor antimonotone
P
C
Score(P) <=> Score(C)
22
Discriminatory Subgraph Sampling

Use Metropis-Hastings Algorithm


Choose neighbor uniformly as proposal distribution
Compute acceptance probability from the delta score
Delta Score of j and i
5/12/2017
Ratio of degree of i and j
23
Datasets
Name
# of Graphs
Average Vertex
count
Average
Edge Count
DTP
1084
43
45
Chess
3196
10.25
-
Mutagenicity
2401 (+)
1936 (-)
17
18
PPI
3
2154
81607
Cell-Graphs
30
2184
36945
5/12/2017
25
Result Evaluation Metrics

Sampling Quality
Our sampling distribution vs target sampling
distribution


How the sampling converges (convergence rate)



Variation Distance:
1
t
P
( x, y )   ( y )

y
2
Scalability Test
Experiments on large datasets


Median and standard deviation of visit count
Quality of Sampled Patterns
5/12/2017
26
Uniform Sampling Results
 Experiment Setup
 Run the sampling algorithm for sufficient number of
iterations and observe the visit count distribution
 For a dataset with n frequent patterns, we perform 200*n
iterations
Uniform Sampling
Max
count
Min
338
32
Median
Std
209
59.02
count
Ideal Sampling
Median
Std
200
14.11
Result on DTP Chemical Dataset
5/12/2017
27
Sampling Quality
 Depends on the choice of proposal distribution
 If the vertices of POG have similar degree values, sampling is
good
 Earlier dataset have patterns with widely varying degree values

[
 For clique dataset, sampling quality is almost perfect
Uniform Sampling
Max
count
Min
count
Median
Std
156
6
100
13.64
Ideal Sampling
Result on Chess (Itemset) Dataset
5/12/2017
(100*n iterations)
Median
Std
100
10
28
Discriminatory sampling
results (Mutagenicity dataset)
Distribution of Delta Score
among all frequent Patterns
Relation between sampling rate
and Delta Score
5/12/2017
29
Discriminatory sampling results (cont)
5/12/2017
Sample No
Delta Score
Rank
% of POG explored
1
404
132
5.7
2
644
21
11.0
3
707
10
10.8
4
725
4
8.9
5
280
595
2.8
6
725
4
8.9
7
627
27
3.3
8
709
9
7.7
9
721
5
9.1
10
725
4
8.9
30
Discriminatory sampling results (cell Graphs)

Total graphs 30, min-sup =
6
Number of subgraphs with delta score > 9
30
25

No graph mining algorithm
could run the dataset for a
week of running ( on a
2GHz with 4GB of RAM
machine)
20
15
Series1
10
5
0
traditional algorithm
5/12/2017
OSS
31
Existing Algorithms
Summary
Output Space Sampling
 Depth-first or Breadth first
 Random walk on the subgraph
walk on the subgraph space
 Rightmost Extension
 Complete algorithm
space
 Arbitrary Extension
 Sampling algorithm
 Quality: Sampling quality guaranty
 Scalability: Visits only a small part of the
search space
 Non-Redundant: finds very dissimilar
patterns by virtue of randomness
 Genericity: In terms of pattern type and
5/12/2017
sampling objective
32
Future Works and Discussion

Important to choose proposal distribution wisely to get
better sampling

For large graph, support counting is still a bottleneck



How to scrap the isomorphism checking entirely
How to effectively parallelize the support counting
How to make the random walk to converge faster


5/12/2017
The POG graph generally have smaller spectral gap, as a
result the convergence is slow.
This makes the algorithm costly (more steps to find good
samples)
33
Acceptance Probability Computation
Desired
Distribution
5/12/2017
Proposal
Distribution
Interestingness
value
36
Support Biased Sampling
We want,
Graphs
s1
s2
s3
sn
Support
 (i ) 
si
s
i
i
What proposal distribution to choose?
1




| Nup |

Q(u, v)  
(1   )  1

| N down |
if v  Nup (u )
if v  N down (u )
u
α=1, if Nup(u) = ø,
α=0, if Ndown(u) = ø
5/12/2017
37
Example of Support Biased Sampling
α= 1/3,
q(u, v) = ½,
q(v, u)=1/(3x3) = 1/9
s(u) = 2
s(v) = 3
A
A
P
5/12/2017
D
D
3 x 1/9
2 X 1/2
1
3
B
38
Sampling Convergence
5/12/2017
39
Support Biased Sampling
 Scatter plot of Visit count and Support shows positive
Correlation
Correlation: 0.76
5/12/2017
40
Specific Sampling Examples and Utilization

Uniform Sampling of Frequent Pattern
To explore the frequent patterns
To set a proper value of minimum support
To make an approximate counting




Support Biased Sampling
To find Top-k Pattern in terms of support value


Discriminatory subgraph sampling

5/12/2017
Finding subgraphs that are good features for
classification
41