Download Mining Graph Patterns Efficiently via Randomized Summaries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chen Chen1, Cindy X. Lin1, Matt Fredrikson2,
Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1
1University of Illinois at Urbana-Champaign
2University of Wisconsin at Madison
3IBM T. J. Watson Research Center
4University of California at Santa Barbara
1
Outline
 Motivation
 The efficiency bottleneck encountered in big networks
 Patterns must be preserved
 Summarize-Mine
 Experiments
 Summary
2
3
Frequent Subgraph Mining
 Find all graphs p such that |Dp| >= min_sup
 Get into the topological structures of graph data
 Useful for many downstream applications
4
Challenges
 Subgraph isomorphism checking is inevitable for any
frequent subgraph mining algorithm
 This will have problems on big networks
 Suppose there is only one triangle in the network
 But there are 1,000,000 length-2 paths
 We must enumerate all these 1,000,000, because any one
of them has the potential to grow into a full triangle
5
Too Many Embeddings
 Subgraph isomorphism is NP-hard
 So, when the problem size increases, …
 During the checking, large graphs are grown from
small subparts
 For small subparts, there might be too many
(overlapped) embeddings in a big network
 Such embedding enumerations will finally kill us
6
Motivating Application
 System call graphs from security research
 Model dependencies among system calls
 Unique subgraph signatures for malicious programs
 Compare malicious/benign programs
 These graphs are very big
 Thousands of nodes on average
 We tried state-of-art mining technologies, but failed
7
Our Approach
 Subgraph isomorphism checking cannot be done on
large networks
 So we do it on small graphs
 Summarize-Mine
 Summarize: Merge nodes by label and collapse
corresponding edges
 Mine: Now, state-of-art algorithms should work
8
Mining after Summarization
G1
G2
a
b
c
a
c
b
a
…
Original
…
…
Summarize
g1
g2
a
c
b
a
c
b
a
…
Summary
Mining
&
Output
…
…
9
Remedy for Pattern Changes
 Frequent subgraphs are presented on a different
abstraction level
 False negatives & false positives, compared to true
patterns mined from the un-summarized database D
 False negatives (recover)
 Randomized technique + multiple rounds
 False positives (delete)
 Verify against D
 Substantial work can be transferred to the summaries
10
Outline
 Motivation
 Summarize-Mine
 The algorithm flow-chart
 Recovering false negatives
 Verifying false positives
 Experiments
 Summary
11
12
False Negatives
 For a pattern p, if each of its vertices bears a different label,
then the embeddings of p must be preserved after
summarization
 Since we are merging groups of vertices by label, the nodes
of p should stay in different groups
 Otherwise,
Gi
a
p
a
c
b
a
b
gi
a
...
c
b
c
13
Missing Prob. of Embeddings
 Suppose
 Assign xj nodes for label lj (j=1,…,L) in the summary Si =>
xj groups of nodes with label lj in the original graph Gi
 Pattern p has mj nodes with label lj
 Then
14
No “Collision” for Same Labels
 Consider a specific embedding f: p->Gi, f is preserved if
vertices in f(p) stay in different groups
 Randomly assign mj nodes with label lj to xj groups,
the probability that they will not “collide” is:
 Multiply probabilities for independent events
15
Example
 A pattern with 5 labels, each label => 2 vertices
 m1 = m2 = m3 = m4 = m5 = 2
 Assign 20 nodes in the summary (i.e., 20 node groups
in the original graph) for each label
 The summary has 100 vertices
 x1 = x2 = x3 = x4 = x5 = 20
 The probability that an embedding will persist
19 19 19 19 19

   
 0.774
20 20 20 20 20
16
Extend to Multiple Graphs
 Setting x1,…,xL to the same values across all Gi’s in the
database
only depends on m1,…,mL, i.e., pattern p’s vertex
label distribution
 We denote this probability as q(p)

 For each of p’s support graphs in D, it has a probability
of at least q(p) to continue support p
 Thus, the overall support can be bounded below by a
binomial random variable
17
Support Moves Downward
18
False Negative Bound
19
Example, Cont.
 As above, q(p)=0.774
 min_sup=50
min_sup'
40
39
38
37
36
35
1 round
0.5966
0.4622
0.3346
0.2255
0.1412
0.0820
2 rounds
0.3559
0.2136
0.1119
0.0508
0.0199
0.0067
3 rounds
0.2123
0.0988
0.0374
0.0115
0.0028
0.0006
20
False Positives
Gi
a
p
a
a
b
gi
a
a
c
a
b
c
b
c
 Much easier to handle
 Just check against the original database D
 Discard if this “actual” support is less than min_sup
21
The Same Skeleton as gSpan
 DFS code tree
 Depth-first search
 Minimum DFS code?
 Check support by
isomorphism tests
 Record all one-edge
extensions along the way
 Pass down the projected
database and recurse
22
Integrate Verification Schemes
 Top-Down and Bottom-Up
 Possible factors
 Amount of false positives
 Top-down
verification
Transaction ID
list for p1 => Dcan
p1
be performed early
 Top-down preferred
Just
search
within
D-D
Just
search
within
Dpp12;
if frequent, can stop
by experiments
Transaction ID list for p2 => Dp2
23
Summary-Guided Verification
 Substantial verification work can be performed on the
summaries, as well
Got it!
24
Iterative Summarize-Mine
 Use a single pattern tree to hold all results spanning
across multiple iterations
 No need to combine pattern sets in a final step
 Avoid verifying patterns that have already been checked
by previous iterations
 Verified support graphs are accurate, they can help prepruning in later iterations
 Details omitted
25
Outline
 Motivation
 Summarize-Mine
 Experiments
 Summary
26
Dataset
 Real data
 W32.Stration, a family of mass-mailing worms
 W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc.
 Vertex # up to 20,000 and edge # even higher
 Avg. # of vertices: 1,300
 Synthetic data
 Size, # of distinct node/edge labels, etc.
 Generator details omitted
27
A Sample Malware Signature
 Mined from W32.Stration
 A malware reading and leaking certain registry
settings related to the network devices
28
Comparison with gSpan
 gSpan is an efficient graph pattern mining algorithm
 Graphs with different size are randomly drawn
 Eventually, gSpan cannot work
29
The Influence of min_sup'
 Total vs. False Positives
 The gap corresponds to true patterns
 It gradually widens as we decrease min_sup'
30
Summarization Ratio
 10/1 node(s) before/after summarization => ratio=10
 Trading-off min_sup' and t as the inner loop
 A range of reasonable parameters in the middle
31
Scalability
 On the synthetic data
 Parameters are tuned as done above
32
Outline
 Motivation
 Summarize-Mine
 Experiments
 Summary
33
Summary
 We solve the frequent subgraph mining problem for
graphs with big size
 We found interesting malware signatures
 Our algorithm is much more efficient, while the state-
of-art mining technologies do not work
 We show that patterns can be well preserved on
higher-level by a good generalization scheme
 Very useful, given the emerging trend of huge networks
 The data has to be preprocessed and summarized
34
Summary
 Our method is orthogonal to many previous works on
this topic => Combine for further improvement
 Efficient pattern space traversal
 Other data space reduction techniques different from
our compression within individual transactions


Transaction sampling, merging, etc.
They perform compression between transactions
35
36