Download Summarize-Mine

Chen Chen1, Cindy X. Lin1, Matt Fredrikson2, Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1 1University of Illinois at Urbana-Champaign 2University of Wisconsin at Madison 3IBM T. J. Watson Research Center 4University of California at Santa Barbara 1 Outline  Motivation  The efficiency bottleneck encountered in big networks  Patterns must be preserved  Summarize-Mine  Experiments  Summary 2 3 Frequent Subgraph Mining  Find all graphs p such that |Dp| >= min_sup  Get into the topological structures of graph data  Useful for many downstream applications 4 Challenges  Subgraph isomorphism checking is inevitable for any frequent subgraph mining algorithm  This will have problems on big networks  Suppose there is only one triangle in the network  But there are 1,000,000 length-2 paths  We must enumerate all these 1,000,000, because any one of them has the potential to grow into a full triangle 5 Too Many Embeddings  Subgraph isomorphism is NP-hard  So, when the problem size increases, …  During the checking, large graphs are grown from small subparts  For small subparts, there might be too many (overlapped) embeddings in a big network  Such embedding enumerations will finally kill us 6 Motivating Application  System call graphs from security research  Model dependencies among system calls  Unique subgraph signatures for malicious programs  Compare malicious/benign programs  These graphs are very big  Thousands of nodes on average  We tried state-of-art mining technologies, but failed 7 Our Approach  Subgraph isomorphism checking cannot be done on large networks  So we do it on small graphs  Summarize-Mine  Summarize: Merge nodes by label and collapse corresponding edges  Mine: Now, state-of-art algorithms should work 8 Mining after Summarization G1 G2 a b c a c b a … Original … … Summarize g1 g2 a c b a c b a … Summary Mining & Output … … 9 Remedy for Pattern Changes  Frequent subgraphs are presented on a different abstraction level  False negatives & false positives, compared to true patterns mined from the un-summarized database D  False negatives (recover)  Randomized technique + multiple rounds  False positives (delete)  Verify against D  Substantial work can be transferred to the summaries 10 Outline  Motivation  Summarize-Mine  The algorithm flow-chart  Recovering false negatives  Verifying false positives  Experiments  Summary 11 12 False Negatives  For a pattern p, if each of its vertices bears a different label, then the embeddings of p must be preserved after summarization  Since we are merging groups of vertices by label, the nodes of p should stay in different groups  Otherwise, Gi a p a c b a b gi a ... c b c 13 Missing Prob. of Embeddings  Suppose  Assign xj nodes for label lj (j=1,…,L) in the summary Si => xj groups of nodes with label lj in the original graph Gi  Pattern p has mj nodes with label lj  Then 14 No “Collision” for Same Labels  Consider a specific embedding f: p->Gi, f is preserved if vertices in f(p) stay in different groups  Randomly assign mj nodes with label lj to xj groups, the probability that they will not “collide” is:  Multiply probabilities for independent events 15 Example  A pattern with 5 labels, each label => 2 vertices  m1 = m2 = m3 = m4 = m5 = 2  Assign 20 nodes in the summary (i.e., 20 node groups in the original graph) for each label  The summary has 100 vertices  x1 = x2 = x3 = x4 = x5 = 20  The probability that an embedding will persist 19 19 19 19 19       0.774 20 20 20 20 20 16 Extend to Multiple Graphs  Setting x1,…,xL to the same values across all Gi’s in the database only depends on m1,…,mL, i.e., pattern p’s vertex label distribution  We denote this probability as q(p)   For each of p’s support graphs in D, it has a probability of at least q(p) to continue support p  Thus, the overall support can be bounded below by a binomial random variable 17 Support Moves Downward 18 False Negative Bound 19 Example, Cont.  As above, q(p)=0.774  min_sup=50 min_sup' 40 39 38 37 36 35 1 round 0.5966 0.4622 0.3346 0.2255 0.1412 0.0820 2 rounds 0.3559 0.2136 0.1119 0.0508 0.0199 0.0067 3 rounds 0.2123 0.0988 0.0374 0.0115 0.0028 0.0006 20 False Positives Gi a p a a b gi a a c a b c b c  Much easier to handle  Just check against the original database D  Discard if this “actual” support is less than min_sup 21 The Same Skeleton as gSpan  DFS code tree  Depth-first search  Minimum DFS code?  Check support by isomorphism tests  Record all one-edge extensions along the way  Pass down the projected database and recurse 22 Integrate Verification Schemes  Top-Down and Bottom-Up  Possible factors  Amount of false positives  Top-down verification Transaction ID list for p1 => Dcan p1 be performed early  Top-down preferred Just search within D-D Just search within Dpp12; if frequent, can stop by experiments Transaction ID list for p2 => Dp2 23 Summary-Guided Verification  Substantial verification work can be performed on the summaries, as well Got it! 24 Iterative Summarize-Mine  Use a single pattern tree to hold all results spanning across multiple iterations  No need to combine pattern sets in a final step  Avoid verifying patterns that have already been checked by previous iterations  Verified support graphs are accurate, they can help prepruning in later iterations  Details omitted 25 Outline  Motivation  Summarize-Mine  Experiments  Summary 26 Dataset  Real data  W32.Stration, a family of mass-mailing worms  W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc.  Vertex # up to 20,000 and edge # even higher  Avg. # of vertices: 1,300  Synthetic data  Size, # of distinct node/edge labels, etc.  Generator details omitted 27 A Sample Malware Signature  Mined from W32.Stration  A malware reading and leaking certain registry settings related to the network devices 28 Comparison with gSpan  gSpan is an efficient graph pattern mining algorithm  Graphs with different size are randomly drawn  Eventually, gSpan cannot work 29 The Influence of min_sup'  Total vs. False Positives  The gap corresponds to true patterns  It gradually widens as we decrease min_sup' 30 Summarization Ratio  10/1 node(s) before/after summarization => ratio=10  Trading-off min_sup' and t as the inner loop  A range of reasonable parameters in the middle 31 Scalability  On the synthetic data  Parameters are tuned as done above 32 Outline  Motivation  Summarize-Mine  Experiments  Summary 33 Summary  We solve the frequent subgraph mining problem for graphs with big size  We found interesting malware signatures  Our algorithm is much more efficient, while the state- of-art mining technologies do not work  We show that patterns can be well preserved on higher-level by a good generalization scheme  Very useful, given the emerging trend of huge networks  The data has to be preprocessed and summarized 34 Summary  Our method is orthogonal to many previous works on this topic => Combine for further improvement  Efficient pattern space traversal  Other data space reduction techniques different from our compression within individual transactions   Transaction sampling, merging, etc. They perform compression between transactions 35 36

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Summarize-Mine