Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mohammad Hasan, Mohammed Zaki RPI, Troy, NY Consider the following problem from Medical Informatics Tissue Images Cell Graphs Discriminatory Subgraphs Healthy 5/12/2017 Diseased Damaged Classifier 4 Mining Task Dataset 30 graphs Average vertex count: 2154 Average edge count: 36945 Support 40% Result 5/12/2017 No Result (used gSpan, Gaston) in a week of running on 2 GHz dual-core PC with 4 GB running Linux 5 Limitations of Existing Subgraph Mining Algorithms Work only for small graphs The most popular datasets in graph mining are chemical graphs Chemical graphs are mostly tree In DTP dataset (most popular dataset) average vertex count is 43 and average edge count is 45 Perform a complete enumeration For large input graph, output set is neither enumerable nor usable They follow a fixed enumeration order Partial run does not efficiently generate the interesting subgraphs avoid complete enumeration to sample a set of interesting subgraphs from the output set 5/12/2017 6 Why sampling a solution? Observation 1: Mining is only exploratory step, mined patterns are generally used in subsequent KD task Not all frequent patterns are equally important for the desired task at hand Large output set leads to information overload problem complete enumeration is generally unnecessary Observation 2: Traditional mining algorithms explore the output space with a fixed enumeration order Good for generating non-duplicate candidate patterns But, subsequent patterns in that order are very similar Sampling can change enumeration order to sample interesting and non-redundant subgraphs with a higher chance 5/12/2017 7 Output Space Traditional frequent subgraphs for a given support threshold Can also augment with other constraint To find good patterns for the desired KD task Input Space 5/12/2017 Output Space for FPM with support = 2 9 Sampling from Output Space Return a random pattern from the output set Random pattern is obtained by sampling from a desired distribution Define an interestingness function, f : FR+; f(p) returns the score of pattern p The desired sampling distribution is proportional to the interestingness score If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution Efficiency consideration 5/12/2017 Enumerate as few auxiliary patterns as possible 10 How to choose f? Depends on application needs For exploratory data analysis (EDA), every frequent pattern can have a uniform score For Top-K pattern mining, support values can be used as scores, which is support biased sampling. For subgraph summarization task, only maximal graph patterns has uniform non-zero score For graph classification, discriminatory subgraphs should have high scores 5/12/2017 11 Challenges g2 The output space can not be instantiate g1 g3 g5 g4 Complete statistics about the output space is not known. Output Space of Graph Mining Target distribution is not known entirely Graphs s1 s2 s3 sn Scores We want, si (i ) s i i 5/12/2017 13 MCMC Sampling POG as transition graph Solution Approach (MCMC Sampling) 5/12/2017 Perform random walk in the output space Represent the output space as a transition graph to allow local transitions Edges of transition graph are chosen based on structural similarity Make sure that the random walk is ergodic In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge 14 Algorithm Define the transition graph (for instance, POG) Define interestingness function that select desired sampling distribution Perform random walk on the transition graph Compute the neighborhood locally Compute Transition probability Utilize the interestingness score makes the method generic Return the currently visiting pattern after k iterations. 5/12/2017 15 Local Computation of Output Space g1 Super Patterns Sub Patterns Σ 5/12/2017 Pattern that are not part of the output space is discarded during local neighborhood computation g0 g5 g4 g1 P01 g3 g2 g2 p02 g3 p03 g4 p04 g5 p05 u p00 =1 16 Compute P to achieve Target Distribution Graphs s1 s2 s3 sn Scores We want, (i ) si s i i If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have, P Main task is to choose P, so that the desired stationary distribution is achieved In fact, we compute only one row of P (local computation) 5/12/2017 17 Use Metropolis-Hastings (MH) Algorithm 1. Fix an arbitrary proposal distribution beforehand (q) 2. Find a neighbor j (to move to) by using the above distribution 3. Compute acceptance probability and accept the move with this probability q01 q02 q03 q04 q05 3 Select s3q30 03 min ,1 s0 q03 1 2 3 0 4. If accept move to j; otherwise, go to step 2 5/12/2017 4 q00 5 Uniform Sampling of Frequent Patterns Target Distribution 1/n, 1/n, . . . , 1/n How to achieve it? Use uniform proposal distribution Acceptance probability is: du min 1, dv dx: Degree of a vertex x 5/12/2017 19 Uniform Sampling, Transition Probability Matrix A A P 5/12/2017 D D 1 4 B 20 Discriminatory Subgraph Sampling Database graphs are labeled Subgraphs may be used as Feature for supervised classification Graph Kernel Graph Label G1 G2 G3 +1 +1 -1 Subgraph Mining graphs g1 g2 g3 ... G1 G2 G3 5/12/2017 21 Sampling in Proportion to Discriminatory Score (f) Interestingness score (feature quality) Entropy Delta score = abs (positive support – negative support) Direct Mining is difficult 5/12/2017 Score values (entropy, delta score) are neither monotone nor antimonotone P C Score(P) <=> Score(C) 22 Discriminatory Subgraph Sampling Use Metropis-Hastings Algorithm Choose neighbor uniformly as proposal distribution Compute acceptance probability from the delta score Delta Score of j and i 5/12/2017 Ratio of degree of i and j 23 Datasets Name # of Graphs Average Vertex count Average Edge Count DTP 1084 43 45 Chess 3196 10.25 - Mutagenicity 2401 (+) 1936 (-) 17 18 PPI 3 2154 81607 Cell-Graphs 30 2184 36945 5/12/2017 25 Result Evaluation Metrics Sampling Quality Our sampling distribution vs target sampling distribution How the sampling converges (convergence rate) Variation Distance: 1 t P ( x, y ) ( y ) y 2 Scalability Test Experiments on large datasets Median and standard deviation of visit count Quality of Sampled Patterns 5/12/2017 26 Uniform Sampling Results Experiment Setup Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution For a dataset with n frequent patterns, we perform 200*n iterations Uniform Sampling Max count Min 338 32 Median Std 209 59.02 count Ideal Sampling Median Std 200 14.11 Result on DTP Chemical Dataset 5/12/2017 27 Sampling Quality Depends on the choice of proposal distribution If the vertices of POG have similar degree values, sampling is good Earlier dataset have patterns with widely varying degree values [ For clique dataset, sampling quality is almost perfect Uniform Sampling Max count Min count Median Std 156 6 100 13.64 Ideal Sampling Result on Chess (Itemset) Dataset 5/12/2017 (100*n iterations) Median Std 100 10 28 Discriminatory sampling results (Mutagenicity dataset) Distribution of Delta Score among all frequent Patterns Relation between sampling rate and Delta Score 5/12/2017 29 Discriminatory sampling results (cont) 5/12/2017 Sample No Delta Score Rank % of POG explored 1 404 132 5.7 2 644 21 11.0 3 707 10 10.8 4 725 4 8.9 5 280 595 2.8 6 725 4 8.9 7 627 27 3.3 8 709 9 7.7 9 721 5 9.1 10 725 4 8.9 30 Discriminatory sampling results (cell Graphs) Total graphs 30, min-sup = 6 Number of subgraphs with delta score > 9 30 25 No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine) 20 15 Series1 10 5 0 traditional algorithm 5/12/2017 OSS 31 Existing Algorithms Summary Output Space Sampling Depth-first or Breadth first Random walk on the subgraph walk on the subgraph space Rightmost Extension Complete algorithm space Arbitrary Extension Sampling algorithm Quality: Sampling quality guaranty Scalability: Visits only a small part of the search space Non-Redundant: finds very dissimilar patterns by virtue of randomness Genericity: In terms of pattern type and 5/12/2017 sampling objective 32 Future Works and Discussion Important to choose proposal distribution wisely to get better sampling For large graph, support counting is still a bottleneck How to scrap the isomorphism checking entirely How to effectively parallelize the support counting How to make the random walk to converge faster 5/12/2017 The POG graph generally have smaller spectral gap, as a result the convergence is slow. This makes the algorithm costly (more steps to find good samples) 33 Acceptance Probability Computation Desired Distribution 5/12/2017 Proposal Distribution Interestingness value 36 Support Biased Sampling We want, Graphs s1 s2 s3 sn Support (i ) si s i i What proposal distribution to choose? 1 | Nup | Q(u, v) (1 ) 1 | N down | if v Nup (u ) if v N down (u ) u α=1, if Nup(u) = ø, α=0, if Ndown(u) = ø 5/12/2017 37 Example of Support Biased Sampling α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9 s(u) = 2 s(v) = 3 A A P 5/12/2017 D D 3 x 1/9 2 X 1/2 1 3 B 38 Sampling Convergence 5/12/2017 39 Support Biased Sampling Scatter plot of Visit count and Support shows positive Correlation Correlation: 0.76 5/12/2017 40 Specific Sampling Examples and Utilization Uniform Sampling of Frequent Pattern To explore the frequent patterns To set a proper value of minimum support To make an approximate counting Support Biased Sampling To find Top-k Pattern in terms of support value Discriminatory subgraph sampling 5/12/2017 Finding subgraphs that are good features for classification 41