Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip S. Yu 2, Jiawei Han 1, Dong-Qing Zhang 3, Xiaohui Gu 2 1 University of Illinois at Urbana-Champaign 2IBM T.J. Watson Research Center 3Thomson Research VLDB ’07, September 23-28, 2007, Vienna, Austria 2007. 12. 28 Summarized by Dongjoo Lee, IDS Lab., Seoul National University Presented by Dongjoo Lee, IDS Lab., Seoul National University Contents Graph Search cIndex Basic Framework Indexing Features Indexing Model cIndex-Basic Hierarchical Indexing Models – cIndex-BottomUp, cIndex-TopDown Index Maintenances Experiments Discussion Copyright 2007 by CEBT 2 Graph Search gc Subgraph Traditional Graph Search q Supergraph Graph Containment Search gc q Find all supergraphs of a query graph Find all subgraphs of a query graph Subgraph isomorphism SCAN Answers of Graph Search Answers of Graph Containment Search Sample Database q Sample Query Copyright 2007 by CEBT 3 Subgraph Indexing & Pruning f1 q1 ga gb f2 q2 f3 q3 gc f4 Sample Database Feature-Graph Matrix Indexed Subgraphs SCAN Queries Traditional Graph Search - Inclusion Logic Given a query graph q and a database graph g D, if a feature f q and f g, then q g . That is, if feature f is in q then the graphs not having f are pruned. Graph Containment Search - Exclusion Logic If a feature f q and f g, then g q. That is, if feature f is not in q then the graphs having f are pruned. Copyright 2007 by CEBT 4 Basic Search Framework 1. 2. Off-line index construction Generate and select a feature set F from the graph database D f ∈ F, Df = {g | f ⊆ g, g ∈D} Search 3. Test indexed features in F against the query q which returns all f q, and compute the candidate query answer set, Cq = D − fDf (f q, f ∈ F). Verification Check each graph g in the database set Cq to see whether g is really a subgraph of q Search Time Let’s reduce these Copyright 2007 by CEBT 5 Redundancy-Aware Feature Selection Select minimal features that cover many graphs that other features didn’t cover from frequent features Expected number of reduced subgraph isomorphism tests X X X X X q1 f2 X q2 Jf = np(1-p’) -1 – n : |D| – p : frequency of f in D – p’ : probability that a query graph having f f3 q3 p p’=> Maximum at p = ½ Frequent subgraph mining algorithm f1 X f4 FSG, GASTON, gSpan Copyright 2007 by CEBT 6 Maximum Coverage With Cost Definition 6. (Maximum Coverage With Cost). Given a set of subsets S = {S1, S2, …, Sm} of the universal set U = {1, 2, … , n} and a cost parameter associated with any Si ∈ S, find a subset T of S such that |∪Si∈TSi|−|T| is maximized. U = {1, 2, 3, 4, 5, 6, 7, 8, 9} S = {{4,5,6}, {1,2,4,5},{4,5,7,8}, {1,4,7}} f1 f2 f3 f4 = 3 Set Cover Problem http://en.wikipedia.org/wiki/Set_cover_problem 1 2 3 4 5 6 Contrast Graph Matrix 7 8 9 This is NP-complete problem The greedy feature selection process can approximate the optimal index with K features within a ratio of 1 − 1/e. [D. S. Hochbaum, editor. “Approximation Algorithms for NP-Hard Problems”, 1997] Copyright 2007 by CEBT 7 cIndex-Basic 1 2 3 4 5 6 7 8 9 Time complexity : O(|F0||D||L|) ----> O(|F0||Ds||L|) Reduced by sampling Space usage : O(|F0||D||L|) ----> O(|F0||D| + |F0||L|) Reduced by virtualization Copyright 2007 by CEBT 8 Hierarchical Indexing Models cIndex-BottomUp cIndex-TopDown Copyright 2007 by CEBT 9 Index Maintenance “ostrich” strategy Stick with the same set of selected features and the same hierarchical structure built Sampling strategy Periodically take small samples (Fs, Ds, Ls) and construct sample index Is – if performance of legacy index I is much worse than Is then reconstruct index – else take ostrich strategy Copyright 2007 by CEBT 10 Experiments (1) Chemical Descriptor Search a model graph database usually includes a set of fundamental substructures, called descriptors. These descriptors, shared by specific groups of known molecules, often indicate particular chemical and physical properties. Given a molecule, fast searching for its “descriptor” substructures can help researchers to quickly predict its attributes. Comparison SCAN FB – Extract features using gIndex and apply features as Basic Framework cIndex-Basic, cIndex-BottomUp, cIndex-TopDown Copyright 2007 by CEBT 11 Experiments (2) Subgraph Isomorphism Test Numbers Effectiveness of Data Space Reduction Copyright 2007 by CEBT Query Processing Time Index Maintenance 12 Experiments (3) Performance of Hierarchical Indices Scalability of Hierarchical Indices Index Size Copyright 2007 by CEBT 13