Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Enhancing Graph Database Indexing By Suffix Tree Structure V. Bonnici, A. Ferro, R. Giugno, A. Pulvirenti Dipartimento di Matematica e Informatica Università di Catania D. Shasha Courant Institute of Mathematical Science, New York University Outline • • • • Biological Motivations GraphGrep GraphGrepSX Experimental results (GIndex, Gcoding, GraphGrep, Ctree, SING) Motivations • Many applications in industry, science, and engineering share the same problem: given a subgraph, find its occurrences in a database of graphs. – – – – Prediction of the functionality of new natural or synthesized compounds Make a compound Q more active Find fragment with the same function among different species Predict protein function, Predict protein interaction Pathways Biological Networks Gene Ontologies Graph indexing system • Graph-to-graph matching algorithms can be used, efficiency considerations suggest the use of specific techniques to reduce the search space and the time complexity. • In a preprocessing phase, each graph of the database is analyzed in order to extract and store its discriminatory properties, features. • In the filtering phase, the graph database index is compared with the query index in order to discard graphs of the database not containing some features present in the query graph. GraphGrep Graphs searching is an NP-problem Database B 0 A 1 Index Construction 3C 2B g1 D 1 3C B 2 g 5 B 6 E2 A 4 B 0 g3 C 4 B A 1 2 3C Index Query processing Candidate Verification Load from DB Filtering: find candidate Indexing is crucial to reduce the search space and make the problem affordable! Shasha D, Wang JL, Giugno R: Algorithmics and Applications of Tree and Graph Searching. Proceeding of the ACM Symposium on Principles of Database Systems (PODS) 2002, :39{52. A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti, D. Skripin, D. Shasha D. GraphFind: Enhancing Graph Searching by Low Support Data Mining Techniques. BMC Bioinformatics 2008, Vol. 9 (Suppl 4) :S10 doi:10.1186/1471-2105-9-S4-S10 GraphGrep: Index building For each graph in DB: • Find all paths of length from 1 to L (4,10) • Save the paths in a Berkeley DB • Count how many occurrences of each path in each graph • Save the occurrences in an hash table indexed by the strings of the paths GraphGrep: Index building GraphGrep: Filtering and Matching Run VF GraphGrep: Matching VF_lib (Cordella et al. IEEE PAMI 2004, http://amalfi.dis.unina.it/graph/db/vflib-2.0/doc/vflib.html) • Extension of Ullmann matching algorithm ( Journal of the ACM, 1976) • The process of finding the mapping function can be suitably described by means of a TREE called State Space Representation (SSR) • Each node is a state s of the partial matching process • Transition from a generic state s to a successor s’ represents the addition of a pair matched nodes. • k-look-ahead rules for checking in advance if a consistent state s has no consistent successors after k steps + Semantic rules GraphGrepSX: Idea Realize a compact representation of the index by making use of Suffix trees “Algorithms on Strings, Trees, and Sequences” by Dan Gusfield GraphGrepSX: Idea Realize a compact representation of the index by making use of Suffix trees GraphGrepSX: Idea Realize a compact representation of the index by making use of Suffix trees Suffix tree index GraphGrepSX: Idea Realize a compact representation of the index by making use of Suffix trees Suffix tree index GraphGrepSX: Idea Realize a compact representation of the index by making use of Suffix trees Suffix tree index GraphGrepSX: Idea Realize a compact representation of the index by making use of Suffix trees Suffix tree index GraphGrepSX • Preprocessing phase – replaces the hash table index by a suffix tree index • Filtering phase – Build a query index tree – The candidate set is constructed by matching the query index tree and the database index • This results in a more flexible graph indexing system – different ways to build the query index – an efficient technique to reduce redundant checks GraphGrepSX: Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: Computed by DFS visit, the backtracking allows to find paths with the same suffix GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: Computed by DFS visit, the backtracking allows to find paths with the same suffix GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: Computed by DFS visit, the backtracking allows to find paths with the same suffix GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: GraphGrepSX : Index structure GraphGrepSX builds the tree index as follows: Experimental analysis on molecular dataset Index construction time GraphGrep GraphGrepSX 8000 24000 900 Building time (sec) 800 700 600 500 400 300 200 100 0 1000 40000 Number of graphs AIDS antiviral screening database http://dtp.nci.nih.gov/docs/aids/ Experimental analysis on molecular dataset Index fingerprint size Total index size hashtable only label-paths table + hashtable GraphGrep GraphGrepSX GraphGrep 1000000 GraphGrepSX 35000 900000 30000 Size (Kb) Size (Kb) 800000 700000 600000 500000 400000 300000 25000 20000 15000 10000 200000 5000 100000 0 0 1000 8000 24000 Number of graphs 40000 1000 8000 24000 Number of graphs 40000 GraphGrepSX: Index structure A path in the index structure is defined as maximal path if: GraphGrepSX: Index structure A path in the index structure is defined as maximal path if: •its length is L GraphGrepSX: Index structure A path in the index structure is defined as maximal path if: •its length is L GraphGrepSX: Index structure A path in the index structure is defined as maximal path if: •its length is L •the path has length < L but it cannot be extended GraphGrepSX: Index structure A path in the index structure is defined as maximal path if: •its length is L •the path has length < L but it cannot be extended GraphGrepSX: Index structure A path in the index structure is defined as maximal path if: •its length is L •the path has length < L but it cannot be extended GraphGrepSX: Filtering phase-Query tree index structure construction - Discard graphs from the database which do not match the query by analyzing only the maximal paths - the query tree nodes representing these maximal paths are marked Red nodes represent End-points of Maximal Paths GraphGrepSX: Filtering phasecandidates generation Dataset index Query index GraphGrepSX: Filtering phasecandidates generation Dataset index Query index Trees matching GraphGrepSX: Filtering phasecandidates generation Dataset index Query index Trees matching GraphGrepSX: Filtering phasecandidates generation Trees matching GraphGrepSX: Filtering phasecandidates generation Trees matching Occurrences verification GraphGrepSX: Filtering phasecandidates generation Trees matching Occurrences verification Candidates set:{g1, …} Experimental analysis Molecular dataset of 42000 graphs Query Time filtering + matching GraphGrep GraphGrepSX 4,5 4 Query time (sec) 3,5 3 2,5 2 1,5 1 0,5 0 4 8 16 Query dimension 32 Experimental analysis Molecular dataset of 42000 graphs Candidates GraphGrep GraphGrepSX Number of candidates 35000 30000 25000 20000 15000 10000 5000 0 4 8 16 Query dimension 32 Experimental analysis: CTree, GCoding, GraphGrep, GraphGrepSX Molecular dataset of 42000 graphs Index size Index construction time GraphGrep Ctree Gcoding 5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 SING GraphGrepSX GraphGrep Ctree Gcoding SING 7 6 Size (Kb) Time (sec) GraphGrepSX 5 4 3 2 1 0 8000 24000 Number of graphs 42000 8000 24000 Number of graphs 42000 Experimental analysis: CTree, GCoding, GraphGrep, GraphGrepSX Molecular dataset of 42000 graphs Query time GraphGrepSX GraphGrep Ctree Candidates Gcoding SING 2 Candidates 1,5 Time (sec) 1 0,5 0 4 8 16 -0,5 -1 Query dimension 32 GraphGrepSX GraphGrep 4 8 Ctree Gcoding SING 5 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 16 Query dimension 32 Huahai He Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries, ICDE '06 Lei Zou, Lei Chen, Jeffrey Xu Yu, Yansheng Lu, A novel spectral coding in a large graph database, Proceedings of the 11th international conference on Extending database technology, 2008 Experimental analysis: CTree, GCoding, GraphGrep, GraphGrepSX Molecular dataset of 42000 graphs Filtering time GraphGrepSX GraphGrep Ctree Matching time Gcoding SING 0 4 8 CTree 8 16 -1 -1,5 -2 32 SING 0,5 0 16 -0,5 -2,5 -3 GCoding 1 4 Time (sec) Time (sec) GraphGrep 1,5 0,5 -0,5 GraphGrepSX -1 Query dimension Query dimension 32