Download Graph Blast PX - Videolectures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Enhancing Graph Database Indexing
By Suffix Tree Structure
V. Bonnici, A. Ferro, R. Giugno, A. Pulvirenti
Dipartimento di Matematica e Informatica Università di Catania
D. Shasha
Courant Institute of Mathematical Science, New York University
Outline
•
•
•
•
Biological Motivations
GraphGrep
GraphGrepSX
Experimental results (GIndex, Gcoding,
GraphGrep, Ctree, SING)
Motivations
•
Many applications in industry, science, and engineering share
the same problem: given a subgraph, find its occurrences in a database of graphs.
–
–
–
–
Prediction of the functionality of new natural or synthesized compounds
Make a compound Q more active
Find fragment with the same function among different species
Predict protein function, Predict protein interaction
Pathways
Biological Networks
Gene Ontologies
Graph indexing system
• Graph-to-graph matching algorithms can be used, efficiency
considerations suggest the use of specific techniques to
reduce the search space and the time complexity.
• In a preprocessing phase, each graph of the database is
analyzed in order to extract and store its discriminatory
properties, features.
• In the filtering phase, the graph database index is compared
with the query index in order to discard graphs of the
database not containing some features present in the query
graph.
GraphGrep
Graphs searching is an NP-problem
Database
B 0
A 1
Index
Construction
3C
2B
g1
D 1
3C
B 2
g
5 B 6 E2
A 4
B 0
g3
C 4
B
A 1 2 3C
Index
Query processing
Candidate
Verification
Load from DB
Filtering:
find candidate
Indexing is crucial to reduce the search space and make the problem affordable!
Shasha D, Wang JL, Giugno R: Algorithmics and Applications of Tree and Graph Searching. Proceeding of the ACM Symposium on Principles of Database
Systems (PODS) 2002, :39{52.
A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti, D. Skripin, D. Shasha D. GraphFind: Enhancing Graph Searching by Low Support Data Mining Techniques.
BMC Bioinformatics 2008, Vol. 9 (Suppl 4) :S10 doi:10.1186/1471-2105-9-S4-S10
GraphGrep: Index building
For each graph in DB:
• Find all paths of length from 1 to L (4,10)
• Save the paths in a Berkeley DB
• Count how many occurrences of each path in each
graph
• Save the occurrences in an hash table indexed by
the strings of the paths
GraphGrep: Index building
GraphGrep: Filtering and Matching
Run VF
GraphGrep: Matching VF_lib
(Cordella et al. IEEE PAMI 2004,
http://amalfi.dis.unina.it/graph/db/vflib-2.0/doc/vflib.html)
• Extension of Ullmann matching algorithm ( Journal of the
ACM, 1976)
• The process of finding the mapping function can be
suitably described by means of a TREE called State Space
Representation (SSR)
• Each node is a state s of the partial matching process
• Transition from a generic state s to a successor s’
represents the addition of a pair matched nodes.
• k-look-ahead rules for checking in advance if a consistent
state s has no consistent successors after k steps + Semantic
rules
GraphGrepSX: Idea
Realize a compact representation of the index by making use
of Suffix trees
“Algorithms on Strings, Trees, and Sequences” by Dan Gusfield
GraphGrepSX: Idea
Realize a compact representation of the index by making use
of Suffix trees
GraphGrepSX: Idea
Realize a compact representation of the index by making use
of Suffix trees
Suffix tree
index
GraphGrepSX: Idea
Realize a compact representation of the index by making use
of Suffix trees
Suffix tree
index
GraphGrepSX: Idea
Realize a compact representation of the index by making use
of Suffix trees
Suffix tree
index
GraphGrepSX: Idea
Realize a compact representation of the index by making use
of Suffix trees
Suffix tree
index
GraphGrepSX
• Preprocessing phase
– replaces the hash table index by a suffix tree index
• Filtering phase
– Build a query index tree
– The candidate set is constructed by matching the query
index tree and the database index
• This results in a more flexible graph indexing system
– different ways to build the query index
– an efficient technique to reduce redundant checks
GraphGrepSX: Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
Computed by DFS visit, the backtracking allows to find paths
with the same suffix
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
Computed by DFS visit, the backtracking allows to find paths
with the same suffix
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
Computed by DFS visit, the backtracking allows to find paths
with the same suffix
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
GraphGrepSX : Index structure
GraphGrepSX builds the tree index as follows:
Experimental analysis on molecular
dataset
Index construction time
GraphGrep
GraphGrepSX
8000
24000
900
Building time (sec)
800
700
600
500
400
300
200
100
0
1000
40000
Number of graphs
AIDS antiviral screening database
http://dtp.nci.nih.gov/docs/aids/
Experimental analysis on molecular
dataset
Index fingerprint size
Total index size
hashtable only
label-paths table + hashtable
GraphGrep
GraphGrepSX
GraphGrep
1000000
GraphGrepSX
35000
900000
30000
Size (Kb)
Size (Kb)
800000
700000
600000
500000
400000
300000
25000
20000
15000
10000
200000
5000
100000
0
0
1000
8000
24000
Number of graphs
40000
1000
8000
24000
Number of graphs
40000
GraphGrepSX: Index structure
A path in the index structure is defined as maximal path if:
GraphGrepSX: Index structure
A path in the index structure is defined as maximal path if:
•its length is L
GraphGrepSX: Index structure
A path in the index structure is defined as maximal path if:
•its length is L
GraphGrepSX: Index structure
A path in the index structure is defined as maximal path if:
•its length is L
•the path has length < L but it cannot be extended
GraphGrepSX: Index structure
A path in the index structure is defined as maximal path if:
•its length is L
•the path has length < L but it cannot be extended
GraphGrepSX: Index structure
A path in the index structure is defined as maximal path if:
•its length is L
•the path has length < L but it cannot be extended
GraphGrepSX: Filtering phase-Query
tree index structure construction
- Discard graphs from the database which do not
match the query by analyzing only the maximal paths
- the query tree nodes representing these maximal
paths are marked
Red nodes represent
End-points of Maximal Paths
GraphGrepSX: Filtering phasecandidates generation
Dataset index
Query index
GraphGrepSX: Filtering phasecandidates generation
Dataset index
Query index
Trees matching
GraphGrepSX: Filtering phasecandidates generation
Dataset index
Query index
Trees matching
GraphGrepSX: Filtering phasecandidates generation
Trees matching
GraphGrepSX: Filtering phasecandidates generation
Trees matching
Occurrences verification
GraphGrepSX: Filtering phasecandidates generation
Trees matching
Occurrences verification
Candidates set:{g1, …}
Experimental analysis
Molecular dataset of 42000 graphs
Query Time
filtering + matching
GraphGrep
GraphGrepSX
4,5
4
Query time (sec)
3,5
3
2,5
2
1,5
1
0,5
0
4
8
16
Query dimension
32
Experimental analysis
Molecular dataset of 42000 graphs
Candidates
GraphGrep
GraphGrepSX
Number of candidates
35000
30000
25000
20000
15000
10000
5000
0
4
8
16
Query dimension
32
Experimental analysis: CTree, GCoding,
GraphGrep, GraphGrepSX
Molecular dataset of 42000 graphs
Index size
Index construction time
GraphGrep
Ctree
Gcoding
5
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
SING
GraphGrepSX
GraphGrep
Ctree
Gcoding
SING
7
6
Size (Kb)
Time (sec)
GraphGrepSX
5
4
3
2
1
0
8000
24000
Number of graphs
42000
8000
24000
Number of graphs
42000
Experimental analysis: CTree, GCoding,
GraphGrep, GraphGrepSX
Molecular dataset of 42000 graphs
Query time
GraphGrepSX
GraphGrep
Ctree
Candidates
Gcoding
SING
2
Candidates
1,5
Time (sec)
1
0,5
0
4
8
16
-0,5
-1
Query dimension
32
GraphGrepSX
GraphGrep
4
8
Ctree
Gcoding
SING
5
4,5
4
3,5
3
2,5
2
1,5
1
0,5
0
16
Query dimension
32
Huahai He Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries, ICDE '06
Lei Zou, Lei Chen, Jeffrey Xu Yu, Yansheng Lu, A novel spectral coding in a large graph
database, Proceedings of the 11th international conference on Extending database technology,
2008
Experimental analysis: CTree, GCoding,
GraphGrep, GraphGrepSX
Molecular dataset of 42000 graphs
Filtering time
GraphGrepSX
GraphGrep
Ctree
Matching time
Gcoding
SING
0
4
8
CTree
8
16
-1
-1,5
-2
32
SING
0,5
0
16
-0,5
-2,5
-3
GCoding
1
4
Time (sec)
Time (sec)
GraphGrep
1,5
0,5
-0,5
GraphGrepSX
-1
Query dimension
Query dimension
32