Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Tree-Pattern Queries on a
Lightweight XML Processor
MIRELLA M. MORO
Zografoula Vagena
Vassilis J. Tsotras
Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks
Outline





Motivation and Contributions
Background
Method Categorization
Experimental Evaluation
Conclusions
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
2
Motivation

XML query languages: selection on both value and structure
 “Tree-pattern” queries (TPQ) very common in XML

Many promising holistic solutions

None in lightweight XML engines

Without optimization module (e.g. eXist, Galax)

 Effective, robust processing method

Reasons:
 No systematic comparison of query methods under a common
storage model
 No integration of all methods under such storage model

Context: XPath semantics, stored data (indexed at will)
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
3
Contributions



TPQ methods over unified environment
Method Categorization: data access patterns and
matching algorithm
Common storage model + integration of all methods
 Capture the access features
 Permit clustering data with off-the-shelf access methods
(e.g. B+tree)


Novel variations of methods using index structures +
Handle TPQ
Extensive comparative study
 Synthetic, benchmark and real datasets
 Decision in the applicability, robustness and efficiency
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
4
Background
TPQ
Bib
(1,20)
article
article
author procs
(2,19)
conf
last
procs
(3,5)
(6,13)
(14,18)
(4)
(2,19)
(7,9)
first
conf
(10,12) (15,17)
DeWitt David J.
(8)
(11) VLDB
last
(7,9)
UC Riverside
author
t1 last
article
2<7<9<19
title
(16)
XML database = forest of unranked, ordered,
node-labeled trees, one tree per document

Tree-Pattern Queries on a Lightweight XML Processor
5
Common Storage Model
bib
(1,26)
author (3,8) (11,16) (19,24)
bib (1,16)
book (2,9)
paper (18,25)
author (3,8)
author (19,24)
name address
(4,5) (6,7)
name address
(20,21) (22,23)
author(11,16)


paper (18,25)
book (2,9) (10,17)
Input = sequence (list) of elements
One list per document tag = element list
 Node clustering by index structures

UC Riverside
name (4,5) (12,13) (20,21)
B+ Tree on ( tag, initial )
book (10,17)
name address
(12,13) (14,15)
address (6,7) (14,15) (22,23)
Numbering scheme
Tree-Pattern Queries on a Lightweight XML Processor
6
Method Categorization

Parameters: access pattern and matching algorithm
(1) set based techniques
(2) query driven
(3) input driven
(4) structural summaries
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
7
Cat 1: Set-based Techniques
Access Pattern
Sorted/indexed


Matching Process
Join sets,
merge individual paths
Input: sequences of elements, one list per query node
element, possibly indexed (set-based)
Major representative: TwigStack
 Optimal XML pattern matching algorithm (ancestor/descendant)

Stack-based processing
 Set of stacks = compact encoding of partial and total results in
linear space (possibly exponential number of answers)
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
8
TwigStack + Indexes




B+tree, built on the left attribute
 From ancestor: probe descendants: skip initial nodes
 Ancestor skipping not effective (up to 1st element that follows)
XB-tree: on (left,right) bounding segment
XR-tree: on (left,right), B+tree with complex index key + stab lists
A comparative study* shows that
 Skipping ancestors: XBTree better (XBTree size is smaller)
 Recursive level of ancestors: XBTree better again

Searching on stab lists of XR-tree is less efficient
 Plain B+tree: skips descendants, BUT not ancestors
 XBTwigStack is our choice
* H.Li et al. “An Evaluation of XML Indexes for
Structural Joins”. Sigmod Record, 33(3), Sept 04
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
9
Cat 2: Query Driven Techniques
Access Pattern
Indexed/random




Matching Process
Incremental construction of
each result instance
Processing: the query defines the way input is
probed
Major representatives: ViST and PRIX
Specific details: significantly different
Same strategy
 Convert both document and query to sequences
 Processing query = subsequence matching
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
10
ViST and PRIX


Recursively identify matches = quadratic time
Optimize the naïve solution:
 Identify candidate nodes for each matching step
 Index structures to cluster those candidates



Subsequence matching process = a plan consisting
of INLJ among relations, each of which groups
document nodes with the same label
For a given query, joins sequence statically defined
by the sequencing of the query
INLJ plans are a superset of the static plans that
PRIX and VIST use
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
11
ViST x PRIX x INLJ
Dataset #nodes


VIST
PRIX
INLJ
100%
100
100
100
LEAVES: 80%
100
84.23
84.20
LEAVES: 1%
100
1.33
1.32
ROOT: 80%
84.22
100
84.18
ROOT: 1%
1.33
100
1.33
INTERNAL: 80%
89.48
89.49
84.20
INTERNAL: 1%
34.24
34.22
1.64
Percentage of nodes processed by each algorithm
INLJ: best plan
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
12
INLJ : improved
a
b 2,31
+
B tree
Consider b//c
Starting from c
1,52
b32,41
b42,51
b elem. list
a 33,40
33
b
c35,36



c 38,39
34,37
2,31 32,41 34,41 42,51
TPQ  evaluation of relational plan
Independence of the ordered XML model
Total avoidance of false positives
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
13
Cat 3: Input Driven Techniques
Access Pattern
Sequential


Matching Process
Input drives computation,
merge individual paths
Processing: at each point, the flow of
computation is guided entirely by the input
through a Finite State Machine (DFA/NFA)
Advantages
 Each node processed only once
 Simplicity, sequential access pattern

Problem: skipping elements
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
14
SingleDFA and IdxDFA

SingleDFA
 <element> triggers the DFA, choosing next state
 </element>: execution backtracks to when start processed
 TPQ matching: intermediate results compacted on stacks


Experiments show reading whole input = not
enough
Speeding up navigation: IdxDFA
 Instead of reading sequentially: use indexes and
skip descendants
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
15
IdxDFA: example
a
b
c
d
c1
b2
a3
c4 d6
d11
b9
c5 d7 d9 c10
UC Riverside
a12
b13
c22
c16 d6 b21
d14 c15
Tree-Pattern Queries on a Lightweight XML Processor
16
IdxDFA: example
a
b
c
d
c1
b2
a3
c4 d6
d11
bb99
c5 d7 d9 c10
UC Riverside
a12
b13
c22
c16 d6 b21
d14 c15
Tree-Pattern Queries on a Lightweight XML Processor
17
Cat 4: Graph Summary Evaluation
Access Pattern
Indexed/Random




Matching Process
Merge-join partitioned input,
merge individual paths
Structural summary: index node identifies a group of
nodes in the document
Processing: identify index nodes that satisfy the
query + post processing filtering
Beneficial: when there is a reasonable structural
index, much smaller than document
Problem: graph size comparable/larger than original
document
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
18
Categories Summary
Access
Pattern
Matching Process
Methods
Set Based
Sorted/
Indexed
Join sets,
merge individual paths
Twigstack /XB,
B+tree, XR-tree
Query
Driven
Indexed/
random
Incremental construction of
each result instance
(ViST, PRIX)
INLJ
Input
Driven
Sequential Input drives computation,
Structural
Summary
Indexed/
random
UC Riverside
merge individual paths
SingleDFA,
IdxDFA
Merge-join partitioned
Structural
input, merge individual paths indexes
Tree-Pattern Queries on a Lightweight XML Processor
19
Experimental Evaluation
Experiments with real datasets
Experiments with synthetic datasets
1.
2.


Further analyze each method
Characterize the methods according to specific
features available in each custom dataset
More sets of experiments
3.

UC Riverside
Closely verify XBTWIGSTACK versus INLJ
Tree-Pattern Queries on a Lightweight XML Processor
20
Setup



Algorithms using the same API
Analysis varying structure and selectivity
Performance measure = total time required to
compute a query
 Number of nodes as secondary information



Intel Pentium 4 2.6GHz, 1Gb ram
Berkeley DB: 100 buffers, page size 8Kb, B+ tree
Real/benchmark datasets
 XMark (Internet auction, 1.4 GB raw data, ± 17 million
nodes), Protein Sequence Database
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
21
XMark
40
35
Time (sec)
30
XBTwigStack
SingleDFA
IdxDFA
INLJ
StrIdx
25
20
15
10
5
0
X1
X2
X4
X6
Queries
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
22
Custom Data


Goal: isolate important features
Query //a//b[.//c]//d
 Simple enough for detailed investigation
 Complex enough to provide large number of
different data access possibilities


a
b
c
d
Vary selectivity of each element separately
Add recursion to key elements (root, leaf)
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
23
Custom Data
30
Time (sec)
25
20
XBTwigStack
IdxDFA
INLJ
15
10
5
a
0
b
D1:100 D2:80
D3:50
D4:10
D5:01
Dataset:Selectivity
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
c
d
24
Custom Data
30
Time (sec)
25
20
XBTwigStack
IdxDFA
INLJ
15
10
a
5
b
0
D1:100 D6:80
D8:10 D10:80 D12:10
Dataset:Selectivity
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
c
d
25
XBTwigStack x INLJ
250
Time (sec)
200
150
100
50
0
R40


ABCD
ABDC
BACD
BADC
BCAD
BCDA
BDAC
BDCA
CBAD
CBDA
DBAC
DBCA
XBTwig
On large dataset, 40mi nodes, 1Gb, 1% selectivity
Difference of 40s between XBTwig and INLJ best plan
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
26
XBTwigStack x INLJ
350
325
300
275
250
225
200
175
150
125
100
75
50
25
0
R1
R3
R4
R7
R9
ABCD ABDC BACD BADC BCAD BCDA BDAC BDCA CBAD CBDA DBAC DBCA XBTWIG
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
27
Conclusions


Categorization of TPQ processing algorithms
Adaptations for processing TPQ
 DFA + accessing nodes from B+tree
 INLJ + ancestor skipping



DFA-based improved, IdxDFA, not enough
Structural summary available and smaller than
document: StrIdx
XBTwigStack: most robust and predictable
 INLJ when high selectivity: no guarantee about chosen
plan without optimizer module
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
28
Questions?
EXTRA SLIDES
Background
article
Bib
author procs
(1,36)
last
conf
article
article
(2,19)
(20,35)
title
author
procs
title
author
procs
(3,5)
(6,13)
(14,18)
(21,23)
(24,31)
(32,34)
t1 last
(4)
(7,9)
first
conf
(10,12) (15,17)
t2 last first
conf
(22) (25,27) (28,30)
(33)
DeWitt David J.
(8)
(11) VLDB
Lu Hongjun
(26)
(29)
(16)
Region numbering scheme : (left, right)
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
31
TwigStack
a1 b1 c1
a1
b1
a2
a
b2 c 2
b
c1
c
doc
query
a1 b1 c2
a2
b2
a1
b1
c12
a1 b2 c1
Sa
Sb
Sc
a2 b2 c1
results
1) solutions individual root-to-leaf paths
2) merge-join those partial solutions
→ before adding element to stack:
(i) the node has a descendant on each of the
query children streams
(ii) each of those descendant nodes recursively
satisfies this property
→ optimized by indexes
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
32
b1
TwigStack + Indexes

B+-tree: built on the left attribute
 Access ancestor then probe descendant stream
to skip unmatchable initial nodes
 Ancestor skipping not effective:


b2
Skip only up to the first element following a given one
XB-tree: index on (left,right) bounding segment
a1
b3 c 2
c1
 Pointer to children (region completely included in parent)
 Leaves sorted on left
 Region: ancestor access effective

XR-tree: index on (left,right) = B+tree with complex
index key + stab lists
 Ancestor skipping: elements stabbed by left
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
33
ViST, Virtual Suffix Tree

Input: sequence of (symbol, path) pairs
(a1,)(b1,a1)(a2,a1b1)(b2,a1b1a2)(c1,a1b1a2b2)(c2, a1b1a2)
 Document and query translated
a1
 Virtual suffix tree (B+-tree) indexed left

b1
Processing
 Structural query = find (non-contiguous)
subsequence matches → suffix tree

Benefit: query as a whole instead of merging parts
a2
b2 c 2
c1
 One query path per time
 Efficient when query top defines the results
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
34
ViST, index
1,13
a
b
a
5,7
2,4
b
3
a
6
1,13
(a,b)
2,4
5,7
8,12
B+ (b,ba)
8,12
c
(b,)
c
9,11
10
c
(c,bac)
10
(c,ba)
6
9,11
D-Ancestor

3
S-Ancestor
Virtual suffix tree
 B+tree, nodes indexed on the left position
 D- ancestor and S-ancestor
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
35
A18
B4
A
B11
B14
B17
B
C1
D3 C6
D8
F10
C13
D16
C
v0
v2
v5
v7
v9
v12
v15
N
Document
ViST
D
(A, ε ) (B,A)(C,B)(D,B)
N
Query
Query Sequence
ViST
(A,ε)1(B,A)2(C,AB)3(D,AB)4(B,A)5(C,AB)6(D,AB)7(F,AB)8(B,A)9(C,AB)10(B,A)11(D,AB)12
ViST Subsequence
Matching
UC Riverside
(A,ε)1(B,A)2(C,AB) 3(D,AB)4,
(A,ε)1(B,A)2(C,AB) 3(D,AB)12,
(A,ε)1(B,A)2(C,AB) 6(D,AB)12,
(A,ε)1(B,A)5(C,AB) 6(D,AB)7,
(A,ε)1(B,A)5(C,AB) 10(D,AB) 12,
(A,ε)1(B,A)2(C,AB)3(D,AB) 7
(A,ε)1(B,A)2(C,AB)6(D,AB)7
(A,ε)1(B,A)2(C,AB)10(D,AB)12
(A,ε)1(B,A)5(C,AB)6(D,AB) 12
(A,ε)1(B,A)9(C,AB)10(D,AB) 12
Tree-Pattern Queries on a Lightweight XML Processor
Final
Filtering
36
ViST, algorithm
1,13
a
a
5,7
2,4
b
3
b
b
c
6
a
8,12
c
9,11
c
10
a
1,13
(a,b)
2,4
5,7
8,12
c
Q = (b,) (a,b) (c,ba)
Q = q1, … qk, query sequence
D-tree B+ index of (symbol, prefix)
S-tree B+ index of region labels
function Search (region, i)
if i < |Q|
T = retrieve qi S-tree from D-tree
N = retrieve from S-tree all nodes
in the range region
for each node c(left,right)  N
Search ( (left,right), i+1)
else return result
UC Riverside
(b,)
B+ (b,ba)
3
(c,bac)
10
(c,ba)
6
9,11
Search ( (1,13), (b,) )
Search ( (1,13), (a,b) )
Search ( (2,4), (c,ba) )
Search ( (5,7), (c,ba) )
Search ( (8,12), (c,ba) )
Tree-Pattern Queries on a Lightweight XML Processor
37
ViST, access order
Q = (b,) (a,b) (c,ba)
(b,)
1,13
(a,b)
2,4
5,7
8,12
B+ (b,ba)
1
a
a
4
2
b
X
b
a
c
7
c
UC Riverside
(c,bac)
10
(c,ba)
6
9,11
6
c
5
3
Search ( (1,13), (b,) )
Search ( (1,13), (a,b) )
Search ( (2,4), (c,ba) )
Search ( (5,7), (c,ba) )
Search ( (8,12), (c,ba) )
Tree-Pattern Queries on a Lightweight XML Processor
38
ViST, discussion

Worst-case storage requirement for D-Ancestor is >
linear in #elements 
 E.g. unary tree with n nodes, sequence O(n2)

False alarms
a

a
b
c
b
b
d e
f d
d
e
d
a
D1 = (a,) (b,a) (d,ab) (e,ab) (c,a) (f,ac) (d,ac)
b
D2 = (a,) (b,a) (d,ab) (b,a) (e,ab)
e
Q = (a,) (b,a) (d,ab) (e,ab)
 Our implementation: no false alarms

//a[//b]//c unordered

a
 Vist: (a, )(b,a)(c,a) & (a, )(c,a)(b,a)
b c
 Our implementation: run the twig query only once
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
a
c b
39
PRIX, PRüfer seqs. for Indexing XML

Input: sequence of labels
 Document & query mapped by Prüfer’s method
 Tree → sequence: remove one node at a time
LPS = A C B C C B A C A E E E D A
NPS = 15 3 7 6 6 7 15 9 15 13 13 13 14 15
(Any numbering scheme, here is post-order)
Bottom-up approach

Processing
 Sequence matching against indexed db: filter non-matches
 Refinement phases: filter twig-matches, the results:


UC Riverside
Form a tree, satisfy the twig query, include the leaf nodes
Tree-Pattern Queries on a Lightweight XML Processor
40
PRIX, Processing

Problems
 Complex solution
 //a[//b]//c unordered: same problem as ViST

What we do
 Region based numbering scheme and XB-tree
 Bottom-up traversal of the query + subtwig
merging


Access nodes in the same order
Efficient when query bottom defines results
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
41
A(k) index
1 A
1 A
2 B
3 C
4 D
5 D
4,5 D
6 E
7 E
6,7 E
6,7 E
A(0)
A(1)
Original document


2 B
1 A
3 C
1 A
2 B
3 C
2 B
3 C
4 D
5 D
4 D
5 D
6 E
7 E
A(2)
A(k) → k is the degree of similarity, “size of common path”
k k-bisimilarity
1) for any two nodes u and v, u 0 v iff u and v have same label
2) u k v iff u k-1 v and for every parent u’ of v’, there is a parent v’ of v
s.t. u’ k-1 v’ and vice-versa
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
42
Protein
60
Time (sec)
50
XBTwigStack
SingleDFA
IdxDFA
INLJ
StrIdx
40
30
20
10
0
P1
P2
P4
P5
P6
Queries
UC Riverside
Tree-Pattern Queries on a Lightweight XML Processor
43
Related documents