Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant IIS 0339032, UC Micro, and Lotus Interworks Outline Motivation and Contributions Background Method Categorization Experimental Evaluation Conclusions UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 2 Motivation XML query languages: selection on both value and structure “Tree-pattern” queries (TPQ) very common in XML Many promising holistic solutions None in lightweight XML engines Without optimization module (e.g. eXist, Galax) Effective, robust processing method Reasons: No systematic comparison of query methods under a common storage model No integration of all methods under such storage model Context: XPath semantics, stored data (indexed at will) UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 3 Contributions TPQ methods over unified environment Method Categorization: data access patterns and matching algorithm Common storage model + integration of all methods Capture the access features Permit clustering data with off-the-shelf access methods (e.g. B+tree) Novel variations of methods using index structures + Handle TPQ Extensive comparative study Synthetic, benchmark and real datasets Decision in the applicability, robustness and efficiency UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 4 Background TPQ Bib (1,20) article article author procs (2,19) conf last procs (3,5) (6,13) (14,18) (4) (2,19) (7,9) first conf (10,12) (15,17) DeWitt David J. (8) (11) VLDB last (7,9) UC Riverside author t1 last article 2<7<9<19 title (16) XML database = forest of unranked, ordered, node-labeled trees, one tree per document Tree-Pattern Queries on a Lightweight XML Processor 5 Common Storage Model bib (1,26) author (3,8) (11,16) (19,24) bib (1,16) book (2,9) paper (18,25) author (3,8) author (19,24) name address (4,5) (6,7) name address (20,21) (22,23) author(11,16) paper (18,25) book (2,9) (10,17) Input = sequence (list) of elements One list per document tag = element list Node clustering by index structures UC Riverside name (4,5) (12,13) (20,21) B+ Tree on ( tag, initial ) book (10,17) name address (12,13) (14,15) address (6,7) (14,15) (22,23) Numbering scheme Tree-Pattern Queries on a Lightweight XML Processor 6 Method Categorization Parameters: access pattern and matching algorithm (1) set based techniques (2) query driven (3) input driven (4) structural summaries UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 7 Cat 1: Set-based Techniques Access Pattern Sorted/indexed Matching Process Join sets, merge individual paths Input: sequences of elements, one list per query node element, possibly indexed (set-based) Major representative: TwigStack Optimal XML pattern matching algorithm (ancestor/descendant) Stack-based processing Set of stacks = compact encoding of partial and total results in linear space (possibly exponential number of answers) UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 8 TwigStack + Indexes B+tree, built on the left attribute From ancestor: probe descendants: skip initial nodes Ancestor skipping not effective (up to 1st element that follows) XB-tree: on (left,right) bounding segment XR-tree: on (left,right), B+tree with complex index key + stab lists A comparative study* shows that Skipping ancestors: XBTree better (XBTree size is smaller) Recursive level of ancestors: XBTree better again Searching on stab lists of XR-tree is less efficient Plain B+tree: skips descendants, BUT not ancestors XBTwigStack is our choice * H.Li et al. “An Evaluation of XML Indexes for Structural Joins”. Sigmod Record, 33(3), Sept 04 UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 9 Cat 2: Query Driven Techniques Access Pattern Indexed/random Matching Process Incremental construction of each result instance Processing: the query defines the way input is probed Major representatives: ViST and PRIX Specific details: significantly different Same strategy Convert both document and query to sequences Processing query = subsequence matching UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 10 ViST and PRIX Recursively identify matches = quadratic time Optimize the naïve solution: Identify candidate nodes for each matching step Index structures to cluster those candidates Subsequence matching process = a plan consisting of INLJ among relations, each of which groups document nodes with the same label For a given query, joins sequence statically defined by the sequencing of the query INLJ plans are a superset of the static plans that PRIX and VIST use UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 11 ViST x PRIX x INLJ Dataset #nodes VIST PRIX INLJ 100% 100 100 100 LEAVES: 80% 100 84.23 84.20 LEAVES: 1% 100 1.33 1.32 ROOT: 80% 84.22 100 84.18 ROOT: 1% 1.33 100 1.33 INTERNAL: 80% 89.48 89.49 84.20 INTERNAL: 1% 34.24 34.22 1.64 Percentage of nodes processed by each algorithm INLJ: best plan UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 12 INLJ : improved a b 2,31 + B tree Consider b//c Starting from c 1,52 b32,41 b42,51 b elem. list a 33,40 33 b c35,36 c 38,39 34,37 2,31 32,41 34,41 42,51 TPQ evaluation of relational plan Independence of the ordered XML model Total avoidance of false positives UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 13 Cat 3: Input Driven Techniques Access Pattern Sequential Matching Process Input drives computation, merge individual paths Processing: at each point, the flow of computation is guided entirely by the input through a Finite State Machine (DFA/NFA) Advantages Each node processed only once Simplicity, sequential access pattern Problem: skipping elements UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 14 SingleDFA and IdxDFA SingleDFA <element> triggers the DFA, choosing next state </element>: execution backtracks to when start processed TPQ matching: intermediate results compacted on stacks Experiments show reading whole input = not enough Speeding up navigation: IdxDFA Instead of reading sequentially: use indexes and skip descendants UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 15 IdxDFA: example a b c d c1 b2 a3 c4 d6 d11 b9 c5 d7 d9 c10 UC Riverside a12 b13 c22 c16 d6 b21 d14 c15 Tree-Pattern Queries on a Lightweight XML Processor 16 IdxDFA: example a b c d c1 b2 a3 c4 d6 d11 bb99 c5 d7 d9 c10 UC Riverside a12 b13 c22 c16 d6 b21 d14 c15 Tree-Pattern Queries on a Lightweight XML Processor 17 Cat 4: Graph Summary Evaluation Access Pattern Indexed/Random Matching Process Merge-join partitioned input, merge individual paths Structural summary: index node identifies a group of nodes in the document Processing: identify index nodes that satisfy the query + post processing filtering Beneficial: when there is a reasonable structural index, much smaller than document Problem: graph size comparable/larger than original document UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 18 Categories Summary Access Pattern Matching Process Methods Set Based Sorted/ Indexed Join sets, merge individual paths Twigstack /XB, B+tree, XR-tree Query Driven Indexed/ random Incremental construction of each result instance (ViST, PRIX) INLJ Input Driven Sequential Input drives computation, Structural Summary Indexed/ random UC Riverside merge individual paths SingleDFA, IdxDFA Merge-join partitioned Structural input, merge individual paths indexes Tree-Pattern Queries on a Lightweight XML Processor 19 Experimental Evaluation Experiments with real datasets Experiments with synthetic datasets 1. 2. Further analyze each method Characterize the methods according to specific features available in each custom dataset More sets of experiments 3. UC Riverside Closely verify XBTWIGSTACK versus INLJ Tree-Pattern Queries on a Lightweight XML Processor 20 Setup Algorithms using the same API Analysis varying structure and selectivity Performance measure = total time required to compute a query Number of nodes as secondary information Intel Pentium 4 2.6GHz, 1Gb ram Berkeley DB: 100 buffers, page size 8Kb, B+ tree Real/benchmark datasets XMark (Internet auction, 1.4 GB raw data, ± 17 million nodes), Protein Sequence Database UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 21 XMark 40 35 Time (sec) 30 XBTwigStack SingleDFA IdxDFA INLJ StrIdx 25 20 15 10 5 0 X1 X2 X4 X6 Queries UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 22 Custom Data Goal: isolate important features Query //a//b[.//c]//d Simple enough for detailed investigation Complex enough to provide large number of different data access possibilities a b c d Vary selectivity of each element separately Add recursion to key elements (root, leaf) UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 23 Custom Data 30 Time (sec) 25 20 XBTwigStack IdxDFA INLJ 15 10 5 a 0 b D1:100 D2:80 D3:50 D4:10 D5:01 Dataset:Selectivity UC Riverside Tree-Pattern Queries on a Lightweight XML Processor c d 24 Custom Data 30 Time (sec) 25 20 XBTwigStack IdxDFA INLJ 15 10 a 5 b 0 D1:100 D6:80 D8:10 D10:80 D12:10 Dataset:Selectivity UC Riverside Tree-Pattern Queries on a Lightweight XML Processor c d 25 XBTwigStack x INLJ 250 Time (sec) 200 150 100 50 0 R40 ABCD ABDC BACD BADC BCAD BCDA BDAC BDCA CBAD CBDA DBAC DBCA XBTwig On large dataset, 40mi nodes, 1Gb, 1% selectivity Difference of 40s between XBTwig and INLJ best plan UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 26 XBTwigStack x INLJ 350 325 300 275 250 225 200 175 150 125 100 75 50 25 0 R1 R3 R4 R7 R9 ABCD ABDC BACD BADC BCAD BCDA BDAC BDCA CBAD CBDA DBAC DBCA XBTWIG UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 27 Conclusions Categorization of TPQ processing algorithms Adaptations for processing TPQ DFA + accessing nodes from B+tree INLJ + ancestor skipping DFA-based improved, IdxDFA, not enough Structural summary available and smaller than document: StrIdx XBTwigStack: most robust and predictable INLJ when high selectivity: no guarantee about chosen plan without optimizer module UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 28 Questions? EXTRA SLIDES Background article Bib author procs (1,36) last conf article article (2,19) (20,35) title author procs title author procs (3,5) (6,13) (14,18) (21,23) (24,31) (32,34) t1 last (4) (7,9) first conf (10,12) (15,17) t2 last first conf (22) (25,27) (28,30) (33) DeWitt David J. (8) (11) VLDB Lu Hongjun (26) (29) (16) Region numbering scheme : (left, right) UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 31 TwigStack a1 b1 c1 a1 b1 a2 a b2 c 2 b c1 c doc query a1 b1 c2 a2 b2 a1 b1 c12 a1 b2 c1 Sa Sb Sc a2 b2 c1 results 1) solutions individual root-to-leaf paths 2) merge-join those partial solutions → before adding element to stack: (i) the node has a descendant on each of the query children streams (ii) each of those descendant nodes recursively satisfies this property → optimized by indexes UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 32 b1 TwigStack + Indexes B+-tree: built on the left attribute Access ancestor then probe descendant stream to skip unmatchable initial nodes Ancestor skipping not effective: b2 Skip only up to the first element following a given one XB-tree: index on (left,right) bounding segment a1 b3 c 2 c1 Pointer to children (region completely included in parent) Leaves sorted on left Region: ancestor access effective XR-tree: index on (left,right) = B+tree with complex index key + stab lists Ancestor skipping: elements stabbed by left UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 33 ViST, Virtual Suffix Tree Input: sequence of (symbol, path) pairs (a1,)(b1,a1)(a2,a1b1)(b2,a1b1a2)(c1,a1b1a2b2)(c2, a1b1a2) Document and query translated a1 Virtual suffix tree (B+-tree) indexed left b1 Processing Structural query = find (non-contiguous) subsequence matches → suffix tree Benefit: query as a whole instead of merging parts a2 b2 c 2 c1 One query path per time Efficient when query top defines the results UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 34 ViST, index 1,13 a b a 5,7 2,4 b 3 a 6 1,13 (a,b) 2,4 5,7 8,12 B+ (b,ba) 8,12 c (b,) c 9,11 10 c (c,bac) 10 (c,ba) 6 9,11 D-Ancestor 3 S-Ancestor Virtual suffix tree B+tree, nodes indexed on the left position D- ancestor and S-ancestor UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 35 A18 B4 A B11 B14 B17 B C1 D3 C6 D8 F10 C13 D16 C v0 v2 v5 v7 v9 v12 v15 N Document ViST D (A, ε ) (B,A)(C,B)(D,B) N Query Query Sequence ViST (A,ε)1(B,A)2(C,AB)3(D,AB)4(B,A)5(C,AB)6(D,AB)7(F,AB)8(B,A)9(C,AB)10(B,A)11(D,AB)12 ViST Subsequence Matching UC Riverside (A,ε)1(B,A)2(C,AB) 3(D,AB)4, (A,ε)1(B,A)2(C,AB) 3(D,AB)12, (A,ε)1(B,A)2(C,AB) 6(D,AB)12, (A,ε)1(B,A)5(C,AB) 6(D,AB)7, (A,ε)1(B,A)5(C,AB) 10(D,AB) 12, (A,ε)1(B,A)2(C,AB)3(D,AB) 7 (A,ε)1(B,A)2(C,AB)6(D,AB)7 (A,ε)1(B,A)2(C,AB)10(D,AB)12 (A,ε)1(B,A)5(C,AB)6(D,AB) 12 (A,ε)1(B,A)9(C,AB)10(D,AB) 12 Tree-Pattern Queries on a Lightweight XML Processor Final Filtering 36 ViST, algorithm 1,13 a a 5,7 2,4 b 3 b b c 6 a 8,12 c 9,11 c 10 a 1,13 (a,b) 2,4 5,7 8,12 c Q = (b,) (a,b) (c,ba) Q = q1, … qk, query sequence D-tree B+ index of (symbol, prefix) S-tree B+ index of region labels function Search (region, i) if i < |Q| T = retrieve qi S-tree from D-tree N = retrieve from S-tree all nodes in the range region for each node c(left,right) N Search ( (left,right), i+1) else return result UC Riverside (b,) B+ (b,ba) 3 (c,bac) 10 (c,ba) 6 9,11 Search ( (1,13), (b,) ) Search ( (1,13), (a,b) ) Search ( (2,4), (c,ba) ) Search ( (5,7), (c,ba) ) Search ( (8,12), (c,ba) ) Tree-Pattern Queries on a Lightweight XML Processor 37 ViST, access order Q = (b,) (a,b) (c,ba) (b,) 1,13 (a,b) 2,4 5,7 8,12 B+ (b,ba) 1 a a 4 2 b X b a c 7 c UC Riverside (c,bac) 10 (c,ba) 6 9,11 6 c 5 3 Search ( (1,13), (b,) ) Search ( (1,13), (a,b) ) Search ( (2,4), (c,ba) ) Search ( (5,7), (c,ba) ) Search ( (8,12), (c,ba) ) Tree-Pattern Queries on a Lightweight XML Processor 38 ViST, discussion Worst-case storage requirement for D-Ancestor is > linear in #elements E.g. unary tree with n nodes, sequence O(n2) False alarms a a b c b b d e f d d e d a D1 = (a,) (b,a) (d,ab) (e,ab) (c,a) (f,ac) (d,ac) b D2 = (a,) (b,a) (d,ab) (b,a) (e,ab) e Q = (a,) (b,a) (d,ab) (e,ab) Our implementation: no false alarms //a[//b]//c unordered a Vist: (a, )(b,a)(c,a) & (a, )(c,a)(b,a) b c Our implementation: run the twig query only once UC Riverside Tree-Pattern Queries on a Lightweight XML Processor a c b 39 PRIX, PRüfer seqs. for Indexing XML Input: sequence of labels Document & query mapped by Prüfer’s method Tree → sequence: remove one node at a time LPS = A C B C C B A C A E E E D A NPS = 15 3 7 6 6 7 15 9 15 13 13 13 14 15 (Any numbering scheme, here is post-order) Bottom-up approach Processing Sequence matching against indexed db: filter non-matches Refinement phases: filter twig-matches, the results: UC Riverside Form a tree, satisfy the twig query, include the leaf nodes Tree-Pattern Queries on a Lightweight XML Processor 40 PRIX, Processing Problems Complex solution //a[//b]//c unordered: same problem as ViST What we do Region based numbering scheme and XB-tree Bottom-up traversal of the query + subtwig merging Access nodes in the same order Efficient when query bottom defines results UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 41 A(k) index 1 A 1 A 2 B 3 C 4 D 5 D 4,5 D 6 E 7 E 6,7 E 6,7 E A(0) A(1) Original document 2 B 1 A 3 C 1 A 2 B 3 C 2 B 3 C 4 D 5 D 4 D 5 D 6 E 7 E A(2) A(k) → k is the degree of similarity, “size of common path” k k-bisimilarity 1) for any two nodes u and v, u 0 v iff u and v have same label 2) u k v iff u k-1 v and for every parent u’ of v’, there is a parent v’ of v s.t. u’ k-1 v’ and vice-versa UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 42 Protein 60 Time (sec) 50 XBTwigStack SingleDFA IdxDFA INLJ StrIdx 40 30 20 10 0 P1 P2 P4 P5 P6 Queries UC Riverside Tree-Pattern Queries on a Lightweight XML Processor 43