Download VIST: The Virtual Suffix Tree

Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321 VIST: A Dynamic Index Method for Querying XML Data by Tree Structures Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003 What is XML?  XML : Extentional Markup Language  Has a great Exchange.  So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents. importance in Data VIST : Virtual Suffix Tree  In this paper, VIST is proposed to search XML Documents.  XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages).  By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches. Index Methods in XML  Previous index methods: Disassemble a query into multiple subqueries, and then join the results of these sub-queries to provide final answers. What does VIST do?  Converts both XML Data and XML Queries to structure-encoded sequences  Uses tree structures as the basic unit of query in order to avoid highly expensive join operations  In other words, uses structured-encoded sequences instead of nodes or paths What does VIST do?  Matches structured queries against structured data as a whole, without breaking down the queries into subqueries of paths or nodes and relying on join operations.  Supports dynamic index update. What does VIST do?  In this paper, it is shown that VIST is effective and efficient in supporting structural queries. Introduction     XML has a growing importance in data exchange (extracting data from XML documents) XML provides a flexible way to define semi-structured data In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree) VIST provides solutions, offers better performance and usability than previous approaches in XML indexing.  In XML query language design, expressing complex structural or graphical queries is one of the major concept.  (In figure 2, four sample queries is displayed in graph form) In previous approaches;  i. Indexes are created on path (e.g. “/P/S/I/M” in Q1) Path indexes can answer simple queries efficiently (no branches in Q1).  ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results.  iii. So, these methods are inefficient in handling. In VIST approach; Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries. Result: no need to perform expensive join operations. Method:  XML Data and XML Queries is transformed into to “structure-encoded sequences”.  In order to organize structure-encoded sequences Virtual Suffix Tree is used.  VIST also speeds up the matching process. Structure:  VIST’s index structure includes two parts: DAncestor index, S-Ancestor index (that will be explained in on-going pages).  VIST unifies structural indexes and value indexes into a single index.  To achieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees. Structure-Encoded Sequences  Sequential representation of both XML Data and XML Queries.  Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing.  Result: Structure-Encoded Sequences are used instead of paths or nodes. Mapping Data and Queries to Structure-Encoded Sequences: Stage 1:  Lets consider the purchase record example in figure 3.  Notation: Capital letters represent names of Attributes.  Lowercase letter represent names of attribute values.  To encode attribute values into integers we use hash( ) function.  e.g. v1 = h(“dell”) and v2 = h(“ibm”)  V1 and v2 is used to represent delle and ibm respectively. Stage 2:   Representing an XML document by the preorder sequence of its tree structure. e.g. preorder sequence of the tree in Figure 3 is: PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8 Stage 3:  Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs: D = (a1,p1), (a2,p2), . . . , (an,pn) ai: node in the XML doc tree. pi: path from the root node to node ai.   Figure 3 can be converted into the structure-encoded sequence. D = ... ... (Figure 4) Benefits:  Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree)  Combining the results of the sub queries by join operations is expensive. The VIST Approach: Presented in 3 stages:  Naïve algorithm based on the suffix trees  RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes  VIST : an index structure but relying only on the B+Trees Requirements  XML indexing method needs to include:    Should support structural queries directly. This is done by “structure-encoded sequences”. Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees. The index structure should allow dynamic data insertion and deletion, etc. A Naïve Algorithm Based on Suffix Trees  Most widely used index structure for subsequence matching is the suffix tree. Example:   2 XML Documents called Doc1 and Doc2, 2 XML Queries called Q1 and Q2 in structure-encoded sequences. Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL) Doc2 : (P,e) (B,P) (L,PB) (V2,PBL) Q1 : (P,e) (B,P) (L,PB) (V2,PBL) Q2 : (P,e) (L,P*) (V2,P*L) Example:  (Cont’d) A tree structure for Doc1 and Doc2 is shown in Figure 5 Example:   (Cont’d) As it is shown above elements in the sequences represent nodes in the suffix tree. Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes. i ) D-Ancestorship e.g. (S,P) is a D-ancestor of (L,PS) ii ) S-Ancestorship e.g. (v1,PSN) is a S-ancestor of (L,PS) Naïve Algorithm based on the suffix trees:  NaiveSearch algorithm based on suffix trees.  Represents a naïve method for noncontigious subsequence matching. For example to match Q2;  Start with the root node, which matches the 1st element of Q2 that is (P,e).  Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB)  Finally, search for - (v2,PSL) under the node labeled (L,PS) - (v2,PBL) under the node labeled (L,PB)  Algorithm 1, searches nodes first by S-Ancestorship, and then D-Ancestorship. Difficulties of Naive Algorithm:   There are difficulties in using suffix tree to index structure-encoded sequences. Major difficulty is explained below: Searching for nodes satisfying both SAncestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match) RIST: Indexing by AncestorDescendent Relationships  Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree.  When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor.  So, no longer need to search among descendents of X to find Ys one by one. the RIST Algorithm:   1. index nodes in suffix tree by their (Symbol,Prefix) pairs. This is represented by a B+Tree.  i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship.  ii. This B+Tree is called D-Ancestorship B+Tree. RIST Algorithm:     2.among all the nodes satisfying D-Ancestorship, we are interested in the ones satisfying SAncestorship as well. i. Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes. ii. We use B+Trees to index nodes by labels. iii.This B+Tree B+Tree. is called S-Ancestorship Labeling Notation  <nx, sizex>  nx: prefix traversal order of x in the suffix tree.  Sizex: total number of descendants of x in the suffix tree.  That kind of labeling is shown in figure 5. Labeling Notation  Note: with that labeling, the SAncestorship between any two nodes can be decide easily:  If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> ) Constructing the B+Trees:  Insert all suffix tree nodes into the DAncestorship B+Tree using their symbols as their keys.  For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys.  Shown in FIGURE 6 Building the DocID B+Tree:  DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree.  Shown in DocID B+Tree In summary;  Unlike the naïve algorithm, RIST does not use suffix trees for subsequence matching (it uses DAncestorship B+Tree and S-Ancestorship B+Tree )  Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query.  So, RIST supports matching efficiently. non-contigious subsequence VIST: The Virtual Suffix Tree  RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions.  Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order)  As well as the size of the subtree rooted at x, which means neither n nor size can be fixed. VIST: The Virtual Suffix Tree  The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship.  Suppose a node x is created for element di ,during the insertion of sequence d1, … , di,… ,dk. VIST: The Virtual Suffix Tree  If it is estimated i. how many different elements will possibly follow di in future insertions. ii.The occurrence probability of each of these elements  Then we can label x’s child nodes instead of waiting until all sequences are inserted. VIST: The Virtual Suffix Tree  (Cont’d) It also means ;   the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient. It supports dynamic data insertion and deletion. Top down scope allocation:  A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node. Top down scope allocation:    In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node, λ is usually assumed as 2. without the knowledge of the occurrence rate of the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child.   Child1 : <n+1,size/2> Child2 : <(n+1+size)/2, size/4> Dynamic scope of a Suffix Tree Node:  The dynamic scope of a node is triple <n,size,k> ,  where k is the number of subscopes allocated inside current scope. Algorithm of VIST:  VIST uses the same sequence matching algorithm as RIST  Dynamic method for labeling suffix tree nodes is represented without building the suffix tree. Algorithm of VIST:  The method relies on insensitive estimations of the number of attribute values.  Because of that the labeling mechanism is based on a virtual suffix tree .  Example: - lets look at the index structure before and after insertion Algortihm of VIST:  Suppose, before the insertion the index structure already contains the following sequence: Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL)  The sequence to be inserted => Doc2 = (P,e) (S,P) (L,PS) (V2,PSL) Assumptions of the Example:  There are 2 assumptions for the algorithm: Max = 20480  Dynamic scope allocation method uses the parameter λ =2   The insertion process is much like that of inserting a sequence into a suffix tree.  We follow the branches, and when there is no branch to follow we create one. CONCLUSION:  VIST (a dynamic index method) is developed for XML Documents.  XML data and XML queries is converted into sequences that encode their structural information. VIST’s Pros:  Uses tree structure as the basic unit of query to avoid expensive join operations.  Supports dynamic data insertion and deletion.  Unlike some other data structures used in other approaches, the index structure of VIST which is based on B+Trees, are well supported by DBMSs. End of Presentation Questions ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download VIST: The Virtual Suffix Tree