Download VIST: The Virtual Suffix Tree

Document related concepts

Quadtree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Red–black tree wikipedia , lookup

Binary tree wikipedia , lookup

Interval tree wikipedia , lookup

B-tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Presentation for Cmpe-521
VIST – Virtual Suffix Tree
Prepared by:
Evren CEYLAN – 2003700163
Aslı UYAR - 2003701321
VIST:
A Dynamic Index Method for Querying
XML Data by Tree Structures
Written by:
Haixun Wang, Sanghyun
Park, Wei Fan, Philip S. Yu – SIGMOD
2003
What is XML?

XML : Extentional Markup Language

Has a great
Exchange.

So, lots of research has been done in
providing flexible query mechanisms in
order to extract data from XML
Documents.
importance
in
Data
VIST : Virtual Suffix Tree

In this paper, VIST is proposed to search XML
Documents.

XML Documents and XML Queries will be
represented in structured-encoded sequences
(that will be explained in on-going pages).

By using this type of sequences it is shown
that, querying XML data is equal to finding
subsequence matches.
Index Methods in XML

Previous index methods:
Disassemble a query into multiple subqueries, and then join the results of
these sub-queries to provide final
answers.
What does VIST do?

Converts both XML Data and XML Queries to
structure-encoded sequences

Uses tree structures as the basic unit of
query in order to avoid highly expensive join
operations

In other words, uses structured-encoded
sequences instead of nodes or paths
What does VIST do?

Matches structured queries against
structured data as a whole, without
breaking down the queries into subqueries of paths or nodes and
relying on join operations.

Supports dynamic index update.
What does VIST do?
 In this paper, it is shown that VIST
is effective and efficient in supporting
structural queries.
Introduction




XML has a growing importance in data
exchange (extracting data from XML
documents)
XML provides a flexible way to define
semi-structured data
In this paper a ‘novel index structure’ is
introduced called “VIST”(Virtual Suffix
Tree)
VIST provides solutions,
offers better
performance and usability than previous
approaches in XML indexing.

In XML query language design,
expressing complex structural or
graphical queries is one of the major
concept.

(In figure 2, four sample queries is displayed
in graph form)
In previous approaches;

i. Indexes are created on path (e.g. “/P/S/I/M” in Q1)
Path indexes can answer simple queries efficiently (no
branches in Q1).

ii. However, queries that involves branching structures
(such as Q2), have to be disassembled into sub-queries,
then combined by expensive join operations to produce
final results.

iii. So, these methods are inefficient in handling.
In VIST approach;
Objective: to provide a general
method so that structural XML
queries
need
not
to
be
decomposed into sub-queries.
Result:
no need to perform
expensive join operations.
Method:

XML Data and XML Queries is transformed into to
“structure-encoded sequences”.

In order to organize structure-encoded sequences
Virtual Suffix Tree is used.

VIST also speeds up the matching process.
Structure:

VIST’s index structure includes two parts: DAncestor index, S-Ancestor index (that will be
explained in on-going pages).

VIST unifies structural indexes and value
indexes into a single index.

To achieve this, a method is proposed called
“dynamic virtual suffix tree labeling” (index
update can be performed directly on B+Trees.
Structure-Encoded Sequences

Sequential representation of both
XML Data and XML Queries.

Objective: Modeling of XML queries
through sequence matching makes
us to avoid unnecessary join
operations in query processing.

Result: Structure-Encoded
Sequences are used instead of paths
or nodes.
Mapping Data and Queries to
Structure-Encoded Sequences:
Stage 1:
 Lets consider the purchase record example in figure 3.
 Notation: Capital letters represent names of Attributes.
 Lowercase letter represent names of attribute values.
 To encode attribute values into integers we use hash( )
function.
 e.g. v1 = h(“dell”) and v2 = h(“ibm”)
 V1 and v2 is used to represent delle and ibm respectively.
Stage 2:


Representing an XML document by the
preorder sequence of its tree structure.
e.g. preorder sequence of the tree in Figure 3 is:
PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8
Stage 3:

Definition: A structure-encoded
sequence
is
a
sequence
of
(symbol,prefix) pairs:
D = (a1,p1), (a2,p2), . . . , (an,pn)
ai: node in the XML doc tree.
pi: path from the root node to node ai.


Figure 3 can be converted into the
structure-encoded sequence.
D = ... ...
(Figure 4)
Benefits:

Modeling XML queries through sequence
matching is that structural queries can be
processed as a whole instead of being broken
into smaller query units(paths or nodes of XML
doc tree)

Combining the results of the sub queries by
join operations is expensive.
The VIST Approach:
Presented in 3 stages:

Naïve algorithm based on the suffix trees

RIST : improves the naïve algorithm by using
B+Trees to index suffix tree nodes

VIST : an index structure but relying only on
the B+Trees
Requirements

XML indexing method needs to include:



Should support structural queries directly. This is done
by “structure-encoded sequences”.
Instead of relying on “suffix trees”, the index method
uses better indexing techniques such as B+Trees.
The index structure should allow dynamic data insertion
and deletion, etc.
A Naïve Algorithm Based on
Suffix Trees

Most widely used index structure
for subsequence matching is the
suffix tree.
Example:


2 XML Documents called Doc1 and Doc2,
2 XML Queries called Q1 and Q2
in structure-encoded sequences.
Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)
Doc2 : (P,e) (B,P) (L,PB) (V2,PBL)
Q1 : (P,e) (B,P) (L,PB) (V2,PBL)
Q2 : (P,e) (L,P*) (V2,P*L)
Example:

(Cont’d)
A tree structure for Doc1 and Doc2
is shown in Figure 5
Example:


(Cont’d)
As it is shown above elements in the sequences
represent nodes in the suffix tree.
Since the nodes are involed in 2 different trees,
there is 2 kinds of ancestor-descendent
relationships among the nodes.
i ) D-Ancestorship
e.g. (S,P) is a D-ancestor of (L,PS)
ii ) S-Ancestorship
e.g. (v1,PSN) is a S-ancestor of (L,PS)
Naïve Algorithm based on the
suffix trees:

NaiveSearch algorithm based on suffix
trees.

Represents a naïve method for noncontigious subsequence matching.
For example to match Q2;

Start with the root node, which matches the 1st
element of Q2 that is (P,e).

Then search under the root for ll nodes that
match (L,P*) which yields to (L,PS) and (L,PB)

Finally, search for
- (v2,PSL) under the node labeled (L,PS)
- (v2,PBL) under the node labeled (L,PB)

Algorithm 1, searches nodes first by
S-Ancestorship, and then D-Ancestorship.
Difficulties of Naive Algorithm:


There are difficulties in using suffix tree to
index structure-encoded sequences.
Major difficulty is explained below:
Searching for nodes satisfying both SAncestorship,
and
D-Ancestorship
is
extremely costly. (because we need to go
over a large portion of the subtree for each
match)
RIST: Indexing by AncestorDescendent Relationships

Improves Naïve Algorithm by eliminating the
expensive go-over operations in suffix tree.

When we reach node X after matching, we can
jump directly to those nodes Y to which X is both
D-Ancestor and S-Ancestor.

So, no longer need to search among
descendents of X to find Ys one by one.
the
RIST Algorithm:


1. index nodes in suffix tree by their (Symbol,Prefix) pairs.
This is represented by a B+Tree.

i.This enables us to search nodes by these
(Symbol,Prefix) pairs that is D-Ancestorship.

ii.
This B+Tree is called D-Ancestorship B+Tree.
RIST Algorithm:




2.among all the nodes satisfying D-Ancestorship,
we are interested in the ones satisfying SAncestorship as well.
i. Labels are created for suffix tree nodes in
order to tell the relationship btw 2 nodes.
ii. We use B+Trees to index nodes by labels.
iii.This B+Tree
B+Tree.
is
called
S-Ancestorship
Labeling Notation

<nx, sizex>

nx: prefix traversal order of x in the suffix tree.

Sizex: total number of descendants of x in the
suffix tree.

That kind of labeling is shown in figure 5.
Labeling Notation

Note: with that labeling, the SAncestorship between any two nodes can
be decide easily:

If x and y are labeled <nx, sizex> and
<ny, sizey>, node x is an S- Ancestor of y
if ny Є ( nx , <nx + sizex> )
Constructing the B+Trees:

Insert all suffix tree nodes into the DAncestorship B+Tree using their symbols
as their keys.

For all nodes that x inserted with the
same (Symbol,Prefix), we index them by
an S-Ancestorship B+Tree, using the nx
values of their labels as keys.

Shown in FIGURE 6
Building the DocID B+Tree:

DocID B+Tree stores for each node x ( using nx as
key ), the document IDs of those XML sequences that
end up at node x when they are inserted into the
suffix tree.

Shown in DocID B+Tree
In summary;

Unlike the naïve algorithm, RIST does not use suffix
trees for subsequence matching (it uses DAncestorship B+Tree and S-Ancestorship B+Tree )

Form any node , instead of searching the entire
subtree under the node, we can jump to the sub
nodes that match the next element in the query.

So, RIST supports
matching efficiently.
non-contigious
subsequence
VIST: The Virtual Suffix Tree

RIST uses a static scheme to label suffix
tree nodes and that prevents it from
supporting dynamic insertions.

Because any node x labeled <n,size> , late
insertions can change the number of
nodes that appear before x. (in the prefix
order)

As well as the size of the subtree rooted at x,
which means neither n nor size can be
fixed.
VIST: The Virtual Suffix Tree

The purpose of the suffix tree is to
provide a labeling mechanism to encode
S-Ancestorship.

Suppose a node x is created for element
di ,during the insertion of sequence
d1, … , di,… ,dk.
VIST: The Virtual Suffix Tree

If it is estimated
i. how many different elements will
possibly follow di in future insertions.
ii.The occurrence probability of each
of these elements

Then we can label x’s child nodes instead
of waiting until all sequences are inserted.
VIST: The Virtual Suffix Tree

(Cont’d)
It also means ;


the suffix tree itself is no longer
needed,
because
it’s
labeling
mechanism is inefficient.
It supports dynamic data insertion and
deletion.
Top down scope allocation:

A tree structure defines nested scopes: the
scope of a child node is a subscope of its parent
node, and the root node has the max scope
which covers the scope of each node.
Top down scope allocation:



In dynamic scope allocation there is a parameter
called λ, which is the expected number of
child nodes of any node,
λ is usually assumed as 2.
without the knowledge of the occurrence rate of
the each child node, 1/λ of the remaining scope
is allocated to x’s 1st inserted child.


Child1 : <n+1,size/2>
Child2 : <(n+1+size)/2, size/4>
Dynamic scope of a
Suffix Tree Node:

The dynamic scope of a node is triple
<n,size,k> ,

where k is the number of subscopes
allocated inside current scope.
Algorithm of VIST:

VIST uses the same sequence
matching algorithm as RIST

Dynamic method for labeling suffix
tree nodes is represented without
building the suffix tree.
Algorithm of VIST:

The
method
relies
on
insensitive
estimations of the number of attribute
values.

Because of that the labeling mechanism is
based on a virtual suffix tree .

Example:
- lets look at the index structure
before and after insertion
Algortihm of VIST:

Suppose, before the insertion the index
structure already contains the following sequence:
Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL)

The sequence to be inserted
=> Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)
Assumptions of the Example:

There are 2 assumptions for the
algorithm:
Max = 20480
 Dynamic scope allocation method uses
the parameter λ =2


The insertion process is much like
that of inserting a sequence into a
suffix tree.

We follow the branches, and when
there is no branch to follow we
create one.
CONCLUSION:

VIST (a dynamic index method) is
developed for XML Documents.

XML data and XML queries is
converted into sequences that
encode their structural information.
VIST’s Pros:

Uses tree structure as the basic unit of query to
avoid expensive join operations.

Supports dynamic data insertion and deletion.

Unlike some other data structures used in other
approaches, the index structure of VIST which is
based on B+Trees, are well supported by
DBMSs.
End of Presentation
Questions ?