Download Path Queries on Compressed XML

Path Queries on Compressed XML Peter Buneman, Martin Grohe, Christoph Koch Laboratory for Foundations of Computer Science University of Edinburgh, UK Murat Şensoy Can Gürel WHAT IS SHOWN IN THIS STUDY  XML Tree structure can be effectively compressed and manipulated using techniques derived from symbolic model checking.  Specically, concise representations of document tree structures based on sharing subtrees are highly effective.  Compressed structures can be queried directly and efficiently through a process of manipulating selections of nodes and partial decompression MOTIVATION  Ability to store and manipulate large portions of the structure of very large XML documents in main memory is crucial to the development of efficient, scalable native XML databases and query engines.  Extraction of text and storing compressed text separately and using remaining tree skeleton for query processing in memory is not sufficient for very large documents whose skeletens do not fit to memory.  Solution: Compress skeleton so that path queries are possible on this compressed skeleton. METHOD Our compression technique of XML skeletons by sub-tree sharing can be seen as a direct generalization of the compression of Boolean functions into Ordered Binary Decision Diagrams. This enables us to transfer the efficient algorithms for OBDDs to our setting and thus provides the basis for new algorithms for evaluating path queries directly on compressed skeletons. Original Skeleton can be recovered by the appropriate depth-first traversal of the compresses skeleton Tree Skeleton (a) Compressed Trees (b) (c) This compressed representation is both natural and exhibits an interesting property of most practical XML documents: XML documents possess a regular structure in that sub-tree structures in a large document are likely to be repeated many times. If one considers an example of extreme regularity, that of an XML-encoded relational table with R rows and C columns, the skeleton has size O(CxR), and the compressed skeleton as in Figure (b) has size O(C + R). This is further reduced by merging multiple edges as in Figure (c) to O(C +logR). Compressed skeletons are easy to compute and allow us to query the data in a natural way. Each node in the compressed skeleton represents a set of nodes in the uncompressed tree. The purpose of a path query is to select a set of nodes in the uncompressed tree. However this set can be represented by a subset of the nodes in a partially decompressed skeleton Coming up... XML as a universal data source Handling XML XML Storage XML databases Basic definitions Decompression/Compression Queries on XML Query specification and path languages XPath Query evaluation Implementation Experiments Conclusion XML Storage Infeasible approaches Subtree-based indexing/caching mechanisms Relational DBMS use to keep node information in tuples Better: Separate string data and document structure String data stored and indexed using conventional approaches Document structure stored as a tree (“skeleton”) with nodes keeping element and attribute names Used in XMILL as a compression scheme XML Storage (cont’d) String data needed for localized processing Easily compressed by conventional methods Skeleton needed for navigational aspect of queries (heavy performance tag) Usually small, can fit wholly in main memory Can be large in complex databases, will not fit in main memory. Can we compress it? How to avoid compression/decompression overhead (time and space) during query evaluation? Proposed Compression Scheme Based on sharing of common nodes Fits well to most practical XML documents due to structure regularity Independent of the actual DTD Generic structure, capable of expressing other information than just elements and attributes (eg. string match, query result) Original document structure is preserved Bisimulation: Each node in compressed tree corresponds to a number of nodes in uncompressed tree Partial decompression is possible Naturally extends to query processing on compressed skeletons Example Common subtrees are shared Edges are ordered In (c) consecutive multiple edges are joined and marked their multiplicity Definitions and Notation Vertex set V = {v, w, ...} Set of vertices in tree (nodes of the (possibly compressed) XML document tree) Function : VV* Function that assigns every vertex its sequence of children Can be visualized as a directed graph Normally the graph is rooted and acyclic (a DAG) (v,w) is an edge iff w(v) Edges are ordered, so if w occurs as ith position of (v), we write i vw Definitions and Notation (cont’d) Some sets of vertices may be arbitrarily distinguished and given names In this work, sets are used to represent XML tags or derived properties (string matching, subquery result etc) for compression operations and queries In the example V={v1,v2,v3,v4,v5} (v1)= v2 v4 v4 (v2)= v3 v5 v5 v5 (v3)=  (v4)= v3 v5 (v5)=  Sets are Sbib, Sbook, Stitle, Spaper, Sauthor 1 2 3 1 2 v1  v2 v1  v4 v1  v4 v4  v3 v4  v5 1 2 3 4 v2  v3 v2  v5 v2  v5 v2  v5 Definitions and Notation (cont’d) Instance I=(V, , root, S1,...Sn) Represents an XML document (in compressed or uncompressed form) Equivalent instances are representations of the same XML document Modeled as a tuple comprised of: The DAG components V: Set of vertices : Function defining the tree root: Root element The “Schema”, a finite set of unary relation names  = {S1, ...Sn} where each Si is a subset of V The schema names the instance type (a -Instance) Definitions and Notation (cont’d) Edge path For vertices v0 and vn and intermediate vertices v1 ... vn-1 such that i1 in v0  v1  ...  vn1  vn The integer sequence i1 i2 ... in is called an “edge path” from v0 to vn For each vertex vV (v)={P | P is an edge path from the root to v} And for a set SV (S)= vS (v) Definitions and Notation (cont’d) Two -instances I and J are said to be “equivalent” if (VI)= (VJ) (SI)= (SJ) for all S The uncompressed instance is a tree, so in a tree there is exactly one edge path to each vertex! Decompression For each instance I there is a unique tree instance T(I) equivalent to I This tree instance is the uncompressed representation of the same XML document How to obtain T(I) from I? Each node in T(I) corresponds to an edge path in I Definitions and Notation (cont’d) ~: Bisimilarity relation An equivalence relation on a -instance I with the vertex set V For all v, wV where v~w, For all S, vSwS i For all i, if v  v' then i w  w' and v' ~ w' for some w'V Compression For an instance I and bisimilarity relation ~ on I, construct I|~ as follows Replace the vertices in VI by their equivalence classes of the vertices with respect to ~ (put all vertices v, wV and v~w into a single vertex labeled “v~w”) For all instances I and bisimilarity relations ~ on I, I is equivalent to I|~ For all instances I (and T(I)), there is a bisimilarity relation ~ on T(I) such that I is isomorphic to T(I)|~ It is always possible to obtain an instance isomorphic to I from T(I) via a bisimilarity relation ~ With an appropriate bisimilarity relation, T(I)|~=I is a compressed representation of T(I) Compression (cont’d) Bisimilarity relations ~ on an instance I form a lattice Topmost relation is the maximally compressing relation and yields the equivalent minimal instance Bottommost relation is equality relation (always fails for distinct vertices, so provides no compression) and yields the same instance (identity element) Instance I is “minimal” if the only bisimilarity relation on I is equality on V. Then, the bisimilarity relation ~ with I=T(I)|~ is the maximum element of the lattice of bisimilarity relations on T(I). Compression (cont’d) For each instance I, there is exactly 1 minimal instance M(I) equivalent to I. If T is an XML document, M(T) is its compression Given I, M(I) can be calculated in linear time Definitions and Notation (cont’d) Schemas contain tag labels. Instead of keeping all tags, we can keep only those relevant (to a subquery) Need a way to join “compatible” instances Given schemas ’, I’=I|’ is the “’-reduct” of a -instance I SI’=SI for all S’ I and I’ have the same DAG Definitions and Notation (cont’d) Given schemas , , -instance I and instance J I and J are “compatible” if I| and J| are equivalent A -instance K is the “common extension” of I and J if K| is equivalent to I and K| is equivalent to J Common extension is defined only for compatible instances Given I and J, common extension of I and J can be calculated in quadratic time Used to merge results of subqueries QUERY EVALUATION How to evaluate a Core XPath query on a compressed instance? Firstly, What is a Core Xpath Query? Location Path is composed of Location Steps: A location step has three parts:  An axis, which specifies the tree relationship between the nodes selected by the location step and the context node.  A node test, which specifies the node type.  Zero or more predicates, which use arbitrary expressions to further refine the set of nodes selected by the location step. Child :: chapter [child::title='Introduction'] Axis Node Test Predicate Selects the chapter children of the context node that have one or more title children with string-value equal to Introduction. EXAMPLE CORE XPATH QUERY: / descendant::a / child::b[child::c / child::d or not(following::*)] The intuition in Core XPath, which reduces the query evaluation problem to manipulating sets rather than binary relations, is to reverse paths in conditions to direct the computation of node sets in the query towards the root of the query tree. Original Skeleton (a) Tree Skeleton can be recovered by the appropriate OPERATIONS ON COMPRESSED INSTANCES depth-first traversal of the compresses skeleton In certain cases, decompression may be necessary to be able to Compressed Trees (b) represent the resulting selection. Our goal is to avoid full de- (c) compression when it is not necessary. The idea of the algorithm(s) is to traverse the DAG of the input instance starting from the root, visiting each node v only once. We choose a new selection of v on the basis of the selection of its ancestors, and split v if different predecessors of v require different selections. We remember which node we have copied to avoid doing it repeatedly. EXAMPLE : COMPLEXITY AND DECOMPOSITION In theory, this method of compression by sharing sub-trees can lead to an exponential reductionBecause in instance size, and : even to doubly exponential compression usinginedge At each selection query, multiplicities. Unfortunately, compressed may compressedtrees instance decompress exponentially in the case even on very canworst double in size simple queries. at worst case. However, surprisingly, the decompression is only exponential in the size of the queries (but not the data), which tend to be small. Implementation Key goals Space efficiency (compression of XML) primary goal Time efficiency (fast queries on compressed XML) secondary goal Implementation DAGs as relaxed trees with multiple incoming pointers Fast SAX parser in C++ is used Compression takes linear time (a single scan of document) Stack used to keep partially formed vertices and do string matches Hash table used to index finished vertices (Hash function based on set memberships and children) Instead of common extensions recompression is done to bias space efficiency over time efficiency No XML database, simply reparsed document for each experiment No indices are implemented for string constraints Experimental Results Reference system Dell Inspiron 8100 Laptop/256MB RAM/PIII 1GHz/Linux Data SwissProt protein database DBLP Penn Treebank (manually structured linguistic database from the Wall Street Journal) OMIM human gene and genetic disorder database XMark generated auction data Shakespeare’s collected works 1998 Major League Baseball statistic Experimental Results (cont’d) Queries For each dataset Q1: Tree pattern query returning the root if pattern is matched. Only parent axis used, no decompression is required. Q2: Q1 reversed, selecting the nodes (not the root) Q3: Descendant axis, conditions as well as string constraints Q4: Branching query trees Q5: Q4 extended with the remaining axes Experimental Results (cont’d) - rows do not keep tag information + rows keep tag information Last is the compression ratio in ratio of the number of edges Conclusions Successful XML compression and efficient querying on compressed XML Bisimulation basis for compression requires little overhead, execution time is competitive even with uncompressed processing Sharing of common subtrees prevent all recalculation and compress large regular XML data more efficiently Separation of skeleton from the string and subsequent compression Global data fits in main memory Requires less data to be processed for query evaluation A general cached chunk scheme is needed as data increases for true scalability

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Path Queries on Compressed XML