Download author

Topic 8: Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A>7 In various application domains, the data are semi-structured; the database schema is loose-defined. Semi-structured data need specialized management methods. The most characteristic example is XML data where the elements (tags) define the semantics of the information. Dr. N. Mamoulis Advanced Database Technologies 1 XML and HTML XML (like HTML) is a subset of SGML In HTML the tags serve the primary purpose of describing how to display a data item. On the other hand, XML tags describe the data itself. Tags are called elements in XML. This means that a program receiving an XML document can interpret it in multiple ways, can filter the document upon its content, can restructure it for a different application, etc. Dr. N. Mamoulis Advanced Database Technologies 2 Example of HTML document <html> <tag></tag> <head> attribute=“value” <title>Course Details</title> text </head> <body> CSIS7101 Detailed (tentative) schedule <hr width="100%"> 1 Introduction <ul> <li> Introduction by the instructor [<a href="http://www.csis7101.com/intro_slides.pdf"> slides in pdf</a>]* </li> </ul> </body> </html> Dr. N. Mamoulis Advanced Database Technologies 3 Example of XML document <book> <booktitle> The Selfish Gene </booktitle> <author id=“dawkings"> <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> </name> <address> <city> Timbuktu </city> <zip> 99999 </zip> <tag></tag> </address> attribute=“value” </author> text </book> Dr. N. Mamoulis Advanced Database Technologies 4 XML and Databases XML is becoming a standard for information exchange over the internet. Since actual data are stored in XML documents, it should be possible to query the data. Here comes the role of databases: How should we organize and query XML data? Dr. N. Mamoulis Advanced Database Technologies 5 XML and Databases (cont’d) Solution 1:  Use specialized storage methods, query languages and query evaluation techniques for semi-structured data. Solution 2:  Represent XML data in relational tables, transform queries to SQL, and use the mature relational DB technology. Dr. N. Mamoulis Advanced Database Technologies 6 More on XML – Document Type Descriptors DTDs (Document Type Descriptors) define (and control) the schema of XML documents for a specific application. Thus, now the structure of these documents is not free, but should conform to the DTD. The DTD can help define a relational schema for the class of documents that conform to it. Dr. N. Mamoulis Advanced Database Technologies 7 Example of a DTD 0 or more times <!ELEMENT book (booktitle, author)> <!ELEMENT article (title, author*, contactauthor)> <!ELEMENT contactauthor EMPTY> <!ATTLIST contactauthor authorID IDREF IMPLIED> <!ELEMENT monograph (title, author, editor)> <!ELEMENT editor (monograph*)> <!ATTLIST editor name CDATA #REQUIRED> reference to <!ELEMENT author (name, address)> existing ID <!ATTLIST author id ID #REQUIRED> <!ELEMENT name (firstname?, lastname)> <!ELEMENT firstname (#PCDATA)> 0 or 1 time <!ELEMENT lastname (#PCDATA)> <!ELEMENT address ANY> Dr. N. Mamoulis anything can be nested under address Advanced Database Technologies 8 More on XML – Queries Queries on XML data are described by the structural relationships of elements, attributes and values. Several XML Query Languages have been proposed. XML-QL, Lorel, UnQL, XQL, XPath, XQuery, ... Dr. N. Mamoulis Advanced Database Technologies 9 More on XML – Query Example Find the last name of the author of book “the selfish gene”  WHERE <book> <booktitle> The Selfish Gene </booktitle> <author> <lastname> $l </lastname> </> </> IN db.xml  CONSTRUCT <lastname> $l </lastname> Dr. N. Mamoulis Advanced Database Technologies 10 XML data represented as graphs An XML document can be represented as a node-labeled graph. The labels of the graph are element tags, attribute names and values. Most documents can be represented by trees. The edges that transform a tree to a graph come from ID references. Dr. N. Mamoulis Advanced Database Technologies 11 Example of a tree representation book booktitle author The Selfish Gene id dawkins name firstname Richard Dr. N. Mamoulis lastname address city zip Dawkins Timbuktu 99999 Advanced Database Technologies 12 Example of a graph representation <atricle> <title> The Importance of Evergreen </title> <author id=“smith”> <name> <firstname> John </firstname> <lastname> Smith </lastname> </name> <address> Smithsville </address> </author> <author id=“jones”> <name> <lastname> Jones </lastname> </name> <address> Jonesville </address> </author> <contactauthor idref=“smith”> </article> Dr. N. Mamoulis Advanced Database Technologies 13 Example of a graph representation (cont’d) article author title The Importance of Evergreen smith id name address firstname lastname Smithsville John Dr. N. Mamoulis author contactauthor id name address authorid jones Smith Advanced Database Technologies lastname Jonessville Jones 14 XML Query types Queries with absolute path expressions. These queries retrieve paths where the first element is the ROOT of the document. Example: find all books written by author with lastname=“Smith”  book/author/name/lastname/Smith Dr. N. Mamoulis Advanced Database Technologies 15 XML Query types Queries with simple path expressions. These queries retrieve paths where the first element can be any element of the document. Example: find all items written by author with lastname=“Smith”  //author/name/lastname/Smith anything can be before author tag Dr. N. Mamoulis Advanced Database Technologies 16 XML Query types Queries with regular path expressions. These queries retrieve paths where the not all elements on the path are specified. Example: find all documents with an “author” element with a descendant “Smith” in the graph  //author//Smith Dr. N. Mamoulis any path can be between author and Smith Advanced Database Technologies 17 XML Query types In general, other symbols may be used to denote the distance between the path elements. Example: find all documents with an “author” followed by one element, then one or none elements, and then by one or no element “Smith”. could be here  //author/_ /?/Smith exactly one element should be here Dr. N. Mamoulis Advanced Database Technologies 18 XML Query types Queries that match multiple regular path expressions. The paths are joined in a root element and the whole query is represented by a twig (small tree). Example: find the book with title “XML” written by an author with a descendant book “Smith” in the graph  book[/title/XML][//author//Smith] title author XML Dr. N. Mamoulis Advanced Database Technologies Smith 19 Indexing and XML Query Processing Several storage schemes and indexes have been proposed for the queries discussed above. Some of them index the paths or subgraphs of the XML structures. Some decompose the information and flatten it into relational DB tables. Dr. N. Mamoulis Advanced Database Technologies 20 Path indexes for XML data If many documents exist, they are connected into a large graph by adding a common root. Then a structural summary of the XML graph is created. All the paths in the data graph are preserved into the summary graph. If we keep pointers to the original graph into the summary graph, then this becomes an index. Dr. N. Mamoulis Advanced Database Technologies 21 Path index example A. a graph of documents 1 alldocuments (root) book 2 8 book title author author title author 3 4 name 5 Dr. N. Mamoulis 6 9 10 name name address 7 11 12 article 13 title author author 14 15 18 name address name 16 Advanced Database Technologies 17 19 22 Path index example B. the 1-index 1 alldocuments (root) book article 2,8 title author 3,9 4,6,10 name address 5,7,11 12 13 title author 14 15,18 name address 16,19 17 The 1-index maintains information about all paths in the original graph Dr. N. Mamoulis Advanced Database Technologies 23 Path indexes for XML data The 1-index maintains information about all paths in the original graph. This makes the index very large (with size comparable to the data size). Therefore it is quite expensive to evaluate queries using this index. To address this problem an A(k)-index is proposed which indexes exactly only paths up to length k. Dr. N. Mamoulis Advanced Database Technologies 24 Bisimilarity Two nodes u, v are called bisimilar if:   They have the same label. If u’ is the parent of u, then there is a parent v’ of v, such that u’,v’ are also bisimilar, and vice versa. Dr. N. Mamoulis Advanced Database Technologies 25 Bisimilarity Example Nodes 4 and 10 are bisimilar Nodes 10 and 15 are not bisimilar 1 alldocuments (root) book 2 8 book title author author title author 3 4 name 5 Dr. N. Mamoulis 6 9 10 name name address 7 11 12 article 13 title author author 14 15 18 name address name 16 Advanced Database Technologies 17 19 26 Bisimilarity defines the 1-index Bisimilar nodes are stored in the same node in the summary index. 1 alldocuments (root) book article 2,8 title author 3,9 4,6,10 name address 5,7,11 Dr. N. Mamoulis 12 13 title author 14 15,18 name address 16,19 Advanced Database Technologies 17 27 Bisimilarity and the A(k) index In the A(k) index, the notion of kbisimilarity is used:   Two nodes u,v are 0-bisimilar, if they have the same label. Two nodes u,v are k-bisimilar, if they have the same label and for every parent u’ of u, there is a parent v’ of v, such that u’ and v’ are (k-1)-bisimilar, and vice versa. Dr. N. Mamoulis Advanced Database Technologies 28 k-bisimilarity Example Nodes 5 and 16 are 1-bisimilar Nodes 6 and 15 are not 1-bisimilar 1 alldocuments (root) book 2 8 book title author author title author 3 4 name 5 Dr. N. Mamoulis 6 9 10 name name address 7 11 12 article 13 title author author 14 15 18 name address name 16 Advanced Database Technologies 17 19 29 Bisimilarity and the A(k) index The A(k)-index stores exactly all paths of length k, or else: all k-bisimilar nodes in the data graph are stored in the same node in the index graph. This means that all incoming paths up to length k are encoded in the index. Dr. N. Mamoulis Advanced Database Technologies 30 A(k)-index example A(3) and A(2)-index 1 alldocuments (root) book article 2,8 title author 3,9 4,6,10 name address 5,7,11 Dr. N. Mamoulis 12 13 title author 14 15,18 name address 16,19 Advanced Database Technologies 17 31 A(k)-index example A(1)-index A(0)-index 1 alldocuments (root) book article 2,8 title author 3,9 4,6,10 name 5,7,11,16,19 Dr. N. Mamoulis 13 title author 14 15,18 address 12,17 Advanced Database Technologies 1 alldocuments (root) book article title author 2,8 13 3,9,14 4,6,10,15,18 name 5,7,11,16,19 address 12,17 32 Using the A(k) index to search A(1)-index A Label Map is constructed together with the index, where each label points to its positions in the index. book book article title author name 1 alldocuments (root) article 2,8 title author 3,9 4,6,10 name 5,7,11,16,19 13 title author 14 15,18 address 12,17 address Dr. N. Mamoulis Advanced Database Technologies 33 Evaluation of path queries Assume that a path query q of length ≤k is applied. The A(k) index can answer the query as follows. First the last label of q is found and the label map is used to find its positions in the index. Then the index is traversed backwards to complete the answer. Dr. N. Mamoulis Advanced Database Technologies 34 Evaluation of path queries (example) A(1)-index Query book/title 1 alldocuments (root) OK book book article title author name article 2,8 title author 3,9 4,6,10 name 5,7,11,16,19 13 title author 14 15,18 address 12,17 address Dr. N. Mamoulis Advanced Database Technologies 35 Evaluation of path queries If the path of the query is longer than k, we may need to access the actual data. Thus, A(k)-index alone cannot be used to answer the query in this case. This is because if we traverse the index backwards we may find false positive paths that actually do not exist in the graph. Paths that share information are grouped to decrease the potential cost. Dr. N. Mamoulis Advanced Database Technologies 36 Evaluation of path queries (2nd example) A(1)-index Query book/author/name path has length 2>1 1 alldocuments (root) book book article title author name article 2,8 title author 3,9 4,6,10 name 5,7,11,16,19 13 title author 14 15,18 address 12,17 address Dr. N. Mamoulis Advanced Database Technologies 37 Problems of path indexes They are appropriate only for simple path queries up to a certain length. Therefore if a query has branches or regular path expressions the index cannot provide exact answers, but the actual data have to be accessed. Also these indexes have high storage and update cost. Dr. N. Mamoulis Advanced Database Technologies 38 Storing and indexing XML data in relational databases We can decompose the structural information into tables and use them to answer queries. This reduces the volume of data that need to be accessed for a single query and we can use off-the-shelf query processing and optimization techniques. On the other hand, we may need expensive joins during query processing Dr. N. Mamoulis Advanced Database Technologies 39 A decomposition model for XML data The storage model indexes the elements and text of the documents by their position in the graph. If the structures are trees, this representation can help to answer queries fast. On the other hand, for graphs the positions of the elements many times cannot help fast query evaluation because of recursion and other problems the incur. Dr. N. Mamoulis Advanced Database Technologies 40 Encoding elements, attributes and values based on their positions. The position of each element/attribute occurrence is represented as a 3-tuple (Document-id, StartPos:EndPos, LevelNum) Values (text) is encoded using (Document-id, StartPos, LevelNum):  Document-id is the id of the document that contains the element  StartPos is the number of words from the beginning of the document until the start of the element  EndPos is the number of words from the beginning of the document until the end of the element  LevelNum is the nesting depth of the element Dr. N. Mamoulis Advanced Database Technologies 41 Encoding example <book> <booktitle> The Selfish Gene </booktitle> <author id=“dawkings"> <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> </name> <address> <city> Timbuktu </city> <zip> 99999 </zip> </address> </author> </book> Dr. N. Mamoulis Advanced Database Technologies 42 Encoding Example book (1,2:6,2) (1,1:27,1) (1,7:26,2) booktitle author (1,3,3) The Selfish Gene id (1,8:9,3) (1,9,4) dawkins name (1,10:17,3) (1,11:13,4) (1,14:16,4) firstname lastname (1,18:25,3) address (1,19:21,4) (1,22:24,4) city zip (1,15,5) (1,20,5) (1,23,5) (1,12,5) Richard Dawkins Timbuktu 99999 Dr. N. Mamoulis Advanced Database Technologies 43 Using the encoding to determine a structural relationship We can use the encoding to find fast the relationship between two elements (or between an element and a value). Element e1 is an ancestor of element e2 in the same document iff:   e1.DocumentId = e2.DocumentId e1.StartPos> e2.StartPos && e1.EndPos< e2.EndPos (interval coverage) If the above hold and, in addition, e1.LevelNum+1 = e2.LevelNum, then e1 is the parent of e2. Dr. N. Mamoulis Advanced Database Technologies 44 Answering queries using the encoding Assume that all documents have been flattened to tables and the encoding is used to index the positions of each element and value in the documents. We store all information in a table: (ElementId, Document-id, StartPos:EndPos, LevelNum) The table is clustered by ElementId and sorted by (Document-id, StartPos). Dr. N. Mamoulis Advanced Database Technologies 45 Answering queries using the encoding (cont’d) Example ElementId Document-id StartPos EndPos LevelNum book 1 1 27 1 booktitle 1 2 6 2 author 1 7 26 2 ... ... ... ... ... Dr. N. Mamoulis Advanced Database Technologies 46 Answering queries using the encoding (cont’d) The query is broken into binary parent-child or ancestor descendant relationships. book Example:  book[/title/XML][//author//Smith] title author Broken to:     book/title title/XML book//author author//Smith Dr. N. Mamoulis Advanced Database Technologies XML Smith 47 Answering queries using the encoding (cont’d) Each binary query is executed as a join, and their results are “stitched” together to formulate the results of the whole query. Example: book/author/address book/author: (2,4),(2,6),(8,10)  author/address: (10,12),(15,17)  book/author/title: (8,10,12) book book article 2  8 13 title author author title author 3 4 name 5 Dr. N. Mamoulis 6 9 10 name name address 7 11 12 title author author 14 15 18 name address name 16 Advanced Database Technologies 17 19 48 How to process the binary joins Thus the “heart” of XML query processing is the algorithm that joins the elements table to retrieve the results for each individual query component. One method to process the binary join is to apply a merge-join algorithm, since the table is already sorted by Element,DocId,StartPos. Assume that the query is an A//D, where A is the ancestor element and D is the descendant element Dr. N. Mamoulis Advanced Database Technologies 49 How to process the binary joins. The tree-merge join algorithm Example: AList DList 1 23 34 2 1 30 32 4 1 25 33 2 1 68 72 5 1 45 56 2 2 17 18 4 2 15 28 3 2 20 22 4 2 45 57 3 2 43 60 2 Dr. N. Mamoulis Advanced Database Technologies 50 Worst case for the tree-merge join algorithm Example: d1 a1 d1 d2 a2 ... d2n d2n-1 a1 d2 a2 ... ... dn an dn+1 ... an d2n-1 dn Dr. N. Mamoulis dn+1 Advanced Database Technologies d2n 51 How to process the binary joins. The tree-merge join algorithm The tree merge join may perform many passes to the “inner” DList table, one for each element in AList that mathes the elements there. In order to avoid this a stack-tree join algorithm is proposed. OBSERVATION: We can get all the join results by a depth-first traversal of the XML tree. Dr. N. Mamoulis Advanced Database Technologies 52 The stack-tree join algorithm The lists are merged together as before, but a stack is maintained to keep nested AList elements which are in the same path as the current element from DList. When a qualifying element in DList is found, all elements of AList in the stack are output. Dr. N. Mamoulis Advanced Database Technologies 53 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 Stack Output d5 d6 a1 d5 d6 Dr. N. Mamoulis Advanced Database Technologies 54 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 d5 d6 Stack Output a1,d1 a2,d1 a2 a1 d5 d6 Dr. N. Mamoulis Advanced Database Technologies 55 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 d5 d6 Stack Output a1,d1 a2,d1 a3 a1 d5 d6 Dr. N. Mamoulis Advanced Database Technologies 56 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 d5 d6 Stack Output a1,d1 a2,d1 a1,d2 a3,d2 a3 a1 d5 d6 Dr. N. Mamoulis Advanced Database Technologies 57 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 d5 d6 Stack Output a1,d1 a2,d1 a1,d2 a3,d2 a3 a1 a1,d3 a3,d3 d5 d6 Dr. N. Mamoulis Advanced Database Technologies 58 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 Stack a1,d1 a2,d1 a1,d2 a3,d2 a1,d3 a3,d3 d5 d6 Output a1 a1,d4 d5 d6 Dr. N. Mamoulis Advanced Database Technologies 59 Stack-tree join example a1 a2 d1 a3 d4 d2 d3 a4 AList DList a1 d1 a2 d2 a3 d3 a4 d4 d5 d6 d5 d6 Dr. N. Mamoulis Advanced Database Technologies Stack Output a1,d1 a2,d1 a1,d2 a3,d2 a4 a1 a1,d3 a3,d3 a1,d4 a1,d5 a4,d5 a1,d6 a4,d6 60 Comments on the stack-tree join algorithm The algorithm has better worst-case complexity than the tree-merge join algorithm. Both of them have two versions; one that outputs results sorted on AList elements and one that outputs results sorted on DList elements. Dr. N. Mamoulis Advanced Database Technologies 61 Limitation of the binary join algorithms They are used only for binary joins. If a query is complex and contains many binary relationships, many intermediate results have to be merged. Example:  book[/title/XML][//author//Smith] Broken to:     book/title title/XML book//author author//Smith Dr. N. Mamoulis Advanced Database Technologies book title author XML Smith 62 An extension of the stack-tree join algorithm book title author The path-join and twig-join algorithms XML Smith extend the basic stack-join algorithm for complex queries. The idea is the same, but multiple stacks are used to avoid merging the intermediate results. Path-join is appropriate for path queries only (e.g., book/author/name) Twig-join is appropriate for branching (tree) expressions. Dr. N. Mamoulis Advanced Database Technologies 63 Example of Path-Join a1 Query a//b//c ..at point c3 StackA a3 a1 Dr. N. Mamoulis StackB b4 b3 Output c1,b2,a1 c2,b2,a1 c3,b4,a3 c3,b4,a1 c3,b3,a1 a2 b2 b1 c1 c2 c3,b3,a3 is not an answer because b3 points to a1 in the next stack! Advanced Database Technologies b3 a3 b4 c5 c3 c4 64 Path-join has optimal asymptotic cost for single-path queries, but ... ... if a query is a twig of multiple paths may produce many partial results which have then to be joined. The twig-join, joins these results at b production time. Example: a a     query a[//b/c][//d/e] two paths a/b/c two paths a/d/e only one twig a[/b/c][/d/e] Dr. N. Mamoulis a b c d d b d c e c e c e Advanced Database Technologies 65 Twig Join b The twig-join applies path-join at multiple paths at the same time. When at some node there are potential solutions for each path of the query, the algorithm waits for these results and waits for them to be computed. Then the results from each path are joined. Dr. N. Mamoulis a a a b c d d b d c e c e c e Advanced Database Technologies partial result: a/b/c partial result: a/d/e merge result: a[/b/c][/d/e] 66 Limitations of the twig-join and stack-based methods It is useful for simple twigs only, but it is not trivial to extend it for arbitrary trees book title author XML Smith Jones The encoding can be used for tree-structured XML data only. However, in many cases XML data are graphs. In this case the encoding (and also the stack-based algorithms) are not applicable. Dr. N. Mamoulis Advanced Database Technologies 67 Summary XML data are everywhere today, and efficient management and querying systems are needed. This is why today XML data management is one of the hottest research topics in DB. There are two streams for XML data management:   Store XML data into native systems and use special indexing and querying methods. Trasform XML data into relational tables and use/adapt relational query algorithms. Dr. N. Mamoulis Advanced Database Technologies 68 References R. Kaushik et al., “Exploiting Local Similarity for Indexing Paths in Graph-Structured Data”, ICDE 2002. S. Al-Khalifa et al., “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002. N. Bruno et al., “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002. J. Shanmugasundaram et al. “Relational Databases for Querying XML Documents: Limitations and Opportunities”, VLDB 1999. Dr. N. Mamoulis Advanced Database Technologies 69

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download author