Download XML Storage - Technion – Israel Institute of Technology

Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ. We must upgrade to XML. Everyone is talking about it. XML Storage XML Topics • Previous topics: – Motivation for XML – XML Syntax – DTDs – XPath • This Week: XML Storage • Upcoming Weeks: – Querying XML – XML Search – Advanced Topics (e.g., Web Services) XML Storage • Suppose that we are given some XML documents • How should they be stored? • Why does it matter? – Type of storage implies which type of use can be efficiently made of the XML – Type of usage determines which type of storage is needed • Can’t really discuss using XML, without knowing how it is stored, and whether such usage is possible 3 Basic Strategies • Files • Relational Database • Native XML Database • What advantages do you think that each approach has? • What disadvantages do you think that each approach has? XML Files Idea • Store XML “as is”, in a file system – When querying, parse the document and traverse it to find the query answer • Obvious Advantage: Simple storage system • Obvious Disadvantage: – Must parse the XML document every time it is queried – Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document) Sample Document <transaction> <account>89-344</account> <buy shares=“100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“30”> What must we read <ticker exch=“NYSE”>GE</ticker> to be able to get </sell> information about the ticker element? </transaction> How is an XML document Parsed? • Two basic types of parsers: – DOM parser: Creates a tree out of the document – SAX parser: Does not create any data structures. Notifies program for every element seen • Both types of parsers have been standardized and have implementations in virtually every query language DOM Parser • DOM = Document Object Model • Parser creates a tree object out of the document • User accesses data by traversing the tree • The API allows for constructing, accessing and manipulating the structure and content of XML documents Document as Tree Methods like: transaction getRoot account buy sell 89-344 shares 100 shares ticker getAttributes etc. ticker 30 exch NASDAQ getChildren exch WEBM NYSE GE Advantages and Disadvantages • How would you answer a query like: – /transaction/buy – //ticker • Advantages: – Natural and relatively easy to use – Can repeatedly query tree without reparsing • Disadvantages: – High memory requirements – the whole document is kept in memory – Must parse the whole document and construct many objects before use SAX Parser • SAX = Simple API for XML • Parser creates “events” (i.e., notifications) while traversing tree • Goes through the document one time only Document as Events <transaction> End tag: account  Start tag: transaction Text: 89-344 account <account>89-344</account>  Value: Start tag: 100 buy Attribute: shares <buy shares=“100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction> Advantages and Disadvantages • How would you answer a query like: – /transaction/buy – find accounts in which something is bought or sold from the NASDAQ • Advantages: – Requires less memory – Fast • Disadvantages: – Cannot read backwards Storing XML in a Relational Database Why? • Relational databases have been developed for about 30 years • There is extensive knowledge on how to use them efficiently • Why not take advantage of this knowledge? • Main Challenges: – get XML into database (inserting) – get XML out of database (querying) Reminder • Relational Database simply contains some tables • Each table can have any number of columns (also called attributes) • Data items in each column are atomic, i.e., single values • A schema is a description of a set of tables, i.e., the table name and each table’s column names Difficulties • DTDs can be complex • Modeling Mismatch – Conceptually, relational databases, i.e., tables, have 2 levels: tables and attributes – XML documents have arbitrary nesting • XML documents can have set-valued attributes and recursion Difficulties DTD XML XML Documents Query XML Result XML Translation Layer Relational Schema Tuples SQL Query Translation Information Relational Database System Relational Result Relational Databases: Option 1 The Schema-less Case Option 1: Store Tree Structure <person> <name> Bart Simpson </name> <tel> 02 – 444 7777 </tel> <tel> 051 – 011 022 </tel> <email> [email protected] </email> </person> person name tel tel email Bart Simpson 051 – 011 022 02 – 444 7777 [email protected] Option 1: Store Tree Structure (cont.) 1 person 2 name 3 tel 4 5 tel email 051 – 011 022 6 Bart Simpson 9 [email protected] 7 02 – 444 7777 8 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information Option 1: Store Tree Structure (cont.) 1 person 2 name 3 tel 4 5 tel email 051 – 011 022 6 Bart Simpson 9 [email protected] 7 02 – 444 7777 8 Node Type Value 1 element person 6 text … … ParentID null Bart Simpson 2 How Good Is This? • Simple schema, can work with any document • Translation from XML to tables is easy • What about the translation back? – is this transformation lossless? Answering XPath Queries • Can you answer an XPath query that: – Just uses the Child axis, e.g., /a/b/c/d/e – Uses the Descendent axis at the beginning of the query, e.g., //a/b – Uses the Descendent axis in the middle of the query, e.g., /a/b//e – Uses the Following, Preceding, FollowingSibling axis? Solving the Problem • With the current modeling, it is not possible to evaluate many different types of steps of XPath queries • To solve this problem, we: – number the nodes by DFS ordering – store, for each node, the id of its last descendent Can you answer these queries, now? 2 name 3 Bart Simpson 1 person 4 phones 7 5 tel 9 email tel [email protected] 051 – 011 022 6 02 – 444 7777 8 Node Type Value ParentID LastDesc 1 element person null 10 4 element phones 1 8 … … 1 0 Summary: Main Problems • No convenient method to creating XML as output • Each element in the path expression requires an additional join – Can become very expensive Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton Example XML <book> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> Wouldn’t it be nice to </name> store this as a table <address> with the columns: <city> Timbuktu </city> • booktitle • author_id <zip> 99999 </zip> • firstname </address> • lastname • city </author> • zip </book> Example XML <book> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <name> <firstname> Richard </firstname> We can do this only <lastname> Dawkins </lastname> if all XML </name> documents that we <address> will be considering follow this format. <city> Timbuktu </city> Otherwise, for <zip> 99999 </zip> example, what </address> happens if there </author> are 2 authors? </book> Considering the DTD • If a DTD is given, then it defines what types of XML documents will be of interest • Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations – <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)> Reducing the Complexity • DTDs can be very complex • Before translating a DTD to a relational schema, simplify the DTD • Property of the Simplification: If D2 is a simplification of D1, then every document that conforms to D1 also almost conforms to D2 – almost means that it conforms, if the ordering of subelements is ignored Simplification Rules (e1, e2)*  e1*, e2* e1**  e1* (e1, e2)?  e1?, e2? e1*?  e1*  e1?, e2? e1?*  e1* (e1|e2) e1??  e1? ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … e 1+  e 1* (e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (e1, e2)*  e1*, e2* (b|c|e)?,(e?|f+) (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … (b?,c?,e?)?,e??,f+? (e1, e2)*  e1*, e2* (b|c|e)?,(e?|f+) (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … b??,c??,e??,e??,f+? (e1, e2)*  e1*, e2* (b|c|e)?,(e?|f+) (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* b??,c??,e??,e??,f+? e1??  e1? e1+  e1* ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … b??,c??,e??,e??,f*? (e1, e2)*  e1*, e2* (b|c|e)?,(e?|f+) (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* b??,c??,e??,e??,f+? e1??  e1? e1+  e1* b??,c??,e??,e??,f*? ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … b?,c?,e?,e?,f* (e1, e2)*  e1*, e2* (b|c|e)?,(e?|f+) (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* b??,c??,e??,e??,f+? e1??  e1? e1+  e1* b??,c??,e??,e??,f*? ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... b?,c?,e?,e?,f* ..., a?, ..., a*, ...  a*, ... ..., a?, ..., a?, ...  a*, … …, ...a, …, a, …  a*, … b?,c?,e*,f* You try it • Can you simplify the expression – (b|c|e)?,(e?|(f?,(b,b)*))* e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e 1+  e 1* ..., a*, ..., a*, ...  a*, ... ..., a*, ..., a?, ...  a*, ... ..., a?, ..., a*, ...  a*, ... (e1, e2)*  e1*, e2* ..., a?, ..., a?, ...  a*, … (e1, e2)?  e1?, e2? …, ...a, …, a, …  a*, … (e1|e2)  e1?, e2? DTD Graphs • In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs • Its nodes are elements, attributes and operators in the DTD • Each element appears exactly once in the graph • Attributes and operators appear as many times as they are in the DTD • Cycles indicate recursion DTD Example Corresponding DTD Graph book monogra ph article ? booktitle * conta cta uthor editor authorID * author name name address ? firstnam e title lastname authorid Creating the Schema: Shared Inline Technique • When creating the schema for a DTD, we create a relation for: – each element with in-degree greater than 1 – each element with in-degree 0 – each element below a * – one element from each set of mutually recursive elements, having in-degree 1 • All other elements are “inlined” into their parent’s relation (i.e., added into their parents relations) – Note that parent may also be inlined Relations for which elements? book monogra ph article ? booktitle * conta cta uthor editor authorID * author name name address ? firstnam e title lastname authorid book (bookID: integer, book.booktitle : string) article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string , What are these for? title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) Advantages/Disadvantages • Advantages: – Reduces number of joins for queries like “get the first and last names of an author” – Efficient for queries such as “list all authors with name Jack” • Disadvantages: – Extra join needed for “Article with a given title name” Notes • Can/Should we use foreign keys to connect child tuples with their parents, e.g., titles with what they belong to? • How can we answer queries, such as: – //title – //article/title – //article//name Another Option: Hybrid Inlining Technique • Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node What, in addition, will be inline? book monogra ph article ? booktitle * conta cta uthor editor authorID * author name name address ? firstnam e title lastname authorid book (bookID: integer, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.authorid: string, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string, Why do we monograph.editor.name: string, ) still have an author (authorID: integer, author.parentID: integer, author author.parentCODE: integer, author.name.firstname: string, relation? author.name.lastname: string, author.address: string, author.authorid: string) Advantages/Disadvantages • Advantages: – Reduces joins through shared elements (that are not set or recursive elements) – Reduces joins for queries like “get first and last names of a book author” (like Shared) • Disadvantages: – Requires more SQL sub-queries to retrieve all authors with first name Jack (i.e., unions) • Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively XML in Major Databases • All major databases now have some level of support for XML • Example: Oracle – XML data type (can have a column which contains XML documents) – XPath processing of XML values – Some indexing capabilities – XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download XML Storage - Technion – Israel Institute of Technology