Download XML Storage - Technion – Israel Institute of Technology

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

SQL wikipedia , lookup

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational algebra wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Well, that is going
to cost us XXX on
YYY and earn us
WWW on ZZZ.
We must upgrade
to XML. Everyone
is talking about it.
XML Storage
XML Topics
• Previous topics:
– Motivation for XML
– XML Syntax
– DTDs
– XPath
• This Week: XML Storage
• Upcoming Weeks:
– Querying XML
– XML Search
– Advanced Topics (e.g., Web Services)
XML Storage
• Suppose that we are given some XML documents
• How should they be stored?
• Why does it matter?
– Type of storage implies which type of use can be
efficiently made of the XML
– Type of usage determines which type of storage is
needed
• Can’t really discuss using XML, without knowing
how it is stored, and whether such usage is
possible
3 Basic Strategies
• Files
• Relational Database
• Native XML Database
• What advantages do you think that each approach
has?
• What disadvantages do you think that each
approach has?
XML Files
Idea
• Store XML “as is”, in a file system
– When querying, parse the document and traverse it to
find the query answer
• Obvious Advantage: Simple storage system
• Obvious Disadvantage:
– Must parse the XML document every time it is queried
– Does not take advantage of indexes to quickly get to
“interesting” elements (in order to reach a given element,
must traverse everything appearing beforehand in the
document)
Sample Document
<transaction>
<account>89-344</account>
<buy shares=“100”>
<ticker exch=“NASDAQ”>WEBM</ticker>
</buy>
<sell shares=“30”>
What must we read
<ticker exch=“NYSE”>GE</ticker>
to be able to get
</sell>
information about
the ticker element?
</transaction>
How is an XML document Parsed?
• Two basic types of parsers:
– DOM parser: Creates a tree out of the document
– SAX parser: Does not create any data
structures. Notifies program for every element
seen
• Both types of parsers have been
standardized and have implementations in
virtually every query language
DOM Parser
• DOM = Document Object Model
• Parser creates a tree object out of the
document
• User accesses data by traversing the tree
• The API allows for constructing, accessing
and manipulating the structure and content
of XML documents
Document as Tree
Methods like:
transaction
getRoot
account
buy
sell
89-344
shares
100
shares
ticker
getAttributes
etc.
ticker
30
exch
NASDAQ
getChildren
exch
WEBM
NYSE
GE
Advantages and Disadvantages
• How would you answer a query like:
– /transaction/buy
– //ticker
• Advantages:
– Natural and relatively easy to use
– Can repeatedly query tree without reparsing
• Disadvantages:
– High memory requirements – the whole document is
kept in memory
– Must parse the whole document and construct many
objects before use
SAX Parser
• SAX = Simple API for XML
• Parser creates “events” (i.e., notifications)
while traversing tree
• Goes through the document one time only
Document as Events
<transaction>
End
tag:
account
 Start
tag:
transaction
Text:
89-344
account
<account>89-344</account>  Value:
Start
tag:
100 buy
Attribute:
shares
<buy shares=“100”>
<ticker exch=“NASDAQ”>WEBM</ticker>
</buy>
<sell shares=“30”>
<ticker exch=“NYSE”>GE</ticker>
</sell>
</transaction>
Advantages and Disadvantages
• How would you answer a query like:
– /transaction/buy
– find accounts in which something is bought or sold from
the NASDAQ
• Advantages:
– Requires less memory
– Fast
• Disadvantages:
– Cannot read backwards
Storing XML in a Relational Database
Why?
• Relational databases have been developed for
about 30 years
• There is extensive knowledge on how to use them
efficiently
• Why not take advantage of this knowledge?
• Main Challenges:
– get XML into database (inserting)
– get XML out of database (querying)
Reminder
• Relational Database simply contains some tables
• Each table can have any number of columns (also
called attributes)
• Data items in each column are atomic, i.e., single
values
• A schema is a description of a set of tables, i.e.,
the table name and each table’s column names
Difficulties
• DTDs can be complex
• Modeling Mismatch
– Conceptually, relational databases, i.e., tables,
have 2 levels: tables and attributes
– XML documents have arbitrary nesting
• XML documents can have set-valued
attributes and recursion
Difficulties
DTD
XML
XML
Documents Query
XML
Result
XML Translation Layer
Relational
Schema
Tuples
SQL
Query
Translation
Information
Relational Database
System
Relational
Result
Relational Databases: Option 1
The Schema-less Case
Option 1: Store Tree Structure
<person>
<name> Bart Simpson </name>
<tel> 02 – 444 7777 </tel>
<tel> 051 – 011 022 </tel>
<email> [email protected] </email>
</person>
person
name
tel
tel
email
Bart Simpson
051 – 011 022
02 – 444 7777
[email protected]
Option 1: Store Tree Structure
(cont.)
1 person
2
name
3 tel
4
5
tel
email
051 – 011 022
6 Bart Simpson
9
[email protected]
7 02 – 444 7777 8
1. Assign each node a unique id
2. For each node, store type and value
3. For each node, store parent information
Option 1: Store Tree Structure
(cont.)
1 person
2
name
3 tel
4
5
tel
email
051 – 011 022
6 Bart Simpson
9
[email protected]
7 02 – 444 7777 8
Node Type
Value
1
element person
6
text
…
…
ParentID
null
Bart Simpson 2
How Good Is This?
• Simple schema, can work with any
document
• Translation from XML to tables is easy
• What about the translation back?
– is this transformation lossless?
Answering XPath Queries
• Can you answer an XPath query that:
– Just uses the Child axis, e.g., /a/b/c/d/e
– Uses the Descendent axis at the beginning of
the query, e.g., //a/b
– Uses the Descendent axis in the middle of the
query, e.g., /a/b//e
– Uses the Following, Preceding, FollowingSibling axis?
Solving the Problem
• With the current modeling, it is not possible
to evaluate many different types of steps of
XPath queries
• To solve this problem, we:
– number the nodes by DFS ordering
– store, for each node, the id of its last descendent
Can you answer
these queries, now?
2
name
3 Bart Simpson
1 person
4
phones
7
5 tel
9
email
tel
[email protected]
051 – 011 022
6
02 – 444 7777 8
Node Type
Value
ParentID
LastDesc
1
element person
null
10
4
element phones
1
8
…
…
1
0
Summary: Main Problems
• No convenient method to creating XML as
output
• Each element in the path expression
requires an additional join
– Can become very expensive
Relational Databases:
Option 2, Taking Advantage of
DTDs
Based On:
Relational Databases for Querying XML
Documents: Limitations and Opportunities
By:
Shanmugasundaram, Tufte, He, Zhang,
DeWitt, Naughton
Example XML
<book>
<booktitle> The Selfish Gene </booktitle>
<author id = “dawkins”>
<name>
<firstname> Richard </firstname>
<lastname> Dawkins </lastname>
Wouldn’t it be nice to
</name>
store this as a table
<address>
with the columns:
<city> Timbuktu </city> • booktitle
• author_id
<zip> 99999 </zip>
• firstname
</address>
• lastname
• city
</author>
• zip
</book>
Example XML
<book>
<booktitle> The Selfish Gene </booktitle>
<author id = “dawkins”>
<name>
<firstname> Richard </firstname>
We can do this only
<lastname> Dawkins </lastname>
if all XML
</name>
documents that we
<address>
will be considering
follow this format.
<city> Timbuktu </city>
Otherwise, for
<zip> 99999 </zip>
example, what
</address>
happens if there
</author>
are 2 authors?
</book>
Considering the DTD
• If a DTD is given, then it defines what types
of XML documents will be of interest
• Challenge: Given a DTD, find a relational
schema such that ANY document
conforming to the DTD can be stored in the
relations
– <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>
Reducing the Complexity
• DTDs can be very complex
• Before translating a DTD to a relational schema,
simplify the DTD
• Property of the Simplification: If D2 is a
simplification of D1, then every document that
conforms to D1 also almost conforms to D2
– almost means that it conforms, if the ordering of subelements is ignored
Simplification Rules
(e1, e2)*  e1*, e2*
e1**  e1*
(e1, e2)?  e1?, e2?
e1*?  e1*
 e1?, e2?
e1?*  e1*
(e1|e2)
e1??  e1?
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
e 1+  e 1*
(e1, e2)*  e1*, e2*
(e1, e2)?  e1?, e2?
(e1|e2)
 e1?, e2?
e1**  e1*
e1*?  e1*
e1?*  e1*
e1??  e1?
e1+  e1*
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
(b|c|e)?,(e?|f+)
(e1, e2)*  e1*, e2*
(b|c|e)?,(e?|f+)
(e1, e2)?  e1?, e2?
(e1|e2)
 e1?, e2?
e1**  e1*
e1*?  e1*
e1?*  e1*
e1??  e1?
e1+  e1*
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
(b?,c?,e?)?,e??,f+?
(e1, e2)*  e1*, e2*
(b|c|e)?,(e?|f+)
(e1, e2)?  e1?, e2?
(e1|e2)
 e1?, e2?
(b?,c?,e?)?,e??,f+?
e1**  e1*
e1*?  e1*
e1?*  e1*
e1??  e1?
e1+  e1*
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
b??,c??,e??,e??,f+?
(e1, e2)*  e1*, e2*
(b|c|e)?,(e?|f+)
(e1, e2)?  e1?, e2?
(e1|e2)
 e1?, e2?
(b?,c?,e?)?,e??,f+?
e1**  e1*
e1*?  e1*
e1?*  e1*
b??,c??,e??,e??,f+?
e1??  e1?
e1+  e1*
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
b??,c??,e??,e??,f*?
(e1, e2)*  e1*, e2*
(b|c|e)?,(e?|f+)
(e1, e2)?  e1?, e2?
(e1|e2)
 e1?, e2?
(b?,c?,e?)?,e??,f+?
e1**  e1*
e1*?  e1*
e1?*  e1*
b??,c??,e??,e??,f+?
e1??  e1?
e1+  e1*
b??,c??,e??,e??,f*?
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
b?,c?,e?,e?,f*
(e1, e2)*  e1*, e2*
(b|c|e)?,(e?|f+)
(e1, e2)?  e1?, e2?
(e1|e2)
 e1?, e2?
(b?,c?,e?)?,e??,f+?
e1**  e1*
e1*?  e1*
e1?*  e1*
b??,c??,e??,e??,f+?
e1??  e1?
e1+  e1*
b??,c??,e??,e??,f*?
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
b?,c?,e?,e?,f*
..., a?, ..., a*, ...  a*, ...
..., a?, ..., a?, ...  a*, …
…, ...a, …, a, …  a*, …
b?,c?,e*,f*
You try it
• Can you simplify the expression
– (b|c|e)?,(e?|(f?,(b,b)*))*
e1**  e1*
e1*?  e1*
e1?*  e1*
e1??  e1?
e 1+  e 1*
..., a*, ..., a*, ...  a*, ...
..., a*, ..., a?, ...  a*, ...
..., a?, ..., a*, ...  a*, ...
(e1, e2)*  e1*, e2*
..., a?, ..., a?, ...  a*, …
(e1, e2)?  e1?, e2?
…, ...a, …, a, …  a*, …
(e1|e2)
 e1?, e2?
DTD Graphs
• In order to describe a technique for converting a
DTD to a schema it is convenient to first describe
DTDs (or rather simplified DTDs) as graphs
• Its nodes are elements, attributes and operators in
the DTD
• Each element appears exactly once in the graph
• Attributes and operators appear as many times as
they are in the DTD
• Cycles indicate recursion
DTD Example
Corresponding DTD Graph
book
monogra ph
article
?
booktitle
*
conta cta uthor
editor
authorID
*
author
name
name
address
?
firstnam e
title
lastname
authorid
Creating the Schema:
Shared Inline Technique
• When creating the schema for a DTD, we create a
relation for:
– each element with in-degree greater than 1
– each element with in-degree 0
– each element below a *
– one element from each set of mutually recursive
elements, having in-degree 1
• All other elements are “inlined” into their parent’s
relation (i.e., added into their parents relations)
– Note that parent may also be inlined
Relations for which elements?
book
monogra ph
article
?
booktitle
*
conta cta uthor
editor
authorID
*
author
name
name
address
?
firstnam e
title
lastname
authorid
book (bookID: integer, book.booktitle : string)
article (articleID: integer, article.contactauthor.authorid: string)
monograph (monographID: integer,
monograph.parentID: integer,
monograph.parentCODE: integer,
monograph.editor.name: string)
title (titleID: integer, title: string ,
What are these for?
title.parentID: integer, title.parentCODE: integer)
author (author.parentID: integer, author.parentCODE: integer,
authorID: integer, author.authorid: string
author.address: string, author.name.firstname: string,
author.name.lastname: string, )
Advantages/Disadvantages
• Advantages:
– Reduces number of joins for queries like “get the
first and last names of an author”
– Efficient for queries such as “list all authors with
name Jack”
• Disadvantages:
– Extra join needed for “Article with a given title
name”
Notes
• Can/Should we use foreign keys to connect
child tuples with their parents, e.g., titles with
what they belong to?
• How can we answer queries, such as:
– //title
– //article/title
– //article//name
Another Option: Hybrid Inlining
Technique
• Same as Shared, except also inline
elements with in-degree greater than one for
the places in which they are not recursive or
reached through a * node
What, in addition, will be inline?
book
monogra ph
article
?
booktitle
*
conta cta uthor
editor
authorID
*
author
name
name
address
?
firstnam e
title
lastname
authorid
book (bookID: integer, book.booktitle : string,
author.name.firstname: string, author.name.lastname: string,
author.address: string, author.authorid: string)
article (articleID: integer, article.contactauthor.authorid: string, article.title:
string)
monograph (monographID: integer, monograph.parentID: integer,
monograph.parentCODE: integer, monograph.title: string,
author.name.firstname: string, author.name.lastname: string,
author.address: string, author.authorid: string,
Why
do we
monograph.editor.name: string, )
still have an
author (authorID: integer, author.parentID: integer,
author
author.parentCODE: integer, author.name.firstname:
string,
relation?
author.name.lastname: string, author.address: string,
author.authorid: string)
Advantages/Disadvantages
• Advantages:
– Reduces joins through shared elements (that are not set
or recursive elements)
– Reduces joins for queries like “get first and last names of
a book author” (like Shared)
• Disadvantages:
– Requires more SQL sub-queries to retrieve all authors
with first name Jack (i.e., unions)
• Tradeoff between reducing number of unions and
reducing number of joins
– Shared and Hybrid target union- and join-reduction,
respectively
XML in Major Databases
• All major databases now have some level of
support for XML
• Example: Oracle
– XML data type (can have a column which contains XML
documents)
– XPath processing of XML values
– Some indexing capabilities
– XML is a second class citizen in the database (support
consists of a bunch of tools – no coherent framework)