Download Slides from David

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

PL/SQL wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

SQL wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational algebra wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Database Systems and XML
David Wu
CS 632
April 23, 2001
Researched Papers
•
•
J. Shanmugasundaram, et al. "Efficiently
Publishing Relational Data as XML
Documents", VLDB Conference, September
2000.
J. Shanmugasundaram, et al. "Relational
Databases for Querying XML Documents:
Limitations and Opportunities," VLDB
Conference, September 1999.
Efficiently Publishing Relational
Data as XML Documents
Motivation
• Relational database systems and XML are
heavily used on the Web.
• Would like some way to publish relational
data as XML.
What is Needed
• Language to specify the conversion from
relational data to XML.
• Implementation to efficiently carry out the
conversion.
SQL Based Language
Implementation Alternatives
Main differences between relations and XML:
• XML docs have tags
• XML has nested structure
Early Tagging, Early Structuring
• Stored Procedure Approach (outside engine)
– Performs a nested-loop join by issuing queries for each
nested structure in the desired XML.
– High overhead due to the number of queries.
– Fixed join order.
Early Tagging, Early Structuring
• Correlated CLOB Approach (inside engine)
– Have one large query with sub-queries is run within the
engine.
– Must add XML constructor support to the engine.
– XML fragments from the constructors are stored as
CLOBs (Character Long Objects). Costly to handle.
• De-Correlated CLOB Approach (inside)
– Perform query de-correlation to give optimizer more
flexibility.
Late Tagging, Late Structuring
Two phases:
1) Content creation
2) Tagging and structuring
Late Tagging, Late Structuring
Content Creation: Redundant Relation Approach
– Join all source tables
– Both content and process redundancy
Late Tagging, Late Structuring
Content creation: Outer Union Approach
– Separate the children of the same parent (e.g.
one tuple should represent either account or
purchaseOrder).
– At the end outer union the results.
– Still some data redundancy (e.g. parent info)
Late Tagging, Late Structuring
Outer Union Plan:
Late Tagging, Late Structuring
Structuring/Tagging: Hashed-based Tagger
• Group by hashing
• Extract tuples and tag them.
Late Tagging, Early Structuring
• Late Tagging, Late Structuring requires
much memory for the hash table.
• Fix by creating “structured content” and
then tag.
Late Tagging, Early Structuring
Structured content: Sorted Outer Union Approach
– Desired format
1.
2.
3.
–
Parent information comes before or with its child
All info of a node and its descendants occur together
Relative order of the tuples matches user-specified order
Achieve by performing a sort on ids on the
result of the outer union.
Late Tagging, Early Structuring
• Tagging Sorted Data:ConstantSpaceTagger
– Can append tags as soon as data is seen.
– Only need to remember the parent ids of the
last tuple seen to know when to append closing
tags.
Experiement
• Inside Engine
• Outside Engine
Breakdown of Construction
Summary of Results
• Constructing inside the relational engine is
more efficient.
• When processing can be done in main mem,
the Unsorted Outer Union approach wins.
• When main mem is not enough, the Sorted
Outer Union approach is best.
Relational Databases for
Querying XML Documents
Why Bother?
• XML is becoming the standard for data
representation in WWW.
• A query engine designed to tap information
from XML documents is valuable.
• Relational database system is a mature
technology and could be used to support
XML querying.
Basic Idea
Step 1: Generate a relational schema from the DTD
Step 2: Parse the XML document and load the data
tuples of the relational table.
into
Step 3: Translate the semi-structured XML queries
SQL corresponding to the relational data.
into
Step 4: Convert the result back to XML.
Translating XML to Relational
Schema
Main Issues:
1. DTDs complexity
2. Arbitrary nesting of XML DTDs vs. twolevel nature of relational schemas.
3. Set-valued attributes and recursion
1) Flattening transformation
2) Simplification
transformation of unary
operations
3) Grouping transformation
Techniques to translate XML
DTD to relations.
• Basic Inlining Technique
• Shared Inlining Technique
• Hybrid Inlining technique
Basic Inlining Technique
• Inlining as many descendants of an element
into a relation.
(author:firstname,lastname,address)
• Every element will have a relation
corresponding to it. (firstname, lastname,
and address will all have elements)
Basic Inlining Technique (cont.)
Complications:
1) Set-valued attributes (eg. Article)
•
Solve by using foreign keys and other tables.
2) Recursion
•
Solve with relational keys and relational
recursive processing to retrieve the
relationship.
Tools used in creating relations
DTD Graph
– Nodes are elements,
attributes,operators
– Each element
appears once
– Attributes and
operators appear as
many times as they
do in the DTD
– Cycles in the graph
indicates recursion
Tools used in creating relations
Element Graphs
– Generated from the DTD
graph
– Created by doing a DFS
from an element node
Creating a Relation
Given an element graph, the root it made into
a relation with all descendents inlined into it,
except:
1) Children directly below a “*” are made into separate
relations;
2) Each node with a backpointer edge are made into
separate relations.
These additional relations are named by their path
from the root and have parentID fields that serve as
foreign keys (e.g. Article.author has the attribute
article.author.parentID)
Problems with Basic
• Large number of relations it creates
• Not efficient for certain queries
– Good: “list all authors of books”
– Bad: “list all authors having first name Jack”
Shared Inlining Technique
Idea: Identify commonly used element nodes and
share them by creating separate relations for them.
Shared Inlining Technique
Rules for creating relations:
–
–
–
–
–
Nodes with in-degree>1 have relations made
Nodes with in-degree=1 are inlined
Nodes with in-degree=0 have relations made
Nodes following “*” have relations made
Nodes with in-degree=1 AND mutually
recurive, one of them is made into a relation
Shared Inlining Technique
Rules for designing the schema:
– Relation X inlines all nodes Y that it an reach
such that the path from X to Y does not contain
a node that is to be made a separate relation.
– Inlined elements are flagged as being a root
with the isRoot field.
Problems with Shared
• Too many joins required!
Hybrid Inlining Technique
• Same as Shared except Hybrid also inlines
elements that…
– have in-degree>1 AND
– are not recursive AND
– are not reached through a “*” node.
Evaluation Metric
For path expressions of length N, data was
gathered on:
• The avg number of SQL queries generated
• The avg number of joins in each SQL query
• The total average number of joins in order
to process the path expression
Results for N=3
• For Basic, 1/3 of the DTDs tests didn’t run
to completion due to lack of virtual
memory. Basic is thus ignored.
Results for N=3
Results for N=3
• Group 1: Hybrid reduce join/query,
increases a smaller amount of queries =>
Hybrid requires fewer joins than shared.
• Group 2: Hybrid reduces join/query,
increases a comparable amount of
queries=> Hybrid and Shared are the same.
Results for N=3
• Group 3: Hybrid reduces some joins/query,
but increased the queries by a lot => Hybrid
generates more joins than Shared.
• Hybrid and Shared performed similarly in
both joins/query and # of queries =>
Hybrid and Shared are about the same.
Semi-Structured Queries to SQL
Semi-structured query languages
– Allow path expressions with various operators
and wildcards.
XML-QL Query
Lorel
Simple Path to SQL
1. The relations corresponding to the start of
the root path is added to the FROM
clause.
2. If needed, the path expressions are
translated to joins.
Simple Recursive Path to SQL
1.
2.
3.
Find initialization of the
recursion (e.g.
*.monograph.editor
with condition
monograph.title=
“Subclass Cirripedia”)
Find the actual recursive
path expression (e.g.
monograph.editor)
Union the two
Arbitrary Path to Simple
Recursive Path
• Use a general technique to translate path
expressions to many simple (recursive) path
expressions.
Relational Results to XML:
Simple Structuring
Requires only attaching appropriate tags to
each tuple.
Relational Results to XML: Tag
Variables
Have the relational query contain the tag
value in the result tuple. Then just covert it
to a tag during XML generation.
Grouping
a) Could sort the result tuples by the groupby field and and scan through it in order
when generating the XML.
b) Could do a grouping operation.
Other Cases
• Complex Element Construction
– e.g. asking for all article elements and assume
that may be multiple elements (e.g. author &
title)
– Difficult to do in traditional relational model.
• Heterogeneous Results
– e.g. asking for either title or author of article.
– Could be done in two queries and then merged.
Other Cases
• Nested Queries
– Could be rewritten in terms of SQL queries
using outer joins.
Conclusion
Suggested modifications to relational systems:
• Untyped/variable-typed references.
• Information retrieval style indices
• Flexible comparison operators
• Multiple-query optimization/execution
• More powerful recursion support.