Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Versant Object Database wikipedia , lookup

Transcript
Lore: A Database Management
System for Semistructured Data
Why?
• Although data may exhibit some structure it may
be too varied or irregular to map to a fixed
schema.
– Relational DBMS might use null values in this case.
• May be difficult to decide in advance on a specific
schema.
– Data elements may change types.
– Structure changes a lot (lots of schema modifications).
Semistructured Data
• Examples:
– Data from the web
• Overall site structure may change often.
• It would be nice to be able to query a web site.
– Data integrated from multiple, heterogeneous
data sources.
• Information sources change, or new sources added.
Object Exchange Model (OEM)
• Data in this model can be thought of as a
labeled directed graph.
– Schema-less and self-describing.
• Vertices in graph are objects.
– Each object has a unique object identifier (oid),
such as &5.
– Atomic objects have no outgoing edges and are
types such as int, real, string, gif, java, etc.
– All other objects that have outgoing edges are
called complex objects.
OEM (Cont.)
• Examples:
– Object &3 is complex, and its subobjects are
&8, &9, &10, and &11.
– Object &7 is atomic and has value “Clark”.
• DBGroup is a name that denotes object &1.
(Names are entry points into the database).
OEM to XML
• Example:
– <Member project=“&5 &6”>
<name>Jones</name>
<age>46</age>
<office>
<building>gates</building>
<room>252</room>
</office>
</member>
• This corresponds to rightmost member in the
example OEM, where project is an attribute.
Lorel Query Language
• Need query language that supports path
expressions for traversing graph data and
handling of ‘typeless’ data.
• A simple path expression is a name
followed by a sequence of labels.
– DBGroup.Member.Office.
– Set of objects that can be reached starting with
the DBGroup object, following edges labels
member and then office.
Lorel (cont.)
• Example:
– select DBGroup.Member.Office
where DBGroup.Member.Age < 30
• Result:
– Office “Gates 252”
– Office
Building “CIS”
Room “411”
Lorel Query Rewrite
• Previous query rewritten to:
– select O
from DBGroup.Member M, M.Office O
where exists y in M.Age : y < 30
• Comparison on age transformed to existential
condition.
– Since all properties are set-valued in OEM.
– A user can ask DBGroup.Member.Age < 30 regardless
of whether Age is single valued, set valued, or
unknown.
Lorel Query Rewrite
• Why?
– Breaking query into simple path expressions
necessary for query optimization.
– Need to explicitly handle coercion.
• Atomic objects and values.
0.5 < “0.9” should return true
• Comparing objects and sets of objects.
DBGroup.Member.Age is a set of objects.
Lorel (cont.)
• General path expressions are loosely specified
patterns for labels in the database.
(‘|’ disjunction, ‘?’ label pattern optional)
• Example:
– select DBGroup.Member.Name
where DBGroup.Member.Office(.Room%|.Cubicle)?
like “%252”
• Result:
– Name “Jones”
Name “Smith”
Query and Update Processing
• Query is parsed
• Parse tree is preprocessed and translated to
new OQL-like query.
• Query plan constructed.
• Query optimization.
• Opt. query plan executed.
System Architecture
Iterators and Object Assignments
• Use recursive iterator approach:
– execution begins at top of query plan
– each node in the plan requests a tuple at a time
from its children and performs some operation
on the tuple(s).
– pass result tuples up to parent.
Object Assignments (OAs)
• OA is a data structure containing slots for range
variables with additional slots depending on the
query.
• Each slot within an OA will holds the oid of a
vertex on a path being considered by the query
engine.
• Example: if OA1 holds oid for “Smith” then OA2
and OA3 can hold the oids for one of Smiths
Office objects and Age objects.
Query Operators
• For example, the Scan operator returns all oids
that are subobjects of a given object following a
specified path expression.
– Scan (StartingOASlot, Path_expression, TargetOASlot)
• For each oid in StartingOASlot, check to see if
object satisfies path_expression and place oid into
TargetOASlot.
• Other operators include Join, Project, Select,
Aggregation, etc.
• Join node like nested-loop join in relational
DBMS.
Query Optimization
• Does only a few optimizations:
– Push selection ops down query tree.
– Eliminate/combine redundant query operators.
• Explores query plans that use indexes where
possible.
– Two kinds of indexes:
– Lindex (link index) provide parent pointers impl. as
hashing.
– Vindex (value index) impl. as B+-trees
Indexes
• Because of non-strict typing system, have String
Vindex, Real Vindex, and String-coerced-to-real
Vindex.
• Separate B-Trees for each type are constructed.
• Using Vindex for comparison (e.g. Age < 30)
consider the following:
– If type is string, do lookup in String Vindex
– If can convert to real the do lookup in String-coercedto-real Vindex.
– If type is real?
Other issues
• Update query operator example:
– Update(Create_Edge, OA1, OA5, “Member”)
– Create edge from results in OA1 to OA5 labeled
“Member”.
• Lore arranges objects in physical disk pages, each
page with a number of slots with a single object in
each slot.
– Objects placed according to first-fit algorithm.
– Supports large objects spanning multiple pages.
– Objects clustered in depth-first manner (since Scan
traverses depth-first).
– Garbage collector removes unreachable objects.
External Data Manager
• Enables retrieval of information from other data
sources, transparent to the user.
• An external object in Lore is a “placeholder” for
the external data and specifies how lore interacts
with an external data source.
• The spec for an external object includes:
– Location of a wrapper program to fetch and convert
data to OEM, time interval until fetched information
becomes stale, and a set of arguments used to limit info
fetched from external source.
Data Guides
• A DataGuide is a concise and accurate summary
of the structure of an OEM database (stored as
OEM database itself, kind of like the system
catalog).
• Why?
– No explicit schema, how do we formulate meaningful
queries?
– Large databases (can’t just view graph structure).
– What if a path expression doesn’t exist (waste).
• Each possible path expression is encoded once.
{9, 13}
DataGuides As Histograms
• Each object in the dataguide can have a link to its
corresponding target set.
– A target set is a set of oids reachable by that path.
• TS of DBGroup.Member.Age is {9, 13}.
– This is a path index. Can find set of objects reachable
by a particular path.
– Can store statistics in DataGuide (more in next paper).
• For example, the # of atomic objects of each type reachable by
p.
Conclusions
• Takes advantage of the structure where it exists.
• Handles lack of structure well (data type coercion,
general path expressions).
• Query language allows users to get and update
data from semistructured sources.
– DataGuide allows users to determine what paths exist,
and gives useful statistical information
Query Optimization for
Semistructured Data
OEM vs. XML
• OEM’s objects correspond to elements in XML
• Sub-elements in XML are inherently ordered.
• XML elements may optionally include a list of
attribute value pairs.
• Graph structure for multiple incoming edges
specified in XML with references (ID, IDREF
attributes). i.e. the Project attribute.
Indexes
• Vindex(op, value, l, x) places into x all
atomic objects that satisfy the “op value”
condition with an incoming edge labeled l.
– Vindex(“Age”, <, 30,y) places into y objects
with age < 30.
• Lindex(x, l, y) places into x all objects that
are parents of y via edge labeled l.
– Lindex(x, “Age”, y) places into x all parents of
y via label “Age”.
Indexes (cont.)
• Bindex(l, x, y) finds all parent-child object pairs
connected by a label l.
– Bindex(“Age”, x, y) locates all parent-child pairs with
label Age.
• Pindex(PathExpression, x) placed into x all
objects reachable via the path expression.
– Pindex(“A.B x, x.C y”, y) places into y all objects
reachable by going from A to B to C.
– Uses DataGuide.
Simple Query
• select O
from DBGroup.Member M, M.Office O
where exists y in M.Age : y < 30
• Possible plans:
– Top-down (similar to pointer-chasing, nested-loops
join)
– Use Vindex to check y < 30, traverse backwards from
child to parent using Lindex
(bottom-up).
– Hybrid, both top down and bottom up. Meet in middle.
Select x
From A.B x
Where exists y in x.C: y = 5
Query Plan Generation
(Overview)
• Logical query plan generator creates
high-level execution strategy.
• Physical query plan enumerator uses
statistics and a cost model to transform
logical query plan into an estimated best
physical plan that lies within their search
space.
Logical Query Plans (cont.)
• Glue node represents a ‘rotation point’ that has as
its children two independent subplans.
– Rotating the order between independent components
yields different plans.
– Marks place where execution order is not fixed.
• Discover node chooses best way to bind variables
x and y.
• Chain node chooses best evaluation of a path
expression.
Logical query plan for:
Select x
From DBGroup.Member x
Where exists y in x.Age: y<30
from clause
where clause
Physical Query Plans
Physical Query Plans (cont.)
• Scan(x, l, y) places into y all objects that are
subojects of x via edge labeled l.
– Top-down (pointer chasing).
• Lindex plan is bottom-up approach.
• Bindex: Locate edges whose label appears
infrequently in database.
• NLJ: left subplan passes variables to right
subplan.
Statistics
• I/O metric uses estimated # of objects fetched.
• For every label subpath p of length <= k:
– # Of atomic objects of each type reachable by p
– Min, and max values of all atomic objects of each type
reachable by p
– # Of instances of path p, denoted |p|
– # Of distinct objects reachable by p, denoted |p|d
– # Of l-labeled subobjects of all objects reachable by p
– # Of incoming l-labeled edges to any instance of p,
denoted |pl|
Plan Enumeration
• Doesn’t consider joining two simple path
expressions together unless they share a common
variable.
• Pindex is used only when path expression begins
with a name and no variable except the last is used
in the query.
• Select clause always executes last.
• Doesn’t try to reorder multiple independent path
expressions.
Results
• Used XML database about movies.
Database graph contained 62,256 nodes and
130,402 edges.
• Experiment 1: Select DB.Movie.Title
– Best plan is Pindex, followed by top-down
– Worst plan is Bindex, with hash joins.
Results (cont.)
• Experiment 2:
All Movies with a Genre of “Comedy”
– Where clause is very selective, bottom-up does
a Vindex for “Comedy” with incoming edge
Genre
Results (cont.)
• Experiment 3: Query with two existentially
quantified variables in the where clause.
• Errors due to bad estimates of atomic value
distributions and set operation costs.
Results (cont.)
• Experiment 4: Select movies with certain
quality rating.
• Quality ratings uncommon in database so
optimizer chooses to find all ratings via
Bindex, and then work bottom-up.
Conclusions
• Cost estimates are accurate and select the
best plan most of the time
• Execution times of best and worst plans for
a given query can differ by many orders of
magnitude.
• Best strategy is highly dependent upon the
query and database (Query optimization is
good for XML data).