Download Slides

Lore: A Database Management System for Semistructured Data Why? • Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. – Relational DBMS might use null values in this case. • May be difficult to decide in advance on a specific schema. – Data elements may change types. – Structure changes a lot (lots of schema modifications). Semistructured Data • Examples: – Data from the web • Overall site structure may change often. • It would be nice to be able to query a web site. – Data integrated from multiple, heterogeneous data sources. • Information sources change, or new sources added. Object Exchange Model (OEM) • Data in this model can be thought of as a labeled directed graph. – Schema-less and self-describing. • Vertices in graph are objects. – Each object has a unique object identifier (oid), such as &5. – Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc. – All other objects that have outgoing edges are called complex objects. OEM (Cont.) • Examples: – Object &3 is complex, and its subobjects are &8, &9, &10, and &11. – Object &7 is atomic and has value “Clark”. • DBGroup is a name that denotes object &1. (Names are entry points into the database). OEM to XML • Example: – <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office> </member> • This corresponds to rightmost member in the example OEM, where project is an attribute. Lorel Query Language • Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data. • A simple path expression is a name followed by a sequence of labels. – DBGroup.Member.Office. – Set of objects that can be reached starting with the DBGroup object, following edges labels member and then office. Lorel (cont.) • Example: – select DBGroup.Member.Office where DBGroup.Member.Age < 30 • Result: – Office “Gates 252” – Office Building “CIS” Room “411” Lorel Query Rewrite • Previous query rewritten to: – select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 • Comparison on age transformed to existential condition. – Since all properties are set-valued in OEM. – A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown. Lorel Query Rewrite • Why? – Breaking query into simple path expressions necessary for query optimization. – Need to explicitly handle coercion. • Atomic objects and values. 0.5 < “0.9” should return true • Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects. Lorel (cont.) • General path expressions are loosely specified patterns for labels in the database. (‘|’ disjunction, ‘?’ label pattern optional) • Example: – select DBGroup.Member.Name where DBGroup.Member.Office(.Room%|.Cubicle)? like “%252” • Result: – Name “Jones” Name “Smith” Query and Update Processing • Query is parsed • Parse tree is preprocessed and translated to new OQL-like query. • Query plan constructed. • Query optimization. • Opt. query plan executed. System Architecture Iterators and Object Assignments • Use recursive iterator approach: – execution begins at top of query plan – each node in the plan requests a tuple at a time from its children and performs some operation on the tuple(s). – pass result tuples up to parent. Object Assignments (OAs) • OA is a data structure containing slots for range variables with additional slots depending on the query. • Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine. • Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects. Query Operators • For example, the Scan operator returns all oids that are subobjects of a given object following a specified path expression. – Scan (StartingOASlot, Path_expression, TargetOASlot) • For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot. • Other operators include Join, Project, Select, Aggregation, etc. • Join node like nested-loop join in relational DBMS. Query Optimization • Does only a few optimizations: – Push selection ops down query tree. – Eliminate/combine redundant query operators. • Explores query plans that use indexes where possible. – Two kinds of indexes: – Lindex (link index) provide parent pointers impl. as hashing. – Vindex (value index) impl. as B+-trees Indexes • Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex. • Separate B-Trees for each type are constructed. • Using Vindex for comparison (e.g. Age < 30) consider the following: – If type is string, do lookup in String Vindex – If can convert to real the do lookup in String-coercedto-real Vindex. – If type is real? Other issues • Update query operator example: – Update(Create_Edge, OA1, OA5, “Member”) – Create edge from results in OA1 to OA5 labeled “Member”. • Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot. – Objects placed according to first-fit algorithm. – Supports large objects spanning multiple pages. – Objects clustered in depth-first manner (since Scan traverses depth-first). – Garbage collector removes unreachable objects. External Data Manager • Enables retrieval of information from other data sources, transparent to the user. • An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source. • The spec for an external object includes: – Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source. Data Guides • A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog). • Why? – No explicit schema, how do we formulate meaningful queries? – Large databases (can’t just view graph structure). – What if a path expression doesn’t exist (waste). • Each possible path expression is encoded once. {9, 13} DataGuides As Histograms • Each object in the dataguide can have a link to its corresponding target set. – A target set is a set of oids reachable by that path. • TS of DBGroup.Member.Age is {9, 13}. – This is a path index. Can find set of objects reachable by a particular path. – Can store statistics in DataGuide (more in next paper). • For example, the # of atomic objects of each type reachable by p. Conclusions • Takes advantage of the structure where it exists. • Handles lack of structure well (data type coercion, general path expressions). • Query language allows users to get and update data from semistructured sources. – DataGuide allows users to determine what paths exist, and gives useful statistical information Query Optimization for Semistructured Data OEM vs. XML • OEM’s objects correspond to elements in XML • Sub-elements in XML are inherently ordered. • XML elements may optionally include a list of attribute value pairs. • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute. Indexes • Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l. – Vindex(“Age”, <, 30,y) places into y objects with age < 30. • Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l. – Lindex(x, “Age”, y) places into x all parents of y via label “Age”. Indexes (cont.) • Bindex(l, x, y) finds all parent-child object pairs connected by a label l. – Bindex(“Age”, x, y) locates all parent-child pairs with label Age. • Pindex(PathExpression, x) placed into x all objects reachable via the path expression. – Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C. – Uses DataGuide. Simple Query • select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 • Possible plans: – Top-down (similar to pointer-chasing, nested-loops join) – Use Vindex to check y < 30, traverse backwards from child to parent using Lindex (bottom-up). – Hybrid, both top down and bottom up. Meet in middle. Select x From A.B x Where exists y in x.C: y = 5 Query Plan Generation (Overview) • Logical query plan generator creates high-level execution strategy. • Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space. Logical Query Plans (cont.) • Glue node represents a ‘rotation point’ that has as its children two independent subplans. – Rotating the order between independent components yields different plans. – Marks place where execution order is not fixed. • Discover node chooses best way to bind variables x and y. • Chain node chooses best evaluation of a path expression. Logical query plan for: Select x From DBGroup.Member x Where exists y in x.Age: y<30 from clause where clause Physical Query Plans Physical Query Plans (cont.) • Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l. – Top-down (pointer chasing). • Lindex plan is bottom-up approach. • Bindex: Locate edges whose label appears infrequently in database. • NLJ: left subplan passes variables to right subplan. Statistics • I/O metric uses estimated # of objects fetched. • For every label subpath p of length <= k: – # Of atomic objects of each type reachable by p – Min, and max values of all atomic objects of each type reachable by p – # Of instances of path p, denoted |p| – # Of distinct objects reachable by p, denoted |p|d – # Of l-labeled subobjects of all objects reachable by p – # Of incoming l-labeled edges to any instance of p, denoted |pl| Plan Enumeration • Doesn’t consider joining two simple path expressions together unless they share a common variable. • Pindex is used only when path expression begins with a name and no variable except the last is used in the query. • Select clause always executes last. • Doesn’t try to reorder multiple independent path expressions. Results • Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges. • Experiment 1: Select DB.Movie.Title – Best plan is Pindex, followed by top-down – Worst plan is Bindex, with hash joins. Results (cont.) • Experiment 2: All Movies with a Genre of “Comedy” – Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre Results (cont.) • Experiment 3: Query with two existentially quantified variables in the where clause. • Errors due to bad estimates of atomic value distributions and set operation costs. Results (cont.) • Experiment 4: Select movies with certain quality rating. • Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up. Conclusions • Cost estimates are accurate and select the best plan most of the time • Execution times of best and worst plans for a given query can differ by many orders of magnitude. • Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides