Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft SQL Server wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Relational model wikipedia , lookup
Lore: A Database Management System for Semistructured Data Why? • Although data may exhibit some structure it may be too varied or irregular to map to a fixed schema. – Relational DBMS might use null values in this case. • May be difficult to decide in advance on a specific schema. – Data elements may change types. – Structure changes a lot (lots of schema modifications). Semistructured Data • Examples: – Data from the web • Overall site structure may change often. • It would be nice to be able to query a web site. – Data integrated from multiple, heterogeneous data sources. • Information sources change, or new sources added. Object Exchange Model (OEM) • Data in this model can be thought of as a labeled directed graph. – Schema-less and self-describing. • Vertices in graph are objects. – Each object has a unique object identifier (oid), such as &5. – Atomic objects have no outgoing edges and are types such as int, real, string, gif, java, etc. – All other objects that have outgoing edges are called complex objects. OEM (Cont.) • Examples: – Object &3 is complex, and its subobjects are &8, &9, &10, and &11. – Object &7 is atomic and has value “Clark”. • DBGroup is a name that denotes object &1. (Names are entry points into the database). OEM to XML • Example: – <Member project=“&5 &6”> <name>Jones</name> <age>46</age> <office> <building>gates</building> <room>252</room> </office> </member> • This corresponds to rightmost member in the example OEM, where project is an attribute. Lorel Query Language • Need query language that supports path expressions for traversing graph data and handling of ‘typeless’ data. • A simple path expression is a name followed by a sequence of labels. – DBGroup.Member.Office. – Set of objects that can be reached starting with the DBGroup object, following edges labels member and then office. Lorel (cont.) • Example: – select DBGroup.Member.Office where DBGroup.Member.Age < 30 • Result: – Office “Gates 252” – Office Building “CIS” Room “411” Lorel Query Rewrite • Previous query rewritten to: – select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 • Comparison on age transformed to existential condition. – Since all properties are set-valued in OEM. – A user can ask DBGroup.Member.Age < 30 regardless of whether Age is single valued, set valued, or unknown. Lorel Query Rewrite • Why? – Breaking query into simple path expressions necessary for query optimization. – Need to explicitly handle coercion. • Atomic objects and values. 0.5 < “0.9” should return true • Comparing objects and sets of objects. DBGroup.Member.Age is a set of objects. Lorel (cont.) • General path expressions are loosely specified patterns for labels in the database. (‘|’ disjunction, ‘?’ label pattern optional) • Example: – select DBGroup.Member.Name where DBGroup.Member.Office(.Room%|.Cubicle)? like “%252” • Result: – Name “Jones” Name “Smith” Query and Update Processing • Query is parsed • Parse tree is preprocessed and translated to new OQL-like query. • Query plan constructed. • Query optimization. • Opt. query plan executed. System Architecture Iterators and Object Assignments • Use recursive iterator approach: – execution begins at top of query plan – each node in the plan requests a tuple at a time from its children and performs some operation on the tuple(s). – pass result tuples up to parent. Object Assignments (OAs) • OA is a data structure containing slots for range variables with additional slots depending on the query. • Each slot within an OA will holds the oid of a vertex on a path being considered by the query engine. • Example: if OA1 holds oid for “Smith” then OA2 and OA3 can hold the oids for one of Smiths Office objects and Age objects. Query Operators • For example, the Scan operator returns all oids that are subobjects of a given object following a specified path expression. – Scan (StartingOASlot, Path_expression, TargetOASlot) • For each oid in StartingOASlot, check to see if object satisfies path_expression and place oid into TargetOASlot. • Other operators include Join, Project, Select, Aggregation, etc. • Join node like nested-loop join in relational DBMS. Query Optimization • Does only a few optimizations: – Push selection ops down query tree. – Eliminate/combine redundant query operators. • Explores query plans that use indexes where possible. – Two kinds of indexes: – Lindex (link index) provide parent pointers impl. as hashing. – Vindex (value index) impl. as B+-trees Indexes • Because of non-strict typing system, have String Vindex, Real Vindex, and String-coerced-to-real Vindex. • Separate B-Trees for each type are constructed. • Using Vindex for comparison (e.g. Age < 30) consider the following: – If type is string, do lookup in String Vindex – If can convert to real the do lookup in String-coercedto-real Vindex. – If type is real? Other issues • Update query operator example: – Update(Create_Edge, OA1, OA5, “Member”) – Create edge from results in OA1 to OA5 labeled “Member”. • Lore arranges objects in physical disk pages, each page with a number of slots with a single object in each slot. – Objects placed according to first-fit algorithm. – Supports large objects spanning multiple pages. – Objects clustered in depth-first manner (since Scan traverses depth-first). – Garbage collector removes unreachable objects. External Data Manager • Enables retrieval of information from other data sources, transparent to the user. • An external object in Lore is a “placeholder” for the external data and specifies how lore interacts with an external data source. • The spec for an external object includes: – Location of a wrapper program to fetch and convert data to OEM, time interval until fetched information becomes stale, and a set of arguments used to limit info fetched from external source. Data Guides • A DataGuide is a concise and accurate summary of the structure of an OEM database (stored as OEM database itself, kind of like the system catalog). • Why? – No explicit schema, how do we formulate meaningful queries? – Large databases (can’t just view graph structure). – What if a path expression doesn’t exist (waste). • Each possible path expression is encoded once. {9, 13} DataGuides As Histograms • Each object in the dataguide can have a link to its corresponding target set. – A target set is a set of oids reachable by that path. • TS of DBGroup.Member.Age is {9, 13}. – This is a path index. Can find set of objects reachable by a particular path. – Can store statistics in DataGuide (more in next paper). • For example, the # of atomic objects of each type reachable by p. Conclusions • Takes advantage of the structure where it exists. • Handles lack of structure well (data type coercion, general path expressions). • Query language allows users to get and update data from semistructured sources. – DataGuide allows users to determine what paths exist, and gives useful statistical information Query Optimization for Semistructured Data OEM vs. XML • OEM’s objects correspond to elements in XML • Sub-elements in XML are inherently ordered. • XML elements may optionally include a list of attribute value pairs. • Graph structure for multiple incoming edges specified in XML with references (ID, IDREF attributes). i.e. the Project attribute. Indexes • Vindex(op, value, l, x) places into x all atomic objects that satisfy the “op value” condition with an incoming edge labeled l. – Vindex(“Age”, <, 30,y) places into y objects with age < 30. • Lindex(x, l, y) places into x all objects that are parents of y via edge labeled l. – Lindex(x, “Age”, y) places into x all parents of y via label “Age”. Indexes (cont.) • Bindex(l, x, y) finds all parent-child object pairs connected by a label l. – Bindex(“Age”, x, y) locates all parent-child pairs with label Age. • Pindex(PathExpression, x) placed into x all objects reachable via the path expression. – Pindex(“A.B x, x.C y”, y) places into y all objects reachable by going from A to B to C. – Uses DataGuide. Simple Query • select O from DBGroup.Member M, M.Office O where exists y in M.Age : y < 30 • Possible plans: – Top-down (similar to pointer-chasing, nested-loops join) – Use Vindex to check y < 30, traverse backwards from child to parent using Lindex (bottom-up). – Hybrid, both top down and bottom up. Meet in middle. Select x From A.B x Where exists y in x.C: y = 5 Query Plan Generation (Overview) • Logical query plan generator creates high-level execution strategy. • Physical query plan enumerator uses statistics and a cost model to transform logical query plan into an estimated best physical plan that lies within their search space. Logical Query Plans (cont.) • Glue node represents a ‘rotation point’ that has as its children two independent subplans. – Rotating the order between independent components yields different plans. – Marks place where execution order is not fixed. • Discover node chooses best way to bind variables x and y. • Chain node chooses best evaluation of a path expression. Logical query plan for: Select x From DBGroup.Member x Where exists y in x.Age: y<30 from clause where clause Physical Query Plans Physical Query Plans (cont.) • Scan(x, l, y) places into y all objects that are subojects of x via edge labeled l. – Top-down (pointer chasing). • Lindex plan is bottom-up approach. • Bindex: Locate edges whose label appears infrequently in database. • NLJ: left subplan passes variables to right subplan. Statistics • I/O metric uses estimated # of objects fetched. • For every label subpath p of length <= k: – # Of atomic objects of each type reachable by p – Min, and max values of all atomic objects of each type reachable by p – # Of instances of path p, denoted |p| – # Of distinct objects reachable by p, denoted |p|d – # Of l-labeled subobjects of all objects reachable by p – # Of incoming l-labeled edges to any instance of p, denoted |pl| Plan Enumeration • Doesn’t consider joining two simple path expressions together unless they share a common variable. • Pindex is used only when path expression begins with a name and no variable except the last is used in the query. • Select clause always executes last. • Doesn’t try to reorder multiple independent path expressions. Results • Used XML database about movies. Database graph contained 62,256 nodes and 130,402 edges. • Experiment 1: Select DB.Movie.Title – Best plan is Pindex, followed by top-down – Worst plan is Bindex, with hash joins. Results (cont.) • Experiment 2: All Movies with a Genre of “Comedy” – Where clause is very selective, bottom-up does a Vindex for “Comedy” with incoming edge Genre Results (cont.) • Experiment 3: Query with two existentially quantified variables in the where clause. • Errors due to bad estimates of atomic value distributions and set operation costs. Results (cont.) • Experiment 4: Select movies with certain quality rating. • Quality ratings uncommon in database so optimizer chooses to find all ratings via Bindex, and then work bottom-up. Conclusions • Cost estimates are accurate and select the best plan most of the time • Execution times of best and worst plans for a given query can differ by many orders of magnitude. • Best strategy is highly dependent upon the query and database (Query optimization is good for XML data).