* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Extending Database Management Systems by Developing New
Survey
Document related concepts
Microsoft Access wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Functional Database Model wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Clusterpoint wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational algebra wikipedia , lookup
Transcript
Extending Database Management Systems by Developing New Database Operators Paul J. Wagner University of Wisconsin – Eau Claire Messages Current relational query languages do not scale up well to support us in the development of complex queries on newer data domains New relational database operators are needed to help us generate such queries We can develop a framework for adding new operators by analyzing the shortcomings of our current operators Definitions Question – an English (or other natural language) statement of the desired data Query – the statement of the problem in a relational query language Operator – a module representing a single conceptual task to be carried out on relational data; can be primitive (e.g. filtering rows from a relation) or nonprimitive (e.g. joining two tables, SQL select) Background Database world in 1970s and 1980s Set-Oriented Data (e.g. employees, bank accounts, airline schedules) Relatively Well-Understood Relational Set Operators Were Sufficient Database world in 1990’s and 2000’s More Complex Data (e.g. Spatial and Temporal Data, Protein Sequences) Not Well Understood Relational Set Operators Are Insufficient Relational Query Languages SQL isn’t the only relational query language SQL is a transform-oriented language that implements a variety of atomic relational operations Other query language options Relational calculus (descriptive) “state the defining characteristics of the result” – C.J. Date Relational algebra (prescriptive) State the process that gets you to the desired result Operations: select, project, times (Cartesian product), join, union, intersect, minus Relational Algebra Operations select (U) – filters rows Note: RA select != SQL select project (U) – filters columns times (B) – all combinations of the rows in two relations (even if they don’t make sense) join (B) – a macro, involving a sequence of: times select those that make sense those that meet the criteria of the particular question (optionally) project remove duplicate key values remove any other columns that aren’t part of the question union (B), intersect (B), minus (B) – basic set operations B = binary, U = unary Creatures Times Example CID Name 1 Alice CID Name CID SCode 2 Bob 1 Alice 1 F 3 Carl 1 Alice 2 S 1 Alice 2 F 1 Alice 3 S 2 Bob 1 F Achievements 2 Bob 2 S CID SCode 2 Bob 2 F 1 F 2 Bob 3 S 2 S 3 Carl 1 F 2 F 3 Carl 2 S 3 S 3 Carl 2 F 3 Carl 3 S TIMES Creature-Achievement Pairs Quiz, Question 1: Given an Achievements table with columns CreatureID and SkillCode, what SQL statement retrieves each creature that floats? (assume F means “floats”) Achievements CreatureID SkillCode 1 2 F S 2 F 3 5 S C SELECT CreatureID FROM Achievements WHERE SkillCode = ‘F’; Quiz, Question 2 Given an Achievements table with columns CreatureID and SkillCode, what SQL statement retrieves each creature that floats or swims (S means “swims”)? Achievements CreatureID SkillCode 1 2 F S 2 F 3 5 S C SELECT CreatureID FROM Achievements WHERE SkillCode = ‘F’ OR SkillCode = ‘S’; Quiz, Question 3: Given an Achievements table with columns CreatureID and SkillCode, what SQL statement retrieves each creature that floats and swims? Achievements CreatureID SkillCode 1 2 F S 2 F 3 5 S C SELECT CreatureID FROM Achievements WHERE SkillCode = ‘F’ AND SkillCode = ‘S’; ??? Problems Emerge With SQL Select What are the issues here? SQL operates on one row at a time We ask questions about the entire data set SQL is monolithic One SQL statement can contain many atomic relational operations E.g. the select statement for a join in SQL actually contains projection, times and selection in relational algebra E.g. the SQL where clause contains meaningful join criteria as well as “business logic” SQL starts to break down as the queries become more complex Harder to generate the syntax as well as know that the results are correct How Do We Answer Question 3? A Few Possibilities: SELECT CreatureID FROM Achievements WHERE SkillCode = ‘F’ INTERSECT SELECT CreatureID FROM Achievements WHERE SkillCode = ‘S’; SELECT A1.CreatureID FROM Achievements A1, Achievements A2 WHERE A1.CreatureID = A2.CreatureID AND A1.SkillCode = ‘F’ AND A2.SkillCode = ‘S’; Quiz, Question 4: Given an Achievements table with columns CreatureID and SkillCode and a table LifeguardSkills with a list of desired skills, what SQL statement retrieves each creature that has achieved all of the Lifeguard skills? Achievements CreatureID SkillCode LifeguardSkills 1 F SkillCode 2 S F 2 F S 3 S R 5 C …. …. …. Umm….. Can I leave now? Problems Prior techniques don’t scale up for an arbitrarily large number of desired criteria Don’t want to have to specify N intersect operations Don’t want to join N tables Resulting queries have problems Time consuming and repetitious to generate Inefficient to execute How Do We Answer Question 4? Relational Algebra also has the (binary) Divide operator Does what we want (divide Achievements by LifeguardSkills) In SQL, we need to create a macro: 1) Find all possible creature/lifeguard-skill pairs (Creatures times LifeguardSkills) Gives us the “ ’if everyone was a lifeguard’ achievements” Note: we need a separate Creatures relation Why can’t we just generate Creatures by projecting CreatureID from Achievements? 2) Find the difference between step 1 and the Achievements relation Gives us the “non-achieved Lifeguard achievements” 3) Project the CreatureID from step 2 Gives us the “creatures who haven’t achieved all LifeguardSkills” 4) Find the difference between Creatures and step 3 Gives us the “creatures who have achieved all LifeguardSkills” Question 4, Reflection How many relations are needed for our SQL macro? Creatures Achievements CreatureID CreatureID SkillCode 1 F 2 S 2 F 3 S 4 5 C …. …. …. 1 2 3 LifeguardSkills SkillCode F S R …. Question 4, Reflection (cont.) Why are more relations needed for the macro than for relational divide? We’re providing for the possibility that there are some creatures who have no achievements Not really needed now, but later…. Is question 4 looking for creatures that have exactly the LifeguardSkills or those with exactly or more than those skills? Are there any other possible associations we might be interested in? Quiz, Question 5 Find each creature/job pair where the creature has achieved exactly or more than the skills for that job Note: we’re generalizing the last question to match multiple jobs Creatures CreatureSkills JobSkills CreatureID CreatureID SCode JobName SCode JobName Lifeguard F Lifeguard Lifeguard S Developer 1 F 2 S 2 F 3 S 4 5 …. …. 1 2 3 Jobs Developer D Manager C Developer C Slacker …. Manager O …. …. …. How To Answer Question 5? No operator in relational algebra Possible as a complex macro, but many operations Hundreds of lines of SQL code Let’s think about this question some more… Matching Relations We need four relations to answer this question Target (e.g. Creatures) – the ‘candidate’ relation Target-Detail (e.g. Creature-Skills, or Achievements) - combination of candidate plus achieved detail Pattern-Detail (e.g. Job-Skills) – combination of what target is matched against plus detail Pattern (e.g. Jobs) – what the target is matched against Possible Set Associations (DEMONS-ZA) Exactly: target detail (TD) same as pattern detail (PD) for a given target/pattern pair More than: TD >= PD (superset) Different than: TD and PD share no detail, but each has detail Overlapping: TD and PD share some detail, but each have different detail None: TD empty, PD has detail Some: TD <= PD (subset) Zero: TD empty, PD empty Any: TD has detail, PD empty Combinations of Set Associations There are 28-1= 255 possible non-empty combinations of the DEMONS-ZA set associations All are potentially interesting Some that commonly arise: EM = at least that many (universal quantifier) SOME = at least one (existential quantifier) NZ = none (no TD values) ESZ = in (the TD set has no values that are not in the PD set) Set HAS Operator Developed by John Carlis, University of Minnesota; late 1980’s HAS <Qual.1> <Qual.2> T-Rel TD-Rel PD-Rel P-Rel E.g. HAS ‘EMZA’ Creatures Creature-Skills Jobs-Skills Jobs Qualifier1: a DEMONS-ZA string or a counts expression which includes a relational expression involving one or more of 3 counts: TD values in PD (qualifying count) TD values not in PD (non-qualifying count) PD values not in TD (missing count) NOTE: can derive each DEMONS-ZA letter from 3 counts Qualifier2: ‘Exact’ matching of TD values to PD values (default), or ‘Range’ matching where PD is specified as a range of values Originally implemented in LISP, Scheme MATCH Operator Developed by Jim Held, University of Minnesota, ~1990 Set HAS plus Supports multiple detail properties E.g. could match on skills, traits, and age Supports hierarchical structure for patterns E.g. a job could have multiple sub-jobs Supports hierarchical structure for criteria E.g. a skill could have sub-skills Demonstrated usefulness with medical expert systems Implemented in LISP Bag HAS Operator Developed by Paul Wagner, ~2000 Set HAS extended to support bags (multisets) of skills Extended Set-HAS in a different direction than match Need 5 relations (Target, Target-Detail, Detail, PatternDetail, Pattern) Target-Detail relation extended to contain count of detail present Pattern-Detail relation extended to contain count of detail required Developed another qualifier to represent possible combinations of counts Demonstrated usefulness in several domains Academic records Sports event qualification Limited protein sequence matching Implemented in relational algebra (built on top of Oracle) and PL/SQL What’s Next? Sequence HAS Bag HAS extended with positional matching Sequence MATCH MATCH extended with positional matching Generalized Sequence HAS/MATCH Usefulness Support DBMS implementations of many currently external sequence matching tools E.g. BLAST, FAST-A for protein sequences Other types of sequences (temporal, positional) Issues/Alternatives For This Approach Issues Operators themselves become more complex to use More relations More qualifiers How far to extend languages like SQL? Current extensions support objects, procedural functionality Alternatives Packages based on data types Contain support for types, operations on those types Issue – only support that type, not generalized matching Conclusions (Messages Revisited) Current relational query languages do not support the development of complex queries on newer data domains New relational database operators are needed to help us generate such queries We have developed a framework for adding new operators by analyzing the shortcomings of our current operators, and are using it to develop new database operators that can help meet today’s data-driven software development needs