Download Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Versant Object Database wikipedia , lookup

Relational algebra wikipedia , lookup

Relational model wikipedia , lookup

Join (SQL) wikipedia , lookup

Transcript
AutoJoin: Providing Freedom
from Specifying Joins
Terrence Mason ([email protected])
Lixin Wang ([email protected])
Dr. Ramon Lawrence ([email protected])
Iowa Database and Emerging Application Laboratory
University of Iowa
7th International Conference on Enterprise Information Systems
ICEIS 2005 Miami, Florida
Presentation Outline



Define Query Inference
Query Languages that require Inference
AutoJoin Architecture
Join Graph represent a schema
 Queries and Query Interpretations on a Join Graph


Pre-compute maximal join trees



Algorithm EMO
Query time processing – Example
Performance Evaluation
Query Inference Problem
New Languages
The query inference problem
requires enumerating and ranking
query interpretations of a query such
that the query interpretation desired
by the user is among the highest
ranked interpretations.
Motivation for Query Inference

State of the art query languages require it
Keyword Search – automatically relate
keywords across relations of a schema
 Conceptual Queries – Concepts mapped to
database must be related
 Natural Language Queries

Natural language query mapped to concepts
 Relate concepts as in Conceptual Queries


Current approaches not scalable
Tied to specific language
 Or conceptual model

Motivation for Query Inference

Reduces to graph problem



Connect relations (nodes) with joins (edges)
Exponential solutions for highly connected graphs
(database graphs less connected)
Approaches to join determination

Grow all ways



Shortest Paths


CQL Conceptual Query Language (Owei and Navathe, 2001)
Limited Interpretations



Universal Relation (Maier and Ullman, 1983)
Discover (Keyword) (Hristidis and Papakonstantinou, 2002, 2003,
2004)
Steiner Tree (2-Trees) (Wald and Sorenson, 1984)
Limit number of joins and interpretations (Zhang et al., 1999)
Query time find spanning trees of keywords

DBXplorer Keyword Search (Agrawal et al. 2002)
Goal of AutoJoin

Consistent, Scalable Inference Engine
Abstract database schema from users
 Automatically determine joins to relate relations and
attributes
 Consistent approach to handle ambiguity in queries
 Efficient algorithm to pre-compute potential joins
 Minimal overhead at query time
 Demonstrate efficiency and scalability
 Structured on relational model without any required
conceptual models

Example Query on TPC-H Schema
English Query:
List all parts ordered by Customers
in the United States.



Attribute-only SQL
Determine Joins with AutoJoin
New formulation for Query Inference problem.
TPC-H Schema
TPC-H BENCHMARK™ (http://www.tpc.org/)
List all parts ordered by Customers in the United States.
Table
Attributes
Part
partkey, name, mfgr, brand, type, size, container, retailprice,
comment
Supplier
supkey, name, address, nationkey, phone, acctbal, comment
PartSupp
partkey, suppkey, availqty, supplycost, comment
Customer
custkey, name, address, nationkey, phone, acctbal, mktsegment,
comment
Order
orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority,
clerk, shippriority, comment
LineItem
orderkey, partkey, suppkey, linenumber, quantity, extendedprice,
discount, returnflag, tax, linestatus, shipdate, commitdate,
receiptdate, shipinstruct, shipmode, comment
Nation
nationkey, name, regionkey, comment
Region
regionkey, name, comment
Attribute-only Query:
Select Part.Name where Nation.Name=‘United States’;




Part.Name - name attribute in Part Table
Nation.Name – name attribute in Nation Table
Select and where similar to SQL
No From clause or joins specified
Keyword Query:
Part ‘United States’

Maps Part to Part relation
Maps ‘United States’ to tuple in Nation relation

No joins specified

SQL Query
Select Part.Name where Nation.Name = ‘United States’;
SELECT P.name
FROM part P, nation N, partsupp PS, lineitem LI,
orders O, customer C
WHERE N.name = ‘United States’
And P.partkey = PS.partkey
And PS.partkey = LI.partkey
And PS.suppkey = LI.suppkey
And O.custkey = C.custkey
And C.nationkey = N.nationkey
And LI.orderkey = O.orderkey;
Specified
Joins and
Tables
AutoJoin Architecture
User
Execute Queries
Relational
Database
Query Interface
Inference Request
Query Builder
Generator
XML
Document
Interpretations
Iterator
Ranker
Loader
AutoJoin Inference Engine
Representing Joins of a Schema
Join Graph


Graph representation of relational schema
Nodes


Relations in schema
Directed Edges

Foreign key constraint between relations
Edges directed from N to 1 cardinality of relationships
 Maintain Lossless property (No spurious tuples on joins)

Create Join Graph TPC-H
Nodes Joined
Foreign key/Join
Line Item to Part
partkey  partkey
Line Item to
PartSupp
partkey, suppkey 
partkey, suppkey
Line Item to
Supplier
suppkey  suppkey
Line Item to Order
l_orderkey  o_orderkey
PartSupp to Part
ps_partkey  p_partkey
PartSupp to
Supplier
ps_suppkey  s_suppkey
Supplier to Nation
s_nationkey  n_nationkey
Order to Customer
o_custkey  c_custkey
Customer to
Nation
c_nationkey  n_nationkey
Nation to Region
n_regionkey  r_regionkey
Tables as
Nodes
Line
Item
Part
Supp
Part
Supplier
Nation
Region
Order
Customer
Pre-compute Maximal Join Trees

EMO Algorithm on Join Graph
Efficiently computes all Trees
 Executes where previous strategy failed
 Direction of edges results in lossless join trees


Pre-computed
Executed once prior to query time
 Structures built for query time performance

Compute Lossless Joins



Maximal sets of lossless joins
Ambiguity inherent in the schema
Two types of ambiguity:

Single relation that plays multiple roles


Node with more than one incoming edge in join graph
Multiple semantic relationships between entities

Strongly connected components greater than one node
Creation of Maximal Join Trees
Lossless Joins

Efficient Algorithm EMO



Determine all reachable graphs from nodes that may be a
root for Maximal Set of Lossless Joins
Identify all Strong Connected Components
(SCC)
For each SCC




If SCC is single node and no incoming edges,
create reachable graph from this node
If SCC has multiple nodes, for each node in
SCC with no incoming edges that are not part
of SCC create reachable graph.
For each reachable graph find all
spanning trees
Spanning trees represent Maximal Join
Trees
Maximal Join Trees of TPC-H

LineItem is the only root for a reachable graph.




No strongly connected components
Join graph is reachable graph
Enumerate spanning trees on original graph
Remove shortcut joins and re-compute
TPC-H Join Graph
Line
Item
Part
Supp
Part
Supplier
Order
Nation
Customer
Region
TPC-H Maximal Join Trees
Line
Item
Part
Supp
Supplier
Part
Supplier
Part
Nation
1
Customer
Nation
2
Region
Supplier
5
Order
Customer
3
Customer
Order
Nation
Region
Supplier
Part
Nation
Customer
4
Region
Order
Customer
Customer
Region
Part
Supp
Supplier
Part
Order
Line
Item
Line
Item
Part
Supp
Supplier
6
Region
Nation
Part
Supp
Part
Nation
Supplier
Part
Line
Item
Part
Supp
Part
Order
Region
Line
Item
Part
Supp
Part
Supp
Part
Supp
Order
Line
Item
Line
Item
Line
Item
Nation
7
Region
Order
Customer
Supplier
Part
Nation
8
Region
Order
Customer
Shortcut Joins

Semantically equivalent join paths
A shortcut join is a join that is semantically
equivalent to a longer join path
 Core join path (longer) preserved in join graph
 Shortcut join removed for join determination

Appears to be a semantically different interpretation of
the query
 Substituted back into query



No nodes on core path in query (faster) execution)
TPC-H has two shortcut joins
TPC-H Join Graph
Remove Shortcut Joins
Line
Item
Red – Shortcut Joins
Part
Supp
Part
Supplier
Order
Nation
Customer
Region
Original TPC-H Maximal Join Trees
Line
Item
Part
Supp
Supplier
Part
Supplier
Part
Nation
1
Customer
Nation
2
Region
Supplier
5
Order
Customer
3
Customer
Order
Nation
Region
Supplier
Part
Nation
Customer
4
Region
Order
Customer
Customer
Region
Part
Supp
Supplier
Part
Order
Line
Item
Line
Item
Part
Supp
Supplier
6
Region
Nation
Part
Supp
Part
Nation
Supplier
Part
Line
Item
Part
Supp
Part
Order
Region
Line
Item
Part
Supp
Part
Supp
Part
Supp
Order
Line
Item
Line
Item
Line
Item
Nation
7
Region
Order
Customer
Supplier
Part
Nation
8
Region
Order
Customer
TPC-H Semantically Unique
Maximal Join Trees
Line
Item
Line
Item
Part
Supp
Part
Part
Supp
Supplier
Nation
Order
Part
Customer
Region
1
Supplier
Order
Nation
Customer
Region
2
Query and Query Interpretation
AutoJoin


Join Graphs
Query:
Sub-graph of the join graph
 Nodes and (optionally) edges



Not connected requires inference
Query Interpretation:
Connected sub-graph of the join graph
 Includes all specified nodes and edges

Example Query
SELECT Part.Name
WHERE Nation.Name = ‘United States’;

Relate Part.Name to Nation.Name



Query of Part and Nation nodes to AutoJoin.
The query is ambiguous



Part and Nation Nodes.
More than one query interpretation
Nation relates to Supplier and Customer
Return the query with fewest joins first
Efficient Query Time Execution

Find maximal join trees with query nodes
Reverse index - relation to its set of join trees
 Intersect lists


Build Interpretations
maximal sets of lossless joins
Least common ancestor (vs. recursive prune)
 Pre-compute ancestor lists


No lossless interpretations (no trees)


Find lossy interpretation
Rank interpretations by cost function
Both Trees Contain Query Nodes
Select Part.Name where Nation.Name = ‘United States’;
Line
Item
Line
Item
Part
Supp
Part
Part
Supp
Supplier
Nation
Region
1
Order
Part
Customer
Supplier
Order
Nation
Customer
Red – Target Nodes
Region
2
Query Processing
Line
Item
Line
Item
Part
Supp
Part
Part
Supp
Supplier
Nation
Order
Part
Customer
Supplier
Order
Nation
Customer
Red – Target Nodes
Region
1
Blue – Tree Nodes
Gray – Nodes to Prune
Region
2
Query Interpretations
Select Part.Name where
Customer.Nation.Name =
‘United States’;
Select Part.Name where
Supplier.Nation.Name =
‘United States’;
Line
Item
Part
Supp
Part
Supplier
Part
Supp
Part
Order
Nation
1
Nation
Customer
2
Unambiguous Query
Select Supplier.Name where Order.Id = 73;
Line
Item
Line
Item
Part
Supp
Part
Part
Supp
Supplier
Nation
Region
1
Order
Part
Customer
Supplier
Order
Nation
Customer
Red – Target Nodes
Region
2
Query Processing
Select Supplier.Name where Order.Id = 73;
Line
Item
Line
Item
Part
Supp
Part
Part
Supp
Supplier
Nation
Order
Part
Customer
Supplier
Order
Nation
Customer
Red – Target Nodes
Region
1
Blue – Tree Nodes
Gray – Nodes to Prune
Region
2
Query Interpretations
Select Supplier.Name where Order.Id = 73;
Line
Item
Line
Item
Part
Supp
Part
Supp
Supplier
1
Order
Supplier
2
Order
The Unambiguous Query Interpretation
Select Supplier.Name where Order.Id = 73;
Line
Item
Part
Supp
Supplier
Order
Additional Interpretations
Lossy Joins

Related through a node involved in two distinct roles




Two maximal join trees contain all query nodes and have at
least one node in common
Union maximal join trees
Common nodes provide relation for trees.
Interpretation where node will have two incoming edges


No longer lossless
Example Customer and Supplier related through Nation in
TPC-H.

Cross products of Customers and Suppliers with the same nation
Beyond Natural Joins

Theta joins
Merge the two nodes related by theta join into single
node and re-compute maximal objects.
 Expand this node for final query interpretation with
theta join


Tuple Variables
A query interface may specify tuple variables
 Additional nodes and edges will be added to join
graph to complete the query interpretations

Performance Experiments

Broad Range of Schemas
caBIO (NCI) 149 relations, 213 joins, and 1253
maximal join trees
 TPC-H Standard Database

Inferred standard queries (21 specified queries)
 Ambiguity reduced by removing shortcut joins


Tenant – 9 nodes, 50 joins, and 1286 maximal join
trees
Peformance Results

Time to generate all Maximal Join Trees




Handles schemas where previous method failed
Worst test 2.7 seconds
Average < 1 second
Reduce Ambiguity


Removing shortcut joins reduces ambiguity
Increased number of unambiguous query


From 45% to 68% for TPC-H Benchmark Queries
Minimal overhead of inference at query time


Average < 1 millisecond
Worst test 7.4 milliseconds
Compute Maximal Join Trees
EMO vs. All Ways
∞
50
Time (Seconds)
40
All Ways
30
EMO
18.187
20
10
4.906
2.652
0.031
0.016
0.079
0.797
0.187
0.047
0.093
0.11
0
TPC-H (8)
Claims (31)
Tenant (1286)
caBIO (1253)
Schema (Maximal Objects)
ACID (20)
MONDIAL (117)
Reducing Ambiguity
Remove Shortcut Joins
100%
100%
90%
Percent Unambiguous Joins
80%
Original
70%
68%
Shortcuts Removed
60%
50%
45%
40%
33%
30%
26%
20%
10%
8%
0%
TPC-H (All)
TPC-H (Benchmark)
EDS
Query Inference Time
(Milliseconds)
10.0
Average Query Time, milliseconds per query
9.0
8.0
7.420
7.0
6.0
5.0
4.0
2.764
3.0
2.0
1.0
0.355
0.282
0.392
TPC-H
Queries
TPC-H
Shortcut
Queries
TPC-H All
0.057
0.055
0.147
TPC-H
Shortcut All
ACID
Claims
0.173
0.0
Tenant
MONDIAL
caBIO
AutoJoin Conclusions
Scalable inference engine
 Efficiently pre-compute maximal join trees
 Reduced ambiguity by removing shortcut
joins
 Overhead is minimal
 Complex queries can be inferred
 Built directly on relational model

Future Work
Develop a query language
 Remove requirement of understanding
the underlying schema
 Automatically determines joins
 End user interface based on AutoJoin
 Query inference for integration systems.

Query Inference
(Previous)
The translation of a query in a
query language into an
unambiguous representation of the
query
[Wald and Sorenson, 1984]
Universal Relation


First model to require query inference
Maximal Objects (Maier and Ullman, 1983)




Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984)



Lossless Join property to identify potential joins
Grows all ways on hyper-graph
Returns a union of all query interpretations
Limited to Partial 2-Trees
Returns only lowest cost query interpretation
Generate a single interpretation


Do not meet need of new query languages
Limited query interpretations possible
State of the Art Query Languages

Keyword Searches
Keywords map to either specific data, attribute
names, or relation names in a database.
 Must identify joins to relate keywords spread across
multiple relations.
 Multiple approaches to identifying the top-k
relationships between keywords.

Keyword Search
Top-K Relationships

Discover (Hristidis and Papakonstantinou, 2002,
2003, 2004)
Grow all ways from a keyword
 Limit on number of joins
 Creates extra graphs


DBXplorer (Agrawal et al. 2002)


Generates spanning trees at query time
BANKS ( )
Graph of all tuples related by joins
 Must fit in memory (limited to smaller databases)

State of the Art Query Languages

Conceptual Query Languages or Models
Queries built with concepts that map to a database.
 Remove the burden of knowledge of the schema.
 Must determine joins to relate concepts in query.
 Use conceptual model to determine joins

Conceptual Query Languages

CQL (Owei and Navathe, 2001)


Queries may include roles or joins required for a query
Pathfinder algorithm for completing the query



Based on shortest path between source and target concepts in query
Semantically Constrained ER Diagram as a graph used to determine
joins.
Conceptual Model (Zhang et al., 1999)


Semantic graph of database
Search algorithm constrained by number of joins or number
of interpretations
State of the Art Query Languages

Natural Language Queries
Natural language queries map the language to
concepts in a database
 Joins must be determined to relate concepts in
database similar to Conceptual Query Languages

Functional Dependencies due to Primary Keys
TPC-H
Table
Part
Functional Dependencies
p_partkey  p_name, p_mfgr, p_brand, p_type, p_size, p_container,
p_retailprice, p_comment
Supplier
s_suppkey  s_name, s_address, s_nationkey, s_phone, s_acctbal,
s_comment
PartSupp
ps_partkey, ps_suppkey  ps_availqty, ps_supplycost, ps_comment
Customer
c_custkey  c_name, c_address, c_nationkey, c_phone, c_acctbal,
c_mktsegment, c_comment
Order
o_orderkey  o_custkey, o_orderstatus, o_totalprice, o_orderdate,
o_orderpriority, o_clerk, o_shippriority, o_comment
LineItem
l_orderkey, l_linenumber  l_partkey, l_suppkey, l_orderkey , l_quantity,
l_extendedprice, l_discount, l_returnflag, l_tax, l_linestatus, l_shipdate,
l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
Nation
n_nationkey  n_name, n_regionkey, n_comment
Region
r_regionkey  r_name, r_comment
Primary Keys
Foreign Keys
Function Dependencies TPC-H
implied by Foreign Keys
Table with
Foreign Key
Table
Referenced
LineItem
Part
l_partkey  p_partkey
LineItem
Supplier
l_suppkey  s_suppkey
LineItem
PartSupp
l_partkey, l_suppkey  ps_partkey, ps_suppkey
LineItem
Order
l_orderkey  o_orderkey
PartSupp
Part
ps_partkey  p_partkey
PartSupp
Supplier
ps_suppkey  s_suppkey
Supplier
Nation
s_nationkey  n_nationkey
Customer
Nation
c_nationkey  n_nationkey
Order
Customer
Nation
Region
Primary Keys
Functional Dependencies
o_custkey  c_custkey
n_regionkey  r_regionkey
Foreign Keys
TPC-H Join Graph
Line
Item
Part
Supp
Part
Supplier
Order
Nation
Customer
Region