Download Conjunctive Queries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

PL/SQL wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Functional Database Model wikipedia , lookup

SAP IQ wikipedia , lookup

Relational algebra wikipedia , lookup

Clusterpoint wikipedia , lookup

Versant Object Database wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
Virtual Data Integration
Helena Galhardas
DEI IST
(based on the slides of the course: CIS 550 – Database &
Information Systems, Univ. Pennsylvania, Zachary Ives)
Agenda



Terminology
Conjunctive queries and Datalog
Virtual Data Integration Architecture
(Known) Terminology (1)


Relational database composed of a set of
relations (or tables)
Database schema includes a relational schema
for each table + set of integrity constraints



The rows of a relational table are called tuples
(or records)
The NULL value means the values is not known
or doesn’t exist


Key constraints, functional dependencies, foreign key
constraints
Equality test envolving NULL always returns FALSE
A database instance is a particular snapshot of
the contents of the DB
(Known) Terminology (2)

Queries used to formulate users’ needs



Views are created when we want to reuse the same
query expression in other queries



Structured (SQL or XQuery) or unstructured (for the
Web, e.g., list of keywords)
Q(D): result of applying query Q to the database D
Materialized view: the answer is computed and
maintained as the DB changes
Queries can also be used to specify relationships
between schemas of data sources in a data
integration context
Two different notations for queries:


SQL
Conjunctive queries, for formal purposes
Conjunctive queries


Based on Mathematical Logic
Conjunctive query has the form:





Q(X):- R1(X1), ....., Rn(Xn), c1, ..., cm
R1(X1)....Rn(Xn) are the sub-goals (or conjuncts) and form
the body of the query
Ris are database relations, Xi’s are tuples of variables and
constants
The variables in X are called head variables or
distinguished variables; the others are called existencial
variables
The predicate Q denotes the answer relation of the query
The ci’s are interpreted atoms and have the form X θ Y,
with X, Y either variables or constants (but at least one is a
variable), and θ is an interpreted predicate such as =, <=,
>, !=, >, >=
Semantics of conjunctive queries

Semantics of a conjunctive query Q over DB
instance D:





Ψ: any mapping that maps each of the variables in Q
to constants in D
Ψ(Ri): result of applying Ψ to Ri(Xi), which consists
of ground atoms (constants)
Ψ(Q): the result of applying Ψ to Q(X), which
consists of ground atoms
If
 Each of Ψ(R1),...., Ψ(Rn) is in D, and
 For each 1<=j<=m, Ψ(cj) is satisfied,
Then, Ψ(Q) is in the answer to Q over D
Correspondence between SQL queries
and conjunctive queries
Base relations:
Interview(candidate, name, recruiter, hireDecision, grade)
EmployeePerformance(empID, name, reviewQuarter, grade,
reviewer)
 SQL:
Select recruiter, candidate
From Interview, EmployeePerformance
Where recruiter = name
And grade < 2.5
 Conjunctive query:
Q1(Y,X):- Interview(X,D,Y,H,F),
EmployeePerformance(E,Y,T,W,Z),
W <= 2.5

Safety, disjunction

Conjunctive queries must be safe:



Every variable appearing in the head also appears in the
body
Otherwise, the set of possible answers to the query may be
infinite
Disjunction can be expressed writing two or more
conjunctive queries with the same head predicate:
Q1(Y,X) :- Interview(X,D,Y,H,F),
EmployeePerformance(E,Y,T,W,Z), W<=2.5
Q1(Y,X) :- Interview(X,D,Y,H,F),
EmployeePerformance(E,Y,T,W,Z), W >=3.9
Negation
Conjunctive queries with negated goals can also be
considered:
Q(X):- R1(X1), ..., Rn(Xn), ¬S1(Y1),…., ¬Sm(Ym)
 The notion of safety is extended to:



Any variable appearing in the head of the query must also
appear in a positive subgoal
To produce an answer for the query, the mappings
from the variables of Q to the constants in the
database must satisfy:
Ψ(S1(Y1)),.... Ψ(Sm(Ym))  D
Datalog program




Set of rules, each of which is a conjunctive query
Instead of computing a single answer query,
computes a set of intensional relations (IDB
relations), one of them is designated the query
predicate
Extensional Database (EDB) relations: are the
database relations, given by a set of tuples; can
occur only in the body of rules
Intensional Database (IDB) relations: are defined by
a set of rules; can occur both in the head and in the
body of rules
Datalog Terminology

An example datalog rule:
body
idb(x,y)  r1(x,z), r2(z,y), z < 10
head


Irrelevant variables can be replaced by _ (anonymous var)
Extensional relations or database schemas (edbs) are relations
only occurring in rules’ bodies, or as base relations:


subgoals
Ground facts only have constants, e.g.,
r1(‘abc’, 123)
Intensional relations (idbs) appear in the heads – these are
basically views

Distinguished variables are the ones output in the head
Example

A database that includes a relation representing the
edges in a graph: edge(X,Y)
The Datalog query:
r1: path(X,Y) :- edge(X,Y)
r2: path(X,Y) :- edge(X,Z), path(Z,Y)

computes the paths in the graph
 If r2 was replaced by:
r3: path(X,Y) :- path(X,Z), path(Z,Y)
It would produce the same result
Semantics of a Datalog program





Start with empty extensions for the IDB predicates
Choose a rule in the program and apply it to the
current extension of the EDB and IDB relations
Add the tuples computed for the head to its
extension
Continue applying the rules of the program until no
new tuples are computed for the IDB relations
The answer to the query is the extension of the
query predicate
Example (cont)
r1: path(X,Y) :- edge(X,Y)
r2: path(X,Y) :- edge(X,Z), path(Z,Y)
Start with:
edge(1,2), edge(2,3), edge(3,4)
Apply r1 returns:
path(1,2), path(2,3), path(3,4)
Apply r2 once:
path(1,3), path(2,4)
Apply r2 twice:
path(1,4)
Datalog is Relationally Complete
We can map RA  Datalog:







Selection p: p becomes a datalog subgoal
Projection A: we drop projected-out variables from
head
Cross-product r  s: q(A,B,C,D) :- r(A,B),s(C,D)
Join r ⋈ s: q(A,B,C,D) :- r(A,B),s(C,D), condition
Union r U s: q(A,B) :- r(A,B) ; q(C, D) :- s(C,D)
Difference r – s: q(A,B) :- r(A,B), ¬ s(A,B)
Great… But then why do we care about
Datalog?
A Query We Can’t Answer in RA
Recall our example of a binary relation for graphs or trees
(similar to an XML Edge relation):
edge(from, to)
If we want to know what nodes are reachable:
reachable(F, T, 1) :- edge(F, T)
dist. 1
reachable(F, T, 2) :- edge(F, X), edge(X, T)
reachable(F, T, 3) :- reachable(F, X, 2), edge(X, T)
dist. 2
dist. 3
But how about all reachable paths? (Note this was easy
in XPath over an XML representation -- //edge)
Recursive Datalog Queries
Define a recursive query in datalog:
reachable(F, T, 1) :- edge(F, T)
reachable(F, T, D + 1) :- reachable(F, X, D),
edge(X, T)
distance 1
distance >1
What does this mean, exactly, in terms of logic?

There are actually three different (equivalent) definitions of
semantics; we will use that of fixpoint:



We start with an instance of data, then derive all immediate
consequences
We repeat as long as we derive new facts
In the RA, this requires a (restricted) while loop!

“Inflationary semantics”
(which terminates in time polynomial in the size of the
database!)
Our Query in RA + while
(inflationary semantics, no negation)
Datalog:
reachable(F, T, 1) :- edge(F, T)
reachable(F, T, D+1) :- reachable(F, X, D), edge(X, T)
RA procedure with while:
reachable += edge ⋈ literal1
while change {
reachable += F, T, D(F ! X(edge) ⋈ T ! X,D ! D0(reachable)
⋈ add1)
}
Note literal1(F,1) and add1(D0,D) are actually arithmetic and
literal functions modeled here as relations.
A Special Type of Query:
Conjunctive Queries
A single Datalog rule with no “Ç,” “:,” “8” can express
select, project, and join – a conjunctive query
Conjunctive queries are possible to reason about statically
(Note that we can write CQ’s in other languages, e.g., SQL!)

We know how to “minimize” conjunctive queries
An important simplification that can’t be done for general
SQL
We can test whether one conjunctive query’s answers
always contain another conjunctive query’s answers
(for ANY instance)

Why might this be useful?
Example of Containment
Suppose we have two queries:
q1(S,C) :- Student(S,N),Takes(S,C),Course(C, X),
inCIS(C),Course(C,’DB & Info Systems’)
q2(S,C) :- Student(S,N),Takes(S,C),Course(C,X)
Intuitively, q1 must contain the same or fewer answers vs. q2:
 It has all of the same conditions, except one extra conjunction
(i.e., it’s more restricted)
 There’s no union or any other way it can add more data
We can say that q2 contains q1 because this holds for any instance of
our DB {Student, Takes, Course}
Agenda



Terminology
Conjunctive queries and Datalog
Virtual Data Integration Architecture
Building a Data Integration System
Create a middleware mediator or data integration system
over the sources



Can be warehoused (a data warehouse) or virtual
Presents a uniform query interface and schema
Abstracts away multitude of sources; consults them for relevant
data




Unifies different source data formats (and possibly schemas)
Sources are generally autonomous, not designed to be integrated
Sources may be local DBs or remote web sources/services
Sources may require certain input to return output (e.g., web
forms)

binding patterns describe these
Logical components of a virtual
Is built for the data
integration system
Specify the properties of
the sources the system
needs to know to use their
data. Main component is
semantic mappings that
specify how attributes in
the sources correspond to
attributes in the mediated
schema, how to resolve
differences in how data
values are specified in
different sources
Other info is whether
sources are complete
integration application and
contains only the aspects
of the domain relevant to
the application. Most
probably will contain a
subset of the attributes
seen in sources
Programs whose
role is to send
queries to a data
source, receive
answers and
possibly apply
some basic
transformation to
the answer
Example: a data integration scenario
Semantic mappings
Semantic mappings in the source descriptions describe:
 the relationship between the sources and the mediated
schema, for example:


Whether the sources are complete, for example:


mapping of source S1 states it contains movies; the attribute
name in Movies maps to attribute title in the mediator Movie
relation; the Actors relation in the mediated schema is a
projection of the Movies source on the attributes name and
actors
Source S2 may not contain all the movie showing times in
the entire country
Can specify limited access patterns to the sources, e.g.,:

All the playing times sources require a movie title as input
Components of a data integration
system
Query reformulation (1)


Rewrite the user query that was posed in terms
of the relations in the mediated schema, into
queries referring to the schemas of data
sources
Result is called a logical query plan:

Set of queries that refer to the schemata of the data
sources and whose combination will yield the answer
to the original query
Query reformulation (2)




Ex: SELECT title, startTime
FROM Movie, Plays
WHERE Movie.title = Plays.movie
AND location= ‘New York’
AND director= ‘Woody Allen’
Tuples for Movie can be obtained from source S1 but attribute
title needs to be reformulated to name
Tuples for Plays can be obtained from S2 or S3. Since S3 is
complete for showings in NY, we choose it
Since source S3 requires the title of a movie as input and the
title is not specified in the query, the query plan must first access
S1 and then feed the movie titles returned from S1 as inputs to
S3.
Query optimization

Acepts a logical query plan as input and
produces a physical query plan


Specifies the exact order in which sources are
accessed, when results are combined, which
algorithms are used for performing operations on
the data
Ex:

The optimizer will decide the join algorithm to
combine results from S1 and S3. It may stream
tuples from S1 and input them into S3, or it may
batch them up before sending them to S3
Query execution

Responsible for the execution of the physical
query plan


Dispatches the queries to the individual sources
through the wrappers and combines the results as
specified by the query plan.
Also may ask the optimizer to reconsider its
plan based on its monitoring of the plan’s
progress (e.g., if source S3 is slow)
Referências





Chapter 2, Draft of the book on “Principles of Data
Integration” by AnHai Doan, Alon Halevy, Zachary Ives
(in preparation).
Slides of the course: CIS 550 – Database &
Information Systems, Univ. Pennsylvania, Zachary
Ives)
Raghu Ramakrishnan and Johannes Gehrke,
“Database Management Systems”, 3rd edition,
McGraw-Hill
A. Silberchatz et al, “Database System Concepts”,
5th edition, McGraw-Hill
S. Abiteboul et al, “Foundations of Databases”,
Addison Weley, 1995