Download Semantics for dirty databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Efficient Management of
Inconsistent and
Uncertain Data
Renée J. Miller
University of Toronto
Contributors

Ariel Fuxman, PhD Thesis






Microsoft Search Labs
Jim Gray SIGMOD 2008 Dissertation Award
Periklis Andritsos, PhD
Jiang Du, MS
Elham Fazli, MS
Diego Fuxman, Undergrad
Dirty Databases
The presence of dirty
data is a major
problem in enterprises
 Traditional solution:
data cleaning

3
No. I don’t see
Any problem with
the data
Limitations of Data Cleaning

Semi-automatic process


Time consuming


Requires highly-qualified domain experts
May not be possible to wait until the database is
clean
Operational systems answer queries
assuming clean data
Our Work
Identify classes of queries for which we can
obtain meaningful answers from
potentially dirty databases
Show how to do it
efficiently and
reusing existing database technology
5
Why is this Business
Intelligence?


Business intelligence (BI) refers to
technologies, applications and practices for
the collection, integration, analysis, and
presentation of information.
The goal of BI is to support better decision
making, based on information.

DBMS should provide meaningful query answers
even over data that is dirty
Outline




Introduction
Semantics for dirty databases
Contributions
Conclusions
7
Outline




Introduction
Semantics for dirty databases
Contributions
Conclusions
8
A Data Integration Example
Integrating customer data…
Sales
Shipping
Integrated
Customer
Database
Customer Support
Web Forms
Demographic Data
9
Matching and Merging
Matching and merging are two fundamental
tasks in data integration
Custid Name
Address
Street
PeterYarrow 276
276 College
College
Peter Peter
Yarrow
Paul Stookey Street
100 Bloor Street
Paul
Mary Stookey
Travers
Mary Travers
Custid Name
100
BloorStreet
Street
20 Union
20 Union Street
Address
St.St.
PeterYarrow 276
276 College
College
Peter Peter
Yarrow
Paul Stookey 100 Bloor St.
Paul
Mary Stookey
Travers
Mary Travers
100
BloorSt.
St.
20 Union
20 Union St.
10
…
Income
…
…
40K
40K
…
…
400K
400K
110K
…
110K
…
Income
…
…
200K
200K
…
…
400K
400K
130K
…
130K
Web
Sales
True Disagreement Between
Sources
What’s Peter’s salary?
Custid
Name
Address
…
Income
Peter
Peter
Yarrow
276 College
Street
…
40K
Paul Stookey
100 Bloor Street
…
400K
Mary Travers
20 Union Street
…
110K
Custid
Name
Address
…
Income
Peter
Peter
Yarrow
276 College
Street
…
200K
Paul Stookey
100 Bloor St.
…
400K
Mary Travers
20 Union St.
…
130K
11
Web
Sales
Inconsistent Integrated Databases
In the absence of complete resolution rules…
SATISFY custid KEY
Web
Sales
VIOLATES custid KEY
custid
…
income
Peter
…
40K
Paul
…
400K
custid
…
income
Mary
…
110K
Peter
…
40K
Peter
…
200K
Paul
…
400K
Mary
…
110K
Mary
…
130K
custid
…
Peter
… 200K
Paul
…
400K
Mary
…
130K
Inconsistent Integrated Database
income
12
Querying Inconsistent Databases
Example: Offering a Platinum credit card…
Query: “Get customers who make more than 100K”
Peter,Paul,Mary
Are we sure that we want to offer a card to Peter?
Custid
Custid
Custid
Peter
Peter
Peter
Peter
Peter
Peter
Paul
Paul
Paul
Mary
Mary
Mary
Mary
Mary
Mary
income
income
income
40K
40K
40K
200K
200K
200K
400K
400K
400K
110K
110K
130K
130K
web
sales
sales/web
web
sales
13
Querying Inconsistent Databases

Aggressive: Get customers who
possibly make more than 100K


Peter, Paul, Mary
Conservative: Get customers who
certainly make more than 100K

Paul, Mary
14
Formal Semantics

Related to semantics for querying incomplete
data [Imielinski Lipski 84, Abiteboul Duschka 98]


Possible world: “complete” databases
Consistent answers



Proposed by Arenas, Bertossi, and Chomicki in
1999
Corresponds to conservative semantics
Possible world: “consistent” databases
15
Consistent Answers
Repairs
Peter
40K
Paul
400K
Mary
110K
web
Peter
40K
Inconsistent database
custi
d
income
Peter
40K
sales
Paul
400K
Peter
200K
sales/web
Mary
130K
Paul
400K
Peter
200K
Mary
110K
web
sales
Paul
400K
Mary
130K
Mary
110K
Peter
200K
Paul
400K
Mary
130K
Key: custid
16
Consistent Answers
Query=“Get customers who make more than 100K”
Repairs
q
q
q
q
Peter
40K
Paul
400K
Mary
110K
Peter
40K
Paul
400K
Mary
130K
Peter
200K
Peter
Paul
400K
Paul
Paul
Mary
110K
Mary
Mary
Peter
200K
Peter
Paul
400K
Paul
Paul
Mary
130K
Paul
Paul
Mary
Mary
Paul
Paul
CONSISTENT
CONSISTENT
ANSWERS
Answers
obtained
ANSWER=
no matter
which repair
{Paul,Mary}
we choose
Mary
Mary
Mary
Mary
17
Outline

Introduction
Semantics for dirty databases

Contributions

Conclusions

18
When We Started…


Semantics well understood
Problem




Potentially HUGE number of repairs!
Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04]
Few tractability results [Arenas et al. 99, Arenas et al. 01]
Logic programming approaches [Bravo and Bertossi 03,
Eiter et al. 03]



Expressive queries and constraints
Computationally expensive
Applicable only to small databases with small number
of inconsistencies
19
Our Proposal: ConQuer
SQL query q
Keys
Consistent
answer to q
ConQuer’s
Rewriting
Algorithm
Commercial database
engine
Inconsistent
database
Rewritten
SQL query Q*
20
Class of Rewritable Queries

ConQuer handles a broad class of SPJ
queries with



No restrictions on





Set semantics
Bag semantics, grouping, and aggregation
Number of relations
Number of joins
Conditions or built-in predicates
Key-to-key joins
The class is “maximal”
21
Why not all SPJ queries?

Some SPJ queries cannot be rewritten into
SQL


Maximality of ConQuer’s class


Consistent query answering is coNP-complete
even for some SPJ queries and key constraints
Minimal relaxations lead to intractability
Restrictions only on



Nonkey-to-nonkey joins
Self joins
Nonkey-to-key joins that
form a cycle
22
Example: A Rewritable Query
• TPC-H Query 10
SELECT c_custkey, c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal, n_name, c_address, c_phone, c_comment
FROM
customer,
orders,
lineitem,
nation
WHERE c_custkey = o_custkey and l_orderkey = o_orderkey
and o_orderdate >= '1993-10-01'
and o_orderdate < date('1993-10-01') + 3 MONTHS
and l_returnflag = 'R'
and c_nationkey = n_nationkey
GROUP BY c_custkey, c_name, c_acctbal, c_phone,
n_name, c_address, c_comment
ORDER BY revenue desc
23
Rewritings Can Get Quite
Complex
Rewriting of TPC-H Query 10
Can this rewriting be executed efficiently?
1.7 overhead
20 GB database, 5% inconsistency
Experimental Evaluation


Goals
 Quantify the overhead of the rewritings
 Assess the scalability of the approach
 Determine sensitivity of the rewritten queries to level of
inconsistency of the instance
Queries and databases
 Representative decision support queries (TPC-H benchmark)
 TPC-H databases, altered to introduce inconsistencies
 Database parameters
 database size
 percentage of the database that is inconsistent
 conflicts per key value (in inconsistent portion)
25
Scalability
Ti me r ewr i ti ng
Ti me or i gi nal
Worst Case
5.8 overhead
Selectivity 98.56 %
Best Case
1.2 overhead
Selectivity
0.001 %
Size (GB)
26
5 % inconsistent tuples
2 conflicts per inconsistent
key value
Contributions – Theory

Formal characterization of a broad class of queries



For which computing consistent answers is tractable under
key constraints
That can be rewritten into first-order/SQL
Query rewriting algorithms for a class of Select-
Project-Join queries



With set semantics
With bag semantics, grouping, and aggregation
Maximality of the class of queries
27
Contributions – Practice

Implementation of ConQuer



Designed to compute consistent answers
efficiently
Multiple rewriting strategies
Experimental validation of efficiency and
scalability


Representative queries from TPC-H
Large databases
28
Uncertain Data
PROVENANCE INFORMATION
(e.g., source reputation)
0.3
Web
0.7
Sales
custid
…
income
Peter
…
40K
Paul
…
400K
custid
…
income
Mary
…
110K
Peter
…
40K
Peter
…
200K
Paul
…
400K
Integrated Database
custid
…
income
Mary
…
110K
Peter
…
200K
Mary
…
130K
Paul
…
400K
Mary
…
130K
0.3
0.7
1
0.3
0.7
Publications and Demo

These and other contributions appear in






ICDT05/JCSS06
SIGMOD05
ICDE06
PODS06/TODS06
VLDB06
Demo given at VLDB05

http://queens.db.toronto.edu/project/conquer/demo2/
30
Outline

Introduction
Semantics for dirty databases
Contributions

Conclusions


31
A Virtuous Cycle
Data Integration
Recognize and
characterize
inconsistent data
Query
Answering
Use knowledge about
inconsistencies to:
• give better answers
• suggest ways to clean the database
32
Beyond the Enterprise


Can we apply principled models of
inconsistency or uncertainty to the Web?
Different assumptions



Uncertainty in queries
There’s never a “true” answer
Challenge


Build models based on user preferences
Leverage massive repositories of user behavior
data
33
THANK YOU
Plug: Discovering Data Quality Rules, Fei Chiang
Thursday 11:15am Research Session 33
34