Download Semantics for dirty databases

Efficient Management of Inconsistent and Uncertain Data Renée J. Miller University of Toronto Contributors  Ariel Fuxman, PhD Thesis       Microsoft Search Labs Jim Gray SIGMOD 2008 Dissertation Award Periklis Andritsos, PhD Jiang Du, MS Elham Fazli, MS Diego Fuxman, Undergrad Dirty Databases The presence of dirty data is a major problem in enterprises  Traditional solution: data cleaning  3 No. I don’t see Any problem with the data Limitations of Data Cleaning  Semi-automatic process   Time consuming   Requires highly-qualified domain experts May not be possible to wait until the database is clean Operational systems answer queries assuming clean data Our Work Identify classes of queries for which we can obtain meaningful answers from potentially dirty databases Show how to do it efficiently and reusing existing database technology 5 Why is this Business Intelligence?   Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of information. The goal of BI is to support better decision making, based on information.  DBMS should provide meaningful query answers even over data that is dirty Outline     Introduction Semantics for dirty databases Contributions Conclusions 7 Outline     Introduction Semantics for dirty databases Contributions Conclusions 8 A Data Integration Example Integrating customer data… Sales Shipping Integrated Customer Database Customer Support Web Forms Demographic Data 9 Matching and Merging Matching and merging are two fundamental tasks in data integration Custid Name Address Street PeterYarrow 276 276 College College Peter Peter Yarrow Paul Stookey Street 100 Bloor Street Paul Mary Stookey Travers Mary Travers Custid Name 100 BloorStreet Street 20 Union 20 Union Street Address St.St. PeterYarrow 276 276 College College Peter Peter Yarrow Paul Stookey 100 Bloor St. Paul Mary Stookey Travers Mary Travers 100 BloorSt. St. 20 Union 20 Union St. 10 … Income … … 40K 40K … … 400K 400K 110K … 110K … Income … … 200K 200K … … 400K 400K 130K … 130K Web Sales True Disagreement Between Sources What’s Peter’s salary? Custid Name Address … Income Peter Peter Yarrow 276 College Street … 40K Paul Stookey 100 Bloor Street … 400K Mary Travers 20 Union Street … 110K Custid Name Address … Income Peter Peter Yarrow 276 College Street … 200K Paul Stookey 100 Bloor St. … 400K Mary Travers 20 Union St. … 130K 11 Web Sales Inconsistent Integrated Databases In the absence of complete resolution rules… SATISFY custid KEY Web Sales VIOLATES custid KEY custid … income Peter … 40K Paul … 400K custid … income Mary … 110K Peter … 40K Peter … 200K Paul … 400K Mary … 110K Mary … 130K custid … Peter … 200K Paul … 400K Mary … 130K Inconsistent Integrated Database income 12 Querying Inconsistent Databases Example: Offering a Platinum credit card… Query: “Get customers who make more than 100K” Peter,Paul,Mary Are we sure that we want to offer a card to Peter? Custid Custid Custid Peter Peter Peter Peter Peter Peter Paul Paul Paul Mary Mary Mary Mary Mary Mary income income income 40K 40K 40K 200K 200K 200K 400K 400K 400K 110K 110K 130K 130K web sales sales/web web sales 13 Querying Inconsistent Databases  Aggressive: Get customers who possibly make more than 100K   Peter, Paul, Mary Conservative: Get customers who certainly make more than 100K  Paul, Mary 14 Formal Semantics  Related to semantics for querying incomplete data [Imielinski Lipski 84, Abiteboul Duschka 98]   Possible world: “complete” databases Consistent answers    Proposed by Arenas, Bertossi, and Chomicki in 1999 Corresponds to conservative semantics Possible world: “consistent” databases 15 Consistent Answers Repairs Peter 40K Paul 400K Mary 110K web Peter 40K Inconsistent database custi d income Peter 40K sales Paul 400K Peter 200K sales/web Mary 130K Paul 400K Peter 200K Mary 110K web sales Paul 400K Mary 130K Mary 110K Peter 200K Paul 400K Mary 130K Key: custid 16 Consistent Answers Query=“Get customers who make more than 100K” Repairs q q q q Peter 40K Paul 400K Mary 110K Peter 40K Paul 400K Mary 130K Peter 200K Peter Paul 400K Paul Paul Mary 110K Mary Mary Peter 200K Peter Paul 400K Paul Paul Mary 130K Paul Paul Mary Mary Paul Paul CONSISTENT CONSISTENT ANSWERS Answers obtained ANSWER= no matter which repair {Paul,Mary} we choose Mary Mary Mary Mary 17 Outline  Introduction Semantics for dirty databases  Contributions  Conclusions  18 When We Started…   Semantics well understood Problem     Potentially HUGE number of repairs! Negative results [Chomicki et al 02, Arenas et al. 01, Cali et al 04] Few tractability results [Arenas et al. 99, Arenas et al. 01] Logic programming approaches [Bravo and Bertossi 03, Eiter et al. 03]    Expressive queries and constraints Computationally expensive Applicable only to small databases with small number of inconsistencies 19 Our Proposal: ConQuer SQL query q Keys Consistent answer to q ConQuer’s Rewriting Algorithm Commercial database engine Inconsistent database Rewritten SQL query Q* 20 Class of Rewritable Queries  ConQuer handles a broad class of SPJ queries with    No restrictions on      Set semantics Bag semantics, grouping, and aggregation Number of relations Number of joins Conditions or built-in predicates Key-to-key joins The class is “maximal” 21 Why not all SPJ queries?  Some SPJ queries cannot be rewritten into SQL   Maximality of ConQuer’s class   Consistent query answering is coNP-complete even for some SPJ queries and key constraints Minimal relaxations lead to intractability Restrictions only on    Nonkey-to-nonkey joins Self joins Nonkey-to-key joins that form a cycle 22 Example: A Rewritable Query • TPC-H Query 10 SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= '1993-10-01' and o_orderdate < date('1993-10-01') + 3 MONTHS and l_returnflag = 'R' and c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue desc 23 Rewritings Can Get Quite Complex Rewriting of TPC-H Query 10 Can this rewriting be executed efficiently? 1.7 overhead 20 GB database, 5% inconsistency Experimental Evaluation   Goals  Quantify the overhead of the rewritings  Assess the scalability of the approach  Determine sensitivity of the rewritten queries to level of inconsistency of the instance Queries and databases  Representative decision support queries (TPC-H benchmark)  TPC-H databases, altered to introduce inconsistencies  Database parameters  database size  percentage of the database that is inconsistent  conflicts per key value (in inconsistent portion) 25 Scalability Ti me r ewr i ti ng Ti me or i gi nal Worst Case 5.8 overhead Selectivity 98.56 % Best Case 1.2 overhead Selectivity 0.001 % Size (GB) 26 5 % inconsistent tuples 2 conflicts per inconsistent key value Contributions – Theory  Formal characterization of a broad class of queries    For which computing consistent answers is tractable under key constraints That can be rewritten into first-order/SQL Query rewriting algorithms for a class of Select- Project-Join queries    With set semantics With bag semantics, grouping, and aggregation Maximality of the class of queries 27 Contributions – Practice  Implementation of ConQuer    Designed to compute consistent answers efficiently Multiple rewriting strategies Experimental validation of efficiency and scalability   Representative queries from TPC-H Large databases 28 Uncertain Data PROVENANCE INFORMATION (e.g., source reputation) 0.3 Web 0.7 Sales custid … income Peter … 40K Paul … 400K custid … income Mary … 110K Peter … 40K Peter … 200K Paul … 400K Integrated Database custid … income Mary … 110K Peter … 200K Mary … 130K Paul … 400K Mary … 130K 0.3 0.7 1 0.3 0.7 Publications and Demo  These and other contributions appear in       ICDT05/JCSS06 SIGMOD05 ICDE06 PODS06/TODS06 VLDB06 Demo given at VLDB05  http://queens.db.toronto.edu/project/conquer/demo2/ 30 Outline  Introduction Semantics for dirty databases Contributions  Conclusions   31 A Virtuous Cycle Data Integration Recognize and characterize inconsistent data Query Answering Use knowledge about inconsistencies to: • give better answers • suggest ways to clean the database 32 Beyond the Enterprise   Can we apply principled models of inconsistency or uncertainty to the Web? Different assumptions    Uncertainty in queries There’s never a “true” answer Challenge   Build models based on user preferences Leverage massive repositories of user behavior data 33 THANK YOU Plug: Discovering Data Quality Rules, Fei Chiang Thursday 11:15am Research Session 33 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Semantics for dirty databases