Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CPT-S 580-06 Advanced Databases Yinghui Wu EME 49 1 Advanced Databases Data quality management: An introduction The data quality problems Central aspects of data quality – – – – – Data consistency Data accuracy Entity resolution (record matching) Information completeness Data currency A real-life encounter Mr. Smith, our database records indicate that you owe us an outstanding amount of £5,921 for council tax for 2007 NI# name AC phone street city zip … … … … … … … SC35621422 M. Smith 131 3456789 Crichton EDI EH8 9LE SC35621422 M. Smith 020 6728593 Baker LDN NW1 6XE Mr. Smith already moved to London in 2006 The council database had not been correctly updated – both old address and the new one are in the database 50% of bills have errors (phone bill reviews, 1992) 3 Customer records country AC phone street city zip 44 131 1234567 Mayfield New York EH8 9LE 44 131 3456789 Crichton New York EH8 9LE 01 908 3456789 Mountain Ave New York 07974 Anything wrong? New York City is moved to the UK (country code: 44) Murray Hill (01-908) in New Jersey is moved to New York state Error rates: 10% - 75% (telecommunication) 4 Another example T.Das|97336o8327|24.95|Y|-|0.0|1000 Ted J.|973-360-8779|2000|N|M|NY|1000 Can we interpret the data? – What do the fields mean? – What is the key? The measures? Data glitches – Typos, multiple formats, missing / default values Metadata and domain expertise – Field three is Revenue. In dollars or cents? – Field seven is Usage. Is it censored? • Field 4 is a censored flag. How to handle censored data? Data in real-life are often dirty Pentagon asked 200+ dead officers to re-enlist 81 million National Insurance numbers but only 60 million eligible citizens Data quality problems are expensive hundreds of billion $$$ each year. 98000 deaths cost each Resolving data quality problems is year, caused by errors often the biggest effort in databases in medical data research • In a 500,000 customer database,500,000 120,000dead customer records people become invalid within 12 months retain active Medicare • Data error rates in industry: 1% - cards 30% (Redman, 1998) Dirty data: inconsistent, inaccurate, incomplete, stale 6 The need for data quality tools Manual effort: beyond reach in practice Data quality tools: to help automatically Reasoning Discover rules Repair Editing a sample of census Detect errors data easily took dozens of clerks months (Winkler 04, US Census Bureau) The market for data quality tools is growing at 17% annually >> the 7% average of other IT segments 2006 7 Conventional Definition of Data Quality Consistency – The data agrees with itself. Accuracy – The data was recorded correctly Uniqueness – Entities are recorded once. Completeness – All relevant data was recorded. Timeliness – The data is kept up to date. • Special problems in federated data: time consistency. Challenges Unmeasurable – Accuracy and completeness are extremely difficult, perhaps impossible to measure. Context independent – No accounting for what is important. E.g., if you are computing aggregates, you can tolerate a lot of inaccuracy. Incomplete – interpretability, accessibility, metadata, analysis, etc. Vague – The conventional definitions provide no guidance towards practical improvements of the data. Central aspects of data quality Data consistency Data accuracy Entity resolution (record matching) Information completeness Data currency 10 Data consistency 11 Inconsistencies of data [country = 44, zip] [street] In the UK, zip code uniquely determines the street country area-code phone street city zip 44 131 1234567 Mayfield NYC EH8 9LE 44 131 3456789 Crichton NYC EH8 9LE 01 908 3456789 Mountain Ave NYC 07974 Observation Do not hold on the entire relation (e.g., customers in the US), but on tuples representing the UK customers only Data dependencies over a fraction of tuples 12 Inconsistent relation type(Reagan, president) [10] married(Reagan, Davis) [10] married(Elvis,Priscilla) [10] spouse(X,Y) & spouse(X,Z) => Y=Z [10] occurs("Hermione","loves","Harry") [3] means("Ron",RonaldReagan) [3] means("Ron",RonaldWeasley) [2] ... Possible World 1: married We are given a set of facts, and wish to find the most plausible possible world. (F. Suchanek et al.: WWW‘09) occurs(X,P,Y) & means(X,X') & means(Y,Y') & R(X',Y') P~R occurs(X,P,Y) & means(X,X') & means(Y,Y') & P~R R(X',Y') spouse(X,Y) & spouse(X,Z) Y=Z Possible World 2: married Data Accuracy CC AC FN LN age 44 131 Mark Smith 65 19 is more accurate occupation city student (HS) EDI zip EH8 9LE Consistency rule: age < 120. The record is consistent. Is it accurate? Context: if we know Mark is going to high school, then age 19 [occupation = student (HS)] [age 19] A uniform framework for improving data consistency and accuracy Entity resolution 15 Entity resolution (Record matching) To identify records from unreliable data sources that refer to the same real-world entity FN Mark LN address Smith 10 Oak St, EDI, EH8 9LE tel DOB gender 3256777 10/27/97 M the same person? FN LN post phn when M. Smith 10 Oak St, EDI, EH8 9LE null 1pm/7/7/09 EDI $3,500 … … … … … … … NYC $6,300 Max Smith PO Box 25, EDI 3256777 2pm/7/7/09 where amount Record linkage, entity resolution, data deduplication, merge/purge, … 16 Knowledge graph construction Dubya 0.9 George W. Bush 0.9 0.6 0.6 0.6 Bush 0.8 President Bush 0.4 0.3 President Bill Clinton Bill Clinton Hillary Clinton George H. W. Bush 17 Data completeness TDD (ln6) 18 Incomplete information: a central data quality issue A database D of UK patients: patient (name, street, city, zip, YoB) A simple query Q1: Find the streets of those patients who were born in 2000 (YoB), and live in Edinburgh (Edi) with zip = “EH8 9AB”. Can we trust the query to find complete & accurate information? Both tuples and values may be missing from D! “information perceived as being needed for clinical decisions was unavailable 13.6%--81% of the time” (2005) 19 Traditional approaches: The CWA vs. the OWA Real world The Closed World Assumption (CWA) – all the real-world objects are already represented by tuples in the database – missing values only The Open World Assumption (OWA) – the database is a subset of the tuples representing real-world objects – missing tuples and missing values database Real world database Few queries can find a complete answer under the OWA Misleading analysis, biased decisions, disastrous consequences (enterprises, finance, scientific,TDD administrative, medical, … ) (ln6) 20 In real-world applications Master data (reference data): a consistent and complete repository of the core business entities of an enterprise (certain categories) OWA Master data The CWA: the master data – an upper bound of the part constrained The OWA: the part not covered by the master data Databases in real world are often neither entirely closed-world, nor entirely open-world 21 Relative information completeness Partially closed databases: partially constrained by master data; neither CWA nor OWA Relative completeness: a partially closed database that has complete information to answer a query relative to master data The completeness and consistency taken together: containment constraints Fundamental problems: – Given a partially closed database D, master data Dm, and a Matching complexity bounds are to Dm query Q, decide whether D is complete Q for relatively in place; efficient – Given master data Dmalready and a query Q, but decide whether there exists a partially closed databaseremain D that to is be complete for Q algorithms developed relatively to Dm A theory of relative information completeness 22 Data currency 23 Data currency: another central data quality issue Data currency: the state of the data being current Data get obsolete quickly: “In a customer file, within two years about 50% of record may become obsolete” (2002) Multiple values pertaining to the same entity are present The values were once correct, but they have become stale and inaccurate Reliable timestamps are often not available Identifying stale data is costly and difficult How can we tell when the data are current or stale? 24 Determining the currency of data FN LN address salary status Mary Smith 2 Small St 50k single Mary Dupont 10 Elm St 50k married Mary Dupont 6 Main St 80k married Bob Luth 8 Cowan St 80k married Robert Luth 6 Drum St 55k married Entities: Mary Robert Identified via record matching Q1: what is Mary’s current salary? 80k Temporal constraint: salary is monotonically increasing Determining data currency in theTDD absence of timestamps (ln6) 25 Temporal constraints FN LN address salary status Mary Smith 2 Small St 50k single Mary Dupont 10 Elm St 50k married Mary Dupont 6 Main St 80k married Bob Luth 8 Cowan St 80k married Robert Luth 6 Drum St 55k married Q2: what is Mary’s current last name? Dupont Temporal constraints: • • Marital status only changes from single married divorced Tuples with the most current marital status also have the most current last name TDD (ln6) Temporal constraints to specify the currency of correlated attributes 26 Data currency Data currency model: • Partial temporal orders, temporal constraints, copy functions Certain current query answering: answering queries with the current values of entities (over all possible “consistent completions” of the partial temporal orders) Fundamental problems: Given partial temporal orders, temporal constraints and copy functions, to decide – whether a value is certainly more current than another? – whether a tuple is a certainFundamental current answer to a query? problems have been – whether the copy functionsstudied; are currency preserving? but efficient algorithms are No matter how the copy functions extended, the certain not yet inare place currency answers remain unchanged There is much more to be doneTDD (ln6) 27 Dependencies: A promising approach Errors found in practice – Syntactic: a value not in the corresponding domain or range, e.g., name = 1.23, age = 250 – Semantic: a value representing a real-world entity different from the true value of the entity, e.g., CIA found WMD in Iraq Dependencies: for specifying the semantics of relational data – relation (table): a setand of tuples (records) Hard to detect fix NI# name AC phone street city zip SC35621422 M. Smith 131 3456789 Crichton EDI EH8 9LE SC35621422 M. Smith 020 6728593 Baker LDN NW1 6XE How can dependencies help? 28 Functional Dependencies (FDs) NI# street, city, zip NI# determines address: for any two records, if they have the same NI#, then they must have the same address for each distinct NI#, there is a unique current address NI# name AC phone street city zip SC35621422 M. Smith 131 3456789 Crichton EDI EH8 9LE SC35621422 M. Smith 020 6728593 Baker LDN NW1 6XE for SC35621422, at least one of the addresses is not up to date Functional dependencies help detect errors in a single relation 29 Inclusion Dependencies (INDs) book[asin, title, price] item[asin, title, price] book item asin isbn title price a23 b32 Harry Potter 17.99 a56 b65 Snow white 7.94 asin title type price a23 Harry Potter book 17.99 a12 J. Denver CD 7.94 Any book sold by a store must be an item carried by the store – for any book tuple, there must exist an item tuple such that their asin, title and price attributes pairwise agree with each other Inclusion dependencies help detect errors across relations 30 More reasons for using dependencies Capturing a fundamental part of the semantics of data – errors and inconsistencies as violations of dependencies Techniques and inference systems are in place for reasoning about dependencies (data quality rules) – remove redundant rules – identify “dirty” rules – … Profiling – automated discovery of dependencies (data quality rules) ... Dependencies should logically become part of data cleaning process 31 However, … Dependencies were developed for improving the quality of schema country area-code phone street city zip 44 131 1234567 Mayfield NYC EH8 9LE 44 131 3456789 Crichton NYC EH8 9LE 01 908 3456789 Mountain Ave NYC 07974 functional dependencies (FDs) country, area-code, phone street, city, zip country, area-code city The database satisfies the FDs, but the data is not clean! A central problem is how to tell whether the data is dirty or clean 32 Summary Data quality: The No.1 problem for data management Real life data are dirty, dirty data are costly – The quest for a principled approach Many challenges remain – – – – – Effective algorithms for certain fixes (minimum user interaction) Efficient algorithms for determining information completeness Efficient algorithms for deciding data currency Data accuracy Putting all together: Interaction between central issues of data quality telecommunication, life sciences, finance, e-government, … Data quality: A rich source of questions and vitality 33