Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Quality Simon Razniewski In collaboration with Werner Nutt Free University of Bozen-Bolzano Introduction • Simon Razniewski PhD Student at the FUB – Data quality – Data completeness • Werner Nutt Professor in Computer Science at the FUB Focus in research and teaching: – Data management, data modelling – Data integration – Incomplete information Data Quality 10.8.2011 - EURAC 2 Why data quality? • Data are the basis for (scientific) conclusions about the world • Conclusions only as good as the data they are based on • Low-quality data low-quality conclusions Data Quality 10.8.2011 - EURAC 3 Some effects of erroneous data are funny Man invited for pre-natal check Data Quality 10.8.2011 - EURAC 4 Some data errors are long-living Spinach contains much iron 100g of spinach contain 35mg of iron Gustav v. Bunge 1890 100g spinach contain only 3,5mg of iron Data Quality 10.8.2011 - EURAC 5 Some data errors are mysterious Student records in in Georgia (USA), 2009 ?? 19.000 students leave their school to change to another … but arrive nowhere Data Quality 10.8.2011 - EURAC 6 Overview • What are data used for? – Data model the real world • What can go wrong? – Wrong, outdated, missing data • What can one do for – Correctness – Currency – Completeness of data? Data Quality 10.8.2011 - EURAC 7 Data model the real world Real world: Students, teachers, classes Database: Tables HOB Bozen Class 2A Anna Paul Diego Maria We analyze the data (instead of the real world) and draw (scientific) conclusions data determines our conclusions Data Quality 10.8.2011 - EURAC 8 Questions about students • „How many students are there in the class 2A of the HOB Bozen?“ • „What is the average age of the students of this class?“ • „How many students play an instrument?“ Data Quality 10.8.2011 - EURAC 9 Table „Students“ Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A What is the average age of the students of the class 2A of the HOB Bozen? Data Quality 10.8.2011 - EURAC 10 Many things can go wrong Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A What is the average age of the students of the class 2A of the HOB Bozen? Data Quality 10.8.2011 - EURAC 11 Typos Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A date of birth of Anna school of Paul Data Quality 10.8.2011 - EURAC 12 Factual errors Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A school of Diego (“HOB Meran“ instead of “HOB Bozen“) Data Quality 10.8.2011 - EURAC 13 Outdated entries Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A class of Anna (“1A“ instead of “2A“) Data Quality 10.8.2011 - EURAC 14 Missing values Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A date of birth of Diego (“Null value“) Data Quality 10.8.2011 - EURAC 15 Missing records Name Date of birth School Class Paul 7.4.1995 HOB Bolzen 2A Anna 3.8.1959 HOB Bozen 1A Diego ? HOB Meran 2A Maria 12.10.1995 HOB Bozen 2A the record about Maria is missing Data Quality 10.8.2011 - EURAC 16 Missing concepts Name Date of birth School Class Instrument Paul 7.4.1995 HOB Bolzen 2A Cello Anna 3.8.1959 HOB Bozen 1A ? Diego ? HOB Meran 2A ? no possibility to store information about music instruments Data Quality 10.8.2011 - EURAC 17 What can be done? There is a distinction between different dimensions of data quality The most important ones are • Correctness Does the data match the real world? • Timeliness Is the data up-to-date? • Completeness Are all aspects of the domain of interest captured? Further: Comprehensibility, accessability, … Datenqualität 10.8.2011 - EURAC 18 Dimension 1: Correctness IT-techniques: 1. Detecting typos or statistical outliers students born in 1959 2. Recognizing duplicates Mohammad Al Zaïn = Muhamad Alzain 3. Rules for logical consistency no student can visit two schools at the same time Organisation: Special treatment of core data: Master data management For example: students, teachers, schools Datenqualität 10.8.2011 - EURAC 19 Dimension 2: Timeliness • By workflow organisation: Bind workflows onto the IT system Timeliness is guaranteed Example: an enrolment is only valid if it is recorded in the database • Trough data about the currency of the data (metadata) Timeliness can be estimated Example: “All dropouts until 31th of March are recorded“ Datenqualität 10.8.2011 - EURAC 20 Dimension 3/1: Completeness of values • Can be enforced by the IT system Risk: nonsensical entries • Alternative solution: enforce input of less values Record reasons for missing values E.g. “Not applicable” or “Unknown” Datenqualität 10.8.2011 - EURAC 21 Dimension 3/3: Conceptual completeness • Solid design is important, but not everything can be foreseen • Flexible IT: Schema changes if necessary – Space for comments, additional information • Otherwise: Other fields will be abused Example: Gasworks in the USA Warning of dogs for meter-readers Address Mountain Road 102 (Beware of dog) … later they send bills Datenqualität 10.8.2011 - EURAC 22 Dimension 3/2: Table completeness • Events are completely recorded, if they are bound to the IT system Example: Sales in a supermarket • In general, this binding is not possible only parts of the database tables are complete • But: Completeness is only necessary for specific uses Example: school statistics from ASTAT Research Datenqualität 10.8.2011 - EURAC 23 Partial table completeness • Common scenario: We have – Some, but not all data complete – Questions (‘‘queries“) over data • Problems: – Do we have the data that is needed to answer the queries? If not: – What more data do we need? Data Quality 10.8.2011 - EURAC 24 An (intuitive) example • Suppose we have data about all students from – Italian schools – German schools, except of the primary school ‘‘Andreas Hofer“ – Ladin schools, except of the high school “Gherdëna“ • Can we correctly answer questions about the italian students in South Tyrol? Yes, because we have all data about students from italian schools Data Quality 10.8.2011 - EURAC 25 An (intuitive) example (2) • Suppose we have data about all students from – Italian schools – German schools, except of the primary school ‘‘Andreas Hofer“ – Ladin schools, except of the high school “Gherdëna“ • Can we answer questions about the high-school students in South Tyrol? No, because data from the “Gherdëna“ high school is missing We could bug them to submit their data (but maybe the secretary is on holiday) We could ask someone else for the data, e.g., the local district administration Data Quality 10.8.2011 - EURAC 26 Our research • How can one describe that data is complete to a certain extent? • How can one find out whether the data one has is sufficient for a certain use? • How can one find out which data is necessary to serve a certain use? Data Quality 10.8.2011 - EURAC 27 Formal example How many students attend an Italian school? SELECT count(*) FROM student, school WHERE student.school = school.name AND school.language = ‘italian‘; Suppose, we have all Italian students. Can we answer this query completely? Data Quality 10.8.2011 - EURAC 28 How can we formalize table completeness? “We have all students from italian schools“ • We imagine: an ideal database that contains complete information about the world • Completeness statements refer to this ideal database: => All ideal students from Italian schools occur among our real students Data Quality 10.8.2011 - EURAC 29 How can we assert (partial) completeness of tables? “We have all students from italian schools“ Table completeness assertion: real.student CONTAINS( SELECT ideal.student.* FROM ideal.student, ideal.school WHERE ideal.student.school = ideal.school.name AND ideal.school.language = ‘Italian‘) Table completeness assertions constitute a logical theory about real and ideal database Data Quality 10.8.2011 - EURAC 30 What does it mean that our query is complete? Consider two versions: “Real query“ “Ideal query“ SELECT count(*) FROM real.student, real.school WHERE real.student.school = real.school.name AND real.school.language = ‘italian‘; SELECT count(*) FROM ideal.student, ideal.school WHERE ideal.student.school = ideal.school.name AND ideal.school.language = ‘italian‘; Our query is complete if the real and the ideal query return the same results (Can be expressed in logic, too) => Reasoning Data Quality 10.8.2011 - EURAC 31 Our results so far • Formalization • General reasoning procedures for – Single block SQL queries – With comparisons – Group By – Aggregate functions min, max, count, sum • Complexity analysis (sometimes high!) • Architecture for reasoning system • “Inverse reasoning” (see later slide) This is a start, many things are still missing Data Quality 10.8.2011 - EURAC 32 Reasoning with schema information • To draw interesting inferences, we need to take into account – Keys – Foreign keys – Finite domains ~> Reasoning becomes more complicated (Current research) Data Quality 10.8.2011 - EURAC 33 Inverse reasoning • So far: Given: Assertions about table completeness Question: Can query Q be answered completely? • Also interesting: Given: query Q Question: which are the minimal completeness assertions that assure completeness of Q? • Can be answered by applying our inference methods backwards Data Quality 10.8.2011 - EURAC 34 Perspective: Probabilistic completeness management • Our theory so far: Boolean statements: complete/not complete • In practice, it is often sufficient to know “With probability < p, we make an error < ε“ • Probabilistic assertions: “With 90% probability, we are not missing more than 5 students“ => Probabilistic inferences Data Quality 10.8.2011 - EURAC 35 Conclusion • Data quality has several dimensions – Correctness, timeliness, completeness • Our current interest – How can one describe which data are complete – How can one find out which queries can be answered completely? – If not, which additional data is needed? • Perspective: Probabilistic completeness management Data Quality 10.8.2011 - EURAC 36