Data Quality, Data Cleaning and Treatment of Noisy Data DIMACS Workshop November 3-4, 2003 Organizer: Tamraparni Dasu, AT&T Labs - Research Why DQ? • Data quality problems are expensive and pervasive – DQ problems cost hundreds of billions of $$$ each year. • Lost revenues, credibility, customer retention – Resolving data quality problems is often the biggest effort in a data mining study. • 50%-80% of time in data mining projects spent on DQ – Interest in streamlining business operations databases to increase operational efficiency (e.g. cycle times), reduce costs, conform to legal requirements 6 The Data Quality Continuum • Data/information is not static, it flows in a data collection and usage process – – – – – – Data gathering Data delivery Data storage Data integration Data retrieval Data mining/analysis • Problems can and do arise at all of these stages • End-to-end, continuous monitoring needed 16 Technical Approaches • Need a multi-disciplinary approach – No single approach solves all problems • Process management – Pertains to data process and flows – Checks and controls, audits • Database – Storage, access, manipulation and retrieval • Metadata / domain expertise – Interpretation and understanding • Analysis – Data Mining, Statistics – Analysis, diagnosis, model fitting, prediction, decision making … 37 Meaning of Data Quality –1 • Conventional definitions: completeness, uniqueness, consistency, accuracy etc. – measurable?Modernize definition of DQ wrt to DQ continuum • Depends on data paradigms (data gathering, storage) – Federated, High dimensional, Descriptive, Longitudinal, Streaming, Web (scraped), Numeric, Text data 30 DQ Meaning - 2 • Depends on applications (delivery, integration, analysis) – Business operations, Aggregate analysis, prediction – Customer relations … • Data Interpretation – Know all the rules used to generate the data • Data Suitability – Use of proxy data – Relevant data is missing Increased DQ Increased reliability and usability (directionally correct) 31 Workshop • Talks cover different aspects of the complex DQ issue • Outstanding set of speakers from academia, industrial labs and industry • Cover theoretical, methodological, applied aspects – case studies! • From a wide range of disciplines and areas Welcome! Rene Miller • University of Toronto • Renee is an Associate Professor of Computer Science at the University of Toronto. S.B., Mathematics, MIT. S.B., Cognitive Science, MIT. Ph.D., Computer Science, U. Wisconsin-Madison. • Heterogeneous databases, data mining, and data warehousing. • “Managing Inconsistency in Data Exchange and Integration” Grace Zhang • Morgan Stanley Institutional Equity Division IT. Master of Philosophy in Computer Science from Columbia University, and a Master and B.S. in Computer Science from Zhongshan University,China. • Develop tools to check data quality issues in equity trading data, design and build the standard destination referential data repository. • “Data Quality in Trading Surveillance” Ted Johnson • AT&T Labs – Research • Database Research department. B.S. in Mathematics, Johns Hopkins University, Ph.D. in Computer Science, New York University, 1990. • Data warehousing and data mining • “Bellman - A Data Quality Browser “ Ron Pearson • Daniel Baugh Institute for Functional Genomics and Computational Biology, Thomas Jefferson University. B.S. in physics from the University of Arkansas at Monticello and M.S.E.E. and PhD in electrical engineering from M.I.T. in 1982. • Design and analysis of nonlinear digital filters, exploratory data analysis and the validation of analytical results. • “The Data Cleaning Problem -- Some Key Issues and Practical Approaches” Dhammika Amaratunga, Javier Cabrera, Nandini Raghavan • Johnson & Johnson, Rutgers University, Johnson & Johnson • “Pre-processing of Microarray Data” S. Muthukrishnan • Rutgers University, AT&T Labs – Research • Associate Professor of Computer Science • Design and analysis of algorithms • “Checks and Balances: Monitoring Data Quality Problems in Network Traffic Databases” T. Bonates, P. Hammer, A. Kogan, and I. Lozina • RutCOR, Rutgers University • Operations Research • Maximum Patterns and Outliers in the Logical Analysis of Data (LAD) Jiawei Han • Professor, Simon Fraser University. Currently at University of Illinois, UC. Ph. D. from University of Wisconsin, Madison in 1985. • Data mining (knowledge discovery in databases), data warehousing, spatial databases, multimedia databases, deductive and object-oriented databases, and logic programming • “Data Mining: A Powerful Tool for Data Cleaning” Jon Hill • British Telecommunications • Jon leads a team of information experts to deliver solutions within asset management, process control and billing assurance. Jon uses a wide range of information quality tools within projects and has extensive experience in investigation and solving IQ problems. • “A $220 Million Success Story” G. Vesonder, J. Wright & T. Dasu • AT&T Labs - Research • Head of Adaptive Systems research • AI, Knowledge Engineering, Expert Systems • “Life Cycle Datamining” Andrew Hume • AT&T Labs – Research • Very large data systems, string searching, performance measurement • Tamed many legacy systems • “Managing Data Streams” Bing Liu • Associate Professor at National Singapore University, on leave at University of Illinois at Chicago • Data mining and knowledge discovery; web, text and image mining; Bioinformatics • Web page cleaning for web data mining R.K. Pearson and M. Gabbouj • Collaboration with Moncef Gabbouj from the Tampere University of Technology in Finland. • “Relational Nonlinear FIR Filters” Thank you!