Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Agenda • • • • • Data Preprocessing Opas Wongtaweesap (Aj’OaT) Intelligent Information Systems Development and Research Laboratory Centre Why Data Preprocessing? • Data in the real world is dirty Why preprocess the data? Data Cleaning Data Integration Data Transformation Summary – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data URL:: http://www.facebook.com/OaTCoM, http://twitter.com/OaTCoM E-mail:: [email protected], [email protected] Data Mining & WEKA Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: 2 – – – – • Broad categories: – intrinsic, contextual, representational, and accessibility. 3 Data Preprocessing Tasks (Con’t) Data Preprocessing Tasks • Data Cleaning – Accuracy, Completeness, Consistency, Timeliness, Believability, Value added, Interpretability, Accessibility Data Mining & WEKA • Data Reduction – Obtains reduced representation in volume but produces the same or similar analytical results Fill in missing values. Smooth noisy data. Identify or remove outliers. Resolve inconsistencies • Data Discretization – Part of data reduction but with particular importance, especially for numerical data • Data Integration – Integration of multiple databases. – Data cubes, or files • Data Transformation – Normalization and aggregation. Data Mining & WEKA 4 Forms of data preprocessing Data Mining & WEKA 5 Data Cleaning Data Mining & WEKA 6 Missing Data • Data is not always available • Data Cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data Data Mining & WEKA 7 Data Mining & WEKA 8 • Missing data mayDataneed to be inferred. Mining & WEKA 9 How to Handle Missing Data? How to Handle Missing Data? • Ignore the tuple: • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree – usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Use the attribute mean to fill in the missing value Data Mining & WEKA 10 Data Mining & WEKA Handling Redundant Data in Data Integration Data Integration (Con’t) • Detecting and resolving data value conflicts – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table 13 Normalization v' = v − minA (new _ maxA − new _ minA) + new _ minA maxA − minA v 10 j – integrate metadata from different sources – Entity identification problem: identify real world entities from multiple data sources, e.g., Sex = Gender, StdCode = StudentCode Data Mining & WEKA 12 • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range • Attribute/feature construction • New attributes constructed from the given ones 15 Data Mining & WEKA Discretization Three types of attributes – min-max normalization – z-score normalization – normalization by decimal scaling • Nominal — values from an unordered set • Ordinal — values from an ordered set • Discretization: – divide the range of a continuous attribute into intervals – Some classification algorithms only accept categorical attributes. – Reduce data size by discretization – Prepare for further analysis Where j is the smallest integer such that Max(| v ' |)<1 Data Mining & WEKA • Schema Integration – Nominal — values from an unordered set – Ordinal — values from an ordered set – Continuous — real numbers • z-score normalization v − mean A v'= stand _ dev A • normalization by decimal scaling v' = – combines data from multiple sources into a coherent store • Redundant data may be able to be detected by correlational analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Data Mining & WEKA 14 • Three types of attributes: • min-max normalization • Data Integration: Data Transformation • Redundant data occur often when integration of multiple databases – for the same real world entity, attribute values from different sources are different – possible reasons: different representations, different scales, e.g., Coke = Coka Cola, Pepsi Data Mining & WEKA 11 Data Integration 16 Data Mining & WEKA 16 17 • Continuous — real numbers Data Mining & WEKA 18 Normalization Data Mining & WEKA Discretization Example I 19 Data Mining & WEKA Discretization Example II 20 Data Mining & WEKA 21 19 Discretization and Concept hierachy Concept hierarchies • Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. • Discretization – reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. • Concept hierarchies – reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middleaged, or senior). – Age<15 is young. country 15 distinct values province_or_ state 65 distinct values 3567 distinct values city street Data Mining & WEKA 22 Data Mining: A KDD Process Data Mining & WEKA Data Cleaning Summary 25 • Data Preparation is a big issue for both warehousing and mining • Data Preparation includes – Data Cleaning and Data Integration – Data Reduction and Feature Selection • Discretization • A lot a methods have been developed but still an active area of research 674,339 distinct values Data Mining & WEKA 23 Data Mining & WEKA 24