Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Summary of Last Chapter Principles of Knowledge Discovery in Databases • What is a data warehouse and what is it for? Fall 1999 • What is the multi-dimensional data model? Chapter 3: Data Preprocessing • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? Dr. Osmar R. Zaïane • How can we implement a data warehouse? • Are there issues related to data cube technology? • Can we mine data warehouses? University of Alberta Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 1 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 2 Course Content Chapter 3 Objectives • • • • • • • • • • Introduction to Data Mining Data warehousing and OLAP Data cleaning Data mining operations Data summarization Association analysis Classification and prediction Clustering Web Mining Similarity Search • Other topics if time permits Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases Realize the importance of data preprocessing for real world data before data mining or construction of data warehouses. Get an overview of some data preprocessing issues and techniques. University of Alberta 3 Dr. Osmar R. Zaïane, 1999 4 In real world applications data can be inconsistent, incomplete and/or noisy. • What is the motivation behind data preprocessing? Errors can happen: • What is data cleaning and what is it for? • Faulty data collection instruments • Data entry problems • Human misjudgment during data entry • Data transmission problems • Technology limitations • Discrepancy in naming conventions • What is data integration and what is it for? • What is data transformation and what is it for? • What is data reduction and what is it for? Results: • What is data discretization? • Duplicated records • Incomplete data • Contradictions in data • How do we generate concept hierarchies? Principles of Knowledge Discovery in Databases University of Alberta Motivation Data Preprocessing Outline Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 5 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 6 1 Motivation (Con’t) Data Preprocessing Data Warehouse Data Cleaning Data Mining Data Integration Decision Data What happens when the data can not be trusted? Can the decision be trusted? Decision making is jeopardized. Data Transformation Better chance to discover useful knowledge when data is clean. Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 7 Data Reduction Dr. Osmar R. Zaïane, 1999 • What is data cleaning and what is it for? • What is data transformation and what is it for? • What is data reduction and what is it for? • • • • What is data discretization? • How do we generate concept hierarchies? 9 Solving Missing Data 10 Fill in missing values Smooth out noisy data Correct inconsistencies Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases Smoothing Noisy Data The purpose of data smoothing is to eliminate noise. This can be done by: • Ignore the tuple with missing values; • Fill in the missing values manually; • Binning • Clustering • Regression • Use a global constant to fill in missing values (NULL, unknown, etc.); • Use the attribute value mean to filling missing values of that attribute; y Data regression consists of fitting the data to a function. A linear regression for instance, finds the line to fit 2 variables so that one variable can predict the other. More variables can be involved in a multiple linear regression. • Use the attribute mean for all samples belonging to the same class to fill in the missing values; • Infer the most probable value to fill in the missing value. Dr. Osmar R. Zaïane, 1999 University of Alberta No recorded values for some attributes Not considered at time of entry Random errors … Data cleaning attempts to: • What is data integration and what is it for? University of Alberta 8 Real-world application data can be incomplete, noisy, and inconsistent. • What is the motivation behind data preprocessing? Principles of Knowledge Discovery in Databases University of Alberta Data Cleaning Data Preprocessing Outline Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases Principles of Knowledge Discovery in Databases University of Alberta 11 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases Y1 y=x+1 Y1’ X1 University of Alberta x 12 2 Binning Clustering Binning smoothes the data by consulting the value’s neighbourhood. Data is organized into groups of “similar” values. Rare values that fall outside these groups are considered outliers and are discarded. First, the data is sorted to get the values “in their neighbourhoods”. Second, the data is distributed in equi-width bins: Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34 Bins of depth 3: Bin1: 4, 8, 15 Bin2: 21, 21, 24 Bin3: 25, 28, 34 Third, process local smoothing. Smoothing by bin median Smoothing by bin means Smoothing by bin boundaries Bin1: 9, 9, 9 Bin2: 22, 22, 22 Bin3: 29, 29, 29 Bin1: 4, 4, 15 Bin2: 21, 21, 24 Bin3: 25, 25, 34 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 13 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 14 Data Integration Data Preprocessing Outline Data analysis may require a combination of data from multiple sources into a coherent data store. • What is the motivation behind data preprocessing? • What is data cleaning and what is it for? There are many challenges: •Schema integration: CID ≈ C_number ≈ Cust-id ≈ cust# •Semantic heterogeneity • Data value conflicts (different representations or scales, etc.) •Redundant records •Redundant attributes (redundant if it can be derived from other attributes) • What is data integration and what is it for? • What is data transformation and what is it for? • What is data reduction and what is it for? • What is data discretization? •Correlation analysis P(A∧B)/(P(A)P(B)) 1: independent, >1 positive correlation, <1: negative correlation. • How do we generate concept hierarchies? Metadata is often necessary Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 15 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 16 Data Transformation Data Preprocessing Outline Data is sometimes in a form not appropriate for mining. Either the algorithm at hand can not handle it, the form of the data is not regular, or the data itself is not specific enough. • What is the motivation behind data preprocessing? • What is data cleaning and what is it for? • What is data integration and what is it for? • What is data transformation and what is it for? • Normalization (to compare carrots with carrots) • Smoothing • Aggregation (summary operation applied to data) • Generalization (low level data is replaced with level data – concept hierarchy) • What is data reduction and what is it for? • What is data discretization? • How do we generate concept hierarchies? Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 17 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 18 3 Normalization Data Preprocessing Outline Min-max normalization: linear transformation from v to v’ v’= v-min/(max – min) (newmax – newmin) + new min Ex: transform $30000 between [10000..45000] into [0..1] 30-10/35(1)+0=0.514 • What is the motivation behind data preprocessing? Zscore normalization: normalization v into v’ based on attribute value mean and standard deviation v’=v-Mean/StandardDeviation • What is data integration and what is it for? Î • What is data cleaning and what is it for? • What is data transformation and what is it for? Normalization by decimal scaling: moves the decimal point of v by j positions such that j is the minimum number of positions moved to the decimal of the absolute maximum value to make is fall in [0..1]. v’=v/10j Ex: if v ranges between –56 and 9976, j=4 v’ ranges between –0.0056 and 0.9976 Î Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 19 • What is data reduction and what is it for? • What is data discretization? • How do we generate concept hierarchies? Dr. Osmar R. Zaïane, 1999 The data is often too large. Reducing the data can improve performance. Data reduction consists of reducing the representation of the data set while producing the same (or almost the same) results. Dr. Osmar R. Zaïane, 1999 20 Reduce the data to the concept level needed in the analysis. Time (years) Time (Quarters per years) Location Q1 Q2 Q3 Q4 Maritimes Drama Quebec (province, Ontario Prairies Canada)Western Pr Comedy Horror Edmonton Montreal Toronto Vancouver Location (city) Category 19981999 1997 Drama Comedy Horror •Numerosity reduction Regression Histograms Clustering Sampling University of Alberta Category Sci. Fi.. Sci. Fi.. Principles of Knowledge Discovery in Databases University of Alberta Data Cube Aggregation Data Reduction Data reduction includes: •Data cube aggregation •Dimension reduction •Data compression •Discretization Principles of Knowledge Discovery in Databases Reduce Queries regarding aggregated information should be answered using data cube when possible. 21 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 22 Decision-Tree Induction Dimensionality Reduction Feature selection (i.e., attribute subset selection): Initial attribute set: {A1, A2, A3, A4, A5} • Select only the necessary attributes. • The goal is to find a minimum set of attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all attributes. • Exponential number of possibilities. A1 ? A5? A3? Use heuristics: select local ‘best’ (or most pertinent) attribute. Information gain, etc. Class 1 ¾step-wise forward selection {}{A1}{A1,A3}{A1,A3,A5} ¾step-wise backward elimination {A1,A2,A3,A4,A5} {A1,A3,A4,A5} {A1,A3,A5} ¾combining forward selection and backward elimination ¾decision-tree induction Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta > 23 Dr. Osmar R. Zaïane, 1999 Class 2 Class 1 Class 2 Reduced attribute set: {A1, A3, A5} Principles of Knowledge Discovery in Databases University of Alberta 24 4 Data Compression Numerosity Reduction Data compression reduces the size of data. saves storage space. saves communication time. Data volume can be reduced by choosing alternative forms of data representation. Î Î Parametric •Regression (a model or function estimating the distribution instead of There is lossless compression and lossy compression. Used for all sorts of data. Some methods are data specific, others are versatile. the data.) Non-parametric •Histograms •Clustering •Sampling For data mining, data compression is beneficial if data mining algorithms can manipulate compressed data directly without uncompressing it. Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 25 Reduction with Histograms Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 26 Reduction with Clustering Partition data into clusters based on “closeness” in space. Retain representatives of clusters (centroids) and outliers. Effectiveness depends upon the distribution of data Hierarchical clustering is possible (multi-resolution). A popular data reduction technique: Divide data into buckets and store representation of buckets (sum, count, etc.) •Equi-width (histogram with bars having the same width) •Equi-depth (histogram with bars having the same height) •V-Optimal (histogram with least variance ∑(countb*valueb) •MaxDiff (bucket boundaries defined by user specified threshold) Related to quantization problem. Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 27 Reduction with Sampling Simple random sample without replacement (SRSWOR) Simple random sampling with replacement (SRSWR) Cluster sample (SRSWOR or SRSWR from clusters) Stratified sample (stratum = group based on attribute value) Random sampling can produce poor results Î active research. Principles of Knowledge Discovery in Databases University of Alberta Principles of Knowledge Discovery in Databases University of Alberta 28 Data Preprocessing Outline Allows a large data set to be represented by a much smaller random sample of the data (sub-set). •How to select a random sample? •Will the patterns in the sample represent the patterns in the data? Dr. Osmar R. Zaïane, 1999 Dr. Osmar R. Zaïane, 1999 • What is the motivation behind data preprocessing? • What is data cleaning and what is it for? • What is data integration and what is it for? • What is data transformation and what is it for? • What is data reduction and what is it for? • What is data discretization? • How do we generate concept hierarchies? 29 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 30 5 Discretization Data Preprocessing Outline Discretization is used to reduce the number of values for a given continuous attribute, by dividing the range of the attribute into intervals. Interval labels are then used to replace actual data values. • What is the motivation behind data preprocessing? • What is data cleaning and what is it for? • What is data integration and what is it for? Some data mining algorithms only accept categorical attributes and cannot handle a range of continuous attribute value. • What is data transformation and what is it for? • What is data reduction and what is it for? Discretization can reduce the data set, and can also be used to generate concept hierarchies automatically. Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 31 • What is data discretization? • How do we generate concept hierarchies? Dr. Osmar R. Zaïane, 1999 Discretization and Concept Hierarchies Î Dr. Osmar R. Zaïane, 1999 Î Î Principles of Knowledge Discovery in Databases University of Alberta Î Î Î Î Î Binning (bin representative (mean/median) recursive binning C.H.) Histogram analysis (recursive with min interval size C.H.) Clustering (recursive clustering C.H.) Entropy-based (binary partitioning with information gain evaluation C.H.) 3-4-5 data segmentation (uniform intervals with rounded boundaries) Î 32 •Sort data get Min and Max •Determine 5%-95% tile get Low and High •Determine most significant digit (msd) get Low’ and High’ •(High’-Low’)/msd If 3, 6, 7, or 9 partition in 3 intervals If 2, 4, or 8 partition in 4 intervals If 1, 5, or 10 partition in 5 intervals •Adjust both ends intervals to include 100% data •Repeat recursively It is difficult to construct concept hierarchies for numerical attributes. Î Automatic concept hierarchy generation based on data distribution analysis. Î University of Alberta 3-4-5 Partitioning For numerical data There is: •Wide diversity of possible range values •Frequent updates Î Principles of Knowledge Discovery in Databases 33 Dr. Osmar R. Zaïane, 1999 Principles of Knowledge Discovery in Databases University of Alberta 34 3-4-5 Partitioning Example 5% count 95% Step 1: -$351 -$159 Min Low (i.e, 5%-tile) Step 2: msd=1,000 Step 3: profit $1,838 $4,700 High(i.e, 95%-0 tile) Low=-$1,000 Max High=$2,000 (-$1,000 - $2,000) (-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000) (-$4000 -$5,000) Step 4: (-$400 - 0) (0 - $200) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) ($2,000 - $5, 000) ($1,000 - $2, 000) (0 - $1,000) ($1,000 - $1,200) ($200 - $400) ($2,000 - $3,000) ($1,200 - $1,400) ($3,000 - $4,000) ($400 - $600) ($1,400 - $1,600) Step 5: ($600 - $800) (-$100 - 0) Dr. Osmar R. Zaïane, 1999 ($800 - $1,000) ($4,000 - $5,000) ($1,600 - $1,800) ($1,800 - $2,000) Principles of Knowledge Discovery in Databases University of Alberta 35 6