Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECT7110 Data Preprocessing Prof. Wai Lam ECT7110 Data Preprocessing 1 Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data ECT7110 Data Preprocessing 2 Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Part of data reduction but with particular importance, especially for numerical data ECT7110 Data Preprocessing 3 Forms of data preprocessing ECT7110 Data Preprocessing 4 Data Cleaning • Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data ECT7110 Data Preprocessing 5 Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Normalization: scaled to fall within a small, specified range – min-max normalization – z-score normalization – normalization by decimal scaling • Attribute/feature construction – New attributes constructed from the given ones ECT7110 Data Preprocessing 6 Data Transformation: Normalization • min-max normalization v − minA v' = (new _ maxA − new _ minA) + new _ minA maxA − minA • z-score normalization v − mean v'= stand_dev ECT7110 A Data Preprocessing A 7 Normalization Examples • Suppose that the minimum and maximum values for attribute income are 12,000 and 98,000 respectively. How to map an income value of 73,600 to the range of [0.0,1.0]? • Suppose that the man and standard deviation for the attribute income are 54,000 and 15,000. How to map an income value of 73,600 using z-score normalization? ECT7110 Data Preprocessing 8 Data Reduction Strategies • Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction – Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies – Dimensionality reduction ECT7110 Data Preprocessing 9 Dimensionality Reduction • Feature selection (i.e., attribute subset selection): – Select a minimum set of features useful for data mining – reduce # of patterns in the patterns, easier to understand ECT7110 Data Preprocessing 10 Histograms 40 35 30 25 20 15 10 5 ECT7110 Data Preprocessing 100000 90000 80000 70000 60000 50000 40000 30000 20000 0 10000 • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Related to quantization problems. 11 Histograms 10 9 8 7 25 count 6 20 5 count 4 3 15 10 2 5 1 0 0 5 10 15 price ($) 20 25 Singleton buckets ECT7110 30 1–10 11–20 price ($) 21–30 Buckets denoting a Continuous range of values Data Preprocessing 12 Histograms • How are buckets determined and the attribute values partitioned? - Equiwidth: The width of each bucket range is uniform - Equidepth: The buckets are created so that, roughly, the frequency of each bucket is constant ECT7110 Data Preprocessing 13 Histogram Examples • Suppose that the values for the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 21, 22, 25, 25, 25. 25, 30, 30, 30, 30, 32, 33, 33, 37, 40, 40, 40, 42, 42 Equiwidth Histogram: Equidepth Histogram: Bucket range Frequency Bucket range Frequency 13-22 10 13-21 9 23-32 9 22-30 9 33-42 8 32-42 9 ECT7110 Data Preprocessing 14 Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data – Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods – Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data • Sampling may not reduce database I/Os (page at a time). ECT7110 Data Preprocessing 15 Sampling R O W SRS le random t p u o m i h t s ( wi e l p sam ment) e c a l p re SRSW R ECT7110 Raw Data Data Preprocessing 16 SRSWOR T5 (n 4) T1 T8 T6 SRSWR (n 4) T4 T7 T1 T2 T3 T4 T5 T6 T7 T8 T4 T1 Cluster sample (m 2) . . .T901 T201 T101 T701 T1 T2 T3 ... T100 ... T201 T202 T203 ... T300 Stratified sample (according to age) ECT7110 T38 T256 T307 T391 T96 T117 T138 T263 T290 T308 T326 T387 T69 T284 young T38 young T391 young T117 young T138 middle-aged T290 middle-aged T326 middle-aged T69 middle-aged middle-aged middle-aged middle-aged middle-aged Data Preprocessing senior senior young young middle-aged middle-aged middle-aged middle-aged senior 17