* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Preprocessing Motivation Data Records Attributes Attribute
Survey
Document related concepts
Transcript
Motivation • Garbage-in, garbage-out – Cannot get good mining results from bad data Data Preprocessing • Need to understand data properties to select the right technique and parameter values • Data cleaning • Data formatting to match technique • Data manipulation to enable discovery of desired patterns Mirek Riedewald Some slides based on presentation by Jiawei Han and Micheline Kamber 2 Data Records Attributes • Data sets are made up of data records • A data record represents an entity • Examples: • Attribute (or dimension, feature, variable): a data field, representing a property of a data record – E.g., customerID, name, address • Types: – Sales database: customers, store items, sales – Medical database: patients, treatments – University database: students, professors, courses – Nominal (aka categorical) • No ordering or meaningful distance measure – Ordinal • Also called samples, examples, tuples, instances, data points, objects • Data records are described by attributes • Ordered domain, but no meaningful distance measure – Numeric • Ordered domain, meaningful distance measure • Continuous versus discrete – Database row = data record; column = attribute 3 4 Attribute Type Examples Numeric Attribute Types • Interval • Nominal: category, status, or “name of thing” – Measured on a scale of equal-sized units – Values have order, but no true zero-point – Hair_color = {black, brown, blond, red, auburn, grey, white} – Marital status, occupation, ID numbers, zip codes • E.g., temperature in C or F, calendar dates • Ratio • Binary: nominal attribute with only 2 states – Inherent zero-point – We can speak of values as being an order of magnitude larger than the unit of measurement (10m is twice as high as 5m). – Gender, outcome of medical test (positive, negative) • Ordinal – Size = {small, medium, large}, grades, army rankings • E.g., temperature in Kelvin, length, counts, monetary quantities 5 6 1 Discrete vs. Continuous Attributes Data Preprocessing Overview • Discrete Attribute – Has only a finite or countably infinite set of values – Nominal, binary, ordinal attributes are usually discrete – Integer numeric attributes • Continuous Attribute – Has real numbers as attribute values • E.g., temperature, height, or weight • • • • • Descriptive data summarization Data cleaning Correlations Data transformation Summary – Practically, real values can only be measured and represented using a finite number of digits – Typically represented as floating-point variables 7 Measuring Data Dispersion: Boxplot Measuring the Central Tendency n • Sample mean: x 1 xi n 8 • Quartiles: Q1 (25th percentile), Q3 (75th percentile) – Inter-quartile range: IQR = Q3 – Q1 – For N records, the p-th percentile is the record at position (p/100)N+0.5 in increasing order n i 1 • Weighted arithmetic mean: x wi xi i 1 n w i 1 • If not integer, round to nearest integer or compute weighted average • E.g., for N=32, p=25: 25/100*32+0.5 = 8.5, i.e., Q1 is average of 8-th and 9-th largest values i – Trimmed mean: set weights of extreme values to zero • Boxplot: ends of the box are the quartiles, median is marked, whiskers extend to min/max • Median – Often plots outliers individually: usually a value higher (or lower) than 1.5IQR from Q3 (or Q1) – Middle value if odd number of values; average of the middle two values otherwise • Mode – Value that occurs most frequently in the data – E.g., unimodal or bimodal distribution 9 Histogram Measuring Data Dispersion: Variance • Display of tabulated frequencies • Shows proportion of cases in each category • Area (not height!) of the bar denotes the value • Sample variance (aka second central moment): sn2 1 n 1 n 2 xi x 2 ( xi x ) 2 n n i 1 i 1 • Standard deviation = square root of variance • Estimator of true population variance from a sample: 2 1 n 2 sn 1 – Crucial distinction when the categories are not of uniform width! (x x) n 1 i 1 10 i 11 12 2 Scatter plot Correlated Data • Visualizes relationship between two attributes, even a third (if categorical) – For each data record, plot selected attribute pair in the plane 13 14 Not Correlated Data Data Preprocessing Overview • • • • • Descriptive data summarization Data cleaning Correlations Data transformation Summary 15 16 Why Data Cleaning? Example: Bird Observation Data • • • • • • • Data in the real world is dirty – Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • E.g., occupation=“ ” Change of range boundaries over time, e.g., for temperature Different units, e.g., meters versus feet for elevation Addition or removal of attributes over the years Missing entries, especially for habitat and weather GIS data based on 30m cells or 1km cells Location accuracy – ZIP code versus GPS coordinates – Walk along transect but report only single location – Noisy: containing errors or outliers • Inconsistent encoding of missing entries • E.g., Salary=“-10” Hairy vs. Downy Woodpecker – 0, -9999, -3.4E+38—need context to decide – Inconsistent: containing discrepancies in codes or names • Varying observer experience and capabilities – Confusion of species, missed present species • Confusion about reporting protocol • E.g., Age=“42” and Birthday=“03/07/1967” • E.g., was rating “1, 2, 3”, now rating “A, B, C” – Report max versus sum seen – Report only interesting species, not all 17 18 3 How to Handle Missing Data? How to Handle Noisy Data? • Ignore the record – Usually done when class label is missing (for classification tasks) • Fill in manually – Tedious and often not clear what value to fill in • Fill in automatically with one of the following: • Noise = random error or variance in a measured variable • Typical approach: smoothing • Adjust values of a record by taking values of other “nearby” records into account • Many approaches – Global constant, e.g., “unknown” • “Unknown” could be mistaken as new concept by data mining algorithm – Attribute mean or mean for all records belonging to the same class – Most probable value: inference-based such as Bayesian formula or decision tree • Some methods, e.g., trees, can do this implicitly • Recommendation: don’t do it unless you understand the nature of the noise • A good data mining technique should be able to deal with noise in the data 19 Data Preprocessing Overview • • • • • 20 Covariance (Numerical Data) • Covariance computed for data samples (A1, B1), (A2, B2),…, (An, Bn): Descriptive data summarization Data cleaning Correlations Data transformation Summary Cov( A, B) 1 n 1 n Ai Bi A B ( Ai A )(Bi B ) n n i 1 i 1 • If A and B are independent, then Cov(A, B) = 0, but the converse is not true – Two random variables may have covariance of 0, but are not independent • If Cov(A, B) > 0, then A and B tend to rise and fall together – The greater, the more so • If covariance is negative, then A tends to rise as B falls and vice versa 21 Covariance Example 22 Correlation Analysis (Numerical Data) • Pearson’s product-moment correlation coefficient of random variables A and B: Cov( A, B) • Suppose two stocks A and B have the following values in one week: A, B – A: (2, 3, 5, 4, 6) – B: (5, 8, 10, 11, 14) – AVG(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 – AVG(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 – Cov(A,B) = (25+38+510+411+614)/5 − 49.6 = 4 A B • Computed for two attributes A and B from data samples (A1, B1), (A2, B2),…, (An, Bn): rA, B 1 n Ai A Bi B n 1 i 1 s A sB Where A and B are the sample means, and sA and sB are the sample standard deviations of A and B (using the variance formula for sn). • Note: -1 ≤ rA,B ≤ 1 • Cov(A,B) > 0, therefore A and B tend to rise and fall together • rA,B > 0: A and B positively correlated (the higher, the stronger the correlation) • rA,B < 0: negatively correlated 23 24 4 Chi-Square Example Correlation Analysis (Categorical Data) • 2 (chi-square) test (Observed Expected ) 2 2 Expected • The larger the 2 value, the more likely the variables are related • The cells that contribute the most to the 2 value are those whose actual count is very different from the expected count • Correlation does not imply causality – # of hospitals and # of car-thefts in a city are correlated – Both are causally linked to the third variable: population Play chess Not play chess Sum (row) Like science fiction 250 (90) 200 (360) 450 Not like science fiction 50 (210) 1000 (840) 1050 Sum(col.) 300 1200 1500 • Numbers in parenthesis are expected counts calculated based on the data distribution in the two categories 2 (250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2 507.93 90 210 360 840 • It shows that like_science_fiction and play_chess are correlated in the group 25 26 Data Preprocessing Overview • • • • • Why Data Transformation? • Make data more “mineable” Descriptive data summarization Data cleaning Correlations Data transformation Summary – E.g., some patterns visible when using single time attribute (entire date-time combination), others only when making hour, day, month, year separate attributes – Some patterns only visible at right granularity of representation • Some methods require normalized data – E.g., all attributes in range [0.0, 1.0] • Reduce data size, both #attributes and #records 27 28 Normalization • Data Reduction • Why data reduction? Min-max normalization to [new_minA, new_maxA]: v min A v' (new_max A new_min A ) new_min A max A min A – Mining cost often increases rapidly with data size and number of attributes – E.g., normalize income range [$12,000, $98,000] to [0.0, 1.0]. Then $73,600 is mapped to 73,600 12,000 (1.0 0) 0 0.716 98,000 12,000 • Z-score normalization (μ: mean, σ: standard deviation): v' – E.g., for μ = 54,000 and σ = 16,000, $73,600 is mapped to • • Goal: reduce data size, but produce (almost) the same results • Data reduction strategies v A A 73,600 54,000 1.225 16,000 – – – – Normalization by decimal scaling: v' v j 10 where j is the smallest integer such that Max(|ν’|) < 1 29 Dimensionality reduction Data Compression Numerosity reduction Discretization 30 5 Dimensionality Reduction: Attribute Subset Selection Principal Component Analysis • Find projection that captures largest amount of variation in the data • Feature selection (i.e., attribute subset selection): – Select a minimum set of attributes such that the mining result is still as good as (or even better than) when using all attributes – Space defined by eigenvectors of the covariance matrix • Heuristic methods (due to exponential number of choices): – – – – – • Compression: use only first k eigenvectors Select independently based on some test Step-wise forward selection Step-wise backward elimination Combining forward selection and backward elimination Eliminate attributes that some trusted method did not use, e.g., a decision tree x2 e1 e2 x1 31 Data Reduction Method: Sampling • Select a small subset of a given data set • Reduces mining cost – Mining cost usually is super-linear in data size – Often makes difference between in-memory processing and need for expensive I/O • Choose a representative subset of the data – Simple random sampling may have poor performance in the presence of skew – Stratified sampling • E.g., sample more from small classes to avoid missing them in small uniform sample Data Reduction: Discretization • Applied to continuous attributes • Reduces domain size • Makes the attribute discrete and hence enables use of techniques that only accept categorical attributes • Approach: – Divide the range of the attribute into intervals – Interval labels replace the original data 38 Data Preprocessing Overview • • • • • 36 39 Summary • Data preparation is a big issue for data mining • Descriptive data summarization is used to understand data properties • Data preparation includes Descriptive data summarization Data cleaning Correlations Data transformation Summary – Data cleaning and integration – Data reduction and feature selection – Discretization • Many techniques and commercial tools, but still major challenge and active research area 40 41 6