Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Week 2 lecture slides ❙ Topics ❙ Data Preprocessing ❙ Descriptive Data Summarization ❙ Data Preprocessing Tasks ❙ Textbook reference: Chapter 2 Data preprocessing ❙ Preprocessing is required because typically real-world data sets are incomplete, noisy and inconsistent. ❙ It is probably sourced from different databases that may use different attribute names for essentially the same thing. ❙ Some attribute values may be missing. ❙ Some have been incorrectly entered, lost or corrupted. ❙ The data set may be far larger than necessary. ❙ It may be necessary to normalize, bin or cull some attributes or some outlier records. ❙ The goal is to get a data set to which data mining techniques can be applied effeciently and effectively. Descriptive data summarization ❙ These techniques provide an overall view or profile of the data. ❙ They can be used to identify problems, such as the presence of outliers. ❙ They describe the: ❙ Central tendency of the data, ❙ Dispersion of the data. Central tendency measures Let the attribute values be ❙ x1, x2, .. , xn The arithmetic mean of is the average mean = (x1+ x2 + .. + xn) / n ❙ The median is the middle value of the ordered set. ❙ The mode is the most frequently occuring value. There may be more than one (a bimodal, or multimodal data set). Dispersion measures ❙ The range is ❙ Quartiles divide the data into 4 quarters. max(x1, x2, .. , xn) - min(x1, x2, .. , xn) Q1 is xi such that 25% of x1, x2, .. , xn are less than or equal to xi ❙ The interquartile range is ❙ The five-number summary of distribution is Q3 - Q1 min(x1, x2, .. , xn), Q1, median, Q3, max(x1, x2, .. , xn) ❙ Boxplots are illustrations of the five-number summary. Min --- Q1 median Q3 --- Max ❙ Variance is ❙ Standard Deviation is the square root of variance. ((x1- mean)2 + .. + (xn - mean)2) / n Data cleaning: incomplete data Missing value strategies: ❙ Omit the whole record. May be losing significant data. ❙ Find the value. Probably impractical. ❙ Use a global constant to mean “unknown”. May interfere with knowledge discovery algorithms. ❙ Use the mean of the known values. Skews the data. ❙ Classify the records with known values and use the class mean. Better, but still skews the data. ❙ Construct a predictive model and use the estimate. Data cleaning: noisy data Noisy data has random errors or variance. A good example is sensor data that is effected by the background or even the sensing instrument itself. Strategies for dealing with noise: ❙ Smooth by binning, that is change the values to those representing a partitioning of the ordered data set, such as the mean, median or bin boundary value. Rolling up through a concept heirarchy is an example. ❙ Smoothing by regression. Linear regression finds a line of best fit for the data. Values are changed to fit the line. Data cleaning: removing outliers Outliers are data objects that don’t comply with the general behaviour of the data set. They will skew the results of knowledge discovery algorithms (such as classification). Outliers could be identified by attribute values : ❙ More than about 1.5 times (Q3 – Q1) outside the quartiles; ❙ With a deviation from the mean of more than about 2.5 times the standard deviation. ❙ Cluster analysis. Data objects that don’t conveniently fit in the clusters encompassing the large majority of the data. Data integration Data integration is combining of data from different sources. For example, two different databases may have attributes customer_id and cust_id. Do they mean the same thing? The metadata of a database is the data about the data. It should describe the meaning of the attribute names. Attribute values may be of different data types. e.g. 1 or 1.0, or of different levels in a concept hierarchy. e.g. $3500 or Expensive. The goal is to retain as much information as possible in the combined data set. Data reduction Reducing the size of the data set will mean that knowledge discovery algorithms will run more efficiently. Again, the goal should be to ensure that no significant information is lost. Strategies for data reduction: ❙ Aggregation, that is, summarizing the data. Data Cubes. A lot more about Data Cubes later. ❙ Attribute reduction. Removing irrelevant attributes. ❙ Dimensionality Reduction. Encoding or compression that transforms the data set into one with fewer attributes. ❙ Numerosity Reduction. Using bin representations or model parameters instead of the actual values. Using only a sample of the whole data. Rollup through the concept hierarchy. Normalizing attribute values. Data reduction: attribute correlation. If two attributes in a data set have a strong correlation, then one could be removed without significant loss of information. Let n be number of records, ai and bi be the numerical values for attributes A and B in record i, and mA and mB be the respective means, and sA and sB be the respective standard deviations for A and B. Then the correlation between the attributes is given by: CAB = ( SUMi=1..n ai x bi - n x mA x mB ) / n x sA x sB Note that If CAB = 0 -1 <= CAB <= +1 then the attributes are completely independent. Chi-square analysis is used to test the correlation of attributes with categorical values. Data reduction: principal component analysis. Assume a data set of m attributes. This could be viewed as points in an m-dimensional space, with each attribute represesnted by an axis. Principal component analysis attempts to map this space onto a smaller k-dimensional space, this reducing the number of attributes required to represent the data set. The Karhunen-Loeve algorithm finds components (new attributes) in order of variance among the data. That is, the first shows the most variance, the second the next most etc. Using only the components with the strongest variance, a good (and smaller) approximation of the data can be constructed. Data reduction: wavelet transformations. Assume a data set of m=2p attributes. Each record can be considered an m-dimensional vector (x1, x2, .. , xn) The Discrete Wavelet Transform consists of a low-frequency (smoothing) filter and a high-frequency (difference) filter. Each filter operates on value pairs series of length m/2. (xi, xi+1) and produces a The filters are recursively applied to previous outputs. Selected output values are called wavelet coeficients. They represent a compressed version of the original data. Wavelet transformations conserve local information (about the relationship between neighbouring values). The wavelet transformation. X is the original vector. D is the low-frequency filter output. C is the high-frequency filter output. E is the sum of the values (the energies). E.g. The energies at each level could be characteristic of classes of data objects, that is, used for classification. Example of dimensionality reduction using the Haar wavelet transformation Assume a record has values for 8 attributes. It can be represented as a vector in an 8-dimensional space. e.g. 6, 4, 4, 4, 8, 6, 8, 4 The low pass filter computes the average sum of pairs. The high pass filter computes the average difference of pairs. 1st low pass: 5, 4, 7, 6 1st high pass: 1, 0, 1, 2 2nd LP: .5, 1.5 2nd HP: .5, -.5 3rd HP: 0 3rd LP .5 Note: LP and HP are decimation filters. They reduce the size of the data at each step. Data reduction: discretization. Discretization represents continuous values by an interval, group or concept abstraction level. Binning. Divide the range into equal width or equal frequency intervals. Replace values by the bin mean or median. Histogram analysis represents the frequency of values. Buckets are merged histograms. Replace values by their frequency or frequency range. Clustering groups similar values. Entrophy-based discretization works on data sets with a class attribute to form a concept hierarchy ChiMerge is a bottom-up method that uses the Chi-square test. Binning examples Data series: 1,1,3,4,4,4,5,5,5,6,6,7,8,8,8,9,10 Equal-width bin... Bin 1: 1,1 Bin 2: 3,4,4,4 Bin 3: 5,5,5,6,6 Bin 4: 7,8,8,8 Bin 5: 9,10 Represent values by: ❙ Bin number ❙ Bin mean ❙ Bin boundary value Frequency Histograms Numerosity reduction. Numerosity reduction reduces the size or even the range of attribute values. Logarithms. That is, transform using the log function. e.g. Log(100) = 2, Log(1000) = 3 Normalization e.g. Z-score normalization. Let values of attribute A have mean m and standard deviation s. Then a value v is normalized as u = (v – m) / s