Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Cleaning: The information possessed by many companies contains numerous anomalies and missing values. If the company wants to analyze their data by using Data Mining, data cleaning should be performed initially. Otherwise, incorrect patterns generated will be of no use but an additional burden for the company’s management. Data Cleaning is an approach to clean the data by correcting errors, filling the missing values and removing noise from the data. It is one of the most important pre-processing tasks to be performed on a data before applying data mining techniques. There are many methods for cleaning the data. Replacing Missing values: There are many situations where certain columns in the dataset don’t have any values. These are termed as missing valued tuples. For making correct predictions, the missing values need to be handled and should be replaced with an approximate value. They can be done by: a) Replacing by the average: In this method, the missing value in the numerical column is replaced by the average of all values in the column present previously. b) Replacing by a constant: In this method, in the whole dataset all instances of missing values are replaced by a constant value. For example c) Replacing Manually: In this method, the missing values are filled manually which is a not the best option especially for huge datasets. d) Removing the record: In this method, if the record’s attributes have many missing values, then removing the record can be a good heuristic. e) Replacing by an approximate value: In this method, the missing values are filled with the most probable value of that attribute. As an example, consider the table name table below having 10 records from the imdb dataset. Title Year ReleaseDate No of ratings IMDB Rating 18 Wheels of Justice 2000 12 Jan 2000 14 4.8 24: Conspiracy 2005 ? 8 4.4 2000 15 Aug 2000 17 5.5 2005 5 Jul 2005 12 5.5 2gether: The Series The 70’s Genre Action, Crime, Drama, Thriller Action, Drama, Shortfilm, Thriller Comedy, Family X Director Cinematograp her Actors Moranville John B., Jack Garrett Thornhill, Lisa, Hosea, Bobby, Gatrick, Maro, ? Bryant, Beverly, Rider, Amy Gunn, Mark Winter, Glen Sizemore, Frenzel, Guido James, Brenda, Smith, Lauren Lee (X), Leggero, House Robert(I) Documenta ry Akhtar, Kabir Capodice, Nick Aldridge, Kelly, Aldridge, Sabrina 8.2 Drama de Anda, Heriberto Suárez, Juan Carlos Almus, Irene, Ballesteros, Marita, Coles, John th 8 & Ocean Amarte asA 2006 2005 7 Mar 2006 6 Apr 2005 89 6 Natasha, Stallings 4.9 Aylesworth, Reiko, Bareikis, Arija, American Embassy, The 2002 11 Mar 2002 14 9.1 Drama, Comedy American High 2000 2 Aug 2000 11 5.3 ? Cutler, R.J, Churchill, Joan Bodle, Kaytee, Komessar, Allie American Juniors 2003 3 Jun 2003 18 1.9 ? Gowers, Brucesex ? Dubela, Julie, Gibson, Deborah As If 2002 ? 19 7.9 Comedy Grant, Brian ? Welland, James Corrie, Emily, Thoms, Tracie There are missing values for the three different attributes which are ReleaseDate, Genre and Cinematographer. 1. If we consider the first attribute of these three, there is a relationship with the other attribute named year in the table. If we remove the ReleaseDate attribute, there is not much loss of important information. Hence, in this way useless attributes can be removed from the table. 2. Considering the attribute Genre which gives information about the type of the movie. If we remove this attribute, there will be certainly a huge loss of information. Hence, without removing the attribute the missing values should be filled. As earlier mentioned we can fill in the missing entries with the most probable option. A naïve method to determine the most probable option would be to first determine the most probable values in the column which are ‘Drama’ and ‘Comedy’. In addition one can always use methods like Regression and Bayesian posterior probability estimates to determine the most probable value here. 3. Considering the cinematographer attribute, we basically have two choices here one being filling the missing values of this attribute manually. However this might seem to be a laborious process so it is subjective to the user. Other option might be just to delete this attribute as in the current table it is of lesser relevance in comparison to the director attribute. After applying the changes listed above and cleaning the table as described above, the table looks like:Title 18 Wheels of Justice 24: Conspiracy 2gether: The Series No of ratings Rating 2000 14 4.8 2005 8 4.4 2000 17 5.5 Year Genre Action, Crime, Drama, Thriller Action, Drama, Shortfilm, Thriller Comedy, Family Director Actors Moranville John B., Thornhill, Lisa, Hosea, Bobby, Gatrick, Maro, Bryant, Beverly, Rider, Amy Gunn, Mark James, Brenda, Smith, Lauren 2005 12 5.5 Comedy Sizemore, Robert(I) Lee (X), Leggero, Natasha, Stallings 8 & Ocean 2006 89 4.9 Documentary Akhtar, Kabir Aldridge, Kelly, Aldridge, Sabrina Amarte asA 2005 6 8.2 Drama de Anda, Heriberto Almus, Irene, Ballesteros, Marita, American Embassy, The 2002 14 9.1 Drama, Comedy American High 2000 11 5.3 Comedy Cutler, R.J, American Juniors 2003 18 1.9 Drama Gowers, Brucesex Dubela, Julie, Gibson, Deborah As If 2002 19 7.9 Comedy Grant, Brian Corrie, Emily, Thoms, Tracie The 70’s House th Coles, John Aylesworth, Reiko, Bareikis, Arija, Bodle, Kaytee, Komessar, Allie Handling Noisy Data: Noise is very normally present in the raw data. Elimination of noise is very important without which the knowledge mined does not make much sense. Hence, the data should be free of noise. There are different approaches for eliminating noise from the data. Some of them are Binning, Clustering, etc. 1. Binning is a technique which smoothes the data. The different methods of binning and how they can be used will be explained during data discretization. Data Transformation: As earlier explained in the introduction section data needs to be standardized for different applications so that data mining algorithms can be applied on them. Hence, data is normally transformed to the necessary formats. There are different methods of transformation. Some of them are:Data Normalization: It is generally observed in few real world numeric datasets that the ranges of values of few columns are much higher than that of the rest. If the mining algorithm is run directly over such a dataset it is quite possible that the algorithm may be biased. Normalization provides us with methods to scale the entire numeric data to the interval [0, 1]. There are different ways of normalization which are 1. Min-Max Normalization: Suppose among the data values of an attribute on which we want to normalize minA and maxA are the minimum and maximum values respectively from the attribute considered. Then for the values to fall in the range of [0,1], for all the values v belonging to the attribute, we can determine another v_new : v_new = ((v - minA)/(maxA - minA))*(new_maxA – new_minA) + new_minA; here new_maxA = 1 and new_minA = 0 2. Z-score Normalization: It is also called zero-mean normalization. The values of attribute X are normalized using the mean and standard deviation of X. A new value new_v is obtained using the following v_new = (v-µ)/standard deviation of X where µ is the mean of attribute X. Consider an example which is a sample dataset from the Boston Housing Data CRIM ZN INDUS NOX RM AGE DIS RAD TAX PT RATIO B LSTAT MEDV 0.00632 18.00 2.310 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00 0.08829 12.50 7.870 0.5240 6.0120 66.60 5.5605 5 311.0 15.20 395.60 12.43 22.90 0.14455 0.02763 0.8873 0.15445 0.01951 0.04203 0.08244 0.21409 12.50 75.00 21.00 25.00 17.50 28.00 30.00 22.00 7.870 2.950 5.640 5.130 1.380 15.040 4.930 5.860 0.5240 0.4280 0.4390 0.4530 0.4161 0.4640 0.4280 0.4310 6.1720 6.5950 5.9630 6.1450 7.1040 6.4420 6.4810 6.4380 96.10 21.80 45.70 29.20 59.50 53.60 18.50 8.90 5.9505 5.4011 6.8147 7.8148 9.2229 3.6659 6.1899 7.3967 5 3 4 8 3 4 6 7 311.0 252.0 243.0 284.0 216.0 270.0 300.0 330.0 15.20 18.30 16.80 19.70 18.60 18.20 16.60 19.10 396.90 395.62 395.56 390.68 393.24 395.01 379.41 377.07 19.15 1.98 13.45 6.86 8.05 8.16 6.36 3.59 27.10 34.90 19.70 23.30 33.00 22.90 23.70 24.80 The two attributes which we are going to normalize are: ZN, LSTAT. a) Considering ZN attribute the different values are: {12.50, 12.50, 17.50, 18.00, 21.00, 22.00, 25.00, 28.00, 30.00, 75.00}. Now we need to scale these values to fall in the interval of [0,1] using min-max normalization. b) min_A = 12.50, max_A= 75.00, new_minA = 0, new_maxA = 1; c) Consider 12.5 now to determine the value by which we want to replace the current value we use the formula as follows v_new: ((12.50 – 12.50)/(75.00 – 12.50))*(1 – 0) + 0 = 0. d) In this process, we can determine the new set of values for the ZN column which are 12.50 = 0; 17.50 = 0.08; 18.00 = 0.088; 21.00 = 0.136; 22.00 = 0.152; 25.00 = 0.2; 28.00 = 0.248; 30.00 = 0.28; 75.00 = 1; Considering the LSTAT attribute the different values in this column are: {1.98, 3.59, 4.98, 6.36, 6.86, 8.05, 8.16, 12.43, 13.45, 19.15}. If we want to normalize them using min max normalization, the same procedure as above can be applied and the values are mapped to their corresponding new values: 1.98 = 0; 3.59=0.097; 4.98=0.1747; 6.36=0.255;6.86=0.2842; 8.05=0.3535; 8.16=0.3599; 12.43=0.6086; 13.45=0.668; 19.15=1; The normalized table looks like: CRIM ZN INDUS NOX RM AGE DIS RAD TAX PT RATIO B LSTA T MEDV 0.00632 0.088 2.310 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 0.1747 24.00 0.08829 0 7.870 0.5240 6.0120 66.60 5.5605 5 311.0 15.20 395.60 0.6086 22.90 0.14455 0.02763 0.8873 0.15445 0.01951 0.04203 0.08244 0.21409 0 1 0.136 0.2 0.08 0.248 0.28 0.152 7.870 2.950 5.640 5.130 1.380 15.040 4.930 5.860 0.5240 0.4280 0.4390 0.4530 0.4161 0.4640 0.4280 0.4310 6.1720 6.5950 5.9630 6.1450 7.1040 6.4420 6.4810 6.4380 96.10 21.80 45.70 29.20 59.50 53.60 18.50 8.90 5.9505 5.4011 6.8147 7.8148 9.2229 3.6659 6.1899 7.3967 5 3 4 8 3 4 6 7 311.0 252.0 243.0 284.0 216.0 270.0 300.0 330.0 15.20 18.30 16.80 19.70 18.60 18.20 16.60 19.10 396.90 395.62 395.56 390.68 393.24 395.01 379.41 377.07 1 0 0.668 0.2842 0.3535 0.3599 0.255 0.097 27.10 34.90 19.70 23.30 33.00 22.90 23.70 24.80 Data Discretization: The process of data discretization can be defined as one where we convert few columns of the dataset from numerical to categorical. This is done to facilitate the working of algorithms which work alone on totally categorical datasets. For example in this scenario ten rows from the movies table of the imdb dataset have been considered. We will describe this process in two stages here firstly we will show you the 10 samples from the movies table and mark in bold the attribute we are going to discretize. Example:IMBD Releasedate rating Title Year Genre Actors 4.8 12 Jan 2000 ? 2000 Action, crime, drama Moranville, John B., Radler, Robert, Satlof, 4.4 ? Conspiracy 2005 Action, drama , thriller Ostrick, Marc, Young, Eric (VII), Young, Eric Nealsex 5.5 5 July 2005 70’s House 2005 ? Sizemore, Robert Frenzel, Guido, Taylor, Mike L 5.5 15 August 2000 2gether: The Series 2000 Comedy, family Gunn, Mark, Lazarus, Paul, Pozer 4.9 7 March 2006 8th & Ocean 2006 documentary Akhtar, Kabir, Barrett, Kasey, Johnson, Jaymee 8.2 6 April 2005 Amarte Asa 2005 drama Moser, Alejandro Hugo, Juan Carlos Almus 9.1 11 March 2002 American Embassy, The 2002 Drama, comedy Coles, John David, Cragg, Stephen, Surjik, Stephensex 5.4 2 August 2000 American High 2000 ? Ellwood, Alison, Partland, Dan Chinn 1.9 3 June 2003 American Juniors 2003 ? Gowers, Brucesex 7.9 ? As if 2002 comedy Grant Brian , Meyers, Simon Stok, Witold Corrie Equi frequency binning Equi width binning Bin1=> ‘A’ = (1.9, 4.4, 4.8) Bin1=> ‘a’= () *<2.5+ Bin2=> ‘B’ = (4.9, 5.4, 5.5) Bin2=> ‘b’=() *2.5-5] Bin3=>’C’ = (7.9, 8.2, 9.1) Bin3=> ‘c’ = () *5-7.5] Bin4=> ‘d’ = () *7.5-10] IMBD Release date rating Movie name Year of Genre release Actors A 12 Jan 2000 18 Wheels of Justice 2000 Action, crime, drama Moranville, John B., Radler, Robert, Satlof, A ? Conspiracy 2005 Action, drama , thriller Ostrick, Marc, Young, Eric (VII), Young, Eric Nealsex B 5 July 2005 70’s House 2005 ? Sizemore, Robert Frenzel, Guido, Taylor, Mike L B 15 August 2000 2gether: The Series 2000 Comedy, family Gunn, Mark, Lazarus, Paul, Pozer B 7 March 2006 8th & Ocean 2006 documentary Akhtar, Kabir, Barrett, Kasey, Johnson, Jaymee C 6 April 2005 Amarte Asa 2005 drama Moser, Alejandro Hugo, Juan Carlos Almus C 11 March 2002 American Embassy, The 2002 Drama, comedy Coles, John David, Cragg, Stephen, Surjik, Stephensex B 2 August 2000 American High 2000 ? Ellwood, Alison, Partland, Dan Chinn A 3 June 2003 American Juniors 2003 ? Gowers, Brucesex C ? As if 2002 comedy Grant Brian , Meyers, Simon Stok, Witold Corrie Data Generalization: Precise data always helps in providing the overall picture of the data by providing the better information. Large data sets can be summarized concisely according to the requirement. Data Generalization is the process of transforming data in a database from a low conceptual level to a higher conceptual level. For example, consider the Census Dataset, whose sub-table looks like:Age Workclass Education MaritalStatus Race Sex 39 State-gov Bachelors Never-Married White Male 52 Self-emp-not-inc HS-grad Married-Civspouse White Male 32 Private Assoc-acdm Never-Married Black Male 49 Private HS-grad Seperated White Female 57 Federal-gov Bachelors Married-Civspouse Black Male 25 Private Some-College Married-CivSpouse Other Female 30 Private HS-grad Married-CivSpouse Asian-PacIslander Female 48 Self-emp-not-Inc Some-college Married-CivSpouse Amer-IndianEskimo Male 18 Never-Worked 10 Never-Married White Male 33 State-gov Some-college Divorced Black Female th When data generalization is applied on the Age attribute, Age being generalized as: Age: 0-12: Children; 13-19: Teenage; 20-30: Adolescent-Age; 30-50: Middle-Age; >50: Old-Age; The transformed table is: Age Workclass Education MaritalStatus Race Sex Middle-Age State-gov Bachelors NeverMarried White Male Old-Age Self-emp-not-inc HS-grad Married-Civspouse White Male Middle-Age Private Assoc-acdm NeverMarried Black Male Middle-Age Private HS-grad Seperated White Female Old-Age Federal-gov Bachelors Married-Civspouse Black Male Adolescent Age Private Some-College Married-CivSpouse Other Female Adolescent Age Private HS-grad Married-CivSpouse Asian-PacIslander Female Married-CivSpouse Amer-IndianEskimo Male 10 NeverMarried White Male Some-college Divorced Black Female Middle-Age Self-emp-not-Inc Some-college Teenage Never-Worked Middle-Age State-gov th Data Aggregation: Data Aggregation is the process of summarizing the information which is mainly used for the statistical analysis. This is particularly used in marketing scenario where the marketer wants to analyze the purchase details made by the customers in a particular duration and from the analysis, tries to improvise their selling techniques and products. Best example is figuring out the products being sold in a particular time period and introducing the wide variety of the most-sold item or providing some offers on those kind of products. Data Reduction: Real world data as such is highly diverse and therefore it needs to be simplified before mining it. Data discretization itself is one method of data reduction. In addition to this there are many different kinds of methods in which data can be reduced which are 1) Numerosity reduction: In numerosity reduction, the data are replaced or estimated by alternative smaller data representations such as parametric models or non parametric models such as clustering, sampling and usage of histograms. 2) Dimensionality reduction: Datasets may contain thousands of features. Not all of them are equally important for a task at hand and some of the attributes may be quite irrelevant. For example, if we are constructing a dataset for prediction of a disease based on several features such as body_mass_index, temperature, etc., then feature telephone_number is likely irrelevant. A straightforward goal of dimensionality reduction is to find the best subset of features with respect to the prediction accuracy of the algorithm. If prediction accuracy is similar to or better than the original accuracy the removed features are said to be irrelevant. 3) Clustering: This can be used to group objects based on a similarity function to reduce the size of huge datasets into groups of interrelated objects. 4) Sampling: Sampling is a typical numerosity reduction technique. There are several ways to construct a sample: a) Simple random sampling without replacement – performed by randomly choosing n1 data points such that n1 < n. Recall, n is the number of data points in the original dataset D. b) Simple random sampling with replacement – we are selecting n1 < n data points, but draw them one at a time (n1 times). In such a way, one data point can be drawn multiple times in the same subsample. c) Cluster sample – examples in D are originally grouped into M disjoint clusters. Then a simple random sample of m < M elements can be drawn. d) Stratified sample – D is originally divided into disjoint parts called strata. Then, a stratified sample of D is generated by obtaining a simple random sample at each stratum. This helps getting a representative sample especially when the data is skewed (say, many more examples of class 0 then of class 1). Stratified samples can be proportionate and disproportionate