Download Chapter 3 Data Preprocessing

Data Preprocessing This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul , Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons Attribution 4.0 International License.    A manager at All Electronics and have been charged with analyzing the company's data with respect to the sales at a branch. He carefully inspect the company's database and data warehouse, identifying dimensions to be included, such as item, price, and units sold. He notice that several of the attributes for various tuples have no recorded value. For analysis, he would like to include information. In other words, the data he wish to analyze by data mining techniques is incomplete, noisy and inconsistent. 2   Data cleaning routines work to "clean" the data by filling in missing values, smoothing noisy data. If users believe the data are dirty, they are unlikely to trust the results. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete data, they are not always robust. Instead, they may concentrate on avoiding over fitting the data to the function being modeled. Therefore, a useful step is to run your data through some data cleaning routines. 3   Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. For example: The attribute for customer identification may be referred to as customer id in one data store and cust id in another. The same first name could be registered as "Bill" in one database, but "William" in another, and "B" in the third. Having a large amount of redundant data may slow down or confuse the knowledge discovery process. 4  Getting back to your data, you have decided that you would like to use a mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers clustering. Such methods provide better results if the data to be analyzed have been normalized.  Example:-contain the attributes age and annual salary. The annual salary attribute usually takes much larger values than age. If the attributes are left un-normalized, the distance measurements taken on annual salary will generally outweigh ,distance measurements taken on age. 5 There are a number of strategies for data reduction. These include data aggregation building a data cube), attribute subset selection (removing irrelevant attributes through correlation analysis), dimensionality reduction (using encoding schemes such as minimum length encoding), and numerosity reduction ("replacing" the data by alternative, smaller representations). Data can also be "reduced" by generalization with the use of concept hierarchies, where low-level concepts, such as city for customer location, are replaced with higher-level concepts, such as region or state. 6 Data Cleaning Data Integration Data Transformation -2,32,100,59,48-0.02,0.32,1.00,0.59,0.48 Data Reduction 7 Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.  Missing Values: Ignore the tuple: This is usually done when the class label is missing. This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. 8   Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like "Unknown" or . If missing values are replaced by, say, "Unknown," then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of "Unknown." Hence, although this method is simple, it is not foolproof. 9    Use the attribute mean to fill in the missing value: For example, suppose that the average income of AllElectronics customers is $56,000. Use this value to replace the missing value for income. Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. 10 Noise is a random error in a measured variable. Given a numerical attribute such as say, price, now "smooth" out the data to remove the noise? Sorted data for price (dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 11 Binning:- Binning methods smooth a sorted data value by consulting its "neighborhood," that is, the values around it. The sorted values are distributed into a no. of "buckets,". Clustering:- Similar values are organized into groups. Values that fall outside of the set of clusters may be considered outliers. Regression:- Data can be smoothed by fitting the data with regression. Linear regression involves finding the "best" line to fit two attributes, so that one attribute can be used to predict . Multiple linear regression is an extension of linear regression, where more than two attributes are involved & data are fit to a multidimensional surface . Combined computer & Human Inspection:-Outlier may be identified through a combination of both human and computer. 12 Data mining often requires data integration—the merging of data from multiple data stores. The data may also need to be transformed into forms appropriate for mining. Data Integration:-It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing.. 13 Thank you 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 3 Data Preprocessing