Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ADDIS ABABA UNIVERSITY BUSINESS INTELLIGENCE, DATA WAREHOUSING, AND DATA MINING CIT 828: Summer 2009 HW Assignment 2 Due: July 7th 2009 Data Preprocessing is a key step when you build data mining models. In real-world business settings, a great proportion of the time you spend in a data mining project will be associated to tasks like data consolidation, data cleaning, feature construction/transformation, feature selection, etc. More importantly, the quality of the resulting predictive models will largely depend on your ability to adequately preprocess the raw data and to create meaningful features from it. In a recent cross-selling application, experiments conducted in a mailing campaign in the publishing industry shown that about 50-70% of the accuracy of the predictive models built in these experiments can be -at least indirectly- explained by data preprocessing decisions (sampling, coding of categorical variables, scaling, etc.). The purpose of this assignment is to familiarize you with some of the preprocessing tools you may need for your projects. You should already have Weka installed from your first assignment. For this assignment, you will need to download the TRAIN2.arff and TRAIN2.csv datasets from the course website. PART I – FEATURE CONSTRUCTION Open the TRAIN2.arff file found on the course website in Weka You should see 9 attributes in the attributes section on the Preprocess tab. Click on each attribute one by one. You should notice that the statistics in the selected attribute section change according to the attribute you select. You should be able to see information about the number of missing values, the attribute type (Nominal vs. Numeric), the number of unique values, and so on. Transformation Now click on pgift. You should see from the selected attribute window that pgift is a numeric attribute. You should also see that the distribution is skewed. Let’s transform this attribute. Go to the filter section and click on the Choose button. Go to the folder filters.unsupervised.attribute Click on the word NumericTransform in the white text box. In the filter section, click on the box right next to the Choose button, as indicated in the figure: In the popup (see the figure below), change the method name to log to take the logarithm of the values in attribute pgift. Set the invert selection flag to False. Put the number 4 for attribute 4 in the attributIndices text box. Click on the More button to see how you might transform multiple attributes at a time. Click the OK button. Now click Apply. Click on pgift again. You should see that the distribution is normal now. Nominal to Binary Now click on attribute rfa_2f. In the selected attribute window, you should see that rfa_2f is a nominal attribute and it has four possible values. Go to the NominalToBinary filter the same way you went to the NumericTransform filter above. Set the parameters as follows. Apply the filter. Question 1: How many attributes do you now see in the attributes window? What possible values do the new attributes take? Now click the Undo button to roll back the change. Go back to the NominalToBinary filter and set the binaryAttributesNominal flag to True. Apply the filter again. Question 2: Now, how many attributes do you see in the attributes window? What possible values do the new attributes take? Discretize Now click on attribute firstdate. You will see that type is numeric attribute. Let’s discretize this attribute using the Discretize filter. Set the parameters as follows: Click on the More button to learn more about the parameters that you may set. For right now, we will leave the default bins setting at 10. Question 3: Select the attribute and look at the ‘selected attribute’ box. What “type” of attribute do you now have? What is the label for the first category? What is the category with the least number of observations? PART II: SAMPLING Unsupervised Sampling Select the Resample instance filter in the filters.unsupervised.instance folder. Notice that in the current relation section it shows you have 9541 instances. For the Resample, set the parameters as indicated in the figure: Select OK and Apply. Q4: How many instances does Weka show in your dataset after sampling? Remove an attribute Click on the check box next to target_d. Now, Click the Remove button at the bottom of the window. Supervised Sampling Weka assumes that the last column in your data is the target variable (NOTE: you can change this to another attribute when running classification and feature selection methods). Our data has a uniform class distribution (I already sampled from a larger data set so that we would have approximately the same number of ones and zeros). However, if your dataset were skewed with respect to your class label, you could perform supervised sampling to bias your sample to a uniform class. You’ll perform supervised sampling. Select the Resample instance filter in the filters.supervised.instance folder, and use the parameter settings as indicated in the figure: Select OK and Apply. Save your updated .arff file now. Click on the Save button to save the .arff file as TRAIN2new.arff. PART III: FEATURE SELECTION Now Click on the Select Attributes tab in Weka. The default evaluator is CfsSubsetEval and the default Search Method is BestFirst. Change these to InfoGainAttributeEval and Ranker respectively. Click Start. The attributes in your data set are ranked by information gain with respect to the class. Question 5: What are the first three attributes ranked by information gain? You may also try PrincipalComponents (PC) as the evaluator. Note that PC creates dummies for all of the attribute/value pairs before perfoming the analysis. Go back to the default settings of CfsSubsetEval and BestFirst. (These settings will perform the forward selection method we discussed in class). Press Start. Question 6: Which attributes were selected? Now go back to the Preprocess tab and select the attributes that you found in the step above. You will need to check the check boxes next to the attributes as well as the check box next to your target variable (target_b). Now, click on the Invert button at the top of the attributes section. Click Remove button. Feel free to go to the Classify tab and play around with some of the classification methods we will discuss in class (Decision Trees, Naïve Bayes, MultiLayerPerceptrons, LogisticRegression, K-Nearest Neighbor, etc) or some of the unsupervised methods like Clustering. For fun, click through the folders to explore the algorithms that are part of the Weka package. Question 7: (NOT A QUESTION BUT REQUIRES ACTION) Open the TRAIN2new.arff file you created in a text editor (e.g. MS Word). Cut and paste the first 20 lines of the file to your homework assignment. PART IV: DATA CLEANING So, if you haven’t already noticed, Weka uses .arff data files. If you open the TRAIN2.arff data file in a text editor, you will see that it has the following header: @relation learn-weka.filters.unsupervised.attribute.Remove-R10weka.filters.supervised.instance.Resample-B1.0-S1-Z10.0 @attribute @attribute @attribute @attribute @attribute @attribute @attribute @attribute @attribute Income {0,1,2,3,4,5,6,7} Firstdate numeric Lastdate numeric pgift numeric rfa_2f {1,2,3,4} rfa_2a {A,B,C,D,E,F,G} pepstrfl {X,0} target_b {1,0} target_d numeric @data In .arff files, the first line must start with @relation, followed by a line for each attribute. Each attribute line begins with @attribute followed by the name of the attribute and then either the word numeric if numeric or a set of attribute values separated by commas enclosed in curly brackets if nominal. After the attributes are declared, a line with @data follows indicating the end of the header. Following the header, you will see a comma delimited set of rows. You may be familiar with comma separated files from Excel. If not, .csv is a filetype that you may use to both save and read files in Excel. You can save comma separated files (csv) using Excel and then read them into Weka easily. The nice thing is that Weka will automatically detect most nominal attributes and their corresponding values. Once you read a .csv file into Weka, you can save it in .arff format and edit the heading according to your needs in a text editor. This PART addresses the top 5 things that will stump you (in Weka) when working with new dirty data. The error descriptions can be a bit cryptic at times (Afterall the software is free). But here are some things to be aware of. 1. 2. 3. Records of different length Missing values not set to question mark (All missing values must be denoted by a question mark as opposed to a space). For example, a row with 5 columns and 2 missing values like 4,A,,,B must be formatted to 4,A,?,?,B for Weka. Non-alphanumeric characters must be removed 4. 5. Non-nominal target variable. For classification, you want your target value (the attribute you are trying to predict) to be of type Nominal. If Weka detects your target attribute to be numeric, you can discretize the attribute into two bins. However, you can also make sure that the values are detected as non-numeric from your .csv file by giving the values text names. For example. You can call the positive examples (pos) and negative examples (neg) instead of assigning them values of ones and zeros respectively. Incompatible training and test sets. You make transformations to attributes in your training set and forget to make the same transformations to the attributes in your test set (We won’t deal with this problem just yet). Open TRAIN2.csv in Weka Weka will complain. Open TRAIN2.csv in Excel and inspect the data for errors: 1. 2. 3. 4. Make sure all records have the same length Make sure there are no blank cells (You may need to find blanks and replace with ?) Make sure there are no bothersome characters (“,*,@, etc.). ALSO, for future reference, note numeric values with commas cause major trouble! In the last column, target_b, replace all ones with pos and all zeros with neg. Question 8: (NOT A QUESTION BUT REQUIRES ACTION) Once the data are clean, open the file in Weka and save the file as a TRAIN2new2.arff file. Open the file in a text editor and cut and paste the heading plus the first 4 lines of data of to your homework assignment. These exercises were meant to get you familiar with Weka (Not to cause you data cleaning pain). Feel free to play around with additional filters and feature selection methods. Next week we will actually start building some models!