Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CAP 4770 Introduction to Data Mining Final Project Report Pedro Exposito ID: 1826385 1. Introduction and Objectives How can one find an effective classifier to classify unclassified test data given another set of the same data that is already classified correctly? This is one question that is answered by following the procedures in this project. In this assignment, we are given two different datasets with the same attributes for the same data relation (genes and disease classes). The train dataset has already been classified with the correct class for each one of its instances. The test dataset contains more instances of the same relation, but their class is not given. Our goal is to find a way to classify these test tuples as accurately as possible. The project’s objective is to learn a good classifier from the training data and to use it to predict the classes for the test data. In order to build a good model from the train data to classify the test data, various preprocessing steps will be taken to obtain a more accurate model from the training set. Both Java and WEKA will be used to accomplish this goal. The data preparation to obtain a good classifier will be done through a Java program and the classification model and class predictions will be obtained using WEKA’s Explorer. 2. Data Preparation using Java This section gives a brief overview of the Java program used for data preprocessing and goes over some procedures implemented in it to determine which attributes are more significant to build a good classification model—the least significant ones according to the procedures explained in 2.2 and 2.3 will be removed. 2.1 The Java program Different preprocessing tasks were implemented in Java for this project. Instead of writing a separate Java program to execute each task on a given data file, all the data preparation tasks done are options available in the same Java program. Therefore, this single Java program can be used to do all the data preprocessing tasks described in this section. The only exception is discretization of numerical attributes (to convert them to nominal), which is the only preprocessing task done in WEKA. The other methods used are implemented in this program. The program consists of three classes: Instances.java, ConverterOption.java, and the main method class, Converter.java. These Java files are included in the .zip file where this document is. Checking this Java program (using your IDE of choice) to analyze algorithms used to implement the procedures and to see how the preprocessed train data files are obtained is encouraged to the reader. The Converter class uses the standard console output to get the input required from the user and calls the corresponding method from ConverterOption, depending on which option is chosen by the user. It has the following four conversion options available: 1- Convert a .CSV file in the format of ATTRIBUTE-IN-ROW into the format ATTRIBUTE_IN-COLUMN 2- Convert a .CSV file in the format of ATTRIBUTE-IN-ROW into the .ARFF format 3- Generate train_topN.ARFF files using a .CSV file in the format of ATTRIBUTE-INROW 4- Generate a test_bestN.ARFF file using results generated by option 3 In order to execute Option 4, Option 3 must be called first by the user and the required input given. Option 3 generates all the top training datasets at once based on the determined top genes. Option 4 must generate the test data with the same attributes as the train set from Option 3, which is why it requires that Option 3 is executed first (so that the boolean data structure that marks which attributes to exclude already has the correct information to print the new test file correctly). The program also asks for input from the user with the following questions: Option selected (1/2/3/4/5): 3 Input file name: train.csv Is the first row of the input file a header? (y/n): y Insert a last column CLASS in the output file? (y/n): y Training class file name: train_class.txt Column CLASS inserted with ? value? (y/n): n Number of attributes: 7070 Number of instances: 69 In the above, the available answers for some questions are specified in parentheses after the question mark. Note that the used input files (i.e. train.csv) MUST be in the same folder of the Java program in order for it to work correctly. The program takes the file’s name, but that file must already be present in the Java project’s folder. The ConverterOption class has methods to perform each option and keeps the top genes (according to highest t-values calculated in Instances) saved in a data structure. The Instances class reads the input file given and stores the input file’s data (data for instances and attribute names) in its internal data structures. It then uses all the stored data to apply the required preprocessing procedures to it (finding fold differences and marking attributes that have less than 2, finding t-values for all attributes, etc). Once it does all the preprocessing, it saves one or more output files, depending on the chosen option, in the indicated format. The output files generated by this program are ready to be used directly by WEKA. This is also a generic Java program that can be used to process other datasets different from the ones for this project because it can process files that have different number of attributes and instances (this information is entered by the user), and another different training dataset could be used instead. 2.2 Data cleaning Data cleaning is done by removing instances from the training data, which are considered outliers, or data that is not consistent with the standard behavior of the entire dataset. This is done in class Instances by “marking” which instances are considered outliers. The procedure, or algorithm, to determine if an instance is an outlier is to add the numerical values it has for all the attributes. Then, this “sum” is used to determine if this attribute is far from the dataset’s norm or not. If the sum is less than the number of attributes times -100 or greater than the number of attributes times 1500, then it means that this instance’s total of attribute values is way too high or way to small in comparison with nearly all instances in the dataset. Therefore, such an instance is marked as an outlier and excluded from the calculations done (which use instance values) and also removed from the printed output file later on. It is also excluded from the selection of top attributes (section 2.3) because it has already been marked as an outlier. 2.3 Selecting top genes by class The two methods used in class Instances to determine the genes (attributes) with the lowest statistical significance for building the model are fold difference and t-values. 2.3.1 Fold differences The formula to obtain the fold difference of an attribute is (max-min)/2. Max stands for the maximum value across all instances for that attribute; likewise, min stands for the minimum value for that attribute (among all instances not marked as outliers). If we analyze the formula, it becomes clear that a fold difference of less than 2 means that the range of values barely changes for the attribute. An attribute that has nearly the same values or values in a very close range has tiny variability and barely influences the selection of a class for instances (attributes with high variability will influence classification more). Therefore, all attributes with fold difference below 2 are marked and removed from the train data file that will be used to build the classifier. 2.3.2 T-values Calculating the T-value for each attribute (based on each class, so each attribute gets one different t-value for each class) is the method used to determine which are the “top genes” or attributes for each class. The genes with the highest t-values for each class have higher statistical significance to build a good classification model. The formula used in class Instances to calculate the t-value for each attribute is: Math.abs( (avg1-avg2) / Math.sqrt( ( (std1*std1)/N1 ) + ( (std2*std2)/N2 ) ) ) Some attributes have negative values, so absolute value was applied to the final result to account for those. N1 is the number of instances for the current attribute which are classified with the class whose T-value is being calculated. For example, if we are calculating the t-value of class MED for attribute x and this attribute is in 120 instances classified as MED then N1 = 120. N2 is the remaining number of instances for attribute x, which are not classified as MED. Avg1 and std1 are similar to N1, since these are the average and the standard deviation for this attribute based on the instances that have the class we want the t-value (i.e. MED). Likewise, avg2 and std2 stand for average and standard deviation for the remaining values, which belong to instances with classes different from MED. The generated t-values are used to pick the top 2, 4, 6, 8, 10, 12, 15, 20, 25, and 30 attributes for each class. New different output data files are produced for those cases, given the original train data and those new train “top” data files will be used and tested to obtain the most accurate classifier. 3. Classification and Prediction using WEKA Preprocessed train datasets train_top2.arff, train_top4.arff … train_top30.arff are obtained from the Java program by running Option 3 once and the corresponding test_BestN.arff file—generated using the original test.csv file— for the train dataset that gives the most accurate classification model (i.e. if train_top10.arff gives most accurate model then we want to get test_Best10.arff) can be obtained running Option 4. The previous “top” train datasets already contain attributes with fold differences >= 2 only and they contain the “N” attributes with the largest t-values for each class. They also have no outlier instances (those are identified by the algorithm mentioned in section 2.2). Therefore, these sets are ready to be used as classification model builders in WEKA, except for one detail, their attributes are still numeric (except the class attribute). Therefore, they are converted to nominal with the Discretize filter prior to building the model. 3.1 Training a good classification model This section consists of choosing the best set of the train data for building the most accurate classifier out of it. In order to do this, all the top train datasets will be run through the same set of classifiers and their accuracy percentages from WEKA will be recorded in the table shown in this section. Then, the two models from the dataset that give the highest accuracy percentage will be chosen for a few more tests to get a better model out of those. The final prediction procedure will apply the most accurate model to the test dataset that has the same attributes as that chosen train set (i.e. test_Best10.arff for train_top10.arff— both are obtained from the program). First, each train dataset will be loaded with WEKA and the Discretize filter with parameter bins = 100 will be applied to it. The filter is located in weka->filters->unsupervised>attributes->Discretize. The number of partitions was picked to be 100 because that one will fit most of the attribute ranges appropriately. One thousand or ten would not be a good number of partitions because those will fit larger or smaller ranges better and most attributes in the dataset have a range (max-min) around 200. In order to keep consistency in determining the best train dataset, Discretize will be applied with 100 bins to all the train datasets. Afterwards, the sets are ready for building the classification model in WEKA Explorer’s Classify Tab. For simplicity the default parameter values will be used for all the classifiers that are applied, except for IBk, which will be run twice with two different KNN values, 1 and 4. The same procedure of discretizing each train dataset using 100 bins and recording the built model’s accuracy for each classifier applied will be done for all train_top datasets. The classifiers used in this project to build models are NNge, NaiveBayes, J48, IB1, IBk (with knn values of 1 and 4), and SMO. These classifiers were picked because they belong to different groups of classifiers and, among the ones in that group, these are known to give good results on accuracy. Moreover, different classification approaches are being applied here and, naturally, one will be more accurate for this particular type of training dataset than others. J48 (decision tree classification) is among the tree classifiers and possibly the most popular one in that group. IB1 and IBk are “lazy” classifiers, which rely on the k-nearest neighbors to build their classification model. SMO is in the functions group and uses the concept of support vector machines in its classification algorithm. NaiveBayes is from the bayes classifiers group; it uses estimator classes for classification. NNge is from the rules group, but it shares similarities with bayes-like classification because it uses a naïve-bayes-like algorithm and an equivalent of if-then rules to classify. It is worth noting that a previous WEKA classification project done by this project’s author gave results that showed SMO, NaiveBayes, and IBk were more accurate than other classifiers, which is another reason to pick those for this classification procedure as well. Finally, the classification model building results were recorded while all classifiers were applied to every train_top dataset. The final accuracy results of every model built are recorded in the following table: Table 1: Accuracy of Classification Models Training Datasets NNge train_top2 train_top4 train_top6 train_top8 train_top10 train_top12 train_top15 train_top20 train_top25 train_top30 66.7% 75.4% 78.3% 72.5% 65.2% 71.0% 68.1% 66.7% 68.1% 68.1% Classifier Used (accuracy % of classification model) NaiveBayes J48 IB1 IBk IBk (knn=1) (knn=4) 62.3% 56.5% 62.3% 66.7% 63.8% 76.8% 56.5% 69.6% 73.9% 69.6% 88.4% 56.5% 73.9% 76.8% 73.9% 89.9% 56.5% 78.3% 81.2% 81.2% 87.0% 56.5% 78.3% 85.5% 82.6% 89.9% 56.5% 82.6% 85.5% 82.6% 92.8% 56.5% 85.5% 88.4% 82.6% 95.7% 56.5% 85.5% 89.9% 88.4% 95.7% 56.5% 84.1% 87.0% 90.0% 95.7% 56.5% 84.1% 87.0% 92.8% SMO 60.9% 58.0% 59.4% 58.0% 56.5% 56.5% 56.5% 56.5% 56.5% 56.5% The 95.7% accuracy result above obtained by NaiveBayes is very favorable already, but in order to improve accuracy even more than that, an additional test will be done. The two best or most accurate classifiers from Table 1—NaiveBayes and IBk using train_top30 training data—will be used to build classification models for them again using the same train_top30.arff file. However, this time different KNN values will be used for IBk and different number of bins will be used in the prior discretization of train_top30.arff. These tweaks actually will boost accuracy (as the next Table 2 shows) and since we are doing them on the two best classifiers we will arrive at the best setup to get the most accuracy out of this, thus, we get the most accurate classification model. Table 2 shows the results of this second round of tests: Table 2: Further tests with train_top30.arff and the two most accurate classifiers Discretization Bins Number 200 100 50 40 35 25 20 NaiveBayes 88.4% 95.7% 95.7% 97.1% 98.6% 97.1% 97.1% Classifier (accuracy %) IBk (knn=1) IBk (knn=4) 85.5% 84.1% 87.0% 92.8% 94.2% 97.1% 94.2% 97.1% 97.1% 100.0% 100.0% * 100.0% 98.6% 98.6% IBk (knn=5) 79.7% 94.2% 100.0% 100.0% 100.0% 100.0% 98.6% * = best classification model based on correct classification % and mean absolute error As Table 2 shows, it was possible to get a classification model as accurate as 100% by tuning the preprocessing parameters to get the best combination possible. Accuracy is not truly 100% since some mean absolute error exists (100% just means that all tuples were classified correctly not that error rate = absolute zero). However, for this project’s purpose accuracy that is nearly 100% and a classification model that is capable of classifying all the test data correctly is even better than what we hoped at first, but, clearly possible as these results show. The best possible model ends up being the one obtained with IBk (knn=1) discretizing all numerical attributes to nominals by 25 bins (partitions). The full details of this model are in reference section 5.2, which contains WEKA’s output stats for this model. This model’s .model file (format in which WEKA saves models) is also included in the general .zip file of this project. Although there are a few models with 100.0% that one had lower mean absolute value than the others, thus, it is even more accurate than other 100s in Table 2. 3.2 Predicting classes for the test dataset In order to make the most accurate prediction, we first need to pick the most accurate classification model obtained, which was the one for train_top30.arff using IBk with KNN=1 and a prior discretization using 25 bins— according to Table 2. First open test_Best30.arff (generated by java program’s option 4) and apply the Discretize filter to it with 25 bins as well (you may at this moment save this file as train_top30_discretized25.arff so that there is no need to re-preprocess train_top30 if this step has to be repeated). Then, save that file as test_Best30_discretized25.arff. Second, reopen train_top30.arff and repeat the procedure to build the best model again, using 25 bins for discretization and the IBk classifier with KNN=1. WEKA gives the same model results as before again. Now, select Supplied test set in Test Options and click “Set…” then Open File and open test_Best30_discretized25.arff. Right-click on the model in the Result list and pick Re-evaluate model on current test set. Next, right-click on the model again and pick Visualize classifier errors and click Save in the window that pops up with the graph. Save this file and then open it using Notepad or Wordpad. The predicted classes for the test_Best30 test data with the 112 tuples will be the label that appears before each “?” symbol at the end of each tuple. A predictedCLASS attribute will also be added on top of the file in the attributes declaration. The predicted classes here are the predicted classes for the original unclassified test.csv data. These predicted classes for the 112 instances of the test dataset are recorded in the 1826385.txt file of the .zip. ***Warning Note*** There was a last-minute need to do a second Java I/O program during this section. The test_Best30_discretized25.arff was supplied as the test set; however, WEKA refused to classify it using the built model from train_top30_discretized25.arff and gave this error instead: “Problem evaluating classifier: Train and test set are not compatible” The problem is that WEKA needs to have a supplied test set that has EXACTLY the same attributes as the one used to build the model. These two did have same # of attributes and format, however, their numerical ranges of attributes for their respective partitions were not exactly the same, nor were they split at the same exact numbers. WEKA needs everything to be an exact copy for the attributes of both the train and test data sets to accept the test set. The solution was to copy train_top30_discretized25.arff ’s info of attributes and their partitions and replace the attributes part of test_Best30.arff with that. Then, write a Java program to replace each numerical value in each instance with its matching partition value declared for that attribute in the attributes declaration above (which was copied from train_top30_discretized25 to ensure both attribute declarations are exact copies). The two java classes in the .zip which represent this 2nd program are named Discretizer.java and UseDiscretizer.java. This 2nd Java program is not generic like the first, it was done to convert this single file (test_Best30.arff) into the compatible format of train_top30_discretized25.arff so that prediction in WEKA was allowed and possible with its built model. 4. Conclusions J48 and SMO clearly were quite ineffective for building a good classifier for this project’s datasets. J48 in particular was the worst classifier out of the ones picked for this task. It gave the same non-accurate result regardless of variations in the size of the training data. SMO did better at the beginning, but for later tests it also got the same result as J48. We can safely conclude that decision tree classification and support vector machines probably are not the most suitable algorithms for the problem solved in this project. Therefore, these two classifiers were the least accurate. On the other hand, we can conclude that the most well-rounded classifier for building the accurate model we wanted (out of the ones used) was NaiveBayes. After the first test with train_top2—in which it was not the best— it consistently gave accuracy results that were better than the ones obtained with the other classifiers. Ultimately, it was not the classifier that gave the top accuracy result for the best model, but it consistently gave some of the most accurate results and it was the second best (just because it did not achieve the 100% accurate model). Overall, NaiveBayes has proven to be a great classifier that is both very fast and very accurate. IBk managed to give the single most accurate model, but in average it was not better than NaiveBayes. Moreover, we can also conclude that although very large datasets are not desirable for building the best possible classification model (they may contain outliers or unhelpful data), tiny subsets of the dataset are not ideal for this task either. If we examine Table 1 we see that although the initial results for train_top2 were not very bad for such a small set, the best classification models were obtained from the subsets that were larger than the initial one (i.e. train_top30). 5. Reference 5.1 Project’s Data Files If the reader wishes to duplicate the results of this project, then the original train.csv and test.csv data files, plus the .txt file with the classes for instances of train.csv are required. These files can be obtained from the data mining class’ website at: http://users.cis.fiu.edu/~taoli/class/CAP4770-F10/index.html Scroll down from the top of the website to the schedule table. At the end of the table there is a link to download the Final Project Specification document. The specification document contains all the instructions on how to approach this project and how to download the required files. 5.2 Best Classification Model Statistics (WEKA’s Output) === Run information === Scheme: weka.classifiers.lazy.IBk -K 1 -W 0 Relation: dataset-weka.filters.unsupervised.attribute.Discretize-B25-M-1.0-Rfirst-last Instances: 69 Attributes: 149 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === IB1 instance-based classifier using 1 nearest neighbour(s) for classification Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 69 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0.022 Root mean squared error 0.0281 Relative absolute error 8.5244 % Root relative squared error 7.8761 % Total Number of Instances 69 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 1 0 1 1 1 MED 1 0 1 1 1 MGL 1 0 1 1 1 RHB 1 0 1 1 1 EPD 1 0 1 1 1 JPA === Confusion Matrix === a b c d e <-- classified as 39 0 0 0 0 | a = MED 0 7 0 0 0 | b = MGL 0 0 7 0 0 | c = RHB 0 0 0 10 0 | d = EPD 0 0 0 0 6 | e = JPA