Download Final Project Report for Intro to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CAP 4770
Introduction to Data Mining
Final Project
Report
Pedro Exposito
ID: 1826385
1. Introduction and Objectives
How can one find an effective classifier to classify unclassified test data given another set
of the same data that is already classified correctly? This is one question that is answered
by following the procedures in this project. In this assignment, we are given two different
datasets with the same attributes for the same data relation (genes and disease classes). The
train dataset has already been classified with the correct class for each one of its instances.
The test dataset contains more instances of the same relation, but their class is not given.
Our goal is to find a way to classify these test tuples as accurately as possible.
The project’s objective is to learn a good classifier from the training data and to use it to
predict the classes for the test data. In order to build a good model from the train data to
classify the test data, various preprocessing steps will be taken to obtain a more accurate
model from the training set. Both Java and WEKA will be used to accomplish this goal.
The data preparation to obtain a good classifier will be done through a Java program and
the classification model and class predictions will be obtained using WEKA’s Explorer.
2. Data Preparation using Java
This section gives a brief overview of the Java program used for data preprocessing and
goes over some procedures implemented in it to determine which attributes are more
significant to build a good classification model—the least significant ones according to the
procedures explained in 2.2 and 2.3 will be removed.
2.1 The Java program
Different preprocessing tasks were implemented in Java for this project. Instead of writing
a separate Java program to execute each task on a given data file, all the data preparation
tasks done are options available in the same Java program. Therefore, this single Java
program can be used to do all the data preprocessing tasks described in this section. The
only exception is discretization of numerical attributes (to convert them to nominal), which
is the only preprocessing task done in WEKA. The other methods used are implemented in
this program.
The program consists of three classes: Instances.java, ConverterOption.java, and the main
method class, Converter.java. These Java files are included in the .zip file where this
document is. Checking this Java program (using your IDE of choice) to analyze algorithms
used to implement the procedures and to see how the preprocessed train data files are
obtained is encouraged to the reader. The Converter class uses the standard console output
to get the input required from the user and calls the corresponding method from
ConverterOption, depending on which option is chosen by the user. It has the following
four conversion options available:
1- Convert a .CSV file in the format of ATTRIBUTE-IN-ROW into the format
ATTRIBUTE_IN-COLUMN
2- Convert a .CSV file in the format of ATTRIBUTE-IN-ROW into the .ARFF format
3- Generate train_topN.ARFF files using a .CSV file in the format of ATTRIBUTE-INROW
4- Generate a test_bestN.ARFF file using results generated by option 3
In order to execute Option 4, Option 3 must be called first by the user and the required
input given. Option 3 generates all the top training datasets at once based on the
determined top genes. Option 4 must generate the test data with the same attributes as the
train set from Option 3, which is why it requires that Option 3 is executed first (so that the
boolean data structure that marks which attributes to exclude already has the correct
information to print the new test file correctly). The program also asks for input from the
user with the following questions:
Option selected (1/2/3/4/5): 3
Input file name: train.csv
Is the first row of the input file a header? (y/n): y
Insert a last column CLASS in the output file? (y/n): y
Training class file name: train_class.txt
Column CLASS inserted with ? value? (y/n): n
Number of attributes: 7070
Number of instances: 69
In the above, the available answers for some questions are specified in parentheses after
the question mark. Note that the used input files (i.e. train.csv) MUST be in the same
folder of the Java program in order for it to work correctly. The program takes the file’s
name, but that file must already be present in the Java project’s folder.
The ConverterOption class has methods to perform each option and keeps the top genes
(according to highest t-values calculated in Instances) saved in a data structure.
The Instances class reads the input file given and stores the input file’s data (data for
instances and attribute names) in its internal data structures. It then uses all the stored data
to apply the required preprocessing procedures to it (finding fold differences and marking
attributes that have less than 2, finding t-values for all attributes, etc). Once it does all the
preprocessing, it saves one or more output files, depending on the chosen option, in the
indicated format. The output files generated by this program are ready to be used directly
by WEKA.
This is also a generic Java program that can be used to process other datasets different
from the ones for this project because it can process files that have different number of
attributes and instances (this information is entered by the user), and another different
training dataset could be used instead.
2.2 Data cleaning
Data cleaning is done by removing instances from the training data, which are considered
outliers, or data that is not consistent with the standard behavior of the entire dataset. This
is done in class Instances by “marking” which instances are considered outliers. The
procedure, or algorithm, to determine if an instance is an outlier is to add the numerical
values it has for all the attributes. Then, this “sum” is used to determine if this attribute is
far from the dataset’s norm or not. If the sum is less than the number of attributes times
-100 or greater than the number of attributes times 1500, then it means that this instance’s
total of attribute values is way too high or way to small in comparison with nearly all
instances in the dataset. Therefore, such an instance is marked as an outlier and excluded
from the calculations done (which use instance values) and also removed from the printed
output file later on. It is also excluded from the selection of top attributes (section 2.3)
because it has already been marked as an outlier.
2.3 Selecting top genes by class
The two methods used in class Instances to determine the genes (attributes) with the lowest
statistical significance for building the model are fold difference and t-values.
2.3.1 Fold differences
The formula to obtain the fold difference of an attribute is (max-min)/2. Max stands for the
maximum value across all instances for that attribute; likewise, min stands for the
minimum value for that attribute (among all instances not marked as outliers). If we
analyze the formula, it becomes clear that a fold difference of less than 2 means that the
range of values barely changes for the attribute. An attribute that has nearly the same
values or values in a very close range has tiny variability and barely influences the
selection of a class for instances (attributes with high variability will influence
classification more). Therefore, all attributes with fold difference below 2 are marked and
removed from the train data file that will be used to build the classifier.
2.3.2 T-values
Calculating the T-value for each attribute (based on each class, so each attribute gets one
different t-value for each class) is the method used to determine which are the “top genes”
or attributes for each class. The genes with the highest t-values for each class have higher
statistical significance to build a good classification model. The formula used in class
Instances to calculate the t-value for each attribute is:
Math.abs( (avg1-avg2) / Math.sqrt( ( (std1*std1)/N1 ) + ( (std2*std2)/N2 ) ) )
Some attributes have negative values, so absolute value was applied to the final result to
account for those. N1 is the number of instances for the current attribute which are
classified with the class whose T-value is being calculated. For example, if we are
calculating the t-value of class MED for attribute x and this attribute is in 120 instances
classified as MED then N1 = 120. N2 is the remaining number of instances for attribute x,
which are not classified as MED. Avg1 and std1 are similar to N1, since these are the
average and the standard deviation for this attribute based on the instances that have the
class we want the t-value (i.e. MED). Likewise, avg2 and std2 stand for average and
standard deviation for the remaining values, which belong to instances with classes
different from MED.
The generated t-values are used to pick the top 2, 4, 6, 8, 10, 12, 15, 20, 25, and 30
attributes for each class. New different output data files are produced for those cases, given
the original train data and those new train “top” data files will be used and tested to obtain
the most accurate classifier.
3. Classification and Prediction using WEKA
Preprocessed train datasets train_top2.arff, train_top4.arff … train_top30.arff are obtained
from the Java program by running Option 3 once and the corresponding test_BestN.arff
file—generated using the original test.csv file— for the train dataset that gives the most
accurate classification model (i.e. if train_top10.arff gives most accurate model then we
want to get test_Best10.arff) can be obtained running Option 4. The previous “top” train
datasets already contain attributes with fold differences >= 2 only and they contain the “N”
attributes with the largest t-values for each class. They also have no outlier instances (those
are identified by the algorithm mentioned in section 2.2). Therefore, these sets are ready to
be used as classification model builders in WEKA, except for one detail, their attributes are
still numeric (except the class attribute). Therefore, they are converted to nominal with the
Discretize filter prior to building the model.
3.1 Training a good classification model
This section consists of choosing the best set of the train data for building the most
accurate classifier out of it. In order to do this, all the top train datasets will be run through
the same set of classifiers and their accuracy percentages from WEKA will be recorded in
the table shown in this section. Then, the two models from the dataset that give the highest
accuracy percentage will be chosen for a few more tests to get a better model out of those.
The final prediction procedure will apply the most accurate model to the test dataset that
has the same attributes as that chosen train set (i.e. test_Best10.arff for train_top10.arff—
both are obtained from the program).
First, each train dataset will be loaded with WEKA and the Discretize filter with parameter
bins = 100 will be applied to it. The filter is located in weka->filters->unsupervised>attributes->Discretize. The number of partitions was picked to be 100 because that one
will fit most of the attribute ranges appropriately. One thousand or ten would not be a good
number of partitions because those will fit larger or smaller ranges better and most
attributes in the dataset have a range (max-min) around 200. In order to keep consistency
in determining the best train dataset, Discretize will be applied with 100 bins to all the train
datasets.
Afterwards, the sets are ready for building the classification model in WEKA Explorer’s
Classify Tab. For simplicity the default parameter values will be used for all the classifiers
that are applied, except for IBk, which will be run twice with two different KNN values, 1
and 4. The same procedure of discretizing each train dataset using 100 bins and recording
the built model’s accuracy for each classifier applied will be done for all train_top datasets.
The classifiers used in this project to build models are NNge, NaiveBayes, J48, IB1, IBk
(with knn values of 1 and 4), and SMO. These classifiers were picked because they belong
to different groups of classifiers and, among the ones in that group, these are known to
give good results on accuracy. Moreover, different classification approaches are being
applied here and, naturally, one will be more accurate for this particular type of training
dataset than others. J48 (decision tree classification) is among the tree classifiers and
possibly the most popular one in that group. IB1 and IBk are “lazy” classifiers, which rely
on the k-nearest neighbors to build their classification model. SMO is in the functions
group and uses the concept of support vector machines in its classification algorithm.
NaiveBayes is from the bayes classifiers group; it uses estimator classes for classification.
NNge is from the rules group, but it shares similarities with bayes-like classification
because it uses a naïve-bayes-like algorithm and an equivalent of if-then rules to classify.
It is worth noting that a previous WEKA classification project done by this project’s author
gave results that showed SMO, NaiveBayes, and IBk were more accurate than other
classifiers, which is another reason to pick those for this classification procedure as well.
Finally, the classification model building results were recorded while all classifiers were
applied to every train_top dataset. The final accuracy results of every model built are
recorded in the following table:
Table 1: Accuracy of Classification Models
Training
Datasets
NNge
train_top2
train_top4
train_top6
train_top8
train_top10
train_top12
train_top15
train_top20
train_top25
train_top30
66.7%
75.4%
78.3%
72.5%
65.2%
71.0%
68.1%
66.7%
68.1%
68.1%
Classifier Used (accuracy % of classification model)
NaiveBayes
J48
IB1
IBk
IBk
(knn=1) (knn=4)
62.3%
56.5%
62.3%
66.7%
63.8%
76.8%
56.5%
69.6%
73.9%
69.6%
88.4%
56.5%
73.9%
76.8%
73.9%
89.9%
56.5%
78.3%
81.2%
81.2%
87.0%
56.5%
78.3%
85.5%
82.6%
89.9%
56.5%
82.6%
85.5%
82.6%
92.8%
56.5%
85.5%
88.4%
82.6%
95.7%
56.5%
85.5%
89.9%
88.4%
95.7%
56.5%
84.1%
87.0%
90.0%
95.7%
56.5%
84.1%
87.0%
92.8%
SMO
60.9%
58.0%
59.4%
58.0%
56.5%
56.5%
56.5%
56.5%
56.5%
56.5%
The 95.7% accuracy result above obtained by NaiveBayes is very favorable already, but in
order to improve accuracy even more than that, an additional test will be done. The two
best or most accurate classifiers from Table 1—NaiveBayes and IBk using train_top30
training data—will be used to build classification models for them again using the same
train_top30.arff file. However, this time different KNN values will be used for IBk and
different number of bins will be used in the prior discretization of train_top30.arff. These
tweaks actually will boost accuracy (as the next Table 2 shows) and since we are doing
them on the two best classifiers we will arrive at the best setup to get the most accuracy out
of this, thus, we get the most accurate classification model. Table 2 shows the results of
this second round of tests:
Table 2: Further tests with train_top30.arff and the two most accurate classifiers
Discretization
Bins Number
200
100
50
40
35
25
20
NaiveBayes
88.4%
95.7%
95.7%
97.1%
98.6%
97.1%
97.1%
Classifier (accuracy %)
IBk (knn=1)
IBk (knn=4)
85.5%
84.1%
87.0%
92.8%
94.2%
97.1%
94.2%
97.1%
97.1%
100.0%
100.0% *
100.0%
98.6%
98.6%
IBk (knn=5)
79.7%
94.2%
100.0%
100.0%
100.0%
100.0%
98.6%
* = best classification model based on correct classification % and mean absolute error
As Table 2 shows, it was possible to get a classification model as accurate as 100% by
tuning the preprocessing parameters to get the best combination possible. Accuracy is not
truly 100% since some mean absolute error exists (100% just means that all tuples were
classified correctly not that error rate = absolute zero). However, for this project’s purpose
accuracy that is nearly 100% and a classification model that is capable of classifying all the
test data correctly is even better than what we hoped at first, but, clearly possible as these
results show. The best possible model ends up being the one obtained with IBk (knn=1)
discretizing all numerical attributes to nominals by 25 bins (partitions). The full details of
this model are in reference section 5.2, which contains WEKA’s output stats for this
model. This model’s .model file (format in which WEKA saves models) is also included in
the general .zip file of this project. Although there are a few models with 100.0% that one
had lower mean absolute value than the others, thus, it is even more accurate than other
100s in Table 2.
3.2 Predicting classes for the test dataset
In order to make the most accurate prediction, we first need to pick the most accurate
classification model obtained, which was the one for train_top30.arff using IBk with
KNN=1 and a prior discretization using 25 bins— according to Table 2. First open
test_Best30.arff (generated by java program’s option 4) and apply the Discretize filter to it
with 25 bins as well (you may at this moment save this file as
train_top30_discretized25.arff so that there is no need to re-preprocess train_top30 if this
step has to be repeated). Then, save that file as test_Best30_discretized25.arff. Second, reopen train_top30.arff and repeat the procedure to build the best model again, using 25 bins
for discretization and the IBk classifier with KNN=1. WEKA gives the same model results
as before again. Now, select Supplied test set in Test Options and click “Set…” then Open
File and open test_Best30_discretized25.arff. Right-click on the model in the Result list
and pick Re-evaluate model on current test set. Next, right-click on the model again and
pick Visualize classifier errors and click Save in the window that pops up with the graph.
Save this file and then open it using Notepad or Wordpad. The predicted classes for the
test_Best30 test data with the 112 tuples will be the label that appears before each “?”
symbol at the end of each tuple. A predictedCLASS attribute will also be added on top of
the file in the attributes declaration. The predicted classes here are the predicted classes for
the original unclassified test.csv data. These predicted classes for the 112 instances of the
test dataset are recorded in the 1826385.txt file of the .zip.
***Warning Note***
There was a last-minute need to do a second Java I/O program during this section. The
test_Best30_discretized25.arff was supplied as the test set; however, WEKA refused to
classify it using the built model from train_top30_discretized25.arff and gave this error
instead:
“Problem evaluating classifier:
Train and test set are not compatible”
The problem is that WEKA needs to have a supplied test set that has EXACTLY the same
attributes as the one used to build the model. These two did have same # of attributes and
format, however, their numerical ranges of attributes for their respective partitions were
not exactly the same, nor were they split at the same exact numbers. WEKA needs
everything to be an exact copy for the attributes of both the train and test data sets to accept
the test set. The solution was to copy train_top30_discretized25.arff ’s info of attributes
and their partitions and replace the attributes part of test_Best30.arff with that. Then, write
a Java program to replace each numerical value in each instance with its matching partition
value declared for that attribute in the attributes declaration above (which was copied from
train_top30_discretized25 to ensure both attribute declarations are exact copies). The two
java classes in the .zip which represent this 2nd program are named Discretizer.java and
UseDiscretizer.java. This 2nd Java program is not generic like the first, it was done to
convert this single file (test_Best30.arff) into the compatible format of
train_top30_discretized25.arff so that prediction in WEKA was allowed and possible with
its built model.
4. Conclusions
J48 and SMO clearly were quite ineffective for building a good classifier for this project’s
datasets. J48 in particular was the worst classifier out of the ones picked for this task. It
gave the same non-accurate result regardless of variations in the size of the training data.
SMO did better at the beginning, but for later tests it also got the same result as J48. We
can safely conclude that decision tree classification and support vector machines probably
are not the most suitable algorithms for the problem solved in this project. Therefore, these
two classifiers were the least accurate.
On the other hand, we can conclude that the most well-rounded classifier for building the
accurate model we wanted (out of the ones used) was NaiveBayes. After the first test with
train_top2—in which it was not the best— it consistently gave accuracy results that were
better than the ones obtained with the other classifiers. Ultimately, it was not the classifier
that gave the top accuracy result for the best model, but it consistently gave some of the
most accurate results and it was the second best (just because it did not achieve the 100%
accurate model). Overall, NaiveBayes has proven to be a great classifier that is both very
fast and very accurate. IBk managed to give the single most accurate model, but in average
it was not better than NaiveBayes.
Moreover, we can also conclude that although very large datasets are not desirable for
building the best possible classification model (they may contain outliers or unhelpful
data), tiny subsets of the dataset are not ideal for this task either. If we examine Table 1 we
see that although the initial results for train_top2 were not very bad for such a small set,
the best classification models were obtained from the subsets that were larger than the
initial one (i.e. train_top30).
5. Reference
5.1 Project’s Data Files
If the reader wishes to duplicate the results of this project, then the original train.csv and
test.csv data files, plus the .txt file with the classes for instances of train.csv are required.
These files can be obtained from the data mining class’ website at:
http://users.cis.fiu.edu/~taoli/class/CAP4770-F10/index.html
Scroll down from the top of the website to the schedule table. At the end of the table there
is a link to download the Final Project Specification document. The specification document
contains all the instructions on how to approach this project and how to download the
required files.
5.2 Best Classification Model Statistics (WEKA’s Output)
=== Run information ===
Scheme:
weka.classifiers.lazy.IBk -K 1 -W 0
Relation: dataset-weka.filters.unsupervised.attribute.Discretize-B25-M-1.0-Rfirst-last
Instances: 69
Attributes: 149
[list of attributes omitted]
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
69
100
%
Incorrectly Classified Instances
0
0
%
Kappa statistic
1
Mean absolute error
0.022
Root mean squared error
0.0281
Relative absolute error
8.5244 %
Root relative squared error
7.8761 %
Total Number of Instances
69
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
1
0
1
1
1
MED
1
0
1
1
1
MGL
1
0
1
1
1
RHB
1
0
1
1
1
EPD
1
0
1
1
1
JPA
=== Confusion Matrix ===
a b c d e <-- classified as
39 0 0 0 0 | a = MED
0 7 0 0 0 | b = MGL
0 0 7 0 0 | c = RHB
0 0 0 10 0 | d = EPD
0 0 0 0 6 | e = JPA