Download Data Mining of Machine Learning Performance Data

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining of Machine Learning
Performance Data
Remzi Salih Ibrahim
Master of Applied Science
(Information Technology)
1999
RMIT University
Abstract
W
ith the development and penetration of data mining within different fields and
industries, many data mining algorithms have emerged. The selection of a good
data mining algorithm to obtain the best result on a particular data set has become very
important. What works well for a particular data set may not work well on another. The
goal of this thesis is to find associations between classification algorithms and
characteristics of data sets by first building a file of data sets, their characteristics and the
performance of a number of algorithms on each data set; and second applying
unsupervised clustering analysis to this file to analyze the generated clusters and
determine whether there are any significant patterns.
Six classification algorithms were applied to 59 data sets and then three clustering
algorithms were applied to the data generated. The patterns and properties of the clusters
formed were then studied. The six classification algorithms used were OneR (1R), Kernel
Density, Naïve Bayes, C4.5, Rule Learner and IBK. The clustering algorithms used were
K-means clustering, Kohonen Vector Quantization, and Autoclass Baysian clustering.
The major discovery made by analyzing the generated clusters is that the clusters were
formed based on accuracy of the algorithms. The data sets were grouped as either
belonging to a cluster having average error rates, lower, about average or higher than
average error rates of the population. This suggests that there are three kinds of data sets
on the 59 data sets considered: ‘easy-to-learn data sets’, ’moderate-to-learn’, and ‘hardto-learn data sets’.
Another discovery made by this thesis is that the number of instances in a data set was
not useful for clustering analysis of the machine learning performance data. It was the
only significant variable in clustering the data sets and prevented analysis based on other
variables including the variables that contain values for the accuracy of each
classification algorithm.
While not directly relevant to clustering, it was also found that the number of instances
and number of attributes in the data sets do not have strong influence on the performance
of the data mining algorithms on the 59 data sets considered as high error rates were
obtained for both small data sets with a small number of attributes and large data sets
with a large number of attributes.
Experiments performed for this thesis also allowed the comparison of the performance of
the 6 classification algorithms with their default parameter settings. It was discovered that
in terms of performance, the top three algorithms were Kernel Density, C4.5, and Naïve
Bayes, followed by Rule Learner, IBK and OneR.
- II -
Declaration
I certify that all work on this thesis was carried out between June 1998 and June 2000 and
it has not been submitted for academic award at any other College, Institute or
University. The work presented was carried out under the supervision of Dr. Vic
Ciesielski. All other work in the thesis is my own except where acknowledged in the text.
Signed,
Remzi Salih Ibrahim
June, 2000
- III -
Table of Contents
List of Tables ....................................................................................................................VI
List of Graphs ................................................................................................................. VII
Acknowledgements ........................................................................................................VIII
Chapter 1. Introduction...................................................................................................... 9
1.1
Goals .............................................................................................................................. 10
1.2
Scope............................................................................................................................... 10
Chapter 2. Literature Survey............................................................................................ 11
2.1
Supervised Learning ..................................................................................................... 11
2.1.1 Supervised Algorithms Used in this thesis .................................................................. 12
2.1.1.1
2.1.1.2
2.1.1.3
2.1.1.4
2.1.1.5
2.1.1.6
2.2
C4.5....................................................................................................................................... 12
Rule Learner (PART)............................................................................................................ 16
OneR (1R) ............................................................................................................................. 17
IBK........................................................................................................................................ 18
Naïve Bayes .......................................................................................................................... 18
Kernel Density ...................................................................................................................... 19
Unsupervised Learning................................................................................................. 20
2.2.1 Unsupervised Data Mining Algorithms Used in This Thesis..................................... 20
2.2.1.1
2.2.1.2
2.2.1.3
2.3
K-means clustering ............................................................................................................... 20
Kohonen Vector Quantization............................................................................................... 20
Autoclass (Bayesian Classification System) ......................................................................... 21
Related work on Comparison of Classifiers................................................................ 21
Chapter 3. Data Generation............................................................................................. 24
3.1 Collection of Data sets........................................................................................................ 24
3.2 Selection of Data Mining Algorithms ............................................................................... 24
3.3 Generating Data ................................................................................................................. 24
Chapter 4. Clustering and Pattern Analysis.................................................................... 28
4.1
Results from k-means clustering.................................................................................. 28
4.2
Results from k-means clustering (without Number of Instances)............................. 30
4.3
Results from Kohonen Vector Quantization Clustering............................................ 34
4.4
Results from Autoclass (Bayesian Classification System) Analysis .......................... 38
4.5
Comparison................................................................................................................... 41
4.5.1 Comparison of significant variables.......................................................................................... 41
4.5.2 Comparison of Data sets in Different Clusters.......................................................................... 41
4.5.3 Influence of Characteristics of Data Sets on Performance of Classification Algorithms.......... 42
Chapter 5. Conclusion...................................................................................................... 43
Appendix A. About WEKA.................................................................................................... 45
Appendix B. About Enterprise Miner (Commercial software)........................................... 46
- IV -
Appendix C. Detail Results from K-means Analysis ........................................................... 47
Appendix D. Detail results from Kohonen Vector Quantization/Kohonen ....................... 48
Appendix E. Detail Result from Autoclass Clustering Analysis ......................................... 49
Appendix F. Sample Code used to run Multiple Algorithms on Multiple Data sets......... 52
Appendix G. Sample Output from data generation.............................................................. 53
6. References .................................................................................................................. 54
-V-
List of Tables
TABLE 1: SAMPLE IRIS DATA .......................................................................................................................... 13
TABLE 2: RESULTS OBTAINED FROM APPLYING 6 DATA MINING ALGORITHMS TO 59 DATA SETS. BLANKS
INDICATE SITUATIONS WHERE ALGORITHMS GAVE NO RESULT. ............................................................. 26
TABLE 3: DEFINITION OF VARIABLES USED.................................................................................................... 26
TABLE 4: SUMMARY OF THE DATA GATHERED FROM RUNNING THE 6 DATA MINING ALGORITHMS ON 59 DATA
SETS ....................................................................................................................................................... 27
TABLE 5: IMPORTANCE LEVEL OF VARIABLES IN DETERMINING CLUSTER...................................................... 28
TABLE 6: PROPERTIES OF THE 5 CLUSTERS .................................................................................................... 29
TABLE 7: IMPORTANCE VARIABLE IN DETERMINING CLUSTERS ..................................................................... 30
TABLE 8: GENERAL PROPERTIES OF THE CLUSTERS FROM K-MEANS ANALYSIS ............................................. 31
TABLE 9: GENERAL PROPERTIES OF CLUSTERS AND SIGNIFICANT VARIABLES ............................................... 32
TABLE 10: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT
CLUSTERING VARIABLES........................................................................................................................ 33
TABLE 11: IMPORTANCE OF VARIABLES IN DETERMINING CLUSTERS IN KOHONEN VECTOR QUANTIZATION
ANALYSIS .............................................................................................................................................. 34
TABLE 12: GENERAL PROPERTIES OF THE CLUSTERS FROM KOHONEN VECTOR QUANTIZATION ANALYSIS .. 35
TABLE 13: GENERAL PROPERTIES OF CLUSTERS MEAN ERROR RATES OF THE SIGNIFICANT VARIABLES ......... 36
TABLE 14: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT
CLUSTERING VARIABLES........................................................................................................................ 37
TABLE 15: SIGNIFICANCE LEVEL OF VARIABLES IN AUTOCLASS CLUSTERING ............................................... 38
TABLE 16. PROPERTIES OF CLUSTERS FROM AUTOCLASS ANALYSIS .............................................................. 39
TABLE 17: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT
CLUSTERING VARIABLES........................................................................................................................ 40
TABLE 18: COMPARISON OF SIGNIFICANT VARIABLES FOUND BY THE THREE CLUSTERING ALGORITHMS ...... 41
TABLE 19: SUMMARY OF THE SIGNIFICANCE OF EACH VARIABLE FOR THE 3 CLUSTERING ALGORITHMS. ...... 41
TABLE 20: DATA SETS IN EACH COLUMN WERE GROUPED INTO ONE CLUSTER BY ALL THREE ALGORITHMS .. 42
- VI -
List of Graphs
FIGURE 1: A DECISION TREE PRODUCED FOR THE IRIS DATA SET ................................................................... 14
FIGURE 2: RULE OUTPUT PRODUCED FROM THE SAS ENTERPRISE MINER SOFTWARE................................... 16
FIGURE 3: PARTIAL OUTPUT FROM WEKA ONER PROGRAM ........................................................................ 17
- VII -
Acknowledgements
I
would like to thank Dr. Vic Ciesleski for being supportive and very patient during the
progress of my thesis. He has been very understanding of the problems that arise from
working full time, studying part time and still having to fulfill family commitments.
I would like to pass my thanks to Dr. Isaac Balbin for his support and his flexibility with
the dateline.
My thanks also go to the WEKA support team and the staff from SAS Institute Australia
for their support and all staff from the RMIT AI (Artificial Intelligence) group who
inspired me to research in this field.
I would like to thank all members of my family and my friends for their patience during
the progress of my thesis.
- VIII -
Chapter 1. Introduction
I
n this current age of technology data has become more readily available than ever.
Using technologies like data warehousing, data is being stored in large quantities. The
availability of such data opened the door for new data analysis techniques to emerge. As
Weiss and Indurkhya [55] explain, as the amount of data stored in existing information
systems mushroomed, a new set of objectives for data management has emerged. Mining
data has become one of the important means of obtaining useful information.
The term data mining is defined by Fayeed, Piatetsky-Shapiro and Smyth [19] as the part
of Knowledge Discovery in Databases (KDD) process relating to methods for extracting
patterns from data. The KDD process involves the complete steps of obtaining
knowledge from data and includes selecting, pre-processing, transformation and mining
of data followed by interpretation and evaluation of patterns.
Data mining has many advantages across different industries. It allows large historical
data to be used as the background for prediction. The interpretation and evaluation of the
patterns obtained by data mining produces new knowledge that decision-makers can act
upon [42]. Data mining provides a means to obtain information that can support decision
making and predict new business opportunities. For example, telecommunications, stock
exchanges, and credit card and insurance companies use data mining to detect fraudulent
use of their services; the medical industry uses data mining to predict the effectiveness of
surgical procedures, medical tests, and medications; and retailers use data mining to
assess the effectiveness of coupons and special events [41].
With the development and penetration of data mining within different fields, many data
mining algorithms emerged. The selection of a good data mining algorithm to obtain the
best result on a particular data set has become very important. What works well for a
particular data may not work well on another. Furthermore, the ‘No Free Lunch’ theorem
by Wolpert and Macready [61,page 2] has established that “it is impossible to say that
any technique is better than another over the space of all problems. In particular, if
algorithm A outperforms algorithm B on some cost functions, then loosely speaking,
there must exist exactly as many other functions where B outperforms A”.
An example of the ‘No Free Lunch’ Theorem that has been encountered in this thesis is
the case of the performance of the C4.5 and Rule Learner algorithms on the “EchMonths”
and “Hungarian” data sets (table 2, page 26). While C4.5 algorithm obtained an error
rate of only 0.6 percent on the “EchMonths”, Rule Learner obtained 57.69 percent. But
on the “Hungarian” data set, Rule Learner outperformed C4.5 19.05 to 22.11.
While the ‘No Free Lunch’ Theorem has established that there can be no one `best'
learning algorithm, the question of `What kinds of algorithms are best suited to what
kinds of data?' remains an open question. While there has been some work comparing
different algorithms on a range of data sets (the STATLOG project [12], Lim and Loh
[35]), there has been little work on trying to characterize data sets (for example, big,
-9-
small, numeric, symbolic, mixed) and matching algorithms to data characteristics. With
the emergence of hundreds of data mining algorithms today, such information will help
data mining analysts to make intelligent decisions in choosing an appropriate data mining
algorithm for certain types of data mining files.
1.1
Goals
The major goal of this thesis is to find associations between classification algorithms and
characteristics of data sets by a two-step process:
1. Build a file of data set names, their characteristics and the performance of a number
of algorithms on each data set.
2. Apply unsupervised clustering to the file built in step 1, analyze the generated clusters
and determine whether there are any significant patterns.
1.2
Scope
Due to time limitations for an MBC minor thesis, the scope of this thesis will be
restricted to:
• 6 supervised learning algorithms
• 59 small to medium size data sets with number of attributes ranging from 7 to 76.
• Running of the 6 supervised algorithms on the 59 data sets using only default
settings of the algorithms
• Using 3 unsupervised learning algorithms for cluster analysis
• Characteristics of the data sets limited to only number of attributes and number of
instances
- 10 -
Chapter 2. Literature Survey
C
oncepts and papers that are relevant to this thesis are discussed in this chapter. First
both supervised and unsupervised learning techniques are discussed followed by the
description of all the algorithms used in this thesis. Finally, three papers that are related
to this thesis are discussed in detail.
Machine learning is described by Witten and Frank [58] as the acquisition of knowledge
and the ability to use it. They explain that learning in data mining involves finding and
describing structural patterns in data for the purpose of helping to explain that data and
make predictions from it. For example, the data could contain examples of customers
who have switched to another service provider in the telecommunication industry and
some that have not. The output of learning could be the prediction of whether a particular
customer will switch to another service provider or not.
There are two common types of learning: supervised and unsupervised.
2.1
Supervised Learning
Learning or adaptation is supervised when there is a desired response that can be used by
the system to guide the learning. Decision trees and neural nets are two common types of
supervised learning. This type of learning always requires a target variable to predict.
Supervised learning algorithms have been used in many applications. For example,
supervised learning has been used in the seismic phase identification in the field of
nuclear science [28] and for the prediction of tornados [36].
Supervised learning involves the gathering of data to be used for data mining, identifying
the target variable, breaking up of the data into training and testing data and developing
the classifier. The training data is used by the data mining algorithm to ‘learn’ the data
and build a classifier. The test data is used to evaluate the performance of the classifier on
new data. The performance a classifier is commonly measured by the percentage of
incorrectly classified instances on the data used. Train error rate refers to the percentage
of incorrectly classified instances on the training data and test error rate refers to
percentage of incorrectly classified instances on the test data.
One of the problems of supervised learning is overfitting [58]. The classifier works well
on the training data but not on test data. This happens when the model learns the training
data ‘too well’. To get an indication of the amount of overfitting, the model should be
tested using a test data set or cross validation. If, after training, the test error rate is
approximately equal to training error rate, the test error rate is an indication of the kind of
generalization that will occur.
- 11 -
Cross-validation is a method for estimating how good the classifier will perform on new
data and is based on "resampling" [33]. Cross validation is good for use when the size of
data is small. It allows the use of all of the data for training. In k-fold cross-validation, the
data is divided into k subsets of equal size. The model is trained k times, each time
leaving out one of the subsets from training, but using only the omitted subset to compute
the error rate. If k equals the sample size, this is called leave-one-out cross-validation.
Leave-one-out cross-validation often works well for continuous error functions such as
the mean squared error, but it may perform poorly for non-continuous error functions
such as the number of misclassified cases [33]. A value of 10 for k is commonly used and
is also used for this thesis.
Some data mining algorithms do not support continuous target variables. In such cases,
binning or discretization is used. Binning is a method of converting continuous targets
into categorical values. For instance, if one of the independent variables is age, the values
must be transformed from the specific value into ranges such as "less than 20 years", "21
to 30 years", "31 to 40 years" and so on [4].
2.1.1 Supervised Algorithms Used in This Thesis
The basic theory behind each of the 6 classification algorithms and the details of how
each algorithm works are discussed in this section. The parameters that affect the
performance of each of the algorithms are also discussed and where possible, papers that
describe successful applications are also cited.
2.1.1.1 C4.5
C4.5 is a decision tree algorithm devised by Quinlan [43]. Decision trees are used to
classify instances into different categories and are common types of classification
algorithms. First what a decision tree is will be discussed followed by the properties of
C4.5 algorithm.
The ‘iris’ data set will be used to explain how decision tree works. A sample of the iris
data set is shown in table 1. The data contains the petal length, petal width, sepal length
and sepal width of iris plants. There are three different categories this plant: ‘Irisversicolor’,’ Iris-verginica’, ‘Iris-setosa’. This is shown as the class variable in table 1
and is the target variable for the iris data set. There are 50 cases of each category in the
data set.
The goal is to determine what distinguishes each category of iris plants from one another
so that it is possible to know to which category an iris plant belongs given the four input
variables. An example of a decision tree produced from analysis of this data is shown in
figure 1. The root node (top node) of the tree in figure 1 shows how many of each
category is found before any analysis is made. There are three leaves in the tree. These
leaves are assigned a class with the major class instances in the leaf. For example, the
second leaf in the tree in figure 1 is considered as a class of ‘Iris-versicolor’ because the
- 12 -
majority class in this leaf is ‘Iris-versicolor’ with 48 instances. The other class in this leaf
has only 4 observations and the node has error rate of 7.7%.
The decision tree can be interpreted as follows: an iris plant with ‘petallwidth’ less than
0.8 is classified as ‘Iris-setosa’ and an iris plant with the ‘petallwidth’ greater than 0.8
and less than 1.65 is categorized as ‘Iris-versicolor’. All the rest (with ‘petallwidth’
greater than 1.65) are classified as ‘Iris-verginica’. Based on this tree, an unknown iris
plant with ‘petallwidth’ of 1.4 would be classified as ‘Iris-versicolor’. Note that only
‘petalwidth’ is used to classify the instances and all the other input variables have been
determined as irrelevant.
The classification from the tree made an error of 6 out of 150, which is 4 %. Therefore, it
can be said that the error rate for this tree on training data is 4 %.
SEPALLENGTH SEPALWIDTH
PETALLENGTH PETALWIDTH CLASS
5.1
3.5
1.4
0.2 Iris-setosa
4.9
3
1.4
0.2 Iris-setosa
4.7
3.2
1.3
0.2 Iris-setosa
3.1
1.5
0.2 Iris-setosa
4.6
2.5 Iris-virginica
3.3
6.3
6
5.8
2.7
5.1
1.9 Iris-virginica
7.1
3
5.9
2.1 Iris-virginica
1.8 Iris-virginica
2.9
6.3
5.6
3
5.8
2.2 Iris-virginica
6.5
5.5
4.4
1.2 Iris-versicolor
2.6
1.4 Iris-versicolor
3
6.1
4.6
5.8
4
1.2
Iris-versicolor
2.6
5
2.3
3.3
1 Iris-versicolor
2.7
4.2
1.3
Iris-versicolor
5.6
Table 1: sample iris data.
- 13 -
Iris-verginica 33.3%
Iris-versicolor 33.3%
Iris-setosa
33.3%
Total
100.0%
50
50
50
150
Petalwidth
< 0.8
Iris-verginica
0.0%
Iris-versicolor 0.0%
Iris-setosa
100.0%
Total
100.0%
< 1.65
0
0
50
50
Iris-verginica
Iris-versicolor
Iris-setosa
Total
>= 1.65
7.7%
4
92.3% 48
0.0%
0
100% 52
Iris-verginica
95.8%
Iris-versicolor 4.2%
Iris-setosa
0.0%
Total
100.0%
46
2
0
48
Figure 1: A decision tree produced for the Iris data set.
According to [43], the first task for C4.5 is to decide which of the non-target variables is
the best variable to split the instances. In the example above, the ‘petallwidth’ variable
was chosen. To choose this attribute, at a node, the decision tree algorithm considers each
attribute field in turn (for example ‘petallwidth’, ‘petallength’, ‘sepallengh’ and
‘sepalwidth’ in the case of the iris data). Then, every possible split is tried. C4.5 uses a
criterion called information ratio to compare the value of potential splits. The information
ratio provides an estimate of how likely a split on a variable is to lead to a leaf which
contains few errors or has low disorder. Disorder is a measure of how pure a given node
is. A node with high disorder contains instances having multiple target variables while a
node with low disorder contains instances with one major target variable. The
information ratio is calculated for all the variables and the ‘winner’ variable is the one
with the largest information ratio and is chosen as the split variable.
The tree will grow in a similar method. For each child node of the root node, the decision
tree algorithm examines all the remaining attributes to find candidate for splitting. If the
field takes on only one value, it is eliminated from consideration since there is no way it
can be used to make a split. The best split for each of the remaining attributes is
determined. When all cases in a node are of the same type, then the node is a leaf node.
- 14 -
But how good is this tree in classifying unknown data? Perhaps not very good as it is
built using training data only which could lead to overfitting. So how does C4.5 avoid
this problem of overfitting?
C4.5 uses a method called pruning to avoid overfitting. There are two types of pruning:
prepruning and post pruning. Postpruning refers to the building of a complete tree and
pruning it afterwards. Postpruning makes the tree less complex and also probably more
general by replacing a subtree with a leaf or with the most common branch. When this is
done, the leaf will correspond to several classes but the label will be the most common
class in the leaf (as was the case in figure 1). A parameter that affects postpruning is
confidence interval. Using lower confidence cause more drastic pruning. The default
confidence value is 25 %.
Prepruning involves deciding when to stop developing subtrees during the tree building
process. For example specifying the minimum number of observations in a leaf can
determine the size of the tree. The default value of minimum number of instances is 2. By
default, C4.5 uses postpruning only but it can use prepruning.
After a tree is constructed, the C4.5 rule induction program can be used to produce a set
of equivalent rules. The rules are formed by writing a rule for each path in the tree and
then eliminating any unnecessary antecedents and rules.
An example of the rules produced from the decision tree in figure 1 is shown in figure 2.
Rule 1, for example, shows that if ‘Petalwidth’ is less than 0.8 then the instance belongs
in node 2 which has 50 observations and is classified as an ‘Iris-Setosa’.
- 15 -
IF Petalwidth <
THEN
NODE
:
N
:
IRIS-VIRGINICA:
IRIS-VERSICOLOR:
IRIS-SETOSA:
IF
THEN
2
50
0.0%
0.0%
100.0%
0.8 <= Petalwidth <
NODE
:
N
:
IRIS-VIRGINICA:
IRIS-VERSICOLOR:
IRIS-SETOSA:
IF
THEN
0.8
1.65
3
52
7.7%
92.3%
0.0%
1.65 <= Petalwidth
NODE
:
N
:
IRIS-VIRGINICA:
IRIS-VERSICOLOR:
IRIS-SETOSA:
4
48
95.8%
4.2%
0.0%
Figure 2: Rule output produced from the SAS Enterprise Miner software.
C4.5 is currently one of the most commonly used data mining algorithms and is available
in many commercial data mining products. The ease of its interpretability as well as its
methods for dealing with numeric attributes, missing values, noisy data, and generating
rules from trees make it a very good choice for practical classification.
C4.5 was successfully used in the application of automated identification of bat calls
using 160 reference calls from eight bat species. The automated identification of pulse
parameters led to good results for species with distinct differences in calls, with four out
of eight species classified correctly in 95% of attempts [24].
2.1.1.2 Rule Learner (PART)
The PART algorithm forms rules from pruned partial decision trees built using C4.5’s
heuristics. According to Witten and Frank [58], the main advantage of PART over C4.5
is that, unlike C4.5, the rule learner algorithm does not need to perform global
optimization to produce accurate rule sets. To make a single rule, a pruned decision tree
is built, the leaf with the largest coverage is made into a rule, and the tree is discarded.
This avoids overfitting by only generalizing once the implications are known.
For example, going back to figure 1, PART would consider the first branch in the tree
and builds the rule: ‘If ‘sepalwidth’ is less than 0.8 then the plant is ‘Iris-setosa’ and
discard all the ‘Iris-setosa’ instances from consideration. It continues with similar rules
for the rest of the tree.
- 16 -
As for C4.5, the parameters that affect the performance of the algorithm are the minimum
number of instances in each leaf and the confidence threshold for pruning.
Frank and Witten [20] describe the results of an experiment performed on multiple data
sets. The result from this experiment showed that PART outperformed the C4.5 algorithm
on 9 occasions whereas C4.5 outperformed PART on 6.
2.1.1.3 OneR (1R)
OneR is one of the simplest classification algorithms. As described by Holte [26], OneR
produces simple rules based on one attribute only. It generates a one-level decision tree,
which is expressed in the form of a set of rules that all test one particular attribute. It is a
simple, cheap method that often comes up with quite good rules for characterizing the
structure in data [59]. It often gets reasonable accuracy on many tasks by simply looking
at one attribute.
An example of a classification performed by OneR on the ‘iris’ data set is shown in
figure 3. As can be seen from the figure, OneR produced rules that when the ‘petallength’
is less than 2.45, then the iris plant is classified as ‘Iris-setosa’ and when the petallength
is greater than 2.45 and less than 4.75 then the iris plant is classified as ‘iris-versicolor’
and when the ‘petallength ‘ is greater than or equal to 4.8, then the iris plant is classified
as ‘Iris-virginica’. This gave 143 correct classification out of 150 on the training data
with an error rate of 4.7 %.
The output produced by the 1R algorithm is shown in figure 3.
## 1R Rule Output
% rule for 'petallength':
'class'('Iris-setosa') :- 'petallength'(X), X <2.45 % 50/50
'class'('Iris-versicolor') :- 'petallength'(X), X <4.75 % 44/50
'class'('Iris-virginica') :- 'petallength'(X), 4.75 =< X. % 48/50
% 1Rw Error Rate 4.7 % (143/150) (on training set)
Figure 3: Partial output from WEKA OneR program.
A comprehensive study of the performance of OneR algorithm by Holte [26] was
reported on sixteen data sets frequently used by machine learning researchers to evaluate
their algorithms. Cross-validation was used to ensure that the results were representative
of what would be obtained on independent test sets. The research found that OneR
performed very well in comparison with other more complex algorithms and Holte
encourages the use of simple data mining algorithms like OneR to establish a
performance baseline before progressing to more sophisticated learning algorithms.
- 17 -
2.1.1.4 IBK
IBK is an implementation of the k-nearest-neighbors classifier. Each case is considered
as a point in multi-dimensional space and classification is done based on the nearest
neighbors. The value of ‘k’ for nearest neighbors can vary. This determines how many
cases are to be considered as neighbors to decide how to classify an unknown instance.
For example, for the ‘iris’ data, IBK would consider the 4 dimensional space for the four
input variables. A new instance would be classified as belonging to the class of its closest
neighbor using Euclidean distance measurement. If 5 is used as the value of ‘k’, then 5
closest neighbors are considered. The class of the new instance is considered to be the
class of the majority of the instances. If 5 is used as the value of k and 3 of the closest
neighbors are of type ‘Iris-setosa’, then the class of the test instance would be assigned as
‘Iris-setosa’.
The time taken to classify a test instance with a nearest-neighbor classifier increases
linearly with the number of training instances that are kept in the classifier. It has a large
storage requirement [59]. Its performance degrades quickly with increasing noise levels.
It also performs badly when different attributes affect the outcome to different extents.
One parameter that can affect the performance of the IBK algorithm is the number of
nearest neighbors to be used. By default it uses just one nearest neighbor.
IBK has been used for gesture recognition as discussed by Kadus [30]. With 95 signs
collected from 5 people with a total of 6650 instances, the accuracy obtained from this
research was approximately 80 per cent. The signs used were very similar to each other
and an accuracy of 80 percent was considered to be very high. This research also found
that instance based learning was better than C4.5 at tasks involved in the gesture tasks
tested.
2.1.1.5 Naïve Bayes
The Naive Bayes classification algorithm is based on Bayes rule which is used to
compute the probabilities which are used to make predictions. Naïve Bayes assumes that
the input attributes are statistically independent. It analyses the relationship between each
input attribute and the dependent attribute to derive a conditional probability for each
relationship[11]. These conditional probabilities are then combined to classify new
cases.
An advantage of Naïve Bayes algorithm over some other algorithms is that it requires
only one pass through the training set to generate a classification model.
Naïve Bayes works very well when tested on many real world data sets [58]. Naïve Bayes
can obtain results that are much better than other sophisticated algorithms. However, if a
particular attribute value does not occur in the training set in conjunction with every class
value, then Naïve Bayes may not perform very well. It can also perform poorly on some
data sets because attributes were treated as though they are independent, whereas in
reality they are correlated.
- 18 -
2.1.1.6 Kernel Density
Kernel Density algorithm works in a very similar fashion to Naïve Bayes. The main
difference is that, unlike Naïve Bayes, Kernel Density does not assume normal
distribution of the data. Kernel Density tries to fit a combination of kernel functions.
According to Beardah and Baxter [2], Kernel Density estimates are similar to histograms
but provide smoother representation of the data.
Beardah and Baxter [2] illustrate some of the advantages of kernel density estimates for
data presentation in archaeology. They show that Kernel Density estimates can be used as
a basis for producing contour plots of archeological data which lead to a useful graphical
representation of the data.
- 19 -
2.2
Unsupervised Learning
Unsupervised learning deals with finding clusters of records that are similar in some way.
As discussed earlier, unsupervised learning does not require a target variable for analysis.
According to Berry and Gordon [4], unsupervised learning is often useful when there are
many competing patterns in the data, making it hard to spot any single pattern. Building
clusters of similar records reduces the complexity within clusters so that other data
mining techniques are more likely to succeed. In unsupervised learning, the main concern
is in obtaining clusters in data that have useful patterns.
2.2.1 Unsupervised Data Mining Algorithms Used in This Thesis
2.2.1.1
K-means clustering
The way k-means clustering works is that first the number of clusters (k) desired is
specified, then the algorithm selects k cluster seeds (centers) which are located
approximately uniformly in a multi-dimensional space. Each observation is then assigned
to the nearest cluster mean to form temporary clusters. The cluster mean positions are
then calculated and used as new cluster centers. The observations are then reallocated
clusters according to the new cluster centers. This is repeated until no further change in
the cluster centers occurs. The observations are assigned into clusters so that every
observation belongs to at most one cluster [57].
According to Weiss and Indurkhya [56], not all the variables are equally important in
determining the clusters. For each variable, an importance value is computed as a value
between 0 and 1 to represent the relative importance of the given variable to the
formation of the clusters. Variables that have the greatest contribution to the cluster
profile have importance values closer to 1. A decision tree analysis can be used to
calculate the relative importance values from a selected sample of the training data. The
first split is most important. It has been discovered that variables having large variance
tend to have more effect on the resulting clusters than variables with small variance.
Some implementations of k-means clustering use these importance values in assigning
cases to clusters [48].
2.2.1.2
Kohonen Vector Quantization
Kohonen Vector Quantization is a clustering method invented by Kohonen [48].
The algorithm is similar to the k-means clustering algorithm. But the original seeds,
called code book vectors, are totally random. The algorithm finds the seed closest to each
training case in a multidimensional space and moves that "winning" seed closer to the
training case. The seed is moved a certain proportion of the distance between it and
the training case. The proportion is specified by the learning rate [48].
- 20 -
2.2.1.3 Autoclass (Bayesian Classification System)
Autoclass is an unsupervised Bayesian classification system that infers classes based on
Bayesian statistics [14]. It divides the problem into two parts - the calculation of the
number of classes and the estimation of the classification parameters. It uses the
Expectation Maximization (EM) algorithm to estimate the parameter values that best fit
the data for a given number of classes. EM is an approximation algorithm that can find a
local minimum of a probability distribution. By default, Autoclass fits a normal
probability distribution for numeric data and a multinomial distribution for symbolic data.
According to Cheeseman and Stutz [14], Autoclass can consider different underlying
probability distribution types for the numeric attributes and is computationally intensive.
Autoclass is a development from NASA. Autoclass has been used for extracting useful
information from databases [14]. It has been used to extract information from Infrared
Astronomical satellite (IRAS) data [21].
2.3
Related work on Comparison of Classifiers
Lim and Loh [35] discuss the comparison of prediction accuracy, the complexity and
training time different of classification algorithms. The paper discusses the results of a
comparison of twenty-two decision tree, nine statistical, and neural network algorithms
on thirty-two data sets in terms of classification accuracy, training time and (in the case
of trees) number of leaves.
Some of the twenty-two decision tree algorithms compared are CART, S-Plus tree, C4.5,
FACT (Fast Classification Tree), QUEST, IND, OC1, LMDT, CAL5, T1. The Statistical
algorithms compared include LDA (Linear Discriminant Analysis), QDA (Quadratic
Discriminat Analysis), NN (Nearest Neighbor), LOG (Logistic Discriminat Analysis),
FDA (Flexible Discriminat Analysis), PDA (Penalized LDA), MDA (Mixture
Discriminat Analysis) and POL (PLYCLASS algorithm). The neural networks
algorithms compared include LVQ (Learning Vector Quantization) and RBF (Radial
Basis Function)
This paper revealed that an algorithm called POLYCLASS, which provides estimates of
conditional class probabilities, performed better than the other algorithms, although its
accuracy was not statistically significantly different from twenty other algorithms.
Another statistical algorithm, logistic regression, was second with respect to the two
accuracy criteria (accuracy and training time). The most accurate decision tree algorithm
was QUEST with linear splits, which was ranked fourth. It was noted that although
spline-based statistical algorithms tend to have good accuracy, they also require relatively
long training times. POLYCLASS, for example, was the third last in terms of median
training time.
- 21 -
The research discovered that among decision tree algorithms with univariate splits, C4.5
IND-CART and QUEST had the best combinations of error rate and speed. It was also
noted that C4.5 tends to produce trees with twice as many leaves as those from INDCART and QUEST.
The main conclusion from this research was that the mean error rates of many algorithms
are sufficiently similar that their differences are statistically insignificant and are also
probably insignificant in practical terms. However, as will be discussed later, using
default settings, this thesis discovered that there were significant differences in error rates
among the different algorithms used.
The STATLOG Project [12] has shown the results of evaluation of the performance of
machine learning, neural and statistical algorithms on large-scale, complex commercial
and industrial problems. The overall aim was to give an objective assessment of the
potential for classification algorithms in solving significant commercial and industrial
problems.
Some of the twenty-four algorithms compared on the STATLOG project are Alloc80,
Ac2, BayTree, NewId, Dipol92, C4.5, Cart, Cal5, Kohonen, Bayes, and Cascasde.
The data sets used for the STATLOG project are from the UCI repository. On test data, it
was discovered that the algorithm ‘Alloc80’, followed by ‘Ac2’, and ‘BayTree’,
performed better than the rest. ‘Alloc80’ and ‘BayTree’ are statistical classifier
algorithms whereas ‘Ac2’ is a decision tree algorithm.
Salzberg [46] cautions that care is required when comparing different algorithms. The
dangers to avoid and a recommended approach to compare data mining algorithms are
discussed.
The main claims made by the paper are:
•
Finding a good classification algorithm requires very careful thought about
experimental design.
•
If not done carefully, comparative studies of classification and other types of
algorithms can easily result in statistically invalid conclusions and this is especially
true when one is using data mining techniques to analyze very large databases, which
inevitably contain some statistically unlikely data.
•
Comparative analysis is more important in evaluating some types of algorithms than
others.
The key recommendations made by Salzberg [47] regarding the comparison of
algorithms are:
- 22 -
•
That data miners must be careful not to rely too heavily on stored repositories such as
the UCI repository because it is difficult to produce major new results using well
studied and widely shared data.
•
That data miners should follow a proper methodology that allows the designer of a
new algorithm to establish the new algorithm’s comparative merits.
- 23 -
Chapter 3. Data Generation
T
his chapter discusses the data generation phase of this thesis which involved
collecting data sets and applying each of 6 supervised algorithms to each of 59 data
sets.
3.1 Collection of Data sets
To achieve the goal of applying multiple data mining algorithms on multiple data sets, a
search for data sets was necessary. Data sets were mainly obtained through the Internet,
particularly from the UCI data set collection. Fifty-nine data sets were collected and used
to perform the experiments. The number of attributes of the data sets used ranged from 3
to 76 while the number of observations ranged from 13 to 8124.
3.2 Selection of Data Mining Algorithms
The selection of appropriate types of data mining algorithms to ensure that they can be
run on all the data sets collected was very important to minimize missing values in the
file produced from the data generation phase.
The 6 data mining algorithms chosen for this experiment are Rule Learner, OneR, Kernel
Density, IBK, C4.5, Naïve Bayes. These algorithms are described in detail in chapter 2.
3.3 Generating Data
Once the data sets and algorithms to use for the experiment were chosen, the actual data
generation for the experiment was conducted. This was done by running the 6 data
mining algorithms on the 59 data sets. Default settings were used for all algorithms. For
the purpose of testing, cross validation with 10 folds was used for all the algorithms.
Once all runs were completed the results were stored in one file, which was later used in
the clustering analysis. The percentage of incorrectly classified instances for each
algorithm on each data set for both training and cross validation was stored in this file.
Also contained in this file is the size and number of attributes of each data set. Table 2
shows the complete result of the data generation process. For example, it shows that for
the ‘anneal’ data set, which has 38 attributes and 898 instances, IBK error rate on training
data was 5.90 percent whereas on the test data it was 5.57 percent.
The definition for each variable is shown on table 3.
- 24 -
Data_Name Num_Attr Num_Ins IBK_TRAIN
Anneal
38
898
5.90
Audiology
70
226
Balance-scale
5
625
30.24
Breast-cancer
10
286
10.14
Breast-w
11
699
4.01
Colic
28
368
11.14
Credit-a
16
690
13.49
Credit-g
Diabetes
Glass
Heart-c
Heart-statlog
Iris
Kr-vs-kp
Labor
Segment
Sick
Sonar
Soybean
Autos
Heart-h
Hepatitis
Lymph
Mushroom
Primary-tumor
Splice
vehicle
vote
vowel
Waveform-5000
AutoPrice
baskball
bodyfat
bolts
BreastTumor
Cleveland
cloud
cpu
detroit
EchoMonths
elusage
Fishcatch
Gascons
housing
Hungarian
longley
20
9
11
76
13
5
37
16
19
30
61
35
26
76
20
19
22
18
62
18
17
14
21
16
5
13
8
10
14
6
7
14
10
3
8
5
13
13
7
1000
768
214
303
270
150
3196
57
2310
3772
208
683
205
294
155
148
8124
339
3190
946
435
990
5000
159
96
252
40
286
303
108
209
13
130
55
158
27
506
294
16
27.60
23.05
1.87
7.59
6.67
0.00
52.22
0.00
9.61
6.31
2.88
50.80
3.41
12.59
3.26
0.00
18.19
31.27
48.11
28.61
5.06
59.29
31.44
0.63
0.00
11.51
0.00
26.22
11.55
0.00
1.44
0.00
12.31
1.82
0.00
0.00
40.71
12.59
0.00
IBK_TEST C45_TRAIN
5.57
0.22
8.85
20.64
9.28
27.97
24.13
4.72
1.57
22.28
14.13
17.83
9.28
31.20
32.94
29.91
23.76
25.19
0.04
31.57
10.52
15.58
6.12
14.42
11.42
24.89
20.75
0.20
17.57
48.20
59.88
44.80
33.92
5.75
37.98
47.48
29.56
14.58
46.03
0.45
84.97
35.97
56.48
9.09
23.07
69.23
21.82
22.15
0.00
51.98
20.75
0.00
14.50
15.63
3.73
7.92
8.52
0.02
0.34
12.28
1.08
0.34
1.92
3.66
4.89
15.97
7.74
6.76
0.00
38.64
3.67
3.07
2.76
2.12
2.50
9.43
10.42
1.19
2.50
44.40
14.52
12.96
3.83
7.69
22.31
12.72
3.80
0.00
10.67
15.99
0.00
C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST
1.56
0.00
0.02
13.36
13.47
16.40
16.40
0.00
0.89
22.12
8.40
18.58
21.68
26.99
53.53
46.46
0.00
0.23
22.24
5.12
19.52
9.12
9.12
36.48
41.12
0.00
0.12
24.83
19.93
28.67
24.83
25.87
27.27
34.62
2.10
27.27
4.72
1.57
5.15
3.86
0.04
7.30
8.15
0.00
4.86
14.13
13.31
17.39
20.38
20.92
18.48
18.48
0.27
20.19
14.06
6.52
16.09
21.74
22.32
14.49
14.49
0.14
18.55
30.30
25.91
32.71
20.79
22.22
4.70
0.47
24.56
2.86
1.35
25.96
7.91
17.56
22.11
20.65
22.93
0.00
59.29
6.01
26.60
3.22
21.62
24.48
31.45
13.54
3.17
0.35
83.92
37.62
42.59
4.78
30.77
0.60
12.72
18.99
11.11
47.23
22.11
0.00
10.30
18.75
9.35
5.61
5.56
2.67
0.25
5.26
0.48
0.37
0.96
3.66
9.76
13.61
4.51
4.73
0.00
38.64
2.76
15.25
2.52
3.23
7.30
8.81
8.33
0.79
0.05
40.91
8.58
10.19
3.83
7.69
23.08
5.45
6.33
0.00
8.30
13.61
0.00
27.90
25.52
34.11
24.75
23.70
5.33
0.78
21.05
2.94
1.48
22.12
8.34
24.89
18.37
19.36
22.30
0.00
61.65
6.24
30.26
3.45
22.63
22.64
33.96
14.58
3.17
37.50
87.06
33.99
40.74
5.26
30.77
57.69
0.20
19.62
11.11
47.43
19.05
0.00
22.80
23.70
44.39
15.84
14.81
0.04
11.67
1.75
19.78
7.03
26.92
6.30
31.22
14.97
14.19
12.84
4.11
43.95
3.04
53.78
9.66
28.79
19.78
26.42
8.33
0.25
0.15
68.18
25.08
27.78
10.05
0.00
53.08
12.72
24.05
0.00
63.44
14.97
0.00
25.10
24.22
51.40
15.51
14.81
0.04
12.36
5.26
19.70
7.26
34.13
7.17
41.95
15.65
16.13
16.21
4.23
51.92
4.45
44.44
9.66
38.48
20.04
32.08
13.54
29.37
0.40
82.87
29.04
36.11
11.48
15.38
66.92
12.72
27.22
7.40
66.40
14.65
0.00
25.80
23.83
38.32
23.43
23.70
4.67
31.66
15.79
29.87
33.30
27.73
43.93
26.07
27.03
5.33
33.57
26.32
36.28
24.04
59.15
31.22
17.69
15.48
24.32
1.48
70.50
0.00
39.01
4.37
56.57
38.90
30.19
10.42
1.19
32.50
69.93
34.65
40.74
10.04
15.38
62.31
9.09
16.46
22.22
45.85
17.69
0.00
36.08
60.47
37.07
18.71
0.20
25.68
1.48
73.16
75.58
47.28
4.37
68.59
46.26
37.11
10.42
1.59
47.50
76.22
39.93
51.85
11.48
53.85
63.85
14.55
20.89
38.32
52.77
18.70
0.00
0.00
0.00
8.41
0.00
0.00
1.33
0.00
0.00
0.00
0.00
0.00
0.17
0.00
0.00
1.29
0.00
29.70
28.12
28.97
24.09
24.81
0.04
3.22
10.53
2.64
3.53
14.90
8.93
22.93
21.43
20.65
17.57
12.39
60.47
0.00
0.23
0.00
31.80
7.13
1.01
3.77
2.08
0.00
0.00
3.85
0.00
0.93
0.48
0.00
13.08
5.45
9.49
7.41
0.00
0.00
0.00
30.82
14.58
0.00
0.45
82.52
34.98
58.33
9.09
23.08
66.15
0.20
22.78
7.41
44.27
21.43
0.00
- 25 -
Data_Name Num_Attr Num_Ins IBK_TRAIN
lowbwt
10
189
0.00
Mbagrade
3
61
6.56
meta
21
528
0.76
Pharynx
9
195
0.00
Pollution
16
60
0.00
PwLinear
11
200
0.00
quake
4
2178
56.38
Schlvote
6
37
0.00
servo
5
167
0.00
sleep
8
58
1.72
strike
6
625
12.80
veteran
8
137
0.00
Vineyard
4
52
0.00
IBK_TEST C45_TRAIN
19.05
14.28
22.95
13.11
0.76
0.76
68.21
54.87
28.33
0.05
61.50
18.50
63.64
38.89
27.03
16.22
22.75
20.36
6.90
6.90
15.84
12.16
48.18
21.90
1.92
1.92
C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST
16.93
5.82
22.22
14.29
19.58
15.87
15.87
0.53
19.58
16.40
11.48
16.39
9.84
13.11
13.11
13.11
11.46
14.75
0.76
0.76
0.76
4.17
4.17
0.76
0.76
0.00
0.76
54.87
3.59
66.15
22.56
0.60
5.13
82.05
0.00
66.15
0.20
0.05
0.20
11.67
0.25
16.67
0.20
0.00
0.25
48.50
0.14
49.50
51.50
0.62
0.62
0.38
0.00
0.60
48.30
30.17
52.43
46.10
46.01
43.62
46.74
36.41
47.84
21.62
10.81
16.22
37.84
51.35
16.22
40.54
13.51
18.82
23.35
20.36
22.75
16.17
25.75
35.92
35.92
0.00
23.35
12.07
3.45
12.07
3.45
10.34
6.90
6.90
3.45
6.90
12.80
6.24
12.80
13.60
14.24
0.12
12.80
0.64
14.40
43.07
14.60
44.52
30.66
37.23
35.04
36.50
2.92
47.45
1.92
1.92
1.92
0.00
1.92
0.00
3.85
0.00
1.92
Table 2: Results obtained from applying 6 data mining algorithms to 59 data sets. Blanks indicate situations where algorithms gave no result.
Note that the word ‘TRAIN’ in the table indicates the percentage of incorrectly classified training cases whereas the word ‘TEST’ indicates the
percentage of incorrectly classified cases using cross validation. For example NB_TRAIN indicates percentage of incorrectly classified instances
(error rate) for the Naïve Bayes algorithm on training data. The definition of each variable is shown below.
Name
NB_TRAIN
NB_TEST
C45_TRAIN
C45_TEST
OR_TRAIN
OR_TEST
RL_TEST
RL_TRAIN
KR_TRAIN
KR_TEST
IBK_TEST
IBK_TRAIN
NUM_INS
NUM_ATTR
Definition
Naive Bayes Training Error (%)
Naive Bayes Testing Error (%)
C4.5 Training Error (%)
C45 Testing Error (%)
OneR Training Error (%)
OneR Training Error (%)
Rule Learner Testing Error (%)
Rule Learner Training Error (%)
Kernel Density Training Error (%)
Kernel Density Testing Error (%)
IBK Testing Error (%)
IBK Training Error (%)
Number of Instances
Number of Attributes
Table 3: Definition of variables used.
- 26 -
Table 4 shows a summary of table 2. It provides the minimum, maximum, mean, standard deviation,
and missing percentage of each numeric variable. For example, it shows that for the Kernel Density
algorithm on training data, the minimum error rate was 0 percent while the maximum and mean were
36.41 and 2.53 percent respectively. It also shows that there was 5 % of the values were missing for
this algorithm on training data indicating that no result was found for some data sets.
Table 4 also shows the overall performance of the 6 algorithms in classifying the 59 data sets. The
table is sorted by the mean error rates of each of the algorithms for both train and test cases. Looking
at the training results indicates that Kernel Density (KR_TRAIN) with an average error rate of 2.53
percent, followed by Rule Learner (RL_TRAIN) and C4.5 (C45_TRAIN) with average error rates of
8.79 and 10.47 percent respectively had lower training errors than the other algorithms. More
importantly, looking at the cross validation results, Kernel Density (KR_TEST) with an average error
rate of 19.88 percent, followed by C4.5 and Naïve Bayes, with average error rates of 20.16 and 21.52
percent respectively, performed better than the other algorithms.
Name
KR_TRAIN
RL_TRAIN
C45_TRAIN
IBK_TRAIN
NB_TRAIN
OR_TRAIN
KR_TEST
C45_TEST
NB_TEST
RL_TEST
IBK_TEST
OR_TEST
NUM_INS
NUM_ATTR
Mean
2.53
8.06
10.47
12.09
19.36
23.83
19.88
20.16
21.51
21.95
26.66
30.49
740.47
18.40
Min Max
0.00 36.41
0.00 40.91
0.00 54.87
0.00 59.29
0.00 68.18
0.00 70.50
0.00 82.52
0.00 83.92
0.00 82.87
0.00 87.06
0.00 84.97
0.00 82.05
13 8124
3
76
Std Dev.
5.89
8.79
11.39
16.32
16.61
18.27
19.64
17.27
18.50
18.63
20.24
22.12
1389.80
17.66
Missing %
5%
0%
0%
2%
0%
2%
5%
0%
0%
0%
2%
2%
0%
0%
Table 4: Summary of the data gathered from running the 6 data mining algorithms on 59 data sets.
- 27 -
Chapter 4. Clustering and Pattern Analysis
F
or the purpose of analyzing the data generated by applying the 6 data mining algorithms to 59
data sets (table 2, page 26), running unsupervised learning algorithms is necessary. For this
experiment, the 3 algorithms used are k-means clustering using least squares, Kohonen Vector
Quantization, and Autoclass Bayesian analysis. These algorithms are described in section 2.21 (pages
21-22).
The results of the unsupervised clustering analysis performed are discussed in the next four sections
followed by the summary and comparison of the results.
4.1
Results from k-means clustering
Table 5 shows the ranking of variables resulting from the application of the k-means algorithm to the
data generated (table 2, page 26). A value of 5 was used for the maximum number of clusters. This
value was chosen as (1) more than 5 clusters in a data set of 59 cases are unlikely to be useful and (2)
preliminary runs of the algorithm suggested there were 3 to 5 clusters.
As shown in table 5, only ‘number of instances’ is significant in determining the clusters. This model
gives five clusters with 52,3,2,1,1 observations. The table shows the name, importance, measurement
type and label of each variable. For example, it indicates that the num_inst (number of instances)
variable has importance level of 1 and that it is a numeric interval variable. Numeric variables
containing values that vary across a continuous range are shown as interval variables.
NAME
IMPORTANCE MEASUREMENT
NUM_INS
1 interval
KR_TEST
0 interval
KR_TRAIN
0 interval
OR_TEST
0 interval
OR_TRAIN
0 interval
NB_TEST
0 interval
NB_TRAIN
0 interval
RL_TEST
0 interval
RL_TRAIN
0 interval
C45_TEST
0 interval
C45_TRAIN
0 interval
IBK_TEST
0 interval
IBK_TRAIN
0 interval
NUM_ATTR
0 interval
TYPE
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Num
Table 5: Importance level of variables in determining cluster.
- 28 -
LABEL
Number of Instances
Kernel Density Test
Kernel Density Train
OneR TEST
OneR TRAIN
Naïve Bayes Test
Naïve Bayes Train
Rule Learner Test
Rule Learner Train
C45 Test
C45 Train
IBK Test
IBK Train
Number of Attributes
Root- Maximum
Distance
Mean- Distance
Number
To
Frequency
Nearest
NUM_
Of
Cluster
Square
From
Of Cluster
Cluster Nearest
INST
Standard Cluster
Attributes
Cluster
Deviation
Seed
1
2
3
4
5
1
1
3
52
2
.
.
90.28
74.75
34.85
0
0
388.32
694.33
92.22
3
1
5
5
3
1614.82
3124.84
1144.20
1938.36
1144.20
21
22
43
17.13
11.5
5000
8124
3386
306.11
2244
IBK_
TRAIN
31.44
18.19
35.54
9.453
32.99
RL_
NB_ NB_ OR_ OR_ KR_
KR_
IBK_ C45_ C45_ RL_
TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST
47.48
48.2
27.49
25.29
39.61
2.5 24.48
0
0
1.45 2.61
10.98 21.27
19.98 25.58
7.3
0
1.12
8.35
15.32
22.64
0
2.83
23.24
27.68
19.78 20.04
4.11 4.23
7.24 8.02
19.83 22.22
32.94 32.85
38.9
1.48
18.49
23.78
36.74
46.26
1.48
46.54
29.40
41.51
2.53
2.53
0.84
2.02
18.20
19.88
19.88
8.87
20.31
25.24
Table 6: Properties of the 5 clusters. Average values for the last 14 columns were used.
Table 6 shows the properties of the 5 clusters formed using k-means analysis. For example, it shows that the one data set in cluster 1 has 5000
instances while the one data set in cluster 2 has 8124. These numbers are much higher than for the population, which has mean number of
instances of 740.47 (table 4, page 27). This indicates that perhaps these two data sets should be clustered together.
As it can be seen from the table, most of the data sets were grouped in to cluster 4 with an average number of instances of 306.11.
Generally, it is possible to see that the data sets were clustered based on the number of instances as small, medium, or large. To be able to see the
significance of the other variables, all further runs were carried out with this variable excluded.
- 29 -
4.2
Results from k-means clustering (without Number of Instances)
When the number of instances variable was excluded for the k-means analysis, 3 other variables
emerged as significant variables. As can be seen from table 7, NB_TRAIN was most significant with
an importance level of 1, followed by KR_TEST and C45_TRAIN with 0.88 and 0.33 respectively.
The rest of the variables have no significance in determining the clusters.
Name
NB_TRAIN
KR_TEST
C45_TRAIN
OR_TRAIN
NB_TEST
KR_TRAIN
RL_TEST
RL_TRAIN
C45_TEST
OR_TEST
IBK_TEST
IBK_TRAIN
NUM_ATTR
Importance
1
0.885
0.336
0
0
0
0
0
0
0
0
0
0
Measurement
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
Type
num
num
num
num
num
num
num
num
num
num
num
num
num
Label
Naive Bayes Train
Kernel Density Test
C45_Train
OneR Train
Naive Bayes Test
Kernel Density Train
Rule Learner Test
Rule Learner Train
C45 Test
OneR Test
IBK Test
IBK Train
Number of Attributes
Table 7: Importance variable in determining Clusters.
Table 8 shows the properties of the 5 clusters formed using k-means clustering analysis. For example,
the table shows that for the significant variables, cluster 1 has mean error rates lower than for the
population. That is, NB_TRAIN, KR_TEST, C45_TEST which have mean error rates of 10.87, 9.85,
5.66 for the cluster 1. Comparing these error rates with the population error rates (table 4, page 27), we
discover that they are all lower with error rates of 19.36, 19.88, and 10.47 respectively.
It can also be seen from the table that cluster 4, which has higher than average error rates contains 12
data sets. Cluster 2, which also has lower than average error rates, has 11 data sets. Data sets from
these two clusters (1 and 2) could possibly be put into one cluster. Also, clusters 3 and 5 behave
similarly having error rates much greater than for the population. The data sets from these two clusters
could possibly be put into one cluster.
- 30 -
Maximum
Root-MeanDistance
Number
Distance
Number IBK_ IBK_ C45_ C45_ RL_
RL_
Frequency
Square
Nearest
To
Cluster
From
Of
Of
Of Cluster Standard
Cluster Nearest
TRAIN TEST TRAIN TEST TRAIN TEST
Cluster
Attributes
Instances
Deviation
Cluster
Seed
1
4
2
3
5
29
12
11
5
2
8.44
9.91
13.92
15.33
15.53
52.99
70.46
67.83
64.80
39.60
2
2
4
5
4
56.06
51.25
51.25
77.29
69.01
13.48
12.83
42.81
11
7.5
717.10
368.50
97.28
687.80
151.50
4.58
8.90
27.93
33.37
0
14.14
33.81
27.50
65.94
62.34
5.66
13.90
5.81
30.98
33.91
9.82
30.29
18.58
47.86
48.73
3.98
12.22
5.32
28.22
6.89
10.86
31.02
17.70
61.25
53.44
NB_
TRAIN
9.34
32.36
14.37
54.95
25.17
NB_
TEST
11.39
32.42
18.11
62.82
18.35
OR_
TRAIN
12.86
28.17
32.43
58.44
22.93
OR_
TEST
14.88
34.52
46.06
62.54
66.95
Table 8: General properties of the clusters from k-means analysis.
- 31 -
KR_
TRAIN
1.613
2.559
0.475
13.146
0.465
KR_
TEST
9.85
27.06
12.43
60.25
62.24
The general properties all the clusters, with the values of significant variables, are shown in table 9.
For example, the table shows that cluster 1 has mean error rates of 9.34, 9.85 and 5.66 for
NB_TRAIN, KR_TEST and C4.5_TRAIN respectively. All of these results are lower than average
error rates (table 4, page 27).
Cluster
1
4
2
3
5
Frequency
Of Cluster
29
12
11
5
2
NB_TRAIN
(19.36)
9.34
32.36
14.37
54.95
25.17
KR_TEST
(19.88)
9.85
27.06
12.43
60.25
62.24
C45_TRAIN
Cluster Properties
(Compared to error rates for the population)
(2.53)
5.66 Lower than average error rates
13.90 About average error rates
5.81 Lower than average error rates
30.98 Higher than average error rates
33.91 Higher than average error rates
Table 9: General properties of Clusters and significant variables. Values in bracket show average values for the
population.
The complete list of the error rates for the 3 significant variables for the 59 data sets with their
corresponding cluster number is shown on table 10. For example the table shows that the ‘Anneal’
data set with 38 attributes and 899 instances is in cluster 1 and has error rates of 13.47, 0.89 and 0.22
for the three significant variables (NB_TRAIN, KR_TEST and C45_TEST). From the table, it is also
possible to see that error rates of the significant variables in cluster 3 and cluster 5 are very high. This
indicates that for the 6 data sets in these clusters, ‘breasttumor’, ‘echmonths’, ‘housing’, ‘quack’,
‘cloud’ and ‘pharynx’, the three algorithms Naïve Bayes, Kernel Density and C4.5 did not perform
well. A closer investigation of the data sets may reveal some similarities among these data sets but
because of the scope of this thesis, such investigation was not done here.
- 32 -
DATA_NAME
Anneal
Breast-W
Colic
Credit-A
Heart-Statlog
Iris
Labor
Segment
Sick
Hepatitis
Lymph
Mushroom
Vote
Baskball
Bodyfat
Bolts
Cpu
Elusage
Fishcatch
Gascons
Hungarian
Longley
Lowbwt
Mbagrade
Meta
Pollution
Sleep
Strike
Vineyard
Audiology
Balance-Scale
Heart-C
Kr-Vs-Kp
Sonar
Soybean
Heart-H
Splice
Vowel
Detroit
Primary-Tumor
Breasttumor
Echomonths
Housing
Quake
Breast-Cancer
Credit-G
Diabetes
Glass
Autos
Vehicle
Autoprice
Cleveland
Pwlinear
Schlvote
Servo
Veteran
Cloud
Pharynx
NUM_ATTR
38
11
28
16
13
5
16
19
30
20
19
22
17
5
13
8
7
3
8
5
13
7
10
3
21
16
8
6
4
70
5
76
37
61
35
76
62
14
14
18
10
10
13
4
10
20
9
11
26
18
16
14
11
6
5
8
6
9
NUM_INS
898
699
368
690
270
150
57
2310
3772
155
148
8124
435
96
252
40
209
55
158
27
294
16
189
61
528
60
58
625
52
226
625
303
3196
208
683
294
3190
990
13
339
286
130
506
2178
286
1000
768
214
205
946
159
303
200
37
167
137
108
195
NB_TEST
13.47
0.04
20.92
22.32
14.81
0.04
5.36
19.70
7.26
16.13
16.21
4.23
9.66
13.54
29.37
0.40
11.48
12.72
27.22
7.40
14.65
0.00
19.58
13.11
4.17
0.25
10.34
14.24
1.92
26.99
9.12
15.51
12.36
34.13
7.17
15.65
4.45
38.48
15.38
51.92
82.87
66.92
66.40
46.01
25.87
25.10
24.22
51.40
41.95
44.44
32.08
29.04
0.62
51.35
25.75
37.23
36.11
0.60
KR_TEST
0.89
4.86
20.19
18.55
24.81
0.04
10.53
2.64
3.53
20.65
17.57
19.88
7.13
14.58
0.00
0.45
9.09
0.20
22.78
7.41
21.43
0.00
19.58
14.75
0.76
0.25
6.90
14.40
1.92
0.23
0.12
24.09
3.22
14.90
8.93
21.43
19.88
1.01
23.08
60.47
82.52
66.15
44.27
47.84
27.27
29.70
28.12
28.97
22.93
31.80
30.82
34.98
0.60
18.82
23.35
47.45
58.33
66.15
C45_TRAIN
0.22
1.57
14.13
9.28
8.52
0.02
12.28
1.08
0.34
7.74
6.76
0.00
2.76
10.42
1.19
2.50
3.83
12.72
3.80
0.00
15.99
0.00
14.28
13.11
0.76
0.05
6.90
12.16
1.92
8.85
9.28
7.92
0.34
1.92
3.66
15.97
3.67
2.12
7.69
38.64
44.40
22.31
10.67
38.89
24.13
14.50
15.63
3.73
4.89
3.07
9.43
14.52
18.50
16.22
20.36
21.90
12.96
54.87
CLUSTER
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
5
5
Table 10: Complete list of data sets in each cluster and the values of the significant clustering variables.
- 33 -
4.3
Results from Kohonen Vector Quantization Clustering
The second unsupervised algorithm used to analyze the data generated is Kohonen Vector
Quantization. Like K-means, Kohonen Vector Quantization analysis also determines the significant
variables for grouping the data sets together. The maximum number of clusters for this algorithm was
also set to 5. The Kohonen Vector Quantization analysis revealed that, in clustering the data sets, the
error rates of the variable KR_TEST have the highest significance of 1, followed by IBK_TRAIN and
Number of Attributes (NUM_ATTR), RL_TRAIN and C45_TRAIN with 0.91, 0.64,0.46 and 0.45
respectively. The rest of the variables have no significance in determining the clusters.
Name
KR_TEST
IBK_TRAIN
NUM_ATTR
RL_TRAIN
C45_TRAIN
NB_TRAIN
RL_TEST
OR_TRAIN
C45_TEST
NB_TEST
IBK_TEST
KR_TRAIN
OR_TEST
Importance
1
0.91
0.64
0.46
0.45
0
0
0
0
0
0
0
0
Measurement
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
interval
Type
num
num
num
num
num
num
num
num
num
num
num
num
num
Label
Kernel Density Test
IBK Train
Number of Attributes
Rule Learner Train
C45_ Train
Naive Bayes Train
Rule Learner Test
OneR Train
C45 Test
Naive Bayes Test
IBK Test
Kernel Density Train
OneR Test
Table 11: Importance of variables in determining clusters in Kohonen Vector Quantization.
The properties of the 5 clusters obtained from the Kohonen Vector Quantization analysis are shown in
table 12. For example, the table shows that cluster 5 has a mean KR_TEST error rate of 9.32 which is
lower than the average value for the population (table 4, page 27) and an average IBK_TEST error rate
of 26.66 which is also lower than the average for the population. Values of both RL_TRAIN and
C45_TRAIN, 3.93 and 5.56 are also lower than the average values for the population, 8.06 and 10.47
(table 4, page 27).
- 34 -
Root- Maximum
Distance
Freq.
Number
Mean- Distance
Nearest
IBK
IBK
C45_
C45_
RL_
RL_
NB_
NB_
OR_
OR_
KR
KR_
To
Cluster
Of
Square
From
Of
Cluster Nearest
TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST _TRAIN TEST
Cluster Standard Cluster
Attributes
Cluster
Deviation
Seed
5.56
9.37
3.93
10.40
9.15
11.27
12.48
14.45
1.67
9.32
5
28
8.42
53.64
4
60.47
13.5
4.50
13.75
4
16
12.04
91.88
5
60.47
12.25
7.09
36.17
15.68
32.12
10.85
33.35
28.34
28.49
26.44
39.31
1.97
31.07
1
6
12.45
59.41
4
62.70
29 45.35
32.31
3.59
13.78
3.72
13.35
13.11
15.27
37.12
54.26
0.87
8.84
3
5
15.33
65.42
4
86.98
11 33.37
65.94
30.98
47.86
28.22
61.25
54.95
62.82
58.44
62.54
13.14
60.25
2
4
7.88
33.09
1
63.87
70.75
8.78
21.39
8.66
22.74
7.145
20.95
19.85
23.07 29.672
31.83
0
15.16
Table 12: General properties of the clusters from Kohonen Vector Quantization analysis.
- 35 -
The general properties all the clusters, with the values of the significant variables, are shown in table
13. For example the table shows that cluster 5 has mean error rates of 3.93, 5.56, 9.32 and 13.75 for
KR_TEST, IBK_TEST, RL_TRAIN and C4.5_TRAIN respectively. All of these results are lower than
average error rates (table 4, page 27).
Data sets in clusters 1, 2 and 3 could probably be put into one cluster as they all behave similarly
having lower than average error rates.
Cluster Freq. Of
Cluster
5
4
1
3
2
KR_
IBK Number Of
RL_
C45_
TEST
TEST Attributes TRAIN TRAIN
(19.88)
(26.66)
(18.40)
(8.06) (10.47)
28
9.32 13.75
13.5
3.93
5.56
16
31.07 36.17
12.25
10.85 15.68
6
8.84 32.31
29
3.72
3.59
5
60.25 65.94
11
28.22 30.98
4
15.16 21.39
70.75
7.14
8.66
Cluster Properties
(Compared to error rates for the
population)
Lower than average error rates
Higher than average error rates
Lower than average error rates
Much higher than average error rates
Lower than average error rates.
Table 13: General properties of clusters mean error rates of the significant variables. Numbers in bracket show
average values for the population.
Table 14 shows which data sets are clustered together. Also shown is the complete list of the error
rates for the 5 significant variables for the 59 data sets with their corresponding cluster numbers. For
example, the table shows that the ‘Balance-scale’ data set with 5 attributes and 625 instances is
grouped in cluster 1 and has error rates of 9.28, 0.12, 20.64, 5.12 13.47, 0.89 and 0.22 for the three
significant variables for Kohonen Vector Quantization analysis (C45_TRAIN, KR_TEST, IBK_TEST,
RL_TRAIN). A closer investigation of the data sets may reveal some similarities among these data
sets but because of the scope of this thesis, such an investigation was not carried out.
- 36 -
DATA_NAME
Balance-Scale
Kr-Vs-Kp
Soybean
Splice
Vowel
Waveform-5000
Audiology
Heart-C
Sonar
Heart-H
Primary-Tumor
Breasttumor
Echomonths
Housing
Quake
Breast-Cancer
Credit-G
Diabetes
Heart-Statlog
Autos
Vehicle
Autoprice
Cleveland
Cloud
Detroit
Pharynx
Pwlinear
Schlvote
Servo
Veteran
Anneal
Breast-W
Colic
Credit-A
Iris
Labor
Segment
Sick
Hepatitis
Lymph
Mushroom
Vote
Baskball
Bodyfat
Bolts
Cpu
Elusage
Fishcatch
Gascons
Hungarian
Longley
Lowbwt
Mbagrade
Meta
Pollution
Sleep
Strike
Vineyard
NUM_ATTR
5
37
35
62
14
21
70
76
61
76
18
10
10
13
4
10
20
9
13
26
18
16
14
6
14
9
11
6
5
8
38
11
28
16
5
16
19
30
20
19
22
17
5
13
8
7
3
8
5
13
7
10
3
21
16
8
6
4
NUM_INS
625
3196
683
3190
990
5000
226
303
208
294
339
286
130
506
2178
286
1000
768
214
270
205
946
159
303
108
13
195
200
37
167
137
898
699
368
690
150
57
2310
3772
155
148
8124
435
96
252
40
209
55
158
27
294
16
189
61
528
60
58
625
C45_TRAIN
9.28
0.34
3.66
3.67
2.12
2.50
8.85
7.92
1.92
15.97
38.64
44.40
22.31
10.67
38.89
24.13
14.50
15.63
8.52
4.89
3.07
9.43
14.52
12.96
7.69
54.87
18.50
16.22
20.36
21.90
0.22
1.57
14.13
9.28
0.02
12.28
1.08
0.34
7.74
6.76
0.00
2.76
10.42
1.19
2.50
3.83
12.72
3.80
0.00
15.99
0.00
14.28
13.11
0.76
0.05
6.90
12.16
1.92
KR_TEST
0.12
3.22
8.93
19.88
1.01
19.88
0.23
24.09
14.90
21.43
60.47
82.52
66.15
44.27
47.84
27.27
29.70
28.12
28.97
24.81
22.93
31.80
30.82
34.98
58.33
23.08
66.15
0.60
18.82
23.35
47.45
0.89
4.86
20.19
18.55
0.04
10.53
2.64
3.53
20.65
17.57
19.88
7.13
14.58
0.00
0.45
9.09
0.20
22.78
7.41
21.43
0.00
19.58
14.75
0.76
0.25
6.90
14.40
IBK_TEST
20.64
31.57
11.42
44.80
37.98
47.48
26.66
23.76
14.42
20.75
59.88
84.97
69.23
51.98
63.64
27.97
31.20
32.94
29.91
25.19
24.89
33.92
29.56
35.97
56.48
23.07
68.21
61.50
27.03
22.75
48.18
5.57
4.72
22.28
17.83
0.04
10.52
15.58
6.12
0.20
17.57
48.20
5.75
14.58
46.03
0.45
9.09
21.82
22.15
0.00
20.75
0.00
19.05
22.95
0.76
28.33
6.90
15.84
RL_TRAIN
5.12
0.25
3.66
2.76
3.23
7.30
8.40
5.61
0.96
13.61
38.64
40.91
23.08
8.30
30.17
19.93
10.30
18.75
9.35
5.56
9.76
15.25
8.81
8.58
10.19
7.69
3.59
0.14
10.81
20.36
14.60
0.00
1.57
13.31
6.52
2.67
5.26
0.48
0.37
4.51
4.73
0.00
2.52
8.33
0.79
0.05
3.83
5.45
6.33
0.00
13.61
0.00
5.82
11.48
0.76
0.05
3.45
6.24
Table 14: Complete list of data sets in each cluster and the values of the significant clustering variables.
- 37 -
CLUSTER
1
1
1
1
1
1
2
2
2
2
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
4.4
Results from Autoclass (Bayesian Classification System) Analysis
The relative influence of each attribute in differentiating the classes from the overall data set is shown
in table 15. Autoclass discovered that NUM_ATTR, KR_TRAIN, and C45_TEST have the highest
influence.
Description
NUM_ATTR
KR_TRAIN
C45_TEST
RL_TEST
KR_TEST
C45_TRAIN
RL_TRAIN
NB_TEST
IBK_TEST
OR_TEST
NB_TRAIN
OR_TRAIN
IBK_TRAIN
Importance
1.00
0.87
0.64
0.64
0.55
0.49
0.40
0.34
0.33
0.29
0.28
0.27
0.14
Table 15: Significance level of variables in Autoclass clustering.
It was discovered that Autoclass produced 4 clusters with frequencies of 19,17, 13 and 10 as shown in
table 16.
Unlike k-means and Kohonen Vector Quantization clustering, Autoclass identifies which of the
variables are probably significant for each cluster. The full result from this algorithm is shown on
Appendix E, page 49 and a summary of the results is shown in table 16. According to Cheeseman and
Stutz [13], to get the significant variables for each class, the following heuristics can be applied: first
20 % of the highest influence value is calculated, then this value is used to determine which variable is
significant for a given class. For example for class 1, number of attributes has a significance factor of
2.5 as can be seen from appendix 6. Twenty percent of 2.5 is 0.5 and therefore only variables with
significance factor of greater than 0.5 are probably significant for this cluster. Therefore number of
attributes, with error rate of 11.6, is significant because it is greater than 0.5. As shown in table 16,
careful analysis of the complete output from this clustering algorithm shows that cluster 3 has all
variables except IBK_TEST and IBK_TRAIN as significant whereas cluster 2 has only KR_TRAIN as
a significant variable.
- 38 -
Freq Number IBK
IBK
C45_
C45_
RL_
RL_
NB_
NB_
OR_
OR_
KR
KR_
Of
Of
TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST _TRAIN TEST
Cluster Attributes (19.36) (26.66) (10.47) (20.16) (8.06) (21.95) (19.36) (21.51) (23.83) (30.49) (2.53) (19.88)
19.3
17.2
21.9
18.7
1
19
4.26
19.5
11.6
18.6
Cluster
2
17
3
13
17.8
4
10
10.0
59.1
1.23
2.27
1.23
2.48
8.7
9.39
9.09
26.2
51.8
26.2
53.8
44.8
55.8
46.8
0.011
0.17
56.2
76.3
10.4
PROPERTIES
(COMPARED TO POPULATION)
Lower than average Error rates
Much lower than average Error rates
3.18 Lower than average Error rates
55.9 Greater than average Error rates
Table 16. Properties of clusters from Autoclass analysis. Values in bracket show average values for the population.
The complete list of the 59 data sets with their corresponding cluster number is provided in table 17.
- 39 -
DATA_NAME
Breast-Cancer
Colic
Credit-A
Heart-Statlog
Labor
Hepatitis
Lymph
Autoprice
Baskball
Elusage
Fishcatch
Gascons
Hungarian
Lowbwt
Mbagrade
Pollution
Schlvote
Sleep
Strike
Audiology
Balance-Scale
Credit-G
Diabetes
Heart-C
Sonar
Soybean
Autos
Heart-H
Splice
Vehicle
Vowel
Waveform-5000
Bolts
Cleveland
Detroit
Servo
Anneal
Breast-W
Iris
Kr-Vs-Kp
Segment
Sick
Mushroom
Vote
Bodyfat
Cpu
Longley
Meta
Vineyard
Glass
Primary-Tumor
Breasttumor
Cloud
Echomonths
Housing
Pharynx
Pwlinear
Quake
Veteran
NUM_ATTR
10
28
16
13
16
20
19
16
5
3
8
5
13
10
3
16
6
8
6
70
5
20
9
76
61
35
26
76
62
18
14
21
8
14
14
5
38
11
5
37
19
30
22
17
13
7
7
21
4
11
18
10
6
10
13
9
11
4
8
KR_TRAIN
2.1
0.27
0.14
0
0
1.29
0
3.77
2.08
5.45
9.49
7.41
0
0.53
11.46
0
13.51
3.45
0.64
0
0
0
0
0
0
0.17
0
0
0
0
0
0
0
0
0
0
1.33
0
0
0
0.23
0
0.48
0
0
0
8.41
12.39
3.85
0.93
13.08
0
0
0
36.41
2.92
C45_TEST
24.83
14.13
14.06
22.22
24.56
20.65
22.93
31.45
13.54
12.72
18.99
11.11
22.11
16.93
16.4
0.2
21.62
12.07
12.8
22.12
22.24
30.3
25.91
20.79
25.96
7.91
17.56
22.11
6.01
26.6
21.62
24.48
0.35
37.62
30.77
23.35
1.56
4.72
4.7
0.47
2.86
1.35
0
3.22
3.17
4.78
0
0.76
1.92
32.71
59.29
83.92
42.59
0.6
47.23
54.87
48.5
48.3
43.07
RL_TEST
28.67
17.39
16.09
23.7
21.05
19.36
22.3
33.96
14.58
0.2
19.62
11.11
19.05
22.22
16.39
0.2
16.22
12.07
12.8
18.58
19.52
27.9
25.52
24.75
22.12
8.34
24.89
18.37
6.24
30.26
22.63
22.64
37.5
33.99
30.77
22.75
0.02
5.15
5.33
0.78
2.94
1.48
0
3.45
3.17
5.26
0
0.76
1.92
34.11
61.65
87.06
40.74
57.69
47.43
66.15
49.5
52.43
44.52
KR_TEST
27.27
20.19
18.55
24.81
10.53
20.65
17.57
30.82
14.58
0.2
22.78
7.41
21.43
19.58
14.75
0.25
18.82
6.9
14.4
0.23
0.12
29.7
28.12
24.09
14.9
8.93
22.93
21.43
31.8
1.01
0.45
34.98
23.08
23.35
0.89
4.86
0.04
3.22
2.64
3.53
7.13
0
9.09
0
0.76
1.92
28.97
60.47
82.52
58.33
66.15
44.27
66.15
0.6
47.84
47.45
CLASS
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
Table 17: Complete list of data sets in each cluster and the values of the significant clustering variables.
- 40 -
4.5
Comparison
4.5.1
Comparison of significant variables
Table 18 shows the significant variables from each of the clustering algorithms. As can be seen from
the table, KR_TEST, and C45_TRAIN appear as significant variables in both k-means analysis and
Kohonen Vector Quantization analysis. KR_TEST also appears as one of the top 5 significant
variables for Autoclass. Five clusters were obtained from Kohonen Vector Quantization and k-means
analysis and four clusters were obtained form Autoclass.
CATEGORY
K-MEANS
Significant
Variables
(in order of
importance)
NB_TRAIN
KR_TEST
C45_TRAIN
KOHONEN VECTOR
QUANTIZATION
KR_TEST
IBK_TRAIN
NUM_ATTR
RL_TRAIN
C45_TRAIN
Number of
clusters
5
5
AUTOCLASS
All.
The top 5 are:
NUM_ATTR
KR_TRAIN
C45_TEST
RL_TEST
KR_TEST
4
Table 18: Comparison of significant variables found by the three clustering algorithms.
Table 19 shows which variables were picked as being significant by all 3 or 2 or just 1 of the
clustering algorithms. As can be seen from table 19, the KR_TEST and KR_TRAIN variables were
picked as significant variables in determining the clusters by all three algorithms.
Variables picked as significant in
clustering data by:
3 algorithms
K-means
Kohonen Vector Quantization
Autoclass
KR_TEST
C45_TRAIN
KR_TEST
C45_TRAIN
KR_TEST
C45_TRAIN
2 algorithms
NB_TRAIN
NB_TRAIN
IBK_TRAIN
RL_TRAIN
NUM_ATTR
1 algorithm
IBK_TRAIN
RL_TRAIN
NUM_ATTR
KR_TRAIN
C4.5_TEST
RL_TEST
IBK_TEST
OR_TEST
NB_TRAIN
OR_TRAIN
IBK_TRAIN
Table 19: Summary of the significance of each variable for the 3 clustering algorithms.
4.5.2
Comparison of Data sets in Different Clusters
Detailed analysis of the grouping of the data sets in from each algorithm revealed that all the data sets
listed in one column in table 20 were actually grouped together by all the three clustering algorithms.
- 41 -
For example, the ‘colic’ data set was grouped by all three clustering algorithms as having lower than
average error rates. There are total of 24 data sets in this group. The data sets in column 2 of table 20
were found to have higher error rates than average by the three clustering algorithms. There are 5 data
sets in this group. The data sets in column 3 have about average error rates and there are 5 data sets in
this group. This gives a total of 34 data sets clearly classified to different groups.
Data sets Names
(having error rates
lower than average)
Colic
Credit-A
Heart-Statlog
Labor
Hepatitis
Lymph
Baskball
Elusage
Fishcatch
Gascons
Hungarian
Lowbwt
Audiology
Heart-C
Sonar
Soybean
Heart-H
Splice
Vowel
Detroit
Mbagrade
Pollution
Sleep
Strike
Data set Names
(having error rates higher
than average)
Data set Names
(having error rates about average)
Primary-Tumor
Breasttumor
Echomonths
Housing
Quake
Credit-G
Diabetes
Autos
Cleveland
Servo
Table 20: Data sets in each column were grouped into one cluster by all three algorithms.
4.5.3
Influence of Characteristics of Data Sets on Performance of Classification Algorithms
Looking at table 20, it can be seen that both the ‘Credit-G’ data set and the ‘Servo’ data sets were
grouped together as having about average error rates. From table 2 on page 26, it can be seen that
‘Servo’ has 5 attributes and 37 instances whereas ‘Credit-G’ has 20 attributes and 1000 instances. On
the other hand, both the ‘Breasttumor’ and ‘Quack’ data sets were grouped together as having higher
than average error rates. ‘Breasttumor’ has 10 attributes and 286 instances whereas ‘Quack’ has 4
attributes and 2178 instances. This gives an indication that the number of attributes and number of
instances (characteristics of the data sets) do not have a strong influence on the performance of the
classification algorithm.
- 42 -
Chapter 5. Conclusion
A
s was stated in chapter 1, the major goal of this thesis was to find associations between
classification algorithms and characteristics of data sets by a two step process:
1. Build a file of data set names, their characteristics and the performance of a number of algorithms
on each data set.
2. Apply unsupervised clustering to the file built in step 1, analyze the generated clusters and
determine whether there any significant patterns.
The major discovery made by analyzing the generated clusters is that the clusters were formed based
on accuracy of the algorithms. The data sets were grouped as either belonging to a cluster having low
average error rates, medium average error rates or high average error rates. This suggests that there are
three kinds of data sets in the 59 used: ‘easy-to-learn data sets’, ‘moderate-to-learn data sets’, and
‘hard-to-learn data sets’.
It was discovered that number of instances was not useful in clustering the data sets, as it was the only
significant variable in clustering the data sets before it was excluded from the generated data set. This
prevented analysis based on other variables including the variables that contain values for the accuracy
of each classification algorithm.
While not directly relevant to clustering, it was also shown that number of instances and number of
attributes of the data sets do not have strong influence on the performance of the data mining
algorithms as high error rates were obtained for both small data sets with small number of attributes
and large data sets with large number of attributes.
Experiments performed for this thesis also allowed the comparison of the performance of the 6
classification algorithms used when default settings were used for each algorithm. It was discovered
that in terms of performance, the top three algorithms were Kernel Density, C4.5, and Naïve Bayes,
followed by Rule Learner, IBK and OneR. However, it is to be noted that training times for both
Kernel Density and Naïve Bayes were considerably higher than for the other algorithms.
5.1 Further Work
While very limited in scope, this investigation has revealed a number of interesting clusters in machine
learning performance data. It suggests that a larger investigation, as outlined below, using more data
sets and data set characteristics would be worthwhile.
•
Use more data sets:
The use of a large number of data sets would allow an increase in size of the data sets generated
for clustering analysis. This will allow the clustering algorithms to consider more cases for the
formation of clusters.
•
Use more data set sources:
The data sets used in this thesis came mainly from the UCI data collection. The use of a larger
variety of real data sets from different industries may allow the formation of clusters which reveal
- 43 -
patterns between different industries and data mining algorithms.
•
Use small to very big data sets:
In the data mining industry the size of the data to be analyzed can be very large. The maximum
size of the data sets used in this thesis was 8124. It would be useful to see what kind of
performance is obtained and what types of clusters are formed when large data sets are used.
•
Use more classification algorithms:
The use of more classification algorithms would allow the formation of larger data sets to be
generated for analysis by clustering.
•
Use more clustering algorithms:
The use of more clustering algorithms would allow greater consistency of the cluster formation.
• Use optimal parameter values by fine-tuning the settings of each algorithm:
As well as using the error rates obtained by using the default settings of the different classification
algorithms, the use of the error rates obtained by fine-tuning the different options available for
optimal classification performance may allow the formation of different types of clusters and may
also produce new significant variables for clustering.
•
Use of more characteristics of data sets:
Only ‘number of instances’ and ‘number of attributes’ were used in this thesis. Characteristics of
the data sets such as whether or not the data sets contain numeric, symbolic or mixed values and
missing values could be useful.
•
Use visualization tools to analyze the generated data set:
Visualization of the generated data set may provide important information and may allow better
analysis of the clusters formed.
- 44 -
Appendix A. About WEKA
Weka is a collection of machine learning algorithms for solving real-world data mining problems. It is
written in Java and runs on almost any platform. The algorithms can either be applied directly to a data
set or called from Java code. Weka is also well suited for developing new machine learning schemes.
Weka is open source software issued under the GNU public license. Implemented schemes for
classification include decision tree inducers, rule learners, Naive Bayes decision tables, locally
weighted regression, support vector machines, instance-based learners, logistic regression and voted
perceptrons.
Implemented schemes for numeric prediction include linear regression, model tree generators, locally
weighted regression, instance-based learners, decision tables.
Implemented "meta-schemes" include bagging, stacking, boosting, regression via classification,
classification via regression.
More details can be found at http://www.cs.waikato.ac.nz.
- 45 -
Appendix B. About Enterprise Miner (Commercial software)
Enterprise Miner is an integrated software product that provides an end-to-end business solution for
data mining. A graphical user interface (GUI) provides a user-friendly front-end to the SEMMA
(Sample, Explore, Modify, Model, Assess) process.
All of the functionality needed to implement the SEMMA process is accessed through a single GUI.
The SEMMA process is driven by a process flow diagram (pfd), which can be modified and saved.
However, the GUI is designed in such a way that the business technologist with little statistical
expertise can quickly and easily navigate through the SEMMA process, while the quantitative expert
can go "behind the scenes" to fine tune the analytical process.
SAS Enterprise Miner contains a collection of sophisticated analysis tools that have a common userfriendly interface that enables you to create and compare multiple algorithms. Statistical tools include
clustering, decision trees, linear and logistic regression, and neural networks. Data preparation tools
include outlier detection, variable transformations, random sampling, and the partitioning of data sets
(into train, test, and validate data sets) [48].
More details can be found at http://www.sas.com
- 46 -
Appendix C. Detail Results from K-means Analysis
FASTCLUS Procedure: Replace=FULL Radius=0 Maxclusters=5 Maxiter=1
Initial Seeds
CLUSTER
1
2
3
4
5
NUM_ATTR IBK_TRAIN IBK_TEST
7.0000
0.0000
0.0000
35.0000
50.8000
11.4200
10.0000
26.2200
84.9700
11.0000
0.0000
61.5000
9.0000
0.0000
68.2100
CLUSTER
1
2
3
4
5
NB_TRAIN
0.0000
6.3000
68.1800
51.5000
22.5600
NB_TEST
0.0000
7.1700
82.8700
0.6200
0.6000
OR_TRAIN
0.0000
59.1500
69.9300
0.6200
5.1300
C45_TRAIN
0.0000
3.6600
44.4000
18.5000
54.8700
OR_TEST
0.0000
60.4700
76.2200
0.3800
82.0500
C45_TEST
0.0000
7.9100
83.9200
48.5000
54.8700
KR_TRAIN
0.0000
0.1700
3.8500
0.0000
0.0000
RL_TRAIN RL_TEST
0.0000
0.0000
3.6600
8.3400
40.9100
87.0600
0.1400
49.5000
3.5900
66.1500
KR_TEST
0.0000
8.9300
82.5200
0.6000
66.1500
Statistics for Variables
VARIABLE
TOTAL STD WITHIN STD
NUM_ATTR
17.665590
13.579017
IBK_TRAIN
16.187436
12.263900
IBK_TEST
20.064596
12.113216
C45_TRAIN
11.392017
7.787043
C45_TEST
17.271645
11.762713
RL_TRAIN
8.790095
5.570234
RL_TEST
18.635989
10.262284
NB_TRAIN
16.614650
9.026326
NB_TEST
18.501144
11.202011
OR_TRAIN
18.116768
12.560708
OR_TEST
21.936017
13.732877
KR_TRAIN
5.744748
4.852061
KR_TEST
19.132048
9.951635
OVER-ALL
16.770235
10.715786
- 47 -
R-SQUARED
RSQ/(1-RSQ)
0.449894
0.817832
0.465600
0.871256
0.660669
1.946973
0.564980
1.298743
0.568169
1.315723
0.626126
1.674696
0.717675
2.542021
0.725207
2.639105
0.658681
1.929810
0.552459
1.234433
0.635100
1.740480
0.335835
0.505649
0.748098
2.969797
0.619867
1.630659
Appendix D. Detail results from Kohonen Vector Quantization/Kohonen
FASTCLUS Procedure: Replace=RANDOM Radius=0 Maxclusters=5 Maxiter=100 Converge=0.02
Initial Seeds
CLUSTER
1
2
3
4
5
NUM_ATTR
35.0000
26.0000
5.0000
4.0000
7.0000
IBK_TRAIN
50.8000
3.4100
30.2400
0.0000
0.0000
CLUSTER
1
2
3
4
5
NB_TRAIN
6.3000
31.2200
9.1200
0.0000
0.0000
NB_TEST
7.1700
41.9500
9.1200
1.9200
0.0000
IBK_TEST
11.4200
24.8900
20.6400
1.9200
0.0000
OR_TRAIN
59.1500
31.2200
36.4800
0.0000
0.0000
C45_TRAIN
3.6600
4.8900
9.2800
1.9200
0.0000
OR_TEST
60.4700
37.0700
41.1200
3.8500
0.0000
C45_TEST
7.9100
17.5600
22.2400
1.9200
0.0000
KR_TRAIN
0.1700
0.0000
0.0000
0.0000
0.0000
RL_TRAIN RL_TEST
3.6600
8.3400
9.7600
24.8900
5.1200
19.5200
1.9200
1.9200
0.0000
0.0000
KR_TEST
8.9300
22.9300
0.1200
1.9200
0.0000
Minimum Distance Between Initial Seeds = 7.044665
Kohonen Learning Rate: Initial=0.5 Final=0.02 Steps=1000
Kohonen VQ: Maxsteps=10000 Maxiter=100 Converge=0.0001
FASTCLUS Procedure: Replace=RANDOM Radius=0 Maxclusters=5 Maxiter=100
Converge=0.02
Statistics for Variables
VARIABLE
TOTAL STD WITHIN STD
NUM_ATTR
17.665590
9.538179
IBK_TRAIN
16.187436
8.862687
IBK_TEST
20.064596
13.065480
C45_TRAIN
11.392017
8.626314
C45_TEST
17.271645
11.900984
RL_TRAIN
8.790095
5.660908
RL_TEST
18.635989
10.682617
NB_TRAIN
16.614650
9.844509
NB_TEST
18.501144
11.674383
OR_TRAIN
18.116768
12.274027
OR_TEST
21.936017
14.221397
KR_TRAIN
5.744748
4.874310
KR_TEST
19.132048
11.494007
OVER-ALL
16.770235
10.540830
Pseudo F Statistic = 23.20
- 48 -
R-SQUARED
RSQ/(1-RSQ)
0.728581
2.684339
0.720912
2.583105
0.605220
1.533053
0.466155
0.873204
0.557957
1.262225
0.613855
1.589698
0.694074
2.268766
0.673133
2.059346
0.629288
1.697513
0.572655
1.340031
0.608678
1.555437
0.329730
0.491935
0.663964
1.975872
0.632179
1.718710
Appendix E. Detail Result from Autoclass Clustering Analysis
Class Listings
These listings are ordered by class weight -* j is the zero-based class index,
* k is the zero-based attribute index, and
* l is the zero-based discrete attribute instance index.
Within each class, the covariant and independent model terms are ordered
by their term influence value I-jk.
Covariant attributes and discrete attribute instance values are both ordered by their significance value.
Significance values are computed with respect to a single class classification, using the divergence
from it, abs( log( Prob-jkl / Prob-*kl)), for discrete attributes and the relative separation from it, abs(
Mean-jk - Mean-*k) / StDev-jk, for numerical valued attributes. For the SNcm model, the value line
is followed by the probabilities that the value is known, for that class
and for the single class classification.
Entries are attribute type dependent, and the corresponding headers are
reproduced with each class. In these -* num/t denotes model term number,
* num/a denotes attribute number,
* t denotes attribute type,
* mtt denotes model term type, and
* I-jk denotes the term influence value.
CLASS 1- weight 19 normalized weight 0.322 relative strength 1.56e-03 *******
class cross entropy w.r.t. global class 9.10e+00 *******
Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0
Numbers: numb/t = model term number; numb/a = attribute number
Model term types (mtt): (single_normal_cn SNcn)
(single_normal_cm SNcm)
I-JK
-JK
NUM_ATTR
IBK_TRAIN
OR_TRAIN
RL_TEST
KR_TEST
C4.5__TEST
IBK_TEST
OR_TEST
C45_TRAIN
NB_TEST
RL_TRAIN
NB_TRAIN
KR_TRAIN
MEAN
-JK
|MEAN-JK STDEV MEAN-*K|/ MEAN
STDEV-JK -*K
-*K
STDEV
2.503 ( 1.16e+01 6.35e+00) 3.68e+00 ( 3.49e+01 1.26e+02)
0.853 ( 4.26e+00 4.96e+00) 1.58e+00 ( 1.21e+01 1.61e+01)
0.799 ( 1.72e+01 5.80e+00) 1.36e+00 ( 2.51e+01 1.82e+01)
0.776 ( 1.93e+01 5.30e+00) 6.35e-01 ( 2.27e+01 1.78e+01)
0.771 ( 1.87e+01 5.98e+00) 7.50e-01 ( 2.32e+01 1.90e+01)
0.761 ( 1.86e+01 5.20e+00) 6.74e-01 ( 2.21e+01 1.72e+01)
0.662 ( 1.95e+01 7.20e+00) 1.16e+00 ( 2.78e+01 1.93e+01)
0.517 ( 2.19e+01 9.43e+00) 1.05e+00 ( 3.18e+01 2.07e+01)
0.383 ( 1.07e+01 5.10e+00) 1.92e-02 ( 1.06e+01 1.11e+01)
0.316 ( 1.91e+01 9.91e+00) 5.72e-01 ( 2.48e+01 1.85e+01)
0.312 ( 7.66e+00 4.31e+00) 1.88e-01 ( 8.47e+00 8.50e+00)
0.295 ( 1.52e+01 8.77e+00) 5.64e-01 ( 2.01e+01 1.59e+01)
0.152 ( 3.24e+00 4.04e+00) 1.76e-01 ( 2.53e+00 5.80e+00)
- 49 -
CLASS 2- weight 17 normalized weight 0.289 relative strength 7.31e-05 *******
class cross entropy w.r.t. global class 7.25e+00 *******
Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0
Numbers: numb/t = model term number; numb/a = attribute number
Model term types (mtt): (single_normal_cn SNcn)
(single_normal_cm SNcm)
I-JK MEAN
-JK -JK
KR_TRAIN
NUM_ATTR
RL_TEST ..
C4.5__TEST
C45_TRAIN
IBK_TEST
OR_TEST
KR_TEST
IBK_TRAIN
RL_TRAIN
NB_TEST.
OR_TRAIN
NB_TRAIN .
| MEAN-JK STDEV
MEAN-*K|/ MEAN STDEV
STDEV-JK -*K
-*K
3.908 ( 1.13e-02 4.86e-02) 5.19e+01 ( 2.53e+00 5.80e+00)
0.565 ( 8.81e+01 2.20e+02) 2.41e-01 ( 3.49e+01 1.26e+02)
0.450 ( 2.33e+01 7.55e+00) 8.85e-02 ( 2.27e+01 1.78e+01)
0.413 ( 2.35e+01 7.64e+00) 1.89e-01 ( 2.21e+01 1.72e+01)
0.337 ( 8.18e+00 5.59e+00) 4.31e-01 ( 1.06e+01 1.11e+01)
0.290 ( 2.94e+01 1.03e+01) 1.53e-01 ( 2.78e+01 1.93e+01)
0.289 ( 4.35e+01 1.41e+01) 8.31e-01 ( 3.18e+01 2.07e+01)
0.268 ( 2.29e+01 1.02e+01) 2.73e-02 ( 2.32e+01 1.90e+01)
0.188 ( 2.09e+01 1.83e+01) 4.78e-01 ( 1.21e+01 1.61e+01)
0.173 ( 8.59e+00 5.25e+00) 2.39e-02 ( 8.47e+00 8.50e+00)
0.160 ( 2.45e+01 1.17e+01) 2.26e-02 ( 2.48e+01 1.85e+01)
0.136 ( 3.21e+01 1.43e+01) 4.92e-01 ( 2.51e+01 1.82e+01)
0.073 ( 1.97e+01 1.18e+01) 3.82e-02 ( 2.01e+01 1.59e+01)
CLASS 3- weight 13 normalized weight 0.220 relative strength 1.00e+00 *******
class cross entropy w.r.t. global class 1.84e+01 *******
Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0
Numbers: numb/t = model term number; numb/a = attribute number
Model term types (mtt): (single_normal_cn SNcn)
(single_normal_cm SNcm)
C4.5__TEST
RL_TEST
C45_TRAIN
KR_TRAIN
NUM_ATTR
KR_TEST
RL_TRAIN
NB_TEST
OR_TEST
NB_TRAIN
OR_TRAIN
IBK_TEST
IBK_TRAIN
MEAN-JK I-JK
MEAN
STDEV
MEAN-*K|/ MEAN STDEV
-JK
-JK
STDEV-JK -*K
-*K
2.520 ( 2.27e+00 1.64e+00) 1.21e+01 ( 2.21e+01 1.72e+01)
2.450 ( 2.48e+00 1.78e+00) 1.13e+01 ( 2.27e+01 1.78e+01)
2.196 ( 1.23e+00 1.07e+00) 8.71e+00 ( 1.06e+01 1.11e+01)
2.182 ( 1.70e-01 3.63e-01) 6.50e+00 ( 2.53e+00 5.80e+00)
1.973 ( 1.78e+01 1.07e+01) 1.60e+00 ( 3.49e+01 1.26e+02)
1.906 ( 3.18e+00 2.60e+00) 7.71e+00 ( 2.32e+01 1.90e+01)
1.885 ( 1.17e+00 1.14e+00) 6.39e+00 ( 8.47e+00 8.50e+00)
0.827 ( 9.39e+00 7.54e+00) 2.04e+00 ( 2.48e+01 1.85e+01)
0.772 ( 1.04e+01 1.15e+01) 1.86e+00 ( 3.18e+01 2.07e+01)
0.688 ( 8.70e+00 6.88e+00) 1.66e+00 ( 2.01e+01 1.59e+01)
0.618 ( 9.09e+00 1.04e+01) 1.54e+00 ( 2.51e+01 1.82e+01)
0.308 ( 1.39e+01 1.57e+01) 8.87e-01 ( 2.78e+01 1.93e+01)
0.069 ( 8.90e+00 1.31e+01) 2.45e-01 ( 1.21e+01 1.61e+01)
- 50 -
CLASS 4- weight 10 normalized weight 0.170 relative strength 5.75e-08 *******
class cross entropy w.r.t. global class 1.65e+01 *******
Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0
Numbers: numb/t = model term number; numb/a = attribute number
Model term types (mtt): (single_normal_cn SNcn)
(single_normal_cm SNcm)
I-JK
-JK
NUM_ATTR
KR_TEST
RL_TEST
C4.5__TEST
NB_TEST
IBK_TEST
NB_TRAIN
C45_TRAIN
RL_TRAIN
KR_TRAIN
OR_TEST
OR_TRAIN
IBK_TRAIN
MEAN
-JK
STDEV MEAN-*K|/
STDEV-JK -*K
|MEAN-JK MEAN STDEV
-*K
3.111 ( 1.00e+01 3.46e+00) 7.20e+00 ( 3.49e+01 1.26e+02)
1.605 ( 5.59e+01 1.33e+01) 2.45e+00 ( 2.32e+01 1.90e+01)
1.586 ( 5.38e+01 1.36e+01) 2.29e+00 ( 2.27e+01 1.78e+01)
1.572 ( 5.18e+01 1.26e+01) 2.36e+00 ( 2.21e+01 1.72e+01)
1.493 ( 5.58e+01 1.31e+01) 2.36e+00 ( 2.48e+01 1.85e+01)
1.428 ( 5.91e+01 1.32e+01) 2.37e+00 ( 2.78e+01 1.93e+01)
1.227 ( 4.48e+01 1.35e+01) 1.83e+00 ( 2.01e+01 1.59e+01)
1.098 ( 2.62e+01 1.50e+01) 1.04e+00 ( 1.06e+01 1.11e+01)
0.903 ( 1.89e+01 1.20e+01) 8.73e-01 ( 8.47e+00 8.50e+00)
0.893 ( 7.63e+00 1.02e+01) 5.02e-01 ( 2.53e+00 5.80e+00)
0.791 ( 5.62e+01 1.48e+01) 1.65e+00 ( 3.18e+01 2.07e+01)
0.721 ( 4.68e+01 1.80e+01) 1.20e+00 ( 2.51e+01 1.82e+01)
0.075 ( 1.67e+01 1.86e+01) 2.46e-01 ( 1.21e+01 1.61e+01)
- 51 -
Appendix F. Sample Code used to run Multiple Algorithms on Multiple Data sets
Run-one.bat
The file which is used to capture the parameters and run data mining algorithms is shown below
#!/bin/csh
#Expects to be invoked from run-lots.bat
#Create a name for the output fule by removing `java weka.classifiers.'
#and all spaces from the command line
set outfile=`echo $argv[1] |sed -e "s/java weka.classifiers.//"|sed -e "s/ //g"`
#Change this line to set the name of the output directory
set outfile=/research/ai/ribrahim/thesis/results/$outfile
/bin/echo $outfile
/bin/echo "$argv[1]" > $outfile
$argv[1] >> $outfile
Run-multiple.bat
Sample of the file which runs multiple algorithms on multiple data sets is shown below:
run-one.bat "java weka.classifiers.j48.J48 -o -t vote.arff"
run-one.bat "java weka.classifiers.IBk -o -W 200 -t vote.arff"
run-one.bat "java weka.classifiers.j48.PART -o -t vote.arff"
run-one.bat "java weka.classifiers.NaiveBayes -o -t vote.arff"
run-one.bat "java weka.classifiers.OneR -o -t vote.arff"
run-one.bat "java weka.classifiers.KernelDensity -o -t vote.arff"
run-one.bat "java weka.classifiers.j48.J48 -o -t vowel.arff"
run-one.bat "java weka.classifiers.IBk -o -W 200 -t vowel.arff"
run-one.bat "java weka.classifiers.j48.PART -o -t vowel.arff"
run-one.bat "java weka.classifiers.NaiveBayes -o -t vowel.arff"
run-one.bat "java weka.classifiers.OneR -o -t vowel.arff"
run-one.bat "java weka.classifiers.KernelDensity -o -t vowel.arff"
- 52 -
Appendix G. Sample Output from data generation
java weka.classifiers.j48.PART -o -t glass.arff
=== Error on Training data ===
Correctly classified instances
Incorrectly Classified Instances
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
=== Stratified cross-validation ===
Correctly classified instances
Incorrectly Classified Instances
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
194 (90.65 %)
20 (9.35 %)
0.04
0.15
20.98 %
45.92 %
214
141 ( 65.89 %)
73 (34.11%)
0.1056
0.2943
49.8563 %
90.6776 %
214
- 53 -
6.
References
1. D. Aha and D. Kibler, Instance-Based Learning Algorithms, Machine Learning, vol.6, pp. 37-66,
1991.
2. C. Beardah and M. Baxter, The Archaeological Use of Kernel Density, Estimates, Internet
Archeology, http://www.intarch.ac.uk, 1996.
3. J. Berger, Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, 1985.
4. M. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales and Customer Support,
John Wiley & Sons, New York, 1997.
5. A. Berson, S. Smith, Data Warehousing, Data Mining, and OLAP, McGraw Hill, 1997.
6. M. Bertero, T. Poggio and V. Torre, Ill-Posed Problems in Early Vision, Proceedings of the IEEE,
Vol.76, No.8, pp.869-902, 1990.
7. M. Bertold and D. Hand, Intelligent Data Analysis: An Introduction, Springer Verlag, 1999.
8. P. Brazdil, J.Gama and R. Henery, Characterizing the Applicability of Classification Algorithms
using Meta Level Learning, in Machine Learning - ECML-94, Springer Verlag, pp. 83-102, 1994.
9. P. Brazdil and R. Henery, Analysis of Results, Machine Learning, Neural and Statistical
Classification, Ellis Horwood, pp. 175-212, 1994.
10. C. Brodley and P.Smyth, Applying Classification Algorithms in Practice, Statistics and
Computing, Vol. 7, pp. 45-56, 1995.
11. D. Brand and R.Gerritsen, Naïve Bayes and nearest neighbor, http://www.dbmsmag.com/
9807m07.html, 1997.
12. P. Brazdil and J.Gama, The STATLOG Project- Evaluation / Characterization of Classification
Algorithms, http://www.ncc.up.pt/liacc/ML/statlog/, 1998.
13. P. Cheeseman, On Finding the Most Probable Model, Computational Algorithms of Discovery and
Theory Formation, Morgan Kaufmann Publishers, San Francisco, pp. 73-96, 1990.
14. P. Cheeseman and J. Stutz, Bayesian classification (AUTOCLASS): Theory and results, Advances
in Knowledge Discovery and Data Mining, AAAI Press/ MIT Press, 1996.
15. P. Cheeseman and J. Stutz, M. Self, J. Kelly, W. Taylor, and D. Freeman, Bayesian Classification,
In Proceedings of the Seventh National Conference of Artificial Intelligence (AAAI-88), Morgan
Kaufmann Publishers, San Francisco, pp. 607-611,1988.
16. J. Culberson, On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch',
Evolutionary Computation, Vol. 6, 1998.
- 54 -
17. B. Efron and R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, London,
1993.
18. W. Emde and D. Wettschereck, Relational Instance Based Learning, Machine Learning Proceedings 13th International Conference on Machine Learning, pp. 122-130, 1996.
19. U. Fayeed, G Piatetsky-Shapiro, and P. Smyth, Knowledge Discovery and data Mining: Towards a
Unifying Framework, in proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96), p. 82, AAAI Press, 1996.
20. E. Frank and I. Witten, Generating Accurate Rule Sets Without Global Optimization, Machine
Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers,
San Francisco, pp. 144-151, 1998.
21. J. Goebel, K. Volk, H. Walker, F. Gerbault, P. Cheeseman, M. Self, J. Stutz, and W. Taylor, A
Bayesian Classification of the IRAS LRS Atlas, Astron Astrophys, Vol. 222, pp. 5-8, 1989.
22. R. Hanson, J. Stutz and P. Cheeseman, Bayesian Classification with Correlation and Inheritance,
Proceedings of 12th International Joint Conference on Artificial Intelligence, San Francisco, pp.
692-698, 1991.
23. R. Hecht-Nielsen, Neural Networks for Image Analysis, in Neural Network for Vision and Image
Processing, Carpenter and Grossberg, 1992.
24. A. Herr, N. Klomp and J. Atkinson, Identification of Bat Echolocation Calls Using a Decision
Tree Classification System, Complexity International, Volume 4, January 1997.
25. J. Hjorth, Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap,
Chapman and Hall, London 1994.
26. R. Holte, Very Simple Classification Rules Perform Well on Most Commonly used data sets,
Machine Learning, Kluwer Academic Publisher, Boston, Vol. 11, pp.63-91, 1993.
27. A. Izenman, Recent Developments in Non parametric Density Estimation, J. Amer. Statistics.
Association. Vol.86, No. 413, pp. 205-224, 1991.
28. G. Jang, A Comparison of Neural Networks Performance for Seismic Phase Identification, J.
Franklin Institute, Vol. 330, No. 3, pp. 505-524, 1993.
29. W. Jung, J. Oglesby, and H. Kirk, Data Mining Primer: Overview of Applications and
Methods, SAS Institute, Cary, 1998.
30. W. Kadus, The use of Symbolic Learning Algorithms and Instance-based learning for Gesture
Recognition, http://www.cse.unsw.edu.au/~waleed/thesis/node69.html, University of NSW, 1995.
31. M. Kearns and U. Vazirani, An Introduction to Computational Learning Theory, The MIT Press,
London, 1994.
32. R Kohavi, Holte's OneR, http://www.sgi.com/Technology/mlc/util/util/node14.html, 1996.
- 55 -
33. R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model
Selection, Max-Plank-Institute Proceedings, pp.1137-1145, 1995.
34. A. Lensu and P. Koikkalainen, Profiling of Text Documents through Self-Organizing Maps,
http://www.ferin.math.jyu.fi, University of Jyväskylä, 1997.
35. T. Lim and W. Loh, A Comparison of Prediction Accuracy, Complexity, and Training Time of
Thirty-Three Old and New Classification Algorithms, Technical Report, Department of Statistics,
University of Wisconsin-Madison, No. 979, 1997.
36. C. Marzban and G. Stumpf, A Neural Network for Tornado Prediction Based on Doppler Radardrived Attributes, Journal of Applied Meteorology, Vol. 35, 617,1996.
37. T. Masters, Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY, John Wiley and
Sons, 1995.
38. G. McLachlan, and T. Krishnan, The EM Algorithm and Extensions, Wiley and Sons, 1997.
39. V. Nalwa, A Guided Tour of Computer Vision, Addison Wesley, 1993.
40. E. Parzen, On Estimation of a Probability Density Function and mode, The Annals of
Mathematical Statistics, pp. 1065-1076, 1962.
41. W. Potts, Introduction to Predictive Data Mining Using Enterprise Miner Software, SAS Institute,
Cary, 1997.
42. D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, San Francisco, 1999.
43. J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publisher, San Mateo,
California, 1993.
44. J. Ramon and L. Raedt, Instance Based Function Learning, Lecture Notes in Computer Science,
Vol. 1634, pp. 268-275, 1999.
45. R. Rao, T. Voigt, and T. Fermanian, Data Mining of Subjective Agricultural Data, in Proceedings
of the Tenth International Conference on Machine Learning, Morgan Kaufmann, san Mateo, 1999.
46. S. Salzberg, On comparing classifiers: Pitfalls to avoid and a recommended approach, Data
Mining and Knowledge Discovery, Vol. 1, No.3, 1997.
47. S. Schuburt, Data Mining Projects, SAS Methodology, SAS Institute Australia, Melbourne, 1998.
48. S. Schubert, The Data Mining Challenge: Turning Raw Data Into Business Gold,
http://www.sas.com, Cary, 1999.
49. S. Scott, Multivariate Density Estimation: Theory, Practice and Visualization, Wiley, New York,
1992.
50. J. Shavlik, Raymond J. Mooney and G. Towell, Symbolic and Neural Learning Algorithms: An
Experimental Comparison, Machine Learning, Vol. 6, pp. 111-143, 1991.
- 56 -
51. B. Silverman, Density Estimation, Chapman and Hall, London, 1986.
52. B. Silverman, Kernel Density Estimation using the Fast Fourier Transform, Applied Statistics,
Vol. 31, No.1, pp. 93-99, 1982.
53. P. Smyth, A. Gray and U. Fayyad, Retrofitting Decision Tree Classifiers using Kernel Density
Estimation, In Proceedings of 12th International Conference on Machine Learning, pp. 506514,1995.
54. W. Venables and B. Ripley, Modern Applied Statistics with S-Plus, New York, 1994.
55. S. Weiss and N. Indurkhya, Predictive Data Mining-a Practical Guide, Morgan Kaufmann
Publishers, San Francisco, 1998.
56. S. Weiss and I. Kapouleas, An Empirical Comparison of Pattern Recognition, Neural Nets, and
Machine Learning Classification Methods, In Proceedings of the 11th International Joint
Conference on Artificial Intelligence, pp. 781-787, 1989.
57. C. Westphal and T. Blaxton, Data Mining Solutions-Methods and Tools for Solving Real-World
Problems, John Wiley & Sons, NY, 1998.
58. I. Witten and M. Frank, Data Mining: Practical Machine Learning Tool and Technique with Java
implementation, Morgan Kaufmann, San Francisco, 2000.
59. D. Wolpert and W. Macready, No Free Lunch Theorems for Search, Santa Fe Institute, Technical
report no., No. SFI-TR-95-02-010, 1995.
60. A. Upal, Autoclass, http://www.cs.ualberta.ca/~upal/cluster/p2/node11.html, 1997.
- 57 -