Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining of Machine Learning Performance Data Remzi Salih Ibrahim Master of Applied Science (Information Technology) 1999 RMIT University Abstract W ith the development and penetration of data mining within different fields and industries, many data mining algorithms have emerged. The selection of a good data mining algorithm to obtain the best result on a particular data set has become very important. What works well for a particular data set may not work well on another. The goal of this thesis is to find associations between classification algorithms and characteristics of data sets by first building a file of data sets, their characteristics and the performance of a number of algorithms on each data set; and second applying unsupervised clustering analysis to this file to analyze the generated clusters and determine whether there are any significant patterns. Six classification algorithms were applied to 59 data sets and then three clustering algorithms were applied to the data generated. The patterns and properties of the clusters formed were then studied. The six classification algorithms used were OneR (1R), Kernel Density, Naïve Bayes, C4.5, Rule Learner and IBK. The clustering algorithms used were K-means clustering, Kohonen Vector Quantization, and Autoclass Baysian clustering. The major discovery made by analyzing the generated clusters is that the clusters were formed based on accuracy of the algorithms. The data sets were grouped as either belonging to a cluster having average error rates, lower, about average or higher than average error rates of the population. This suggests that there are three kinds of data sets on the 59 data sets considered: ‘easy-to-learn data sets’, ’moderate-to-learn’, and ‘hardto-learn data sets’. Another discovery made by this thesis is that the number of instances in a data set was not useful for clustering analysis of the machine learning performance data. It was the only significant variable in clustering the data sets and prevented analysis based on other variables including the variables that contain values for the accuracy of each classification algorithm. While not directly relevant to clustering, it was also found that the number of instances and number of attributes in the data sets do not have strong influence on the performance of the data mining algorithms on the 59 data sets considered as high error rates were obtained for both small data sets with a small number of attributes and large data sets with a large number of attributes. Experiments performed for this thesis also allowed the comparison of the performance of the 6 classification algorithms with their default parameter settings. It was discovered that in terms of performance, the top three algorithms were Kernel Density, C4.5, and Naïve Bayes, followed by Rule Learner, IBK and OneR. - II - Declaration I certify that all work on this thesis was carried out between June 1998 and June 2000 and it has not been submitted for academic award at any other College, Institute or University. The work presented was carried out under the supervision of Dr. Vic Ciesielski. All other work in the thesis is my own except where acknowledged in the text. Signed, Remzi Salih Ibrahim June, 2000 - III - Table of Contents List of Tables ....................................................................................................................VI List of Graphs ................................................................................................................. VII Acknowledgements ........................................................................................................VIII Chapter 1. Introduction...................................................................................................... 9 1.1 Goals .............................................................................................................................. 10 1.2 Scope............................................................................................................................... 10 Chapter 2. Literature Survey............................................................................................ 11 2.1 Supervised Learning ..................................................................................................... 11 2.1.1 Supervised Algorithms Used in this thesis .................................................................. 12 2.1.1.1 2.1.1.2 2.1.1.3 2.1.1.4 2.1.1.5 2.1.1.6 2.2 C4.5....................................................................................................................................... 12 Rule Learner (PART)............................................................................................................ 16 OneR (1R) ............................................................................................................................. 17 IBK........................................................................................................................................ 18 Naïve Bayes .......................................................................................................................... 18 Kernel Density ...................................................................................................................... 19 Unsupervised Learning................................................................................................. 20 2.2.1 Unsupervised Data Mining Algorithms Used in This Thesis..................................... 20 2.2.1.1 2.2.1.2 2.2.1.3 2.3 K-means clustering ............................................................................................................... 20 Kohonen Vector Quantization............................................................................................... 20 Autoclass (Bayesian Classification System) ......................................................................... 21 Related work on Comparison of Classifiers................................................................ 21 Chapter 3. Data Generation............................................................................................. 24 3.1 Collection of Data sets........................................................................................................ 24 3.2 Selection of Data Mining Algorithms ............................................................................... 24 3.3 Generating Data ................................................................................................................. 24 Chapter 4. Clustering and Pattern Analysis.................................................................... 28 4.1 Results from k-means clustering.................................................................................. 28 4.2 Results from k-means clustering (without Number of Instances)............................. 30 4.3 Results from Kohonen Vector Quantization Clustering............................................ 34 4.4 Results from Autoclass (Bayesian Classification System) Analysis .......................... 38 4.5 Comparison................................................................................................................... 41 4.5.1 Comparison of significant variables.......................................................................................... 41 4.5.2 Comparison of Data sets in Different Clusters.......................................................................... 41 4.5.3 Influence of Characteristics of Data Sets on Performance of Classification Algorithms.......... 42 Chapter 5. Conclusion...................................................................................................... 43 Appendix A. About WEKA.................................................................................................... 45 Appendix B. About Enterprise Miner (Commercial software)........................................... 46 - IV - Appendix C. Detail Results from K-means Analysis ........................................................... 47 Appendix D. Detail results from Kohonen Vector Quantization/Kohonen ....................... 48 Appendix E. Detail Result from Autoclass Clustering Analysis ......................................... 49 Appendix F. Sample Code used to run Multiple Algorithms on Multiple Data sets......... 52 Appendix G. Sample Output from data generation.............................................................. 53 6. References .................................................................................................................. 54 -V- List of Tables TABLE 1: SAMPLE IRIS DATA .......................................................................................................................... 13 TABLE 2: RESULTS OBTAINED FROM APPLYING 6 DATA MINING ALGORITHMS TO 59 DATA SETS. BLANKS INDICATE SITUATIONS WHERE ALGORITHMS GAVE NO RESULT. ............................................................. 26 TABLE 3: DEFINITION OF VARIABLES USED.................................................................................................... 26 TABLE 4: SUMMARY OF THE DATA GATHERED FROM RUNNING THE 6 DATA MINING ALGORITHMS ON 59 DATA SETS ....................................................................................................................................................... 27 TABLE 5: IMPORTANCE LEVEL OF VARIABLES IN DETERMINING CLUSTER...................................................... 28 TABLE 6: PROPERTIES OF THE 5 CLUSTERS .................................................................................................... 29 TABLE 7: IMPORTANCE VARIABLE IN DETERMINING CLUSTERS ..................................................................... 30 TABLE 8: GENERAL PROPERTIES OF THE CLUSTERS FROM K-MEANS ANALYSIS ............................................. 31 TABLE 9: GENERAL PROPERTIES OF CLUSTERS AND SIGNIFICANT VARIABLES ............................................... 32 TABLE 10: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES........................................................................................................................ 33 TABLE 11: IMPORTANCE OF VARIABLES IN DETERMINING CLUSTERS IN KOHONEN VECTOR QUANTIZATION ANALYSIS .............................................................................................................................................. 34 TABLE 12: GENERAL PROPERTIES OF THE CLUSTERS FROM KOHONEN VECTOR QUANTIZATION ANALYSIS .. 35 TABLE 13: GENERAL PROPERTIES OF CLUSTERS MEAN ERROR RATES OF THE SIGNIFICANT VARIABLES ......... 36 TABLE 14: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES........................................................................................................................ 37 TABLE 15: SIGNIFICANCE LEVEL OF VARIABLES IN AUTOCLASS CLUSTERING ............................................... 38 TABLE 16. PROPERTIES OF CLUSTERS FROM AUTOCLASS ANALYSIS .............................................................. 39 TABLE 17: COMPLETE LIST OF DATA SETS IN EACH CLUSTER AND THE VALUES OF THE SIGNIFICANT CLUSTERING VARIABLES........................................................................................................................ 40 TABLE 18: COMPARISON OF SIGNIFICANT VARIABLES FOUND BY THE THREE CLUSTERING ALGORITHMS ...... 41 TABLE 19: SUMMARY OF THE SIGNIFICANCE OF EACH VARIABLE FOR THE 3 CLUSTERING ALGORITHMS. ...... 41 TABLE 20: DATA SETS IN EACH COLUMN WERE GROUPED INTO ONE CLUSTER BY ALL THREE ALGORITHMS .. 42 - VI - List of Graphs FIGURE 1: A DECISION TREE PRODUCED FOR THE IRIS DATA SET ................................................................... 14 FIGURE 2: RULE OUTPUT PRODUCED FROM THE SAS ENTERPRISE MINER SOFTWARE................................... 16 FIGURE 3: PARTIAL OUTPUT FROM WEKA ONER PROGRAM ........................................................................ 17 - VII - Acknowledgements I would like to thank Dr. Vic Ciesleski for being supportive and very patient during the progress of my thesis. He has been very understanding of the problems that arise from working full time, studying part time and still having to fulfill family commitments. I would like to pass my thanks to Dr. Isaac Balbin for his support and his flexibility with the dateline. My thanks also go to the WEKA support team and the staff from SAS Institute Australia for their support and all staff from the RMIT AI (Artificial Intelligence) group who inspired me to research in this field. I would like to thank all members of my family and my friends for their patience during the progress of my thesis. - VIII - Chapter 1. Introduction I n this current age of technology data has become more readily available than ever. Using technologies like data warehousing, data is being stored in large quantities. The availability of such data opened the door for new data analysis techniques to emerge. As Weiss and Indurkhya [55] explain, as the amount of data stored in existing information systems mushroomed, a new set of objectives for data management has emerged. Mining data has become one of the important means of obtaining useful information. The term data mining is defined by Fayeed, Piatetsky-Shapiro and Smyth [19] as the part of Knowledge Discovery in Databases (KDD) process relating to methods for extracting patterns from data. The KDD process involves the complete steps of obtaining knowledge from data and includes selecting, pre-processing, transformation and mining of data followed by interpretation and evaluation of patterns. Data mining has many advantages across different industries. It allows large historical data to be used as the background for prediction. The interpretation and evaluation of the patterns obtained by data mining produces new knowledge that decision-makers can act upon [42]. Data mining provides a means to obtain information that can support decision making and predict new business opportunities. For example, telecommunications, stock exchanges, and credit card and insurance companies use data mining to detect fraudulent use of their services; the medical industry uses data mining to predict the effectiveness of surgical procedures, medical tests, and medications; and retailers use data mining to assess the effectiveness of coupons and special events [41]. With the development and penetration of data mining within different fields, many data mining algorithms emerged. The selection of a good data mining algorithm to obtain the best result on a particular data set has become very important. What works well for a particular data may not work well on another. Furthermore, the ‘No Free Lunch’ theorem by Wolpert and Macready [61,page 2] has established that “it is impossible to say that any technique is better than another over the space of all problems. In particular, if algorithm A outperforms algorithm B on some cost functions, then loosely speaking, there must exist exactly as many other functions where B outperforms A”. An example of the ‘No Free Lunch’ Theorem that has been encountered in this thesis is the case of the performance of the C4.5 and Rule Learner algorithms on the “EchMonths” and “Hungarian” data sets (table 2, page 26). While C4.5 algorithm obtained an error rate of only 0.6 percent on the “EchMonths”, Rule Learner obtained 57.69 percent. But on the “Hungarian” data set, Rule Learner outperformed C4.5 19.05 to 22.11. While the ‘No Free Lunch’ Theorem has established that there can be no one `best' learning algorithm, the question of `What kinds of algorithms are best suited to what kinds of data?' remains an open question. While there has been some work comparing different algorithms on a range of data sets (the STATLOG project [12], Lim and Loh [35]), there has been little work on trying to characterize data sets (for example, big, -9- small, numeric, symbolic, mixed) and matching algorithms to data characteristics. With the emergence of hundreds of data mining algorithms today, such information will help data mining analysts to make intelligent decisions in choosing an appropriate data mining algorithm for certain types of data mining files. 1.1 Goals The major goal of this thesis is to find associations between classification algorithms and characteristics of data sets by a two-step process: 1. Build a file of data set names, their characteristics and the performance of a number of algorithms on each data set. 2. Apply unsupervised clustering to the file built in step 1, analyze the generated clusters and determine whether there are any significant patterns. 1.2 Scope Due to time limitations for an MBC minor thesis, the scope of this thesis will be restricted to: • 6 supervised learning algorithms • 59 small to medium size data sets with number of attributes ranging from 7 to 76. • Running of the 6 supervised algorithms on the 59 data sets using only default settings of the algorithms • Using 3 unsupervised learning algorithms for cluster analysis • Characteristics of the data sets limited to only number of attributes and number of instances - 10 - Chapter 2. Literature Survey C oncepts and papers that are relevant to this thesis are discussed in this chapter. First both supervised and unsupervised learning techniques are discussed followed by the description of all the algorithms used in this thesis. Finally, three papers that are related to this thesis are discussed in detail. Machine learning is described by Witten and Frank [58] as the acquisition of knowledge and the ability to use it. They explain that learning in data mining involves finding and describing structural patterns in data for the purpose of helping to explain that data and make predictions from it. For example, the data could contain examples of customers who have switched to another service provider in the telecommunication industry and some that have not. The output of learning could be the prediction of whether a particular customer will switch to another service provider or not. There are two common types of learning: supervised and unsupervised. 2.1 Supervised Learning Learning or adaptation is supervised when there is a desired response that can be used by the system to guide the learning. Decision trees and neural nets are two common types of supervised learning. This type of learning always requires a target variable to predict. Supervised learning algorithms have been used in many applications. For example, supervised learning has been used in the seismic phase identification in the field of nuclear science [28] and for the prediction of tornados [36]. Supervised learning involves the gathering of data to be used for data mining, identifying the target variable, breaking up of the data into training and testing data and developing the classifier. The training data is used by the data mining algorithm to ‘learn’ the data and build a classifier. The test data is used to evaluate the performance of the classifier on new data. The performance a classifier is commonly measured by the percentage of incorrectly classified instances on the data used. Train error rate refers to the percentage of incorrectly classified instances on the training data and test error rate refers to percentage of incorrectly classified instances on the test data. One of the problems of supervised learning is overfitting [58]. The classifier works well on the training data but not on test data. This happens when the model learns the training data ‘too well’. To get an indication of the amount of overfitting, the model should be tested using a test data set or cross validation. If, after training, the test error rate is approximately equal to training error rate, the test error rate is an indication of the kind of generalization that will occur. - 11 - Cross-validation is a method for estimating how good the classifier will perform on new data and is based on "resampling" [33]. Cross validation is good for use when the size of data is small. It allows the use of all of the data for training. In k-fold cross-validation, the data is divided into k subsets of equal size. The model is trained k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute the error rate. If k equals the sample size, this is called leave-one-out cross-validation. Leave-one-out cross-validation often works well for continuous error functions such as the mean squared error, but it may perform poorly for non-continuous error functions such as the number of misclassified cases [33]. A value of 10 for k is commonly used and is also used for this thesis. Some data mining algorithms do not support continuous target variables. In such cases, binning or discretization is used. Binning is a method of converting continuous targets into categorical values. For instance, if one of the independent variables is age, the values must be transformed from the specific value into ranges such as "less than 20 years", "21 to 30 years", "31 to 40 years" and so on [4]. 2.1.1 Supervised Algorithms Used in This Thesis The basic theory behind each of the 6 classification algorithms and the details of how each algorithm works are discussed in this section. The parameters that affect the performance of each of the algorithms are also discussed and where possible, papers that describe successful applications are also cited. 2.1.1.1 C4.5 C4.5 is a decision tree algorithm devised by Quinlan [43]. Decision trees are used to classify instances into different categories and are common types of classification algorithms. First what a decision tree is will be discussed followed by the properties of C4.5 algorithm. The ‘iris’ data set will be used to explain how decision tree works. A sample of the iris data set is shown in table 1. The data contains the petal length, petal width, sepal length and sepal width of iris plants. There are three different categories this plant: ‘Irisversicolor’,’ Iris-verginica’, ‘Iris-setosa’. This is shown as the class variable in table 1 and is the target variable for the iris data set. There are 50 cases of each category in the data set. The goal is to determine what distinguishes each category of iris plants from one another so that it is possible to know to which category an iris plant belongs given the four input variables. An example of a decision tree produced from analysis of this data is shown in figure 1. The root node (top node) of the tree in figure 1 shows how many of each category is found before any analysis is made. There are three leaves in the tree. These leaves are assigned a class with the major class instances in the leaf. For example, the second leaf in the tree in figure 1 is considered as a class of ‘Iris-versicolor’ because the - 12 - majority class in this leaf is ‘Iris-versicolor’ with 48 instances. The other class in this leaf has only 4 observations and the node has error rate of 7.7%. The decision tree can be interpreted as follows: an iris plant with ‘petallwidth’ less than 0.8 is classified as ‘Iris-setosa’ and an iris plant with the ‘petallwidth’ greater than 0.8 and less than 1.65 is categorized as ‘Iris-versicolor’. All the rest (with ‘petallwidth’ greater than 1.65) are classified as ‘Iris-verginica’. Based on this tree, an unknown iris plant with ‘petallwidth’ of 1.4 would be classified as ‘Iris-versicolor’. Note that only ‘petalwidth’ is used to classify the instances and all the other input variables have been determined as irrelevant. The classification from the tree made an error of 6 out of 150, which is 4 %. Therefore, it can be said that the error rate for this tree on training data is 4 %. SEPALLENGTH SEPALWIDTH PETALLENGTH PETALWIDTH CLASS 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 3.1 1.5 0.2 Iris-setosa 4.6 2.5 Iris-virginica 3.3 6.3 6 5.8 2.7 5.1 1.9 Iris-virginica 7.1 3 5.9 2.1 Iris-virginica 1.8 Iris-virginica 2.9 6.3 5.6 3 5.8 2.2 Iris-virginica 6.5 5.5 4.4 1.2 Iris-versicolor 2.6 1.4 Iris-versicolor 3 6.1 4.6 5.8 4 1.2 Iris-versicolor 2.6 5 2.3 3.3 1 Iris-versicolor 2.7 4.2 1.3 Iris-versicolor 5.6 Table 1: sample iris data. - 13 - Iris-verginica 33.3% Iris-versicolor 33.3% Iris-setosa 33.3% Total 100.0% 50 50 50 150 Petalwidth < 0.8 Iris-verginica 0.0% Iris-versicolor 0.0% Iris-setosa 100.0% Total 100.0% < 1.65 0 0 50 50 Iris-verginica Iris-versicolor Iris-setosa Total >= 1.65 7.7% 4 92.3% 48 0.0% 0 100% 52 Iris-verginica 95.8% Iris-versicolor 4.2% Iris-setosa 0.0% Total 100.0% 46 2 0 48 Figure 1: A decision tree produced for the Iris data set. According to [43], the first task for C4.5 is to decide which of the non-target variables is the best variable to split the instances. In the example above, the ‘petallwidth’ variable was chosen. To choose this attribute, at a node, the decision tree algorithm considers each attribute field in turn (for example ‘petallwidth’, ‘petallength’, ‘sepallengh’ and ‘sepalwidth’ in the case of the iris data). Then, every possible split is tried. C4.5 uses a criterion called information ratio to compare the value of potential splits. The information ratio provides an estimate of how likely a split on a variable is to lead to a leaf which contains few errors or has low disorder. Disorder is a measure of how pure a given node is. A node with high disorder contains instances having multiple target variables while a node with low disorder contains instances with one major target variable. The information ratio is calculated for all the variables and the ‘winner’ variable is the one with the largest information ratio and is chosen as the split variable. The tree will grow in a similar method. For each child node of the root node, the decision tree algorithm examines all the remaining attributes to find candidate for splitting. If the field takes on only one value, it is eliminated from consideration since there is no way it can be used to make a split. The best split for each of the remaining attributes is determined. When all cases in a node are of the same type, then the node is a leaf node. - 14 - But how good is this tree in classifying unknown data? Perhaps not very good as it is built using training data only which could lead to overfitting. So how does C4.5 avoid this problem of overfitting? C4.5 uses a method called pruning to avoid overfitting. There are two types of pruning: prepruning and post pruning. Postpruning refers to the building of a complete tree and pruning it afterwards. Postpruning makes the tree less complex and also probably more general by replacing a subtree with a leaf or with the most common branch. When this is done, the leaf will correspond to several classes but the label will be the most common class in the leaf (as was the case in figure 1). A parameter that affects postpruning is confidence interval. Using lower confidence cause more drastic pruning. The default confidence value is 25 %. Prepruning involves deciding when to stop developing subtrees during the tree building process. For example specifying the minimum number of observations in a leaf can determine the size of the tree. The default value of minimum number of instances is 2. By default, C4.5 uses postpruning only but it can use prepruning. After a tree is constructed, the C4.5 rule induction program can be used to produce a set of equivalent rules. The rules are formed by writing a rule for each path in the tree and then eliminating any unnecessary antecedents and rules. An example of the rules produced from the decision tree in figure 1 is shown in figure 2. Rule 1, for example, shows that if ‘Petalwidth’ is less than 0.8 then the instance belongs in node 2 which has 50 observations and is classified as an ‘Iris-Setosa’. - 15 - IF Petalwidth < THEN NODE : N : IRIS-VIRGINICA: IRIS-VERSICOLOR: IRIS-SETOSA: IF THEN 2 50 0.0% 0.0% 100.0% 0.8 <= Petalwidth < NODE : N : IRIS-VIRGINICA: IRIS-VERSICOLOR: IRIS-SETOSA: IF THEN 0.8 1.65 3 52 7.7% 92.3% 0.0% 1.65 <= Petalwidth NODE : N : IRIS-VIRGINICA: IRIS-VERSICOLOR: IRIS-SETOSA: 4 48 95.8% 4.2% 0.0% Figure 2: Rule output produced from the SAS Enterprise Miner software. C4.5 is currently one of the most commonly used data mining algorithms and is available in many commercial data mining products. The ease of its interpretability as well as its methods for dealing with numeric attributes, missing values, noisy data, and generating rules from trees make it a very good choice for practical classification. C4.5 was successfully used in the application of automated identification of bat calls using 160 reference calls from eight bat species. The automated identification of pulse parameters led to good results for species with distinct differences in calls, with four out of eight species classified correctly in 95% of attempts [24]. 2.1.1.2 Rule Learner (PART) The PART algorithm forms rules from pruned partial decision trees built using C4.5’s heuristics. According to Witten and Frank [58], the main advantage of PART over C4.5 is that, unlike C4.5, the rule learner algorithm does not need to perform global optimization to produce accurate rule sets. To make a single rule, a pruned decision tree is built, the leaf with the largest coverage is made into a rule, and the tree is discarded. This avoids overfitting by only generalizing once the implications are known. For example, going back to figure 1, PART would consider the first branch in the tree and builds the rule: ‘If ‘sepalwidth’ is less than 0.8 then the plant is ‘Iris-setosa’ and discard all the ‘Iris-setosa’ instances from consideration. It continues with similar rules for the rest of the tree. - 16 - As for C4.5, the parameters that affect the performance of the algorithm are the minimum number of instances in each leaf and the confidence threshold for pruning. Frank and Witten [20] describe the results of an experiment performed on multiple data sets. The result from this experiment showed that PART outperformed the C4.5 algorithm on 9 occasions whereas C4.5 outperformed PART on 6. 2.1.1.3 OneR (1R) OneR is one of the simplest classification algorithms. As described by Holte [26], OneR produces simple rules based on one attribute only. It generates a one-level decision tree, which is expressed in the form of a set of rules that all test one particular attribute. It is a simple, cheap method that often comes up with quite good rules for characterizing the structure in data [59]. It often gets reasonable accuracy on many tasks by simply looking at one attribute. An example of a classification performed by OneR on the ‘iris’ data set is shown in figure 3. As can be seen from the figure, OneR produced rules that when the ‘petallength’ is less than 2.45, then the iris plant is classified as ‘Iris-setosa’ and when the petallength is greater than 2.45 and less than 4.75 then the iris plant is classified as ‘iris-versicolor’ and when the ‘petallength ‘ is greater than or equal to 4.8, then the iris plant is classified as ‘Iris-virginica’. This gave 143 correct classification out of 150 on the training data with an error rate of 4.7 %. The output produced by the 1R algorithm is shown in figure 3. ## 1R Rule Output % rule for 'petallength': 'class'('Iris-setosa') :- 'petallength'(X), X <2.45 % 50/50 'class'('Iris-versicolor') :- 'petallength'(X), X <4.75 % 44/50 'class'('Iris-virginica') :- 'petallength'(X), 4.75 =< X. % 48/50 % 1Rw Error Rate 4.7 % (143/150) (on training set) Figure 3: Partial output from WEKA OneR program. A comprehensive study of the performance of OneR algorithm by Holte [26] was reported on sixteen data sets frequently used by machine learning researchers to evaluate their algorithms. Cross-validation was used to ensure that the results were representative of what would be obtained on independent test sets. The research found that OneR performed very well in comparison with other more complex algorithms and Holte encourages the use of simple data mining algorithms like OneR to establish a performance baseline before progressing to more sophisticated learning algorithms. - 17 - 2.1.1.4 IBK IBK is an implementation of the k-nearest-neighbors classifier. Each case is considered as a point in multi-dimensional space and classification is done based on the nearest neighbors. The value of ‘k’ for nearest neighbors can vary. This determines how many cases are to be considered as neighbors to decide how to classify an unknown instance. For example, for the ‘iris’ data, IBK would consider the 4 dimensional space for the four input variables. A new instance would be classified as belonging to the class of its closest neighbor using Euclidean distance measurement. If 5 is used as the value of ‘k’, then 5 closest neighbors are considered. The class of the new instance is considered to be the class of the majority of the instances. If 5 is used as the value of k and 3 of the closest neighbors are of type ‘Iris-setosa’, then the class of the test instance would be assigned as ‘Iris-setosa’. The time taken to classify a test instance with a nearest-neighbor classifier increases linearly with the number of training instances that are kept in the classifier. It has a large storage requirement [59]. Its performance degrades quickly with increasing noise levels. It also performs badly when different attributes affect the outcome to different extents. One parameter that can affect the performance of the IBK algorithm is the number of nearest neighbors to be used. By default it uses just one nearest neighbor. IBK has been used for gesture recognition as discussed by Kadus [30]. With 95 signs collected from 5 people with a total of 6650 instances, the accuracy obtained from this research was approximately 80 per cent. The signs used were very similar to each other and an accuracy of 80 percent was considered to be very high. This research also found that instance based learning was better than C4.5 at tasks involved in the gesture tasks tested. 2.1.1.5 Naïve Bayes The Naive Bayes classification algorithm is based on Bayes rule which is used to compute the probabilities which are used to make predictions. Naïve Bayes assumes that the input attributes are statistically independent. It analyses the relationship between each input attribute and the dependent attribute to derive a conditional probability for each relationship[11]. These conditional probabilities are then combined to classify new cases. An advantage of Naïve Bayes algorithm over some other algorithms is that it requires only one pass through the training set to generate a classification model. Naïve Bayes works very well when tested on many real world data sets [58]. Naïve Bayes can obtain results that are much better than other sophisticated algorithms. However, if a particular attribute value does not occur in the training set in conjunction with every class value, then Naïve Bayes may not perform very well. It can also perform poorly on some data sets because attributes were treated as though they are independent, whereas in reality they are correlated. - 18 - 2.1.1.6 Kernel Density Kernel Density algorithm works in a very similar fashion to Naïve Bayes. The main difference is that, unlike Naïve Bayes, Kernel Density does not assume normal distribution of the data. Kernel Density tries to fit a combination of kernel functions. According to Beardah and Baxter [2], Kernel Density estimates are similar to histograms but provide smoother representation of the data. Beardah and Baxter [2] illustrate some of the advantages of kernel density estimates for data presentation in archaeology. They show that Kernel Density estimates can be used as a basis for producing contour plots of archeological data which lead to a useful graphical representation of the data. - 19 - 2.2 Unsupervised Learning Unsupervised learning deals with finding clusters of records that are similar in some way. As discussed earlier, unsupervised learning does not require a target variable for analysis. According to Berry and Gordon [4], unsupervised learning is often useful when there are many competing patterns in the data, making it hard to spot any single pattern. Building clusters of similar records reduces the complexity within clusters so that other data mining techniques are more likely to succeed. In unsupervised learning, the main concern is in obtaining clusters in data that have useful patterns. 2.2.1 Unsupervised Data Mining Algorithms Used in This Thesis 2.2.1.1 K-means clustering The way k-means clustering works is that first the number of clusters (k) desired is specified, then the algorithm selects k cluster seeds (centers) which are located approximately uniformly in a multi-dimensional space. Each observation is then assigned to the nearest cluster mean to form temporary clusters. The cluster mean positions are then calculated and used as new cluster centers. The observations are then reallocated clusters according to the new cluster centers. This is repeated until no further change in the cluster centers occurs. The observations are assigned into clusters so that every observation belongs to at most one cluster [57]. According to Weiss and Indurkhya [56], not all the variables are equally important in determining the clusters. For each variable, an importance value is computed as a value between 0 and 1 to represent the relative importance of the given variable to the formation of the clusters. Variables that have the greatest contribution to the cluster profile have importance values closer to 1. A decision tree analysis can be used to calculate the relative importance values from a selected sample of the training data. The first split is most important. It has been discovered that variables having large variance tend to have more effect on the resulting clusters than variables with small variance. Some implementations of k-means clustering use these importance values in assigning cases to clusters [48]. 2.2.1.2 Kohonen Vector Quantization Kohonen Vector Quantization is a clustering method invented by Kohonen [48]. The algorithm is similar to the k-means clustering algorithm. But the original seeds, called code book vectors, are totally random. The algorithm finds the seed closest to each training case in a multidimensional space and moves that "winning" seed closer to the training case. The seed is moved a certain proportion of the distance between it and the training case. The proportion is specified by the learning rate [48]. - 20 - 2.2.1.3 Autoclass (Bayesian Classification System) Autoclass is an unsupervised Bayesian classification system that infers classes based on Bayesian statistics [14]. It divides the problem into two parts - the calculation of the number of classes and the estimation of the classification parameters. It uses the Expectation Maximization (EM) algorithm to estimate the parameter values that best fit the data for a given number of classes. EM is an approximation algorithm that can find a local minimum of a probability distribution. By default, Autoclass fits a normal probability distribution for numeric data and a multinomial distribution for symbolic data. According to Cheeseman and Stutz [14], Autoclass can consider different underlying probability distribution types for the numeric attributes and is computationally intensive. Autoclass is a development from NASA. Autoclass has been used for extracting useful information from databases [14]. It has been used to extract information from Infrared Astronomical satellite (IRAS) data [21]. 2.3 Related work on Comparison of Classifiers Lim and Loh [35] discuss the comparison of prediction accuracy, the complexity and training time different of classification algorithms. The paper discusses the results of a comparison of twenty-two decision tree, nine statistical, and neural network algorithms on thirty-two data sets in terms of classification accuracy, training time and (in the case of trees) number of leaves. Some of the twenty-two decision tree algorithms compared are CART, S-Plus tree, C4.5, FACT (Fast Classification Tree), QUEST, IND, OC1, LMDT, CAL5, T1. The Statistical algorithms compared include LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminat Analysis), NN (Nearest Neighbor), LOG (Logistic Discriminat Analysis), FDA (Flexible Discriminat Analysis), PDA (Penalized LDA), MDA (Mixture Discriminat Analysis) and POL (PLYCLASS algorithm). The neural networks algorithms compared include LVQ (Learning Vector Quantization) and RBF (Radial Basis Function) This paper revealed that an algorithm called POLYCLASS, which provides estimates of conditional class probabilities, performed better than the other algorithms, although its accuracy was not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, was second with respect to the two accuracy criteria (accuracy and training time). The most accurate decision tree algorithm was QUEST with linear splits, which was ranked fourth. It was noted that although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, was the third last in terms of median training time. - 21 - The research discovered that among decision tree algorithms with univariate splits, C4.5 IND-CART and QUEST had the best combinations of error rate and speed. It was also noted that C4.5 tends to produce trees with twice as many leaves as those from INDCART and QUEST. The main conclusion from this research was that the mean error rates of many algorithms are sufficiently similar that their differences are statistically insignificant and are also probably insignificant in practical terms. However, as will be discussed later, using default settings, this thesis discovered that there were significant differences in error rates among the different algorithms used. The STATLOG Project [12] has shown the results of evaluation of the performance of machine learning, neural and statistical algorithms on large-scale, complex commercial and industrial problems. The overall aim was to give an objective assessment of the potential for classification algorithms in solving significant commercial and industrial problems. Some of the twenty-four algorithms compared on the STATLOG project are Alloc80, Ac2, BayTree, NewId, Dipol92, C4.5, Cart, Cal5, Kohonen, Bayes, and Cascasde. The data sets used for the STATLOG project are from the UCI repository. On test data, it was discovered that the algorithm ‘Alloc80’, followed by ‘Ac2’, and ‘BayTree’, performed better than the rest. ‘Alloc80’ and ‘BayTree’ are statistical classifier algorithms whereas ‘Ac2’ is a decision tree algorithm. Salzberg [46] cautions that care is required when comparing different algorithms. The dangers to avoid and a recommended approach to compare data mining algorithms are discussed. The main claims made by the paper are: • Finding a good classification algorithm requires very careful thought about experimental design. • If not done carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions and this is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. • Comparative analysis is more important in evaluating some types of algorithms than others. The key recommendations made by Salzberg [47] regarding the comparison of algorithms are: - 22 - • That data miners must be careful not to rely too heavily on stored repositories such as the UCI repository because it is difficult to produce major new results using well studied and widely shared data. • That data miners should follow a proper methodology that allows the designer of a new algorithm to establish the new algorithm’s comparative merits. - 23 - Chapter 3. Data Generation T his chapter discusses the data generation phase of this thesis which involved collecting data sets and applying each of 6 supervised algorithms to each of 59 data sets. 3.1 Collection of Data sets To achieve the goal of applying multiple data mining algorithms on multiple data sets, a search for data sets was necessary. Data sets were mainly obtained through the Internet, particularly from the UCI data set collection. Fifty-nine data sets were collected and used to perform the experiments. The number of attributes of the data sets used ranged from 3 to 76 while the number of observations ranged from 13 to 8124. 3.2 Selection of Data Mining Algorithms The selection of appropriate types of data mining algorithms to ensure that they can be run on all the data sets collected was very important to minimize missing values in the file produced from the data generation phase. The 6 data mining algorithms chosen for this experiment are Rule Learner, OneR, Kernel Density, IBK, C4.5, Naïve Bayes. These algorithms are described in detail in chapter 2. 3.3 Generating Data Once the data sets and algorithms to use for the experiment were chosen, the actual data generation for the experiment was conducted. This was done by running the 6 data mining algorithms on the 59 data sets. Default settings were used for all algorithms. For the purpose of testing, cross validation with 10 folds was used for all the algorithms. Once all runs were completed the results were stored in one file, which was later used in the clustering analysis. The percentage of incorrectly classified instances for each algorithm on each data set for both training and cross validation was stored in this file. Also contained in this file is the size and number of attributes of each data set. Table 2 shows the complete result of the data generation process. For example, it shows that for the ‘anneal’ data set, which has 38 attributes and 898 instances, IBK error rate on training data was 5.90 percent whereas on the test data it was 5.57 percent. The definition for each variable is shown on table 3. - 24 - Data_Name Num_Attr Num_Ins IBK_TRAIN Anneal 38 898 5.90 Audiology 70 226 Balance-scale 5 625 30.24 Breast-cancer 10 286 10.14 Breast-w 11 699 4.01 Colic 28 368 11.14 Credit-a 16 690 13.49 Credit-g Diabetes Glass Heart-c Heart-statlog Iris Kr-vs-kp Labor Segment Sick Sonar Soybean Autos Heart-h Hepatitis Lymph Mushroom Primary-tumor Splice vehicle vote vowel Waveform-5000 AutoPrice baskball bodyfat bolts BreastTumor Cleveland cloud cpu detroit EchoMonths elusage Fishcatch Gascons housing Hungarian longley 20 9 11 76 13 5 37 16 19 30 61 35 26 76 20 19 22 18 62 18 17 14 21 16 5 13 8 10 14 6 7 14 10 3 8 5 13 13 7 1000 768 214 303 270 150 3196 57 2310 3772 208 683 205 294 155 148 8124 339 3190 946 435 990 5000 159 96 252 40 286 303 108 209 13 130 55 158 27 506 294 16 27.60 23.05 1.87 7.59 6.67 0.00 52.22 0.00 9.61 6.31 2.88 50.80 3.41 12.59 3.26 0.00 18.19 31.27 48.11 28.61 5.06 59.29 31.44 0.63 0.00 11.51 0.00 26.22 11.55 0.00 1.44 0.00 12.31 1.82 0.00 0.00 40.71 12.59 0.00 IBK_TEST C45_TRAIN 5.57 0.22 8.85 20.64 9.28 27.97 24.13 4.72 1.57 22.28 14.13 17.83 9.28 31.20 32.94 29.91 23.76 25.19 0.04 31.57 10.52 15.58 6.12 14.42 11.42 24.89 20.75 0.20 17.57 48.20 59.88 44.80 33.92 5.75 37.98 47.48 29.56 14.58 46.03 0.45 84.97 35.97 56.48 9.09 23.07 69.23 21.82 22.15 0.00 51.98 20.75 0.00 14.50 15.63 3.73 7.92 8.52 0.02 0.34 12.28 1.08 0.34 1.92 3.66 4.89 15.97 7.74 6.76 0.00 38.64 3.67 3.07 2.76 2.12 2.50 9.43 10.42 1.19 2.50 44.40 14.52 12.96 3.83 7.69 22.31 12.72 3.80 0.00 10.67 15.99 0.00 C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST 1.56 0.00 0.02 13.36 13.47 16.40 16.40 0.00 0.89 22.12 8.40 18.58 21.68 26.99 53.53 46.46 0.00 0.23 22.24 5.12 19.52 9.12 9.12 36.48 41.12 0.00 0.12 24.83 19.93 28.67 24.83 25.87 27.27 34.62 2.10 27.27 4.72 1.57 5.15 3.86 0.04 7.30 8.15 0.00 4.86 14.13 13.31 17.39 20.38 20.92 18.48 18.48 0.27 20.19 14.06 6.52 16.09 21.74 22.32 14.49 14.49 0.14 18.55 30.30 25.91 32.71 20.79 22.22 4.70 0.47 24.56 2.86 1.35 25.96 7.91 17.56 22.11 20.65 22.93 0.00 59.29 6.01 26.60 3.22 21.62 24.48 31.45 13.54 3.17 0.35 83.92 37.62 42.59 4.78 30.77 0.60 12.72 18.99 11.11 47.23 22.11 0.00 10.30 18.75 9.35 5.61 5.56 2.67 0.25 5.26 0.48 0.37 0.96 3.66 9.76 13.61 4.51 4.73 0.00 38.64 2.76 15.25 2.52 3.23 7.30 8.81 8.33 0.79 0.05 40.91 8.58 10.19 3.83 7.69 23.08 5.45 6.33 0.00 8.30 13.61 0.00 27.90 25.52 34.11 24.75 23.70 5.33 0.78 21.05 2.94 1.48 22.12 8.34 24.89 18.37 19.36 22.30 0.00 61.65 6.24 30.26 3.45 22.63 22.64 33.96 14.58 3.17 37.50 87.06 33.99 40.74 5.26 30.77 57.69 0.20 19.62 11.11 47.43 19.05 0.00 22.80 23.70 44.39 15.84 14.81 0.04 11.67 1.75 19.78 7.03 26.92 6.30 31.22 14.97 14.19 12.84 4.11 43.95 3.04 53.78 9.66 28.79 19.78 26.42 8.33 0.25 0.15 68.18 25.08 27.78 10.05 0.00 53.08 12.72 24.05 0.00 63.44 14.97 0.00 25.10 24.22 51.40 15.51 14.81 0.04 12.36 5.26 19.70 7.26 34.13 7.17 41.95 15.65 16.13 16.21 4.23 51.92 4.45 44.44 9.66 38.48 20.04 32.08 13.54 29.37 0.40 82.87 29.04 36.11 11.48 15.38 66.92 12.72 27.22 7.40 66.40 14.65 0.00 25.80 23.83 38.32 23.43 23.70 4.67 31.66 15.79 29.87 33.30 27.73 43.93 26.07 27.03 5.33 33.57 26.32 36.28 24.04 59.15 31.22 17.69 15.48 24.32 1.48 70.50 0.00 39.01 4.37 56.57 38.90 30.19 10.42 1.19 32.50 69.93 34.65 40.74 10.04 15.38 62.31 9.09 16.46 22.22 45.85 17.69 0.00 36.08 60.47 37.07 18.71 0.20 25.68 1.48 73.16 75.58 47.28 4.37 68.59 46.26 37.11 10.42 1.59 47.50 76.22 39.93 51.85 11.48 53.85 63.85 14.55 20.89 38.32 52.77 18.70 0.00 0.00 0.00 8.41 0.00 0.00 1.33 0.00 0.00 0.00 0.00 0.00 0.17 0.00 0.00 1.29 0.00 29.70 28.12 28.97 24.09 24.81 0.04 3.22 10.53 2.64 3.53 14.90 8.93 22.93 21.43 20.65 17.57 12.39 60.47 0.00 0.23 0.00 31.80 7.13 1.01 3.77 2.08 0.00 0.00 3.85 0.00 0.93 0.48 0.00 13.08 5.45 9.49 7.41 0.00 0.00 0.00 30.82 14.58 0.00 0.45 82.52 34.98 58.33 9.09 23.08 66.15 0.20 22.78 7.41 44.27 21.43 0.00 - 25 - Data_Name Num_Attr Num_Ins IBK_TRAIN lowbwt 10 189 0.00 Mbagrade 3 61 6.56 meta 21 528 0.76 Pharynx 9 195 0.00 Pollution 16 60 0.00 PwLinear 11 200 0.00 quake 4 2178 56.38 Schlvote 6 37 0.00 servo 5 167 0.00 sleep 8 58 1.72 strike 6 625 12.80 veteran 8 137 0.00 Vineyard 4 52 0.00 IBK_TEST C45_TRAIN 19.05 14.28 22.95 13.11 0.76 0.76 68.21 54.87 28.33 0.05 61.50 18.50 63.64 38.89 27.03 16.22 22.75 20.36 6.90 6.90 15.84 12.16 48.18 21.90 1.92 1.92 C45_TEST RL_TRAIN RL_TEST NB_TRAIN NB_TEST OR_TRAIN OR_TEST KR_TRAIN KR_TEST 16.93 5.82 22.22 14.29 19.58 15.87 15.87 0.53 19.58 16.40 11.48 16.39 9.84 13.11 13.11 13.11 11.46 14.75 0.76 0.76 0.76 4.17 4.17 0.76 0.76 0.00 0.76 54.87 3.59 66.15 22.56 0.60 5.13 82.05 0.00 66.15 0.20 0.05 0.20 11.67 0.25 16.67 0.20 0.00 0.25 48.50 0.14 49.50 51.50 0.62 0.62 0.38 0.00 0.60 48.30 30.17 52.43 46.10 46.01 43.62 46.74 36.41 47.84 21.62 10.81 16.22 37.84 51.35 16.22 40.54 13.51 18.82 23.35 20.36 22.75 16.17 25.75 35.92 35.92 0.00 23.35 12.07 3.45 12.07 3.45 10.34 6.90 6.90 3.45 6.90 12.80 6.24 12.80 13.60 14.24 0.12 12.80 0.64 14.40 43.07 14.60 44.52 30.66 37.23 35.04 36.50 2.92 47.45 1.92 1.92 1.92 0.00 1.92 0.00 3.85 0.00 1.92 Table 2: Results obtained from applying 6 data mining algorithms to 59 data sets. Blanks indicate situations where algorithms gave no result. Note that the word ‘TRAIN’ in the table indicates the percentage of incorrectly classified training cases whereas the word ‘TEST’ indicates the percentage of incorrectly classified cases using cross validation. For example NB_TRAIN indicates percentage of incorrectly classified instances (error rate) for the Naïve Bayes algorithm on training data. The definition of each variable is shown below. Name NB_TRAIN NB_TEST C45_TRAIN C45_TEST OR_TRAIN OR_TEST RL_TEST RL_TRAIN KR_TRAIN KR_TEST IBK_TEST IBK_TRAIN NUM_INS NUM_ATTR Definition Naive Bayes Training Error (%) Naive Bayes Testing Error (%) C4.5 Training Error (%) C45 Testing Error (%) OneR Training Error (%) OneR Training Error (%) Rule Learner Testing Error (%) Rule Learner Training Error (%) Kernel Density Training Error (%) Kernel Density Testing Error (%) IBK Testing Error (%) IBK Training Error (%) Number of Instances Number of Attributes Table 3: Definition of variables used. - 26 - Table 4 shows a summary of table 2. It provides the minimum, maximum, mean, standard deviation, and missing percentage of each numeric variable. For example, it shows that for the Kernel Density algorithm on training data, the minimum error rate was 0 percent while the maximum and mean were 36.41 and 2.53 percent respectively. It also shows that there was 5 % of the values were missing for this algorithm on training data indicating that no result was found for some data sets. Table 4 also shows the overall performance of the 6 algorithms in classifying the 59 data sets. The table is sorted by the mean error rates of each of the algorithms for both train and test cases. Looking at the training results indicates that Kernel Density (KR_TRAIN) with an average error rate of 2.53 percent, followed by Rule Learner (RL_TRAIN) and C4.5 (C45_TRAIN) with average error rates of 8.79 and 10.47 percent respectively had lower training errors than the other algorithms. More importantly, looking at the cross validation results, Kernel Density (KR_TEST) with an average error rate of 19.88 percent, followed by C4.5 and Naïve Bayes, with average error rates of 20.16 and 21.52 percent respectively, performed better than the other algorithms. Name KR_TRAIN RL_TRAIN C45_TRAIN IBK_TRAIN NB_TRAIN OR_TRAIN KR_TEST C45_TEST NB_TEST RL_TEST IBK_TEST OR_TEST NUM_INS NUM_ATTR Mean 2.53 8.06 10.47 12.09 19.36 23.83 19.88 20.16 21.51 21.95 26.66 30.49 740.47 18.40 Min Max 0.00 36.41 0.00 40.91 0.00 54.87 0.00 59.29 0.00 68.18 0.00 70.50 0.00 82.52 0.00 83.92 0.00 82.87 0.00 87.06 0.00 84.97 0.00 82.05 13 8124 3 76 Std Dev. 5.89 8.79 11.39 16.32 16.61 18.27 19.64 17.27 18.50 18.63 20.24 22.12 1389.80 17.66 Missing % 5% 0% 0% 2% 0% 2% 5% 0% 0% 0% 2% 2% 0% 0% Table 4: Summary of the data gathered from running the 6 data mining algorithms on 59 data sets. - 27 - Chapter 4. Clustering and Pattern Analysis F or the purpose of analyzing the data generated by applying the 6 data mining algorithms to 59 data sets (table 2, page 26), running unsupervised learning algorithms is necessary. For this experiment, the 3 algorithms used are k-means clustering using least squares, Kohonen Vector Quantization, and Autoclass Bayesian analysis. These algorithms are described in section 2.21 (pages 21-22). The results of the unsupervised clustering analysis performed are discussed in the next four sections followed by the summary and comparison of the results. 4.1 Results from k-means clustering Table 5 shows the ranking of variables resulting from the application of the k-means algorithm to the data generated (table 2, page 26). A value of 5 was used for the maximum number of clusters. This value was chosen as (1) more than 5 clusters in a data set of 59 cases are unlikely to be useful and (2) preliminary runs of the algorithm suggested there were 3 to 5 clusters. As shown in table 5, only ‘number of instances’ is significant in determining the clusters. This model gives five clusters with 52,3,2,1,1 observations. The table shows the name, importance, measurement type and label of each variable. For example, it indicates that the num_inst (number of instances) variable has importance level of 1 and that it is a numeric interval variable. Numeric variables containing values that vary across a continuous range are shown as interval variables. NAME IMPORTANCE MEASUREMENT NUM_INS 1 interval KR_TEST 0 interval KR_TRAIN 0 interval OR_TEST 0 interval OR_TRAIN 0 interval NB_TEST 0 interval NB_TRAIN 0 interval RL_TEST 0 interval RL_TRAIN 0 interval C45_TEST 0 interval C45_TRAIN 0 interval IBK_TEST 0 interval IBK_TRAIN 0 interval NUM_ATTR 0 interval TYPE Num Num Num Num Num Num Num Num Num Num Num Num Num Num Table 5: Importance level of variables in determining cluster. - 28 - LABEL Number of Instances Kernel Density Test Kernel Density Train OneR TEST OneR TRAIN Naïve Bayes Test Naïve Bayes Train Rule Learner Test Rule Learner Train C45 Test C45 Train IBK Test IBK Train Number of Attributes Root- Maximum Distance Mean- Distance Number To Frequency Nearest NUM_ Of Cluster Square From Of Cluster Cluster Nearest INST Standard Cluster Attributes Cluster Deviation Seed 1 2 3 4 5 1 1 3 52 2 . . 90.28 74.75 34.85 0 0 388.32 694.33 92.22 3 1 5 5 3 1614.82 3124.84 1144.20 1938.36 1144.20 21 22 43 17.13 11.5 5000 8124 3386 306.11 2244 IBK_ TRAIN 31.44 18.19 35.54 9.453 32.99 RL_ NB_ NB_ OR_ OR_ KR_ KR_ IBK_ C45_ C45_ RL_ TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST 47.48 48.2 27.49 25.29 39.61 2.5 24.48 0 0 1.45 2.61 10.98 21.27 19.98 25.58 7.3 0 1.12 8.35 15.32 22.64 0 2.83 23.24 27.68 19.78 20.04 4.11 4.23 7.24 8.02 19.83 22.22 32.94 32.85 38.9 1.48 18.49 23.78 36.74 46.26 1.48 46.54 29.40 41.51 2.53 2.53 0.84 2.02 18.20 19.88 19.88 8.87 20.31 25.24 Table 6: Properties of the 5 clusters. Average values for the last 14 columns were used. Table 6 shows the properties of the 5 clusters formed using k-means analysis. For example, it shows that the one data set in cluster 1 has 5000 instances while the one data set in cluster 2 has 8124. These numbers are much higher than for the population, which has mean number of instances of 740.47 (table 4, page 27). This indicates that perhaps these two data sets should be clustered together. As it can be seen from the table, most of the data sets were grouped in to cluster 4 with an average number of instances of 306.11. Generally, it is possible to see that the data sets were clustered based on the number of instances as small, medium, or large. To be able to see the significance of the other variables, all further runs were carried out with this variable excluded. - 29 - 4.2 Results from k-means clustering (without Number of Instances) When the number of instances variable was excluded for the k-means analysis, 3 other variables emerged as significant variables. As can be seen from table 7, NB_TRAIN was most significant with an importance level of 1, followed by KR_TEST and C45_TRAIN with 0.88 and 0.33 respectively. The rest of the variables have no significance in determining the clusters. Name NB_TRAIN KR_TEST C45_TRAIN OR_TRAIN NB_TEST KR_TRAIN RL_TEST RL_TRAIN C45_TEST OR_TEST IBK_TEST IBK_TRAIN NUM_ATTR Importance 1 0.885 0.336 0 0 0 0 0 0 0 0 0 0 Measurement interval interval interval interval interval interval interval interval interval interval interval interval interval Type num num num num num num num num num num num num num Label Naive Bayes Train Kernel Density Test C45_Train OneR Train Naive Bayes Test Kernel Density Train Rule Learner Test Rule Learner Train C45 Test OneR Test IBK Test IBK Train Number of Attributes Table 7: Importance variable in determining Clusters. Table 8 shows the properties of the 5 clusters formed using k-means clustering analysis. For example, the table shows that for the significant variables, cluster 1 has mean error rates lower than for the population. That is, NB_TRAIN, KR_TEST, C45_TEST which have mean error rates of 10.87, 9.85, 5.66 for the cluster 1. Comparing these error rates with the population error rates (table 4, page 27), we discover that they are all lower with error rates of 19.36, 19.88, and 10.47 respectively. It can also be seen from the table that cluster 4, which has higher than average error rates contains 12 data sets. Cluster 2, which also has lower than average error rates, has 11 data sets. Data sets from these two clusters (1 and 2) could possibly be put into one cluster. Also, clusters 3 and 5 behave similarly having error rates much greater than for the population. The data sets from these two clusters could possibly be put into one cluster. - 30 - Maximum Root-MeanDistance Number Distance Number IBK_ IBK_ C45_ C45_ RL_ RL_ Frequency Square Nearest To Cluster From Of Of Of Cluster Standard Cluster Nearest TRAIN TEST TRAIN TEST TRAIN TEST Cluster Attributes Instances Deviation Cluster Seed 1 4 2 3 5 29 12 11 5 2 8.44 9.91 13.92 15.33 15.53 52.99 70.46 67.83 64.80 39.60 2 2 4 5 4 56.06 51.25 51.25 77.29 69.01 13.48 12.83 42.81 11 7.5 717.10 368.50 97.28 687.80 151.50 4.58 8.90 27.93 33.37 0 14.14 33.81 27.50 65.94 62.34 5.66 13.90 5.81 30.98 33.91 9.82 30.29 18.58 47.86 48.73 3.98 12.22 5.32 28.22 6.89 10.86 31.02 17.70 61.25 53.44 NB_ TRAIN 9.34 32.36 14.37 54.95 25.17 NB_ TEST 11.39 32.42 18.11 62.82 18.35 OR_ TRAIN 12.86 28.17 32.43 58.44 22.93 OR_ TEST 14.88 34.52 46.06 62.54 66.95 Table 8: General properties of the clusters from k-means analysis. - 31 - KR_ TRAIN 1.613 2.559 0.475 13.146 0.465 KR_ TEST 9.85 27.06 12.43 60.25 62.24 The general properties all the clusters, with the values of significant variables, are shown in table 9. For example, the table shows that cluster 1 has mean error rates of 9.34, 9.85 and 5.66 for NB_TRAIN, KR_TEST and C4.5_TRAIN respectively. All of these results are lower than average error rates (table 4, page 27). Cluster 1 4 2 3 5 Frequency Of Cluster 29 12 11 5 2 NB_TRAIN (19.36) 9.34 32.36 14.37 54.95 25.17 KR_TEST (19.88) 9.85 27.06 12.43 60.25 62.24 C45_TRAIN Cluster Properties (Compared to error rates for the population) (2.53) 5.66 Lower than average error rates 13.90 About average error rates 5.81 Lower than average error rates 30.98 Higher than average error rates 33.91 Higher than average error rates Table 9: General properties of Clusters and significant variables. Values in bracket show average values for the population. The complete list of the error rates for the 3 significant variables for the 59 data sets with their corresponding cluster number is shown on table 10. For example the table shows that the ‘Anneal’ data set with 38 attributes and 899 instances is in cluster 1 and has error rates of 13.47, 0.89 and 0.22 for the three significant variables (NB_TRAIN, KR_TEST and C45_TEST). From the table, it is also possible to see that error rates of the significant variables in cluster 3 and cluster 5 are very high. This indicates that for the 6 data sets in these clusters, ‘breasttumor’, ‘echmonths’, ‘housing’, ‘quack’, ‘cloud’ and ‘pharynx’, the three algorithms Naïve Bayes, Kernel Density and C4.5 did not perform well. A closer investigation of the data sets may reveal some similarities among these data sets but because of the scope of this thesis, such investigation was not done here. - 32 - DATA_NAME Anneal Breast-W Colic Credit-A Heart-Statlog Iris Labor Segment Sick Hepatitis Lymph Mushroom Vote Baskball Bodyfat Bolts Cpu Elusage Fishcatch Gascons Hungarian Longley Lowbwt Mbagrade Meta Pollution Sleep Strike Vineyard Audiology Balance-Scale Heart-C Kr-Vs-Kp Sonar Soybean Heart-H Splice Vowel Detroit Primary-Tumor Breasttumor Echomonths Housing Quake Breast-Cancer Credit-G Diabetes Glass Autos Vehicle Autoprice Cleveland Pwlinear Schlvote Servo Veteran Cloud Pharynx NUM_ATTR 38 11 28 16 13 5 16 19 30 20 19 22 17 5 13 8 7 3 8 5 13 7 10 3 21 16 8 6 4 70 5 76 37 61 35 76 62 14 14 18 10 10 13 4 10 20 9 11 26 18 16 14 11 6 5 8 6 9 NUM_INS 898 699 368 690 270 150 57 2310 3772 155 148 8124 435 96 252 40 209 55 158 27 294 16 189 61 528 60 58 625 52 226 625 303 3196 208 683 294 3190 990 13 339 286 130 506 2178 286 1000 768 214 205 946 159 303 200 37 167 137 108 195 NB_TEST 13.47 0.04 20.92 22.32 14.81 0.04 5.36 19.70 7.26 16.13 16.21 4.23 9.66 13.54 29.37 0.40 11.48 12.72 27.22 7.40 14.65 0.00 19.58 13.11 4.17 0.25 10.34 14.24 1.92 26.99 9.12 15.51 12.36 34.13 7.17 15.65 4.45 38.48 15.38 51.92 82.87 66.92 66.40 46.01 25.87 25.10 24.22 51.40 41.95 44.44 32.08 29.04 0.62 51.35 25.75 37.23 36.11 0.60 KR_TEST 0.89 4.86 20.19 18.55 24.81 0.04 10.53 2.64 3.53 20.65 17.57 19.88 7.13 14.58 0.00 0.45 9.09 0.20 22.78 7.41 21.43 0.00 19.58 14.75 0.76 0.25 6.90 14.40 1.92 0.23 0.12 24.09 3.22 14.90 8.93 21.43 19.88 1.01 23.08 60.47 82.52 66.15 44.27 47.84 27.27 29.70 28.12 28.97 22.93 31.80 30.82 34.98 0.60 18.82 23.35 47.45 58.33 66.15 C45_TRAIN 0.22 1.57 14.13 9.28 8.52 0.02 12.28 1.08 0.34 7.74 6.76 0.00 2.76 10.42 1.19 2.50 3.83 12.72 3.80 0.00 15.99 0.00 14.28 13.11 0.76 0.05 6.90 12.16 1.92 8.85 9.28 7.92 0.34 1.92 3.66 15.97 3.67 2.12 7.69 38.64 44.40 22.31 10.67 38.89 24.13 14.50 15.63 3.73 4.89 3.07 9.43 14.52 18.50 16.22 20.36 21.90 12.96 54.87 CLUSTER 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 Table 10: Complete list of data sets in each cluster and the values of the significant clustering variables. - 33 - 4.3 Results from Kohonen Vector Quantization Clustering The second unsupervised algorithm used to analyze the data generated is Kohonen Vector Quantization. Like K-means, Kohonen Vector Quantization analysis also determines the significant variables for grouping the data sets together. The maximum number of clusters for this algorithm was also set to 5. The Kohonen Vector Quantization analysis revealed that, in clustering the data sets, the error rates of the variable KR_TEST have the highest significance of 1, followed by IBK_TRAIN and Number of Attributes (NUM_ATTR), RL_TRAIN and C45_TRAIN with 0.91, 0.64,0.46 and 0.45 respectively. The rest of the variables have no significance in determining the clusters. Name KR_TEST IBK_TRAIN NUM_ATTR RL_TRAIN C45_TRAIN NB_TRAIN RL_TEST OR_TRAIN C45_TEST NB_TEST IBK_TEST KR_TRAIN OR_TEST Importance 1 0.91 0.64 0.46 0.45 0 0 0 0 0 0 0 0 Measurement interval interval interval interval interval interval interval interval interval interval interval interval interval Type num num num num num num num num num num num num num Label Kernel Density Test IBK Train Number of Attributes Rule Learner Train C45_ Train Naive Bayes Train Rule Learner Test OneR Train C45 Test Naive Bayes Test IBK Test Kernel Density Train OneR Test Table 11: Importance of variables in determining clusters in Kohonen Vector Quantization. The properties of the 5 clusters obtained from the Kohonen Vector Quantization analysis are shown in table 12. For example, the table shows that cluster 5 has a mean KR_TEST error rate of 9.32 which is lower than the average value for the population (table 4, page 27) and an average IBK_TEST error rate of 26.66 which is also lower than the average for the population. Values of both RL_TRAIN and C45_TRAIN, 3.93 and 5.56 are also lower than the average values for the population, 8.06 and 10.47 (table 4, page 27). - 34 - Root- Maximum Distance Freq. Number Mean- Distance Nearest IBK IBK C45_ C45_ RL_ RL_ NB_ NB_ OR_ OR_ KR KR_ To Cluster Of Square From Of Cluster Nearest TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST _TRAIN TEST Cluster Standard Cluster Attributes Cluster Deviation Seed 5.56 9.37 3.93 10.40 9.15 11.27 12.48 14.45 1.67 9.32 5 28 8.42 53.64 4 60.47 13.5 4.50 13.75 4 16 12.04 91.88 5 60.47 12.25 7.09 36.17 15.68 32.12 10.85 33.35 28.34 28.49 26.44 39.31 1.97 31.07 1 6 12.45 59.41 4 62.70 29 45.35 32.31 3.59 13.78 3.72 13.35 13.11 15.27 37.12 54.26 0.87 8.84 3 5 15.33 65.42 4 86.98 11 33.37 65.94 30.98 47.86 28.22 61.25 54.95 62.82 58.44 62.54 13.14 60.25 2 4 7.88 33.09 1 63.87 70.75 8.78 21.39 8.66 22.74 7.145 20.95 19.85 23.07 29.672 31.83 0 15.16 Table 12: General properties of the clusters from Kohonen Vector Quantization analysis. - 35 - The general properties all the clusters, with the values of the significant variables, are shown in table 13. For example the table shows that cluster 5 has mean error rates of 3.93, 5.56, 9.32 and 13.75 for KR_TEST, IBK_TEST, RL_TRAIN and C4.5_TRAIN respectively. All of these results are lower than average error rates (table 4, page 27). Data sets in clusters 1, 2 and 3 could probably be put into one cluster as they all behave similarly having lower than average error rates. Cluster Freq. Of Cluster 5 4 1 3 2 KR_ IBK Number Of RL_ C45_ TEST TEST Attributes TRAIN TRAIN (19.88) (26.66) (18.40) (8.06) (10.47) 28 9.32 13.75 13.5 3.93 5.56 16 31.07 36.17 12.25 10.85 15.68 6 8.84 32.31 29 3.72 3.59 5 60.25 65.94 11 28.22 30.98 4 15.16 21.39 70.75 7.14 8.66 Cluster Properties (Compared to error rates for the population) Lower than average error rates Higher than average error rates Lower than average error rates Much higher than average error rates Lower than average error rates. Table 13: General properties of clusters mean error rates of the significant variables. Numbers in bracket show average values for the population. Table 14 shows which data sets are clustered together. Also shown is the complete list of the error rates for the 5 significant variables for the 59 data sets with their corresponding cluster numbers. For example, the table shows that the ‘Balance-scale’ data set with 5 attributes and 625 instances is grouped in cluster 1 and has error rates of 9.28, 0.12, 20.64, 5.12 13.47, 0.89 and 0.22 for the three significant variables for Kohonen Vector Quantization analysis (C45_TRAIN, KR_TEST, IBK_TEST, RL_TRAIN). A closer investigation of the data sets may reveal some similarities among these data sets but because of the scope of this thesis, such an investigation was not carried out. - 36 - DATA_NAME Balance-Scale Kr-Vs-Kp Soybean Splice Vowel Waveform-5000 Audiology Heart-C Sonar Heart-H Primary-Tumor Breasttumor Echomonths Housing Quake Breast-Cancer Credit-G Diabetes Heart-Statlog Autos Vehicle Autoprice Cleveland Cloud Detroit Pharynx Pwlinear Schlvote Servo Veteran Anneal Breast-W Colic Credit-A Iris Labor Segment Sick Hepatitis Lymph Mushroom Vote Baskball Bodyfat Bolts Cpu Elusage Fishcatch Gascons Hungarian Longley Lowbwt Mbagrade Meta Pollution Sleep Strike Vineyard NUM_ATTR 5 37 35 62 14 21 70 76 61 76 18 10 10 13 4 10 20 9 13 26 18 16 14 6 14 9 11 6 5 8 38 11 28 16 5 16 19 30 20 19 22 17 5 13 8 7 3 8 5 13 7 10 3 21 16 8 6 4 NUM_INS 625 3196 683 3190 990 5000 226 303 208 294 339 286 130 506 2178 286 1000 768 214 270 205 946 159 303 108 13 195 200 37 167 137 898 699 368 690 150 57 2310 3772 155 148 8124 435 96 252 40 209 55 158 27 294 16 189 61 528 60 58 625 C45_TRAIN 9.28 0.34 3.66 3.67 2.12 2.50 8.85 7.92 1.92 15.97 38.64 44.40 22.31 10.67 38.89 24.13 14.50 15.63 8.52 4.89 3.07 9.43 14.52 12.96 7.69 54.87 18.50 16.22 20.36 21.90 0.22 1.57 14.13 9.28 0.02 12.28 1.08 0.34 7.74 6.76 0.00 2.76 10.42 1.19 2.50 3.83 12.72 3.80 0.00 15.99 0.00 14.28 13.11 0.76 0.05 6.90 12.16 1.92 KR_TEST 0.12 3.22 8.93 19.88 1.01 19.88 0.23 24.09 14.90 21.43 60.47 82.52 66.15 44.27 47.84 27.27 29.70 28.12 28.97 24.81 22.93 31.80 30.82 34.98 58.33 23.08 66.15 0.60 18.82 23.35 47.45 0.89 4.86 20.19 18.55 0.04 10.53 2.64 3.53 20.65 17.57 19.88 7.13 14.58 0.00 0.45 9.09 0.20 22.78 7.41 21.43 0.00 19.58 14.75 0.76 0.25 6.90 14.40 IBK_TEST 20.64 31.57 11.42 44.80 37.98 47.48 26.66 23.76 14.42 20.75 59.88 84.97 69.23 51.98 63.64 27.97 31.20 32.94 29.91 25.19 24.89 33.92 29.56 35.97 56.48 23.07 68.21 61.50 27.03 22.75 48.18 5.57 4.72 22.28 17.83 0.04 10.52 15.58 6.12 0.20 17.57 48.20 5.75 14.58 46.03 0.45 9.09 21.82 22.15 0.00 20.75 0.00 19.05 22.95 0.76 28.33 6.90 15.84 RL_TRAIN 5.12 0.25 3.66 2.76 3.23 7.30 8.40 5.61 0.96 13.61 38.64 40.91 23.08 8.30 30.17 19.93 10.30 18.75 9.35 5.56 9.76 15.25 8.81 8.58 10.19 7.69 3.59 0.14 10.81 20.36 14.60 0.00 1.57 13.31 6.52 2.67 5.26 0.48 0.37 4.51 4.73 0.00 2.52 8.33 0.79 0.05 3.83 5.45 6.33 0.00 13.61 0.00 5.82 11.48 0.76 0.05 3.45 6.24 Table 14: Complete list of data sets in each cluster and the values of the significant clustering variables. - 37 - CLUSTER 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4.4 Results from Autoclass (Bayesian Classification System) Analysis The relative influence of each attribute in differentiating the classes from the overall data set is shown in table 15. Autoclass discovered that NUM_ATTR, KR_TRAIN, and C45_TEST have the highest influence. Description NUM_ATTR KR_TRAIN C45_TEST RL_TEST KR_TEST C45_TRAIN RL_TRAIN NB_TEST IBK_TEST OR_TEST NB_TRAIN OR_TRAIN IBK_TRAIN Importance 1.00 0.87 0.64 0.64 0.55 0.49 0.40 0.34 0.33 0.29 0.28 0.27 0.14 Table 15: Significance level of variables in Autoclass clustering. It was discovered that Autoclass produced 4 clusters with frequencies of 19,17, 13 and 10 as shown in table 16. Unlike k-means and Kohonen Vector Quantization clustering, Autoclass identifies which of the variables are probably significant for each cluster. The full result from this algorithm is shown on Appendix E, page 49 and a summary of the results is shown in table 16. According to Cheeseman and Stutz [13], to get the significant variables for each class, the following heuristics can be applied: first 20 % of the highest influence value is calculated, then this value is used to determine which variable is significant for a given class. For example for class 1, number of attributes has a significance factor of 2.5 as can be seen from appendix 6. Twenty percent of 2.5 is 0.5 and therefore only variables with significance factor of greater than 0.5 are probably significant for this cluster. Therefore number of attributes, with error rate of 11.6, is significant because it is greater than 0.5. As shown in table 16, careful analysis of the complete output from this clustering algorithm shows that cluster 3 has all variables except IBK_TEST and IBK_TRAIN as significant whereas cluster 2 has only KR_TRAIN as a significant variable. - 38 - Freq Number IBK IBK C45_ C45_ RL_ RL_ NB_ NB_ OR_ OR_ KR KR_ Of Of TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST TRAIN TEST _TRAIN TEST Cluster Attributes (19.36) (26.66) (10.47) (20.16) (8.06) (21.95) (19.36) (21.51) (23.83) (30.49) (2.53) (19.88) 19.3 17.2 21.9 18.7 1 19 4.26 19.5 11.6 18.6 Cluster 2 17 3 13 17.8 4 10 10.0 59.1 1.23 2.27 1.23 2.48 8.7 9.39 9.09 26.2 51.8 26.2 53.8 44.8 55.8 46.8 0.011 0.17 56.2 76.3 10.4 PROPERTIES (COMPARED TO POPULATION) Lower than average Error rates Much lower than average Error rates 3.18 Lower than average Error rates 55.9 Greater than average Error rates Table 16. Properties of clusters from Autoclass analysis. Values in bracket show average values for the population. The complete list of the 59 data sets with their corresponding cluster number is provided in table 17. - 39 - DATA_NAME Breast-Cancer Colic Credit-A Heart-Statlog Labor Hepatitis Lymph Autoprice Baskball Elusage Fishcatch Gascons Hungarian Lowbwt Mbagrade Pollution Schlvote Sleep Strike Audiology Balance-Scale Credit-G Diabetes Heart-C Sonar Soybean Autos Heart-H Splice Vehicle Vowel Waveform-5000 Bolts Cleveland Detroit Servo Anneal Breast-W Iris Kr-Vs-Kp Segment Sick Mushroom Vote Bodyfat Cpu Longley Meta Vineyard Glass Primary-Tumor Breasttumor Cloud Echomonths Housing Pharynx Pwlinear Quake Veteran NUM_ATTR 10 28 16 13 16 20 19 16 5 3 8 5 13 10 3 16 6 8 6 70 5 20 9 76 61 35 26 76 62 18 14 21 8 14 14 5 38 11 5 37 19 30 22 17 13 7 7 21 4 11 18 10 6 10 13 9 11 4 8 KR_TRAIN 2.1 0.27 0.14 0 0 1.29 0 3.77 2.08 5.45 9.49 7.41 0 0.53 11.46 0 13.51 3.45 0.64 0 0 0 0 0 0 0.17 0 0 0 0 0 0 0 0 0 0 1.33 0 0 0 0.23 0 0.48 0 0 0 8.41 12.39 3.85 0.93 13.08 0 0 0 36.41 2.92 C45_TEST 24.83 14.13 14.06 22.22 24.56 20.65 22.93 31.45 13.54 12.72 18.99 11.11 22.11 16.93 16.4 0.2 21.62 12.07 12.8 22.12 22.24 30.3 25.91 20.79 25.96 7.91 17.56 22.11 6.01 26.6 21.62 24.48 0.35 37.62 30.77 23.35 1.56 4.72 4.7 0.47 2.86 1.35 0 3.22 3.17 4.78 0 0.76 1.92 32.71 59.29 83.92 42.59 0.6 47.23 54.87 48.5 48.3 43.07 RL_TEST 28.67 17.39 16.09 23.7 21.05 19.36 22.3 33.96 14.58 0.2 19.62 11.11 19.05 22.22 16.39 0.2 16.22 12.07 12.8 18.58 19.52 27.9 25.52 24.75 22.12 8.34 24.89 18.37 6.24 30.26 22.63 22.64 37.5 33.99 30.77 22.75 0.02 5.15 5.33 0.78 2.94 1.48 0 3.45 3.17 5.26 0 0.76 1.92 34.11 61.65 87.06 40.74 57.69 47.43 66.15 49.5 52.43 44.52 KR_TEST 27.27 20.19 18.55 24.81 10.53 20.65 17.57 30.82 14.58 0.2 22.78 7.41 21.43 19.58 14.75 0.25 18.82 6.9 14.4 0.23 0.12 29.7 28.12 24.09 14.9 8.93 22.93 21.43 31.8 1.01 0.45 34.98 23.08 23.35 0.89 4.86 0.04 3.22 2.64 3.53 7.13 0 9.09 0 0.76 1.92 28.97 60.47 82.52 58.33 66.15 44.27 66.15 0.6 47.84 47.45 CLASS 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Table 17: Complete list of data sets in each cluster and the values of the significant clustering variables. - 40 - 4.5 Comparison 4.5.1 Comparison of significant variables Table 18 shows the significant variables from each of the clustering algorithms. As can be seen from the table, KR_TEST, and C45_TRAIN appear as significant variables in both k-means analysis and Kohonen Vector Quantization analysis. KR_TEST also appears as one of the top 5 significant variables for Autoclass. Five clusters were obtained from Kohonen Vector Quantization and k-means analysis and four clusters were obtained form Autoclass. CATEGORY K-MEANS Significant Variables (in order of importance) NB_TRAIN KR_TEST C45_TRAIN KOHONEN VECTOR QUANTIZATION KR_TEST IBK_TRAIN NUM_ATTR RL_TRAIN C45_TRAIN Number of clusters 5 5 AUTOCLASS All. The top 5 are: NUM_ATTR KR_TRAIN C45_TEST RL_TEST KR_TEST 4 Table 18: Comparison of significant variables found by the three clustering algorithms. Table 19 shows which variables were picked as being significant by all 3 or 2 or just 1 of the clustering algorithms. As can be seen from table 19, the KR_TEST and KR_TRAIN variables were picked as significant variables in determining the clusters by all three algorithms. Variables picked as significant in clustering data by: 3 algorithms K-means Kohonen Vector Quantization Autoclass KR_TEST C45_TRAIN KR_TEST C45_TRAIN KR_TEST C45_TRAIN 2 algorithms NB_TRAIN NB_TRAIN IBK_TRAIN RL_TRAIN NUM_ATTR 1 algorithm IBK_TRAIN RL_TRAIN NUM_ATTR KR_TRAIN C4.5_TEST RL_TEST IBK_TEST OR_TEST NB_TRAIN OR_TRAIN IBK_TRAIN Table 19: Summary of the significance of each variable for the 3 clustering algorithms. 4.5.2 Comparison of Data sets in Different Clusters Detailed analysis of the grouping of the data sets in from each algorithm revealed that all the data sets listed in one column in table 20 were actually grouped together by all the three clustering algorithms. - 41 - For example, the ‘colic’ data set was grouped by all three clustering algorithms as having lower than average error rates. There are total of 24 data sets in this group. The data sets in column 2 of table 20 were found to have higher error rates than average by the three clustering algorithms. There are 5 data sets in this group. The data sets in column 3 have about average error rates and there are 5 data sets in this group. This gives a total of 34 data sets clearly classified to different groups. Data sets Names (having error rates lower than average) Colic Credit-A Heart-Statlog Labor Hepatitis Lymph Baskball Elusage Fishcatch Gascons Hungarian Lowbwt Audiology Heart-C Sonar Soybean Heart-H Splice Vowel Detroit Mbagrade Pollution Sleep Strike Data set Names (having error rates higher than average) Data set Names (having error rates about average) Primary-Tumor Breasttumor Echomonths Housing Quake Credit-G Diabetes Autos Cleveland Servo Table 20: Data sets in each column were grouped into one cluster by all three algorithms. 4.5.3 Influence of Characteristics of Data Sets on Performance of Classification Algorithms Looking at table 20, it can be seen that both the ‘Credit-G’ data set and the ‘Servo’ data sets were grouped together as having about average error rates. From table 2 on page 26, it can be seen that ‘Servo’ has 5 attributes and 37 instances whereas ‘Credit-G’ has 20 attributes and 1000 instances. On the other hand, both the ‘Breasttumor’ and ‘Quack’ data sets were grouped together as having higher than average error rates. ‘Breasttumor’ has 10 attributes and 286 instances whereas ‘Quack’ has 4 attributes and 2178 instances. This gives an indication that the number of attributes and number of instances (characteristics of the data sets) do not have a strong influence on the performance of the classification algorithm. - 42 - Chapter 5. Conclusion A s was stated in chapter 1, the major goal of this thesis was to find associations between classification algorithms and characteristics of data sets by a two step process: 1. Build a file of data set names, their characteristics and the performance of a number of algorithms on each data set. 2. Apply unsupervised clustering to the file built in step 1, analyze the generated clusters and determine whether there any significant patterns. The major discovery made by analyzing the generated clusters is that the clusters were formed based on accuracy of the algorithms. The data sets were grouped as either belonging to a cluster having low average error rates, medium average error rates or high average error rates. This suggests that there are three kinds of data sets in the 59 used: ‘easy-to-learn data sets’, ‘moderate-to-learn data sets’, and ‘hard-to-learn data sets’. It was discovered that number of instances was not useful in clustering the data sets, as it was the only significant variable in clustering the data sets before it was excluded from the generated data set. This prevented analysis based on other variables including the variables that contain values for the accuracy of each classification algorithm. While not directly relevant to clustering, it was also shown that number of instances and number of attributes of the data sets do not have strong influence on the performance of the data mining algorithms as high error rates were obtained for both small data sets with small number of attributes and large data sets with large number of attributes. Experiments performed for this thesis also allowed the comparison of the performance of the 6 classification algorithms used when default settings were used for each algorithm. It was discovered that in terms of performance, the top three algorithms were Kernel Density, C4.5, and Naïve Bayes, followed by Rule Learner, IBK and OneR. However, it is to be noted that training times for both Kernel Density and Naïve Bayes were considerably higher than for the other algorithms. 5.1 Further Work While very limited in scope, this investigation has revealed a number of interesting clusters in machine learning performance data. It suggests that a larger investigation, as outlined below, using more data sets and data set characteristics would be worthwhile. • Use more data sets: The use of a large number of data sets would allow an increase in size of the data sets generated for clustering analysis. This will allow the clustering algorithms to consider more cases for the formation of clusters. • Use more data set sources: The data sets used in this thesis came mainly from the UCI data collection. The use of a larger variety of real data sets from different industries may allow the formation of clusters which reveal - 43 - patterns between different industries and data mining algorithms. • Use small to very big data sets: In the data mining industry the size of the data to be analyzed can be very large. The maximum size of the data sets used in this thesis was 8124. It would be useful to see what kind of performance is obtained and what types of clusters are formed when large data sets are used. • Use more classification algorithms: The use of more classification algorithms would allow the formation of larger data sets to be generated for analysis by clustering. • Use more clustering algorithms: The use of more clustering algorithms would allow greater consistency of the cluster formation. • Use optimal parameter values by fine-tuning the settings of each algorithm: As well as using the error rates obtained by using the default settings of the different classification algorithms, the use of the error rates obtained by fine-tuning the different options available for optimal classification performance may allow the formation of different types of clusters and may also produce new significant variables for clustering. • Use of more characteristics of data sets: Only ‘number of instances’ and ‘number of attributes’ were used in this thesis. Characteristics of the data sets such as whether or not the data sets contain numeric, symbolic or mixed values and missing values could be useful. • Use visualization tools to analyze the generated data set: Visualization of the generated data set may provide important information and may allow better analysis of the clusters formed. - 44 - Appendix A. About WEKA Weka is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a data set or called from Java code. Weka is also well suited for developing new machine learning schemes. Weka is open source software issued under the GNU public license. Implemented schemes for classification include decision tree inducers, rule learners, Naive Bayes decision tables, locally weighted regression, support vector machines, instance-based learners, logistic regression and voted perceptrons. Implemented schemes for numeric prediction include linear regression, model tree generators, locally weighted regression, instance-based learners, decision tables. Implemented "meta-schemes" include bagging, stacking, boosting, regression via classification, classification via regression. More details can be found at http://www.cs.waikato.ac.nz. - 45 - Appendix B. About Enterprise Miner (Commercial software) Enterprise Miner is an integrated software product that provides an end-to-end business solution for data mining. A graphical user interface (GUI) provides a user-friendly front-end to the SEMMA (Sample, Explore, Modify, Model, Assess) process. All of the functionality needed to implement the SEMMA process is accessed through a single GUI. The SEMMA process is driven by a process flow diagram (pfd), which can be modified and saved. However, the GUI is designed in such a way that the business technologist with little statistical expertise can quickly and easily navigate through the SEMMA process, while the quantitative expert can go "behind the scenes" to fine tune the analytical process. SAS Enterprise Miner contains a collection of sophisticated analysis tools that have a common userfriendly interface that enables you to create and compare multiple algorithms. Statistical tools include clustering, decision trees, linear and logistic regression, and neural networks. Data preparation tools include outlier detection, variable transformations, random sampling, and the partitioning of data sets (into train, test, and validate data sets) [48]. More details can be found at http://www.sas.com - 46 - Appendix C. Detail Results from K-means Analysis FASTCLUS Procedure: Replace=FULL Radius=0 Maxclusters=5 Maxiter=1 Initial Seeds CLUSTER 1 2 3 4 5 NUM_ATTR IBK_TRAIN IBK_TEST 7.0000 0.0000 0.0000 35.0000 50.8000 11.4200 10.0000 26.2200 84.9700 11.0000 0.0000 61.5000 9.0000 0.0000 68.2100 CLUSTER 1 2 3 4 5 NB_TRAIN 0.0000 6.3000 68.1800 51.5000 22.5600 NB_TEST 0.0000 7.1700 82.8700 0.6200 0.6000 OR_TRAIN 0.0000 59.1500 69.9300 0.6200 5.1300 C45_TRAIN 0.0000 3.6600 44.4000 18.5000 54.8700 OR_TEST 0.0000 60.4700 76.2200 0.3800 82.0500 C45_TEST 0.0000 7.9100 83.9200 48.5000 54.8700 KR_TRAIN 0.0000 0.1700 3.8500 0.0000 0.0000 RL_TRAIN RL_TEST 0.0000 0.0000 3.6600 8.3400 40.9100 87.0600 0.1400 49.5000 3.5900 66.1500 KR_TEST 0.0000 8.9300 82.5200 0.6000 66.1500 Statistics for Variables VARIABLE TOTAL STD WITHIN STD NUM_ATTR 17.665590 13.579017 IBK_TRAIN 16.187436 12.263900 IBK_TEST 20.064596 12.113216 C45_TRAIN 11.392017 7.787043 C45_TEST 17.271645 11.762713 RL_TRAIN 8.790095 5.570234 RL_TEST 18.635989 10.262284 NB_TRAIN 16.614650 9.026326 NB_TEST 18.501144 11.202011 OR_TRAIN 18.116768 12.560708 OR_TEST 21.936017 13.732877 KR_TRAIN 5.744748 4.852061 KR_TEST 19.132048 9.951635 OVER-ALL 16.770235 10.715786 - 47 - R-SQUARED RSQ/(1-RSQ) 0.449894 0.817832 0.465600 0.871256 0.660669 1.946973 0.564980 1.298743 0.568169 1.315723 0.626126 1.674696 0.717675 2.542021 0.725207 2.639105 0.658681 1.929810 0.552459 1.234433 0.635100 1.740480 0.335835 0.505649 0.748098 2.969797 0.619867 1.630659 Appendix D. Detail results from Kohonen Vector Quantization/Kohonen FASTCLUS Procedure: Replace=RANDOM Radius=0 Maxclusters=5 Maxiter=100 Converge=0.02 Initial Seeds CLUSTER 1 2 3 4 5 NUM_ATTR 35.0000 26.0000 5.0000 4.0000 7.0000 IBK_TRAIN 50.8000 3.4100 30.2400 0.0000 0.0000 CLUSTER 1 2 3 4 5 NB_TRAIN 6.3000 31.2200 9.1200 0.0000 0.0000 NB_TEST 7.1700 41.9500 9.1200 1.9200 0.0000 IBK_TEST 11.4200 24.8900 20.6400 1.9200 0.0000 OR_TRAIN 59.1500 31.2200 36.4800 0.0000 0.0000 C45_TRAIN 3.6600 4.8900 9.2800 1.9200 0.0000 OR_TEST 60.4700 37.0700 41.1200 3.8500 0.0000 C45_TEST 7.9100 17.5600 22.2400 1.9200 0.0000 KR_TRAIN 0.1700 0.0000 0.0000 0.0000 0.0000 RL_TRAIN RL_TEST 3.6600 8.3400 9.7600 24.8900 5.1200 19.5200 1.9200 1.9200 0.0000 0.0000 KR_TEST 8.9300 22.9300 0.1200 1.9200 0.0000 Minimum Distance Between Initial Seeds = 7.044665 Kohonen Learning Rate: Initial=0.5 Final=0.02 Steps=1000 Kohonen VQ: Maxsteps=10000 Maxiter=100 Converge=0.0001 FASTCLUS Procedure: Replace=RANDOM Radius=0 Maxclusters=5 Maxiter=100 Converge=0.02 Statistics for Variables VARIABLE TOTAL STD WITHIN STD NUM_ATTR 17.665590 9.538179 IBK_TRAIN 16.187436 8.862687 IBK_TEST 20.064596 13.065480 C45_TRAIN 11.392017 8.626314 C45_TEST 17.271645 11.900984 RL_TRAIN 8.790095 5.660908 RL_TEST 18.635989 10.682617 NB_TRAIN 16.614650 9.844509 NB_TEST 18.501144 11.674383 OR_TRAIN 18.116768 12.274027 OR_TEST 21.936017 14.221397 KR_TRAIN 5.744748 4.874310 KR_TEST 19.132048 11.494007 OVER-ALL 16.770235 10.540830 Pseudo F Statistic = 23.20 - 48 - R-SQUARED RSQ/(1-RSQ) 0.728581 2.684339 0.720912 2.583105 0.605220 1.533053 0.466155 0.873204 0.557957 1.262225 0.613855 1.589698 0.694074 2.268766 0.673133 2.059346 0.629288 1.697513 0.572655 1.340031 0.608678 1.555437 0.329730 0.491935 0.663964 1.975872 0.632179 1.718710 Appendix E. Detail Result from Autoclass Clustering Analysis Class Listings These listings are ordered by class weight -* j is the zero-based class index, * k is the zero-based attribute index, and * l is the zero-based discrete attribute instance index. Within each class, the covariant and independent model terms are ordered by their term influence value I-jk. Covariant attributes and discrete attribute instance values are both ordered by their significance value. Significance values are computed with respect to a single class classification, using the divergence from it, abs( log( Prob-jkl / Prob-*kl)), for discrete attributes and the relative separation from it, abs( Mean-jk - Mean-*k) / StDev-jk, for numerical valued attributes. For the SNcm model, the value line is followed by the probabilities that the value is known, for that class and for the single class classification. Entries are attribute type dependent, and the corresponding headers are reproduced with each class. In these -* num/t denotes model term number, * num/a denotes attribute number, * t denotes attribute type, * mtt denotes model term type, and * I-jk denotes the term influence value. CLASS 1- weight 19 normalized weight 0.322 relative strength 1.56e-03 ******* class cross entropy w.r.t. global class 9.10e+00 ******* Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0 Numbers: numb/t = model term number; numb/a = attribute number Model term types (mtt): (single_normal_cn SNcn) (single_normal_cm SNcm) I-JK -JK NUM_ATTR IBK_TRAIN OR_TRAIN RL_TEST KR_TEST C4.5__TEST IBK_TEST OR_TEST C45_TRAIN NB_TEST RL_TRAIN NB_TRAIN KR_TRAIN MEAN -JK |MEAN-JK STDEV MEAN-*K|/ MEAN STDEV-JK -*K -*K STDEV 2.503 ( 1.16e+01 6.35e+00) 3.68e+00 ( 3.49e+01 1.26e+02) 0.853 ( 4.26e+00 4.96e+00) 1.58e+00 ( 1.21e+01 1.61e+01) 0.799 ( 1.72e+01 5.80e+00) 1.36e+00 ( 2.51e+01 1.82e+01) 0.776 ( 1.93e+01 5.30e+00) 6.35e-01 ( 2.27e+01 1.78e+01) 0.771 ( 1.87e+01 5.98e+00) 7.50e-01 ( 2.32e+01 1.90e+01) 0.761 ( 1.86e+01 5.20e+00) 6.74e-01 ( 2.21e+01 1.72e+01) 0.662 ( 1.95e+01 7.20e+00) 1.16e+00 ( 2.78e+01 1.93e+01) 0.517 ( 2.19e+01 9.43e+00) 1.05e+00 ( 3.18e+01 2.07e+01) 0.383 ( 1.07e+01 5.10e+00) 1.92e-02 ( 1.06e+01 1.11e+01) 0.316 ( 1.91e+01 9.91e+00) 5.72e-01 ( 2.48e+01 1.85e+01) 0.312 ( 7.66e+00 4.31e+00) 1.88e-01 ( 8.47e+00 8.50e+00) 0.295 ( 1.52e+01 8.77e+00) 5.64e-01 ( 2.01e+01 1.59e+01) 0.152 ( 3.24e+00 4.04e+00) 1.76e-01 ( 2.53e+00 5.80e+00) - 49 - CLASS 2- weight 17 normalized weight 0.289 relative strength 7.31e-05 ******* class cross entropy w.r.t. global class 7.25e+00 ******* Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0 Numbers: numb/t = model term number; numb/a = attribute number Model term types (mtt): (single_normal_cn SNcn) (single_normal_cm SNcm) I-JK MEAN -JK -JK KR_TRAIN NUM_ATTR RL_TEST .. C4.5__TEST C45_TRAIN IBK_TEST OR_TEST KR_TEST IBK_TRAIN RL_TRAIN NB_TEST. OR_TRAIN NB_TRAIN . | MEAN-JK STDEV MEAN-*K|/ MEAN STDEV STDEV-JK -*K -*K 3.908 ( 1.13e-02 4.86e-02) 5.19e+01 ( 2.53e+00 5.80e+00) 0.565 ( 8.81e+01 2.20e+02) 2.41e-01 ( 3.49e+01 1.26e+02) 0.450 ( 2.33e+01 7.55e+00) 8.85e-02 ( 2.27e+01 1.78e+01) 0.413 ( 2.35e+01 7.64e+00) 1.89e-01 ( 2.21e+01 1.72e+01) 0.337 ( 8.18e+00 5.59e+00) 4.31e-01 ( 1.06e+01 1.11e+01) 0.290 ( 2.94e+01 1.03e+01) 1.53e-01 ( 2.78e+01 1.93e+01) 0.289 ( 4.35e+01 1.41e+01) 8.31e-01 ( 3.18e+01 2.07e+01) 0.268 ( 2.29e+01 1.02e+01) 2.73e-02 ( 2.32e+01 1.90e+01) 0.188 ( 2.09e+01 1.83e+01) 4.78e-01 ( 1.21e+01 1.61e+01) 0.173 ( 8.59e+00 5.25e+00) 2.39e-02 ( 8.47e+00 8.50e+00) 0.160 ( 2.45e+01 1.17e+01) 2.26e-02 ( 2.48e+01 1.85e+01) 0.136 ( 3.21e+01 1.43e+01) 4.92e-01 ( 2.51e+01 1.82e+01) 0.073 ( 1.97e+01 1.18e+01) 3.82e-02 ( 2.01e+01 1.59e+01) CLASS 3- weight 13 normalized weight 0.220 relative strength 1.00e+00 ******* class cross entropy w.r.t. global class 1.84e+01 ******* Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0 Numbers: numb/t = model term number; numb/a = attribute number Model term types (mtt): (single_normal_cn SNcn) (single_normal_cm SNcm) C4.5__TEST RL_TEST C45_TRAIN KR_TRAIN NUM_ATTR KR_TEST RL_TRAIN NB_TEST OR_TEST NB_TRAIN OR_TRAIN IBK_TEST IBK_TRAIN MEAN-JK I-JK MEAN STDEV MEAN-*K|/ MEAN STDEV -JK -JK STDEV-JK -*K -*K 2.520 ( 2.27e+00 1.64e+00) 1.21e+01 ( 2.21e+01 1.72e+01) 2.450 ( 2.48e+00 1.78e+00) 1.13e+01 ( 2.27e+01 1.78e+01) 2.196 ( 1.23e+00 1.07e+00) 8.71e+00 ( 1.06e+01 1.11e+01) 2.182 ( 1.70e-01 3.63e-01) 6.50e+00 ( 2.53e+00 5.80e+00) 1.973 ( 1.78e+01 1.07e+01) 1.60e+00 ( 3.49e+01 1.26e+02) 1.906 ( 3.18e+00 2.60e+00) 7.71e+00 ( 2.32e+01 1.90e+01) 1.885 ( 1.17e+00 1.14e+00) 6.39e+00 ( 8.47e+00 8.50e+00) 0.827 ( 9.39e+00 7.54e+00) 2.04e+00 ( 2.48e+01 1.85e+01) 0.772 ( 1.04e+01 1.15e+01) 1.86e+00 ( 3.18e+01 2.07e+01) 0.688 ( 8.70e+00 6.88e+00) 1.66e+00 ( 2.01e+01 1.59e+01) 0.618 ( 9.09e+00 1.04e+01) 1.54e+00 ( 2.51e+01 1.82e+01) 0.308 ( 1.39e+01 1.57e+01) 8.87e-01 ( 2.78e+01 1.93e+01) 0.069 ( 8.90e+00 1.31e+01) 2.45e-01 ( 1.21e+01 1.61e+01) - 50 - CLASS 4- weight 10 normalized weight 0.170 relative strength 5.75e-08 ******* class cross entropy w.r.t. global class 1.65e+01 ******* Model file: /tmp/wekaAC-112131-14862-16325.model - index = 0 Numbers: numb/t = model term number; numb/a = attribute number Model term types (mtt): (single_normal_cn SNcn) (single_normal_cm SNcm) I-JK -JK NUM_ATTR KR_TEST RL_TEST C4.5__TEST NB_TEST IBK_TEST NB_TRAIN C45_TRAIN RL_TRAIN KR_TRAIN OR_TEST OR_TRAIN IBK_TRAIN MEAN -JK STDEV MEAN-*K|/ STDEV-JK -*K |MEAN-JK MEAN STDEV -*K 3.111 ( 1.00e+01 3.46e+00) 7.20e+00 ( 3.49e+01 1.26e+02) 1.605 ( 5.59e+01 1.33e+01) 2.45e+00 ( 2.32e+01 1.90e+01) 1.586 ( 5.38e+01 1.36e+01) 2.29e+00 ( 2.27e+01 1.78e+01) 1.572 ( 5.18e+01 1.26e+01) 2.36e+00 ( 2.21e+01 1.72e+01) 1.493 ( 5.58e+01 1.31e+01) 2.36e+00 ( 2.48e+01 1.85e+01) 1.428 ( 5.91e+01 1.32e+01) 2.37e+00 ( 2.78e+01 1.93e+01) 1.227 ( 4.48e+01 1.35e+01) 1.83e+00 ( 2.01e+01 1.59e+01) 1.098 ( 2.62e+01 1.50e+01) 1.04e+00 ( 1.06e+01 1.11e+01) 0.903 ( 1.89e+01 1.20e+01) 8.73e-01 ( 8.47e+00 8.50e+00) 0.893 ( 7.63e+00 1.02e+01) 5.02e-01 ( 2.53e+00 5.80e+00) 0.791 ( 5.62e+01 1.48e+01) 1.65e+00 ( 3.18e+01 2.07e+01) 0.721 ( 4.68e+01 1.80e+01) 1.20e+00 ( 2.51e+01 1.82e+01) 0.075 ( 1.67e+01 1.86e+01) 2.46e-01 ( 1.21e+01 1.61e+01) - 51 - Appendix F. Sample Code used to run Multiple Algorithms on Multiple Data sets Run-one.bat The file which is used to capture the parameters and run data mining algorithms is shown below #!/bin/csh #Expects to be invoked from run-lots.bat #Create a name for the output fule by removing `java weka.classifiers.' #and all spaces from the command line set outfile=`echo $argv[1] |sed -e "s/java weka.classifiers.//"|sed -e "s/ //g"` #Change this line to set the name of the output directory set outfile=/research/ai/ribrahim/thesis/results/$outfile /bin/echo $outfile /bin/echo "$argv[1]" > $outfile $argv[1] >> $outfile Run-multiple.bat Sample of the file which runs multiple algorithms on multiple data sets is shown below: run-one.bat "java weka.classifiers.j48.J48 -o -t vote.arff" run-one.bat "java weka.classifiers.IBk -o -W 200 -t vote.arff" run-one.bat "java weka.classifiers.j48.PART -o -t vote.arff" run-one.bat "java weka.classifiers.NaiveBayes -o -t vote.arff" run-one.bat "java weka.classifiers.OneR -o -t vote.arff" run-one.bat "java weka.classifiers.KernelDensity -o -t vote.arff" run-one.bat "java weka.classifiers.j48.J48 -o -t vowel.arff" run-one.bat "java weka.classifiers.IBk -o -W 200 -t vowel.arff" run-one.bat "java weka.classifiers.j48.PART -o -t vowel.arff" run-one.bat "java weka.classifiers.NaiveBayes -o -t vowel.arff" run-one.bat "java weka.classifiers.OneR -o -t vowel.arff" run-one.bat "java weka.classifiers.KernelDensity -o -t vowel.arff" - 52 - Appendix G. Sample Output from data generation java weka.classifiers.j48.PART -o -t glass.arff === Error on Training data === Correctly classified instances Incorrectly Classified Instances Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances === Stratified cross-validation === Correctly classified instances Incorrectly Classified Instances Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 194 (90.65 %) 20 (9.35 %) 0.04 0.15 20.98 % 45.92 % 214 141 ( 65.89 %) 73 (34.11%) 0.1056 0.2943 49.8563 % 90.6776 % 214 - 53 - 6. References 1. D. Aha and D. Kibler, Instance-Based Learning Algorithms, Machine Learning, vol.6, pp. 37-66, 1991. 2. C. Beardah and M. Baxter, The Archaeological Use of Kernel Density, Estimates, Internet Archeology, http://www.intarch.ac.uk, 1996. 3. J. Berger, Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, 1985. 4. M. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley & Sons, New York, 1997. 5. A. Berson, S. Smith, Data Warehousing, Data Mining, and OLAP, McGraw Hill, 1997. 6. M. Bertero, T. Poggio and V. Torre, Ill-Posed Problems in Early Vision, Proceedings of the IEEE, Vol.76, No.8, pp.869-902, 1990. 7. M. Bertold and D. Hand, Intelligent Data Analysis: An Introduction, Springer Verlag, 1999. 8. P. Brazdil, J.Gama and R. Henery, Characterizing the Applicability of Classification Algorithms using Meta Level Learning, in Machine Learning - ECML-94, Springer Verlag, pp. 83-102, 1994. 9. P. Brazdil and R. Henery, Analysis of Results, Machine Learning, Neural and Statistical Classification, Ellis Horwood, pp. 175-212, 1994. 10. C. Brodley and P.Smyth, Applying Classification Algorithms in Practice, Statistics and Computing, Vol. 7, pp. 45-56, 1995. 11. D. Brand and R.Gerritsen, Naïve Bayes and nearest neighbor, http://www.dbmsmag.com/ 9807m07.html, 1997. 12. P. Brazdil and J.Gama, The STATLOG Project- Evaluation / Characterization of Classification Algorithms, http://www.ncc.up.pt/liacc/ML/statlog/, 1998. 13. P. Cheeseman, On Finding the Most Probable Model, Computational Algorithms of Discovery and Theory Formation, Morgan Kaufmann Publishers, San Francisco, pp. 73-96, 1990. 14. P. Cheeseman and J. Stutz, Bayesian classification (AUTOCLASS): Theory and results, Advances in Knowledge Discovery and Data Mining, AAAI Press/ MIT Press, 1996. 15. P. Cheeseman and J. Stutz, M. Self, J. Kelly, W. Taylor, and D. Freeman, Bayesian Classification, In Proceedings of the Seventh National Conference of Artificial Intelligence (AAAI-88), Morgan Kaufmann Publishers, San Francisco, pp. 607-611,1988. 16. J. Culberson, On the Futility of Blind Search: An Algorithmic View of ‘No Free Lunch', Evolutionary Computation, Vol. 6, 1998. - 54 - 17. B. Efron and R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, London, 1993. 18. W. Emde and D. Wettschereck, Relational Instance Based Learning, Machine Learning Proceedings 13th International Conference on Machine Learning, pp. 122-130, 1996. 19. U. Fayeed, G Piatetsky-Shapiro, and P. Smyth, Knowledge Discovery and data Mining: Towards a Unifying Framework, in proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), p. 82, AAAI Press, 1996. 20. E. Frank and I. Witten, Generating Accurate Rule Sets Without Global Optimization, Machine Learning: Proceedings of the Fifteenth International Conference, Morgan Kaufmann Publishers, San Francisco, pp. 144-151, 1998. 21. J. Goebel, K. Volk, H. Walker, F. Gerbault, P. Cheeseman, M. Self, J. Stutz, and W. Taylor, A Bayesian Classification of the IRAS LRS Atlas, Astron Astrophys, Vol. 222, pp. 5-8, 1989. 22. R. Hanson, J. Stutz and P. Cheeseman, Bayesian Classification with Correlation and Inheritance, Proceedings of 12th International Joint Conference on Artificial Intelligence, San Francisco, pp. 692-698, 1991. 23. R. Hecht-Nielsen, Neural Networks for Image Analysis, in Neural Network for Vision and Image Processing, Carpenter and Grossberg, 1992. 24. A. Herr, N. Klomp and J. Atkinson, Identification of Bat Echolocation Calls Using a Decision Tree Classification System, Complexity International, Volume 4, January 1997. 25. J. Hjorth, Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap, Chapman and Hall, London 1994. 26. R. Holte, Very Simple Classification Rules Perform Well on Most Commonly used data sets, Machine Learning, Kluwer Academic Publisher, Boston, Vol. 11, pp.63-91, 1993. 27. A. Izenman, Recent Developments in Non parametric Density Estimation, J. Amer. Statistics. Association. Vol.86, No. 413, pp. 205-224, 1991. 28. G. Jang, A Comparison of Neural Networks Performance for Seismic Phase Identification, J. Franklin Institute, Vol. 330, No. 3, pp. 505-524, 1993. 29. W. Jung, J. Oglesby, and H. Kirk, Data Mining Primer: Overview of Applications and Methods, SAS Institute, Cary, 1998. 30. W. Kadus, The use of Symbolic Learning Algorithms and Instance-based learning for Gesture Recognition, http://www.cse.unsw.edu.au/~waleed/thesis/node69.html, University of NSW, 1995. 31. M. Kearns and U. Vazirani, An Introduction to Computational Learning Theory, The MIT Press, London, 1994. 32. R Kohavi, Holte's OneR, http://www.sgi.com/Technology/mlc/util/util/node14.html, 1996. - 55 - 33. R. Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Max-Plank-Institute Proceedings, pp.1137-1145, 1995. 34. A. Lensu and P. Koikkalainen, Profiling of Text Documents through Self-Organizing Maps, http://www.ferin.math.jyu.fi, University of Jyväskylä, 1997. 35. T. Lim and W. Loh, A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms, Technical Report, Department of Statistics, University of Wisconsin-Madison, No. 979, 1997. 36. C. Marzban and G. Stumpf, A Neural Network for Tornado Prediction Based on Doppler Radardrived Attributes, Journal of Applied Meteorology, Vol. 35, 617,1996. 37. T. Masters, Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY, John Wiley and Sons, 1995. 38. G. McLachlan, and T. Krishnan, The EM Algorithm and Extensions, Wiley and Sons, 1997. 39. V. Nalwa, A Guided Tour of Computer Vision, Addison Wesley, 1993. 40. E. Parzen, On Estimation of a Probability Density Function and mode, The Annals of Mathematical Statistics, pp. 1065-1076, 1962. 41. W. Potts, Introduction to Predictive Data Mining Using Enterprise Miner Software, SAS Institute, Cary, 1997. 42. D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann, San Francisco, 1999. 43. J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publisher, San Mateo, California, 1993. 44. J. Ramon and L. Raedt, Instance Based Function Learning, Lecture Notes in Computer Science, Vol. 1634, pp. 268-275, 1999. 45. R. Rao, T. Voigt, and T. Fermanian, Data Mining of Subjective Agricultural Data, in Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann, san Mateo, 1999. 46. S. Salzberg, On comparing classifiers: Pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery, Vol. 1, No.3, 1997. 47. S. Schuburt, Data Mining Projects, SAS Methodology, SAS Institute Australia, Melbourne, 1998. 48. S. Schubert, The Data Mining Challenge: Turning Raw Data Into Business Gold, http://www.sas.com, Cary, 1999. 49. S. Scott, Multivariate Density Estimation: Theory, Practice and Visualization, Wiley, New York, 1992. 50. J. Shavlik, Raymond J. Mooney and G. Towell, Symbolic and Neural Learning Algorithms: An Experimental Comparison, Machine Learning, Vol. 6, pp. 111-143, 1991. - 56 - 51. B. Silverman, Density Estimation, Chapman and Hall, London, 1986. 52. B. Silverman, Kernel Density Estimation using the Fast Fourier Transform, Applied Statistics, Vol. 31, No.1, pp. 93-99, 1982. 53. P. Smyth, A. Gray and U. Fayyad, Retrofitting Decision Tree Classifiers using Kernel Density Estimation, In Proceedings of 12th International Conference on Machine Learning, pp. 506514,1995. 54. W. Venables and B. Ripley, Modern Applied Statistics with S-Plus, New York, 1994. 55. S. Weiss and N. Indurkhya, Predictive Data Mining-a Practical Guide, Morgan Kaufmann Publishers, San Francisco, 1998. 56. S. Weiss and I. Kapouleas, An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods, In Proceedings of the 11th International Joint Conference on Artificial Intelligence, pp. 781-787, 1989. 57. C. Westphal and T. Blaxton, Data Mining Solutions-Methods and Tools for Solving Real-World Problems, John Wiley & Sons, NY, 1998. 58. I. Witten and M. Frank, Data Mining: Practical Machine Learning Tool and Technique with Java implementation, Morgan Kaufmann, San Francisco, 2000. 59. D. Wolpert and W. Macready, No Free Lunch Theorems for Search, Santa Fe Institute, Technical report no., No. SFI-TR-95-02-010, 1995. 60. A. Upal, Autoclass, http://www.cs.ualberta.ca/~upal/cluster/p2/node11.html, 1997. - 57 -