Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Rheinisch-Westfälische Technische Hochschule Aachen Chair of Data Management and -Exploration Prof. Dr. T. Seidl Proseminararbeit Weka Data Mining Software Kai Adam June 2012 Supervisor: Prof. Dr. T. Seidl Dipl.-Ing. Marwan Hassani Contents 1 Introduction 2 2 Basics and Motivation 2.1 The ARFF Data Set Format . . . . . . . . . . . . . . . . . . . 2.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 3 Essential Application Types 3.1 The Explorer . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Pre-Processing with Weka . . . . . . . . . . . . . . 3.1.2 Classication with Weka . . . . . . . . . . . . . . . 3.1.3 Cluster, Associate, Select Attributes, Visualization 3.2 The Experimenter, Knowledge Flow and Simple CLI . . . . . . . . . . . . . 6 6 6 7 11 13 4 Review 15 5 Related Works 16 6 Summary 18 Bibliography 19 A List of Abbreviations 20 List of Figures 2.1 Weka Explorer: Preprocess tab with available lters . . . . . 2.2 Expiration of a rough classication . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Explorer: Pre-Processing Tab with balance-scale.ar Explorer: Classify Tab . . . . . . . . . . . . . . . . . Explorer: Classify . . . . . . . . . . . . . . . . . . . . Explorer: Classify with FT algorithm . . . . . . . . . Explorer: FT tree visualisation . . . . . . . . . . . . Explorer: Cluster . . . . . . . . . . . . . . . . . . . . Explorer: Select Attributes . . . . . . . . . . . . . . . Experimenter . . . . . . . . . . . . . . . . . . . . . . Knowledge Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 7 8 9 10 11 12 12 13 14 5.1 Data Mining Tools Usage [2] . . . . . . . . . . . . . . . . . . . 16 Abstract Machine learning software packages are becoming increasingly important in science, because these are used to evaluate results and methods from experiments. Those kinds of packages are included in the Weka workbench and allow to test a lot of varying algorithms on dierent kinds of data, in order to assess the accuracy and eciency between the algorithms in comparison with the received data. This can be helpful to train some algorithms on similar data sets and to select the most suitable one, which can be recognized by the eciency, accuracy and some other factors. In this paper the algorithms are determined by their accuracy. 1 Chapter 1 Introduction Weka is a data mining suite in Java and was developed at the University of Waikato in New Zealand by Ian H. Witten, Eibe Frank and Mark A. Hall. The name Weka is an acronym for Waikato Environment for Knowledge Analysis. The weka, a bird domiciled in New Zealand, is its symbol. It is a collection of dierent data mining tools and provides the right response for most real world data set problems. Researchers who work with data sets have to decide between many algorithms and lters because there is not a specic one for all data mining problems. The Weka workbench oers users a simple GUI with four varying views with panels for pre-processing, classication, clustering, attribute-selection, association-rules and visualization. It works with several data set types like ARFF, C4.5, CSV or comparable ones. One of the main advantages is to classify some attributes of a relational data set without having great knowledge of data mining. It allows users to analyse data sets and to nd by means of a visualization of the evaluated data a proper algorithm that delivers the desired results. The following parts of this term paper will describe the essential application types of Weka - primarily the Explorer and its great diversity in data processing. At the conclusion will be a short review of the Weka workbench and a short discussion about related works. 2 Chapter 2 Basics and Motivation 2.1 The ARFF Data Set Format The ARFF is a common le format and stores data in terms of a relation with '@relation <lename>'. Attributes are initiated by '@attribute <Description> <Type>' and can take dierent types of values: Nominal, Numeric, String, Date or comparable ones. A special attribute type is 'class'. At the end of an ARFF le is a data eld, which is represented by '@data' and consists of instances of the relation. Instances are illustrated as rows of data in the body of an ARFF le. Listing 3.1 displays a ARFF data set with 625 instances that is used to show how pre-processing, classication and other data mining methods work. Listing 2.1: ARFF Example : balance-scale.ar % Name Of The Relation @ r e l a t i o n b a l a n c e −s c a l e % 4 Attributes With 1 Class − Attribute @attribute −w e i g h t r e a l −d i s t a n c e r e a l r i g h t −w e i g h t r e a l r i g h t −d i s t a n c e r e a l @attribute class @attribute @attribute @attribute left left { L , B , R} @data 1 , 1 , 1 , 1 ,B 1 , 1 , 1 , 2 ,R 1 , 1 , 1 , 3 ,R ... 3 CHAPTER 2. BASICS AND MOTIVATION 4 The data set describes results of a psychological experiment on a balance scale. Aim of the experiment was to nd a model that determines the correct class. Attributes for this model are right-/left-weight and right-/left-distance, whereas "L" for tip to the left, "R" for tip to the right and "B" for a balanced scale are class attributes. [1, Siegler, R.S.] 2.2 Pre-Processing Data pre-processing is a generic term that includes all methods which can process raw data, if it does not have the needed quality. This is caused by incomplete data from hardware problems or dierent opinions between the analysts. Another reason could be noisy or inconsistent data which are produced by lacks in transmission and discrepancies in names, codes and duplicated records. In order to prevent this loss of quality, researcher have to process raw data to get task relevant data. Figure 2.1: Weka Explorer: Preprocess tab with available lters CHAPTER 2. 5 BASICS AND MOTIVATION 2.3 Classication Classication works with training data sets that are given by dierent observations and measurements. Those recorded data collection contains several attributes and one class attribute per instance and is used to nd a model, or classier, for the class attribute. Essential for this classier are the values of the other attributes which are needed to nd a model/rule set that is able to predict the class attribute on an unknown data set as accurately as possible. In order to check the accuracy of the model it is necessary to get a testing data set that is unknown to the classier - normally there are no specic training and testing data sets. Instead of that, you have got one data set which is split into a training and a testing one. To extract the maximum of information for the classier, the training data set is bigger than the testing one. The precisely determined size depends on the selected test options and number of instances. training data testing data Use classier for similar unknown data sets train learning algorithm classier apply on test on good accuracy bad accuracy choose dierent learning algorithm or parameter settings accuracy Figure 2.2: Expiration of a rough classication Chapter 3 Essential Application Types 3.1 The Explorer The Explorer is the most powerful tool in the Weka workbench and allows users to access multiple functions. Those can be used to fulll a certain task, such as selecting a data set, pre-processing the data set, choosing a machine learning algorithm, classifying an attribute and evaluating produced data of dierent algorithms against each other. This will be explained in the next two subsections based on the balance-scale data set. 3.1.1 Pre-Processing with Weka Pre-Processing in Weka is easy to handle, because the workbench provides several features to process raw data. Figure 3.1 displays the Preprocess panel and a bar with buttons for integrating a data set into the workbench on top. Furthermore, it provides not only the option to open a le, but also the opportunity to load a le out of a database or an url. Moreover, users are able to watch four dierent parts: the current relations and its properties, a list of attributes, properties of the selected attribute and a visualization window, that displays the selected attribute in connection with a freely selectable. A further opportunity is to edit data with the edit-button or editing it by several lters. These are shown in Figure 2.1 and are able to clean the data from missing values or inconsistent data, transform the data into any attribute type or reduce the data by deleting useless instances or duplicates. However, there is a distinction made between supervised and unsupervised lters. Unsupervised lters are used to process on data sets that have a class label while supervised lters are used to handle data sets without class labels. 6 CHAPTER 3. ESSENTIAL APPLICATION TYPES 7 Figure 3.1: Explorer: Pre-Processing Tab with balance-scale.ar 3.1.2 Classication with Weka After pre-processing, the data set should be ready to determine a classier. Weka enables to classify with several lters and some test options. The GUI, shown in Figure 3.2, is split into four dierent parts. On top is a classier tab, where a Choose-button leads to a pop-up, displayed in Figure 3.3, with several kinds of lters like bayes, functions, rules, trees and other ones. Two tree classier will be shortly explained later. The test options frame is a specic one and allows users to choose a training set that is explained in Section 3.3. At this point users get the options: • Use training set: This should be chosen if the actual data set is used as training and testing set. • Supplied test set: It is an option if the actual data set is used as training set and you have got a separate testing set. CHAPTER 3. ESSENTIAL APPLICATION TYPES 8 • Cross-Validation: Cross-Validation provides the opportunity to use one data set. It splits the data set into m folds and use m-1 folds as training sets and one fold as testing set. • Percentage split: Allows to split on n percentage the actual data set into training and testing set. Figure 3.2: Explorer: Classify Tab CHAPTER 3. 9 ESSENTIAL APPLICATION TYPES (a) Filter (b) Results with J48 algorithm Figure 3.3: Explorer: Classify In this example the two chosen tree algorithms are the J48 and FT, shown in Figure 3.3(a). The J48 algorithm is based on the C4.5 decision tree algorithm, which is explained in pseudo code in Listing 3.1. Listing 3.1: C4.5 Pseudo code described by Quinlan 1. Check 2. For e a c h for any b a s e cases . attribute 1. a. Find t h e from 3. L e t a_b e s t information be t h e normalized splitting information gain on a . attribute with the node splits highest normalized gain . 4. Create a 5. R e c u r s e on t h e decision and add t h o s e sublists nodes as that o b t a i n e d by children of on a_b e s t splitting on a_b e s t , node In contrast to the J48 algorithm, the FT tree algorithm builds a functional tree with oblique splits and linear functions at the leaves. If these two algorithms are applied on the class attribute, the result list will display the executed algorithms and allows on left-click to see the results in the right frame, called classier output. At this point, Figure 3.3 displays the results of the J48 and Figure 3.4 the results of the FT algorithm. CHAPTER 3. ESSENTIAL APPLICATION TYPES 10 Figure 3.4: Explorer: Classify with FT algorithm A comparison of the correctly classied instances implies that the FT algorithm is with 90.72% a better choice than the J48 with 76.64% correctly classied instances. Furthermore, the error rates of the FT are lower and the accuracy by class are higher than the ones at the J48 results. But Weka allows also the use of dierent visualizations like visualizing trees, classier errors and other ones. These can be applied by right-click on the result list, displayed in Figure 3.5 on the FT results. There it can be seen that the leaves represent class values and the nodes are functions, which are derived from the values of the attributes. This task and the following ones could be implemented in every Java using platform. Therefore, it is necessary to import the Java packages, that Weka provides for each task, into any Java le. CHAPTER 3. ESSENTIAL APPLICATION TYPES 11 Figure 3.5: Explorer: FT tree visualisation 3.1.3 Cluster, Associate, Select Attributes, Visualization The "Cluster" panel is similar to those for "Classify", where users are able to choose an algorithm which determines a priority class in each cluster and compares their matches with the preassigned class. (Figure 4.6) Another panel is the "Associate", which allows users to choose one of six algorithms to determine association rules between the attributes. The selecting panel is more complex, because it provides us the opportunity to select an attribute evaluation algorithm and a search algorithm, that are used to nd an best matching attribute in dependencies of a preassigned attribute. (Figure 4.7) The visualization panel visualizes the data set and not the results of an classication or clustering. It helps to understand the dependencies between the attributes and their distribution. CHAPTER 3. ESSENTIAL APPLICATION TYPES Figure 3.6: Explorer: Cluster Figure 3.7: Explorer: Select Attributes 12 CHAPTER 3. ESSENTIAL APPLICATION TYPES 13 3.2 The Experimenter, Knowledge Flow and Simple CLI The Experimenter view, shown in Figure 3.8, allows to run algorithms against each other on dierent data sets. It provides three panels to Setup, Run and Analyse the output and should be used to evaluate algorithms. In order to use an algorithm that is not listed in the lter list, users are able to implement their own learning algorithm, that must be written in Java. Figure 3.8: Experimenter Knowledge Flow and Simple CLI do not have the same functionality as the Explorer, because they oer better opportunities to work with large data sets. For example, the Knowledge Flow helps user to outsource and to congure methods on dierent kinds of memory in order to not overload the main memory. It enables users to develop a data ow to analyse the algorithms and data sets. Furthermore, the Simple CLI gives you the possibility to use dierent options on algorithms that were hidden in the Explorer. A short classication with the J48 algorithm on the balance-scale data set is represented in Figure 3.9. CHAPTER 3. ESSENTIAL APPLICATION TYPES Figure 3.9: Knowledge Flow 14 Chapter 4 Review This chapter deals with strengths and weaknesses of the Weka workbench. Weka is distributed under the GNU General Public License. Therefore it can be freely used and developed further by third-party supplier under the GNU GPL. A simple GUI makes it easier for the user to manage an evaluation of data sets. It provides some useful features and splits them into self-explaining panels - users only have to decide between these panels and methods to evaluate. However, one unpleasant aspect of Weka is, that it is not compatible with multi-relational data sets. Users who process these types of data sets have to use a MARFF extension that converts multi-relational into an one-relational data set. In conclusion, Weka is a good choice for people who want to process and evaluate their data sets easily by using predened lters and machine learning algorithms. In addition Weka provides an excellent opportunity to integrate its features into any Java using platform. This makes Weka portable. 15 Chapter 5 Related Works Data mining tools are in high demand in various scientic areas. Scientists have to decide between several tools. These have for example properties like free availability, high amount of useful features or an easy implementation of new features. A short list of data mining tools that scientists used in year 2011 are shown in Figure 5.1 below. Figure 5.1: Data Mining Tools Usage [2] 16 CHAPTER 5. RELATED WORKS 17 As one can see in Figure 5.1, 'RapidMiner' is the most commonly used data mining tool in 2011. This one belongs to the free available software applications and surpasses the Weka workbench which is placed on position seven. 'RapidMiner' was started by Ralf Klinkenberg, Ingo Mierswa and Simon Fischer at the Articial Intelligence Unit of the University of Dortmund in 2001. It was rst named 'YALE' and is still developing. This project has implemented the Weka learning schemes and allows users to take advantage of varying modeling methods like 3D-Plots. Chapter 6 Summary Weka is a great choice for researchers, who have to classify, cluster, search for association rules, pre-process, select attributes and visualize several data sets, because it enables to nd an optimized algorithm out of a collection. With the ability to implement algorithms and new tabs, the workbench grows more and more and is becoming more comfortable in handling. Moreover, a basic Java implementation allows to integrate Weka and its features into every Java using platform and enables the portability, which leads to an even greater awareness. 18 Bibliography [1] Generated to model psychological experiments reported by Siegler, R.S.(1976). Three Aspects of Cognitive Development. Cognitive Psychology, 8, 481-520. http://archive.ics.uci.edu/ml/machinelearning-databases/balance-scale/ [2] Data Mining Tool Usage. http://www.kdnuggets.com/polls/2011/toolsanalytics-data-mining.html [3] Szugat, Martin: Im Datenrausch: Praktische Einführung in das Data Mining mit Weka 3.4. http://bioweka.sourceforge.net/download/LE_1.05_75-79.pdf [4] Witten, H. Ian and Frank, Eibe and Hall, A. Mark: Data Mining: Practical Machine Learning Tools and Techniques. Third Edition. Morgan Kaufmann, 2011. [5] Frank, Eibe and Hall, Mark and Trigg, Len and Holmes, Georey and Witten, Ian H.: Data mining in bioinformatics using Weka. Volume 20, Number 20, Pages 2479-2481, Year 2004 [6] Prof. Dr. Thomas Seidl, Slides from the Data Mining lecture, Year 2011 19 Appendix A List of Abbreviations ARFF Attribute Relation File Format CLI Command Line Interpreter CSV Comma Separated Values GNU GNU is not Unix GPL General Public License GUI Graphical User Interface IT Information Technology MARFF Multi Attribute Relation File Format YALE Yet Another Learning Environment 20