Download Rheinisch-Westfälische Technische Hochschule Aachen

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Rheinisch-Westfälische Technische Hochschule Aachen
Chair of Data Management and -Exploration
Prof. Dr. T. Seidl
Proseminararbeit
Weka
Data Mining Software
Kai Adam
June 2012
Supervisor: Prof. Dr. T. Seidl
Dipl.-Ing. Marwan Hassani
Contents
1 Introduction
2
2 Basics and Motivation
2.1 The ARFF Data Set Format . . . . . . . . . . . . . . . . . . .
2.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
5
3 Essential Application Types
3.1 The Explorer . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Pre-Processing with Weka . . . . . . . . . . . . . .
3.1.2 Classication with Weka . . . . . . . . . . . . . . .
3.1.3 Cluster, Associate, Select Attributes, Visualization
3.2 The Experimenter, Knowledge Flow and Simple CLI . . .
.
.
.
.
.
.
.
.
.
.
6
6
6
7
11
13
4 Review
15
5 Related Works
16
6 Summary
18
Bibliography
19
A List of Abbreviations
20
List of Figures
2.1 Weka Explorer: Preprocess tab with available lters . . . . .
2.2 Expiration of a rough classication . . . . . . . . . . . . . . .
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
Explorer: Pre-Processing Tab with balance-scale.ar
Explorer: Classify Tab . . . . . . . . . . . . . . . . .
Explorer: Classify . . . . . . . . . . . . . . . . . . . .
Explorer: Classify with FT algorithm . . . . . . . . .
Explorer: FT tree visualisation . . . . . . . . . . . .
Explorer: Cluster . . . . . . . . . . . . . . . . . . . .
Explorer: Select Attributes . . . . . . . . . . . . . . .
Experimenter . . . . . . . . . . . . . . . . . . . . . .
Knowledge Flow . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
7
8
9
10
11
12
12
13
14
5.1 Data Mining Tools Usage [2] . . . . . . . . . . . . . . . . . . . 16
Abstract
Machine learning software packages are becoming increasingly important in
science, because these are used to evaluate results and methods from experiments. Those kinds of packages are included in the Weka workbench and
allow to test a lot of varying algorithms on dierent kinds of data, in order to
assess the accuracy and eciency between the algorithms in comparison with
the received data. This can be helpful to train some algorithms on similar
data sets and to select the most suitable one, which can be recognized by the
eciency, accuracy and some other factors. In this paper the algorithms are
determined by their accuracy.
1
Chapter 1
Introduction
Weka is a data mining suite in Java and was developed at the University
of Waikato in New Zealand by Ian H. Witten, Eibe Frank and Mark A.
Hall. The name Weka is an acronym for Waikato Environment for Knowledge Analysis. The weka, a bird domiciled in New Zealand, is its symbol. It
is a collection of dierent data mining tools and provides the right response
for most real world data set problems.
Researchers who work with data sets have to decide between many algorithms and lters because there is not a specic one for all data mining problems. The Weka workbench oers users a simple GUI with four varying views
with panels for pre-processing, classication, clustering, attribute-selection,
association-rules and visualization. It works with several data set types like
ARFF, C4.5, CSV or comparable ones.
One of the main advantages is to classify some attributes of a relational
data set without having great knowledge of data mining. It allows users to
analyse data sets and to nd by means of a visualization of the evaluated
data a proper algorithm that delivers the desired results.
The following parts of this term paper will describe the essential application types of Weka - primarily the Explorer and its great diversity in data
processing. At the conclusion will be a short review of the Weka workbench
and a short discussion about related works.
2
Chapter 2
Basics and Motivation
2.1 The ARFF Data Set Format
The ARFF is a common le format and stores data in terms of a relation
with '@relation <lename>'. Attributes are initiated by '@attribute <Description> <Type>' and can take dierent types of values: Nominal, Numeric, String, Date or comparable ones. A special attribute type is 'class'.
At the end of an ARFF le is a data eld, which is represented by '@data'
and consists of instances of the relation. Instances are illustrated as rows of
data in the body of an ARFF le.
Listing 3.1 displays a ARFF data set with 625 instances that is used to show
how pre-processing, classication and other data mining methods work.
Listing 2.1: ARFF Example : balance-scale.ar
% Name Of The Relation
@ r e l a t i o n b a l a n c e −s c a l e
% 4 Attributes With 1 Class − Attribute
@attribute
−w e i g h t r e a l
−d i s t a n c e r e a l
r i g h t −w e i g h t r e a l
r i g h t −d i s t a n c e r e a l
@attribute
class
@attribute
@attribute
@attribute
left
left
{ L , B , R}
@data
1 , 1 , 1 , 1 ,B
1 , 1 , 1 , 2 ,R
1 , 1 , 1 , 3 ,R
...
3
CHAPTER 2.
BASICS AND MOTIVATION
4
The data set describes results of a psychological experiment on a balance
scale. Aim of the experiment was to nd a model that determines the correct
class. Attributes for this model are right-/left-weight and right-/left-distance,
whereas "L" for tip to the left, "R" for tip to the right and "B" for a balanced
scale are class attributes. [1, Siegler, R.S.]
2.2 Pre-Processing
Data pre-processing is a generic term that includes all methods which can
process raw data, if it does not have the needed quality. This is caused
by incomplete data from hardware problems or dierent opinions between
the analysts. Another reason could be noisy or inconsistent data which are
produced by lacks in transmission and discrepancies in names, codes and
duplicated records. In order to prevent this loss of quality, researcher have
to process raw data to get task relevant data.
Figure 2.1: Weka Explorer: Preprocess tab with available lters
CHAPTER 2.
5
BASICS AND MOTIVATION
2.3 Classication
Classication works with training data sets that are given by dierent observations and measurements. Those recorded data collection contains several
attributes and one class attribute per instance and is used to nd a model,
or classier, for the class attribute.
Essential for this classier are the values of the other attributes which are
needed to nd a model/rule set that is able to predict the class attribute on
an unknown data set as accurately as possible. In order to check the accuracy of the model it is necessary to get a testing data set that is unknown to
the classier - normally there are no specic training and testing data sets.
Instead of that, you have got one data set which is split into a training and
a testing one. To extract the maximum of information for the classier, the
training data set is bigger than the testing one. The precisely determined
size depends on the selected test options and number of instances.
training data
testing data
Use classier
for similar
unknown data
sets
train learning
algorithm
classier
apply on
test on
good accuracy
bad accuracy
choose dierent
learning algorithm
or parameter
settings
accuracy
Figure 2.2: Expiration of a rough classication
Chapter 3
Essential Application Types
3.1 The Explorer
The Explorer is the most powerful tool in the Weka workbench and allows
users to access multiple functions. Those can be used to fulll a certain task,
such as selecting a data set, pre-processing the data set, choosing a machine
learning algorithm, classifying an attribute and evaluating produced data of
dierent algorithms against each other. This will be explained in the next
two subsections based on the balance-scale data set.
3.1.1 Pre-Processing with Weka
Pre-Processing in Weka is easy to handle, because the workbench provides
several features to process raw data. Figure 3.1 displays the Preprocess panel
and a bar with buttons for integrating a data set into the workbench on top.
Furthermore, it provides not only the option to open a le, but also the opportunity to load a le out of a database or an url. Moreover, users are able
to watch four dierent parts: the current relations and its properties, a list
of attributes, properties of the selected attribute and a visualization window,
that displays the selected attribute in connection with a freely selectable.
A further opportunity is to edit data with the edit-button or editing it by
several lters. These are shown in Figure 2.1 and are able to clean the data
from missing values or inconsistent data, transform the data into any attribute type or reduce the data by deleting useless instances or duplicates.
However, there is a distinction made between supervised and unsupervised
lters. Unsupervised lters are used to process on data sets that have a class
label while supervised lters are used to handle data sets without class labels.
6
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
7
Figure 3.1: Explorer: Pre-Processing Tab with balance-scale.ar
3.1.2 Classication with Weka
After pre-processing, the data set should be ready to determine a classier.
Weka enables to classify with several lters and some test options. The GUI,
shown in Figure 3.2, is split into four dierent parts. On top is a classier
tab, where a Choose-button leads to a pop-up, displayed in Figure 3.3, with
several kinds of lters like bayes, functions, rules, trees and other ones. Two
tree classier will be shortly explained later.
The test options frame is a specic one and allows users to choose a training
set that is explained in Section 3.3. At this point users get the options:
• Use training set:
This should be chosen if the actual data set is used as training and
testing set.
• Supplied test set:
It is an option if the actual data set is used as training set and you
have got a separate testing set.
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
8
• Cross-Validation:
Cross-Validation provides the opportunity to use one data set. It splits
the data set into m folds and use m-1 folds as training sets and one
fold as testing set.
• Percentage split:
Allows to split on n percentage the actual data set into training and
testing set.
Figure 3.2: Explorer: Classify Tab
CHAPTER 3.
9
ESSENTIAL APPLICATION TYPES
(a) Filter
(b) Results with J48 algorithm
Figure 3.3: Explorer: Classify
In this example the two chosen tree algorithms are the J48 and FT, shown
in Figure 3.3(a). The J48 algorithm is based on the C4.5 decision tree algorithm, which is explained in pseudo code in Listing 3.1.
Listing 3.1: C4.5 Pseudo code described by Quinlan
1.
Check
2.
For e a c h
for
any b a s e
cases .
attribute
1.
a.
Find t h e
from
3.
L e t a_b e s t
information
be t h e
normalized
splitting
information
gain
on a .
attribute
with
the
node
splits
highest
normalized
gain .
4.
Create a
5.
R e c u r s e on t h e
decision
and add t h o s e
sublists
nodes
as
that
o b t a i n e d by
children
of
on a_b e s t
splitting
on a_b e s t ,
node
In contrast to the J48 algorithm, the FT tree algorithm builds a functional
tree with oblique splits and linear functions at the leaves. If these two algorithms are applied on the class attribute, the result list will display the
executed algorithms and allows on left-click to see the results in the right
frame, called classier output. At this point, Figure 3.3 displays the results
of the J48 and Figure 3.4 the results of the FT algorithm.
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
10
Figure 3.4: Explorer: Classify with FT algorithm
A comparison of the correctly classied instances implies that the FT algorithm is with 90.72% a better choice than the J48 with 76.64% correctly
classied instances. Furthermore, the error rates of the FT are lower and the
accuracy by class are higher than the ones at the J48 results.
But Weka allows also the use of dierent visualizations like visualizing trees,
classier errors and other ones. These can be applied by right-click on the
result list, displayed in Figure 3.5 on the FT results. There it can be seen
that the leaves represent class values and the nodes are functions, which are
derived from the values of the attributes.
This task and the following ones could be implemented in every Java using platform. Therefore, it is necessary to import the Java packages, that
Weka provides for each task, into any Java le.
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
11
Figure 3.5: Explorer: FT tree visualisation
3.1.3 Cluster, Associate, Select Attributes, Visualization
The "Cluster" panel is similar to those for "Classify", where users are able
to choose an algorithm which determines a priority class in each cluster and
compares their matches with the preassigned class. (Figure 4.6)
Another panel is the "Associate", which allows users to choose one of six
algorithms to determine association rules between the attributes.
The selecting panel is more complex, because it provides us the opportunity to select an attribute evaluation algorithm and a search algorithm, that
are used to nd an best matching attribute in dependencies of a preassigned
attribute. (Figure 4.7)
The visualization panel visualizes the data set and not the results of an
classication or clustering. It helps to understand the dependencies between
the attributes and their distribution.
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
Figure 3.6: Explorer: Cluster
Figure 3.7: Explorer: Select Attributes
12
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
13
3.2 The Experimenter, Knowledge Flow and
Simple CLI
The Experimenter view, shown in Figure 3.8, allows to run algorithms against
each other on dierent data sets. It provides three panels to Setup, Run and
Analyse the output and should be used to evaluate algorithms. In order to
use an algorithm that is not listed in the lter list, users are able to implement
their own learning algorithm, that must be written in Java.
Figure 3.8: Experimenter
Knowledge Flow and Simple CLI do not have the same functionality as the
Explorer, because they oer better opportunities to work with large data
sets. For example, the Knowledge Flow helps user to outsource and to congure methods on dierent kinds of memory in order to not overload the
main memory. It enables users to develop a data ow to analyse the algorithms and data sets. Furthermore, the Simple CLI gives you the possibility
to use dierent options on algorithms that were hidden in the Explorer.
A short classication with the J48 algorithm on the balance-scale data set is
represented in Figure 3.9.
CHAPTER 3.
ESSENTIAL APPLICATION TYPES
Figure 3.9: Knowledge Flow
14
Chapter 4
Review
This chapter deals with strengths and weaknesses of the Weka workbench.
Weka is distributed under the GNU General Public License. Therefore it can
be freely used and developed further by third-party supplier under the GNU
GPL.
A simple GUI makes it easier for the user to manage an evaluation of data
sets. It provides some useful features and splits them into self-explaining panels - users only have to decide between these panels and methods to evaluate.
However, one unpleasant aspect of Weka is, that it is not compatible with
multi-relational data sets. Users who process these types of data sets have to
use a MARFF extension that converts multi-relational into an one-relational
data set.
In conclusion, Weka is a good choice for people who want to process and evaluate their data sets easily by using predened lters and machine learning
algorithms. In addition Weka provides an excellent opportunity to integrate
its features into any Java using platform. This makes Weka portable.
15
Chapter 5
Related Works
Data mining tools are in high demand in various scientic areas. Scientists
have to decide between several tools. These have for example properties like
free availability, high amount of useful features or an easy implementation of
new features. A short list of data mining tools that scientists used in year
2011 are shown in Figure 5.1 below.
Figure 5.1: Data Mining Tools Usage [2]
16
CHAPTER 5.
RELATED WORKS
17
As one can see in Figure 5.1, 'RapidMiner' is the most commonly used data
mining tool in 2011. This one belongs to the free available software applications and surpasses the Weka workbench which is placed on position seven.
'RapidMiner' was started by Ralf Klinkenberg, Ingo Mierswa and Simon
Fischer at the Articial Intelligence Unit of the University of Dortmund in
2001. It was rst named 'YALE' and is still developing. This project has
implemented the Weka learning schemes and allows users to take advantage
of varying modeling methods like 3D-Plots.
Chapter 6
Summary
Weka is a great choice for researchers, who have to classify, cluster, search
for association rules, pre-process, select attributes and visualize several data
sets, because it enables to nd an optimized algorithm out of a collection.
With the ability to implement algorithms and new tabs, the workbench grows
more and more and is becoming more comfortable in handling. Moreover,
a basic Java implementation allows to integrate Weka and its features into
every Java using platform and enables the portability, which leads to an even
greater awareness.
18
Bibliography
[1] Generated to model psychological experiments reported by Siegler,
R.S.(1976). Three Aspects of Cognitive Development. Cognitive
Psychology, 8, 481-520. http://archive.ics.uci.edu/ml/machinelearning-databases/balance-scale/
[2] Data Mining Tool Usage. http://www.kdnuggets.com/polls/2011/toolsanalytics-data-mining.html
[3] Szugat,
Martin:
Im
Datenrausch:
Praktische
Einführung in das Data Mining mit Weka 3.4.
http://bioweka.sourceforge.net/download/LE_1.05_75-79.pdf
[4] Witten, H. Ian and Frank, Eibe and Hall, A. Mark: Data Mining:
Practical Machine Learning Tools and Techniques. Third Edition.
Morgan Kaufmann, 2011.
[5] Frank, Eibe and Hall, Mark and Trigg, Len and Holmes, Georey
and Witten, Ian H.: Data mining in bioinformatics using Weka.
Volume 20, Number 20, Pages 2479-2481, Year 2004
[6] Prof. Dr. Thomas Seidl, Slides from the Data Mining lecture, Year
2011
19
Appendix A
List of Abbreviations
ARFF Attribute Relation File Format
CLI Command Line Interpreter
CSV Comma Separated Values
GNU GNU is not Unix
GPL General Public License
GUI Graphical User Interface
IT Information Technology
MARFF Multi Attribute Relation File Format
YALE Yet Another Learning Environment
20