Download Introduction to WEKA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Introduction to WEKA (1)
Learning Objectives
•
•
•
•
What is WEKA?
How to start?
GUI Interface
Weka Explorer application
– Walk through the 3 phases of DM using iris
database
What is WEKA?
• Open source tool created by researchers at the
University of Waikato in New Zealand
• Started in 1993 (TCL and C based)
• First Java version was in 1997
• In 2005 received the SIGKDD Data Mining and
Knowledge Discovery Service Award
• Latest version is 3.7 Develop
• We use Stable version 3.6.x
What is WEKA?
• Core is a collection of open source Machine
Learning algorithms
o Pre-processing
o Classifiers
o Clustering
o Association rule
• Both GUI and Command Line interfaces
How to start - Installation
• Lab computer has WEKA installed
• To install on your laptop (do it after class please)
– Download software (of same version as in lab) from
http://www.cs.waikato.ac.nz/ml/weka/
– If Java VM 1.6 is not installed in your laptop, choose the
includes version
How to start – Get some data
• From campus website:
– http://192.168.10.91/insu/CP3300/index.html
– http://192.168.2.91/insu/CP3300/index.html
(Later) Would like to have more choice for
assignment?
• Explore more from well known repositories:
– http://mlearn.ics.uci.edu/MLRepository.html
– http://www.kdnuggets.com/datasets/index.html
How to start – Get some data
• Format of Data file supported
– Flat text file format: arff, csv, libsvm …
– Database: remote SQL database (use JDBC driver)
GUI Interface
• Launch from Program -> Weka 3.6.x -> Weka (with console)
Explorer
• Where we conduct the 3 phases
of data mining:
• Data pre-processing
• Data mining
• Present & Interpret result
KnowledgeFlow
• (Visual) user interface that Data sources,
classifiers, etc. are connected graphically
Experimenter
• Make comparing the
performance of different learning
schemes easier
• Can save results into file or
database
• Suitable for classification and
regression problems
Explorer – Load database
• Open Explorer from Weka GUI Chooser
• Click ‘Open file…’ button. From pop up navigate to
C:\Program Files\Weka-3-6\data, select file iris.
• Click Open.
Note in Files of type you shall see Arff
data files (*.arff)
Explorer – Iris database
• Fisher’s Iris data is a classical reference
in pattern recognition literature.
• The database has 150 instances
(entries, records), contains 3 classes of
iris plant.
• Each class has 50 instance.
• Total 5 attributes
– Sepal length
– Sepal width
– Petal length
– Petal width
– Class (Setosa, Versicolour and Virginica)
Explorer – Examine data summaries
• Purpose is to indentify potential data problems,
apply suitable pre-processing technique for optimal
evaluation.
• Select an attribute and examine
1. Summary statistics (Data type, any missing data, …)
2. Visualization
Explorer – Load database
1
2
Explorer – Load database
Explorer – Build Cluster
• Weka models for popular clustering algorithms
• Clustering is a process following certain criteria to
assign a set of instances into a few subsets
• Example: From iris database we know there are 3
types of iris. Can we build model (a set of criteria)
to assign 150 instances into 3 subsets? And how
accurate can we?
Explorer – Build Cluster
1. From cluster tab, select Choose button
2. Select SimpleKMeans
Explorer – Build a Cluster
2. Right Click inside the schema box
3. From popup Window, set numClusters to 3
4. Click Start button
Explorer – Build Cluster
Result (total 3 sections)
Select a result form result list and right click
Save result to file
Explorer – Build Cluster
• Interpret the result (section 1 and 2)
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance
-R first-last" -I 500 -S 10
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: evaluate on training data
=== Model and evaluation on training set ===
Explorer – Build Cluster
• Interpret the result (section 1 and 2)
=== Run information ===
Scheme:
weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance
-R first-last" -I 500 -S 10
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: evaluate on training data
=== Model and evaluation on training set ===
Explorer – Build Cluster
•
Interpret the result (cont.)
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 7.817456892309574
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute
Full Data
0
1
2
(150)
(50)
(50)
(50)
====================================================
sepallength
5.8433
5.936
5.006
6.588
sepalwidth
3.054
2.77
3.418
2.974
petallength
3.7587
4.26
1.464
5.552
petalwidth
1.1987
1.326
0.244
2.026
class
Iris-setosa Iris-versicolor Iris-setosa Iris-virginica
Clustered Instances
0
1
2
50 ( 33%)
50 ( 33%)
50 ( 33%)
Explorer – Build Cluster
• Visualise result
• Y axis – Mining result
• X axis – value of class attribute from iris database
Red cluster refers to
which type of iris?
Right click the red cluster.
Explorer – Build Cluster
• What if we want to use 2 attributes, petal width
and length only to a build clusterer?
1.
2.
3.
4.
Back to Preprocess tab
Click Choose button under Filter section
Expand unsupervised ->attribute, select Remove
Click on Remove
5. In pop-up Window enter 1-2 for AttributeIndices
6. Click OK, to go back to Preprocess
Explorer – Build Cluster
• What if we want to use 2 attributes, petal width
and length only to a build clusterer?
7. Click Apply button in Filter section
8. Now we shall see only 3 attributes
9. Switch to Cluster tab. Click the Start button to
build a new cluster
To revert back the attributes
removal click Undo button
Explorer – Build Cluster
• What if we want to use only
petal width and length to a
build clusterer?
10. Observe the three sections
of output and compare the
difference between the
clusterer built with both petal
and sepal attributes.
Explorer – Build Classifier
• Weka models for predicting nominal or numeric
quantities
• Example: If we build a model from given iris data.
Given data of 100 new iris flower, can our model
tell us the type of newly given iris flowers? And
how accurate?
Explorer – Build Classifier
• Iris database has 150 instance.
• Let’s assume we have collected 100 instances of iris
data to build a model.
• Use rest of 50 instances to test how ‘good’ is our
model.
• Weka comes with many classify algorithms, we
use J48. Detail about this algorithm to learn from
lecture in week6-8.
Explorer – Build Classifier
1
2
1
3
Explorer – Build Classifier
Result (total 3 sections)
Select a result form result list and right click
Save result to file
Explorer – Build Classifier
• Interpret the result
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: split 66.0% train, remainder test
Explorer – Build Classifier
• Interpret the result (cont.)
=== Classifier model (full training set) ===
J48 pruned tree
-----------------petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree :
9
Time taken to build model: 0 seconds
Explorer – Build Classifier
• Interpret the result (cont.)
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances
49
96.0784 %
Incorrectly Classified Instances
2
3.9216 %
Kappa statistic
0.9408
Mean absolute error
0.0396
Root mean squared error
0.1579
Relative absolute error
8.8979 %
Root relative squared error
33.4091 %
Total Number of Instances
51
Explorer – Build Classifier
• Interpret the result (cont.)
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
Iris-setosa
1
0.063 0.905 1
0.95
0.969 Iris-versicolor
0.882 0
1
0.882 0.938 0.967 Iris-virginica
Weighted Avg. 0.961 0.023 0.965 0.961 0.961 0.977
=== Confusion Matrix ===
a b c <-- classified as
15 0 0 | a = Iris-setosa
0 19 0 | b = Iris-versicolor
0 2 15 | c = Iris-virginica
Re-visit the output in text and other
graphic formats when classification is
taught in lectures