Download WEKA Slides - San Diego Supercomputer Center

Document related concepts
no text concepts found
Transcript
An Introduction to WEKA
As presented by PACE
8/9/2012
SAN DIEGO SUPERCOMPUTER CENTER
Content
• What is WEKA?
• The Explorer Application
•
•
•
•
•
•
Preprocess
Classify
Cluster
Associate
Select Attributes
Visualize
• Weka on Trestles
• References and Resources
2
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
What is WEKA?
• Weka is a bird found only in New Zealand.
• Waikato Environment for Knowledge Analysis
• Weka is a data mining/machine learning tool developed by
Department of Computer Science, University of Waikato, New
Zealand.
• Weka is a collection of machine learning algorithms for data
mining tasks.
• The algorithms can either be applied directly to a dataset or
called from Java code.
• Weka contains tools for data pre-processing, classification,
regression, clustering, association rules, and visualization. It is
also well-suited for developing new machine learning schemes.
• Weka is open source software in JAVA issued under the GNU
General Public License
3
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
Download and Install WEKA
• Website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
• Support multiple platforms (written in java):
• Windows, Mac OS X and Linux
• Datasets(iris.arff, weather.arff)
• Available on Trestles at: /home/diag/opt/weka/data
• Available with Download: …../weka/data/
4
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
Main Features
•
•
•
•
•
5
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
3 algorithms for finding association rules
15 attribute/subset evaluators + 10 search
algorithms for feature selection
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
Main GUI
• Three graphical user interfaces
• “The Explorer” (exploratory data analysis)
•
•
•
•
•
•
pre-process data
build “classifiers”
cluster data
find associations
attribute selection
data visualization
• “The Experimenter” (experimental environment)
• used to compare performance of different learning
schemes
• “The KnowledgeFlow” (new process model
inspired interface)
• Java-Beans-based interface for setting up and running
machine learning experiments.
• Command line Interface (“Simple CLI”)
6
More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
5/8/2017
SAN DIEGO SUPERCOMPUTER CENTER
Content
• What is WEKA?
• The Explorer:
•
•
•
•
•
•
Preprocess
Classify
Cluster
Associate
Select Attributes
Visualize
• Weka on Trestles
• References and Resources
7
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
8
5/8/2017
WEKA:: Explorer: Preprocess
• Data format
• Uses flat text files to describe the data
• Data can be imported from a file in various formats:
ARFF, CSV, C4.5, binary
• Data can also be read from a URL or from an SQL
database (using JDBC)
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: ARFF file format
@relation heart-disease-simplified
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
age numeric
sex { female, male}
chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
cholesterol numeric
exercise_induced_angina { no, yes}
class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
A more thorough description is available here
http://www.cs.waikato.ac.nz/~ml/weka/arff.html
SAN DIEGO SUPERCOMPUTER CENTER
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
1
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
2
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
3
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
4
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
5
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
6
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
7
5/8/2017
WEKA:: Explorer: Preprocess
• Used to define filters to transform
Data.
• WEKA contains filters for:
• Discretization, normalization, resampling,
attribute selection, transforming, combining
attributes, etc
SAN DIEGO SUPERCOMPUTER CENTER
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
1
9
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
0
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
1
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
2
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
3
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
4
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
5
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
6
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
7
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
8
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
2
9
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
3
0
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
3
1
5/8/2017
WEKA:: Explorer: building “classifiers”
• Classifiers in WEKA are models for
predicting nominal or numeric quantities
• Implemented learning schemes include:
• Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer
perceptrons, logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:
• Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, …
SAN DIEGO SUPERCOMPUTER CENTER
Decision Tree Induction: Training Dataset
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
fair
no
high
excellent
no
high
fair
no
high
fair
no
medium
fair
yes
low
excellent
yes
low
excellent
yes
low
fair
no
medium
fair
yes
low
fair
yes
medium
excellent
yes
medium
excellent
no
medium
fair
yes
high
excellent
no
medium
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
This follows an example of Quinlan’s ID3
SAN DIEGO SUPERCOMPUTER CENTER
May 8, 2017
33
Output: A Decision Tree for “buys_computer”
age?
<=30
31..40
overcast
student?
no
no
yes
yes
yes
SAN DIEGO SUPERCOMPUTER CENTER
>40
credit rating?
excellent
fair
yes
May 8, 2017
34
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
3
6
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
3
7
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
3
8
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
3
9
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
0
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
1
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
2
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
3
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
4
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
5
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
6
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
7
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
8
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
4
9
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
0
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
1
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
2
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
3
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
4
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
5
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
6
5/8/2017
Explorer: Select Attributes
• Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
• Attribute selection methods contain two parts:
• A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
• An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
• Very flexible: WEKA allows (almost) arbitrary
combinations of these two
5
7
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
8
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
5
9
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
0
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
1
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
2
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
3
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
4
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
5
5/8/2017
Explorer: Visualize
• Visualization very useful in practice: e.g. helps
to determine difficulty of the learning problem
• WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
• To do: rotating 3-d visualizations (Xgobi-style)
• Color-coded class values
• “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)
• “Zoom-in” function
SAN DIEGO SUPERCOMPUTER CENTER
5/8/2017
66
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
7
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
8
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
6
9
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
0
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
1
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
2
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
3
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
4
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
5
5/8/2017
University of Waikato SAN DIEGO SUPERCOMPUTER CENTER
7
6
5/8/2017
Using Weka On Trestles
SAN DIEGO SUPERCOMPUTER CENTER
Using Weka on Trestles
•
•
•
•
•
Shared Resources
Batch and Interactive
Use GUI and Command Line
Use GUI on login nodes to create command line
Use command line to run interactive or batch
jobs on production nodes
SAN DIEGO SUPERCOMPUTER CENTER
Weka Gui
• To launch Weka Gui on a:
• Windows machine
• to run software on remote machine with GUI requires a secure shell
with x forwarding enabled to establish a remote connection and an X
Server to handle the local display.
– Suggested software putty and Xming
» http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
» http://www.straightrunning.com/XmingNotes/
• Linux and MAC OS X support X Forwarding
• Mac users need to run Applications > Utilities > Xterm
• ssh –Y [email protected]
• Load weka module
• Weka installation available at: /home/diag/opt/weka
• At command prompt > weka
SAN DIEGO SUPERCOMPUTER CENTER
PBS Script
SAN DIEGO SUPERCOMPUTER CENTER
Output file
SAN DIEGO SUPERCOMPUTER CENTER
Hands On with Weka
SAN DIEGO SUPERCOMPUTER CENTER
The Weather Data Set
.arff file
Weather.arff file
•
•
•
Available on Trestles at: /home/diag/opt/weka/data
On line: http://www.hakank.org/weka/
With Weka download
Data Set:
@relation PlayTennis
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
day numeric
outlook {Sunny, Overcast, Rain}
temperature {Hot, Mild, Cool}
humidity {High, Normal}
wind {Weak, Strong}
playTennis {Yes, No}
@data
1,Sunny,Hot,High,Weak,No,?
2,Sunny,Hot,High,Strong,No,?
3,Overcast,Hot,High,Weak,Yes,?
4,Rain,Mild,High,Weak,Yes,?
5,Rain,Cool,Normal,Weak,Yes,?
6,Rain,Cool,Normal,Strong,No,?
7,Overcast,Cool,Normal,Strong,Yes,?
8,Sunny,Mild,High,Weak,No,?
.
SAN DIEGO SUPERCOMPUTER CENTER
The Problem
• Each instance describes the facts of the day and
the action of the observed person (played or no
play).
• The Data Set
• 14 Instances
• 6 attributes (day, outlook, temp, humidity, wind, play
tennis)
• Based on the given records we can assess
which factors affected the person's decision
about playing tennis.
SAN DIEGO SUPERCOMPUTER CENTER
The Question
• Use j48 decision tree learner to model for class
attribute play tennis
• Make prediction for “play”.
• Make predictions for the ‘temperature’ attribute. Do you
need to do any additional data preparation?
SAN DIEGO SUPERCOMPUTER CENTER
Result
SAN DIEGO SUPERCOMPUTER CENTER
References and Resources
• References:
• WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
• WEKA Tutorial:
• Machine Learning with WEKA: A presentation demonstrating all graphical
user interfaces (GUI) in Weka.
• A presentation which explains how to use Weka for exploratory data
mining.
• WEKA Data Mining Book:
• Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)
• WEKA Wiki:
http://weka.sourceforge.net/wiki/index.php/Main_Page
SAN DIEGO SUPERCOMPUTER CENTER