Download Machine Learning with WEKA

Document related concepts
no text concepts found
Transcript
„
Machine Learning with
WEKA
„
WEKA: A Machine
Learning Toolkit
The Explorer
•
•
Bernhard Pfahringer
•
(based on material by Eibe Frank, Mark
Hall, and Peter Reutemann)
Department of Computer Science,
University of Waikato, New Zealand
•
•
„
„
„
„
Classification and
Regression
Clustering
Association Rules
Attribute Selection
Data Visualization
The Experimenter
The Knowledge
Flow GUI
Other Utilities
Conclusions
WEKA: the bird
The Weka or woodhen (Gallirallus australis) is an
endemic bird of New Zealand. (Source: WikiPedia)
Copyright: Martin Kramer ([email protected])
5/14/2007
University of Waikato
2
WEKA: the software
„
„
„
„
Machine learning/data mining software written in
Java (distributed under the GNU Public License)
Used for research, education, and applications
Complements “Data Mining” by Witten & Frank
Main features:
Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
‹ Graphical user interfaces (incl. data visualization)
‹ Environment for comparing learning algorithms
‹
5/14/2007
University of Waikato
3
History
„
Project funded by the NZ government since 1993
‹
‹
‹
5/14/2007
Develop state-of-the art workbench of data mining tools
Explore fielded applications
Develop new fundamental methods
University of Waikato
4
History (2)
„
„
„
Late 1992 - funding was applied for by Ian Witten
1993 - development of the interface and infrastructure
‹ WEKA acronym coined by Geoff Holmes
‹ WEKA’s file format “ARFF” was created by Andrew Donkin
ARFF was rumored to stand for Andrew’s Ridiculous File Format
Sometime in 1994 - first internal release of WEKA
‹ TCL/TK user interface + learning algorithms written mostly in C
‹ Very much beta software
‹ Changes for the b1 release included (among others):
“Ambiguous and Unsupported menu commands removed.”
“Crashing processes handled (in most cases :-)”
„
October 1996 - first public release: WEKA 2.1
5/14/2007
University of Waikato
5
History (3)
„
„
„
„
July 1997 - WEKA 2.2
‹ Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support
for C5
‹ Included a facility (based on Unix makefiles) for configuring and
running large scale experiments
Early 1997 - decision was made to rewrite WEKA in Java
‹ Originated from code written by Eibe Frank for his PhD
‹ Originally codenamed JAWS (JAva
JA Weka System)
May 1998 - WEKA 2.3
‹ Last release of the TCL/TK-based system
Mid 1999 - WEKA 3 (100% Java) released
‹ Version to complement the Data Mining book
‹ Development version (including GUI)
5/14/2007
University of Waikato
6
TCL/TK interface of Weka 2.1
The GUI back then…
5/14/2007
University of Waikato
7
WEKA: versions
„
There are several versions of WEKA:
WEKA 3.4: “book version” compatible with
description in data mining book
‹ WEKA 3.5.5: “development version” with lots of
improvements
‹
„
This talk is based on a nightly snapshot of WEKA
3.5.5 (12-Feb-2007)
5/14/2007
University of Waikato
8
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
5/14/2007
University of Waikato
9
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
5/14/2007
University of Waikato
10
java weka.gui.GUIChooser
5/14/2007
University of Waikato
11
5/14/2007
University of Waikato
12
5/14/2007
University of Waikato
13
java -jar weka.jar
5/14/2007
University of Waikato
14
Explorer: pre-processing the data
„
„
„
„
Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
Data can also be read from a URL or from an SQL
database (using JDBC)
Pre-processing tools in WEKA are called “filters”
WEKA contains filters for:
‹
5/14/2007
Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
University of Waikato
15
5/14/2007
University of Waikato
16
5/14/2007
University of Waikato
17
5/14/2007
University of Waikato
18
5/14/2007
University of Waikato
19
5/14/2007
University of Waikato
20
5/14/2007
University of Waikato
21
5/14/2007
University of Waikato
22
5/14/2007
University of Waikato
23
5/14/2007
University of Waikato
24
5/14/2007
University of Waikato
25
5/14/2007
University of Waikato
26
5/14/2007
University of Waikato
27
5/14/2007
University of Waikato
28
5/14/2007
University of Waikato
29
5/14/2007
University of Waikato
30
5/14/2007
University of Waikato
31
5/14/2007
University of Waikato
32
5/14/2007
University of Waikato
33
5/14/2007
University of Waikato
34
5/14/2007
University of Waikato
35
5/14/2007
University of Waikato
36
Explorer: building “classifiers”
„
„
Classifiers in WEKA are models for predicting
nominal or numeric quantities
Implemented learning schemes include:
‹
„
Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …
“Meta”-classifiers include:
‹
5/14/2007
Bagging, boosting, stacking, error-correcting output
codes, locally weighted learning, …
University of Waikato
37
5/14/2007
University of Waikato
38
5/14/2007
University of Waikato
39
5/14/2007
University of Waikato
40
5/14/2007
University of Waikato
41
5/14/2007
University of Waikato
42
5/14/2007
University of Waikato
43
5/14/2007
University of Waikato
44
5/14/2007
University of Waikato
45
5/14/2007
University of Waikato
46
5/14/2007
University of Waikato
47
5/14/2007
University of Waikato
48
5/14/2007
University of Waikato
49
5/14/2007
University of Waikato
50
5/14/2007
University of Waikato
51
5/14/2007
University of Waikato
52
5/14/2007
University of Waikato
53
5/14/2007
University of Waikato
54
5/14/2007
University of Waikato
55
5/14/2007
University of Waikato
56
5/14/2007
University of Waikato
57
5/14/2007
University of Waikato
58
5/14/2007
University of Waikato
59
5/14/2007
University of Waikato
60
5/14/2007
University of Waikato
61
5/14/2007
University of Waikato
62
5/14/2007
University of Waikato
63
5/14/2007
University of Waikato
64
5/14/2007
University of Waikato
65
5/14/2007
University of Waikato
66
5/14/2007
University of Waikato
67
5/14/2007
University of Waikato
68
5/14/2007
University of Waikato
69
5/14/2007
University of Waikato
70
5/14/2007
University of Waikato
71
5/14/2007
QuickTime?and a TIFF (LZW) decompressor are needed to see this picture.
University of Waikato
72
5/14/2007
QuickTime?and a TIFF (LZW) decompressor are needed to see this picture.
University of Waikato
73
5/14/2007
University of Waikato
74
5/14/2007
University of Waikato
75
5/14/2007
University of Waikato
76
5/14/2007
University of Waikato
77
QuickTime?and a TIFF (LZW) decompressor are needed to see this pictu
5/14/2007
University of Waikato
78
5/14/2007
University of Waikato
79
5/14/2007
University of Waikato
80
5/14/2007
University of Waikato
81
5/14/2007
University of Waikato
82
5/14/2007
University of Waikato
83
5/14/2007
University of Waikato
84
5/14/2007
University of Waikato
85
5/14/2007
University of Waikato
86
5/14/2007
University of Waikato
87
5/14/2007
University of Waikato
88
5/14/2007
University of Waikato
89
5/14/2007
University of Waikato
90
5/14/2007
University of Waikato
91
5/14/2007
University of Waikato
92
Explorer: clustering data
„
„
WEKA contains “clusterers” for finding groups of
similar instances in a dataset
Some implemented schemes are:
‹
„
„
k-Means, EM, Cobweb, X-means, FarthestFirst
Clusters can be visualized and compared to “true”
clusters (if given)
Evaluation based on loglikelihood if clustering
scheme produces a probability distribution
5/14/2007
University of Waikato
93
5/14/2007
University of Waikato
94
5/14/2007
University of Waikato
95
5/14/2007
University of Waikato
96
5/14/2007
University of Waikato
97
5/14/2007
University of Waikato
98
5/14/2007
University of Waikato
99
5/14/2007
University of Waikato
100
5/14/2007
University of Waikato
101
5/14/2007
University of Waikato
102
Explorer: finding associations
„
WEKA contains the Apriori algorithm (among
others) for learning association rules
‹
„
Can identify statistical dependencies between
groups of attributes:
‹
„
Works only with discrete data
milk, butter ⇒ bread, eggs (with confidence 0.9 and
support 2000)
Apriori can compute all rules that have a given
minimum support and exceed a given confidence
5/14/2007
University of Waikato
103
5/14/2007
University of Waikato
104
5/14/2007
University of Waikato
105
5/14/2007
University of Waikato
106
Explorer: attribute selection
„
„
Panel that can be used to investigate which
(subsets of) attributes are the most predictive ones
Attribute selection methods contain two parts:
A search method: best-first, forward selection,
random, exhaustive, genetic algorithm, ranking
‹ An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
‹
„
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
5/14/2007
University of Waikato
107
5/14/2007
University of Waikato
108
5/14/2007
University of Waikato
109
5/14/2007
University of Waikato
110
5/14/2007
University of Waikato
111
5/14/2007
University of Waikato
112
5/14/2007
University of Waikato
113
Explorer: data visualization
„
„
Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
WEKA can visualize single attributes (1-d) and
pairs of attributes (2-d)
‹
„
„
„
To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values
“Jitter” option to deal with nominal attributes (and
to detect “hidden” data points)
“Zoom-in” function
5/14/2007
University of Waikato
114
5/14/2007
University of Waikato
115
5/14/2007
University of Waikato
116
5/14/2007
University of Waikato
117
5/14/2007
University of Waikato
118
5/14/2007
University of Waikato
119
5/14/2007
University of Waikato
120
5/14/2007
University of Waikato
121
5/14/2007
University of Waikato
122
5/14/2007
University of Waikato
123
5/14/2007
University of Waikato
124
Performing experiments
„
„
„
„
„
„
Experimenter makes it easy to compare the
performance of different learning schemes
For classification and regression problems
Results can be written into file or database
Evaluation options: cross-validation, learning
curve, hold-out
Can also iterate over different parameter settings
Significance-testing built in!
5/14/2007
University of Waikato
125
5/14/2007
University of Waikato
126
5/14/2007
University of Waikato
127
5/14/2007
University of Waikato
128
5/14/2007
University of Waikato
129
5/14/2007
University of Waikato
130
5/14/2007
University of Waikato
131
5/14/2007
University of Waikato
132
5/14/2007
University of Waikato
133
5/14/2007
University of Waikato
134
5/14/2007
University of Waikato
135
The Knowledge Flow GUI
„
„
„
„
„
Java-Beans-based interface for setting up and
running machine learning experiments
Data sources, classifiers, etc. are beans and can
be connected graphically
Data “flows” through components: e.g.,
“data source” -> “filter” -> “classifier” -> “evaluator”
Layouts can be saved and loaded again later
cf. Clementine ™
5/14/2007
University of Waikato
136
5/14/2007
University of Waikato
137
5/14/2007
University of Waikato
138
5/14/2007
University of Waikato
139
5/14/2007
University of Waikato
140
5/14/2007
University of Waikato
141
5/14/2007
University of Waikato
142
5/14/2007
University of Waikato
143
5/14/2007
University of Waikato
144
5/14/2007
University of Waikato
145
5/14/2007
University of Waikato
146
5/14/2007
University of Waikato
147
5/14/2007
University of Waikato
148
5/14/2007
University of Waikato
149
5/14/2007
University of Waikato
150
5/14/2007
University of Waikato
151
Sourceforge.net – Downloads
5/14/2007
University of Waikato
152
Sourceforge.net – Web Traffic
WekaDoc Wiki introduced – 12/2005
WekaWiki launched – 05/2005
5/14/2007
University of Waikato
153
Projects based on WEKA
„
„
„
45 projects currently (30/01/07) listed on the WekaWiki
Incorporate/wrap WEKA
‹ GRB Tool Shed - a tool to aid gamma ray burst research
‹ YALE - facility for large scale ML experiments
‹ GATE - NLP workbench with a WEKA interface
‹ Judge - document clustering and classification
‹ RWeka - an R interface to Weka
Extend/modify WEKA
‹ BioWeka - extension library for knowledge discovery in biology
‹ WekaMetal - meta learning extension to WEKA
‹ Weka-Parallel - parallel processing for WEKA
‹ Grid Weka - grid computing using WEKA
‹ Weka-CG - computational genetics tool library
5/14/2007
University of Waikato
154
WEKA and PENTAHO
„
„
„
„
„
Pentaho – The leader in Open Source Business
Intelligence (BI)
September 2006 – Pentaho acquires the Weka project
(exclusive license and SF.net page)
Weka will be used/integrated as data mining component in
their BI suite
Weka will be still available as GPL open source software
Most likely to evolve 2 editions:
‹
‹
5/14/2007
Community edition
BI oriented edition
University of Waikato
155
Limitations of WEKA
„
„
„
Traditional algorithms need to have all data in
main memory
==> big datasets are an issue
Solution:
Incremental schemes
‹ Stream algorithms
MOA “Massive Online Analysis”
(not only a flightless bird, but also extinct!)
‹
5/14/2007
University of Waikato
156
Conclusion: try it yourself!
„
ƒ
ƒ
WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
Also has a list of projects based on WEKA
(probably incomplete list of) WEKA contributors:
Abdelaziz Mahoui, Alexander K. Seewald, Ashraf M. Kibriya, Bernhard
Pfahringer, Brent Martin, Peter Flach, Eibe Frank, Gabi Schmidberger,
Ian H. Witten, J. Lindgren, Janice Boughton, Jason Wells, Len Trigg,
Lucio de Souza Coelho, Malcolm Ware, Mark Hall, Remco Bouckaert,
Richard Kirkby, Shane Butler, Shane Legg, Stuart Inglis, Sylvain Roy,
Tony Voyle, Xin Xu, Yong Wang, Zhihai Wang
5/14/2007
University of Waikato
157
Related documents