Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to WEKA As presented by PACE 8/9/2012 SAN DIEGO SUPERCOMPUTER CENTER Content • What is WEKA? • The Explorer Application • • • • • • Preprocess Classify Cluster Associate Select Attributes Visualize • Weka on Trestles • References and Resources 2 SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 What is WEKA? • Weka is a bird found only in New Zealand. • Waikato Environment for Knowledge Analysis • Weka is a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand. • Weka is a collection of machine learning algorithms for data mining tasks. • The algorithms can either be applied directly to a dataset or called from Java code. • Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. • Weka is open source software in JAVA issued under the GNU General Public License 3 SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 Download and Install WEKA • Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html • Support multiple platforms (written in java): • Windows, Mac OS X and Linux • Datasets(iris.arff, weather.arff) • Available on Trestles at: /home/diag/opt/weka/data • Available with Download: …../weka/data/ 4 SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 Main Features • • • • • 5 49 data preprocessing tools 76 classification/regression algorithms 8 clustering algorithms 3 algorithms for finding association rules 15 attribute/subset evaluators + 10 search algorithms for feature selection SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 Main GUI • Three graphical user interfaces • “The Explorer” (exploratory data analysis) • • • • • • pre-process data build “classifiers” cluster data find associations attribute selection data visualization • “The Experimenter” (experimental environment) • used to compare performance of different learning schemes • “The KnowledgeFlow” (new process model inspired interface) • Java-Beans-based interface for setting up and running machine learning experiments. • Command line Interface (“Simple CLI”) 6 More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html 5/8/2017 SAN DIEGO SUPERCOMPUTER CENTER Content • What is WEKA? • The Explorer: • • • • • • Preprocess Classify Cluster Associate Select Attributes Visualize • Weka on Trestles • References and Resources 7 SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 8 5/8/2017 WEKA:: Explorer: Preprocess • Data format • Uses flat text files to describe the data • Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary • Data can also be read from a URL or from an SQL database (using JDBC) SAN DIEGO SUPERCOMPUTER CENTER WEKA:: ARFF file format @relation heart-disease-simplified @attribute @attribute @attribute @attribute @attribute @attribute age numeric sex { female, male} chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} cholesterol numeric exercise_induced_angina { no, yes} class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... A more thorough description is available here http://www.cs.waikato.ac.nz/~ml/weka/arff.html SAN DIEGO SUPERCOMPUTER CENTER University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 1 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 2 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 3 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 4 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 5 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 6 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 7 5/8/2017 WEKA:: Explorer: Preprocess • Used to define filters to transform Data. • WEKA contains filters for: • Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc SAN DIEGO SUPERCOMPUTER CENTER University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 1 9 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 0 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 1 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 2 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 3 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 4 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 5 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 6 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 7 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 8 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 2 9 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 3 0 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 3 1 5/8/2017 WEKA:: Explorer: building “classifiers” • Classifiers in WEKA are models for predicting nominal or numeric quantities • Implemented learning schemes include: • Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, … • “Meta”-classifiers include: • Bagging, boosting, stacking, error-correcting output codes, locally weighted learning, … SAN DIEGO SUPERCOMPUTER CENTER Decision Tree Induction: Training Dataset age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating fair no high excellent no high fair no high fair no medium fair yes low excellent yes low excellent yes low fair no medium fair yes low fair yes medium excellent yes medium excellent no medium fair yes high excellent no medium buys_computer no no yes yes yes no yes no yes yes yes yes yes no This follows an example of Quinlan’s ID3 SAN DIEGO SUPERCOMPUTER CENTER May 8, 2017 33 Output: A Decision Tree for “buys_computer” age? <=30 31..40 overcast student? no no yes yes yes SAN DIEGO SUPERCOMPUTER CENTER >40 credit rating? excellent fair yes May 8, 2017 34 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 3 6 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 3 7 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 3 8 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 3 9 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 0 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 1 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 2 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 3 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 4 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 5 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 6 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 7 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 8 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 4 9 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 0 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 1 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 2 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 3 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 4 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 5 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 6 5/8/2017 Explorer: Select Attributes • Panel that can be used to investigate which (subsets of) attributes are the most predictive ones • Attribute selection methods contain two parts: • A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking • An evaluation method: correlation-based, wrapper, information gain, chi-squared, … • Very flexible: WEKA allows (almost) arbitrary combinations of these two 5 7 SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 8 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 5 9 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 0 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 1 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 2 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 3 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 4 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 5 5/8/2017 Explorer: Visualize • Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem • WEKA can visualize single attributes (1-d) and pairs of attributes (2-d) • To do: rotating 3-d visualizations (Xgobi-style) • Color-coded class values • “Jitter” option to deal with nominal attributes (and to detect “hidden” data points) • “Zoom-in” function SAN DIEGO SUPERCOMPUTER CENTER 5/8/2017 66 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 7 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 8 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 6 9 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 0 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 1 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 2 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 3 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 4 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 5 5/8/2017 University of Waikato SAN DIEGO SUPERCOMPUTER CENTER 7 6 5/8/2017 Using Weka On Trestles SAN DIEGO SUPERCOMPUTER CENTER Using Weka on Trestles • • • • • Shared Resources Batch and Interactive Use GUI and Command Line Use GUI on login nodes to create command line Use command line to run interactive or batch jobs on production nodes SAN DIEGO SUPERCOMPUTER CENTER Weka Gui • To launch Weka Gui on a: • Windows machine • to run software on remote machine with GUI requires a secure shell with x forwarding enabled to establish a remote connection and an X Server to handle the local display. – Suggested software putty and Xming » http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html » http://www.straightrunning.com/XmingNotes/ • Linux and MAC OS X support X Forwarding • Mac users need to run Applications > Utilities > Xterm • ssh –Y [email protected] • Load weka module • Weka installation available at: /home/diag/opt/weka • At command prompt > weka SAN DIEGO SUPERCOMPUTER CENTER PBS Script SAN DIEGO SUPERCOMPUTER CENTER Output file SAN DIEGO SUPERCOMPUTER CENTER Hands On with Weka SAN DIEGO SUPERCOMPUTER CENTER The Weather Data Set .arff file Weather.arff file • • • Available on Trestles at: /home/diag/opt/weka/data On line: http://www.hakank.org/weka/ With Weka download Data Set: @relation PlayTennis @attribute @attribute @attribute @attribute @attribute @attribute day numeric outlook {Sunny, Overcast, Rain} temperature {Hot, Mild, Cool} humidity {High, Normal} wind {Weak, Strong} playTennis {Yes, No} @data 1,Sunny,Hot,High,Weak,No,? 2,Sunny,Hot,High,Strong,No,? 3,Overcast,Hot,High,Weak,Yes,? 4,Rain,Mild,High,Weak,Yes,? 5,Rain,Cool,Normal,Weak,Yes,? 6,Rain,Cool,Normal,Strong,No,? 7,Overcast,Cool,Normal,Strong,Yes,? 8,Sunny,Mild,High,Weak,No,? . SAN DIEGO SUPERCOMPUTER CENTER The Problem • Each instance describes the facts of the day and the action of the observed person (played or no play). • The Data Set • 14 Instances • 6 attributes (day, outlook, temp, humidity, wind, play tennis) • Based on the given records we can assess which factors affected the person's decision about playing tennis. SAN DIEGO SUPERCOMPUTER CENTER The Question • Use j48 decision tree learner to model for class attribute play tennis • Make prediction for “play”. • Make predictions for the ‘temperature’ attribute. Do you need to do any additional data preparation? SAN DIEGO SUPERCOMPUTER CENTER Result SAN DIEGO SUPERCOMPUTER CENTER References and Resources • References: • WEKA website: http://www.cs.waikato.ac.nz/~ml/weka/index.html • WEKA Tutorial: • Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. • A presentation which explains how to use Weka for exploratory data mining. • WEKA Data Mining Book: • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) • WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page SAN DIEGO SUPERCOMPUTER CENTER