Download Introduction to WEKA

Introduction to WEKA (1) Learning Objectives • • • • What is WEKA? How to start? GUI Interface Weka Explorer application – Walk through the 3 phases of DM using iris database What is WEKA? • Open source tool created by researchers at the University of Waikato in New Zealand • Started in 1993 (TCL and C based) • First Java version was in 1997 • In 2005 received the SIGKDD Data Mining and Knowledge Discovery Service Award • Latest version is 3.7 Develop • We use Stable version 3.6.x What is WEKA? • Core is a collection of open source Machine Learning algorithms o Pre-processing o Classifiers o Clustering o Association rule • Both GUI and Command Line interfaces How to start - Installation • Lab computer has WEKA installed • To install on your laptop (do it after class please) – Download software (of same version as in lab) from http://www.cs.waikato.ac.nz/ml/weka/ – If Java VM 1.6 is not installed in your laptop, choose the includes version How to start – Get some data • From campus website: – http://192.168.10.91/insu/CP3300/index.html – http://192.168.2.91/insu/CP3300/index.html (Later) Would like to have more choice for assignment? • Explore more from well known repositories: – http://mlearn.ics.uci.edu/MLRepository.html – http://www.kdnuggets.com/datasets/index.html How to start – Get some data • Format of Data file supported – Flat text file format: arff, csv, libsvm … – Database: remote SQL database (use JDBC driver) GUI Interface • Launch from Program -> Weka 3.6.x -> Weka (with console) Explorer • Where we conduct the 3 phases of data mining: • Data pre-processing • Data mining • Present & Interpret result KnowledgeFlow • (Visual) user interface that Data sources, classifiers, etc. are connected graphically Experimenter • Make comparing the performance of different learning schemes easier • Can save results into file or database • Suitable for classification and regression problems Explorer – Load database • Open Explorer from Weka GUI Chooser • Click ‘Open file…’ button. From pop up navigate to C:\Program Files\Weka-3-6\data, select file iris. • Click Open. Note in Files of type you shall see Arff data files (*.arff) Explorer – Iris database • Fisher’s Iris data is a classical reference in pattern recognition literature. • The database has 150 instances (entries, records), contains 3 classes of iris plant. • Each class has 50 instance. • Total 5 attributes – Sepal length – Sepal width – Petal length – Petal width – Class (Setosa, Versicolour and Virginica) Explorer – Examine data summaries • Purpose is to indentify potential data problems, apply suitable pre-processing technique for optimal evaluation. • Select an attribute and examine 1. Summary statistics (Data type, any missing data, …) 2. Visualization Explorer – Load database 1 2 Explorer – Load database Explorer – Build Cluster • Weka models for popular clustering algorithms • Clustering is a process following certain criteria to assign a set of instances into a few subsets • Example: From iris database we know there are 3 types of iris. Can we build model (a set of criteria) to assign 150 instances into 3 subsets? And how accurate can we? Explorer – Build Cluster 1. From cluster tab, select Choose button 2. Select SimpleKMeans Explorer – Build a Cluster 2. Right Click inside the schema box 3. From popup Window, set numClusters to 3 4. Click Start button Explorer – Build Cluster Result (total 3 sections) Select a result form result list and right click Save result to file Explorer – Build Cluster • Interpret the result (section 1 and 2) === Run information === Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: evaluate on training data === Model and evaluation on training set === Explorer – Build Cluster • Interpret the result (section 1 and 2) === Run information === Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: evaluate on training data === Model and evaluation on training set === Explorer – Build Cluster • Interpret the result (cont.) kMeans ====== Number of iterations: 3 Within cluster sum of squared errors: 7.817456892309574 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 (150) (50) (50) (50) ==================================================== sepallength 5.8433 5.936 5.006 6.588 sepalwidth 3.054 2.77 3.418 2.974 petallength 3.7587 4.26 1.464 5.552 petalwidth 1.1987 1.326 0.244 2.026 class Iris-setosa Iris-versicolor Iris-setosa Iris-virginica Clustered Instances 0 1 2 50 ( 33%) 50 ( 33%) 50 ( 33%) Explorer – Build Cluster • Visualise result • Y axis – Mining result • X axis – value of class attribute from iris database Red cluster refers to which type of iris? Right click the red cluster. Explorer – Build Cluster • What if we want to use 2 attributes, petal width and length only to a build clusterer? 1. 2. 3. 4. Back to Preprocess tab Click Choose button under Filter section Expand unsupervised ->attribute, select Remove Click on Remove 5. In pop-up Window enter 1-2 for AttributeIndices 6. Click OK, to go back to Preprocess Explorer – Build Cluster • What if we want to use 2 attributes, petal width and length only to a build clusterer? 7. Click Apply button in Filter section 8. Now we shall see only 3 attributes 9. Switch to Cluster tab. Click the Start button to build a new cluster To revert back the attributes removal click Undo button Explorer – Build Cluster • What if we want to use only petal width and length to a build clusterer? 10. Observe the three sections of output and compare the difference between the clusterer built with both petal and sepal attributes. Explorer – Build Classifier • Weka models for predicting nominal or numeric quantities • Example: If we build a model from given iris data. Given data of 100 new iris flower, can our model tell us the type of newly given iris flowers? And how accurate? Explorer – Build Classifier • Iris database has 150 instance. • Let’s assume we have collected 100 instances of iris data to build a model. • Use rest of 50 instances to test how ‘good’ is our model. • Weka comes with many classify algorithms, we use J48. Detail about this algorithm to learn from lecture in week6-8. Explorer – Build Classifier 1 2 1 3 Explorer – Build Classifier Result (total 3 sections) Select a result form result list and right click Save result to file Explorer – Build Classifier • Interpret the result === Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: split 66.0% train, remainder test Explorer – Build Classifier • Interpret the result (cont.) === Classifier model (full training set) === J48 pruned tree -----------------petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 Time taken to build model: 0 seconds Explorer – Build Classifier • Interpret the result (cont.) === Evaluation on test split === === Summary === Correctly Classified Instances 49 96.0784 % Incorrectly Classified Instances 2 3.9216 % Kappa statistic 0.9408 Mean absolute error 0.0396 Root mean squared error 0.1579 Relative absolute error 8.8979 % Root relative squared error 33.4091 % Total Number of Instances 51 Explorer – Build Classifier • Interpret the result (cont.) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 Iris-setosa 1 0.063 0.905 1 0.95 0.969 Iris-versicolor 0.882 0 1 0.882 0.938 0.967 Iris-virginica Weighted Avg. 0.961 0.023 0.965 0.961 0.961 0.977 === Confusion Matrix === a b c <-- classified as 15 0 0 | a = Iris-setosa 0 19 0 | b = Iris-versicolor 0 2 15 | c = Iris-virginica Re-visit the output in text and other graphic formats when classification is taught in lectures

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to WEKA