Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to WEKA (1) Learning Objectives • • • • What is WEKA? How to start? GUI Interface Weka Explorer application – Walk through the 3 phases of DM using iris database What is WEKA? • Open source tool created by researchers at the University of Waikato in New Zealand • Started in 1993 (TCL and C based) • First Java version was in 1997 • In 2005 received the SIGKDD Data Mining and Knowledge Discovery Service Award • Latest version is 3.7 Develop • We use Stable version 3.6.x What is WEKA? • Core is a collection of open source Machine Learning algorithms o Pre-processing o Classifiers o Clustering o Association rule • Both GUI and Command Line interfaces How to start - Installation • Lab computer has WEKA installed • To install on your laptop (do it after class please) – Download software (of same version as in lab) from http://www.cs.waikato.ac.nz/ml/weka/ – If Java VM 1.6 is not installed in your laptop, choose the includes version How to start – Get some data • From campus website: – http://192.168.10.91/insu/CP3300/index.html – http://192.168.2.91/insu/CP3300/index.html (Later) Would like to have more choice for assignment? • Explore more from well known repositories: – http://mlearn.ics.uci.edu/MLRepository.html – http://www.kdnuggets.com/datasets/index.html How to start – Get some data • Format of Data file supported – Flat text file format: arff, csv, libsvm … – Database: remote SQL database (use JDBC driver) GUI Interface • Launch from Program -> Weka 3.6.x -> Weka (with console) Explorer • Where we conduct the 3 phases of data mining: • Data pre-processing • Data mining • Present & Interpret result KnowledgeFlow • (Visual) user interface that Data sources, classifiers, etc. are connected graphically Experimenter • Make comparing the performance of different learning schemes easier • Can save results into file or database • Suitable for classification and regression problems Explorer – Load database • Open Explorer from Weka GUI Chooser • Click ‘Open file…’ button. From pop up navigate to C:\Program Files\Weka-3-6\data, select file iris. • Click Open. Note in Files of type you shall see Arff data files (*.arff) Explorer – Iris database • Fisher’s Iris data is a classical reference in pattern recognition literature. • The database has 150 instances (entries, records), contains 3 classes of iris plant. • Each class has 50 instance. • Total 5 attributes – Sepal length – Sepal width – Petal length – Petal width – Class (Setosa, Versicolour and Virginica) Explorer – Examine data summaries • Purpose is to indentify potential data problems, apply suitable pre-processing technique for optimal evaluation. • Select an attribute and examine 1. Summary statistics (Data type, any missing data, …) 2. Visualization Explorer – Load database 1 2 Explorer – Load database Explorer – Build Cluster • Weka models for popular clustering algorithms • Clustering is a process following certain criteria to assign a set of instances into a few subsets • Example: From iris database we know there are 3 types of iris. Can we build model (a set of criteria) to assign 150 instances into 3 subsets? And how accurate can we? Explorer – Build Cluster 1. From cluster tab, select Choose button 2. Select SimpleKMeans Explorer – Build a Cluster 2. Right Click inside the schema box 3. From popup Window, set numClusters to 3 4. Click Start button Explorer – Build Cluster Result (total 3 sections) Select a result form result list and right click Save result to file Explorer – Build Cluster • Interpret the result (section 1 and 2) === Run information === Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: evaluate on training data === Model and evaluation on training set === Explorer – Build Cluster • Interpret the result (section 1 and 2) === Run information === Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: evaluate on training data === Model and evaluation on training set === Explorer – Build Cluster • Interpret the result (cont.) kMeans ====== Number of iterations: 3 Within cluster sum of squared errors: 7.817456892309574 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data 0 1 2 (150) (50) (50) (50) ==================================================== sepallength 5.8433 5.936 5.006 6.588 sepalwidth 3.054 2.77 3.418 2.974 petallength 3.7587 4.26 1.464 5.552 petalwidth 1.1987 1.326 0.244 2.026 class Iris-setosa Iris-versicolor Iris-setosa Iris-virginica Clustered Instances 0 1 2 50 ( 33%) 50 ( 33%) 50 ( 33%) Explorer – Build Cluster • Visualise result • Y axis – Mining result • X axis – value of class attribute from iris database Red cluster refers to which type of iris? Right click the red cluster. Explorer – Build Cluster • What if we want to use 2 attributes, petal width and length only to a build clusterer? 1. 2. 3. 4. Back to Preprocess tab Click Choose button under Filter section Expand unsupervised ->attribute, select Remove Click on Remove 5. In pop-up Window enter 1-2 for AttributeIndices 6. Click OK, to go back to Preprocess Explorer – Build Cluster • What if we want to use 2 attributes, petal width and length only to a build clusterer? 7. Click Apply button in Filter section 8. Now we shall see only 3 attributes 9. Switch to Cluster tab. Click the Start button to build a new cluster To revert back the attributes removal click Undo button Explorer – Build Cluster • What if we want to use only petal width and length to a build clusterer? 10. Observe the three sections of output and compare the difference between the clusterer built with both petal and sepal attributes. Explorer – Build Classifier • Weka models for predicting nominal or numeric quantities • Example: If we build a model from given iris data. Given data of 100 new iris flower, can our model tell us the type of newly given iris flowers? And how accurate? Explorer – Build Classifier • Iris database has 150 instance. • Let’s assume we have collected 100 instances of iris data to build a model. • Use rest of 50 instances to test how ‘good’ is our model. • Weka comes with many classify algorithms, we use J48. Detail about this algorithm to learn from lecture in week6-8. Explorer – Build Classifier 1 2 1 3 Explorer – Build Classifier Result (total 3 sections) Select a result form result list and right click Save result to file Explorer – Build Classifier • Interpret the result === Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: iris Instances: 150 Attributes: 5 sepallength sepalwidth petallength petalwidth class Test mode: split 66.0% train, remainder test Explorer – Build Classifier • Interpret the result (cont.) === Classifier model (full training set) === J48 pruned tree -----------------petalwidth <= 0.6: Iris-setosa (50.0) petalwidth > 0.6 | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 Time taken to build model: 0 seconds Explorer – Build Classifier • Interpret the result (cont.) === Evaluation on test split === === Summary === Correctly Classified Instances 49 96.0784 % Incorrectly Classified Instances 2 3.9216 % Kappa statistic 0.9408 Mean absolute error 0.0396 Root mean squared error 0.1579 Relative absolute error 8.8979 % Root relative squared error 33.4091 % Total Number of Instances 51 Explorer – Build Classifier • Interpret the result (cont.) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 Iris-setosa 1 0.063 0.905 1 0.95 0.969 Iris-versicolor 0.882 0 1 0.882 0.938 0.967 Iris-virginica Weighted Avg. 0.961 0.023 0.965 0.961 0.961 0.977 === Confusion Matrix === a b c <-- classified as 15 0 0 | a = Iris-setosa 0 19 0 | b = Iris-versicolor 0 2 15 | c = Iris-virginica Re-visit the output in text and other graphic formats when classification is taught in lectures