Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand Overview • • • • Classifiers, Regressors, and clusterers Multiple evaluation schemes Bagging and Boosting Feature Selection: – right features and data key to successful learning • • • • Experimenter Visualizer Text not up to date. They welcome additions. Learning Tasks • Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. • Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. • Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want. Data Format: IRIS @RELATION iris @ATTRIBUTE sepallength @ATTRIBUTE sepalwidth @ATTRIBUTE petallength @ATTRIBUTE petalwidth @ATTRIBUTE class REAL REAL REAL REAL {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General from @atttribute attribute-name REAL or list of values J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0) Cross-validation • Correctly Classified Instances 143 95.3% • Incorrectly Classified Instances 7 4.67 % • Default 10-fold cross validation i.e. – Split data into 10 equal sized pieces – Train on 9 pieces and test on remainder – Do for all possibilities and average J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica Precision, Recall, and Accuracy • Precision: probability of being correct given that your decision. – Precision of iris-setosa is 49/49 = 100% – Specificity in medical literature • Recall: probability of correctly identifying class. – Recall accuracy for iris-setosa is 49/50 = 98% – Sensitity in medical literature • Accuracy: # right/total = 143/150 =~95% Other Evaluation Schemes • Leave-one-out cross-validation – Cross-validation where n = number of training instanced • Specific train and test set – Allows for exact replication – Ok if train/test large, e.g. 10,000 range. Bootstrap sampling • Randomly select n with replacement from n • Expect about 2/3 to be chosen for training – Prob of not chosen = (1-1/n)^n ~ 1/e. • Testing on remainder • Repeat about 30 times and average. • Avoids partition bias