Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Weka
Just do it
Free and Open Source
ML Suite
Ian Witten & Eibe Frank
University of Waikato
New Zealand
Overview
•
•
•
•
Classifiers, Regressors, and clusterers
Multiple evaluation schemes
Bagging and Boosting
Feature Selection:
– right features and data key to successful learning
•
•
•
•
Experimenter
Visualizer
Text not up to date.
They welcome additions.
Learning Tasks
• Classification: given examples labelled
from a finite domain, generate a procedure
for labelling unseen examples.
• Regression: given examples labelled with a
real value, generate procedure for labelling
unseen examples.
• Clustering: from a set of examples,
partitioning examples into “interesting”
groups. What scientists want.
Data Format: IRIS
@RELATION iris
@ATTRIBUTE sepallength
@ATTRIBUTE sepalwidth
@ATTRIBUTE petallength
@ATTRIBUTE petalwidth
@ATTRIBUTE class
REAL
REAL
REAL
REAL
{Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
Etc.
General from
@atttribute attribute-name REAL or list of values
J48 = Decision Tree
petalwidth <= 0.6: Iris-setosa (50.0) : # under node
petalwidth > 0.6
# ..number wrong
| petalwidth <= 1.7
| | petallength <= 4.9: Iris-versicolor (48.0/1.0)
| | petallength > 4.9
| | | petalwidth <= 1.5: Iris-virginica (3.0)
| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)
| petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation
• Correctly Classified Instances 143 95.3%
• Incorrectly Classified Instances 7 4.67 %
• Default 10-fold cross validation i.e.
– Split data into 10 equal sized pieces
– Train on 9 pieces and test on remainder
– Do for all possibilities and average
J48 Confusion Matrix
Old data set from statistics: 50 of each class
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 3 47 | c = Iris-virginica
Precision, Recall, and Accuracy
• Precision: probability of being correct given
that your decision.
– Precision of iris-setosa is 49/49 = 100%
– Specificity in medical literature
• Recall: probability of correctly identifying
class.
– Recall accuracy for iris-setosa is 49/50 = 98%
– Sensitity in medical literature
• Accuracy: # right/total = 143/150 =~95%
Other Evaluation Schemes
• Leave-one-out cross-validation
– Cross-validation where n = number of training
instanced
• Specific train and test set
– Allows for exact replication
– Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling
• Randomly select n with replacement from n
• Expect about 2/3 to be chosen for training
– Prob of not chosen = (1-1/n)^n ~ 1/e.
• Testing on remainder
• Repeat about 30 times and average.
• Avoids partition bias