Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evaluation of Results (classifiers, and beyond) Sources: Biplav Srivastava •[Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations, Morgan Kaufmann, 2000. • [Lim et al99] Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, Machine Learning. Forthcoming. (Appendix containing complete tables of error rates, ranks, and training times; download the data sets in C4.5 format) Evaluation of Methods Ideally: find the best one Practically: find comparable classes • Classification accuracy – Effect of noise on accuracy • Comprehensibility of result – Compactness – Complexity • Training time • Scalability with increase in sample size Types of classifiers • Decision trees • Neural Network • Statistical – Regression • Pi = w0 + w1*x1 + w2*x2 + ... + wm*xm • Use regression on data set: Min Sum(Ci - (w0 + w1*x1 + ... + wm*xm ))^2 to get the weights. – Classification • Perform regression for each class; output = 1 for Ci, 0 otherwise, to get a linear expression • Given test data, evaluate each linear expression and choose the class corresponding to the largest Assumptions in classification • Data set is a representative sample • Prior probability is proportional to the frequency in training sample sizes • Changing categories to numerical values If attribute X takes k values: C1 ,C2 ,...,Ck, have (k-1) dimensional vector d1 ,d2 ,...,dk-1 such that di=1 if X= Ci and di=0, otherwise. For X= Ck, the vector contains all zeros. Error Rate of Classifier • If data set is large – use result from test data – test data is usually 1/3 of total data • If data size is small – K-fold cross validation (usually 10) • Divide data set into roughly equal K sets. • Use K-1 for training, the remaining for testing • Repeat K times and average the error rate Error Rate of Classifier (cont) – Fold choice can affect result • STRATIFICATION: Class ratio in each set is same as in whole data set. • Use K K-fold cross validation to overcome random variation in fold choice. – Leave-one-out: N-fold cross validation – Bootstrap: • Training Set = Select N data items at random with substitution. Prob = (1 - 1/N)^N = 1/e = 0.368 • Test Set = Data Set - Training Set • e = 0.632 * e(test) + 0.368 * e(training) Evaluation in [Lim et al 99] • 22 DT, 9 Statistical, 2 NN classifiers • Time Measurement – Use SPEC marks to rate platforms and scale results based on these marks. • Error rates – Calculation • For test data > 1000, use its error rate • Else, use 10-fold cross validation – Does not use multiple 10-folds. Evaluation in [Lim et al 99] – Acceptable performance • If p is minimum error rate of all classifiers, those within 1 standard error on p are accepted • Std error of p = Sqrt(p*(1-p)/N) • Statistical significance of error rates – Null hypothesis: All algorithms with same mean error rate ? Test differences of mean error rates. (Hypothesis REJECTED). – Tukey method: Difference between mean error rates significant at 10% if differ by 0.058. Evaluation in [Lim et al 99] • Rank analysis (no normality assumption) – Data Set(i): Sort according to ascending error rate and assign rank (ties given avg rank) Error rate: 0.1 0.342 0.5 0.5 0.5 0.543 0.677 0.789 Rank: 1 2 4 4 4 6 7 8 – Get Mean Rank across all Data Set • Statistical significance of rank – Null Hypothesis on difference of mean ranks (Friedman test) is REJECTED. – Difference in mean ranks greater than 8.7 is significant at 10% level. Evaluation in [Lim et al 99] • Equivalent classifiers from best (POL) (Training time <= 10 min) – 15 found with mean error rate – 18 found with mean rank • Training time shown in order slower than fastest classifier (10^(x-1) to 10^x) – Decision tree learners (C4.5, FACT - statistical tests for splitter), regression methods train fast – Spline-based statisticals and NNs slower Evaluation in [Lim et al 99] • Size of trees (in decision trees) – Use 10-fold cross-validation – Noise typically reduces tree size • Scalability with data set size – If data set small, use bootstrap re-sampling • N items are drawn with substitution • Class attribute is randomly changed with prob = 0.1 to a value from the valid set selected uniformly – Else, use given data size Evaluation in [Lim et al 99] – log(training time) increases linearly with log(N) – Decision trees usually scale • C4.5 with rules does not scale, but trees do • QUEST, FACT (multi-attribute) scale – Regression methods usually scale – NN methods were not tested [Lim et al 99] Summary • Use method that fits the requirement – error rate, training time, ... • Decision tree with univariate splits (single attribute) for data interpretation – C4.5 trees are big, C4.5 rules don’t scale • Simple regression methods are good: fast, easy implementation, scalable What more ? • Precision/ Recall PREDICTED False (-)ve True (+)ve ACTUAL False (+)ve • Cost based analysis True (-)ve Demonstration from WEKA