Download Evaluation - WCU Computer Science

EVALUATION Class 15 CSC 600: Data Mining Today…  Evaluation Metrics: Misclassification Rate  Confusion Matrix  Precision, Recall, F1   Evaluation Experiments: Cross-Validation  Bootstrap  Out-of-time Sampling  Evaluation Metrics   How should classifier be quantitatively evaluated? Misclassification Rate: # of incorrect predictions misclassifcation rate= total predictions Evaluation Metrics  Issues with misclassification rate?  If Accuracy is used: Example: in CC fraud domain, there are many more legitimate transactions than fraudulent transactions  Presume only 1% of transactions are fraudulent   Then: Classifier that predicts every transaction as GOOD would have 99% accuracy!  Seems great, but it’s not…  Alternative Metrics  Notation (for binary classification): + (positive class) - (negative class)  Class imblanace:  rare class vs. majority class For binary problems, there are 4 possible problems: Confusion Matrix: Counts Predicted Class Actual Class     + - + f++ (True Positive) f+- (False Negative) - f-+ (False Positive) f-- (True Negative) True Positive (TP): number of positive examples correctly predicted by model False Negative (FN): number of positive examples wrongly predicted as negative False Positive (FP): number of negative examples wrongly predicted as positive True Negative (TN): number of negative examples correctly predicted Confusion Matrix: Percentages Predicted Class Actual Class    f+- (False Negative) f-- (True Negative) True Positive Rate (TPR): fraction of positive examples correctly predicted by model   + - + f++ (True Positive) f-+ (False Positive) Also referred to as sensitivity False Negative Rate (FNR): fraction of positive examples wrongly predicted as negative False Positive Rate (FPR): fraction of negative examples wrongly predicted as positive True Negative Rate (TNR): fraction of negative examples correctly predicted  TPR = TP/(TP+FN) Also referred to as specificity FNR = FN/(TP+FN) FPR = FP/(TN+FP) TNR = TN/(TN+FP) Precision and Recall  Widely used metrics when successful detection of one class is considered more significant than detection of the other classes TP Precision, p = TP + FP TP Recall, r = TP + FN Precision   Precision: fraction of records that are positive in the set of records that classifier predicted as positive Interpretation: higher the precision, the lower the number of false positive errors TP TP + FP TP Recall, r = TP + FN Precision, p = Recall   Recall: fraction of positive records that are correctly predicted by classifier Interpretation: higher the recall, the fewer number of positive records misclassified as negative class TP TP + FP TP Recall, r = TP + FN Precision, p = Baseline Models   Often naïve models Example #1: classify every instance as positive (the rare class)     What is accuracy? Precision? Recall? Precision = poor; Recall = 100% Baseline models often maximize one metric (precision, recall) but not the other. Key challenge: building a model that performs well with both metrics F1 Measure  F-measure: combines precision and recall into a single metric, using the harmonic mean  Harmonic Mean of two numbers tends to be closer to the smaller of two numbers…  …so the only way F1 is high is for both precision and recall to be high. F1 = 2rp 2 ´ TP = r + p 2 ´ TP + FP + FN Evaluation Experiments Training** Set* Predic'Valida' on* on* Model* Set* Test* Set* Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Attrib3 Apply Model Class Deduction Evaluation  How to divide dataset into Training set, Validation set, Testing set?   Sample (randomly partition) Problems? Requires enough data to create suitably large enough sets “Lucky split” is possible 1. 2.   Difficult instances were chosen for the training set Easy instances put into the testing set Cross Validation    Widely used alternative to single Training Set + Validation Set Multiple evaluations using different portions of data for training and validating/testing k-fold Cross Validation k = number of folds (integer) k-Fold Cross Validation   Available data is divided into equal-sized folds (partitions) k separate evaluation experiments:  Training set: k-1 folds  Test set: 1 fold  Repeat k-1 more times, each using a different fold for the test set 10-fold Cross Validation 1/10th of training data X 9 = 90% used for training 10% used for testing • Repeat k times • Average results • Each instance will be used once in testing Cross-Validation   More computationally expensive k-fold Cross-Validation    k = number of folds (integer) k = 10 is common Leave-One-Out Cross-Validation    Extreme: k = n, where there are n observations in training+validation set Useful when the amount of data available is too small to all big enough training sets in a k-fold cross-validation. Significantly more computationally expensive Leave-One-Out Cross-Validation n folds Training set: n-1 instances Testing set: 1 single instance n instances Cross-Validation Error Estimate 1 k CVk = å ErrorRatei k i=1 Average the error rate for each fold. In leave-one-out cross-validation, since each test contains only one record, the variance of the estimated performance will be high. (Usually either 100% or 0%.) ε0 Bootstrapping   Also called .632 bootstrap Preferred over cross-validation approaches in contexts with very small datasets     < 300 instances Performs multiple evaluation experiments using slightly different training and testing sets Repeat k times Typically, k is set greater than 200  Many more folds than with k-fold cross-validation Itera' on*1" Itera' on*2" Itera' on*3" Itera' on*k-1" Itera' on*k" Out-of-time Sampling  Some domains have a time dimension  chronological  structure in data One time span used for training set  another time span for the test set Out-of-time Sampling Training*Set* Test*Set* Time*  Important that out-of-time sample uses time spans large enough to take into account any cyclical behavior patterns Average Class Accuracy Two models. Which is better? We’re really interested in identifying the churn customers. Model 1 Accuracy: 91% Model 2 Accuracy: 78% Note the imblanaced test set. Average Class Accuracy   Using average class accuracy instead of classification accuracy. Average recall for each possible level of a target class. 1 Average class accuracy = 𝑙𝑒𝑣𝑒𝑙𝑠(𝑡) 𝑟𝑒𝑐𝑎𝑙𝑙𝑙 𝑙∈𝑙𝑒𝑣𝑒𝑙𝑠(𝑡) Average Class Accuracy Two models. Which is better? We’re really interested in identifying the churn customers. Model 1 Accuracy: 91% Model 2 Accuracy: 78% Model 1: Average Class Accuracy: 55% Model 2: Average Class Accuracy: 79% Average Class Accuracy (Harmonic Mean)  Using average class accuracy can be computed using the harmonic mean instead of the arithmetic mean. Average class accuracy𝐻𝑀 = 1 1 𝑙𝑒𝑣𝑒𝑙𝑠(𝑡) 1 𝑙∈𝑙𝑒𝑣𝑒𝑙𝑠(𝑡) 𝑟𝑒𝑐𝑎𝑙𝑙 𝑙 • Arithmetic means are susceptible to influence of large outliers. • The harmonic mean results in a more pessimistic view of model performance than the arithmetic mean. Measuring Profit and Loss  Should all cells in a confusion matrix have the same value? Predicted Class Actual Class + - + f++ (True Positive) f+- (False Negative) - f-+ (False Positive) f-- (True Negative) Not always correct to treat all outcomes equally. Measuring Profit and Loss     Small cost incurred by company on false positives. Large cost incurred by company on false negatives. No cost on true positives and true negatives. Take into account the costs of different outcomes… Measuring Profit and Loss  Calculate a profit matrix, profit or loss from each outcome in the confusion matrix.  Used  in overall model evaluation. Example: Average class accuracyHM = 83.824% Average class accuracyHM = 80.761% Measuring Profit and Loss References       Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Evaluation - WCU Computer Science