Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2011 Data Mining Industrial & Information Systems Engineering Chapter 4: Evaluating Classification & Predictive Performance •Pilsung Kang •Industrial & Information Systems Engineering •Seoul National University of Science & Technology 2011 Data Mining, IISE, SNUT Steps in Data Mining revisited 1. Define and understand the purpose of data mining project 2. Formulate the data mining problem 3. Obtain/verify/modify the data 4. Explore and customize the data 5. Build data mining models 6. Evaluate and interpret the results 7. Deploy and monitor the model 2 2011 Data Mining, IISE, SNUT Why Evaluate? Over-fitting for training data Training data Validation data Test data Is red boundary is better than blue one? 3 2011 Data Mining, IISE, SNUT Why Evaluate? Over-fitting for training data Do not memorize them all!! Training data Validation data Test data 4 2011 Data Mining, IISE, SNUT Why Evaluate? Multiple methods are available to classify or predict. Classification: • Naïve bayes, linear discriminant, k-nearest neighbor, classification trees, etc. Prediction: • Multiple linear regression, neural networks, regression trees, etc. For each method, multiple choices are available for settings. Neural networks: # hidden nodes, activation functions, etc. To choose best model, need to assess each model’s performance. Best setting (parameters) among various candidates for an algorithm (validation). Best model among various data mining algorithms for the task (test). 5 2011 Data Mining, IISE, SNUT Classification Performance 1 Example: Gender classification Classify a person based on his/her body fat percentage (BFP). 10.0 21.7 8.9 19.9 23.4 28.9 15.7 21.6 21.5 23.2 Simple classifier: if BFP > 20 then female else male. 10.0 21.7 8.9 19.9 23.4 28.9 15.7 21.6 21.8 23.2 M F M M F F M F F F How do you evaluate the performance of the above classifier? 6 2011 Data Mining, IISE, SNUT Classification Performance Confusion Matrix 2 Summarizes the correct and incorrect classifications that a classifier produced for a certain data set. 10.0 21.7 8.9 19.9 23.4 28.9 15.7 21.6 21.5 23.2 M F M M F F M F F F Confusion matrix can be constructed as Predicted Confusion Matrix Actual F M F 4 1 M 2 3 7 2011 Data Mining, IISE, SNUT Classification Performance Confusion Matrix 2 Summarizes the correct and incorrect classifications that a classifier produced for a certain data set. Predicted Confusion Matrix Actual 1(+) 0(-) 1(+) n11 n10 0(-) n01 n00 • Sensitivity (true positive, recall) = n11/(n11+n10) • Specificity (true negative) = n00/(n01+n00) • Precision = n11/(n11+n01) • Type I error (false negative) = n10/(n11+n10) • Type II error (false positive) = n01/(n01+n00) 8 2011 Data Mining, IISE, SNUT Classification Performance Confusion Matrix: continued 2 Predicted Confusion Matrix Actual 1(+) 0(-) 1(+) n11 n10 0(-) n01 n00 Misclassification error = (n12+n21)/(n11+n12+n21+n22) Accuracy (1-misclassification error) = (n11+n22)/(n11+n12+n21+n22) Balanced correction rate = n11 n22 n11 n12 n21 n22 sqrt(sensitivity*specificity) F1 measure (harmonized mean of recall and precision) = 2 Recall Precision F1 measure Recall Precision 9 2011 Data Mining, IISE, SNUT Classification Performance Confusion Matrix 2 For the previous example: Predicted Confusion Matrix F M F 4 1 M 2 3 Actual 4 2 1 3 Sensitivity: 4/5 = 0.8, Specificity: 3/5 = 0.6 Recall: 4/5 = 0.8, Precision: 4/6 = 0.67 Type I error: 1/5 = 0.2, Type II error: 2/5 = 0.4 Misclassification error: (1+2)/(4+1+2+3) = 0.3, accuracy = 0.7 Balanced correction rate: sqrt(0.8*0.6) = 0.69 F1 measure: (2*0.8*0.67)/(0.8+0.67) = 0.85 10 2011 Data Mining, IISE, SNUT Classification Performance Cut-off for classification A new classifier: : if BFP > θ then female else male. 3 10.0 21.7 8.9 19.9 23.4 28.9 15.7 21.6 21.5 23.2 19.9 15.7 10.0 8.9 Sort data in a descending order of BFS. 28.6 25.4 24.2 23.6 22.7 21.5 How do you decide the cut-off for classification? 11 2011 Data Mining, IISE, SNUT Classification Performance Cut-off for classification Performance measures for different cut-offs: 3 No. BFS Gender 1 28.6 F 2 25.4 M 3 24.2 F = 22, IfIf Ifθθθ= =24, 18, Predicted Confusion Matrix F M F 5 4 2 0 1 3 M 2 1 3 4 Actual 4 23.6 F 5 22.7 F 6 21.5 M Misclassification error: 0.2 ••• Misclassification Misclassificationerror: error:0.4 0.2 7 19.9 F Accuracy: 0.8 ••• Accuracy: Accuracy:0.6 0.8 8 15.7 M Balanced correction rate: 0.8 ••• Balanced Balancedcorrection correctionrate: rate:0.57 0.77 9 10.0 M F1 measure 0.8 ••• F1 F1measure measure== =0.5 0.83 10 8.9 M 12 2011 Data Mining, IISE, SNUT Classification Performance Cut-off for classification In general, classification algorithms can produce the likelihood for each class in terms of probability or degree of evidence, etc. 3 Classification performance highly depends on the cut-off of the algorithm. For model selection & model comparison, cut-off independent performance measures are recommended. Lift charts, receiver operating characteristic (ROC) curve, etc. 13 Patient P(Malignant) 1 0.976 Status Patient P(Malignant) 1 26 Status Patient P(Malignant) 0.716 1 51 2 0.973 1 27 0.676 0 52 Classification Performance 0.410 Status Patient P(Malignant) Status 2011 Data Mining, IISE, SNUT 0 76 0.186 0 0.406 1 77 0.183 0 3 0.971 0 28 0.672 0 53 0.378 0 78 0.178 0 4 0.967 1 29 0.662 0 54 0.376 0 79 0.178 0 5 0.937 0 30 0.647 0 55 0.362 0 80 0.173 0 6 Lift charts: An example 0.936 1 31 0.640 1 56 0.355 0 81 0.170 0 7 0.929 0 57 0.343 0 82 0.133 0 0 58 0.338 0 83 0.120 0 0 84 0.119 0 1 32 0.625 8 Cancer diagnosis: 0.927 0 33 0.624 9 0.923 10 0.898 11 0.863 • A patients. 1 total 36 of 100 0.604 0 12 0.863 13 0.859 14 15 4 0.855 1 34 patients’ 0.613 probability 1 59of malignant. 0.335 • Predict 0 60 0.334 0 85 0.112 0 61 0.328 0 86 0.093 0 1 37 0.601 0 • 20 patients are malignant. 62 0.313 0 87 0.086 0 • Malignant ratio: 0 39 0.578 0.2. 0 63 0.285 1 88 0.079 0 0 64 0.274 0 89 0.071 0 0 35 38 0.606 0.594 0 0.847 1 40 0.548 0 65 0.274 0 90 0.069 0 16 0.847 1 41 0.539 1 66 0.272 0 91 0.047 0 17 0.837 0 42 0.525 1 67 0.267 0 92 0.029 0 18 0.833 0 43 0.524 0 68 0.265 0 93 0.028 0 19 0.814 0 44 0.514 0 69 0.237 0 94 0.027 0 20 0.813 0 45 0.510 0 70 0.217 0 95 0.022 0 21 0.793 1 46 0.509 0 71 0.213 0 96 0.019 0 22 0.787 0 47 0.455 0 72 0.204 1 97 0.015 0 23 0.757 1 48 0.449 0 73 0.201 0 98 0.010 0 24 0.741 0 49 0.434 0 74 0.200 0 99 0.005 0 25 0.737 0 50 0.414 0 75 0.193 0 100 0.002 0 14 2011 Data Mining, IISE, SNUT Classification Performance Confusion matrix Set the cut-off to 0.9 • Malignant if P(Malignant) > 0.9, else benign. 4 Predicted Confusion Matrix M B M 6 14 B 3 77 Actual 6 • Misclassification error = 0.17 • Accuracy = 0.83 Is it a good classification model? 15 3 14 77 2011 Data Mining, IISE, SNUT Classification Performance Confusion matrix Set the cut-off to 0.8 • Malignant if P(Malignant) > 0.8, else benign. 4 Predicted Confusion Matrix M B M 10 10 B 10 70 Actual 10 • Misclassification error = 0.2 • Accuracy = 0.8 Is it worse than the previous model? 16 1 0 10 70 2011 Data Mining, IISE, SNUT Classification Performance Lift charts Useful for assessing performance in terms of identifying the most important class. Compare performance of DM model to “no model, pick randomly.” 4 Measures ability of DM model to identify the important class, relative to its average prevalence. Charts give explicit assessment of results over a large number of cutoffs. 17 2011 Data Mining, IISE, SNUT Classification Performance Lift charts: Preparation Benchmark model (B): randomly assign malignant with the probability of 0.2. Compute the number of malignant patients for each decile. Non-cumulative 4 Cumulative Decile A B A B 1 6 2 6 2 2 4 2 10 4 3 3 2 13 6 4 2 2 15 8 5 2 2 17 10 6 1 2 18 12 7 1 2 19 14 8 1 2 20 16 9 0 2 20 18 10 0 2 20 20 18 2011 Data Mining, IISE, SNUT Classification Performance Lift charts Plot the case/relative ratio/proportion for each decile. Case 4 Relative ratio 7 3.5 6 3 5 2.5 4 2 3 1.5 2 1 1 0.5 0 0 1 2 3 4 5 A 6 7 8 9 10 B 1 2 3 4 5 A 19 6 B 7 8 9 10 2011 Data Mining, IISE, SNUT Classification Performance Lift charts Plot the case/relative ratio/proportion for each decile. Proportion (non-cumulative) 0.70 • Top 20~30% Prob. 0.60 • 30% of them are malignant. 0.50 4 • Lift = 0.3/0.2 = 1.5 0.40 0.30 0.20 0.10 0.00 1 2 3 4 5 A 6 7 8 9 10 B 20 2011 Data Mining, IISE, SNUT Classification Performance Lift charts Plot the case/relative ratio/proportion for each decile. Proportion (cumulative) 0.70 • Top 0~30% Prob. 0.60 • 43.33% of them are 0.50 4 malignant. 0.40 • Cumulative lift: 0.30 0.43/0.2 = 2.17 0.20 0.10 0.00 1 2 3 4 5 A 6 7 8 9 10 B 21 2011 Data Mining, IISE, SNUT Classification Performance Gain chart Compare two models for each cumulative decile. 1 4 • Top 0~30% Prob. 0.8 • 65% of malignant patients 0.6 belong to this group. • Cumulative lift: 0.4 0.65/0.3 = 2.17 0.2 • Cumulative lift chart (y-axis): 0 1 2 3 4 5 A 6 7 8 9 10 B (malignant/total patients) in the group • Gain chart (y-axis): (malignant in the group)/total malignant 22 76 0.186 0 77 0.183 0 2011 Data 1.000Mining, IISE, 0.713SNUT 78 Classification Performance79 0.178 0 1.000 0.725 0.178 0 1.000 0.738 80 0.173 0 1.000 0.750 81 0.170 0 1.000 0.763 82 0.133 0 1.000 Receiver operating characteristics (ROC) curve 0.775 83 Sort the records based on the P(interesting class) in a descending order. Compute the true positive rate and false positive rate 5 by varying the cut-off. Draw a chart where x & y axes are false & true positive, respectively. 0.120 84 0.119 Patient P(Malignant) 85 0.112 1 0.976 86 0.093 2 0.973 87 3 88 4 89 5 90 6 91 7 92 8 93 9 94 10 95 11 96 12 97 13 98 14 99 15 100 16 0.086 0.971 0.079 0.967 0.071 0.937 0.069 0.936 0.047 0.929 0.029 0.927 0.028 0.923 0.027 0.898 0.022 0.863 0.019 0.863 0.015 0.859 0.010 0.855 0.005 0.847 0.002 0.847 0 1.000 1.000 0 Status True1.000 positive 0 1.000 1 0.050 0 1.000 1 0.100 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 1 1.000 0.100 1.000 0.150 1.000 0.150 1.000 0.200 1.000 0.250 1.000 0.250 1.000 0.300 1.000 0.300 1.000 0.350 1.000 0.400 1.000 0.400 1.000 0.400 1.000 0.450 1.000 0.500 0.700 0.788 false0.800 positive 0.813 0.000 0.825 0.000 0.838 0.013 0.850 0.013 0.863 0.025 0.875 0.025 0.888 0.025 0.900 0.038 0.913 0.038 0.925 0.050 0.938 0.050 0.950 0.050 0.963 0.063 0.975 0.075 0.988 0.075 1.000 0.075 17 0.837 0 0.500 0.088 23 18 0.833 0 0.500 0.100 2011 Data Mining, IISE, SNUT Classification Performance Receiver operating characteristics (ROC) curve ROC curve 5 True positive (sensitivity) 1.0 Ideal classifier 0.8 0.6 Random classifier 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 False positive (1-specificity) 24 1.0 2011 Data Mining, IISE, SNUT Classification Performance Receiver operating characteristics (ROC) curve low cut-off True positive high False positive 5 So so Good 25 Bad 2011 Data Mining, IISE, SNUT Classification Performance ROC curve and confusion matrix Low Cut-off 5 26 High 2011 Data Mining, IISE, SNUT Classification Performance ROC curve, lift chart, and gain chart 5 27 2011 Data Mining, IISE, SNUT Classification Performance Area Under ROC curve (AUROC) ROC curve The area under the 1.0 Can be a useful metric for parameter/model selection. 1 for the ideal classifier 6 True positive (sensitivity) ROC curve. 0.8 0.6 AUROC 0.4 0.2 0.5 for the random classifier. 0.0 0.0 0.2 0.4 0.6 0.8 False positive (1-specificity) 28 1.0 2011 Data Mining, IISE, SNUT Classification Performance Asymmetric misclassification costs In many cases it is more important to identify members of one class. • Cancer diagnosis, tax fraud, credit default, response to promotional offer, etc. In such cases, we are willing to tolerate greater overall error, in return for better identifying the important class for further attention. The cost of making a misclassification error may be higher for one class than the other(s). 7 The benefit of making a correct classification may be higher for one class than the other(s). 29 2011 Data Mining, IISE, SNUT Classification Performance Example: Response to promotional offer Suppose we send an offer to 1000 people, with 1% average response rate (“1” = response, “0” = non-response). “Naïve rule” • Classify everyone as “0. Confusion Matrix Actual Predicted 1 0 1 0 10 0 0 990 • Misclassification error = 1% 7 • Accuracy = 99%. 30 2011 Data Mining, IISE, SNUT Classification Performance Example: Response to promotional offer DM model • Correctly classify eight 1’s as 1’s at the cost of misclassifying twenty 0’s as 1’s and two 0’s as 1’s. Confusion Matrix Actual Predicted 1 0 1 8 2 0 20 970 • Misclassification error = 2.2% • Accuracy = 97.8% 7 Is it worse than the previous model? 31 2011 Data Mining, IISE, SNUT Classification Performance Profit/Cost matrix Assign profit/cost for each cell of confusion matrix. • Example: $10: net profit for the responders if the offer is sent. $10: net cost for not sending offer for the responders. $1: net cost for sending an offer. Predicted Confusion Matrix Actual 7 1 0 1 $9 -$10 0 -$1 0 • Total profit for the naïve rule: 10*(-$10) = -$100 • Total profit for DM model: 8*($9)+2*(-$10)+20*(-$1) = $32 32 2011 Data Mining, IISE, SNUT Classification Performance Profit/Cost matrix for cancer diagnosis Can assign the net cost for classifying malignant to benign? Predicted Confusion Matrix 1 0 1 Save one’s lift Can measure? 0 Misdiagnosis cost 0 Actual • This is why doctors’ diagnoses are usually very conservative. 7 33 2011 Data Mining, IISE, SNUT Classification Performance Cost ratio In general, actual costs and benefits are hard to estimate. Need to express everything in terms of costs (i.e. cost of misclassification per record). Goal is to minimize the average cost per record. A good practical substitute for individual costs is the ratio of misclassification costs Misclassifying responders costs 10 times higher then misclassifying non-responders. Misclassifying fraudulent firms is 5 times worse than 7 misclassifying solvent firms. 34 2011 Data Mining, IISE, SNUT Classification Performance Cost ratio Evaluation using cost ratio: • q0/q1: misclassifying cost for negative(0)/positive(1) class. Predicted Confusion Matrix Actual 1(+) 0(-) 1(+) n11 n10 0(-) n01 n00 Expected misclassification cost per record: = = 7 qo n0,1 q1 n1,0 n n0,1 n0,0 n0,1 n0,1 n0,0 n0,1 qo n1,0 n1,0 n1,1 n1,0 n1,1 n n1,0 p(C0 ) qo p(C1 ) q1 = n0,0 n0,1 n1,0 n1,1 n 35 q1 2011 Data Mining, IISE, SNUT Classification Performance Oversampling for asymmetric costs When misclassification costs are equal: 7 36 2011 Data Mining, IISE, SNUT Classification Performance Oversampling for asymmetric costs When misclassification costs are unequal: • Misclassification cost for o is 5 times higher than that of x. 7 37 2011 Data Mining, IISE, SNUT Classification Performance Oversampling for asymmetric costs Oversampling: • Generate four synthetic o instances around each o. 7 38 2011 Data Mining, IISE, SNUT Classification Performance Confusion matrix for over-sampled data Assume that there are 2% of class 1 and 98% of class 2. Conducted over-sampling so that there are equal number of class 1 and class 2 records. After oversampling: Confusion Matrix Actual Predicted 1 0 1 420 80 0 110 390 • Misclassification rate = (80+110)/1,000 = 19% 7 39 2011 Data Mining, IISE, SNUT Classification Performance Confusion matrix for over-sampled data # of records for the entire data: 0.02*X = 500, X=25,000. # of records for class 2: 25,000*0.98 = 24,500 For the original data: Confusion Matrix Actual Predicted 1 0 1 420 80 0 5,390 19,110 • Misclassification rate = (80+5,390)/25,000 = 21.9% 7 40 2011 Data Mining, IISE, SNUT Prediction Performance 1 Example Predict a baby’s weight(kg) based on his age. Age Actual Weight(y) Predicted Weight(y’) 25 1 5.6 6.0 20 2 6.9 6.4 15 3 10.4 10.9 10 4 13.7 12.4 5 5 17.4 15.6 0 6 20.7 21.5 7 23.5 23.0 1 2 3 4 5 Actual Weight(y) Predicted Weight(y’) 41 6 7 2011 Data Mining, IISE, SNUT Prediction Performance Average error 2 Indicate whether the predictions are on average over- or underpredicted. 1 n ( y y) i 1 n 0.342 Average error 42 Age Actual Weight(y) Predicted Weight(y’) 1 5.6 6.0 2 6.9 6.4 3 10.4 10.9 4 13.7 12.4 5 17.4 15.6 6 20.7 21.5 7 23.5 23.0 2011 Data Mining, IISE, SNUT Prediction Performance Mean absolute error (MAE) Gives the magnitude of the average error. 3 1 n MAE i 1 y y n 0.829 43 Age Actual Weight(y) Predicted Weight(y’) 1 5.6 6.0 2 6.9 6.4 3 10.4 10.9 4 13.7 12.4 5 17.4 15.6 6 20.7 21.5 7 23.5 23.0 2011 Data Mining, IISE, SNUT Prediction Performance Mean absolute percentage error (MAPE) Gives a percentage score of how predictions deviate (on average) from the actual values. 4 MAPE 100% 1 n n i 1 y y y 6.43% 44 Age Actual Weight(y) Predicted Weight(y’) 1 5.6 6.0 2 6.9 6.4 3 10.4 10.9 4 13.7 12.4 5 17.4 15.6 6 20.7 21.5 7 23.5 23.0 2011 Data Mining, IISE, SNUT Prediction Performance (Root) Mean squared error ((R)MSE) Standard error of estimate. Same units as the variable predicted. 1 n 2 ( y y ) n i 1 0.926 MSE 5 1 n 2 RMSE ( y y ) n i 1 0.962 45 Age Actual Weight(y) Predicted Weight(y’) 1 5.6 6.0 2 6.9 6.4 3 10.4 10.9 4 13.7 12.4 5 17.4 15.6 6 20.7 21.5 7 23.5 23.0