Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Chapter 5 Credibility: Evaluating What’s Been Learned Kirk Scott 1 2 3 4 • This set of overheads consists of just one section in the book: • 5.7 Counting the Cost • There are approximately 200 overheads • This is a long and important section 5 5.7 Counting the Cost • Preliminary note: • The book discusses cost in this section • It also discusses things that don’t really involve cost • Incidentally, it also discusses things that would involve (differential) cost, but simplifies by using cost coefficients of 1 and 0 rather than anything more complicated 6 • In a way, the contents of this section might come across as a mish-mash • Hopefully the overheads will add clarity to the book’s presentation • Whether mish-mash or not, ultimately the topics covered will be important because they are aspects of what is available in Weka 7 Fundamental Cost Ideas • For any prediction scheme there will be successes and failures, successes and errors • Consider a binary classification scheme where the output is yes/no, true/false, positive/negative 8 Two Kinds of Successes • Correct predictions of positive, true positives, TP • Correct predictions of negative, true negatives, TN • Quite often, the cost (i.e., the benefit) of the two kinds of successes are taken to be the same • You can deal with TP + TN together 9 Two Kinds of Errors • Incorrect predictions of positive, false positives, FP • Incorrect predictions of negative, false negatives, FN • In virtually every applied situation the costs of false positives and false negatives materially differ 10 Difference in Error Costs • The book illustrates the idea with several examples • One is direct mail advertising, which it uses again later in the section • Consider a mailing to a predicted potential customer who doesn’t respond • This is a false positive • The individual cost of the unsuccessful mailing is low 11 • Consider a mailing that was never sent to what would have been a customer • This is a false negative • The algorithm incorrectly predicted negative for that potential customer • The cost of the lost business is high compared to the cost of a mailing 12 • In order to stay in business, the cost of mailing has to be less than the cost of business generated • You can afford lots of false positives, although in aggregate their cost adds up • In this domain the cost of lost opportunity is invisible, but in practical terms, the cost of a few false negatives quickly outweighs the cost of many false positives 13 Confusion Matrices • The idea of a confusion matrix will come up more than once in this section • A confusion matrix is a graphical way of summarizing information about successes and failures in prediction 14 • The simplest of confusion matrices, one for two classifications, is illustrated in Table 5.3, shown on the following overhead • The entries in the matrix are the counts of the different kinds of successes and failures for a given situation 15 16 • Next the book presents Table 5.4, which consists of parts (a) and (b) • I will break up the presentation • Consider Table 5.4a, shown on the following overhead • It shows the confusion matrix resulting from the application of a non-binary classification scheme 17 18 • The rows represent the actual class for the instances in a data set • The columns represent the class predictions given by the classification scheme • The true predictions are on the main diagonal of the matrix • The errors are off of the main diagonal 19 False Positives • Look at column a, for example • The entry for row a, 88, is the count of the number of correct predictions • The entries for rows b and c, 14 and 18, are the counts of false positives with respect to class a • The algorithm falsely predicted this many instances of b and c were of class a 20 False Negatives • Now look at row a • The entries for columns b and c, 10 and 2, are the counts of false negatives with respect to class a • The algorithm falsely predicted this many instances of class a were in classes b and c 21 Food for Thought • Now look at row b • The entry for column a, 14, is a false negative with respect to class b • This entry was identified as a false positive with respect to class a 22 • So for a classification that isn’t binary, whether something is a false positive or false negative is a matter of perspective • Geometrically, it depends on whether you’re looking from the perspective of a row or a column 23 • The rows represent the actual classification • Taken from the perspective of a row, an error entry is a failure to correctly place something into that classification • This is a false negative 24 • The columns represent the predicted classification • Taken from the perspective of a column, an error entry is an incorrect placement of something into that classification • This is a false positive 25 • The row b, column a entry is an error, false positive of false negative, depending on perspective • It is simply a unique error case, distinct from the other error entries in the table 26 • It might have a cost associated with it that is distinct from the cost of any other error entry in the matrix • At this point, the kind of error doesn’t really matter • However, the kind of error may eventually be figured into the calculation of the cost 27 Evaluating Classification Performance with a Confusion Matrix • Next, the book shows a way of evaluating the performance of a classification scheme • This evaluation method has nothing to do with costs • It’s simply an application built on top of, and graphically illustrated by, confusion matrices 28 • Previous sections of the book evaluated data mining algorithm performance by estimating error rates of classification schemes • It also considered comparing the performance of two different schemes 29 • The idea presented now is that the performance of a data mining algorithm can be evaluated by comparing its results to the results of random classification • The results of an actual predictor will be compared with a hypothetical predictor that does classification randomly 30 • Table 5.4 (a) and Table 5.4 (b) combined illustrate the idea • Table 5.4 (a) represents the actual predictor • Table 5.4 (b) represents the hypothetical predictor • These tables are shown on the following overhead • Explanations follow that 31 32 • First consider the bottom row of 5.4 (a) • The values 120, 60, and 20 are the totals predicted for classes a, b, and c overall • Now consider the rightmost column of 5.4 (a) • 100, 60, and 40 are the counts of the actual occurrence of instances of classes a, b, and c 33 • Now consider Table 5.4 (b) • The bottom row of 5.4 (b) is the same as the bottom row of 5.4 (a) • The totals predicted for the classes are the same • The rightmost column of 5.4 (b) is the same values as the rightmost column of 5.4a • The count of the actual number of 34 instances of each class does not change • Table 5.4 (a) represents the actual predictor we are trying to evaluation • Table 5.4 (b) represents a hypothetical predictor that we are comparing it against • The fact that the bottom rows of the two tables are the same means that the hypothetical predictor of interest gives the same overall count of predictions for each class 35 • It’s the bodies of the two tables that differ • They differ in how the correct and incorrect predictions for each class are distributed • The hypothetical predictor’s values work in this way: • The values in the rows for every actual class are distributed in proportion to the overall totals in the bottom row 36 • • • • • 120 is to 60 is to 20 As 60 is to 30 is to 10 As 36 is to 18 is to 6 As 24 is to 12 is to 4 These sets of values represent the socalled random predictor • (It’s not exactly random. It’s based on the prediction counts for the actual scheme.) 37 • The random predictor is the point of comparison • Can you make a quantitative comparison between the performance of the actual classification scheme and the random predictor? • If your classification scheme isn’t better than random, you’ve got a big problem • Your predictor is sort of an anti-predictor 38 Comparing with the Random Predictor • Sum the counts on the main diagonal of the matrix for the actual scheme • It got 88 + 40 + 12 = 140 correct • Do the same for the random predictor • It got 60 + 18 + 4 = 82 correct • Clearly the actual predictor is better • Can this be quantified? 39 • There are 200 instances altogether • The random predictor had 200 – 82 = 118 incorrectly classified instances • The actual predictor got 140 – 82 = 58 more correct than the random predictor • Of the 118 incorrectly classified by the random predictor, the actual predictor classified 58 of them correctly 40 • You measure performance by finding the ratio 58/118 = 49.2% • This ratio is known as the Kappa statistic • For a predictor that is no worse than random, the value of the statistic can range from 0 to 1 • 0 means no better than random • 1 means perfect prediction 41 Counting as Opposed to Cost • The method given here doesn’t count the cost of FP or FN • It is similar to previous approaches • It simply counts the successes vs. the errors and compares with the hypothetical case 42 • What we just looked at were confusion matrices • These structures contained counts of correct and incorrect classifications • It was possible to compare these values for an actual and a hypothetical predictor 43 Cost-Sensitive Classification and Cost Matrices • The next step will be to include cost in the performance calculation • Suppose you don’t assign different cost values to different success types (TP, TN) and different failure types (FP, FN) • Cost of success = 0 • Cost of failure = 1 44 • The simple cost matrices for the two-class and three-class cases are given in Table 5.5, shown on the following overhead 45 46 Performance Evaluation with Costs • Suppose you have a data mining scheme that gives straight classification predictions • With a test set, you have both a prediction and the known classification • Up until now, performance was evaluated on the basis of a count of errors 47 • Potentially, each element of the cost matrix could contain a different decimal value which was the weight for that case • Successes (benefits) could be positive • Failures/errors (costs) would be negative • You would evaluate performance by summing up the total of successes and errors multiplied by the corresponding weight in the cost matrix 48 Complete Financial Analysis • Data mining doesn’t exist in a vacuum • It is a real technique applied in businesslike situations • Suppose you were comparing the cost of two different algorithms • Not only might the costs associated with the predictions be taken into account 49 • The better performing algorithm might have data preprocessing or computational costs that are significant • Its predictions might be X dollars more valuable than those of another scheme • But the comparative costs of running the algorithm may be greater than X • A complete analysis would take this into account when picking an algorithm 50 Classification Including Costs • This section represents a shift of focus • So far we’ve been concerned with doing a cost evaluation of prediction performance after the fact • Now we want to try and take cost into account when making the prediction 51 • Suppose you’re using a classification scheme that generates probabilities • Up until now, if you had to classify under this condition, you picked the classification with the highest probability • With cost information, another possibility presents itself 52 • For a given instance that you want to classify: • pi is the predicted probability that the instance falls into the ith class • Then for each class, (1 – pi) is the probability of a false positive • In other words, classify on the basis of minimum cost 53 • Multiple each of the (1 – pi) times the cost factor of a misclassification (false positive) • Do not classify according to the largest pi • Instead, give the instance the classification where the product of the probability of a misclassification and the cost of a misclassification is highest 54 Probabilities vs. Straight Classification • The book notes that there are ways of converting the results of straight classification schemes into probabilities • It doesn’t give details (although it’s pretty clear this would involve counting and dividing in Bayesian style) • The idea is that you could then apply cost metrics when deciding classification for those schemes 55 Summary of the previous idea • The approach for factoring cost into predictions given thus far could be summarized as follows: • When running the data mining algorithm, do not take cost into account when deriving the rules • However, after the rules are derived, take cost into account when using them to make a classification 56 Cost-Sensitive Learning • Now we would like to try and take cost into account when deriving the rules • It is possible to follow a line of reasoning that explains how this is accomplished • Not all the details will be given • However, the approach can be outlined based on concepts that have already been presented 57 There are Schemes that give Probabilities as Results • We have repeatedly been told that some data mining algorithms will give probabilities instead of classifications • Naïve Bayes gave these kinds of results • We infer that there are other, more sophisticated data mining schemes that also give probabilistic results • In other words, probabilistic based schemes have general application 58 What about the Loss Function? • In this chapter the idea of a loss function was introduced • The statement was made that the way to improve a classification scheme was to minimize the loss • In fact, we have not yet been told how data mining schemes can be guided to give better results by a process of minimizing loss 59 Cost and Loss • Cost is external to a classification scheme • There are costs in the real world • Total cost depends on how successful or unsuccessful classification is 60 • Loss is internal to a classification scheme • It is a measure of how greatly predicted classification probabilities differ from observed actual classifications • Cost and loss are not the same, but they are related 61 • When evaluating a scheme, the cost is calculated by multiplying the probability of an outcome times the cost of that outcome • If the mining scheme is guided by minimizing loss, that means it’s guided by improving the probability estimates • This will tend to minimize cost 62 Making the Jump—Bias to Cost • This is where you make a jump • Recall this idea from earlier in the chapter: • You achieve quality evaluations of a scheme by cross-validation • If you want accurate performance estimates, the distribution of classifications in the test set should match the distribution of classifications in the training set 63 • If there was a mismatch between training and test set some test set instances might not classify correctly • The mismatch in the sets means that the training set was biased • If it was biased against certain classifications, this suggests that it was biased in favor of other classifications 64 • Note that bias is closely related to the concept of fitting or overfitting • In nonjudgmental terms, there is simply a mismatch between training and test set • In more judgmental terms, the training set is overfitted or biased in favor of giving some predictions or probabilities which are not as prevalent in the test set 65 Including Cost in Rule Derivation by Intentionally Using Bias • The approach to including cost in the derivation in classification rules is this: • Intentionally bias the training set by increasing the representation of some classifications • The algorithm will produce rules more likely to correctly predict these classifications 66 • The practical questions are these: • Which classifications to over-represent? • In general you want to over-represent classifications where misclassification has a low cost • How much to over-represent them? • The answer to this is pretty much just empirical rule of thumb 67 Concretely… • Take this as a concrete example: • Suppose that false positives are more costly than false negatives • Over-represent “no” instances in the training set • When you run the classification algorithm on this training set, it will overtrain on no 68 • That means the rules derived by the algorithm will be more likely to reach a conclusion of no • Incidentally, this will increase the number of false negatives • However, it will decrease the number of false positives for these classifications • In this way, cost has been taken into account in the rules derived, reducing the 69 likelihood of expensive FP results • The book notes that you aren’t limited to accomplishing this by over-representation • You can allow duplicate instances in the training set (like sampling with replacement) • In practice, some data mining schemes just allow heavier or lighter weights to be assigned to instances of given classifications 70 Cost and Loss, Again • Technically, we still didn’t use an algorithm that was guided by loss • Skewing the sample was the proxy for tuning loss, which in turn was also the proxy for cost in the rule derivation 71 Lift Charts • The term lift describes certain techniques based on the following idea • You may not be able to perfectly classify instances and identify only those with the desired classification • However, you may be able to identify subsets of a data set where the probability of the desired classification is higher 72 • Consider the direct mailing example which was mentioned quite a while back • Suppose you know as a starting point that out of a population (training set/test set) of 1,000,000, there will be 1,000 yeses • This is a response rate of .1% 73 Defining the Lift Factor • Now suppose you have a data mining algorithm that gives probabilities of yes for the instances • This means you can identify people more likely to respond yes and those less likely to respond yes • You can identify subpopulations-subsetssamples with a greater likelihood of yes 74 • The proportional increase in the response rate is known as the lift factor • The response rate = count(yeses) / count(number of instances in sample) • The lift factor is response rate(sample) / response rate(population) • This idea is summarized in the table on the following overhead 75 Sample Yeses Response Rate Lift Factor 1,000,000 1,000 .1% 400,000 800 .2% 2 100,000 400 .4% 4 76 • Assuming that you can do this, note this obvious result: • You’ve increased the response rate—good • You’ve accomplished this by reducing the sample size—also good • On the other hand, the absolute number of yeses has dropped • In other words, you are excluding some yeses from consideration 77 Finding Lift Factors • Given a training set or population, how do you come up with the figures for lift factors? • The data mining algorithm gives a probability of a yes • The training data also tells you whether a given instance is actually yes 78 • Rank all of the instances by their probability and keep track of their actual class • In essence, you choose your new, “lifted” subset, by taking instances from the top of the probability rankings • (Don’t worry, a more complete explanation is coming) 79 • Note that we will be operating under the (reasonable) assumption that the data mining algorithm really works • In other words, we assume that the higher the predicted probability, the more likely an instance really is to take on a certain value 80 • Table 5.6, given on the following overhead, shows instances ranked by their predicted probability, and with their actual classification also listed 81 82 • Using a table like this, you can find the lift factor for a given sample size • For example, for a sample size of 10, take the top 10 instances • The lift ratio depends on the ratio of yeses to no’s in that sample 83 • Using a table like this, you can also find the sample size for a given lift factor • For a desired lift factor, go down the table tallying the ratio of yeses to no’s • When you get the desired lift factor, the sample size is the number of instances you tallied 84 Finding the Lift factor for a Given Sample Size • In summary: • If you want the lift factor for a particular sample size, just count down the list until you reach that sample size • The response rate = count(yeses) / count(number of instances in sample) • The lift factor is response rate(sample) / response rate(population) 85 Finding the Sample Size for a Given Lift Factor • In summary: • If you want a particular lift factor, keep a running computation of response rate and lift factor as you go down the list • When you reach the desired lift factor, the number of instances you’ve gone through is your sample size • (Note that response rate is the key figure) 86 Making a Lift Chart • To make a lift chart, you need the lift factor for each possible sample size • In other words, work your way down the ranked list one instance at a time • For each one, you can calculate the response rate so far and the lift factor so far 87 • The x-axis is the the sample size • This could be given as a percent of the population • The y-axis is the lift factor • This could be given as the actual number of yes responses generated • The resulting plot is what is known as a lift chart 88 • Figure 5.1, shown on the following overhead, illustrates the idea for the direct mailing example • Recall that the example had really nice numbers • Because the total yes responses for the population was 1,000, when reading the yaxis, you can readily interpret the value as a percent 89 90 What a Lift Chart Tells You • The lift chart tells you something about your data mining algorithm • The diagonal line represents the situation where, if you take m% of the population, you get m% of the yeses • This is what you would expect of a random sample from the population 91 • An effective data mining algorithm should produce a lift chart curve above the diagonal • The higher and further to the left the hump is, the better 92 • In general, the closer the lift chart curve comes to the upper left hand corner, the better • A minimal sample size and a maximum response rate is good • The hypothetical ideal would be a mailing only to those who would respond yes, namely a 100% success rate, with no one left out (no false negatives) 93 • If the curve is below the diagonal, the algorithm is determining the yes probability for instances with a success rate lower than a random sample • This is similar to other situations, where the success rate is below 50%, for example • A data mining algorithm with this kind of performance is an “anti-predictor” 94 Cost Benefit Curves • The costs and benefits of different sample sizes/mailing scenarios will differ • As noted earlier, the cost of the algorithm itself can be a factor • But for the time being we will ignore it • The assumption is that we have already done the work of running the algorithm • We want to know what this completed work can tell us 95 • The cost of interest now is the cost of mailing an individual item • We’ll assume that the cost is constant per item • The benefit of interest is the value of the business generated per yes response • We’ll assume that the benefit is constant per positive response 96 • • • • • We now have three data items: A lift chart A cost A benefit From this information it is possible to form a cost/benefit curve across the same domain (percent of sample size) as the lift chart 97 • Figure 5.2 was produced by Weka • It shows lift charts on the left and cost/benefit curves on the right for various scenarios • It is given on the following overhead and will be discussed in greater detail on the overheads following it 98 99 Comments on Figure 5.2 • For Figure 5.2, a benefit of +15.00 per positive response is assumed • The charts at the top assume a mailing cost per item of .50 • A lift chart is shown on the left and a cost/benefit curve is shown on the right • The cost/benefit curve keeps on rising all the way to 100% of the population, so it makes sense to send mail to everybody if 100 you can afford it • The charts at the bottom assume a cost per item of .80 • A lift chart is shown on the left and a cost/benefit curve is shown on the right • The cost/benefit curve has a peak in the middle • This means you only want to mail to this proportion of the population 101 • Keep in mind that the sample is selected by ranking by the algorithm, not randomly • Roughly speaking, gauging by the shape of the cost/benefit curve, that: • You would like to mail to those prospective customers in the top half of the data set • Mailing to more than these gives diminishing returns 102 • In summary: • The practical result of this is to tell you what portion of the sample it is costeffective to contact • The effect of the changed cost on the curves generated by the two scenarios is notable 103 What Else is in Figure 5.2? • The details of Figure 5.2 are impossible to see • In brief, the GUI includes the following: • A confusion matrix • A cost matrix where you input the costs of TP, TN, FP, FN 104 • A slider representing the x-axis, the percent of the population • This allows you to get specific cost results for a given sample size, as well as the graphical representations of the results across the whole range 105 A Side Note on Where You Might Be in Exploring Weka • The fact that this figure was produced by Weka is a sign: • If you haven’t downloaded and installed Weka yet, it’s not too soon to do so • The book also has links to the example data sets, already in ARFF (format) • Looking through the GUI, you can find the options to run various data mining algorithms 106 • You can also find the options to run various tools that allow you to explore and compare the results of running the algorithms • There are no homework assignments, so it’s up to you to start looking into these things on a schedule that agrees with you 107 • There are two goals: • To understand the discussions in class in preparation for the second test • To start getting ready to do the final project 108 A Preliminary Note on the Project • The syllabus contained some information on the general form of the project • Here is a little more preliminary detail • You can do the project using the UCI data sets or any of the other data sets made available on the Weka Web site • If you do so, you are on the hook for the full 8 combinations of data and algorithm 109 • If you use another data set, the number of combinations you have to run will be reduced • In either case, for full points you will also have to do some exploring and comparing of results • Further details on these aspects of the project are posted separately 110 ROC Curves • The acronym ROC stands for receiver operating characteristic • This term originated in the analysis of signal to noise ratio in a communication channel • It is a way of comparing true positives to false positives 111 • In general terms, you can think of it as “pushing” for greater true positives and measuring your success against the false positives generated • An ROC curve is similar to a lift chart • It provides a way of drawing conclusions about the results of a data mining algorithm 112 • Multiple ROC curves can be used to compare data mining algorithms • ROC curves also graphically illustrate a means of getting the best possible results over a range of parameter values by combining two different data mining algorithms 113 Thinking about Data Values and Probability Based Predictions • Consider a data mining algorithm that gives prediction probabilities • For a binary attribute, you ultimately want a yes or a no classification • For those instances with a predicted probability > .5 the prediction is yes 114 • Technically, for those instances with a predicted probability < .5 the prediction is no • In the discussion that follows, this second half of the picture will not be explicitly covered • We’ll restrict ourselves to those instances where the prediction is > .5 115 • Now, take a look at Table 5.6 again • It is repeated on the following overhead • Notice that it basically goes down to the level of a prediction probability of .5 and ignores the instances with lower probabilities 116 117 • Essentially what we’re seeing is that instances 1-17 are predicted yes • Then in the right hand column, those instances that are yes are TP’s • Those instances that are no are FP’s • Likewise, the FP’s are clearly identified • [You could do a similar thing (in reverse) for those instances with probabilities < .5] 118 • You can break things down in this way: • Actual negatives = FP + TN • Actual positives = TP + FN 119 Forming the ROC Curve • Let a set of data be given that has been sorted by the probability prediction like in Table 5.6 • The ROC curve plots the FP rate on the xaxis and the TP rate on the y-axis 120 • FP (negative classified as positive) rate • = FP / Actual negatives • = FP / (FP + TN) * 100 • TP (positive classified as positive) rate • = TP / Actual positives • = TP / (TP + FN) * 100 121 • The ROC curve for the Table 5.6 data is given in Figure 5.3, shown on the following overhead • The jagged line represents a discrete plot of the values in the table • The dashed line obviously represents an approximate curve • The fact that the curve is bowed up is good—TP is winning over FP 122 123 Similarity between the ROC Curve and the Lift Chart • The ROC curve looks similar to the lift chart curve shown earlier • Lift chart x-axis ~ sample size as a percent of the population • Lift chart y-axis ~ TP • ROC curve x-axis ~ # of FP as a % of the actual number of negatives • ROC curve y-axis ~ TP 124 • • • • The y-axes are essentially the same--TP The x-axes are similar in this way: Lift chart: Increase sample size ROC: proportion of false positives to actual positives if you just randomly sampled would track with an increased sample size in a lift chart 125 • There is a close similarity between the lift chart and the ROC curve for the direct mailing example • The majority of the population are actual negatives • Increasing the number of negatives tracks directly with increasing the sample size 126 The Upper Left is Good • With a different example an ROC curve and a lift curve may not be as similar • However, there is still an overall similarity: • The upper left hand corner of the graph is “good” • The graph displays TP (y) vs. FP (x) • You’d like to maximize TP and minimize FP, which would occur in the upper left 127 Arriving at the Smooth Curve • Figure 5.3 is repeated on the following overhead for reference • Note again the smooth curve represented as a dashed line 128 129 • The jagged line represented a single display of one data set • For general analytical purposes, you might assume that across the whole problem domain, with different data sets, the relationship between TP and FP would be a smooth curve • This would be what the dashed line represents 130 • The book gives several explanations of how to get the smooth curve • The basic idea is to use cross-validation • To me the book’s detailed explanations are foggy • I’d like to appeal to a simple visual argument 131 • Suppose you did 10-fold cross-validation • You’d have 10 test sets • If you graphed TP vs. FP for each test set, you’d get 10 different jagged lines • Each of these is an approximate curve 132 • Since the axes are percents, the shape of all of the curves would be the same • If you averaged the y values for each x value, the results would approximate a smooth curve • This would be your approximation for the population as a whole 133 ROC Curves for Different Learning Schemes • At the outset, the following was noted: • You can use approaches like ROC curves to compare two data mining schemes • You can use ROC to combine the best of two schemes • That’s the topic now 134 • Figure 5.4, given on the following overhead, shows ROC curves for two data mining schemes • Scheme A is further to the upper left for smaller sample sizes • Scheme B is further to the upper left for larger sample sizes • You can compare and choose between the two if you had a preference based on 135 sample size 136 • The shaded area on the graph is referred to as the convex hull of the two curves • Part of the hull lies outside the boundaries of the curves for A and B • However, the edge of the shaded region still represents an achievable result • For sample sizes in that range, a linear combination of A and B can give these ROC curve results 137 • The outer boundary of the hull is tangent to both A and B • Let (x1, y1) and (x2, y2) be the tangent points for A and B, respectively • Suppose you wanted to find the point 1/nth of the way from (x1, y1) to (x2, y2) 138 • xnew = x1 + 1/n(x2 – x1) • xnew = 1/n x2 + (x1 – 1/n x1) • xnew = 1/n x2 + (1 – 1/n) x1 • ynew = y1 + 1/n(y2 – y1) • ynew = 1/n y2 + (y1 – 1/n y1) • ynew = 1/n y2 + (1 – 1/n) y1 139 • 1/n and (1 – 1/n) can work like probabilities • They sum to 1 • You can use a Monte Carlo like approach to achieving the best possible ROC value for any point between the tangent points to A and B 140 • This would be cross-validation like in nature • For multiple data sets, apply A at the ratio of FP to FN that gives its tangent point • For multiple data sets, apply B at the ratio of FP to FN that gives its tangent point 141 • For a given problem where you want an ROC (ratio of FP to FN) that lies 1/nth of the way from one tangent point to the other: • Let the ratio of B data sets to A data sets be 1/n to (1 – 1/n) • Combine the results of the computations generated in this proportion and it will be the desired point on the line 142 • Note that the foregoing overheads are my attempt to explain what the book says • I don’t claim that they are wrong • I did find the book’s explanation kind of incomplete • My review of it is no better • There seem to be some missing links 143 Recall-Precision Curves • Even though the section heading is recallprecision curves, we won’t actually see a graph in this section… • The general idea behind lift charts and ROC curves is trade-off • They are measures of good outcomes vs. unsuccessful outcomes • This is cost-benefit in simple terms (not in terms of a cost factor and cost equation) 144 • Trying to measure the tradeoff between desirable and undesirable outcomes occurs in many different problem domains • Different measures may be used in different domains • In the area of information retrieval, two measures are used: • Recall and precision 145 • # of relevant docs retrieved • Recall = ------------------------------------• total # of relevant docs • # of relevant docs retrieved • Precision = ------------------------------------• total # of docs retrieved 146 • Notice the general idea illustrated by these measures and inherent in the other ones we’ve looked at: • You can increase the number of relevant documents you retrieve by increasing the total number of documents you retrieve • But as you do so, the proportion of relevant documents falls 147 • Consider the extreme case • How would you be guaranteed of always retrieving all relevant documents? • Retrieve all documents in total • But you’ve gained nothing • If you retrieve all documents, the task still remains of winnowing the relevant documents from the rest 148 Discussion • The book notes that in non-computer fields, like medical testing, the same kinds of ideas crop up • For a given medical test: • Sensitivity = proportion of people with the disease who test positive • Specificity = proportion of people without the disease who test negative • For both of these measures, high is good 149 • Incidentally, these concepts also remind me of association rule mining • Support and confidence seem to have something in common with the dichotomies presented in this section 150 One Figure Measures • In addition to 2-dimensional graphs or curves like lift charts and ROC curves, there are also techniques for trying to express the goodness of a scheme in a single number • For example, in information retrieval there is the concept of “average recall” measures 151 • Average recall measures are actually averages of precision over several recall values • Three-point average recall is the average precision for recall figures of 20%, 50%, and 80% • Eleven-point average recall is the average precision for figures of 0%-100% by 10’s 152 • • • • The book also cites the F-measure (2 x recall x precision) / (recall + precision) = (2 x TP) / (2 x TP + FP + FN) If you go back to the definitions you can derive the 2nd expression from the 1st • Looking at the second expression, the simple point is that a larger value is better 153 • The simplest one-figure measure of goodness is simply the success rate: • (TP + TN) / (TP + FP + TN + FN) • One-figure measures don’t contain as much information as graphs • By definition, graphs tell you the outcome over a range 154 • Table 5.7, shown on the following overhead, summarizes lift charts, ROC curves, and recall precision curves • Even though we haven’t been given a graph of a recall-precision curve, this is where such curves are put in context with lift charts and ROC curves 155 156 Cost Curves • In the previous section two ROC curves were graphed together • There performance on samples of different sizes could be compared • This comparison didn’t include cost • It is not straightforward to see how cost can be incorporated into lift charts or ROC curves 157 • The book introduces error and cost curves as a way of including cost in scheme comparisons • Even though error and cost curves have a purpose similar to lift charts and ROC curves, they are quite different from them • In trying to understand error curves and cost curves it’s important not to confuse them with lift charts and ROC curves 158 The Error Curve • The book presents an error curve and a cost curve together in Figure 5.5 • I will deal with the two curves one at a time • The presentation of error curves will come first • Consider the error curve, Figure 5.5a, shown on the following overhead 159 160 • The error curve shown is based on binary classification • In the figure, a yes classification is symbolized by “+”, so the probability of a yes is p[+] • By extension a no is symbolized by “-” • The solid curve (line) shows the performance of a classification scheme, A 161 The Axes of the Graph • The x-axis is the probability of a yes, p[+], classification in a data set that A is applied to • In other words, A’s performance may differ for different probabilities of yes in the data set • The domain is the interval of probabilities from 0 to 1 162 • The y-axis, the performance measure, is the expected error • This is the probability of a misclassification by A for a given value of p[+] • The range is the interval of expected errors from 0 to 1 163 • The expected error is not the count of errors, but the error rate from 0 to 100% • The error rates for false negatives and false positives are given using italics as fn and fp, respectively 164 Characteristics of the Curve for Data Mining Algorithm A • A is shown as linear • It is not known whether classification scheme performance is necessarily linear, but for the purposes of discussion, this one is • Also note that the slope of A is not zero • A’s performance does differ for different values of x 165 • Under the assumption that A is linear, A is not hard to plot • The left hand endpoint, the y intercept, is defined as the false positive rate, fp, that A gives when the probability of yes in the data set is 0 • The y value for the right hand endpoint is defined as the false negative rate, fn, that A gives when the probability of a yes in the 166 data set is 1 • For a given probability of yes on the x-axis between 0 and 1: • The y value would be the expected misclassification rate, whether fp or fn, for that probability p[+] 167 • A is lower on the left than on the right • For data sets more likely to contain no, A has fewer errors • For data sets less likely to contain no, A has more errors • In a sense, you might say that the error curve (line) shows that A is biased • A is apparently more likely to give a classification of no than yes 168 Other Things Shown in the Figure • The book indicates other things in the figure • The horizontal line at the bottom, the xaxis, would represent the performance of a perfect predictor—expected error = 0 • The horizontal line at the top would represent the performance of a predictor that was always wrong—expected error = 1 169 • The dashed diagonal from the origin to the upper right would represent the performance if you always predicted yes • The dashed diagonal from the upper left to the lower right would represent the performance if you always predicted no 170 • The two diagonals seem uninteresting, but they bring out a useful point • You don’t have to use the same predictor across the whole domain • Below a certain point, simply predicting no outperforms A • Above a certain point, simply predicting yes outperforms A 171 • In theory, you could take a random sample of a data set and see what the probability of a yes was • You could then choose which predictor to use accordingly 172 The Cost Curve • An example of a cost curve, Figure 5.5b, is shown on the following overhead • It looks somewhat similar to the error curve of Figure 5.5a • This is not an accident, but the axes of the curves are different • As the name implies, this curve contains information about costs, not just error rates 173 174 • It is not reasonable to try and derive or prove the formulas they use in creating the cost curve • It’s also not reasonable to think we’ll come to much of a theoretical or intuitive understanding of the formulas 175 • My goal in presenting this will be the following: • Try to figure out what practical outcomes in the graph result from the formulas used • Then try to figure out how these practical outcomes make it possible to read the graph and understand what it represents overall, even if the derivations are not given in detail 176 The Axes of the Cost Curve— The X-Axis • The starting point for understanding the cost curve has to be the axes • The x-axis is labeled as the probability cost function • The book’s notation for this is pc[+] • This is an extension of the notation p[+], the probability of a yes, which was the xaxis of the error curve 177 • By way of introduction, consider the following: • Yes, we want a curve that graphs cost • You might assume that you graph cost on the y-axis against something else entirely different on the x-axis 178 • It turns out that cost characteristics are going to be part of the x-axis • Part of the goal of this discussion will be to explain how a cost curve can be based on an x-axis that contains cost information • The book’s expression for the x-axis, along with an explanatory version, are given on the following overhead 179 • pc[+], the x-axis of the cost curve • p[+] x C[-|+] • = ---------------------------------• p[+] x C[-|+] + p[-] x C[+|-] • prob(yes) x cost(FN) • = --------------------------------------------------------• prob(yes) x cost (FN) + prob(no) x cost(FP) 180 • We don’t know what this “means” or “where it came from”, but certain things about it can be observed • pc[+] is a transformation of p[+] • p[+] appears in both the numerator and denominator • You might interpret p[-] as (1 – p[+]) 181 • If you did a little simple graphing of the function, you would find that it has curves and discontinuities (where the denominator goes to 0) • So the x-axis of the cost curve is a function of the x-axis of the error curve 182 • If you substitute p[+] = 0 and p[+] = 1, you get the following • If prob(yes) = p[+] = 0, pc[+] = 0 • If prob(yes) = p[+] = 1, pc[+] = 1 • Expressing the x-axis as a function, this is: • pc[+] = fcn(p[+]) • For p[+] = 0, fcn(p[+]) = 0 • For p[+] = 1, fcn(p[+]) = 1 183 • In words, the x-axis of the cost curve is a transformation of the x-axis of the error curve • The endpoints of the interval of interest of the error curve map to the same, desired endpoints for the interval of interest for the cost curve, 0 and 1 • In between these points, the error curve is actually a curvilinear function of the error 184 The Y-Axis • With the x-axis established, what is the y value being mapped against it? • The y-axis represents what is known as the normalized expected cost • The normalized expected cost is a function of the probability cost function • y = NEC = f(pc[+]) = f(x), 185 • This is the formula for the normalized expected cost, namely the formula given in the book for the function that defines the y coordinate of the graph: • fn x pc[+] + fp x (1 – pc[+]) • Note that fn and fp are small letters in italics • These italicized things are not the counts, FP and FN 186 • fn and fp are the rates or ratios of false negatives and false positives • This was also the notation used in the error curve graph • In the error curve, fn and fp varied according to p[+] 187 • The formula given for the y coordinate of the cost curve appears quite simple • However, it would be wise to take into account that when calculating it, fn and fp would themselves vary across the range of p[+] 188 • • • • • In practical terms, what have we got? The normalized expected cost— Is the sum of two terms— Each term consisting of two factors— Where each of the factors falls between 0 and 1— 189 • The sum of fp and fn can’t be greater than 1 • You hope it’s less than 1 • Otherwise all of your predictions are in error • The sum of pc[+] and (1 – pc[+]) is clearly 1 190 • These arithmetic facts mean that the normalized expected cost can only fall between 0 and 1 • This is why it’s called normalized • From a purely pragmatic point of view this normalization is the reason (perhaps among others) why the x-axis is defined the way it is and the function y = f(x) is defined the way it is 191 What Does the Graph Portray? • The cost curve is shown again on the following overhead as a memory refresher • With the axes defined, what information can be read from the graph? 192 193 • The line A represents the cost curve for a particular data mining algorithm • The x-axis, the probability cost function, contains information on two things at once: • The probability of a yes or no for instances of various different data sets • The cost of FP and FN for a particular problem domain 194 • The x-axis doesn’t include any information about a data mining algorithm or its performance • The x-axis contains information about the domain only (as it should) 195 • The y-axis is the dimension along which you graph algorithm dependent information • The fp and fn in the function being graphed are performance measures for an algorithm at given probabilities of yes/no in the data set 196 • Remember, for example, that if an algorithm has been trained to lean towards no, but the data set leans towards yes, you will tend to get more FN’s and fn will be higher • This is the kind of information that the factors fn and fp include in the formula 197 • After all this blah blah blah, presumably the following simple description of the cost curve will make some sense: • The cost curve graphs the normalized cost of applying a data mining algorithm to data sets where the p[+] ranges from 0 to 1 198 What Else is in the Figure? • Given the explanation so far, what else is in the figure and what does it signify? • Once again, the figure is repeated as a reminder of what’s in it 199 200 • The performance cost function of a given algorithm is shown as a straight, solid line • The book still doesn’t say why the lines would necessarily be straight, but we’ll take them as they’re given • There are lines for A and B in the figure, plus shadowy lines for other hypothetical data mining algorithms 201 • A and B cross and have different performance/cost characteristics over different ranges of p[+] or pc[+] values • The always yes/always no predictors are shown as crossed, dashed, diagonal lines • The curved, dotted line is referred to as an envelope 202 • The idea behind the shadowy lines and the envelope is that you might have multiple data mining algorithms • If you used each algorithm only in that range of pc[+] where it was the best, the envelope is the minimum cost curve over the problem 203 • Finding the best algorithm to use, given this envelope, is not hard to accomplish • A statistical sample will tell you the p[+] for any given data set out of a problem domain • Then use the algorithm that performs best for that value 204 The End 205