Download csce462chapter5Part2PowerPointOldX

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Chapter 5
Credibility: Evaluating What’s
Been Learned
Kirk Scott
1
2
3
4
• This set of overheads consists of just one
section in the book:
• 5.7 Counting the Cost
• There are approximately 200 overheads
• This is a long and important section
5
5.7 Counting the Cost
• Preliminary note:
• The book discusses cost in this section
• It also discusses things that don’t really
involve cost
• Incidentally, it also discusses things that
would involve (differential) cost, but
simplifies by using cost coefficients of 1
and 0 rather than anything more
complicated
6
• In a way, the contents of this section might
come across as a mish-mash
• Hopefully the overheads will add clarity to
the book’s presentation
• Whether mish-mash or not, ultimately the
topics covered will be important because
they are aspects of what is available in
Weka
7
Fundamental Cost Ideas
• For any prediction scheme there will be
successes and failures, successes and
errors
• Consider a binary classification scheme
where the output is yes/no, true/false,
positive/negative
8
Two Kinds of Successes
• Correct predictions of positive, true
positives, TP
• Correct predictions of negative, true
negatives, TN
• Quite often, the cost (i.e., the benefit) of
the two kinds of successes are taken to be
the same
• You can deal with TP + TN together
9
Two Kinds of Errors
• Incorrect predictions of positive, false
positives, FP
• Incorrect predictions of negative, false
negatives, FN
• In virtually every applied situation the
costs of false positives and false negatives
materially differ
10
Difference in Error Costs
• The book illustrates the idea with several
examples
• One is direct mail advertising, which it
uses again later in the section
• Consider a mailing to a predicted potential
customer who doesn’t respond
• This is a false positive
• The individual cost of the unsuccessful
mailing is low
11
• Consider a mailing that was never sent to
what would have been a customer
• This is a false negative
• The algorithm incorrectly predicted
negative for that potential customer
• The cost of the lost business is high
compared to the cost of a mailing
12
• In order to stay in business, the cost of
mailing has to be less than the cost of
business generated
• You can afford lots of false positives,
although in aggregate their cost adds up
• In this domain the cost of lost opportunity
is invisible, but in practical terms, the cost
of a few false negatives quickly outweighs
the cost of many false positives
13
Confusion Matrices
• The idea of a confusion matrix will come
up more than once in this section
• A confusion matrix is a graphical way of
summarizing information about successes
and failures in prediction
14
• The simplest of confusion matrices, one
for two classifications, is illustrated in
Table 5.3, shown on the following
overhead
• The entries in the matrix are the counts of
the different kinds of successes and
failures for a given situation
15
16
• Next the book presents Table 5.4, which
consists of parts (a) and (b)
• I will break up the presentation
• Consider Table 5.4a, shown on the
following overhead
• It shows the confusion matrix resulting
from the application of a non-binary
classification scheme
17
18
• The rows represent the actual class for the
instances in a data set
• The columns represent the class
predictions given by the classification
scheme
• The true predictions are on the main
diagonal of the matrix
• The errors are off of the main diagonal
19
False Positives
• Look at column a, for example
• The entry for row a, 88, is the count of the
number of correct predictions
• The entries for rows b and c, 14 and 18,
are the counts of false positives with
respect to class a
• The algorithm falsely predicted this many
instances of b and c were of class a
20
False Negatives
• Now look at row a
• The entries for columns b and c, 10 and 2,
are the counts of false negatives with
respect to class a
• The algorithm falsely predicted this many
instances of class a were in classes b and
c
21
Food for Thought
• Now look at row b
• The entry for column a, 14, is a false
negative with respect to class b
• This entry was identified as a false positive
with respect to class a
22
• So for a classification that isn’t binary,
whether something is a false positive or
false negative is a matter of perspective
• Geometrically, it depends on whether
you’re looking from the perspective of a
row or a column
23
• The rows represent the actual
classification
• Taken from the perspective of a row, an
error entry is a failure to correctly place
something into that classification
• This is a false negative
24
• The columns represent the predicted
classification
• Taken from the perspective of a column,
an error entry is an incorrect placement of
something into that classification
• This is a false positive
25
• The row b, column a entry is an error,
false positive of false negative, depending
on perspective
• It is simply a unique error case, distinct
from the other error entries in the table
26
• It might have a cost associated with it that
is distinct from the cost of any other error
entry in the matrix
• At this point, the kind of error doesn’t really
matter
• However, the kind of error may eventually
be figured into the calculation of the cost
27
Evaluating Classification Performance
with a Confusion Matrix
• Next, the book shows a way of evaluating
the performance of a classification scheme
• This evaluation method has nothing to do
with costs
• It’s simply an application built on top of,
and graphically illustrated by, confusion
matrices
28
• Previous sections of the book evaluated
data mining algorithm performance by
estimating error rates of classification
schemes
• It also considered comparing the
performance of two different schemes
29
• The idea presented now is that the
performance of a data mining algorithm
can be evaluated by comparing its results
to the results of random classification
• The results of an actual predictor will be
compared with a hypothetical predictor
that does classification randomly
30
• Table 5.4 (a) and Table 5.4 (b) combined
illustrate the idea
• Table 5.4 (a) represents the actual
predictor
• Table 5.4 (b) represents the hypothetical
predictor
• These tables are shown on the following
overhead
• Explanations follow that
31
32
• First consider the bottom row of 5.4 (a)
• The values 120, 60, and 20 are the totals
predicted for classes a, b, and c overall
• Now consider the rightmost column of 5.4
(a)
• 100, 60, and 40 are the counts of the
actual occurrence of instances of classes
a, b, and c
33
• Now consider Table 5.4 (b)
• The bottom row of 5.4 (b) is the same as
the bottom row of 5.4 (a)
• The totals predicted for the classes are the
same
• The rightmost column of 5.4 (b) is the
same values as the rightmost column of
5.4a
• The count of the actual number of
34
instances of each class does not change
• Table 5.4 (a) represents the actual
predictor we are trying to evaluation
• Table 5.4 (b) represents a hypothetical
predictor that we are comparing it against
• The fact that the bottom rows of the two
tables are the same means that the
hypothetical predictor of interest gives the
same overall count of predictions for each
class
35
• It’s the bodies of the two tables that differ
• They differ in how the correct and incorrect
predictions for each class are distributed
• The hypothetical predictor’s values work in
this way:
• The values in the rows for every actual
class are distributed in proportion to the
overall totals in the bottom row
36
•
•
•
•
•
120 is to 60 is to 20
As 60 is to 30 is to 10
As 36 is to 18 is to 6
As 24 is to 12 is to 4
These sets of values represent the socalled random predictor
• (It’s not exactly random. It’s based on the
prediction counts for the actual scheme.)
37
• The random predictor is the point of
comparison
• Can you make a quantitative comparison
between the performance of the actual
classification scheme and the random
predictor?
• If your classification scheme isn’t better
than random, you’ve got a big problem
• Your predictor is sort of an anti-predictor
38
Comparing with the Random
Predictor
• Sum the counts on the main diagonal of
the matrix for the actual scheme
• It got 88 + 40 + 12 = 140 correct
• Do the same for the random predictor
• It got 60 + 18 + 4 = 82 correct
• Clearly the actual predictor is better
• Can this be quantified?
39
• There are 200 instances altogether
• The random predictor had 200 – 82 = 118
incorrectly classified instances
• The actual predictor got 140 – 82 = 58
more correct than the random predictor
• Of the 118 incorrectly classified by the
random predictor, the actual predictor
classified 58 of them correctly
40
• You measure performance by finding the
ratio 58/118 = 49.2%
• This ratio is known as the Kappa statistic
• For a predictor that is no worse than
random, the value of the statistic can
range from 0 to 1
• 0 means no better than random
• 1 means perfect prediction
41
Counting as Opposed to Cost
• The method given here doesn’t count the
cost of FP or FN
• It is similar to previous approaches
• It simply counts the successes vs. the
errors and compares with the hypothetical
case
42
• What we just looked at were confusion
matrices
• These structures contained counts of
correct and incorrect classifications
• It was possible to compare these values
for an actual and a hypothetical predictor
43
Cost-Sensitive Classification
and Cost Matrices
• The next step will be to include cost in the
performance calculation
• Suppose you don’t assign different cost
values to different success types (TP, TN)
and different failure types (FP, FN)
• Cost of success = 0
• Cost of failure = 1
44
• The simple cost matrices for the two-class
and three-class cases are given in Table
5.5, shown on the following overhead
45
46
Performance Evaluation with
Costs
• Suppose you have a data mining scheme
that gives straight classification predictions
• With a test set, you have both a prediction
and the known classification
• Up until now, performance was evaluated
on the basis of a count of errors
47
• Potentially, each element of the cost
matrix could contain a different decimal
value which was the weight for that case
• Successes (benefits) could be positive
• Failures/errors (costs) would be negative
• You would evaluate performance by
summing up the total of successes and
errors multiplied by the corresponding
weight in the cost matrix
48
Complete Financial Analysis
• Data mining doesn’t exist in a vacuum
• It is a real technique applied in businesslike situations
• Suppose you were comparing the cost of
two different algorithms
• Not only might the costs associated with
the predictions be taken into account
49
• The better performing algorithm might
have data preprocessing or computational
costs that are significant
• Its predictions might be X dollars more
valuable than those of another scheme
• But the comparative costs of running the
algorithm may be greater than X
• A complete analysis would take this into
account when picking an algorithm
50
Classification Including Costs
• This section represents a shift of focus
• So far we’ve been concerned with doing a
cost evaluation of prediction performance
after the fact
• Now we want to try and take cost into
account when making the prediction
51
• Suppose you’re using a classification
scheme that generates probabilities
• Up until now, if you had to classify under
this condition, you picked the classification
with the highest probability
• With cost information, another possibility
presents itself
52
• For a given instance that you want to
classify:
• pi is the predicted probability that the
instance falls into the ith class
• Then for each class, (1 – pi) is the
probability of a false positive
• In other words, classify on the basis of
minimum cost
53
• Multiple each of the (1 – pi) times the cost
factor of a misclassification (false positive)
• Do not classify according to the largest pi
• Instead, give the instance the classification
where the product of the probability of a
misclassification and the cost of a
misclassification is highest
54
Probabilities vs. Straight
Classification
• The book notes that there are ways of
converting the results of straight
classification schemes into probabilities
• It doesn’t give details (although it’s pretty
clear this would involve counting and
dividing in Bayesian style)
• The idea is that you could then apply cost
metrics when deciding classification for
those schemes
55
Summary of the previous idea
• The approach for factoring cost into
predictions given thus far could be
summarized as follows:
• When running the data mining algorithm,
do not take cost into account when
deriving the rules
• However, after the rules are derived, take
cost into account when using them to
make a classification
56
Cost-Sensitive Learning
• Now we would like to try and take cost into
account when deriving the rules
• It is possible to follow a line of reasoning
that explains how this is accomplished
• Not all the details will be given
• However, the approach can be outlined
based on concepts that have already been
presented
57
There are Schemes that give
Probabilities as Results
• We have repeatedly been told that some
data mining algorithms will give
probabilities instead of classifications
• Naïve Bayes gave these kinds of results
• We infer that there are other, more
sophisticated data mining schemes that
also give probabilistic results
• In other words, probabilistic based
schemes have general application
58
What about the Loss Function?
• In this chapter the idea of a loss function
was introduced
• The statement was made that the way to
improve a classification scheme was to
minimize the loss
• In fact, we have not yet been told how
data mining schemes can be guided to
give better results by a process of
minimizing loss
59
Cost and Loss
• Cost is external to a classification scheme
• There are costs in the real world
• Total cost depends on how successful or
unsuccessful classification is
60
• Loss is internal to a classification scheme
• It is a measure of how greatly predicted
classification probabilities differ from
observed actual classifications
• Cost and loss are not the same, but they
are related
61
• When evaluating a scheme, the cost is
calculated by multiplying the probability of
an outcome times the cost of that outcome
• If the mining scheme is guided by
minimizing loss, that means it’s guided by
improving the probability estimates
• This will tend to minimize cost
62
Making the Jump—Bias to Cost
• This is where you make a jump
• Recall this idea from earlier in the chapter:
• You achieve quality evaluations of a
scheme by cross-validation
• If you want accurate performance
estimates, the distribution of classifications
in the test set should match the distribution
of classifications in the training set
63
• If there was a mismatch between training
and test set some test set instances might
not classify correctly
• The mismatch in the sets means that the
training set was biased
• If it was biased against certain
classifications, this suggests that it was
biased in favor of other classifications
64
• Note that bias is closely related to the
concept of fitting or overfitting
• In nonjudgmental terms, there is simply a
mismatch between training and test set
• In more judgmental terms, the training set
is overfitted or biased in favor of giving
some predictions or probabilities which are
not as prevalent in the test set
65
Including Cost in Rule Derivation by
Intentionally Using Bias
• The approach to including cost in the
derivation in classification rules is this:
• Intentionally bias the training set by
increasing the representation of some
classifications
• The algorithm will produce rules more
likely to correctly predict these
classifications
66
• The practical questions are these:
• Which classifications to over-represent?
• In general you want to over-represent
classifications where misclassification has
a low cost
• How much to over-represent them?
• The answer to this is pretty much just
empirical rule of thumb
67
Concretely…
• Take this as a concrete example:
• Suppose that false positives are more
costly than false negatives
• Over-represent “no” instances in the
training set
• When you run the classification algorithm
on this training set, it will overtrain on no
68
• That means the rules derived by the
algorithm will be more likely to reach a
conclusion of no
• Incidentally, this will increase the number
of false negatives
• However, it will decrease the number of
false positives for these classifications
• In this way, cost has been taken into
account in the rules derived, reducing the
69
likelihood of expensive FP results
• The book notes that you aren’t limited to
accomplishing this by over-representation
• You can allow duplicate instances in the
training set (like sampling with
replacement)
• In practice, some data mining schemes
just allow heavier or lighter weights to be
assigned to instances of given
classifications
70
Cost and Loss, Again
• Technically, we still didn’t use an algorithm
that was guided by loss
• Skewing the sample was the proxy for
tuning loss, which in turn was also the
proxy for cost in the rule derivation
71
Lift Charts
• The term lift describes certain techniques
based on the following idea
• You may not be able to perfectly classify
instances and identify only those with the
desired classification
• However, you may be able to identify
subsets of a data set where the probability
of the desired classification is higher
72
• Consider the direct mailing example which
was mentioned quite a while back
• Suppose you know as a starting point that
out of a population (training set/test set) of
1,000,000, there will be 1,000 yeses
• This is a response rate of .1%
73
Defining the Lift Factor
• Now suppose you have a data mining
algorithm that gives probabilities of yes for
the instances
• This means you can identify people more
likely to respond yes and those less likely
to respond yes
• You can identify subpopulations-subsetssamples with a greater likelihood of yes
74
• The proportional increase in the response
rate is known as the lift factor
• The response rate = count(yeses) /
count(number of instances in sample)
• The lift factor is response rate(sample) /
response rate(population)
• This idea is summarized in the table on
the following overhead
75
Sample
Yeses
Response Rate
Lift Factor
1,000,000
1,000
.1%
400,000
800
.2%
2
100,000
400
.4%
4
76
• Assuming that you can do this, note this
obvious result:
• You’ve increased the response rate—good
• You’ve accomplished this by reducing the
sample size—also good
• On the other hand, the absolute number of
yeses has dropped
• In other words, you are excluding some
yeses from consideration
77
Finding Lift Factors
• Given a training set or population, how do
you come up with the figures for lift
factors?
• The data mining algorithm gives a
probability of a yes
• The training data also tells you whether a
given instance is actually yes
78
• Rank all of the instances by their
probability and keep track of their actual
class
• In essence, you choose your new, “lifted”
subset, by taking instances from the top of
the probability rankings
• (Don’t worry, a more complete explanation
is coming)
79
• Note that we will be operating under the
(reasonable) assumption that the data
mining algorithm really works
• In other words, we assume that the higher
the predicted probability, the more likely
an instance really is to take on a certain
value
80
• Table 5.6, given on the following
overhead, shows instances ranked by their
predicted probability, and with their actual
classification also listed
81
82
• Using a table like this, you can find the lift
factor for a given sample size
• For example, for a sample size of 10, take
the top 10 instances
• The lift ratio depends on the ratio of yeses
to no’s in that sample
83
• Using a table like this, you can also find
the sample size for a given lift factor
• For a desired lift factor, go down the table
tallying the ratio of yeses to no’s
• When you get the desired lift factor, the
sample size is the number of instances
you tallied
84
Finding the Lift factor for a
Given Sample Size
• In summary:
• If you want the lift factor for a particular
sample size, just count down the list until
you reach that sample size
• The response rate = count(yeses) /
count(number of instances in sample)
• The lift factor is response rate(sample) /
response rate(population)
85
Finding the Sample Size for a
Given Lift Factor
• In summary:
• If you want a particular lift factor, keep a
running computation of response rate and
lift factor as you go down the list
• When you reach the desired lift factor, the
number of instances you’ve gone through
is your sample size
• (Note that response rate is the key figure)
86
Making a Lift Chart
• To make a lift chart, you need the lift factor
for each possible sample size
• In other words, work your way down the
ranked list one instance at a time
• For each one, you can calculate the
response rate so far and the lift factor so
far
87
• The x-axis is the the sample size
• This could be given as a percent of the
population
• The y-axis is the lift factor
• This could be given as the actual number
of yes responses generated
• The resulting plot is what is known as a lift
chart
88
• Figure 5.1, shown on the following
overhead, illustrates the idea for the direct
mailing example
• Recall that the example had really nice
numbers
• Because the total yes responses for the
population was 1,000, when reading the yaxis, you can readily interpret the value as
a percent
89
90
What a Lift Chart Tells You
• The lift chart tells you something about
your data mining algorithm
• The diagonal line represents the situation
where, if you take m% of the population,
you get m% of the yeses
• This is what you would expect of a random
sample from the population
91
• An effective data mining algorithm should
produce a lift chart curve above the
diagonal
• The higher and further to the left the hump
is, the better
92
• In general, the closer the lift chart curve
comes to the upper left hand corner, the
better
• A minimal sample size and a maximum
response rate is good
• The hypothetical ideal would be a mailing
only to those who would respond yes,
namely a 100% success rate, with no one
left out (no false negatives)
93
• If the curve is below the diagonal, the
algorithm is determining the yes probability
for instances with a success rate lower
than a random sample
• This is similar to other situations, where
the success rate is below 50%, for
example
• A data mining algorithm with this kind of
performance is an “anti-predictor”
94
Cost Benefit Curves
• The costs and benefits of different sample
sizes/mailing scenarios will differ
• As noted earlier, the cost of the algorithm
itself can be a factor
• But for the time being we will ignore it
• The assumption is that we have already
done the work of running the algorithm
• We want to know what this completed
work can tell us
95
• The cost of interest now is the cost of
mailing an individual item
• We’ll assume that the cost is constant per
item
• The benefit of interest is the value of the
business generated per yes response
• We’ll assume that the benefit is constant
per positive response
96
•
•
•
•
•
We now have three data items:
A lift chart
A cost
A benefit
From this information it is possible to form
a cost/benefit curve across the same
domain (percent of sample size) as the lift
chart
97
• Figure 5.2 was produced by Weka
• It shows lift charts on the left and
cost/benefit curves on the right for various
scenarios
• It is given on the following overhead and
will be discussed in greater detail on the
overheads following it
98
99
Comments on Figure 5.2
• For Figure 5.2, a benefit of +15.00 per
positive response is assumed
• The charts at the top assume a mailing
cost per item of .50
• A lift chart is shown on the left and a
cost/benefit curve is shown on the right
• The cost/benefit curve keeps on rising all
the way to 100% of the population, so it
makes sense to send mail to everybody if
100
you can afford it
• The charts at the bottom assume a cost
per item of .80
• A lift chart is shown on the left and a
cost/benefit curve is shown on the right
• The cost/benefit curve has a peak in the
middle
• This means you only want to mail to this
proportion of the population
101
• Keep in mind that the sample is selected
by ranking by the algorithm, not randomly
• Roughly speaking, gauging by the shape
of the cost/benefit curve, that:
• You would like to mail to those prospective
customers in the top half of the data set
• Mailing to more than these gives
diminishing returns
102
• In summary:
• The practical result of this is to tell you
what portion of the sample it is costeffective to contact
• The effect of the changed cost on the
curves generated by the two scenarios is
notable
103
What Else is in Figure 5.2?
• The details of Figure 5.2 are impossible to
see
• In brief, the GUI includes the following:
• A confusion matrix
• A cost matrix where you input the costs of
TP, TN, FP, FN
104
• A slider representing the x-axis, the
percent of the population
• This allows you to get specific cost results
for a given sample size, as well as the
graphical representations of the results
across the whole range
105
A Side Note on Where You
Might Be in Exploring Weka
• The fact that this figure was produced by
Weka is a sign:
• If you haven’t downloaded and installed
Weka yet, it’s not too soon to do so
• The book also has links to the example
data sets, already in ARFF (format)
• Looking through the GUI, you can find the
options to run various data mining
algorithms
106
• You can also find the options to run
various tools that allow you to explore and
compare the results of running the
algorithms
• There are no homework assignments, so
it’s up to you to start looking into these
things on a schedule that agrees with you
107
• There are two goals:
• To understand the discussions in class in
preparation for the second test
• To start getting ready to do the final
project
108
A Preliminary Note on the
Project
• The syllabus contained some information
on the general form of the project
• Here is a little more preliminary detail
• You can do the project using the UCI data
sets or any of the other data sets made
available on the Weka Web site
• If you do so, you are on the hook for the
full 8 combinations of data and algorithm
109
• If you use another data set, the number of
combinations you have to run will be
reduced
• In either case, for full points you will also
have to do some exploring and comparing
of results
• Further details on these aspects of the
project are posted separately
110
ROC Curves
• The acronym ROC stands for receiver
operating characteristic
• This term originated in the analysis of
signal to noise ratio in a communication
channel
• It is a way of comparing true positives to
false positives
111
• In general terms, you can think of it as
“pushing” for greater true positives and
measuring your success against the false
positives generated
• An ROC curve is similar to a lift chart
• It provides a way of drawing conclusions
about the results of a data mining
algorithm
112
• Multiple ROC curves can be used to
compare data mining algorithms
• ROC curves also graphically illustrate a
means of getting the best possible results
over a range of parameter values by
combining two different data mining
algorithms
113
Thinking about Data Values and
Probability Based Predictions
• Consider a data mining algorithm that
gives prediction probabilities
• For a binary attribute, you ultimately want
a yes or a no classification
• For those instances with a predicted
probability > .5 the prediction is yes
114
• Technically, for those instances with a
predicted probability < .5 the prediction is
no
• In the discussion that follows, this second
half of the picture will not be explicitly
covered
• We’ll restrict ourselves to those instances
where the prediction is > .5
115
• Now, take a look at Table 5.6 again
• It is repeated on the following overhead
• Notice that it basically goes down to the
level of a prediction probability of .5 and
ignores the instances with lower
probabilities
116
117
• Essentially what we’re seeing is that
instances 1-17 are predicted yes
• Then in the right hand column, those
instances that are yes are TP’s
• Those instances that are no are FP’s
• Likewise, the FP’s are clearly identified
• [You could do a similar thing (in reverse)
for those instances with probabilities < .5]
118
• You can break things down in this way:
• Actual negatives = FP + TN
• Actual positives = TP + FN
119
Forming the ROC Curve
• Let a set of data be given that has been
sorted by the probability prediction like in
Table 5.6
• The ROC curve plots the FP rate on the xaxis and the TP rate on the y-axis
120
• FP (negative classified as positive) rate
• = FP / Actual negatives
• = FP / (FP + TN) * 100
• TP (positive classified as positive) rate
• = TP / Actual positives
• = TP / (TP + FN) * 100
121
• The ROC curve for the Table 5.6 data is
given in Figure 5.3, shown on the following
overhead
• The jagged line represents a discrete plot
of the values in the table
• The dashed line obviously represents an
approximate curve
• The fact that the curve is bowed up is
good—TP is winning over FP
122
123
Similarity between the ROC
Curve and the Lift Chart
• The ROC curve looks similar to the lift
chart curve shown earlier
• Lift chart x-axis ~ sample size as a percent
of the population
• Lift chart y-axis ~ TP
• ROC curve x-axis ~ # of FP as a % of the
actual number of negatives
• ROC curve y-axis ~ TP
124
•
•
•
•
The y-axes are essentially the same--TP
The x-axes are similar in this way:
Lift chart: Increase sample size
ROC: proportion of false positives to
actual positives if you just randomly
sampled would track with an increased
sample size in a lift chart
125
• There is a close similarity between the lift
chart and the ROC curve for the direct
mailing example
• The majority of the population are actual
negatives
• Increasing the number of negatives tracks
directly with increasing the sample size
126
The Upper Left is Good
• With a different example an ROC curve
and a lift curve may not be as similar
• However, there is still an overall similarity:
• The upper left hand corner of the graph is
“good”
• The graph displays TP (y) vs. FP (x)
• You’d like to maximize TP and minimize
FP, which would occur in the upper left
127
Arriving at the Smooth Curve
• Figure 5.3 is repeated on the following
overhead for reference
• Note again the smooth curve represented
as a dashed line
128
129
• The jagged line represented a single
display of one data set
• For general analytical purposes, you might
assume that across the whole problem
domain, with different data sets, the
relationship between TP and FP would be
a smooth curve
• This would be what the dashed line
represents
130
• The book gives several explanations of
how to get the smooth curve
• The basic idea is to use cross-validation
• To me the book’s detailed explanations
are foggy
• I’d like to appeal to a simple visual
argument
131
• Suppose you did 10-fold cross-validation
• You’d have 10 test sets
• If you graphed TP vs. FP for each test set,
you’d get 10 different jagged lines
• Each of these is an approximate curve
132
• Since the axes are percents, the shape of
all of the curves would be the same
• If you averaged the y values for each x
value, the results would approximate a
smooth curve
• This would be your approximation for the
population as a whole
133
ROC Curves for Different
Learning Schemes
• At the outset, the following was noted:
• You can use approaches like ROC curves
to compare two data mining schemes
• You can use ROC to combine the best of
two schemes
• That’s the topic now
134
• Figure 5.4, given on the following
overhead, shows ROC curves for two data
mining schemes
• Scheme A is further to the upper left for
smaller sample sizes
• Scheme B is further to the upper left for
larger sample sizes
• You can compare and choose between
the two if you had a preference based on
135
sample size
136
• The shaded area on the graph is referred
to as the convex hull of the two curves
• Part of the hull lies outside the boundaries
of the curves for A and B
• However, the edge of the shaded region
still represents an achievable result
• For sample sizes in that range, a linear
combination of A and B can give these
ROC curve results
137
• The outer boundary of the hull is tangent
to both A and B
• Let (x1, y1) and (x2, y2) be the tangent
points for A and B, respectively
• Suppose you wanted to find the point 1/nth
of the way from (x1, y1) to (x2, y2)
138
• xnew = x1 + 1/n(x2 – x1)
• xnew = 1/n x2 + (x1 – 1/n x1)
• xnew = 1/n x2 + (1 – 1/n) x1
• ynew = y1 + 1/n(y2 – y1)
• ynew = 1/n y2 + (y1 – 1/n y1)
• ynew = 1/n y2 + (1 – 1/n) y1
139
• 1/n and (1 – 1/n) can work like
probabilities
• They sum to 1
• You can use a Monte Carlo like approach
to achieving the best possible ROC value
for any point between the tangent points to
A and B
140
• This would be cross-validation like in
nature
• For multiple data sets, apply A at the ratio
of FP to FN that gives its tangent point
• For multiple data sets, apply B at the ratio
of FP to FN that gives its tangent point
141
• For a given problem where you want an
ROC (ratio of FP to FN) that lies 1/nth of
the way from one tangent point to the
other:
• Let the ratio of B data sets to A data sets
be 1/n to (1 – 1/n)
• Combine the results of the computations
generated in this proportion and it will be
the desired point on the line
142
• Note that the foregoing overheads are my
attempt to explain what the book says
• I don’t claim that they are wrong
• I did find the book’s explanation kind of
incomplete
• My review of it is no better
• There seem to be some missing links
143
Recall-Precision Curves
• Even though the section heading is recallprecision curves, we won’t actually see a
graph in this section…
• The general idea behind lift charts and
ROC curves is trade-off
• They are measures of good outcomes vs.
unsuccessful outcomes
• This is cost-benefit in simple terms (not in
terms of a cost factor and cost equation)
144
• Trying to measure the tradeoff between
desirable and undesirable outcomes
occurs in many different problem domains
• Different measures may be used in
different domains
• In the area of information retrieval, two
measures are used:
• Recall and precision
145
•
# of relevant docs retrieved
• Recall = ------------------------------------•
total # of relevant docs
•
# of relevant docs retrieved
• Precision = ------------------------------------•
total # of docs retrieved
146
• Notice the general idea illustrated by these
measures and inherent in the other ones
we’ve looked at:
• You can increase the number of relevant
documents you retrieve by increasing the
total number of documents you retrieve
• But as you do so, the proportion of
relevant documents falls
147
• Consider the extreme case
• How would you be guaranteed of always
retrieving all relevant documents?
• Retrieve all documents in total
• But you’ve gained nothing
• If you retrieve all documents, the task still
remains of winnowing the relevant
documents from the rest
148
Discussion
• The book notes that in non-computer
fields, like medical testing, the same kinds
of ideas crop up
• For a given medical test:
• Sensitivity = proportion of people with the
disease who test positive
• Specificity = proportion of people without
the disease who test negative
• For both of these measures, high is good 149
• Incidentally, these concepts also remind
me of association rule mining
• Support and confidence seem to have
something in common with the
dichotomies presented in this section
150
One Figure Measures
• In addition to 2-dimensional graphs or
curves like lift charts and ROC curves,
there are also techniques for trying to
express the goodness of a scheme in a
single number
• For example, in information retrieval there
is the concept of “average recall”
measures
151
• Average recall measures are actually
averages of precision over several recall
values
• Three-point average recall is the average
precision for recall figures of 20%, 50%,
and 80%
• Eleven-point average recall is the average
precision for figures of 0%-100% by 10’s
152
•
•
•
•
The book also cites the F-measure
(2 x recall x precision) / (recall + precision)
= (2 x TP) / (2 x TP + FP + FN)
If you go back to the definitions you can
derive the 2nd expression from the 1st
• Looking at the second expression, the
simple point is that a larger value is better
153
• The simplest one-figure measure of
goodness is simply the success rate:
• (TP + TN) / (TP + FP + TN + FN)
• One-figure measures don’t contain as
much information as graphs
• By definition, graphs tell you the outcome
over a range
154
• Table 5.7, shown on the following
overhead, summarizes lift charts, ROC
curves, and recall precision curves
• Even though we haven’t been given a
graph of a recall-precision curve, this is
where such curves are put in context with
lift charts and ROC curves
155
156
Cost Curves
• In the previous section two ROC curves
were graphed together
• There performance on samples of different
sizes could be compared
• This comparison didn’t include cost
• It is not straightforward to see how cost
can be incorporated into lift charts or ROC
curves
157
• The book introduces error and cost curves
as a way of including cost in scheme
comparisons
• Even though error and cost curves have a
purpose similar to lift charts and ROC
curves, they are quite different from them
• In trying to understand error curves and
cost curves it’s important not to confuse
them with lift charts and ROC curves
158
The Error Curve
• The book presents an error curve and a
cost curve together in Figure 5.5
• I will deal with the two curves one at a time
• The presentation of error curves will come
first
• Consider the error curve, Figure 5.5a,
shown on the following overhead
159
160
• The error curve shown is based on binary
classification
• In the figure, a yes classification is
symbolized by “+”, so the probability of a
yes is p[+]
• By extension a no is symbolized by “-”
• The solid curve (line) shows the
performance of a classification scheme, A
161
The Axes of the Graph
• The x-axis is the probability of a yes, p[+],
classification in a data set that A is applied
to
• In other words, A’s performance may differ
for different probabilities of yes in the data
set
• The domain is the interval of probabilities
from 0 to 1
162
• The y-axis, the performance measure, is
the expected error
• This is the probability of a misclassification
by A for a given value of p[+]
• The range is the interval of expected
errors from 0 to 1
163
• The expected error is not the count of
errors, but the error rate from 0 to 100%
• The error rates for false negatives and
false positives are given using italics as fn
and fp, respectively
164
Characteristics of the Curve for
Data Mining Algorithm A
• A is shown as linear
• It is not known whether classification
scheme performance is necessarily linear,
but for the purposes of discussion, this
one is
• Also note that the slope of A is not zero
• A’s performance does differ for different
values of x
165
• Under the assumption that A is linear, A is
not hard to plot
• The left hand endpoint, the y intercept, is
defined as the false positive rate, fp, that A
gives when the probability of yes in the
data set is 0
• The y value for the right hand endpoint is
defined as the false negative rate, fn, that
A gives when the probability of a yes in the
166
data set is 1
• For a given probability of yes on the x-axis
between 0 and 1:
• The y value would be the expected
misclassification rate, whether fp or fn, for
that probability p[+]
167
• A is lower on the left than on the right
• For data sets more likely to contain no, A
has fewer errors
• For data sets less likely to contain no, A
has more errors
• In a sense, you might say that the error
curve (line) shows that A is biased
• A is apparently more likely to give a
classification of no than yes
168
Other Things Shown in the
Figure
• The book indicates other things in the
figure
• The horizontal line at the bottom, the xaxis, would represent the performance of a
perfect predictor—expected error = 0
• The horizontal line at the top would
represent the performance of a predictor
that was always wrong—expected error =
1
169
• The dashed diagonal from the origin to the
upper right would represent the
performance if you always predicted yes
• The dashed diagonal from the upper left to
the lower right would represent the
performance if you always predicted no
170
• The two diagonals seem uninteresting, but
they bring out a useful point
• You don’t have to use the same predictor
across the whole domain
• Below a certain point, simply predicting no
outperforms A
• Above a certain point, simply predicting
yes outperforms A
171
• In theory, you could take a random sample
of a data set and see what the probability
of a yes was
• You could then choose which predictor to
use accordingly
172
The Cost Curve
• An example of a cost curve, Figure 5.5b, is
shown on the following overhead
• It looks somewhat similar to the error
curve of Figure 5.5a
• This is not an accident, but the axes of the
curves are different
• As the name implies, this curve contains
information about costs, not just error
rates
173
174
• It is not reasonable to try and derive or
prove the formulas they use in creating the
cost curve
• It’s also not reasonable to think we’ll come
to much of a theoretical or intuitive
understanding of the formulas
175
• My goal in presenting this will be the
following:
• Try to figure out what practical outcomes
in the graph result from the formulas used
• Then try to figure out how these practical
outcomes make it possible to read the
graph and understand what it represents
overall, even if the derivations are not
given in detail
176
The Axes of the Cost Curve—
The X-Axis
• The starting point for understanding the
cost curve has to be the axes
• The x-axis is labeled as the probability
cost function
• The book’s notation for this is pc[+]
• This is an extension of the notation p[+],
the probability of a yes, which was the xaxis of the error curve
177
• By way of introduction, consider the
following:
• Yes, we want a curve that graphs cost
• You might assume that you graph cost on
the y-axis against something else entirely
different on the x-axis
178
• It turns out that cost characteristics are
going to be part of the x-axis
• Part of the goal of this discussion will be to
explain how a cost curve can be based on
an x-axis that contains cost information
• The book’s expression for the x-axis,
along with an explanatory version, are
given on the following overhead
179
• pc[+], the x-axis of the cost curve
•
p[+] x C[-|+]
• = ---------------------------------•
p[+] x C[-|+] + p[-] x C[+|-]
•
prob(yes) x cost(FN)
• = --------------------------------------------------------• prob(yes) x cost (FN) + prob(no) x cost(FP)
180
• We don’t know what this “means” or
“where it came from”, but certain things
about it can be observed
• pc[+] is a transformation of p[+]
• p[+] appears in both the numerator and
denominator
• You might interpret p[-] as (1 – p[+])
181
• If you did a little simple graphing of the
function, you would find that it has curves
and discontinuities (where the
denominator goes to 0)
• So the x-axis of the cost curve is a
function of the x-axis of the error curve
182
• If you substitute p[+] = 0 and p[+] = 1, you
get the following
• If prob(yes) = p[+] = 0, pc[+] = 0
• If prob(yes) = p[+] = 1, pc[+] = 1
• Expressing the x-axis as a function, this is:
• pc[+] = fcn(p[+])
• For p[+] = 0, fcn(p[+]) = 0
• For p[+] = 1, fcn(p[+]) = 1
183
• In words, the x-axis of the cost curve is a
transformation of the x-axis of the error
curve
• The endpoints of the interval of interest of
the error curve map to the same, desired
endpoints for the interval of interest for the
cost curve, 0 and 1
• In between these points, the error curve is
actually a curvilinear function of the error
184
The Y-Axis
• With the x-axis established, what is the y
value being mapped against it?
• The y-axis represents what is known as
the normalized expected cost
• The normalized expected cost is a function
of the probability cost function
• y = NEC = f(pc[+]) = f(x),
185
• This is the formula for the normalized
expected cost, namely the formula given in
the book for the function that defines the y
coordinate of the graph:
• fn x pc[+] + fp x (1 – pc[+])
• Note that fn and fp are small letters in
italics
• These italicized things are not the counts,
FP and FN
186
• fn and fp are the rates or ratios of false
negatives and false positives
• This was also the notation used in the
error curve graph
• In the error curve, fn and fp varied
according to p[+]
187
• The formula given for the y coordinate of
the cost curve appears quite simple
• However, it would be wise to take into
account that when calculating it, fn and fp
would themselves vary across the range of
p[+]
188
•
•
•
•
•
In practical terms, what have we got?
The normalized expected cost—
Is the sum of two terms—
Each term consisting of two factors—
Where each of the factors falls between 0
and 1—
189
• The sum of fp and fn can’t be greater than
1
• You hope it’s less than 1
• Otherwise all of your predictions are in
error
• The sum of pc[+] and (1 – pc[+]) is clearly 1
190
• These arithmetic facts mean that the
normalized expected cost can only fall
between 0 and 1
• This is why it’s called normalized
• From a purely pragmatic point of view this
normalization is the reason (perhaps
among others) why the x-axis is defined
the way it is and the function y = f(x) is
defined the way it is
191
What Does the Graph Portray?
• The cost curve is shown again on the
following overhead as a memory refresher
• With the axes defined, what information
can be read from the graph?
192
193
• The line A represents the cost curve for a
particular data mining algorithm
• The x-axis, the probability cost function,
contains information on two things at once:
• The probability of a yes or no for instances
of various different data sets
• The cost of FP and FN for a particular
problem domain
194
• The x-axis doesn’t include any information
about a data mining algorithm or its
performance
• The x-axis contains information about the
domain only (as it should)
195
• The y-axis is the dimension along which
you graph algorithm dependent
information
• The fp and fn in the function being
graphed are performance measures for an
algorithm at given probabilities of yes/no in
the data set
196
• Remember, for example, that if an
algorithm has been trained to lean towards
no, but the data set leans towards yes,
you will tend to get more FN’s and fn will
be higher
• This is the kind of information that the
factors fn and fp include in the formula
197
• After all this blah blah blah, presumably
the following simple description of the cost
curve will make some sense:
• The cost curve graphs the normalized cost
of applying a data mining algorithm to data
sets where the p[+] ranges from 0 to 1
198
What Else is in the Figure?
• Given the explanation so far, what else is
in the figure and what does it signify?
• Once again, the figure is repeated as a
reminder of what’s in it
199
200
• The performance cost function of a given
algorithm is shown as a straight, solid line
• The book still doesn’t say why the lines
would necessarily be straight, but we’ll
take them as they’re given
• There are lines for A and B in the figure,
plus shadowy lines for other hypothetical
data mining algorithms
201
• A and B cross and have different
performance/cost characteristics over
different ranges of p[+] or pc[+] values
• The always yes/always no predictors are
shown as crossed, dashed, diagonal lines
• The curved, dotted line is referred to as an
envelope
202
• The idea behind the shadowy lines and
the envelope is that you might have
multiple data mining algorithms
• If you used each algorithm only in that
range of pc[+] where it was the best, the
envelope is the minimum cost curve over
the problem
203
• Finding the best algorithm to use, given
this envelope, is not hard to accomplish
• A statistical sample will tell you the p[+] for
any given data set out of a problem
domain
• Then use the algorithm that performs best
for that value
204
The End
205