Download csce462chapter5Part1PowerPointOlder

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Chapter 5
Credibility: Evaluating What’s
Been Learned
Kirk Scott
1
The Lontara Script, and
Example of an Abugida
2
Comparison of various abugidas descended from Brahmi
script. May Śiva protect those who take delight in the
language of the gods. (Kalidasa)
3
Bopomofo system of phonetic notation
for the transcription of spoken Chinese
4
•
•
•
•
•
•
•
•
•
•
•
•
Oracle bone script: (from left)
馬/马 mǎ "horse",
虎 hǔ "tiger",
豕 shĭ "swine",
犬 quǎn "dog",
鼠 shǔ "rat and mouse",
象 xiàng "elephant",
豸 zhì "beasts of prey",
龜/龟 guī "turtle",
爿 qiáng "low table" (now 床 chuáng),
為/为 wèi "to lead" (now "do or for"),
and 疾 jí "illness"
5
5.0 On to the Topic at Hand…
6
• One thing you’d like to do is evaluate the
performance of a data mining algorithm
• More specifically, you’d like to evaluate
results obtained by applying a data mining
algorithm to a certain data set
• For example, can you estimate what
percent of new instances it will classify
correctly?
7
• Problems include:
• The performance level on the training set
is not a good indicator of performance on
other data sets
• Data may be scarce, so that enough data
for separate training and test data sets
may be hard to come by
• A related question is how to compare the
performance of 2 different algorithms
8
• Parameters for evaluation and
comparison:
• Are you predicting classification?
• Or are you predicting the probability that
an instance falls into a classification?
• Or are you doing numeric prediction?
• What does performance evaluation mean
for things other than prediction, like
association rules?
9
• Can you define the cost of a
misclassification or other “failure” of
machine learning?
• There are false positives and false
negatives
• Different kinds of misclassifications will
have different costs in practice
10
• There are broad categories of answers to
these questions
• For evaluation of one algorithm, a large
amount of data makes estimating
performance easy
• For smaller amounts of data, a technique
called cross-validation is commonly used
• Comparing the performance of different
data mining algorithms relies on statistics
11
5.1 Training and Testing
12
• For tasks like prediction, a natural
performance measure is the error rate
• This is the number of incorrect predictions
out of all made
• How to estimate this?
13
• A rule set, a tree, etc. may be imperfect
• It may not classify all of the training set
instances correctly
• But it has been derived from the training
set
• In effect, it is optimized for the training set
• Its error rate on an independent test set is
likely to be much higher
14
• The error rate on the training set is known
as the resubstitution error
• You put the same data into the classifier
that was used to create it
• The resubstitution error rate may be of
interest
• But it is not a good estimate of the “true”
error rate
15
• In a sense, you can say that the classifier,
by definition, is overfitted on the training
set
• If the training set is a very large sample, its
error rate could be closer to the true rate
• Even so, as time passes and conditions
change, new instances might not fall into
quite the same distribution as the training
instances and will have a higher rate of
16
misclassification
• Not surprisingly, the way to estimate the
error rate is with a test set
• You’d like both the training set and the test
set to be representative samples of all
possible instances
• It is important that the training set and the
test set be independent
• Any test data should have played no role
in the training of the classifier
17
•
•
•
•
There is actually a third set to consider
The validation set
It may serve either of these two purposes:
Selecting one of several possible data
mining algorithms
• Optimizing the one selected
18
•
•
•
•
•
All three sets should be independent
Test only with the test set
Train only with the training set
Validate only with the validation set
In general, error rate estimation is done on
the test set
• After all decisions are made, it is
permissible to combine all test sets and
retrain on this superset for better results 19
5.2 Predicting Performance
20
• This section can be dealt with quickly
• Its main conclusions are based on
statistical concepts, which are presented
in a box
• You may be familiar with the derivations
from statistics class
• They are beyond the scope of this course
21
• These are the basic ideas:
• Think in terms of a success rate rather
than an error rate
• Based on a sample of n instances, you
have an observed success rate in the test
data set
22
• Statistically, you can derive a confidence
interval around the observed rate that
depends on the sample size
• Doing this provides more complete
knowledge about the real success rate,
whatever it might be, based on the
observed success rate
23
5.3 Cross-Validation
24
• Cross-validation is the general term for
techniques that can be used to select
training/testing data sets and estimate
performance when the overall data set is
limited in size
• In simple terms, this might be a
reasonable rule of thumb:
• Hold out 1/3 of the data for testing, for
example, and use the rest for training (and
25
validation)
Statistical Problems with the
Sample
• Even if you do a proper random sample,
poorly matched test and training sets can
lead to poor error estimates
• Suppose that all of the instances of one
classification fall into the test set and none
fall into the training set
26
• The rules resulting from training will not be
able to classify those test set instances
• The error rate will be high (justifiably)
• But a different selection of training set and
test set would give rules that covered that
classification
• And when the rules are evaluated using
the test set, a more realistic error rate
would result
27
Stratification
• Stratification can be used to better match
the test set and the training set
• You don’t simply obtain the test set by
random sampling
• You randomly sample from each of the
classifications in the overall data set in
proportion to their occurrence there
• This means all classifications are
represented in both the test and training
28
sets
Repeated Holdout
• With additional computation, you can
improve on the error estimate obtained
from a stratified sample alone
• n times, randomly hold out 1/3 of the
instances for training, possibly using
stratification
• Average the error rates over the n times
• This is called repeated holdout
29
Cross-Validation
• The idea of multiple samples can be
generalized
• Partition the data into n “folds” (partitions)
• Do the following n times:
• Train on (n – 1) of the partitions
• Test on the remaining 1
• Average the error rates over the n times
30
Stratified 10-Fold CrossValidation
• The final refinement of cross validation is
to make the partitions so that all
classifications are roughly in proportion
• The standard rule of thumb is to use 10
partitions
• Why?
• The answer boils down to this essentially:
• Human beings have 10 fingers…
31
• In general, it has been observed that 10
partitions lead to reasonable results
• 10 partitions is not computationally out of
the question
32
• The final refinement presented in the book
is this:
• If you want really good error estimates, do
10-fold cross validation 10 times with
different stratified partitions and average
the results
• At this point, estimating error rates has
become a computationally intensive task
33
5.4 Other Estimates
34
• 10 times 10-fold stratified cross-validation
is the current standard for estimating error
rates
• There are other methods, including these
two:
• Leave-One-Out Cross-Validation
• The Bootstrap
35
Leave-One-Out CrossValidation
• This is basically n-fold cross validation
taken to the max
• For a data set with n instances, you hold
out one for testing and train on the
remaining (n – 1)
• You do this for each of the instances and
then average the results
36
•
•
•
•
This has these advantages:
It’s deterministic:
There’s no sampling
In a sense, you maximize the information
you can squeeze out of the data set
37
• It has these disadvantages:
• It’s computationally intensive
• By definition, a holdout of 1 can’t be
stratified
• By definition, the classification of a single
instance will not conform to the distribution
of classifications in the remaining
instances
• This can lead to poor error rate estimates 38
The Bootstrap
• This is another technique that ultimately
relies on statistics (presented in a box)
• The question underlying the development
and use of the bootstrap technique is this:
• Is there a way of estimating the error rate
that is especially well-suited to small data
sets?
39
• The basic statistical idea is this:
• Do not just randomly select a test set from
the data set
• Instead, select a training set (the larger of
the two sets) using sampling with
replacement
40
• Sampling with replacement in this way will
lead to these results:
• Some of the instances will not be selected
for the training set
• These instances will be the test set
• By definition, you expect duplicates in the
training set
41
• Having duplicates in the training set will
mean that it is a poorer match to the test
set
• The error estimate is skewed to the high
side
• This is corrected by combining this
estimate with the resubstitution error rate,
which is naturally skewed to the low side
42
• This is the statistically based formula for
the overall error rate:
• Overall error rate =
• .632 * test set error rate
• + .368 * training set error rate
• To improve the estimate, randomly sample
with replacement multiple times and
average the results
43
• Like the leave-out-one technique, there
are cases where this technique does not
give good results
• This depends on the distribution of the
classifications in the overall data set
• Sampling with replacement in some cases
can lead to the component error rates
being particularly skewed in such a way
that they don’t compensate for each other
44
5.5 Comparing Data Mining
Schemes
45
• This section can be dealt with quickly
because it is largely based on statistical
derivations presented in a box
• The fundamental idea is this:
• Given a data set (or data sets) and two
different data mining algorithms, you’d like
to choose which one is best
46
• For researchers in the field, this is the
broader question:
• Across all possible data sets in a given
problem domain (however that may be
defined) which algorithm is superior
overall?
• We’re really only interested in the simpler,
applied question
47
• At heart, to compare two algorithms, you
compare their estimated error or success
rates
• In a previous section it was noted that you
can get a confidence interval for an
estimated success rate
• In a situation like this, you have two
probabilistic quantities you want to
compare
48
• The paired t-test is an established
statistical technique for comparing two
such quantities and determining if they are
significantly different
49
5.6 Predicting Probabilities
50
• The discussion so far had to do with
evaluating schemes that do simple
classification
• Either an instance is predicted to be in a
certain classification or it isn’t
• Numerically, you could say a successful
classification was a 1 and an unsuccessful
classification was a 0
• In this situation, evaluation boiled down to
51
counting the number of successes
Quadratic Loss Function
• Some classification schemes are finer
tuned
• Instead of just predicting a classification,
they produce the following:
• For each instance examined, a probability
that that instance falls into each of the
classes
52
• The set of k probabilities can be thought of
as a vector of length k containing elements
pi
• Each pi represents the probability of one of
the classifications for that instance
• The pi sum to 1
53
• The following discussion will be based on
the case of one instance, not initially
considering the fact that a data set will
contain many instances
• How do you judge the goodness of the
vector of pi‘s that a data mining algorithm
produces?
54
• Evaluation is based on what is known as a
loss function
• Loss is a measure of the probabilities
against the actual classification found
• A good prediction of probabilities should
have a low loss value
• In other words, the difference is small
between what is observed and the
predicted probabilities
55
• Consider the loss for one instance
• Out of k possible classifications, the
instance actually falls into one of them,
say the ith
• This fact can be represented by a vector a
with a 1 in position ai and 0 in all others
56
• The data mining algorithm finds a vector of
length k containing the probabilities that an
instance falls into one of the classifications
• This fact can be represented by a vector p
containing values pi
57
• Now for a single instance, consider this
sum over the different classes and their
probabilities:
 (p
j
j
 aj)
2
58
 (p
j
j
 aj)
2
• Ignore the squaring for a moment
• Excluding the ith case, this is a sum of the
probabilities that didn’t come true
• For a good prediction, you’d like this to be
small
59
 (p
j
j
 aj)
2
• For the ith case you have pi – ai = pi – 1
• For a good scheme you’d like the
probability of predicting the correct
classification to be as near to 1 as
possible
• In other words, you’d like the difference
between the probability and 1 to be as
small as possible
60
• Putting the two parts back together again,
you’d like this sum to be small:
 (p
j
j
 aj)
• As for the squaring, this is basically just
statistical standard operating procedure
• Squaring makes the function amenable to
calculus techniques
61
 (p
j
j
 aj)
2
• The final outcome is this
• The goal is to minimize the mean square
error (MSE)
• Because of the squaring this is called the
quadratic loss function
• Recall that this discussion was for one
instance
• Globally you want to minimize the MSE
over all instances
62
Informational Loss Function
• The informational loss function is another
way of evaluating a probabilistic predictor
• Like the discussion of the quadratic loss
function, this will be discussed in terms of
a single instance
• For a whole data set you would have to
consider the losses over all instances
63
• The book explains that the informational
loss function is related to the information
function for tree formation
• It is also related to logistic regression
• Suppose that a particular instance is
classified into the ith category
• For that instance, this is the loss function:
• - log2 pi
64
• I will give a simplified information theoretic
explanation at the end
• In the meantime, it is possible to
understand what the function does with a
few basic definitions and graphs
• Remember the following:
• The sum of all of the pi = 1
• Each individual pi < 1
65
• Now observe that the information loss
function involves only pi, the probability
value for the category that the instance
actually classified into
• The information loss function does not
include any information about the
probabilities of the categories that the
instance didn’t classify into
66
• So, restrict your attention to the ith
category, the one the instance classified
into
• A measure of a good predictor would be
how close the prediction probability, pi, is
to the value 1 (perfect prediction) for that
instance
67
• Here is the standard log function
• log2 pi
• Its graph is shown on the following
overhead
68
69
• Here is the informational loss function
• - log2 pi
• Its graph is shown on the following
overhead
• We are only interested in that part of the
graph where 0 < pi < 1, since that is the
range of possible values for pi
70
71
• The goal is to minimize the loss function
• What does that mean, visually?
• It means pushing pi as close to one as
possible
• When pi = 1 the information loss function
takes on the value 0
• An information loss function value of 0 is
the optimal result
72
• In vague information theoretic terms, you
might say this:
• The greater the probability of your
prediction, the more (valid) information you
apparently had to base it on
• Or, the less information you would have
needed in order to make it optimal
• As pi  1, -log2 pi  0
73
• Conversely, the less the probability of your
prediction the less (valid) information you
apparently had to base it on
• Or the more information you would have
needed to make it optimal
• As pi  0, -log2 pi  ∞
74
• The book observes that if you assign a
probability of 0 to an event and that event
happens, the informational loss function
“cost” is punishing—namely, infinite
• A prediction that is flat wrong would
require an unmeasurable amount of
information to correct
75
• This leads to another very informal insight
into the use of the logarithm in the
function, and what that means
• If you view the graph from right to left
instead of right to left, as the probability of
a prediction gets smaller and smaller, the
information needed in order to make it
better grows “exponentially”
76
• In practice, because of the potential for an
infinite informational loss function value,
no event is given a probability of 0
77
• Consider the graph again
• Assume that information was measured in
bits rather than decimal numbers
• A more theoretical statement about the
relative costs would go like this:
• The informational loss function represents
the relative number of bits of information
that would be needed to effect a change in
the probability of a prediction
78
Discussion
• Which is better, the quadratic loss function
or the informational loss function?
• It’s a question of preference
• The quadratic loss function is based on
the probability of cases that didn’t occur as
well as cases that occurred
• The information loss function is based only
on the probability of the case that occurred
79
• The quadratic loss function is bounded (a
finite sum of squares of differences)
• The informational loss function is
unbounded
• Statisticians would probably find the
quadratic loss function (MSE criterion)
familiar and useful
80
• Information theorists would probably find
the informational loss function familiar and
useful
• The book points out that in theory you can
measure the information contained in a
structural representation in bits
• Advanced analyses can be based on
formulas that combine the information in a
representation and the bits involved in
measuring the error rate (bits still needed)81
The End
82