Download Sl1 - Maastricht University

Document related concepts
no text concepts found
Transcript
Evaluation of Learning
Models
Evgueni Smirnov
Overview
•
•
•
•
•
Motivation
Metrics for Classifier’s Evaluation
Methods for Classifier’s Evaluation
Comparing the Performance of two Classifiers
Costs in Classification
– Cost-Sensitive Classification and Learning
– Lift Charts
– ROC Curves
Motivation
• It is important to evaluate classifier’s
generalization performance in order to:
– Determine whether to employ the classifier;
(For example: when learning the effectiveness of
medical treatments from a limited-size data, it is
important to estimate the accuracy of the classifiers.)
– Optimize the classifier.
(For example: when post-pruning decision trees we
must evaluate the accuracy of the decision trees on
each pruning step.)
Model’s Evaluation in the KDD Process
Knowledge
Transformed
data
Target
data
data
Selection
Processed
data
Patterns
Interpretation
Evaluation
Data Mining
Transformation
Preprocessing & feature
selection
& cleaning
How to evaluate the Classifier’s
Generalization Performance?
• Assume that we test a classifier on some
test set and we derive at the end the
following confusion matrix:
Predicted class
Actual
class
Pos
Neg
Pos
TP
FN
P
Neg
FP
TN
N
Metrics for Classifier’s Evaluation
•
•
•
•
•
Accuracy = (TP+TN)/(P+N)
Error = (FP+FN)/(P+N)
Precision = TP/(TP+FP)
Recall/TP rate = TP/P
FP Rate = FP/N
Actual
class
Predicted class
Pos
Neg
Pos
TP
FN
P
Neg
FP
TN
N
How to Estimate the Metrics?
• We can use:
–
–
–
–
–
–
–
Training data;
Independent test data;
Hold-out method;
k-fold cross-validation method;
Leave-one-out method;
Bootstrap method;
And many more…
Estimation with Training Data
• The accuracy/error estimates on the training data are
not good indicators of performance on future data.
Classifier
Training set
Training set
– Q: Why?
– A: Because new data will probably not be exactly the same
as the training data!
• The accuracy/error estimates on the training data
measure the degree of classifier’s overfitting.
Estimation with Independent Test Data
• Estimation with independent test data is used when we
have plenty of data and there is a natural way to
forming training and test data.
Classifier
Training set
Test set
• For example: Quinlan in 1987 reported experiments in
a medical domain for which the classifiers were trained
on data from 1985 and tested on data from 1986.
Hold-out Method
• The hold-out method splits the data into training data and test
data (usually 2/3 for train, 1/3 for test). Then we build a
classifier using the train data and test it using the test data.
Classifier
Training set
Test set
Data
• The hold-out method is usually used when we have thousands
of instances, including several hundred instances from each
class.
Classification: Train, Validation, Test Split
Results Known
+
+
+
Data
Model
Builder
Training set
Evaluate
Classifier Builder
Y
N
Validation set
Final Test Set
Classifier
Predictions
+
+
+
- Final Evaluation
+
-
The test data can’t be used for parameter tuning!
Making the Most of the Data
• Once evaluation is complete, all the data
can be used to build the final classifier.
• Generally, the larger the training data the
better the classifier (but returns diminish).
• The larger the test data the more accurate
the error estimate.
Stratification
• The holdout method reserves a certain amount
for testing and uses the remainder for training.
– Usually: one third for testing, the rest for training.
• For “unbalanced” datasets, samples might not
be representative.
– Few or none instances of some classes.
• Stratified sample: advanced version of
balancing the data.
– Make sure that each class is represented with
approximately equal proportions in both subsets.
Repeated Holdout Method
• Holdout estimate can be made more reliable
by repeating the process with different
subsamples.
– In each iteration, a certain proportion is
randomly selected for training (possibly with
stratification).
– The error rates on the different iterations are
averaged to yield an overall error rate.
• This is called the repeated holdout method.
Repeated Holdout Method, 2
• Still not optimum: the different test sets
overlap, but we would like all our instances
from the data to be tested at least ones.
• Can we prevent overlapping?
witten & eibe
k-Fold Cross-Validation
• k-fold cross-validation avoids overlapping test sets:
– First step: data is split into k subsets of equal size;
– Second step: each subset in turn is used for testing and the
remainder for training.
Classifier
• The subsets are stratified
before the cross-validation.
• The estimates are averaged to
yield an overall estimate.
Data
train
train
test
train
test
train
test
train
train
More on Cross-Validation
• Standard method for evaluation: stratified 10-fold crossvalidation.
• Why 10? Extensive experiments have shown that this is the
best choice to get an accurate estimate.
• Stratification reduces the estimate’s variance.
• Even better: repeated stratified cross-validation:
– E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance).
Leave-One-Out Cross-Validation
• Leave-One-Out is a particular form of
cross-validation:
– Set number of folds to number of training
instances;
– I.e., for n training instances, build classifier n
times.
• Makes best use of the data.
• Involves no random sub-sampling.
• Very computationally expensive.
Leave-One-Out Cross-Validation
and Stratification
• A disadvantage of Leave-One-Out-CV is
that stratification is not possible:
– It guarantees a non-stratified sample because
there is only one instance in the test set!
• Extreme example - random dataset split
equally into two classes:
– Best inducer predicts majority class;
– 50% accuracy on fresh data;
– Leave-One-Out-CV estimate is 100% error!
Bootstrap Method
• Cross validation uses sampling without replacement:
– The same instance, once selected, can not be selected
again for a particular training/test set
• The bootstrap uses sampling with replacement to
form the training set:
– Sample a dataset of n instances n times with replacement
to form a new dataset of n instances;
– Use this data as the training set;
– Use the instances from the original dataset that don’t
occur in the new training set for testing.
Bootstrap Method
• The bootstrap method is also called the 0.632
bootstrap:
–
–
A particular instance has a probability of 1–1/n of not
being picked;
Thus its probability of ending up in the test data is:
n
–
 1
1
1


e
 0.368


 n
This means the training data will contain approximately
63.2% of the instances and the test data will contain
approximately 36.8% of the instances.
Estimating Error with the
Bootstrap Method
•
•
The error estimate on the test data will be very pessimistic
because the classifier is trained on just ~63% of the
instances.
Therefore, combine it with the training error:
err  0.632  etest instances  0.368  etraining
•
•
instances
The training error gets less weight than the error on the
test data.
Repeat process several times with different replacement
samples; average the results.
Confidence Intervals for Accuracy
• Assume that the estimated accuracy accS(h) of
classifier h is 75%.
• How close is the estimated accuracy accS(h) to
the true accuracy accD(h) ?
Confidence Intervals for Accuracy
• Classification of an instance is a Bernoulli trial.
– A Bernoulli trial has 2 outcomes: correct and wrong;
– The random variable X, the number of correct outcomes of N
Bernoulli trials, has a Binomial distribution b(x; N, accD);
– Example: If we have a classifier with true accuracy accD equal
to 50%, then to classify 30 randomly chosen instances we
receive the following Binomial Distribution:
X
Confidence Intervals for Accuracy
The main question:
Given number N of test instances and number x
of correct classifications, or equivalently,
empirical accuracy accS= x/N, can we predict
the true accuracy accD of the classifier?
Confidence Intervals for Accuracy
• The binomial distribution of X has mean equal to N accD
and variance N accD(1- accD).
• It can be shown that the empirical accuracy accS= X / N
follows also a binomial distribution with mean equal to
accD and variance accD(1- accD)/ N.
X
Confidence Intervals for Accuracy
Area = 1 - 
• For large test sets (N > 30),
a binomial distribution is
approximated by a normal
one with mean accD and
variance accD(1- accD)/N.
• Thus,
Z/2
Z1-  /2
accS  accD
P ( Z / 2 
 Z1 / 2 )  1  
accD (1  accD ) / N
• Confidence Interval for accD:
2  N  accS  Z2 / 2  Z2 / 2  4  N  accS  4  N  accS
2( N  Z2 / 2 )
2
Confidence Intervals for Accuracy
• Confidence Interval for accD: 2  N  accS  Z2 / 2  Z2 / 2  4  N  accS  4  N  accS 2
2( N  Z2 / 2 )
• The confidence intervals shrink when we decrease confidence:
1- α 0.99
0.98
0.95
0.9
0.8
0.7
0.5
Zα/2
2.33
1.96
1.65
1.28
1.04
0.67
2.58
• The confidence intervals become tighter when the number N of
test instances increases. See below the evolution of the intervals
for confidence level 95% for a classifier with accuracy 80% on
100 test instances.
N
20
50
100
500
1000
5000
Confidence
Interval
[0.58, 0.92]
[0.67, 0.89]
[0.71, 0.87]
[0.76, 0.83]
[0.77, 0.82]
[0.78, 0.81]
Estimating Confidence Intervals of the
Difference of Generalization Performances
of two Classifier Models
• Assume that we have two classifiers, M1 and M2, and we would
like to know which one is better for a classification problem.
• We test the classifiers on n test data sets D1, D2, …, Dn, and we
receive error rate estimates e11, e12, …, e1n for classifier M1 and
error rate estimates e21, e22, …, e2n for classifier M2.
• Using rate estimates we can compute the mean error rate e1 for
classifier M1 and the mean error rate e2 for classifier M2.
• These mean error rates are just estimates of error on the true
population of future data cases. What if the difference between
the two error rates is just attributed to chance?
Estimating Confidence Intervals of the
Difference of Generalization Performances
of two Classifier Models
• We note that error rate estimates e11, e12, …, e1n for classifier M1
and error rate estimates e21, e22, …, e2n for classifier M2 are paired.
Thus, we consider the differences d1, d2, …, dn where dj= | e1j- e2j|.
• The differences d1, d2, …, dn are instantiations of n random
variables D1, D2, …, Dn with mean µD and standard deviation σD.
• We need to establish confidence intervals for µD in order to decide
whether the difference in the generalization performance of the
classifiers M1 and M2 is statistically significant or not.
Estimating Confidence Intervals of the
Difference of Generalization Performances
of two Classifier Models
• Since the standard deviation σD is unknown, we approximate it
using the sample standard deviation sd:
sd 
1 n
2
[(
e

e
)

(
e

e
)]

1i
2i
1
2
n i 1
• Since we approximate the true standard deviation σD, we
introduce T statistics:
D  D
T
sd / n
Estimating Confidence Intervals of the
Difference of Generalization Performances
of two Classifier Models
• The T statistics is governed by t-distribution with n - 1 degrees of
freedom.
Area = 1 - 
t/2
t1-  /2
Estimating Confidence Intervals of the
Difference of Generalization Performances
of two Classifier Models
• If d and sd are the mean and standard deviation of the normally
distributed differences of n random pairs of errors, a (1 – α)100%
confidence interval for µD = µ1 - µ2 is :
d  t / 2
sd
sd
  D  d  t / 2
,
n
n
where tα/2 is the t-value with v = n -1 degrees of freedom, leaving
an area of α/2 to the right.
• Thus, if the interval contains 0.0 we can conclude on significance
level α that the difference is 0.0.
Metric Evaluation Summary:
• Use test sets and the hold-out method for
“large” data;
• Use the cross-validation method for “middlesized” data;
• Use the leave-one-out and bootstrap methods
for small data;
• Don’t use test data for parameter tuning - use
separate validation data.
Counting the Costs
• In practice, different types of classification
errors often incur different costs
• Examples:
– ¨ Terrorist profiling
• “Not a terrorist” correct 99.99% of the time
– Loan decisions
– Fault diagnosis
– Promotional mailing
Cost Matrices
Hypothesized
class
True class
Pos
Neg
Pos
TP Cost
FN Cost
Neg
FP Cost
TN Cost
Usually, TP Cost and TN Cost are set equal to 0!
Cost-Sensitive Classification
• If the classifier outputs probability for each class, it can be
adjusted to minimize the expected costs of the predictions.
• Expected cost is computed as dot product of vector of
class probabilities and appropriate column in cost matrix.
Hypothesized
class
True class
Pos
Neg
Pos
TP Cost
FN Cost
Neg
FP Cost
TN Cost
Cost Sensitive Classification
• Assume that the classifier returns for an instance probs ppos = 0.6
and pneg = 0.4. Then, the expected cost if the instance is classified
as positive is 0.6 * 0 + 0.4 * 10 = 4. The expected cost if the
instance is classified as negative is 0.6 * 5 + 0.4 * 0 = 3. To
minimize the costs the instance is classified as negative.
Hypothesized
class
True class
Pos
Neg
Pos
0
5
Neg
10
0
Cost Sensitive Learning
• Simple methods for cost sensitive learning:
– Resampling of instances according to costs;
– Weighting of instances according to costs.
Hypothesized
class
True class
Pos
Neg
Pos
0
5
Neg
10
0
In Weka Cost Sensitive Classification and Learning can be applied for any classifier using
the meta scheme: CostSensitiveClassifier.
Lift Charts
• In practice, decisions are usually made by comparing
possible scenarios taking into account different costs.
• Example:
• Promotional mailout to 1,000,000 households. If we
mail to all households, we get 0.1% respond (1000).
• Data mining tool identifies (a) subset of 100,000
households with 0.4% respond (400); or (b) subset of
400,000 households with 0.2% respond (800);
• Depending on the costs we can make final decision
using lift charts!
• A lift chart allows a visual comparison.
Generating a Lift Chart
• Instances are sorted according to their predicted
probability of being a true positive:
Rank
1
2
3
4
…
Predicted probability
0.95
0.93
0.93
0.88
…
Actual class
Yes
Yes
No
Yes
…
• In lift chart, x axis is sample size and y axis is number of
true positives.
Hypothetical Lift Chart
ROC Curves and Analysis
Predicted
Predicted
Predicted
True
pos
neg
True
pos
neg
True
pos
neg
pos
40
60
pos
70
30
pos
60
40
neg
30
70
neg
50
50
neg
20
80
Classifier 1
TPr = 0.4
FPr = 0.3
Classifier 2
TPr = 0.7
FPr = 0.5
Classifier 3
TPr = 0.6
FPr = 0.2
ROC Space
Ideal classifier
always positive
False Negative Rate
True Negative Rate
chance
always negative
Dominance in the ROC Space
Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.
ROC Convex Hull (ROCCH)
• ROCCH is determined by the dominant classifiers;
• Classifiers on ROCCH achieve the best accuracy;
• Classifiers below ROCCH are always sub-optimal.
Convex Hull
• Any performance on a line segment connecting two ROC points
can be achieved by randomly choosing between them;
• The classifiers on ROCCH can be combined to form a hybrid.
Iso-Accuracy Lines
• Iso-accuracy line connects ROC points with the same
accuracy A:
• (P*TPr + N*(1–FPr))/(P+N) = A;
• TPr = (A*(P+N)-N)/P + N/P*FPr.
• Iso-accuracy lines have slope N/P.
• Higher iso-accuracy lines are better.
Iso-Accuracy Lines
• For uniform class distribution, C4.5 is optimal and achieves
about 82% accuracy.
Iso-Accuracy Lines
• With for times as many positives as negatives SVM is
optimal and achieves about 84% accuracy.
Iso-Accuracy Lines
• With for times as many negatives as positives CN2 is
optimal and achieves about 86% accuracy.
Iso-Accuracy Lines
• With less than 9% positives, AlwaysNeg is optimal.
• With less than 11% negatives, AlwaysPos is optimal.
How to Construct ROC Curve for
one Classifier
•
•
•
•
Sort the instances according to their Ppos.
Move a threshold on the sorted instances.
For each threshold define a classifier with confusion matrix.
Plot the TPr and FPr rates of the classfiers.
Ppos
0.99
0.98
0.7
0.6
0.43
True Class
pos
pos
neg
pos
neg
Predicted
True
pos
neg
pos
2
1
neg
1
1
ROC for one Classifier
Good separation between the classes, convex curve.
ROC for one Classifier
Reasonable separation between the classes, mostly convex.
ROC for one Classifier
Fairly poor separation between the classes, mostly convex.
ROC for one Classifier
Poor separation between the classes, large and small concavities.
ROC for one Classifier
Random performance.
The AUC Metric
• The area under ROC curve (AUC) assesses the ranking in
terms of separation of the classes.
• AUC estimates that randomly chosen positive instance will be
ranked before randomly chosen negative instances.
Note
• To generate ROC curves or Lift charts we
need to use some evaluation methods
considered in this lecture.
• ROC curves and Lift charts can be used for
internal optimization of classifiers.
Summary
• In this lecture we have considered:
–
–
–
–
Metrics for Classifier’s Evaluation
Methods for Classifier’s Evaluation
Comparing Data Mining Schemes
Costs in Data Mining
• Cost-Sensitive Classification and Learning
• Lift Charts
• ROC Curves
Related documents