Download Evaluation - WCU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
EVALUATION
Class 15
CSC 600: Data Mining
Today…

Evaluation Metrics:
Misclassification Rate
 Confusion Matrix
 Precision, Recall, F1


Evaluation Experiments:
Cross-Validation
 Bootstrap
 Out-of-time Sampling

Evaluation Metrics


How should classifier be quantitatively evaluated?
Misclassification Rate:
# of incorrect predictions
misclassifcation rate=
total predictions
Evaluation Metrics

Issues with misclassification rate?

If Accuracy is used:
Example: in CC fraud domain, there are many more legitimate
transactions than fraudulent transactions
 Presume only 1% of transactions are fraudulent


Then:
Classifier that predicts every transaction as GOOD would have
99% accuracy!
 Seems great, but it’s not…

Alternative Metrics

Notation (for binary classification):
+ (positive class)
- (negative class)

Class imblanace:
 rare
class vs. majority class
For binary problems, there are 4 possible problems:
Confusion Matrix: Counts
Predicted Class
Actual
Class




+
-
+
f++ (True Positive)
f+- (False Negative)
-
f-+ (False Positive)
f-- (True Negative)
True Positive (TP): number of positive examples correctly predicted by model
False Negative (FN): number of positive examples wrongly predicted as negative
False Positive (FP): number of negative examples wrongly predicted as positive
True Negative (TN): number of negative examples correctly predicted
Confusion Matrix: Percentages
Predicted Class
Actual
Class



f+- (False Negative)
f-- (True Negative)
True Positive Rate (TPR): fraction of positive examples correctly
predicted by model


+
-
+
f++ (True Positive)
f-+ (False Positive)
Also referred to as sensitivity
False Negative Rate (FNR): fraction of positive examples wrongly
predicted as negative
False Positive Rate (FPR): fraction of negative examples wrongly
predicted as positive
True Negative Rate (TNR): fraction of negative examples correctly
predicted

TPR = TP/(TP+FN)
Also referred to as specificity
FNR = FN/(TP+FN)
FPR = FP/(TN+FP)
TNR = TN/(TN+FP)
Precision and Recall

Widely used metrics when successful detection of
one class is considered more significant than
detection of the other classes
TP
Precision, p =
TP + FP
TP
Recall, r =
TP + FN
Precision


Precision: fraction of records that are positive in the
set of records that classifier predicted as positive
Interpretation: higher the precision, the lower the
number of false positive errors
TP
TP + FP
TP
Recall, r =
TP + FN
Precision, p =
Recall


Recall: fraction of positive records that are correctly
predicted by classifier
Interpretation: higher the recall, the fewer number
of positive records misclassified as negative class
TP
TP + FP
TP
Recall, r =
TP + FN
Precision, p =
Baseline Models


Often naïve models
Example #1: classify every instance as positive (the rare
class)




What is accuracy? Precision? Recall?
Precision = poor; Recall = 100%
Baseline models often maximize one metric (precision, recall)
but not the other.
Key challenge: building a model that performs well with
both metrics
F1 Measure

F-measure: combines precision and recall into a
single metric, using the harmonic mean
 Harmonic
Mean of two numbers tends to be closer to
the smaller of two numbers…
 …so the only way F1 is high is for both precision and
recall to be high.
F1 =
2rp
2 ´ TP
=
r + p 2 ´ TP + FP + FN
Evaluation Experiments
Training**
Set*
Predic'Valida'
on* on*
Model* Set*
Test*
Set*
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Evaluation

How to divide dataset into Training set, Validation set,
Testing set?


Sample (randomly partition)
Problems?
Requires enough data to create suitably large enough sets
“Lucky split” is possible
1.
2.


Difficult instances were chosen for the training set
Easy instances put into the testing set
Cross Validation



Widely used alternative to single Training Set +
Validation Set
Multiple evaluations using different portions of data
for training and validating/testing
k-fold Cross Validation
k
= number of folds (integer)
k-Fold Cross Validation


Available data is divided into equal-sized folds
(partitions)
k separate evaluation experiments:
 Training
set: k-1 folds
 Test set: 1 fold

Repeat k-1 more times, each using a different fold
for the test set
10-fold Cross Validation
1/10th of
training data
X 9 = 90%
used for training
10% used for testing
• Repeat k times
• Average results
• Each instance will
be used once in
testing
Cross-Validation


More computationally expensive
k-fold Cross-Validation



k = number of folds (integer)
k = 10 is common
Leave-One-Out Cross-Validation



Extreme: k = n, where there are n observations in training+validation set
Useful when the amount of data available is too small to all big enough
training sets in a k-fold cross-validation.
Significantly more computationally expensive
Leave-One-Out Cross-Validation
n folds
Training set: n-1 instances
Testing set: 1 single instance
n instances
Cross-Validation Error Estimate
1 k
CVk = å ErrorRatei
k i=1
Average the error rate for each fold.
In leave-one-out cross-validation, since each test contains only one record,
the variance of the estimated performance will be high. (Usually either
100% or 0%.)
ε0 Bootstrapping


Also called .632 bootstrap
Preferred over cross-validation approaches in contexts with
very small datasets




< 300 instances
Performs multiple evaluation experiments using slightly
different training and testing sets
Repeat k times
Typically, k is set greater than 200

Many more folds than with k-fold cross-validation
Itera' on*1"
Itera' on*2"
Itera' on*3"
Itera' on*k-1"
Itera' on*k"
Out-of-time Sampling

Some domains have a time dimension
 chronological

structure in data
One time span used for training set
 another
time span for the test set
Out-of-time Sampling
Training*Set*
Test*Set*
Time*

Important that out-of-time sample uses time spans
large enough to take into account any cyclical
behavior patterns
Average Class Accuracy
Two models.
Which is better?
We’re really interested in
identifying the churn
customers.
Model 1 Accuracy: 91%
Model 2 Accuracy: 78%
Note the imblanaced test set.
Average Class Accuracy


Using average class accuracy instead of
classification accuracy.
Average recall for each possible level of a target
class.
1
Average class accuracy =
𝑙𝑒𝑣𝑒𝑙𝑠(𝑡)
𝑟𝑒𝑐𝑎𝑙𝑙𝑙
𝑙∈𝑙𝑒𝑣𝑒𝑙𝑠(𝑡)
Average Class Accuracy
Two models.
Which is better?
We’re really interested in
identifying the churn
customers.
Model 1 Accuracy: 91%
Model 2 Accuracy: 78%
Model 1:
Average Class Accuracy: 55%
Model 2:
Average Class Accuracy: 79%
Average Class Accuracy (Harmonic
Mean)

Using average class accuracy can be computed
using the harmonic mean instead of the arithmetic
mean.
Average class accuracy𝐻𝑀 =
1
1
𝑙𝑒𝑣𝑒𝑙𝑠(𝑡)
1
𝑙∈𝑙𝑒𝑣𝑒𝑙𝑠(𝑡) 𝑟𝑒𝑐𝑎𝑙𝑙
𝑙
• Arithmetic means are susceptible to influence of large outliers.
• The harmonic mean results in a more pessimistic view of model performance than the
arithmetic mean.
Measuring Profit and Loss

Should all cells in a confusion matrix have the same value?
Predicted Class
Actual
Class
+
-
+
f++ (True Positive)
f+- (False Negative)
-
f-+ (False Positive)
f-- (True Negative)
Not always correct to treat all outcomes equally.
Measuring Profit and Loss




Small cost incurred by company on false positives.
Large cost incurred by company on false negatives.
No cost on true positives and true negatives.
Take into account the costs of different outcomes…
Measuring Profit and Loss

Calculate a profit matrix, profit or loss from each
outcome in the confusion matrix.
 Used

in overall model evaluation.
Example:
Average class accuracyHM = 83.824%
Average class accuracyHM = 80.761%
Measuring Profit and Loss
References






Fundamentals of Machine Learning for Predictive Data
Analytics, 1st Edition, Kelleher et al.
Data Science from Scratch, 1st Edition, Grus
Data Mining and Business Analytics in R, 1st edition, Ledolter
An Introduction to Statistical Learning, 1st edition, James et
al.
Discovering Knowledge in Data, 2nd edition, Larose et al.
Introduction to Data Mining, 1st edition, Tam et al.