Download Evaluation of Classifiers Evaluating classifiers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Evaluation of Classifiers
– p. 1
Evaluating classifiers
How well can a classifier be expected to perform
on novel data?
Choice of performance measure
How close is the estimated performance to the
true performance?
Comparing Classifiers
– p. 2
Measuring classifier performance
Natural performance measure for classification
problems: error rate or accuracy
Higher accuracy does not necessarily imply better
performance on target task
Implicit assumption: the class distribution among
examples is relative balanced
Biased in favor of the majority class!
Should be used with caution!
– p. 3
Measuring classifier performance
Consider two category classification
One method for handling c-class problem is to
consider c 2-class problems: ωi /not ωi
Fawcett, T. (2003) ROC Graphs: Notes and
Practical Considerations for Researchers
– p. 4
Measuring classifier performance
confusion matrix (also called a contingency table)
TP: Number of True positives
FP: Number of False positives
TN: Number of True Negatives
FN: Number of False Negative
Accuracy = (T P + T N)/n
(With c classes the confusion matrix becomes an c × c matrix
containing the c correct classifications (the major diagonal
entries) and c2 − c possible errors (the off-diagonal entries)).
– p. 5
Measuring classifier performance
True positive rate (TPR) (also called sensitivity, hit
rate, and recall)
TP
sensitivity =
T P + FN
A statistical measure of how well a binary classification
test correctly identifies a condition
Probability of correctly labeling members of the target
class
– p. 6
Measuring classifier performance
False positive rate (FPR) (also called false alarm rate)
FalseAlarm =
speci f icity =
FP
T N + FP
TN
= 1 − FalseAlarm
T N + FP
The specificity is a statistical measure of how well a
binary classification test correctly identifies the
negative cases
– p. 7
Measuring classifier performance
Precision (also positive predictive value)
precision =
TP
T P + FP
Probability that a positive prediction is correct
F-measure (in information retrieval): can be used as a single
measure of performance. The F-measure is the harmonic mean
of precision and recall:
F-measure =
2 × precision × recall
precision + recall
– p. 8
Measuring classifier performance
TP, FP, TN, FN provide the relevant information
No single measure tells the whole story
A classifier with 90% accuracy can be useless if 90 percent
of the population does not have cancer and the 10% that do
are misclassified by the classifier
Use of multiple measures recommended
Beware of terminological confusion in the literature!
E.g., specificity sometimes refers to precision
When you write: provide the formula in terms of TP etc.
When you read: check the formula in terms of TP etc.
– p. 9
ROC Curve
Receiver Operating Characteristics (ROC) graphs have long
been used in signal detection theory to depict the tradeoff
between hit rates and false alarm rates over noisy channel
Recent years have seen an increase in the use of ROC
graphs in the machine learning community.
A useful technique for organizing classifiers and visualizing
their performance.
Especially useful for domains with skewed class distribution
and unequal classification error costs.
– p. 10
ROC Space
ROC curve is a plot of TPR against FPR which depicts relative
trade-offs between benefits (true positives) and costs (false
positives).
– p. 11
ROC Curve
A discrete classifier produces an (FPR, TPR) pair
corresponding to a single point in ROC space.
Some classifiers, such as a Naive Bayes or a neural
network, naturally yield an instance probability or score, a
numeric value that represents the degree to which an
instance is a member of a class.
Such a ranking or scoring classifier can be used with a
threshold to produce a discrete classifier
Plotting the ROC point for each possible threshold value
results in a curve
– p. 12
ROC Curve
1.0
1.0
B
True Positive rate
True Positive rate
B
0.8
A
0.6
0.4
0.6
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1.0
False Positive rate
A
0.8
0
0.2
0.4
0.6
0.8
1.0
False Positive rate
– p. 13
Measuring performance-ROC Curve
ROC curves offer a more complete picture of the
performance of the classifier
We often do not know costs, or changing class
distributions
To compare classifiers we may want to reduce ROC
performance to a single scalar value representing expected
performance.
A common method is to calculate the area under the ROC
curve, abbreviated AUC
AUC(h) > AUC(g): classifier h has better average
performance
– p. 14
Evaluating A Classifier
How well will the classifier we learned perform on novel
data?
We can estimate the performance (e.g., accuracy,
sensitivity) of the classifier using a test data set
Performance on the training data is not a good indicator of
performance on future data
Test set: independent instances that have not been used in
any way to create the classifier
Assumption: both training data and test data are
representative samples of the underlying problem
– p. 15
Evaluating A Classifier
How close is the estimated performance to the true
performance?
Assume the estimated error rate is 25%. How close is this
to the true error rate?
Depends on the amount of test data
There is extensive literature on how to estimate the
performance of a hypothesis from limited samples and how
to assign confidence to estimates – take Stat 430x
Evaluating hypotheses, Chapter 5, Tom Mitchell, Machine
Learning
– p. 16
Estimating Hypothesis Accuracy
General setting
There is some space of possible instances X , with unknown
distribution D that defines the probability of encountering
each instance
Learning task: learn target function f by considering a
space H of possible hypotheses
Each training example is drawn independently according to
D
Choose test sample S of size n according to distribution D
What is the best estimate of the accuracy of a hypothesis h?
What is the error in this estimate?
– p. 17
Estimating Hypothesis Accuracy
The sample error of h with respect to target function f and data
sample S is the proportion of examples h misclassifies
errorS (h) ≡
1
∑ δ( f (x) = h(x))
n x∈S
Where δ( f (x) = h(x)) is 1 if f (x) = h(x), and 0 otherwise.
The true error of hypothesis h with respect to target function f
and distribution D is the probability that h will misclassify an
instance drawn at random according to D .
errorD (h) ≡ Pr [ f (x) = h(x)]
x∈D
How well does errorS (h) estimate errorD (h)?
– p. 18
Example
Hypothesis h misclassifies 12 of the 40 examples in S
errorS (h) =
12
= .30
40
What is errorD (h)?
– p. 19
errorS (h) is a Random Variable
errorS (h) is a random variable (i.e., outcome of an experiment)
Compare: Toss a coin n times and it turns up heads r times
with probability of each toss turning heads p = P(heads).
(Bernoulli trial)
Randomly draw n samples and h misclassifies r samples
with probability of misclassification p = errorD (h).(Bernoulli
trial)
errorS (h) =
r
n
Probability of observing r misclassified examples:
P(r) =
n!
pr (1 − p)n−r
r!(n − r)!
– p. 20
Binomial Probability Distribution
P(X = r) =
n!
pr (1 − p)n−r
r!(n − r)!
Expected, or mean value of X , E[X], is
n
E[X] ≡ ∑ iP(i) = np
i=0
Variance of X is
Var(X) ≡ E[(X − E[X])2 ] = np(1 − p)
Standard deviation of X , σX , is
σX ≡
E[(X
− E[X])2 ] =
np(1 − p)
– p. 21
Bias, Variance
errorS (h) follows a Binomial distribution, with
mean µerrorS (h) = errorD (h)
standard deviation σerrorS (h)
errorD (h)(1 − errorD (h))
n
errorS (h)(1 − errorS (h))
σerrorS (h) ≈
n
errorS (h) is an unbiased estimator for errorD (h)
σerrorS (h) =
bias ≡ E[errorS (h)] − errorD (h) = 0
– p. 22
Confidence Interval
Confidence Interval: one common way to describe the uncertainy
associated with an estimate.
An N% confidence interval for some parameter p is an
interval that is expected with probability N% to contain p
Computation can be quite tedious
Binomial distribution can be closely approximated by a Normal
distribution with the same mean and variance when
np(1 − p) ≥ 5
– p. 23
Normal Probability Distribution
Normal distribution with mean 0, standard deviation 1
0.4
0.35
0.3
0.25
0.2
0.15
0.1
p(x) =
1 x−µ 2
σ )
√ 1 e− 2 (
2πσ2
0.05
0
-3
-2
-1
0
1
2
3
Expected, or mean value of X , E[X] = µ
Variance of X is
Var(X) = σ2
Standard deviation of X , σX = σ
– p. 24
Normal Probability Distribution
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
-1
0
1
2
3
80% of area (probability) lies in µ ± 1.28σ
N% of area (probability) lies in µ ± zN σ
N%: 50%
zN :
0.67
68%
80%
90%
95%
98%
99%
1.00
1.28
1.64
1.96
2.33
2.58
– p. 25
Confidence Intervals
If S contains n examples, drawn independently of h and
each other, and n ≥ 30 (and p is not too close to 0 or 1) or
np(1 − p) ≥ 5, then
With approximately N% probability, errorS (h) lies in interval
errorD (h) ± zN
errorD (h)(1 − errorD (h))
n
equivalently, errorD (h) lies in interval
errorS (h) ± zN
errorD (h)(1 − errorD (h))
n
– p. 26
Confidence Intervals
Approximately
errorS (h) ± zN
Or solve for p = errorD (h)
p̂ +
z2
2n
errorS (h)(1 − errorS (h))
n
p̂)
z2
± z p̂(1−
+ 4n
2
n
2
1 + zn
– p. 27
Confidence Intervals
Examples:
p̂ = 0.75, n = 1000, c = 80%(z = 1.28)
with 80% confidence the true value p in [.732, .767]
p̂ = 0.75, n = 100, c = 80%(z = 1.28)
With 80% confidence p in [.691, .801]
– p. 28
Central Limit Theorem
Central Limit Theorem simplifies confidence interval calculations
Consider a set of independent, identically distributed random
variables Y1 . . .Yn , all governed by an arbitrary distribution with
mean µ and finite variance σ2 . Define the sample mean,
1 n
Ȳ ≡ ∑ Yi
n i=1
Central Limit Theorem. As n → ∞, the distribution governing Ȳ
2
approaches a Normal distribution, with mean µ and variance σn .
– p. 29
Central Limit Theorem
Central Limit Theorem implies that whenever we define an
estimator that is the mean of some sample (e.g. errorS (h)), the
distribution governing this estimator can be approximated by a
Normal distribution for sufficiently large n
A common rule of thumb is that we can use the Normal
approximation when n ≥ 30.
– p. 30
Estimating Classifier Performance
Holdout method: use part of the data for training, and the
rest for testing
Usually: one third for testing, the rest for training
Ideally both training set and test set should be large
We may be unlucky – training data or test data may not be
representative
Use stratification: Ensures that each class is represented in
roughly the same proportion as in the entire data set
What to do if the amount of data is limited?
Solution: Run multiple experiments with disjoint training and
test data sets
– p. 31
Cross-validation
K-fold cross-validation:
The training set is randomly divided into K disjoint sets of
equal size where each part has roughly the same class
distribution
The classifier is trained K times, each time with a different
set held out as a test set.
The estimated error p̂CV is the mean of these K errors
Ron Kohavi, A Study of Cross-Validation and Bootstrap for
Accuracy Estimation and Model Selection, IJCAI 1995.
– p. 32
Cross-validation
The standard deviation can be estimated as
σCV =
p̂CV (1 − p̂CV )
n
where n is the total number of instances in the dataset
Confidence interval can be computed as before with n being
the total number of instances in the dataset
– p. 33
Leave-one-out approach
K-fold cross validation with K = n where n is the total
number of samples available
n experiments using n-1 samples for training and the
remaining sample for testing
Computationally expensive
Leave-one-out cross-validation does not guarantee the
same class distribution in training and test data!
Extreme case: 50% class 1, 50% class 2
Predict majority class label in the training data
True error 50%; Leave-one-out error estimate 100%!
– p. 34
The bootstrap
The bootstrap uses sampling with replacement to
form the training set
Sample uniformly a dataset of n instances n times
with replacement to form a new dataset of n
instances as the training set
Use the instances from the original dataset that
do not occur in the new training set for testing
Repeat process several times; average the results
– p. 35
Estimating Classifier Performance
Recommended procedure
Use (stratified) K-fold cross-validation (K=5 or 10) for
estimating performance estimates (accuracy, etc.)
Compute mean value of performance estimate, and
standard deviation and confidence intervals
Report mean values of performance estimates and their
standard deviations or 95% confidence intervals around the
mean
– p. 36
Difference Between Hypotheses
We wish to estimate
d ≡ errorD (h1 ) − errorD (h2 )
Suppose h1 has been tested on a sample S1 of size n1
drawn according to D and
h2 has been tested on a sample S2 of size n2 drawn
according to D
An unbiased estimator
dˆ ≡ errorS1 (h1 ) − errorS2 (h2 )
– p. 37
Difference Between Hypotheses
An unbiased estimator
dˆ ≡ errorS1 (h1 ) − errorS2 (h2 )
For large n1 and large n2 the corresponding error estimates
follow Normal distribution
Difference of two Normal distributions yields a normal
distribution
Variance equal to the sum of the variances of the individual
distributions for two independent variables
– p. 38
Difference Between Hypotheses
σdˆ ≈
errorS1 (h1 )(1 − errorS1 (h1 )) errorS2 (h2 )(1 − errorS2 (h2 ))
+
n1
n2
N% of probability mass falls in the interval
dˆ±zN
errorS1 (h1 )(1 − errorS1 (h1 )) errorS2 (h2 )(1 − errorS2 (h2 ))
+
n1
n2
When S1 = S2 , the variance is usually smaller than above and the
confidence interval applicable but overly conservative
– p. 39
Hypothesis testing
Is one hypothesis likely to be better than another?
Suppose
n1 = n2 = 100, errorS1 (h1 ) = .30, errorS2 (h2 ) = .20
dˆ = .1, σdˆ = .061
Is the error difference statistically significant?
– p. 40
Hypothesis testing
The null hypothesis: the two hypotheses have the same
error rate d = 0
Significance level of a test (α): The test’s probability of
incorrectly rejecting the null hypothesis
ˆ > zN σ) = α
Pr(|d|
for a two-sided test with confidence N = 1 − α.
Test statistic
z=
ˆ
|d|
σdˆ
Reject the null hypothesis if the test statistic z > zN
– p. 41
Hypothesis testing
E.g., n1 = n2 = 100, errorS1 (h1 ) = .30, errorS2 (h2 ) = .20
dˆ = .1, σdˆ = .061, the test statistic z = 1.64
For significance level α = 0.05, N = 95%, zN = 1.96, there
is not enough evidence to reject the null hypothesis.
– p. 42
Comparing learning algorithms
Which learning algorithm is better at learning target function f ?
What we’d like to estimate:
ES⊂D [errorD (LA (S)) − errorD (LB (S))]
where L(S) is the hypothesis output by learner L using training
set S
i.e., the expected difference in true error between hypotheses
output by learners LA and LB , when trained using randomly
selected training sets S drawn according to distribution D .
– p. 43
Comparing learning algorithms
But, given limited data D0 , what is a good estimator?
could partition D0 into training set S0 and test set T0 , and
measure
errorT0 (LA (S0 )) − errorT0 (LB (S0 ))
even better, repeat this many times and average the results
– p. 44
CVed Paired t Test
1. Partition data D0 into k disjoint test sets T1 , T2 , . . . , Tk of
equal size, where this size is at least 30.
2. For i from 1 to k, do
use Ti for test set, and the remaining data for training set Si
Si ← {D0 − Ti }
hA ← LA (Si ), hB ← LB (Si )
δi ← errorTi (hA ) − errorTi (hB )
3. Return the value δ̄, where
1 k
δ̄ ≡ ∑ δi
k i=1
– p. 45
Paired t Test
δ̄ is our estimate of the error difference
For large test sets, δi has Normal distribution with mean d ,
and variance σ2 that we don’t know
δ̄ have normal distribution with mean d and variance σ2 /k if
we pretend δi ’s independent
If we estimate σ2 with sample variance s2δ , the distribution
i
δ̄−d
sδ̄ is no longer Normal unless
s2δi
k is large
1 k
=
(δi − δ̄)2
∑
k − 1 i=1
– p. 46
Paired t Test
δ̄−d
sδ̄ follows a Student’s t-distribution with
freedom.
sδ̄ ≡ k − 1 degrees of
k
1
(δi − δ̄)2
∑
k(k − 1) i=1
N% confidence interval estimate:
δ̄ ± tN,k−1 sδ̄
The t-statistic
t=
δ̄
sδ̄
– p. 47
Performing the test
The null hypothesis: the two learning algorithms have the
same error rate
Fix a significance level (α)
The test’s probability of incorrectly rejecting the null
hypothesis
Two-sided test with confidence N = 1 − α
Look up tN,k−1
Reject the null hypothesis if |t| > tN,k−1 .
E.g., typically k=30 trials, 0.05 significance level, reject if
|t| > 2.045.
– p. 48
Performance evaluation summary
Rigorous statistical evaluation is extremely important in
experimental computer science in general and machine
learning in particular
How good is a learned hypothesis?
How close is the estimated performance to the true
performance?
Is one hypothesis better than another?
Is one learning algorithm better than another on a particular
learning task?
– p. 49