Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Evaluation of Learning Models



Literature:
T. Mitchel, Machine Learning, chapter 5
I.H. Witten and E. Frank, Data Mining, chapter
5
1
Fayyad’s KDD Methodology
Transformed
data
Target
data
Patterns
Processed
data
Data Mining
data
Selection
Knowledge
Interpretation
Evaluation
Transformation
Preprocessing & feature
selection
& cleaning
2
Contents




Estimation of errors for one hypothesis
Comparison of hypotheses
Comparison of learning models
Practical aspects
3
Two definitions of error

True error of hypothesis h with respect to
target function f and distribution D is the
probability that h will misclassify an instance
drawn at random according to D.
errorD (h)  Pr [ f ( x)  h( x)]
xD
4
Two definitions of error (2)

Sample error of hypothesis h with respect to
target function f and data sample S is the
proportion of examples h misclassifies
errorS (h) 
1
n
  ( f ( x)  h( x))
xS
where  ( f ( x)  h( x)) is 1 if
and 0 otherwise.
f ( x )  h( x )
5
Two definitions of error (3)

How well does errorS(h) estimate errorD(h)?
6
Problems estimating error

1. Bias: If S is training set, errorS(h) is
optimistically biased
bias  E[errorS (h)]  errorD (h)

For unbiased estimate, h and S must be
chosen independently.
2. Variance: Even with unbiased S, errorS(h)
may still vary from errorD(h).
7
Example

Hypothesis h misclassifies 12 of the 40
examples in S
12
errorS (h) 
 0.30
40
What is errorD(h)?
8
Estimators

Experiment:
1. choose sample S of size n according to
distribution D
2. measure errorS(h)
errorS(h) is a random variable (i.e., result of
an experiment)
errorS(h) is an unbiased estimator for
errorD(h)
Given observed errorS(h) what can we
conclude about errorD(h)?
9
Confidence intervals

If
 S contains n examples, drawn independently of h
and each other
 n  30

Then
 With approximately 95% probability, errorD(h) lies
in the interval
errorS (h)  1.96
errorS ( h )(1errorS ( h ))
n
10
Confidence intervals (2)

If
 S contains n examples, drawn independently of h
and each other
 n  30

Then
 With approximately N% probability, errorD(h) lies in
the interval
errorS (h)  z N
errorS ( h )(1errorS ( h ))
n
where
N%: 50% 68% 80% 90% 95% 98% 99%
zN:
0.67 1.00 1.28 1.64 1.96 2.33 2.58
11
errorS(h) is a random variable


Rerun the experiment with different randomly
drawn S (of size n)
Probability of observing r misclassified
examples:
n!
r
nr
P(r ) 
errorD (h) (1  errorD (h))
r!(n  r )!
12
Binomial probability distribution

Probability P(r) of r heads in n coin flips, if p =
Pr(heads)
Binomial distribution
for n = 10 and p = 0.3
n!
r
nr
P(r ) 
errorD (h) (1  errorD (h))
r!(n  r )!
13
Binomial probability distribution (2)

Expected, or mean value of X, E[X], is
n
E[ X ]   iP (i )  np
i 0

Variance of X is
Var ( X )  E[( X  E[ X ]) ]  np(1  p)
2

Standard deviation of X, X, is
 X  E[( X  E[ X ]) ]  np(1  p)
2
14
Normal distribution approximates
binomial

errorS(h) follows a Binomial distribution, with
 mean  error ( h )  errorD ( h)
S
 standard deviation
 error ( h) 
S

errorD ( h )(1errorD ( h ))
n
Approximate this by a Normal distribution with
 mean  errorS ( h )  errorD ( h)
 standard deviation
 error ( h) 
S
errorD ( h )(1errorD ( h ))
n
15
Normal probability distribution
p ( x) 

1
2
a


e
The probability that X will fall into the interval
b
(a,b) is given by


2
 12 ( x ) 2
p ( x)dx
Expected, or mean value of X, E[X], is
E[X] = 
Variance of X is Var(x) = 2
Standard deviation of X, X , X = 
16
Normal probability distribution


80% of area (probability) lies in   1.28
N% of area (probability) lies in   zN
N%:
zN:
50% 68% 80% 90% 95% 98% 99%
0.67 1.00 1.28 1.64 1.96 2.33 2.58
17
Confidence intervals, more correctly

If
 S contains n examples, drawn independently of h
and each other
 n  30

Then
 with approximately 95% probability, errorS(h) lies in
the interval
errorD (h)  1.96
errorD ( h )(1errorD ( h ))
n
errorS (h)  1.96
errorS ( h )(1errorS ( h ))
n
 and errorD(h) approximately lies in the interval
18
Central Limit Theorem

Consider a set of independent, identically
distributed random variables Y1...Yn, all
governed by an arbitrary probability
distribution with mean  and finite variance 2.
Define the sample mean,
n
Y 

1
n
Y
i 1
i
Central Limit Theorem. As n  , the
distribution governing Y approaches a Normal
distribution, with mean  and variance 2/n.
19
Difference between hypotheses


Test h1 on sample S1, test h2 on S2
1. Pick parameter to estimate
d  errorD (h1 )  errorD (h2 )

2. Choose an estimator

3. Determine probability distribution that
governs estimator
dˆ  errorS1 (h1 )  errorS2 (h2 )
 dˆ 
errorS1 ( h1 )(1errorS1 ( h1 ))
n1

errorS2 ( h2 )(1errorS2 ( h2 ))
n2
20
Difference between hypotheses (2)

4. Find interval (L, U) such that N% of
probability mass falls in the interval
dˆ  z N
errorS1 ( h1 )(1errorS1 ( h1 ))
n1

errorS2 ( h2 )(1errorS2 ( h2 ))
n2
21
Paired t test to compare hA, hB


1. Partition data into k disjoint test sets
T1,T2,…,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
 i  errorT (hA )  errorT (hB )
 3. Return the value  , where
i
 
i
k
1
k

i 0
i
22
Paired t test to compare hA, hB (2)

N% confidence interval estimate for d:
 
s 

k
1
k

i 0
i
k
1
k ( k 1)
 (
i 0
i
 )
2
Note  i approximately normally distributed
23
Comparing learning algorithms LA
and LB

What we’d like to estimate:
ES  D [errorD ( LA (S ))  errorD ( LB (S ))]

where L(S) is the hypothesis output by learner
L using training set S
I.e., the expected difference in true error
between hypotheses output by learners LA and
LB, when trained using randomly selected
training sets S drawn according to distribution
D.
24
Comparing learning algorithms LA
and LB (2)

But, given limited data D0, what is a good
estimator?
 Could partition D0 into training set S0 and test set
T0, and measure
errorT0 ( LA ( S 0 ))  errorT0 ( LB ( S 0 ))
 Even better, repeat this many times and average
the results
25
Comparing learning algorithms LA
and LB (3): k-fold cross validation



1. Partition data D0 into k disjoint test sets
T1,T2,…,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
use Ti for the test set, and the remaining
data for training set Si
3. Return the average of the errors on the test
sets
26
Practical Aspects
A note on parameter tuning


It is important that the test data is not used in any
way to create the classifier
Some learning schemes operate in two stages:
 Stage 1: builds the basic structure
 Stage 2: optimizes parameter settings




The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data,
validation data, and test data
Validation data is used to optimize parameters
27
Holdout estimation, stratification



What shall we do if the amount of data is limited?
The holdout method reserves a certain amount for
testing and uses the remainder for training
 Usually: one third for testing, the rest for training

Problem: the samples might not be representative
 Example: class might be missing in the test data

Advanced version uses stratification
 Ensures that each class is represented with approximately
equal proportions in both subsets
28
More on cross-validation



Standard method for evaluation: stratified ten-fold
cross-validation
Why ten? Extensive experiments have shown that this
is the best choice to get an accurate estimate
 There is also some theoretical evidence for this


Stratification reduces the estimate’s variance
Even better: repeated stratified cross-validation
 E.g. ten-fold cross-validation is repeated ten times

and results are averaged (reduces the variance)
29
Issues in evaluation


Statistical reliability of estimated differences in
performance
Choice of performance measure
 Number of correct classifications
 Accuracy of probability estimates
 Error in numeric predictions

Costs assigned to different types of errors
 Many practical applications involve costs
30
Counting the costs


In practice, different types of classification
errors often incur different costs
Examples:
 Predicting when cows are in heat (“in estrus”)
 “Not in estrus” correct 97% of the time
 Loan decisions
 Oil-slick detection
 Fault diagnosis
 Promotional mailing
31
Taking costs into account

The confusion matrix:
predicted class
yes
actual
class

no
yes True positive
False negative
no
True negative
False positive
There are many other types of costs!
 E.g.: costs of collecting training data
32
Lift charts



In practice, costs are rarely known
Decisions are usually made by comparing
possible scenarios
Example: promotional mailout
 Situation 1: classifier predicts that 0.1% of all
households will respond
 Situation 2: classifier predicts that 0.4% of the
10000 most promising households will respond

A lift chart allows for a visual comparison
33
Generating a lift chart

Instances are sorted according to their predicted
probability of being a true positive:
Rank
1
2
3
4
…

Predicted probability Actual class
0.95
Yes
0.93
Yes
0.93
No
0.88
Yes
…
…
In lift chart, x axis is sample size and y axis is number
of true positives
34
A hypothetical lift chart
35
Summary of measures
36
Model selection criteria



Model selection criteria attempt to find a good
compromise between:
A. The complexity of a model
B. Its prediction accuracy on the training data
Reasoning: a good model is a simple model
that achieves high accuracy on the given data
Also known as Occam’s Razor: the best theory
is the smallest one that describes all the facts
37
Warning



Suppose you are gathering hypotheses that
have a probability of 95% to have an error
level below 10%
What if you have found 100 hypotheses
satisfying this condition?
Then the probability that all have an error
below 10% is equal to (0.95)100  0.013
corresponding to 1.3 %. So, the probability of
having at least one hypothesis with an error
above 10% is about 98.7%!
38