Download Experimental Evaluation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Experimental Evaluation
• In experimental Machine Learning we evaluate the accuracy of a
hypothesis empirically.
This raises a few important methodological questions:
Experimental Evaluation
CS446-Spring06
1
Experimental Evaluation
• In experimental Machine Learning we evaluate the accuracy of a
hypothesis empirically.
This raises a few important methodological questions:
• Given the observed accuracy of the hypothesis over a limited sample of
data, how well does it estimate its accuracy over additional examples ?
Experimental Evaluation
CS446-Spring06
2
Experimental Evaluation
• In experimental Machine Learning we evaluate the accuracy of a
hypothesis empirically.
This raises a few important methodological questions:
• Given the observed accuracy of the hypothesis over a limited sample of
data, how well does it estimate its accuracy over additional examples ?
• Given that one hypothesis outperforms another over some sample of
data, how probable is it that it is more accurate in general ?
Experimental Evaluation
CS446-Spring06
3
Experimental Evaluation
• In experimental Machine Learning we evaluate the accuracy of a
hypothesis empirically.
This raises a few important methodological questions:
• Given the observed accuracy of the hypothesis over a limited sample of
data, how well does it estimate its accuracy over additional examples ?
• Given that one hypothesis outperforms another over some sample of
data, how probable is it that it is more accurate in general ?
• When data is limited, what is the best way to use this data to both learn
the hypothesis and estimate its accuracy.
Experimental Evaluation
CS446-Spring06
4
Experimental Evaluation
• In experimental Machine Learning we evaluate the accuracy of a
hypothesis empirically.
This raises a few important methodological questions:
• Given the observed accuracy of the hypothesis over a limited sample of
data, how well does it estimate its accuracy over additional examples ?
Estimating Hypothesis Accuracy
• Given that one hypothesis outperforms another over some sample of
data, how probable is it that it is more accurate in general ?
Comparing Classifiers/Learning Algorithms
• When data is limited, what is the best way to use this data to both learn
the hypothesis and estimate its accuracy.
Statistical Problems: Parameter Estimation and Hypothesis Testing
Experimental Evaluation
CS446-Spring06
5
Estimating Hypothesis Accuracy
• Given a hypothesis h and data sample containing n examples drawn at
random according to some distribution D what is the best estimate of
the accuracy of h over future instances drawn from the same
distribution ?
Experimental Evaluation
CS446-Spring06
6
Estimating Hypothesis Accuracy
• Given a hypothesis h and data sample containing n examples drawn at
random according to some distribution D what is the best estimate of
the accuracy of h over future instances drawn from the same
distribution ?
PAC: Given sample drawn according to D
Want to make sure that we will be okay for new sample from D
with confidence  we will be -accurate
Here: We observe some accuracy and want to know if its typical
Note the difference from the (worst case) PAC learning question.
Here we are interested in a statistical estimation problem.
Experimental Evaluation
CS446-Spring06
7
Estimating Hypothesis Accuracy
• Given a hypothesis h and data sample containing n examples drawn at
random according to some distribution D what is the best estimate of
the accuracy of h over future instances drawn from the same
distribution ?
Note the difference from the (worst case) PAC learning question.
Here we are interested in a statistical estimation problem.
• The problem is to estimate the proportion of a population that
exhibits some property, given the observed proportion over some
random sample of the population.
Experimental Evaluation
CS446-Spring06
8
Estimating Hypothesis Accuracy
• The property we are interested in is (for some fixed function f)
The True Error:
errorD (h)  PrxD (h(x)  f(x))
• Since we cannot observe it, we are performing an experiment:
we collect a random sample S of n independently drawn instances from
the distribution D, and use it to measure
The Sample Error:
1
errorS (h)  | {x  S | h(x)  f(x)} |
n
• Naturally, each time we run an experiment (i.e., collect a sample of
n test examples) we expect to get a different Sample Error.
Experimental Evaluation
CS446-Spring06
9
Estimating Hypothesis Accuracy
• The distribution of the number of mistakes is Binomial( p) with
p=errorD (h)
p(Num of Mistakes  r)  C(n, r)p r (1 - p) n-r
The mean is E  np
and the standard variation is   np(1 - p)
Experimental Evaluation
CS446-Spring06
10
Estimating Hypothesis Accuracy
• The Sample Error errorS (h) is distributed like
r/n, when r is Binomial(p)
The mean is E error (h)  p  error D (h)
S
and the standard variation is
Experimental Evaluation
1

np(1 - p)
n
CS446-Spring06
11
Estimating Hypothesis Accuracy
• The Sample Error errorS (h) is distributed like r/n, when r is Binomial(p)
The mean is E error (h)  p  error D (h)
1
and the standard variation is ,

np(1 - p)
S
n
• But, due to the central limit theorem, if n is large enough (30…)
we can assume that the distribution of the Sample Error is Normal
with mean E error (h)  p  error D (h)
and standard variation:
S
 error (h)
S
Experimental Evaluation
1

np(1 - p)  errorS (h)(1 errorS (h)/n
n
CS446-Spring06
12
Estimating Hypothesis Accuracy
The distribution of the Sample Error:
Pr(error S (h))
errorD (h)
errorS (h)
Consequently, one can give a range on the error of a hypothesis such
that with high probability the true error will be within this range.
Experimental Evaluation
CS446-Spring06
13
Estimating Hypothesis Accuracy
The distribution of the Sample Error:
Pr(error S (h))
errorD (h)
errorS (h)
Consequently, one can give a range on the error of a hypothesis such
that with high probability the true error will be within this range.
Given the observed error (your estimate of the true error), you know
with some confidence that the true error is within some range around it.
Experimental Evaluation
CS446-Spring06
14
Some Numbers
Assume you test an hypothesis h and find that it commits
r = 12 errors on a sample of n=40 examples.
The estimation for the true error will be: p=r/n= 0.3
What is the variance of this error ?
(n is fixed, r - random variable, distributed Binomial(0.3).)
Therefore (# of mistakes) = 40. 0.3 (1-0.3) = 2.89
And
(sample error) = 2.89/40=0.07
r = 300 errors on a sample of n=1000 examples.
The estimation for the true error will be: p=r/n= 0.3
And
(sample error) =  0.3 (1-0.3)/1000=14.5/1000=0.014
Experimental Evaluation
CS446-Spring06
15
Estimating Hypothesis Accuracy
The distribution of the Sample Error:
Pr(error S (h))
  errorS (h)(1 - errorS (h))/n
95% of the samples are within
±2 of the mean
errorD (h)
errorS (h)
Consequently, one can give a range on the error of a hypothesis such
that with high probability the true error will be within this range.
With confidence N%: errorD (h)  (R N  errorS (h))
Confidence N%
50% 68% 80% 90% 95% 98% 99%
Constant R
0.67 1.00 1.28 1.64 1.96 2.33 2.58
Experimental Evaluation
CS446-Spring06
16
Estimating Hypothesis Accuracy
The distribution of the Sample Error:
Pr(error S (h))
  errorS (h)(1 - errorS (h))/n
Prob(  - R N  Z    R N )  N
errorD (h)
95% of the samples are within
±2 of the mean
errorS (h)
Consequently, one can give a range on the error of a hypothesis such
that with high probability the true error will be within this range.
With confidence N%: errorD (h)  (R N  errorS (h))
Confidence N%
50% 68% 80% 90% 95% 98% 99%
Constant R
0.67 1.00 1.28 1.64 1.96 2.33 2.58
Experimental Evaluation
CS446-Spring06
17
Comparing Two Hypotheses
When comparing two hypotheses, the ordering of their sample accuracies
may or may nor accurately reflect the ordering of their true accuracies.
Experimental Evaluation
CS446-Spring06
18
Comparing Two Hypotheses
When comparing two hypotheses, the ordering of their sample accuracies
may or may nor accurately reflect the ordering of their true accuracies.
Pr(error S (h))
error S1 (h 1 )
errorS 2 (h 2 )
errorS (h)
Interpretation: assume we test h 1 on S 1 and h 2on S 2and measure
error S (h 1 ) and errorS (h 2 )respectively. These graphs indicate the
probability distribution of the sample error. We can see that it is possible
that the true error of h 2 is lower than that of h 1 and vice versa.
1
Experimental Evaluation
2
CS446-Spring06
19
Comparing Two Hypotheses
• We wish to estimate the difference between the true error of these
hypotheses.
• The difference of two normally distributed variables is also normally
distributed
Pr(d)
d  errorS1 (h 1 ) - errorS 2 (h 2 )
d
0
Experimental Evaluation
CS446-Spring06
20
Comparing Two Hypotheses
• We wish to estimate the difference between the true error of these
hypotheses.
• The difference of two normally distributed variables is also normally
distributed
Pr(d)
d  errorS1 (h 1 ) - errorS 2 (h 2 )
d
0
Notice that the density function is a convolution of the original two
Experimental Evaluation
CS446-Spring06
21
Confidence in Difference
• The probability that errorD (h 1 )  errorD (h 2 ) is the probability
that d > 0 which is given by the shaded area.
Pr(d)
d  errorS1 (h 1 ) - errorS 2 (h 2 )
d
0
Experimental Evaluation
CS446-Spring06
22
Confidence in Difference
• Since the normal distribution is symmetric, we can also assert
confidence intervals with lower bounds and upper bounds
Pr(d)
d  errorS1 (h 1 ) - errorS 2 (h 2 )
d
Experimental Evaluation
CS446-Spring06
23
Standard Deviation of the Difference
• The variance of the difference is the sum of the variances
d 
errorS1 (h 1 )(1 - errorS1 (h 1 ))
n1

errorS 2 (h 2 )(1 - errorS 2 (h 2 ))
• The mean is the observed difference d
• Therefore, the N% confidence interval in d is
n2
(d  R N d )
• What is the probability that errorD (h 1 )  errorD (h 2 )?
This is the confidence that d is in the one-sided interval d >0
• We find the highest value N such that d  R N d , that is
R N  d / d and conclude that errorD (h 1 )  errorD (h 2 )
with probability (100-(100-N)/2)%
Experimental Evaluation
CS446-Spring06
24
Standard Deviation of the Difference
• What is the probability that errorD (h 1 )  errorD (h 2 )?
This is the confidence that d is in the one-sided interval d >0
• We find the highest value N such that d  R N d, that is R N  d
d
and conclude that errorD (h 1 )  errorD (h 2 )
with probability (100-(100-N)/2)%
Prob(  - R N  Z    R N )  N
Pr(d)
errorS (h)
d
0
Confidence N%
Constant Rn
Experimental Evaluation
50% 68% 80% 90% 95% 98% 99%
0.67 1.00 1.28 1.64 1.96 2.33 2.58
CS446-Spring06
25
Hypothesis Testing
• A statistical hypothesis is a statement about a set of parameters
of a distribution. We are looking for procedures that determine
whether the hypothesis is correct or not.
•In this case we can say that we accept the hypothesis that
errorD (h 1 )  errorD (h 2 )with N% confidence.
• Equivalently, we can say that we reject that hypothesis that the
difference is due to random chance,
at a (100-N)/100 level of significance.
• By convention, in normal scientific practice, a confidence of 95%
is high enough to assert that there is a “significant difference”.
Experimental Evaluation
CS446-Spring06
26
A Hypothesis Test
• Assume that based on two different samples of 100 test instances
we observe that: errorS (h 1 )  0.2
1
errorS 2 (h 2 )  0.3
d 
errorS1 (h 1 )(1 - errorS1 (h 1 ))
n1

errorS 2 (h 2 )(1 - errorS 2 (h 2 ))
n2
 0.0608
d
0.3 - 0.2

 1.644  R 90  100 - (100 - 90)/2  95
 d 0.0608
• We can say that we accept the hypothesis that “h1 is better than h2”
with 95% confidence, or that the difference is significant at the .05 level.
Experimental Evaluation
CS446-Spring06
27
A Hypothesis Test
• Assume that based on two different samples of 100 test instances
we observe that: errorD (h 1 )  0.2
errorD (h 2 )  0.25
d 
errorS1 (h 1 )(1 - errorS1 (h 1 ))
n1

errorS 2 (h 2 )(1 - errorS 2 (h 2 ))
n2
 0.0589
d
0.3 - 0.25

 0.848  R 50  100 - (100 - 50)/2  75
d
0.0589
• We conclude: “h1 is better than h2” with 75% confidence.
• Cannot conclude that that difference is significant (since p> .05).
Experimental Evaluation
CS446-Spring06
28
Comparing Learning Algorithms
• Given two algorithms A and B we would like to know which of the
methods is the better method, on average, for learning a
particular function f
• Statistical tests must control several sources of variation:
- variation in selecting test data
- variation in selecting training data
- random decisions of the algorithms
• Algorithms A might do better than B when trained on a particular
randomly selected set, or when tested on a particular randomly
selected test set, even though on the whole population they
perform identically.
Experimental Evaluation
CS446-Spring06
29
Comparing Learning Algorithms
• An ideal statistical test should derive conclusions based on estimating:
ESD [error D (L A (S) - error D (L B (S)]
where by L(S) we denote the output hypothesis of the algorithm
when being trained on S, and the expectation is over all possible
samples from the underlying distribution, taken independently.
• In practice we usually have a sample D’ from D to work with.
The average is therefore done on different splits of this sample
to training/test sets.
• We want methods that
- identify a difference between algorithms when it exists
- do not find a difference when it does not exists
Experimental Evaluation
CS446-Spring06
30
Methodology
• Assume a hypothesis (null hypothesis)
E.g. the algorithms are equivalent
• Choose a statistics
A figure that you can compute from the data and can estimate given
that the hypothesis holds.
- what value do we expect? (assuming the hypothesis holds)
- what value do we get?
(experimentally)
• What is the probability distribution of the statistics?
What’s the deviation of the empirical figure from the expected one?
Decide: Is this due to chance?
- Yes/ No/ with what confidence?
Experimental Evaluation
CS446-Spring06
31
Distributions
• Normal Distribution
X i  N (  i ,  i2 )
2

• Chi Square
Consider the random variables: X i  N (  i ,  i2 ) Z i  ( X i   i ) /  i
The random variables defined by:
n
n
2
2
2
Z

(
X


)
/

1 i 1 i i i
is  2 (n)
(with n degrees of freedom)
Experimental Evaluation
CS446-Spring06
32
Distributions
Student’s t distribution:
Let W be N(0,1), V be  2 (n) and assume the W,V are independent
Then, the distribution of the random variable
W
Tn 
is call a t-distribution (with n degrees of freedom)
V
n
Tn is symmetric about zero.
As n becomes larger, it becomes more and more like N(0,1).
E(Tn) = 0
Var(Tn) = n/(n-2)
Experimental Evaluation
CS446-Spring06
33
t-Distributions
Student’s t distribution:
Let W be N(0,1), V be  2 (n) and assume the W,V are independent
2

Then, the distribution of the random variable
W
Tn 
is call a t-distribution (with n degrees of freedom)
V
n
Tn is symmetric about zero.
As n becomes larger, it becomes more and more like N(0,1).
E(Tn) = 0
Var(Tn) = n/(n-2)
Experimental Evaluation
CS446-Spring06
34
t Distributions
• Originally used when one can obtain an estimate for the mean
but not for 
• We want a distribution that allows us to compute a confidence
in the mean  without knowing  , but only an estimate s for it
(based on the same sample that produced the mean).
• The quantity t is given by:
t
X
s/ n
• That is, t is the deviation of the sample mean from the population
mean, measured in units of the means standard error s / n
• This is good for small samples, and the tables depend on n
Experimental Evaluation
CS446-Spring06
35
K-Fold Cross Validation
• Partition the data D’ into k disjoint subsets T1 , T2 ,..., Tkof
equal size.
• For i from 1 to k do:
Use Ti for test and the rest for training.
Set S i  D'Ti
h A  L A (S i )
h B  L B (S i )
 i  errorT (h A ) - errorT (h B )
i
i
• Return the average difference in error:
Experimental Evaluation
CS446-Spring06
1
  i  i
k
36
K-Fold Cross Validation Comments
• 10 is a standard number of folds.
When k=|D|, the methods is called leave-one-out.
• Every example gets used as a test example exactly once and
as a training example k-1 times.
• Test sets are independent but training set overlap significantly.
• The hypotheses are generated using (k-1)/k of the training data.
Experimental Evaluation
CS446-Spring06
37
K-Fold Cross Validation Comments
• 10 is a standard number of folds.
When k=|D|, the methods is called leave-one-out.
• Every examples gets used as a test example exactly once and
as a training example k-1 times.
• Test sets are independent but training set overlap significantly.
• The hypotheses are generated using (k-1)/k of the training data.
• Before we compared hypotheses using independent test sets.
Here, the hypotheses generated by algorithms A and B are tested
on the same test set (paired tests).
Experimental Evaluation
CS446-Spring06
38
Paired t Tests
• Paired tests produce tighter bounds since any difference is due to
difference in the hypotheses rather than differences in the test set.
Significance Testing of the Paired Tests:
• Compute the statistics:
t
 k
k
1
2
(



)
 i
(k  1) i 1
• where  i is the measured difference between A and B on the
ith data set and   1/k i  i is their average.
• The statistics is distributed according to a t-distribution(k)
• When k paired test are performed
With k=30, to get N=95%, we need |t| < 2.04
Experimental Evaluation
CS446-Spring06
39
Paired t Tests
• Paired t tests can be used in many ways.
• Sample the data 30 times.
Split each sample to Train and Test.
Run A and B on Train and Test. Let  i be the difference in error.
Estimate the same statistics.
(Most common in Machine Leaning; has problems)
• 10-Fold Cross validation:
The ith experiment is done on Ti
Better; has problems due to training set overlap
Experimental Evaluation
CS446-Spring06
40
5x2 Cross Validation
• Perform 5 replication of 2-fold cross validation.
• In each replication, the available data is randomly partitioned into
S 2 and S 1of equal size.
• Train algorithms A and B on each set and test on the other.
• Error measures:
e1A , e1B , e 2A , e B2 , p 1  e1A  e1B , p 2  e 2A  e B2 , s 2  (p 1  p) 2  (p 2  p) 2
2
s
• Let i be the variances computed for each of the 5 replications.
p1
• Then,
t
1 5 2
s

i 1 i
5
• Has a t-distribution, with k=5
Experimental Evaluation
CS446-Spring06
41
McNemar’s Test
• An alternative to Cross Validation, when the test can be run only once.
• Divide the sample S into a training set R and test set T.
• Train algorithms A and B on R, yielding classifiers A, B
• Record how each example in T is classified and compute the number of:
examples misclassified by both
A and B N 00
examples misclassified by
A but not B N 01
examples misclassified by both
B but not A N 10
examples misclassified by
neither A nor B N 11
where N is the total number of examples in the test set T
N 00  N 10  N 01  N 11  N
Experimental Evaluation
CS446-Spring06
42
McNemar’s Test
• The hypothesis: the two learning algorithms have the same error rate
on a randomly drawn sample. That is, we expect that
N 10  N 01
• The statistics we use to measure deviation from the expected counts:
(| N 01  N 10 | 1) 2
N 01  N 10
2

• This statistics is distributed (approximately) as
with 1 degree
of freedom. (with a continuity correction since the statistics is discrete)
2
• Example: Since 1,0.95
 3.841 we reject the hypothesis with
95% confidence if the above ratio is greater the 3.841
Experimental Evaluation
CS446-Spring06
43
Experimental Evaluation - Final Comments
• Good experimental methodology, including statistical analysis is
important in empirically comparing learning algorithms.
• Methods have their shortcomings; this is an active area of research.
Tom Dietterich, Approximate statistical tests for comparing supervised
classification learning algorithms (Neural Computation)
• Artificial data is useful for testing certain hypotheses about specific
strength and weaknesses of algorithms but only real data can test
the hypothesis that the bias of the learner is useful for the actual
problem.
• There are a few benchmarks for comparing learning algorithms.
The UC Irvine repository is the one most commonly used
http://www/ics.uci.edu/mlearn/MLRepository.html
Experimental Evaluation
CS446-Spring06
44