Download Statistical Pair-Wise Comparison of Learning Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistical Comparison of Two
Learning Algorithms
Presented by:
Payam Refaeilzadeh
Overview

How can we tell if one algorithm can learn
better than another?
–
–
–
–
Design an experiment to measure the accuracy of
the two algorithms.
Run multiple trials.
Compare the samples not just their means. Do a
statistically sound test of the two samples.
Is any observed difference significant? Is it due to
true difference between algorithms or natural
variation in the measurements?
Statistical Hypothesis Testing


Statistical Hypothesis: A statement about
the parameters of one or more populations
Hypothesis Testing: A procedure for
deciding to accept or reject the hypothesis
–
–
–
–
–
Identify the parameter of interest
State a null hypothesis, H0
Specify an alternate hypothesis, H1
Choose a significance level α
State an appropriate test statistic
Statistical Hypothesis Testing Cont

Null Hypothesis (H0): A statement presumed to be
true until statistical evidence shows otherwise




Usually specifies an exact value for a parameter
Example H0: µ = 30 Kg
Alternate Hypothesis (H1): Accepted if the null
hypothesis is rejected
Test Statistic: Particular statistic calculated from
measurements of a random sample / experiment
–
–
A test statistic is assumed to follow a particular distribution
(normal, t, chi-square, etc)
That particular distribution can be used to test for the
significance of the calculated test statistic.
Error in Hypothesis Testing

Type I error occurs when H0
is rejected but it is in fact true
–

P(Type I error)=α or significance
level
Type II error occurs when we
fail to reject H0 but it is in fact
false
–
–
–
P(Type II error)=β
power = 1-β = Probability of
correctly rejecting H0
power = ability to distinguish
between the two populations
Paired t-Test

Collect data in pairs:
–


Suppose n paired measurements have been made
Assume
–
–

Example: Given a training set DTrain and a test set DTest,
train both learning algorithms on DTrain and then test their
accuracies on DTest.
The measurements are independent
The measurements for each algorithm follow a normal
distribution
The test statistic T0 will follow a t-distribution with n-1
degrees of freedom
Paired t-Test cont
Trial
#
Algorithm 1
Accuracy
X1
Algorithm 2
Accuracy
X2
1
X11
X21
2
X12
X22
…
..
…
n
X1N
X2N
Assume: X1 follows N(µ1,σ1)
X2 follows N(µ2,σ2)
Let:
µD = µ1 - µ2
Di = X1i - X2i i=1,2,...,n
1
D   X 1i  X 2i
n i
S D  STDEV ( X 1i  X 2i )
Null Hypothesis:
H0: µD = Δ0
Test Statistic:
T0 
D   
0
n
SD
Rejection Criteria:
H1: µD ≠ Δ0 |t0| > tα/2,n-1
H1: µD > Δ0 t0 > tα,n-1
H1: µD < Δ0 t0 < -tα,n-1
Cross Validated t-test


Paired t-Test on the 10 paired
accuracies obtained from 10-fold
cross validation
Advantages
–
–

Large train set size
Most powerful (Diettrich, 98)
Disadvantages
–
–
Accuracy results are not independent
(overlap)
Somewhat elevated probability of
type-1 error (Diettrich, 98)
…
5x2 Cross Validated t-test




Run 2-fold cross validation 5 times
Use results from the first of five replications to
estimate mean difference
Use results for all folds to estimate the variance
Advantage:
–

Lowest Type-1 error (Diettrich, 98)
Disadvantage
–
Not as powerful as 10 fold cross validated t-test (Diettrich,
98)
Re-sampled t-test




Randomly divide data into train / test sets
(usually 2/3 – 1/3)
Run multiple trials (usually 30)
Perform a paired t-test between the trial
accuracies
This test has very high probability of type-1
error and should never be used.
Calibrated Tests

Bouckaert – ICML 2003:
–
–
–
It is very difficult to estimate the true degrees of
freedom because independence assumptions are
being violated
Instead of correcting for the mean-difference,
calibrate on the degrees of freedom
Recommendation: use 10 times repeated 10-fold
cross validation with 10 degrees of freedom
References



R. R. Bouckaert. Choosing between two
learning algorithms based on calibrated
tests. ICML’03: PP 51-58.
T. G. Dietterich. Approximate statistical tests
for comparing supervised classification
learning algorithms. Neural Computation,
10:1895–1924, 1998.
D. C. Montgomery et al. Engineering
Statistics. 2nd Edition. Wiley Press. 2001