Download Here - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
On Comparing Classifiers :
Pitfalls to Avoid and
Recommended Approach
Published by Steven L. Salzberg
Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
Presenter Jiyeon Kim
(April14th, 2014)
1
Introduction
 How does the researcher choose which classification
algorithm to use for a new problem?
 Comparing the effectiveness of different algorithms on
public databases – opportunities or dangers ?
 Are many comparisons relied on widely shared datasets
statistically valid ?
2
Contents
1
2
3
4
Definitions
Comparing Algorithms
Statistical Validity
Conclusions
> Candidate Questions
3
1 Definitions
 Paired T-Test
 Hypothesis Testing
 Significant Level (α)
 P-value
4
1/1
Paired T-Test
• To determine whether two paired sets differ
from each other in a significant way
• Under this assumption
- the paired differences are
independent and identically normally distributed
5
1/2
Hypothesis Testing
• Null Hypothesis (H0)
vs. Alternative Hypothesis (H1)
• Reject the null hypothesis (H0),
if the p-value is less than the significance level
• e.g. In the case of Paired T-test,
H0 : There is no difference in two populations.
H1 : There is a statistically significant difference.
6
1/3
Significance Level, α
• The percentage of the time in which
the experimenters make an error
• Usually, the significance level is chosen to
be 0.05 (or equivalently, 5%)
• A fixed probability of wrongly rejecting the
null hypothesis H0, if it is in fact true
( = P(type I error) )
7
1/4
P-Value
• The probability of obtaining a test statistic result
at least as extreme as the one that was actually
observed, assuming that the null hypothesis is true
• “Reject the null hypothesis (H0) "
when the p-value turns out to be less than
a certain significance level, often 0.05
8
2 Comparing Algorithms
• Empirical validation of Classification Research has
serious experimental deficiencies
• Be careful when making conclusion that
a new method is significantly better on well-studied
datasets
9
3 Statistical Validity
< Multiplicity Effect >
e.g. Assume that you do 154 experiments ( two-tailed,
paired t-test) with significant level 0.05
- You have 154 chances to be significant
- The expected number of significant results is
154 * 0.05 = 7.7
Now You have 770 % error rate !!
10
3 Statistical Validity
< Bonferroni Adjustment >
① Let α* be the error rate of each experiment
② Then, (1- α*) become the chance that we can get
right conclusion
③ If we conduct n independent experiments,
the chance of getting them all right is (1- α*)ⁿ
④ So, the chance that we will make at least one
mistake, α is 1 - (1- α*)ⁿ
11
3 Statistical Validity
< Bonferroni Adjustment >
e.g. (This is not correct usage!!)
Assume that you do 154 experiments ( two-tailed, paired t-test )
with significant level 0.05 again
① The significance level for each experiment, α* = 0.05
② Then, the right conclusion rate, (1 - α*) = (1 - 0.05) = 0.95
③ The chance of getting them all right is (1 - 0.05) ^154
④ So, the significance level for all experiments is
1 - (1 - α*)ⁿ = 1 - (1 – 0.05)^154 = 0.99996
Now You have 99.96% error rate !!
12
3 Statistical Validity
< Bonferroni Adjustment >
⇒ “ Then, what should we do? ”
e.g. (This is ‘correct’ usage!!)
① α = 1 - (1 – α*)^154 ≤ 0.05 in order to obtain
results significant at the 0.05 level with 154 results
② Then, it gives α*≤ 0.0003 which is more stringent
than the original significance level 0.05!
13
3 Statistical Validity
< Bonferroni Adjustment >
/ CAVEAT /
This argument is very rough
because it assumes that all the experiments are
independent of one another!
14
3/1 Alternative statistical tests
Statistical Validity /
* Recommended Tests
 Simple Binomial Test
 ANOVA(Analysis of Variance)
( with Bonferroni Adjustment )
15
3/1 Alternative statistical tests
Statistical Validity /
To compare two algorithms ( A&B ),
a comparison must consider four numbers ;
① The number of examples that A got right and
B got wrong ⇒ A>B
② The number of examples that B got right and
A got wrong ⇒ A>B
③ The number that both algorithms got right
④ The number that both algorithms got wrong
16
3/1 Alternative statistical tests
Statistical Validity /
To compare two algorithms ( A&B ),
a comparison must consider four numbers ;
① The number of examples that A got right and
B got wrong ⇒ A>B
② The number of examples that B got right and
A got wrong ⇒ A>B
⇒ simple but much improved way, Binomial Test!
17
3/2 Community Experiments
Statistical Validity /
• Even when using strict significance criteria and
the appropriate significance tests, there would be
mere ‘ accidents of chance ’
• In order to deal with this phenomenon,
the most helpful resolution is duplication !
18
3/3 Repeated Tuning
Statistical Validity /
• Algorithms are tuned repeatedly on some datasets
• Whenever tuning takes place, every adjustment
should be considered a separate experiment
e.g. If 10 ‘ tuning ’ experiments were attempted,
then significance level should be 0.005 instead of 0.05
19
3/3 Repeated Tuning
Statistical Validity /
< Recommended Approach>
To establish the new algorithm’s comparative merits,
① Choose other algorithm that is most similar to the
new one to include in the comparison
② Choose a benchmark data set that illustrates the
strengths of the new algorithm
③ Divide the data set into k subsets for cross-validation
④ Run a cross-validation
⑤ To compare algorithms, use the appropriate
statistical test
20
3/3 Repeated Tuning
Statistical Validity /
< Cross-Validation>
(A) For each of the k subsets of the data set D, create a
training set T = D - k
(B) Divide each training set into two smaller subsets, T1 and T2
; T1 will be used for training, and T2 for tuning
(C) Once the parameters are optimized, re-run training
on the larger set T
(D) Finally, measure accuracy on k
(E) Overall accuracy is averaged across all k partitions
; These k values also give an estimate of the variance of the
algorithms
21
4 Conclusions
• No single technique is likely to work best on all
databases
• Empirical comparisons should be done for validity
of algorithms but these studies must be very careful!
- Comparative work should be done in a statistically
acceptable framework
• The contents above are to help experimental
researchers steer clear of problems in designing a
comparative study.
22
> Exam Questions
Q) Why should we apply Bonferroni Adjustment
to comparing classifiers?
1
23
> Exam Questions
A) In case of multiple tests, multiplicity effect
occurs if we use same significant level for each
test as for all tests. So we need to get more
stringent level for each experiment by
Bonferroni Adjustment.
1
24
> Exam Questions
Q) Assume that you will do 10 experiments for
comparing two classification algorithms.
Using Bonferroni Adjustment, determine the
criterion of α* (the significant level for each
experiment) in order to get results that are truly
significant at the 0.01 level for 10 tests.
2
25
> Exam Questions
A)
α = 1 - (1 - α*)^10 = 1 - (1 - α*)^10 ≤ 0.01
(1 - α*)^10 ≥ 0.99
1 - α* ≥ 0.9989
∴ α* ≤ 0.0011
2
26
> Exam Questions
Q) Specify the difference between paired t-test and
simple binomial test in comparing two algorithms.
3
27
> Exam Questions
A)
Paired t-test :
determine whether the difference between two
algorithms exists or not
Binomial test :
compare the percentage of times ‘ algorithm A >
algorithm B ’ versus ‘ A < B ’, with throwing out
the ties
3
28
Thank You.
감사합니다.
29