Download On Comparing Classifiers: Pitfalls to Avoid and a Recommended

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
On Comparing Classifiers: Pitfalls to
Avoid and a Recommended Approach
Author: Steven L. Salzberg
Presented by: Zheng Liu
Introduction
• Data mining researchers often use classifiers to identify important
classes of objects within a data repository. Classification is
particularly useful when a database contains examples that can be
used as the basis for future decision making; e.g., for assessing
credit risks, for medical diagnosis, or for scientific data analysis.
• Researchers have a range of different types of classification
algorithms at their disposal, including nearest neighbor methods,
decision tree induction, error back propagation, reinforcement
learning, and rule learning.
Introduction
• Problem: How does one choose which algorithm to use for a new
problem?
• This paper discusses some of the pitfalls that confront anyone
trying to answer this question, and demonstrates how misleading
results can easily follow from a lack of attention to methodology.
Definition
• T-test:
 The t-test assesses whether the means of two groups are statistically
different from each other.
• P-value:
 It represents probability of concluding (incorrectly) that there is a difference
in samples when no true difference exists.
 Dependent upon the statistical test being performed.
 P = 0.05 means that there is 5% chance that you would be wrong if
concluding the populations are different.
• NULL Hypothesis
 Assumption that there is no difference in two or more populations.
Comparing algorithms
• Classification research, which is a component of data mining
as well as a subfield of machine learning, has always had a
need for very specific, focused studies that compare
algorithms carefully.
• Based on [1], a high percentage of new algorithms (29%) were
not evaluated on any real problem at all, and that very few
(only 8%) were compared to more than one alternative on
real data.
[1]. Prechelt, L. A quantitative study of experimental evaluations of neural network
learning algorithms: Current research practice. Neural Networks, 9, 1996.
Pitfalls
• Classification research and data mining rely too
heavily on stored repositories of data.
– It is difficult to produce major new results using well-studied
and widely shared data.
• Easy studies need considerable skills.
– Most comparative study does propose an entirely new
method, most often it proposes changes to one or more
known algorithms, and uses comparisons to show where
and how the changes will improve performance. Although
the goal is worthwhile, the approach is sometimes not valid.
Problem 1: Sharing a small repository of datasets
• Suppose 100 people are studying the effect of algorithms
A and B.
• At least 5 will get results statistically significant at p <= 0.05
(assuming independent experiments).
• These results are nothing but due to chance.
Problem 2: Statistics were not designed for
computational experiments
•
•
•
•
Comparison of classifier algorithms
154 datasets
NULL hypothesis is true if p-value < 0.05 (not very stringent)
Differences were reported significant if a t-test produced pvalue < 0.05.
– Actual p-value used is 154*0.05 (= 7.7).
Problem 2: Statistics were not designed for
computational experiments
• Let the significance for each level be 
• Chance for making right conclusion for one experiment is 1-
• Assuming experiments are independent of one another,
chance for getting n experiments correct is (1-)n
• Chances of not making correct conclusion is 1-(1-)n
• Substituting =0.05
• Chances for making incorrect conclusion is 0.9996
• To obtain results significant at 0.05 level with 154 tests
1-(1-)154 < 0.05 or  < 0.003
• This adjustment is well known as Bonferroni Adjustment .
Problem 3: Experiments are not independent
• The t-test assumes that the test sets for each algorithm are
independent.
• When two algorithms are compared on the same data set,
then obviously the test sets are not independent, since they
will share some of the same examples—assuming the training
and test sets are created by random partitioning, which is the
standard practice.
• The whole framework of using alpha levels and p-values has
been questioned when more than two hypotheses are under
consideration[2].
[2]. Raftery, A. Bayesian model selection in social research (with discussion by Andrew
Gelman, Donald B. Rubin, and Robert M. Hauser). In Peter Marsden, editor, Sociological
Methodology 1995, pages 111–196. Blackwells, Oxford, UK, 1995.
Problem 4: Only considers overall accuracy
on a test set
• Comparison must consider 4 numbers when a
common test set to compare two algorithms (A and B)
•
•
•
•
A>B
A<B
A=B
~A = ~B
• If only two algorithms compared
– Throw out ties.
– Compare A>B Vs A<B
• If more than two algorithms compared
– Use “Analysis of Variance” (ANOVA)
– Bonferroni adjustment for multiple test should be applied
Problem 5: Repeated tuning
• Many researchers tune their algorithms repeatedly in order to
make them perform optimally on some dataset.
• The representation of the data, which may vary from one
study to the next, even when the same basic dataset is used.
• Whenever tuning takes place, every adjustment should really
be considered as a separate experiment.
• A greater problem occurs when one uses an algorithm that
has been used before: the algorithm may already have been
tuned on public databases.
Problem 5: Repeated tuning
• The recommended procedure is to reserve a portion of the
training set as a “tuning” set, and to repeatedly test the
algorithm and adjust its parameters on the tuning set. When
the parameters appear to be at their optimal settings,
accuracy can finally be measured on the test data.
Problem 6: Generalizing results
• A common methodological approach in recent comparative
studies is to pick several datasets from the UCI repository (or
other data source) and perform a series of experiments,
measuring classification accuracy, learning rates, and perhaps
other criteria.
• It is not valid to make general statements about other
datasets. The only way this might be valid would be if the UCI
repository were known to represent a larger population of
classification problems.
A recommended approach
1.
Choose other algorithms to include in the comparison. Try
including most similar to new algorithm.
2.
Choose datasets.
3.
Divide the data set into k subsets for cross validation.
– Typically k=10.
– For a small data set, choose larger k, since this leaves
more examples in the training set.
A recommended approach
4.
Run a cross-validation
– For each of the k subsets of the data set D, create a
training set T = D – k
– Divide T into T1 (training) and T2 (tuning) subsets
– Once tuning is done, rerun training on T
– Finally measure accuracy on k
– Overall accuracy is averaged across all k partitions.
5. compare algorithms
• In case of multiple data sets, Bonferroni adjustment should be
applied.
Conclusion
• No single classification algorithm is the best for all
problems.
• Comparative studies typically include at least one new
algorithm and several known methods; these studies
must be very careful about their methods and their
claims.
References
[1]. Prechelt, L. A quantitative study of experimental evaluations
of neural network learning algorithms: Current research
practice. Neural Networks, 9, 1996.
[2]. Raftery, A. Bayesian model selection in social research (with
discussion by Andrew Gelman, Donald B. Rubin, and Robert M.
Hauser). In Peter Marsden, editor, Sociological Methodology
1995, pages 111–196. Blackwells, Oxford, UK, 1995.