Download Data Mining - Computer Science Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
Classification: Evaluation
February 23, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Classification: Evaluation
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
February 23, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Evaluation
Samples
Cross Validation
Bootstrap
Confidence of Accuracy
Classification: Evaluation
February 23, 2009
Slide 3
COMP527:
Data Mining
Evaluation
We need some way to quantitatively evaluate the results of data
mining.






Just how accurate is the classification?
How accurate can we expect a classifier to be?
If we can't evaluate the classifier, how can it be improved?
Can different types of classifier be evaluated in the same way?
What are useful criteria for such a comparison?
How can we evaluate clusters or association rules?
There are lots of issues to do with evaluation.
Classification: Evaluation
February 23, 2009
Slide 4
COMP527:
Data Mining
Evaluation
Assuming classification, the basic evaluation is how many correct
predictions it makes as opposed to incorrect predictions.
Can't test on data used for training the classifier and get an
accurate result. The result is "hopelessly optimistic" (Witten).
Eg: Due to over-fitting, a classifier might get 100% accuracy on the
data it was trained from and 0% accuracy on other data. This is
called the resubstitution error rate -- the error rate when you
substitute the data back into the classifier generated from it.
So we need some new, but labeled data to test on.
Classification: Evaluation
February 23, 2009
Slide 5
COMP527:
Data Mining
Validation
Most of the time we do not have enough data to have a lot for
training and a lot for testing, though sometimes this is possible
(eg sales data)
Some systems have two phases of training. An initial learning
period and then fine tuning. For example the Growing and
Pruning sets for building trees.
It's important to not use the validation set either.
Note that this reduces the amount of data that you can actually
train on by a significant amount.
Classification: Evaluation
February 23, 2009
Slide 6
COMP527:
Data Mining
Numeric Data, Multiple Classes
Further issues to consider:


Some classifiers produce probabilities for one or more classes.
We need some way to handle the probabilities – for a classifier to
be partly correct. Also for multi-class problems (eg instance has
2 or more classes) we need some 'cost' function for getting an
accurate subset of the classes.
Regression/Numeric Prediction produces a numeric value. We
need statistical tests to determine how accurate this is rather
than true/false for nominal classes.
Classification: Evaluation
February 23, 2009
Slide 7
COMP527:
Data Mining
Hold Out Method
Obvious answer: Keep part of the data set aside for testing
purposes and use the rest to train the classifier.
Then use the test set to evaluate the resulting classifier in terms of
accuracy.
Accuracy: Number of correctly classified instances / total number
of instances to classify.
Ratio is often 2/3rds training, 1/3rd test.
How should we select the instances for each section?
Classification: Evaluation
February 23, 2009
Slide 8
COMP527:
Data Mining
Samples
Easy: Randomly select instances.
Data could be very unbalanced: Eg 99% one class, 1% the other
class.
Then random sampling is likely to not draw any of the 1% class.
Stratified: Group the instances by class and then select a
proportionate number from each class.
Balanced: Randomly select a desired amount of minority class
instances, and then add the same number from the majority
class.
Classification: Evaluation
February 23, 2009
Slide 9
COMP527:
Data Mining
Samples
Stratified: Group the instances by class and then select a
proportionate number from each class.
Classification: Evaluation
February 23, 2009
Slide 10
COMP527:
Data Mining
Samples
Balanced: Randomly select a desired amount of minority class
instances, and then add the same number from the majority
class.
Classification: Evaluation
February 23, 2009
Slide 11
COMP527:
Data Mining
Small Data Sets
For small data sets, removing some as a test set and still having a
representative set to train from is hard. Solutions?
Repeat the process multiple times, select a different test set. Then
find the error from each, and average across all of the iterations.
Of course there's no reason to do this only for small data sets!
Different test sets might still overlap, which might give a biased
estimate of the accuracy. (eg if it randomly selects good records
multiple times)
Can we prevent this?
Classification: Evaluation
February 23, 2009
Slide 12
COMP527:
Data Mining
Cross Validation
Split the dataset up into k parts, then use each part in turn as the
test set and the others as the training set.
If the data set is also stratified, we can have stratified cross
validation, rather than perhaps ending up with a non
representative sample in one or more parts.
Common values for k are 3 (eg hold out) and 10.
Hence: stratified 10-fold cross validation
Again, the error values are averaged after the k iterations.
Classification: Evaluation
February 23, 2009
Slide 13
COMP527:
Data Mining
Cross Validation
Why 10? Extensive testing shows it to be a good middle ground -not too much processing, not too random.
Cross validation is used extensively in all data mining literature. It's
the simplest and easiest to understand evaluation technique,
while having a good accuracy.
There are other similar evaluation techniques, however ...
Classification: Evaluation
February 23, 2009
Slide 14
COMP527:
Data Mining
Leave One Out
Select one instance and train on all others. Then see if the
instance is correctly classified. Repeat and find the percentage
of accurate results.
Eg: N-fold cross validation, where N is the number of instances in
the data set.
Attractive:
 If 10 is good, surely N is better :)
 No random sampling problems
 Trains with the most amount of data
Classification: Evaluation
February 23, 2009
Slide 15
COMP527:
Data Mining
Leave One Out
Disadvantages:
 Computationally expensive, builds N models!
 Guarantees a non-stratified, non-balanced sample.
Worst case: class distribution is exactly 50/50.
Data is so complicated, classifier simply picks the most common
class.
-- Will always pick the wrong class.
Classification: Evaluation
February 23, 2009
Slide 16
COMP527:
Data Mining
Bootstrap
Until now, the sampling has been without replacement (eg each
instance occurs once, either in training or test set).
However we could put back an instance to be drawn again -sampling with replacement.
This results in the 0.632 bootstrap evaluation technique.
Draw a training set from the data set with replacement such that
the number of instances in both is the same, then use the
instances which are not in the training set as the test set.
(Eg some instances will appear more than once in the training set)
Statistically, the likelihood of an instance not being picked is 0.368,
hence the name.
Classification: Evaluation
February 23, 2009
Slide 17
COMP527:
Data Mining
Bootstrap
Eg: Have a dataset of 1000 instances.
We sample with replacement 1000 times – eg we randomly select
an instance from all 1000 instances 1000 times.
This should leave us with approximately 368 instances that have
not been selected. We remove these and use them for the test
set.
Error rate will be pessimistic – only training on 63% of the data,
with some repeated instances. We compensate by combining
with the optimistic error rate from resubstitution:
error rate: 0.632 * error-on-test + 0.368 * error-on-training
Classification: Evaluation
February 23, 2009
Slide 18
COMP527:
Data Mining
Confidence of Accuracy
What about the size of the test set? More test instances should
make us more confident that the accuracy predicted is close to
the true accuracy.
Eg getting 75% on 10,000 samples is more likely closer to the
accuracy than 75% on 10.
A series of events that succeed of fail is a Bernoulli process, eg
coin tosses. We can find out S successes from N trials, and then
S/N ... but what does that tell us about the true accuracy rate?
Statistics can then tell us the range within which the true accuracy
rate should fall. Eg: 750/1000 is very likely to be between
73.2% to 76.7%.
(Witten 147 to 149 has the full maths!)
Classification: Evaluation
February 23, 2009
Slide 19
COMP527:
Data Mining
Confidence of Accuracy
We might wish to compare two classifiers of different types. Could
compare accuracy of 10 fold cross validation, but there's another
method: Student's T-Test
Method:
 We perform cross validation 10 times – eg 10 times TCV = 100
models
 Perform the same repeated TCV with the second classifier
 This gives us x1..x10 for the first, and y1..y10 for the second.
 Find the mean of the 10 cross-validation runs for each.
 Find the difference between the two means.
We want to know if the difference is statistically significant.
Classification: Evaluation
February 23, 2009
Slide 20
COMP527:
Data Mining
Student's T-Test
We then find 't' by:
Where d is the difference between the means, k is the number of
times the cross validation was performed, and 2 is the variance
of the differences between the samples.
(variance = sum of squared differences between mean and actual)
Then look up on the table for k-1 number of degrees of freedom.
(more tables! But printed in Witten pg 155)
If t is greater than z on the table, then it is statistically significant.
Classification: Evaluation
February 23, 2009
Slide 21
COMP527:
Data Mining





Further Reading
Introductory statistical text books, again
Witten, 5.1-5.4
Han 6.2, 6.12, 6.13
Berry and Browne, 1.4
Devijver and Kittler, Chapter 10
Classification: Evaluation
February 23, 2009
Slide 22