Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Principles of Data Mining Published by Springer-Verlag. 2007 + Clashes in a Training Set + Adapting TDIDT to Deal With Clashes + Overfitting Rules to Data Clashes – Two (or more) instances in a training set have the same combination of attribute values but different classifications. Why the clashes happened? – One of the instances has the data incorrectly recorded. i.e. there is noise in the data. – The clashing instances are all correct, but it is not possible to discriminate between them on the basis of the attributes recorded. + Strategy(I)-The ‘delete branch’ strategy Strategy(II)-The ‘majority voting’ strategy *The ‘delete branch’(100%) strategy and the ‘majority voting’ strategy(0%) are too extreme for the usual cases. Therefore, we use clash threshold . Clash threshold: A percentage from 0 to 100 inclusive. –Normal usage might be 60%, 70%, 80% or 90%. – The proportion of instances in the clash set with that classification is at least equal to the clash threshold –> classify to the most common class. – Else –> Discarded. EX: Credit checking dataset * *The predictive accuracy for the training data is no importance—we already know the classifications! It is the accuracy for the test data that matters. Using the ‘default classification strategy’ and automatically allocate each unclassified instance to the largest class. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. Why cause overfitting? –noise –less training data which cause some of the attributes can divide the training data well in coincidence. The linear function provides better prediction . Consider a typical rule such as – IF a = 1 and b = yes and z = red THEN class = OK Specialise – IF a = 1 and b = yes and z = red and k = green THEN class = OK Generalise – IF a = 1 and b = yes THEN class = OK Example: (a gold digger named – A ) Give A a present – A feels happy. Ride A with a Lamborghini – A feels very happy. One night stand with A –A feels extremely happy. (a princess named – B ) Give her a present – B feels happy. Ride her with a Lamborghini – B feels very happy. One night stand with her –B regards you as a playboy. We can find out that not every girls can accept a cheap guy, which means if the machine learned this wrong formula, it becomes a overfitting. With the same accuracy: The simplest explanation is the best. (Ockham’s Razor)