Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2009 Classification: Rules February 10, 2009 Slide 1 COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Classification: Rules Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam February 10, 2009 Slide 2 COMP527: Data Mining Today's Topics Introduction Rule Sets vs Rule Lists Constructing Rules-based Classifiers 1R PRISM Reduced Error Pruning RIPPER Rules with Exceptions Classification: Rules February 10, 2009 Slide 3 COMP527: Data Mining Rules-Based Classifiers Idea: Learn a set of rules from the data. Apply those rules to determine the class of the new instance. For example: R1. If blood-type=Warm and lay-eggs=True then Bird R2. If blood-type=Cold and flies=False then Reptile R3. If blood-type=Warm and lay-eggs=False then Mammal Hawk: flies=True, blood-type=Warm, lay-eggs=True, class=??? R1 is True, so the classifier predicts that Hawk = Bird. Yay! Classification: Rules February 10, 2009 Slide 4 COMP527: Data Mining Rules-Based Classifiers A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule. The coverage of a rule is the percentage of records that satisfy the condition. The accuracy of a rule is the percentage of covered records that satisfy the condition and the conclusion. For example, a rule might cover 10/50 records (coverage 20%) of which 8 are correct (accuracy 80%). Classification: Rules February 10, 2009 Slide 5 COMP527: Data Mining Rule Set vs Rule List Rules can either be grouped as a set or an ordered list. Set: The rules make independent predictions. Every record is covered by 0..1 rules (hopefully 1!) R1. If flies=True and lays-eggs=True and lives-in-water=False then Bird R2. If flies=False and lives-in-water=True and lays-eggs=True then Fish R3. If blood-type=Warm and lays-eggs=False then Mammal R4. If blood-type=Warm and lays-eggs=True then Reptile Doesn’t matter which order we evaluate these rules. Classification: Rules February 10, 2009 Slide 6 COMP527: Data Mining Rule Set vs Rule List List: The rules make dependent predictions. Every record is covered by 0..* rules (hopefully 1..*!) R1. If flies=True and lays-eggs=True then Bird R2. If blood-type=Warm and lays-eggs=False then Mammal R3. If lives-in-water=True then Fish R4. If lays-eggs=True then Reptile Does matter which order we evaluate these rules. If all records are covered by at least one rule, then rule set or list is considered Exhaustive. Classification: Rules February 10, 2009 Slide 7 COMP527: Data Mining Constructing Rules-Based Classifiers Covering approach: At each stage, a rule is found that covers some instances. “Separate and Conquer” -- Choose rule that identifies many instances, separate them out, repeat. But first a very very simple classifier called “1R”. 1R because the rules all test one particular attribute. Classification: Rules February 10, 2009 Slide 8 COMP527: Data Mining 1R Classifier Idea: Construct one rule for each attribute/value combination predicting the most common class for that combination. Example Data: Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Classification: Rules Windy false true false false false true true false false false true true false true February 10, 2009 Play? no no yes yes yes no yes no yes yes yes yes yes no Slide 9 COMP527: Data Mining 1R Classifier Rules generated: Attribute Outlook Temperature Humidity Windy Rules sunny » no overcast » yes rainy » yes hot » no mild » yes cool » yes high » no normal » yes false » yes true » no Errors 2/5 0/4 2/5 2/4 (random) 2/6 1/4 3/7 1/7 2/8 3/6 (random) Total Errors 4/14 5/14 4/14 5/14 Now choose the attribute with the fewest errors. Randomly decide on Outlook. So 1R will simply use the outlook attribute to predict the class for new instances. Classification: Rules February 10, 2009 Slide 10 COMP527: Data Mining 1R Algorithm foreach attribute, foreach value of that attribute, find class distribution for attr/value conc = most frequent class make rule: attribute=value -> conc calculate error rate of ruleset select ruleset with lowest error rate Almost not worth wasting a slide on, really! Classification: Rules February 10, 2009 Slide 11 COMP527: Data Mining PRISM Classifier Instead of always looking to the full data set, after constructing each rule we could remove the instances that the rule covers before looking for a new rule. Start with a high coverage rule, then increase its accuracy by adding more conditions to it. Want to maximise the accuracy of each rule: maximise the ratio of positive instances/covered instances. Finished adding conditions when p/t = 1, or no more instances to look at Classification: Rules February 10, 2009 Slide 12 COMP527: Data Mining PRISM Classifier Following Witten (pg 6, 108+) If X then recommendation=hard Find highest coverage ratio condition for X: Age = Young Age = Pre-presbyopic Age = Presbyopic Prescription = Myope Prescription = Hypermetrope Astigmatism = no Astigmatism = yes Tear-Production = Reduced Tear-Production = Normal 2/8 1/8 1/8 3/12 1/12 0/12 4/12 0/12 4/12 Select astigmatism = yes (arbitrarily over Tear-Production = Normal) Classification: Rules February 10, 2009 Slide 13 COMP527: Data Mining PRISM Classifier This covers: Age Spectacle prescription Astigmatism Young Young Young Young Pre-presbyopic Pre-presbyopic Pre-presbyopic Pre-presbyopic Presbyopic Presbyopic Presbyopic Presbyopic Myope Myope Hypermetrope Hypermetrope Myope Myope Hypermetrope Hypermetrope Myope Myope Hypermetrope Hypermetrope Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Tear production rate Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Reduced Normal Recommended lenses None Hard None hard None Hard None None None Hard None None Now need to add another condition to make it more accurate. If astigmatism = yes and X then recommendation = hard Classification: Rules February 10, 2009 Slide 14 COMP527: Data Mining PRISM Classifier Best condition is Tear-Production = normal (4/6) New rule: astigmatism=yes and tear-production = normal But still some inaccuracy... Age=Young (2/2) or Prescription = Myope (3/3) both have 100% ratio in remaining instances. Choose the greater coverage. If astigmatism = yes and tear-production = normal and prescription = myope then recommendation = hard Repeat the process, removing the instances covered by this rule. Then repeat for all classes. Classification: Rules February 10, 2009 Slide 15 COMP527: Data Mining PRISM Classifier Try with the other example data set. If X then play=yes Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high Windy false true false false false true true false false false true true false true Play? no no yes yes yes no yes no yes yes yes yes yes no Outlook=overcast is (4/4) Already perfect, remove instances and look again. Classification: Rules February 10, 2009 Slide 16 COMP527: Data Mining PRISM Classifier With reduced dataset, if X then play=yes Outlook sunny sunny rainy rainy rainy sunny sunny rainy sunny rainy Temperature hot hot mild cool cool mild cool mild mild mild Humidity high high high normal normal high normal normal normal high Windy false true false false true false false false true true Play? no no yes yes no no yes yes yes no Sunny (2/5) rainy (3/5) hot (0/2) mild (3/5) cool (2/3) high (1/5) normal (4/5) false (4/6) true (¼) Select humidity=normal (4/5) and look for another rule as not perfect Classification: Rules February 10, 2009 Slide 17 COMP527: Data Mining PRISM Classifier If humidity=normal and X then play=yes Outlook rainy rainy sunny rainy sunny Temperature cool cool cool mild mild Humidity normal normal normal normal normal Windy false true false false true Play? yes no yes yes yes If we could use 'and-not' we could have: and-not (temperature=cool and windy=true) But instead: rainy (2/3), sunny (2/2), cool (2/3), mild (2/2), false(3/3), true (1/2) So we select windy=false to maximise t and add that to the rule. Classification: Rules February 10, 2009 Slide 18 COMP527: Data Mining PRISM Algorithm for each class C initialise E to the complete instance set while E contains instances with class C create empty rule R if X then C until R is perfect (or no more attributes) for each attribute A not in R, and each value v, consider adding A=v to R select A and v to maximise accuracy of p/t add A=v to R remove instances covered by R from E Classification: Rules February 10, 2009 Slide 19 COMP527: Data Mining Issues with PRISM Overfitting. As we saw, we had 4/5 and then lost one positive instance to lose the one negative instance. But with more examples maybe it was 199/200 and we'd need to lose 40 positive to remove it... that's crazy. Measure 2: Information Gain p * (log(p/t) - log (P/T)) p/t as before (positive over total) P and T are positive and total before new condition. Emphasises large number of positive examples. Use this in PRISM in place of maximising p/t. Classification: Rules February 10, 2009 Slide 20 COMP527: Data Mining Over-Fitting Avoidance We could split the training set into a growing set and a pruning set. Grow rules out using the first, and then try to cut the rules back with the pruning set. Two strategies: Reduced-error pruning: Build full rule set then prune rules Incremental reduced-error pruning: Simplify rules as built Can re-split the data after each rule. Let's look at this one. Classification: Rules February 10, 2009 Slide 21 COMP527: Data Mining Incremental Reduced Error Pruning initialise E to instance set until E is empty: split E into Grow and Prune (ratio 2:1) for each class C in Grow generate best rule for C using Prune: calc worth(R) and worth(R-finalCondition) while worth(R-) > worth(R), prune rule from rules for different classes, select largest worth(R) remove instances covered by rule Classification: Rules February 10, 2009 Slide 22 COMP527: Data Mining Rule Worth? How can we generate the worth of the rules? (Witten 203) ( p + (N -n)) / T true positive + true negative / total number of instances positive + totalNegative - negativesCovered / totalInstances p=2000, t=3000 --> 1000 + N / T p=1000, t=1001 --> 999 + N / T p/t Same problem as before p=1,t=1 vs p=1000,t=1001 Simple but intuitive algorithm for worth? Classification: Rules February 10, 2009 Slide 23 COMP527: Data Mining Issue with Grow/Prune Splitting Say we have 1000 examples, and we split 2:1 for train/test (666,334), then 2:1 for grow/prune (444,222) ... we're building our rules on less than half of our data! Depending on the dataset, classes may be absent from the training set, or the distributions may be very wrong, or any number of other statistical problems with random sampling to this degree. Ameliorated in Incremental as re-split often. But might still want to perform the algorithm several times and pick the best. Classification: Rules February 10, 2009 Slide 24 COMP527: Data Mining RIPPER Rules-based classifier from industry. If 2 classes, then learn rules for one and default the other If more than 2 classes, start with smallest until you have 2. Information Gain to grow rules Measure for pruning: p-n / p + n covered in pruning set) (positive/negative examples Uses 'Description Length' metric -- Ockham's Razor says that the simplest solution is the best, so here the simplest rule set is the best. (Not going into how to calculate this) Classification: Rules February 10, 2009 Slide 25 COMP527: Data Mining RIPPER Algorithm Repeated Incremental Pruning to Produce Error Reduction split E into Grow/Prune BUILD: repeat until no examples, or DL of ruleset >minDL(rulesets)+64, or error >50% GROW: add conditions until rule is 100% by IG PRUNE: prune last to first while worth metric W increases for each rule R, for each class C: split E into Grow/Prune remove all instances from Prune covered by other rules GROW and PRUNE two competing rules: R1 is new rule built from scratch R2 is generated by adding conditions to R prune using worth metric A for reduced dataset replace R by R, R1 or R2 with smallest DL if uncovered instances of C, return to BUILD to make more rules calculate DL for ruleset and ruleset with each rule omitted, delete any rule that increases the DL remove instances covered by rules generated DL = Description Length, Metric W= p+1/t+2, Metric A=p+N-n/T Classification: Rules February 10, 2009 Slide 26 COMP527: Data Mining Rules with Exceptions If we get more data after a ruleset has been generated, it might be useful to add exceptions to rules. If X then class1 unless Y then class2 Consider our humidity rule: if humidity=normal then play=yes unless temperature=cool and windy=true then play = no Exceptions developed with the Induct system, called 'ripple-down rules' Classification: Rules February 10, 2009 Slide 27 COMP527: Data Mining Further Reading Witten, Sections 3.3, 3.5, 3.6, 4.1, 4.4 Dunham Section 4.6 Han, Section 6.5 Berry and Browne, Chapter 8 Classification: Rules February 10, 2009 Slide 28