Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification: Alternative Techniques Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/26/2006 1 Classification: Alternative Techniques Rule-based Classifier Introduction to Data Mining 08/26/2006 2 Rule-Based Classifier z Classify records by using a collection of “if…then…” rules z Rule: (Condition) → y – where Condition is a conjunctions of attributes y is the class label – LHS: rule antecedent or condition – RHS: rule consequent – Examples of classification rules: (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No Introduction to Data Mining 08/26/2006 3 Rule-based Classifier (Example) Introduction to Data Mining 08/26/2006 4 Application of Rule-Based Classifier z A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule The rule r1 covers a hawk => Bird The rule r3 covers the grizzly bear => Mammal Introduction to Data Mining 08/26/2006 5 Rule Coverage and Accuracy Tid Refund Marital Coverage of a rule: Status 1 Yes Single – Fraction of records 2 No Married th t satisfy that ti f the th 3 No Single antecedent of a rule 4 Yes Married 5 No Divorced z Accuracy of a rule: 6 No Married – Fraction of records 7 Yes Divorced that satisfy both the 8 No Single 9 No Married antecedent t d t and d 10 No Single consequent of a (Status=Single) → No rule z Taxable Income Class 125K No 100K No 70K No 120K No 95K Yes 60K No 220K No 85K Yes 75K No 90K Yes 10 Coverage = 40%, Accuracy = 50% Introduction to Data Mining 08/26/2006 6 How does Rule-based Classifier Work? A lemur triggers rule r3, so it is classified as a mammal A turtle triggers both r4 and r5 A dogfish shark triggers none of the rules Introduction to Data Mining 08/26/2006 7 Characteristics of Rule-Based Classifier z Mutually exclusive rules – Classifier contains mutually exclusive rules if th rules the l are iindependent d d t off each h other th – Every record is covered by at most one rule z Exhaustive rules – Classifier has exhaustive coverage g if it accounts for every possible combination of attribute values – Each record is covered by at least one rule Introduction to Data Mining 08/26/2006 8 From Decision Trees To Rules Classification Rules (Refund=Yes) ==> No Refund Yes No NO {Single, Divorced} (Refund=No, Marital Status={Single,Divorced}, Taxable Income<80K) ==> No Marita l Status (Refund=No, Marital Status={Single,Divorced}, Taxable Income>80K) ==> Yes {Married} (Refund=No, Marital Status={Married}) ==> No NO Taxable Income < 80K > 80K NO YES Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree Introduction to Data Mining 08/26/2006 9 Rules Can Be Simplified Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K 6 No Married 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes No NO Marita l Status {Single, Divorced} {Married} NO Taxable Income < 80K NO > 80K YES 60K Yes No 10 Initial Rule: (Refund=No) ∧ (Status=Married) → No Simplified Rule: (Status=Married) → No Introduction to Data Mining 08/26/2006 10 Effect of Rule Simplification z Rules are no longer mutually exclusive – A record may trigger more than one rule – Solution? Ordered rule set Unordered rule set – use voting schemes z Rules are no longer g exhaustive – A record may not trigger any rules – Solution? Use a default class Introduction to Data Mining 08/26/2006 11 Ordered Rule Set z Rules are rank ordered according to their priority – An ordered rule set is known as a decision list z When a test record is presented to the classifier – It is assigned to the class label of the highest ranked rule it has triggered – If none of the rules fired, it is assigned to the default class Introduction to Data Mining 08/26/2006 12 Rule Ordering Schemes z Rule-based ordering – Individual rules are ranked based on their quality/priority z Class-based ordering – Rules that belong to the same class appear together Introduction to Data Mining 08/26/2006 13 Building Classification Rules z Direct Method: Extract rules directly from data e.g.: RIPPER, RIPPER CN2 CN2, 1R 1R, and d AQ z Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, SVM, etc). e.g: C4.5rules C4 5 l Introduction to Data Mining 08/26/2006 14 Direct Method: Sequential Covering Introduction to Data Mining 08/26/2006 15 Example of Sequential Covering (ii) Step 1 Introduction to Data Mining 08/26/2006 16 Example of Sequential Covering… R1 R1 R2 (iii) Step 2 Introduction to Data Mining (iv) Step 3 08/26/2006 17 Aspects of Sequential Covering z Rule Growing – Rule evaluation z Instance Elimination z Stopping Criterion z Rule Pruning Introduction to Data Mining 08/26/2006 18 Rule Growing z Two common strategies Yes: 3 No: 4 {} Refund= No Status = Single Status = Divorced Status = Married Yes: 3 N 4 No: Yes: 2 N 1 No: Yes: 1 N 0 No: Yes: 0 N 3 No: ... Income > 80K Yes: 3 N 1 No: (a) General-to-specific Introduction to Data Mining 08/26/2006 19 Rule Evaluation z Evaluation metric determines which conjunct should be added during rule growing – Accuracy – Laplace – = = nc n nc + 1 n+k n + kp M-estimate = c n+k Introduction to Data Mining n : Number of instances covered by rule nc : Number of instances of class c covered by rule k : Number of classes p : Prior probability 08/26/2006 20 Rule Growing (Examples) z CN2 Algorithm: – Start from an empty conjunct: {} – Add conjuncts that minimizes the entropy measure: {A}, {A,B}, … – Determine D t i th the rule l consequentt b by ttaking ki majority j it class l off iinstances t covered by the rule z RIPPER Algorithm: – Start from an empty rule: {} => class – Add conjuncts that maximizes FOIL’s information gain measure: R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, Gain(R0 R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1 p0: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1 Introduction to Data Mining 08/26/2006 21 08/26/2006 22 Instance Elimination z Why do we need to eliminate instances? – Otherwise, the next rule is identical to previous rule z Why do we remove positive instances? – Ensure that the next rule is different z Why do we remove g instances? negative – Prevent underestimating accuracy of rule – Compare rules R2 and R3 in the diagram Introduction to Data Mining Stopping Criterion and Rule Pruning z Examples of stopping criterion: – If rule does not improve significantly after adding conjunct – If rule starts covering examples from another class z Rule Pruning – Similar to post-pruning of decision trees – Example: a p e us using g validation a da o se set ((reduced educed e error o p pruning) u g) Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct Introduction to Data Mining 08/26/2006 23 Summary of Direct Method z Initial rule set is empty z Repeat – Grow a single rule – Remove Instances covered by the rule – Prune the rule (if necessary) – Add rule to the current rule set Introduction to Data Mining 08/26/2006 24 Direct Method: RIPPER z For 2-class problem, choose one of the classes as positive class, and the other as negative class – Learn the rules for positive class – Use negative class as default z For multi-class problem – Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) – Learn the rule set for smallest class first, treat the rest as negative class – Repeat with next smallest class as positive class Introduction to Data Mining 08/26/2006 25 Direct Method: RIPPER z Rule growing: – Start from an empty rule: {} → + – Add conjuncts j as long g as they y improve p FOIL’s information gain – Stop when rule no longer covers negative examples – Prune the rule immediately using incremental reduced error pruning – Measure for pruning: v = (p-n)/(p+n) p: number of p p positive examples p covered by y the rule in the validation set n: number of negative examples covered by the rule in the validation set – Pruning method: delete any final sequence of conditions that maximizes v Introduction to Data Mining 08/26/2006 26 Direct Method: RIPPER z Building a Rule Set: – Use sequential covering algorithm Grow a rule to cover the current set of positive examples Eliminate both positive and negative examples covered by the rule – Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far Introduction to Data Mining 08/26/2006 27 08/26/2006 28 Indirect Methods Introduction to Data Mining Indirect Method: C4.5rules Extract rules for every path from root to leaf nodes z For each rule, r: A → y, – consider alternative rule r’: A’ → y where A’ is obtained by removing one of the conjuncts in A – Compare the pessimistic error rate for r against all r’s z Prune if one of the r’s has lower pessimistic error rate t – Repeat until pessimistic error rate can no longer be improved Introduction to Data Mining 08/26/2006 29 Indirect Method: C4.5rules z Use class-based ordering – Rules that predict the same class are grouped t together th into i t the th same subset b t – Compute total description length for each class – Classes are ordered in increasing order of their total description length Introduction to Data Mining 08/26/2006 30 Example Give Birth? C4.5rules: (Give Birth=No, Can Fly=Yes) → Birds ((Give Birth=No,, Live in Water=Yes)) → Fishes No Yes (Give Birth=Yes) → Mammals (Give Birth=No, Can Fly=No, Live in Water=No) → Reptiles Live In Water? Mammals Yes ( ) → Amphibians No Sometimes Fishes Can Fly? Amphibians Yes Birds Introduction to Data Mining No Reptiles 08/26/2006 31 Characteristics of Rule-Based Classifiers As highly expressive as decision trees z Easy to interpret z Easy to generate z Can classify new instances rapidly z Performance comparable to decision trees z Introduction to Data Mining 08/26/2006 32 Classification: Alternative Techniques Instance-Based Classifiers Introduction to Data Mining 08/26/2006 33 Instance-Based Classifiers Set of Stored Cases Atr1 ……... AtrN Class A • Store the training records • Use training records to predict the class label of unseen cases B B Unseen Case C A Atr1 ……... AtrN C B Introduction to Data Mining 08/26/2006 34 Instance Based Classifiers z Examples: – Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly – Nearest neighbor Uses k ““closest” U l t” points i t ((nearestt neighbors) i hb ) ffor performing classification Introduction to Data Mining 08/26/2006 35 Nearest Neighbor Classifiers z Basic idea: – If it walks like a duck, quacks like a duck, then it’ probably it’s b bl a d duck k Compute Distance Training Records Introduction to Data Mining Test Record Choose k of the “nearest” records 08/26/2006 36 Nearest-Neighbor Classifiers Unknown record z Requires three things – The set of stored records – Distance metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve z To classify an unknown record: – Compute distance to other training records – Identify Id tif k nearestt neighbors i hb – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Introduction to Data Mining 08/26/2006 37 Definition of Nearest Neighbor X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Introduction to Data Mining 08/26/2006 38 1 nearest-neighbor Voronoi Diagram Introduction to Data Mining 08/26/2006 39 Nearest Neighbor Classification z Compute distance between two points: – Example: Euclidean distance d ( p, q ) = z ∑ ( pi i −q ) 2 i Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance weight factor, w = 1/d2 Introduction to Data Mining 08/26/2006 40 Nearest Neighbor Classification… z Choosing the value of k: – If k is too small, sensitive to noise points – If k is i too t large, l neighborhood i hb h d may iinclude l d points i t ffrom other classes Introduction to Data Mining 08/26/2006 41 Nearest Neighbor Classification… z Scaling issues – Attributes may have to be scaled to prevent di t distance measures ffrom being b i d dominated i t db by one of the attributes – Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K $ to $ $1M Introduction to Data Mining 08/26/2006 42 Nearest Neighbor Classification… z Problem with Euclidean measure: – High dimensional data curse of dimensionality – Can produce counter-intuitive results 111111111110 100000000000 vs 011111111111 000000000001 d = 1.4142 d = 1.4142 Solution: Normalize the vectors to unit length Introduction to Data Mining 08/26/2006 43 Nearest neighbor Classification… z k-NN classifiers are lazy learners – It does not build models explicitly – Unlike eager learners such as decision tree induction and rule-based systems – Classifying unknown records are relatively expensive Introduction to Data Mining 08/26/2006 44 Example: PEBLS z PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg) – Works W k with ith both b th continuous ti and d nominal i l features For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM) – Each record is assigned a weight factor – Number of nearest neighbor, k = 1 Introduction to Data Mining 08/26/2006 45 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No d(Single,Married) 2 No Married 100K No = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1 3 No Single 70K No d(Single,Divorced) 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No d(Married,Divorced) 7 Yes Divorced 220K No = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1 8 No Single 85K Yes d(Refund=Yes,Refund=No) 9 No Married 75K No = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7 10 No Single 90K Yes 60K Distance between nominal attribute values: = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0 10 Marital Status Class Refund Single Married Divorced Yes 2 0 1 No 2 4 1 Introduction to Data Mining Class Yes No Yes 0 3 No 3 4 d (V1 ,V2 ) = ∑ i 08/26/2006 n1i n 2 i − n1 n 2 46 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat X Yes Single 125K No Y No Married 100K No 10 Distance between record X and record Y: d Δ ( X , Y ) = w X wY ∑ d ( X i , Yi ) 2 i =1 where: wX = Number of times X is used for p prediction Number of times X predicts correctly wX ≅ 1 if X makes accurate prediction most of the time wX > 1 if X is not reliable for making predictions Introduction to Data Mining 08/26/2006 47 Classification: Alternative Techniques Bayesian Classifiers Introduction to Data Mining 08/26/2006 48 Classification: Alternative Techniques Ensemble Methods Introduction to Data Mining 8/26/2005 49 Ensemble Methods z Construct a set of classifiers from the training data z Predict class label of test records by combining the predictions made by multiple classifiers Introduction to Data Mining 08/26/2006 50 Why Ensemble Methods work? z Suppose there are 25 base classifiers – Each classifier has error rate, ε = 0.35 – Assume errors made by classifiers are uncorrelated – Probability that the ensemble classifier makes a wrong prediction: ⎛ 25 ⎞ P ( X ≥ 13) = ∑ ⎜⎜ ⎟⎟ε i (1 − ε ) 25−i = 0.06 i =13 ⎝ i ⎠ 25 Introduction to Data Mining 08/26/2006 51 08/26/2006 52 General Approach Introduction to Data Mining Types of Ensemble Methods Bayesian ensemble – Example: Mixture of Gaussian z Manipulate data distribution – Example: Resampling method z Manipulate input features – Example: Feature subset selection z Manipulate class labels – Example: error-correcting output coding z Introduce randomness into learning algorithm – Example: Random forests z Introduction to Data Mining 08/26/2006 53 Bagging z Sampling with replacement Original Data Bagging (Round 1) Bagging (Round 2) Bagging (Round 3) 1 7 1 1 2 8 4 8 3 10 9 5 4 8 1 10 5 2 2 5 6 5 3 5 7 10 2 9 8 10 7 6 9 5 3 3 10 9 2 7 z Build classifier on each bootstrap sample z Each sample has probability (1 – 1/n)n of being selected Introduction to Data Mining 08/26/2006 54 Bagging Algorithm Introduction to Data Mining 08/26/2006 55 Bagging Example z Consider 1-dimensional data set: x y z 0.1 1 0.2 1 0.3 1 0.4 -1 0.5 -1 0.6 -1 0.7 -1 0.8 1 0.9 1 1 1 Classifier is a decision stump – Decision rule: x ≤ k versus x > k – Split point k is chosen based on entropy x≤k True False yleft yright Introduction to Data Mining 08/26/2006 56 Bagging Example Bagging Round 1: x 0.1 0.2 y 1 1 0.2 1 0.3 1 0.4 -1 0.4 -1 0.5 -1 0.6 -1 0.9 1 0.9 1 Bagging B i R Round d2 2: x 0.1 0.2 y 1 1 0.3 1 0.4 -1 0.5 -1 0.5 -1 0.9 1 1 1 1 1 1 1 Bagging Round 3: x 0.1 0.2 y 1 1 0.3 1 0.4 -1 0.4 -1 0.5 -1 0.7 -1 0.7 -1 0.8 1 0.9 1 Bagging Round 4: x 01 0.1 01 0.1 y 1 1 0.2 0 2 1 0.4 0 4 -1 0.4 0 4 -1 0.5 0 5 -1 0.5 0 5 -1 0.7 0 7 -1 0.8 0 8 1 0.9 0 9 1 Bagging Round 5: x 0.1 0.1 y 1 1 0.2 1 0.5 -1 0.6 -1 0.6 -1 0.6 -1 1 1 1 1 1 1 Introduction to Data Mining 08/26/2006 57 Bagging Example Bagging Round 6: x 0.2 0.4 y 1 -1 0.5 -1 0.6 -1 0.7 -1 0.7 -1 0.7 -1 0.8 1 0.9 1 1 1 Bagging B i R Round d7 7: x 0.1 0.4 y 1 -1 0.4 -1 0.6 -1 0.7 -1 0.8 1 0.9 1 0.9 1 0.9 1 1 1 Bagging Round 8: x 0.1 0.2 y 1 1 0.5 -1 0.5 -1 0.5 -1 0.7 -1 0.7 -1 0.8 1 0.9 1 1 1 Bagging Round 9: x 01 0.1 03 0.3 y 1 1 0.4 0 4 -1 0.4 0 4 -1 0.6 0 6 -1 0.7 0 7 -1 0.7 0 7 -1 0.8 0 8 1 1 1 1 1 Bagging Round 10: x 0.1 0.1 y 1 1 0.1 1 0.1 1 0.3 1 0.3 1 0.8 1 0.8 1 0.9 1 0.9 1 Introduction to Data Mining 08/26/2006 58 Bagging Example z Summary of Training sets: Round 1 2 3 4 5 6 7 8 9 10 Split Point Left Class Right Class 0.35 1 -1 0.7 1 1 0.35 1 -1 0.3 1 -1 0.35 1 -1 0.75 -1 1 0.75 -1 1 0 75 0.75 -1 1 1 0.75 -1 1 0.05 1 1 Introduction to Data Mining 08/26/2006 59 Bagging Example Assume test set is the same as the original data z Use majority vote to determine class of ensemble classifier l ifi z Predicted Class Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 2 1 1 1 1 1 1 1 1 1 1 3 1 1 1 -1 -1 -1 -1 -1 -1 -1 4 1 1 1 -1 -1 -1 -1 -1 -1 -1 5 1 1 1 -1 -1 -1 -1 -1 -1 -1 6 -1 -1 -1 -1 -1 -1 -1 1 1 1 7 -1 -1 -1 -1 -1 -1 -1 1 1 1 8 -1 -1 -1 -1 -1 -1 -1 1 1 1 9 -1 -1 -1 -1 -1 -1 -1 1 1 1 10 1 1 1 1 1 1 1 1 1 1 Sum 2 2 2 -6 -6 -6 -6 2 2 2 Sign 1 1 1 -1 -1 -1 -1 1 1 1 Introduction to Data Mining 08/26/2006 60 Boosting z An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records – Initially, all N records are assigned equal weights – Unlike bagging, weights may change at the end of each boosting round Introduction to Data Mining 08/26/2006 61 Boosting Records that are wrongly classified will have their weights increased z Records R d th thatt are classified l ifi d correctly tl will ill h have their weights decreased z Original Data Boosting (Round 1) Boosting (Round 2) Boosting (Round 3) 1 7 5 4 2 3 4 4 3 2 9 8 4 8 4 10 5 7 2 4 6 9 5 5 7 4 1 4 8 10 7 6 9 6 4 3 10 3 2 4 • Example 4 is hard to classify • Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds Introduction to Data Mining 08/26/2006 62 AdaBoost z Base classifiers: C1, C2, …, CT z Error rate: 1 εi = N z ∑ w δ (C ( x ) ≠ y ) N j i j j j =1 Importance of a classifier: 1 ⎛ 1 − εi ⎞ ⎟ αi = ln⎜⎜ 2 ⎝ ε i ⎟⎠ Introduction to Data Mining 08/26/2006 63 AdaBoost Algorithm z Weight update: ⎧⎪exp −α j if C j ( xi ) = yi wi ⎨ α ⎪⎩ exp j if C j ( xi ) ≠ yi where Z j is the normalization factor ( j +1) wi( j ) = Zj If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated z Classification: T z C * ( x ) = arg max ∑ α jδ (C j ( x ) = y ) y Introduction to Data Mining j =1 08/26/2006 64 AdaBoost Algorithm Introduction to Data Mining 08/26/2006 65 AdaBoost Example z Consider 1-dimensional data set: x y z 0.1 1 0.2 1 0.3 1 0.4 -1 0.5 -1 0.6 -1 0.7 -1 0.8 1 0.9 1 1 1 Classifier is a decision stump – Decision rule: x ≤ k versus x > k – Split point k is chosen based on entropy x≤k True False yleft yright Introduction to Data Mining 08/26/2006 66 AdaBoost Example z z Training sets for the first 3 boosting rounds: Boosting Round 1: x 01 0.1 04 0.4 y 1 -1 0.5 0 5 -1 0.6 0 6 -1 0.6 0 6 -1 0.7 0 7 -1 0.7 0 7 -1 0.7 0 7 -1 0.8 0 8 1 1 1 Boosting Round 2: x 0.1 0.1 y 1 1 0.2 1 0.2 1 0.2 1 0.2 1 0.3 1 0.3 1 0.3 1 0.3 1 Boosting Round 3: x 0.2 0.2 y 1 1 0.4 -1 0.4 -1 0.4 -1 0.4 -1 0.5 -1 0.6 -1 0.6 -1 0.7 -1 Summary: Round 1 2 3 Split Point Left Class Right Class 0.75 -1 1 0.05 1 1 0.3 1 -1 Introduction to Data Mining 08/26/2006 alpha 1.738 2.7784 4.1195 67 AdaBoost Example z Weights Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01 3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009 z Classification Predicted Class Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0 1 -1 -1 -1 -1 -1 -1 -1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 1 1 1 -1 -1 -1 -1 -1 -1 -1 Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397 Sign 1 1 1 -1 -1 -1 -1 1 1 1 Introduction to Data Mining 08/26/2006 68 Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining 8/26/2005 69 Class Imbalance Problem z Lots of classification problems where the classes are skewed (more records from one class than another) – Credit card fraud – Intrusion detection – Defective products in manufacturing assembly line Introduction to Data Mining 08/26/2006 70 Challenges z Evaluation measures such as accuracy is not well-suited for imbalanced class z Detecting the rare class is like finding needle in a haystack Introduction to Data Mining 08/26/2006 71 Confusion Matrix z Confusion Matrix: PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Introduction to Data Mining 08/26/2006 72 Accuracy PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No z Class=No a (TP) b (FN) c (FP) d (TN) Most widely-used metric: Accuracy = a+d TP + TN = a + b + c + d TP + TN + FP + FN Introduction to Data Mining 08/26/2006 73 Problem with Accuracy z Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 z If a model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – This is misleading because the model does not detect any class 1 example – Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) Introduction to Data Mining 08/26/2006 74 Alternative Measures PREDICTED CLASS Class=Yes ACTUAL CLASS Class=No Cl Class=Yes Y a (TP) b (FN) Class=No c (FP) d (TN) Precision (p) = Recall (r) = a a+c a a+b F - measure (F) = Introduction to Data Mining 2rp 2a = r + p 2a + b + c 08/26/2006 75 ROC (Receiver Operating Characteristic) A graphical approach for displaying trade-off between detection rate and false alarm rate z Developed D l d iin 1950 1950s ffor signal i ld detection t ti th theory tto analyze noisy signals z ROC curve plots TPR against FPR – TPR = TP/(TP+FN), FPR = FP/(TN+FP) – Performance of a model represented as a point in an ROC curve – Changing the threshold parameter of classifier changes the location of the point z Introduction to Data Mining 08/26/2006 76 ROC Curve (TPR,FPR): z (0,0): declare everything to be negative class z (1,1): declare everything to be positive class z (1,0): ideal z Diagonal line: – Random guessing – Below diagonal line: prediction is opposite of the true class Introduction to Data Mining 08/26/2006 77 ROC (Receiver Operating Characteristic) z To draw ROC curve, classifier must produce continuous-valued output – Outputs are used to rank test records records, from the most likely positive class record to the least likely positive class record z Many classifiers produce only discrete outputs (i.e., predicted class) – How to get continuous-valued outputs? Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM Introduction to Data Mining 08/26/2006 78 Example: Decision Trees Decision Tree C ti Continuous-valued l d outputs t t Introduction to Data Mining 08/26/2006 79 08/26/2006 80 ROC Curve Example Introduction to Data Mining Using ROC for Model Comparison z No model consistently outperform the other z M1 is better for small FPR z M2 is better for large FPR z Area Under the ROC curve z Ideal: Area z Random guess: Area Introduction to Data Mining =1 = 0.5 08/26/2006 81 How to Construct an ROC curve Instance score(+|A) True Class 1 0.95 + 2 0 93 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0 53 0.53 + 9 0.43 - 10 0.25 + • Use classifier that produces continuous-valued output for each test instance score(+|A) • Sort the instances according to score(+|A) in decreasing order • Apply threshold at each unique value of score(+|A) • Count the number of TP, FP, TN, FN at each threshold •TPR = TP/(TP+FN) •FPR = FP/(FP + TN) Introduction to Data Mining 08/26/2006 82 How to construct an ROC curve + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 5 Class Threshold >= FN 0 1 1 2 2 2 2 3 3 4 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve: Introduction to Data Mining 08/26/2006 83 Handling Class Imbalanced Problem z Class-based ordering (e.g. RIPPER) – Rules for rare class have higher priority z Cost-sensitive classification – Misclassifying rare class as majority class is more expensive than misclassifying majority as rare class z Sampling-based approaches Introduction to Data Mining 08/26/2006 84 Cost Matrix PREDICTED CLASS Class=Yes ACTUAL CLASS Class=Yes Class=No Cost Matrix f(Yes, Yes) f(Yes,No) f(No, Yes) f(No, No) C(i,j): Cost of misclassifying class i example as class j PREDICTED CLASS C(i j) C(i, ACTUAL CLASS Class=No Cl Class=Yes Y C t = ∑ C (i , j ) × f (i , j ) Cost Cl Class=No N Class=Yes C(Yes, Yes) C(Yes, No) Class=No C(No, Yes) C(No, No) Introduction to Data Mining 08/26/2006 85 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS Model M1 PREDICTED CLASS + ACTUAL CLASS 150 40 - 60 250 Introduction to Data Mining + - + -1 100 - 1 0 Model M2 PREDICTED CLASS - + Accuracy = 80% Cost = 3910 C(i,j) ACTUAL CLASS + - + 250 45 - 5 200 Accuracy = 90% Cost = 4255 08/26/2006 86 Cost Sensitive Classification z Example: Bayesian classifer – Given a test record x: Compute p(i|x) for each class i Decision rule: classify node as class k if k = arg max p(i | x ) i – For 2-class, classify x as + if p(+|x) > p(-|x) This decision rule implicitly assumes that C(+|+) = C(-|-) = 0 and C(+|-) = C(-|+) Introduction to Data Mining 08/26/2006 87 Cost Sensitive Classification z General decision rule: – Classify test record x as class k if k = arg min ∑ p(i | x ) × C (i, j ) j z i 2-class: – Cost(+) = p(+|x) C(+,+) + p(-|x) C(-,+) – Cost(-) ( ) = p(+|x) ( | ) C(+,-) ( ) + p(-|x) ( | ) C(-,-) ( ) – Decision rule: classify x as + if Cost(+) < Cost(-) if C(+,+) = C(-,-) = 0: Introduction to Data Mining p( + | x ) > C ( −, + ) C ( −, + ) + C ( + , − ) 08/26/2006 88 Sampling-based Approaches z Modify the distribution of training data so that rare class is well-represented in training set – Undersample U d l th the majority j it class l – Oversample the rare class z Advantages and disadvantages Introduction to Data Mining 08/26/2006 89