Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification: Definition Illustrating Classification Task Given a collection of records (training set) Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. Tid Attrib1 1 Yes Large 125K No 2 No Medium Attrib2 100K Attrib3 No Class 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model 10 – A test set is used to determine the accuracy. Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Examples of Classification Task © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Example of a Decision Tree al al us ic ic uo or or s tin eg eg t n t as cl ca ca co Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K © Tan,Steinbach, Kumar al al us ic ic uo or or s tin eg eg t t n as cl ca ca co Refund Yes No NO MarSt Single, Divorced TaxInc NO 4/18/2004 4 Another Example of Decision Tree Splitting Attributes < 80K Introduction to Data Mining Married NO > 80K YES 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single, Divorced Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data! 10 Model: Decision Tree Training Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Apply Model to Test Data Test Data Class Start from the root of tree. Refund Yes Learn Model Attrib1 11 No Small Attrib2 55K Attrib3 ? Class 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? No 80K Married ? 10 MarSt Single, Divorced Tid Taxable Income Cheat No NO 10 Apply Model Refund Marital Status Decision Tree TaxInc < 80K Married NO > 80K NO YES 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Apply Model to Test Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Apply Model to Test Data Test Data Refund Yes Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 10 No NO Yes TaxInc < 80K NO TaxInc NO < 80K 4/18/2004 9 Apply Model to Test Data Married NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Refund Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 10 No NO Yes TaxInc < 80K NO © Tan,Steinbach, Kumar Married Introduction to Data Mining TaxInc < 80K NO 4/18/2004 11 No 80K Married ? MarSt NO YES Taxable Income Cheat 10 Single, Divorced > 80K Refund Marital Status No NO MarSt Single, Divorced 10 Apply Model to Test Data Test Data Yes ? > 80K NO Introduction to Data Mining 80K Married 10 Single, Divorced YES © Tan,Steinbach, Kumar Taxable Income Cheat No MarSt Married > 80K Refund Marital Status No NO MarSt Single, Divorced 8 © Tan,Steinbach, Kumar Married NO > 80K YES Introduction to Data Mining 4/18/2004 12 Apply Model to Test Data Decision Tree Classification Task Test Data Refund Refund Marital Status Taxable Income Cheat No 80K Married ? Tid Attrib1 1 Yes Large 125K No 2 No Medium Attrib2 100K Attrib3 No Class 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc 10 NO < 80K > 80K NO Learn Model YES Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? Class 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Decision Tree 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Decision Tree Induction Introduction to Data Mining Refund Yes No Don’t Cheat Don’t Cheat Refund Refund Yes Yes No Dt = set of training records of node t General Procedure: – If Dt only records of same class yt t leaf node labeled as yt – Else: use an attribute test to split the data. Recursively apply the procedure to each subset. No 15 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Don’t Cheat Marital Status Single, Divorced Cheat Married Don’t Cheat Single, Divorced © Tan,Steinbach, Kumar Marital Status >= 80K Don’t Cheat Cheat Introduction to Data Mining 14 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Dt ? Introduction to Data Mining 4/18/2004 16 Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes a local criterion. Issues – Determine how to split the records How Don’t Cheat < 80K © Tan,Steinbach, Kumar How Married Taxable Income Don’t Cheat 4/18/2004 4/18/2004 Hunt’s Algorithm Don’t Cheat Introduction to Data Mining General Structure of Hunt’s Algorithm Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT © Tan,Steinbach, Kumar © Tan,Steinbach, Kumar to specify the attribute test condition? to determine the best split? – Determine when to stop splitting 4/18/2004 17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Tree Induction How to Specify Test Condition? Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. Issues – Determine how to split the records Depends on attribute types – Nominal (No order; e.g., Country) – Ordinal (Discrete, order; e.g., S,M,L,XL) – Continuous (Ordered, cont.; e.g., temperature) Depends on number of ways to split – 2-way split – Multi-way split How to specify the attribute test condition? How to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Splitting Based on Nominal Attributes © Tan,Steinbach, Kumar CarType Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} Introduction to Data Mining Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Binary Decision: (A < v) or (A ≥ v) CarType {Sports} © Tan,Steinbach, Kumar 20 Different ways of handling – Discretization to form an ordinal categorical attribute Sports 4/18/2004 Splitting Based on Continuous Attributes Multi-way split: Use as many partitions as distinct values. Family Introduction to Data Mining 4/18/2004 21 Splitting Based on Continuous Attributes consider all possible splits and finds the best cut can be more compute intensive © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. Issues – Determine how to split the records How How to specify the attribute test condition? to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 How to determine the Best Split How to determine the Best Split Greedy approach: – Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Measures of Node Impurity Gini Index Entropy Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 How to Find the Best Split Before Splitting: C0 C1 N00 N01 M0 A? B? Yes No Node N1 26 Misclassification error Node N2 N10 N11 C0 C1 Yes N20 N21 C0 C1 Node N4 N30 N31 C0 C1 M2 M1 No Node N3 M3 M12 N40 N41 C0 C1 M4 M34 Gain = M0 – M12 vs M0 – M34 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Measure of Impurity: GINI © Tan,Steinbach, Kumar Introduction to Data Mining GINI (t ) = 1 − ∑ [ p ( j | t )]2 j GINI (t ) = 1 − ∑ [ p ( j | t )]2 j (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information 0 6 Gini=0.000 © Tan,Steinbach, Kumar C1 C2 1 5 Gini=0.278 28 Examples for computing GINI Gini Index for a given node t : C1 C2 4/18/2004 C1 C2 2 4 Gini=0.444 Introduction to Data Mining C1 C2 3 3 C1 C2 0 6 P(C1) = 0/6 = 0 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 Gini=0.500 4/18/2004 29 © Tan,Steinbach, Kumar P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Introduction to Data Mining 4/18/2004 30 Splitting Based on GINI Binary Attributes: Computing GINI Index Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, k GINI split = ∑ i =1 Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent ni GINI (i) n B? Yes No C1 6 C2 6 Gini = 0.500 where, Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 ni = number of records at child i, n = number of records at node p. Node N1 C1 C2 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. Issues – Determine how to split the records How N1 5 2 N2 1 4 Gini=0.333 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333 Introduction to Data Mining 4/18/2004 32 Stopping Criteria for Tree Induction How © Tan,Steinbach, Kumar Node N2 to specify the attribute test condition? to determine the best split? Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later) – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Decision Tree Based Classification Introduction to Data Mining 4/18/2004 Introduction to Data Mining 4/18/2004 34 4/18/2004 36 Practical Issues of Classification Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets © Tan,Steinbach, Kumar © Tan,Steinbach, Kumar 35 Underfitting and Overfitting Missing Values Costs of Classification © Tan,Steinbach, Kumar Introduction to Data Mining Underfitting and Overfitting Underfitting Overfitting due to Noise Overfitting Decision boundary is distorted by noise point Underfitting: when model is too simple, both training and test errors are large © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 How to Address Overfitting Stop if number of instances is less than some userspecified threshold Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Introduction to Data Mining Introduction to Data Mining 4/18/2004 40 How to Address Overfitting… Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fullygrown tree © Tan,Steinbach, Kumar © Tan,Steinbach, Kumar 4/18/2004 41 Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from majority class of instances in the sub-tree © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Model Evaluation Metrics for Performance Evaluation Metrics for Performance Evaluation – How to evaluate the performance of a model? Methods for Performance Evaluation – How to obtain reliable estimates? Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Class=Yes Methods for Model Comparison – How to compare the relative performance among competing models? © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Class=Yes ACTUAL CLASS Class=No 43 Metrics for Performance Evaluation… Class=Yes ACTUAL CLASS Class=No b (FN) c (FP) d (TN) Most widely-used metric: Accuracy = © Tan,Steinbach, Kumar Introduction to Data Mining Precision (p) = Recall (r) = 4/18/2004 45 a a+c a a+b F - measure (F) = 2rp 2a = r + p 2a + b + c Precision is biased towards C(Yes|Yes) & C(Yes|No) Recall is biased towards C(Yes|Yes) & C(No|Yes) F-measure is biased towards all except C(No|No) © Tan,Steinbach, Kumar c d Introduction to Data Mining a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 4/18/2004 44 Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example a+d TP + TN = a + b + c + d TP + TN + FP + FN Cost-Sensitive Measures b Class=No a (TP) a Limitation of Accuracy PREDICTED CLASS Class=Yes © Tan,Steinbach, Kumar Class=No Introduction to Data Mining 4/18/2004 47 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46