Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Classification: Definition l – Each record contains a set of attributes, one of the attributes is the class. Lecture Notes for Chapter 4 l Introduction to Data Mining l by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining Attrib1 1 Yes Large 125K No 2 No Medium Attrib2 100K Attrib3 No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 4/18/2004 1 © Tan,Steinbach, Kumar Attrib1 No Small 55K ? 12 Yes Medium Attrib2 80K Attrib3 ? Class 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Predicting tumor cells as benign or malignant l Classifying credit card transactions as legitimate or fraudulent l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil l Categorizing news stories as finance, weather, entertainment, sports, etc Learn Model Apply Model 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 l 10 11 Introduction to Data Mining 2 Examples of Classification Task Class Tid Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Illustrating Classification Task Tid Given a collection of records (training set ) 4/18/2004 3 © Tan,Steinbach, Kumar Classification Techniques Introduction to Data Mining 4/18/2004 4 Example of a Decision Tree al al us ic ic uo or or s in eg eg nt t t as cl ca ca co Logistic Regression l Decision Tree based Methods l Rule-based Methods l Memory based reasoning l Neural Networks l Naïve Bayes and Bayesian Belief Networks l Support Vector Machines l Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO >=80K YES 1 0 Model: Decision Tree Training Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Another Example of Decision Tree al al us ic ic uo or or s tin eg eg t t n as cl ca ca co Tid Refund Marital Status 1 Yes Single Taxable Income Cheat 125K Married NO No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K 60K MarSt Decision Tree Classification Task Single, Divorced Refund No Yes NO TaxInc < 80K >= 80K NO No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Attrib1 1 Yes Large 125K No 2 No Medium Attrib2 100K Attrib3 No Class 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tid Attrib1 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Learn Model 10 YES No 8 Tid There could be more than one tree that fits the same data! 1 0 Attrib2 Attrib3 Apply Model Class Decision Tree 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Apply Model to Test Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Apply Model to Test Data Test Data Start from the root of tree. Refund Yes Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 1 0 No NO Yes TaxInc < 80K © Tan,Steinbach, Kumar TaxInc NO < 80K 4/18/2004 9 Apply Model to Test Data Refund NO YES © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 1 0 Yes TaxInc < 80K NO © Tan,Steinbach, Kumar Married Introduction to Data Mining TaxInc < 80K NO 4/18/2004 11 No 80K Married ? MarSt NO YES Taxable Income Cheat 0 1 Single, Divorced >= 80K Refund Marital Status No NO MarSt Single, Divorced 10 Apply Model to Test Data No NO ? Married Test Data Yes 80K Married >= 80K NO Introduction to Data Mining No 0 1 Single, Divorced YES NO Taxable Income Cheat MarSt Married >= 80K Refund Marital Status No NO MarSt Single, Divorced 8 © Tan,Steinbach, Kumar Married NO >= 80K YES Introduction to Data Mining 4/18/2004 12 Apply Model to Test Data Apply Model to Test Data Test Data Refund Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 1 0 Yes No NO < 80K TaxInc NO < 80K YES NO © Tan,Steinbach, Kumar 4/18/2004 13 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes NO >= 80K YES NO Introduction to Data Mining ? Assign Cheat to “No” Married Single, Divorced >= 80K 80K Married MarSt Married TaxInc No No NO MarSt Taxable Income Cheat 0 1 Yes Single, Divorced Refund Marital Status © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Decision Tree Induction l Class Learn Model Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT 10 Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? Class 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Decision Tree 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 General Structure of Hunt’s Algorithm l l Let Dt be the set of training records that reach a node t General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. © Tan,Steinbach, Kumar Introduction to Data Mining © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Hunt’s Algorithm 16 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes No Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 6 No Married 5 No Divorced 95K Yes 7 Yes Divorced 220K No 6 No Married No 8 No Single 85K Yes 7 Yes Divorced 220K No 9 No Married 75K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 No Single 90K Yes 60K Refund Don’t Cheat No Don’t Cheat Don’t Cheat Refund Refund Yes Yes No No 60K 1 0 Don’t Cheat Single, Divorced 10 Dt ? Cheat 4/18/2004 Yes 17 Marital Status Married Don’t Cheat Single, Divorced Married Don’t Cheat Taxable Income Don’t Cheat © Tan,Steinbach, Kumar Marital Status < 80K >= 80K Don’t Cheat Cheat Introduction to Data Mining 4/18/2004 18 Tree Induction Tree Induction l Greedy strategy. – Split the records based on an attribute test that optimizes a certain criterion. l Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. l Issues – Determine how to split the records l Issues – Determine how to split the records uHow uHow to specify the attribute test condition? to determine the best split? uHow uHow – Determine when to stop splitting © Tan,Steinbach, Kumar – Determine when to stop splitting Introduction to Data Mining 4/18/2004 19 How to Specify Test Condition? l l l Depends on number of ways to split – 2-way split – Multi-way split l CarType 4/18/2004 21 Large l {Medium, Large} l What about this split? © Tan,Steinbach, Kumar Introduction to Data Mining {Family, Luxury} Introduction to Data Mining CarType {Sports} 4/18/2004 22 Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Size – Binary Decision: (A < v) or (A ≥ v) {Small} u {Small, Large} OR Different ways of handling – Discretization to form an ordinal categorical attribute u OR {Family} Splitting Based on Continuous Attributes Binary split: Divides values into two subsets. Need to find optimal partitioning. {Large} CarType © Tan,Steinbach, Kumar u Size Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} Multi-way split: Use as many partitions as distinct values. {Small, Medium} 20 Sports Size l 4/18/2004 Multi-way split: Use as many partitions as distinct values. Family Introduction to Data Mining Small Medium Introduction to Data Mining Splitting Based on Nominal Attributes Splitting Based on Ordinal Attributes l © Tan,Steinbach, Kumar Depends on attribute types – Nominal – Ordinal – Continuous © Tan,Steinbach, Kumar to specify the attribute test condition? to determine the best split? Size u {Medium} 4/18/2004 23 consider all possible splits and finds the best cut can be more compute intensive © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Splitting Based on Continuous Attributes Tree Induction l Greedy strategy. – Split the records based on an attribute test that optimizes a certain criterion. l Issues – Determine how to split the records uHow uHow to specify the attribute test condition? to determine the best split? – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 How to determine the Best Split © Tan,Steinbach, Kumar Greedy approach: – Nodes with homogeneous class distribution are preferred l Need a measure of node impurity: 4/18/2004 27 Impurity Measures Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 4/18/2004 30 Measures of Node Impurity l Should be at minimum when all examples belong to a single class. l Should be at maximum when all classes have the same relative frequency. l 26 l Which test condition is the best? Introduction to Data Mining 4/18/2004 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 © Tan,Steinbach, Kumar Introduction to Data Mining l Gini Index l Entropy l Misclassification error Should be symmetrical in class frequencies. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 © Tan,Steinbach, Kumar Introduction to Data Mining How to Find the Best Split Before Splitting: C0 C1 Measure of Impurity: GINI N00 N01 M0 A? Gini Index for a given node t : GINI (t ) = 1 − ∑ [ p ( j | t )]2 B? Yes j No Node N1 Yes Node N3 M2 M1 Node N4 N30 N31 C0 C1 N20 N21 C0 C1 No (NOTE: p( j | t) is the relative frequency of class j at node t). Node N2 N10 N11 C0 C1 l M3 – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information N40 N41 C0 C1 M4 C1 C2 M12 M34 0 6 C1 C2 Gini=0.000 1 5 C1 C2 Gini=0.278 2 4 Gini=0.444 C1 C2 3 3 Gini=0.500 Gain = M0 – M12 vs M0 – M34 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Examples for computing GINI l j 0 6 P(C1) = 0/6 = 0 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 Introduction to Data Mining l P(C2) = 6/6 = 1 k GINI split = ∑ Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 l ni GINI (i ) n P(C2) = 5/6 where, Gini = 1 – (1/6)2 – (5/6)2 = 0.278 ni = number of records at child i, n = number of records at node t. P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Introduction to Data Mining 4/18/2004 33 Binary Attributes: Computing GINI Index l 32 Used in CART, SLIQ, SPRINT. When a node t is split into k partitions (children), the quality of split is computed as, i =1 © Tan,Steinbach, Kumar 4/18/2004 Splitting Based on GINI GINI (t ) = 1 − ∑ [ p ( j | t )]2 C1 C2 © Tan,Steinbach, Kumar Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Categorical Attributes: Computing Gini Index l l For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Parent B? Yes No C1 6 C2 6 Multi-way split Gini = 0.500 Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.408 Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.320 © Tan,Steinbach, Kumar Node N1 CarType Node N2 N1 N2 C1 5 1 C2 2 4 Gini=0.371 Introduction to Data Mining C1 C2 Gini Gini(Children) = 7/12 * 0.408 + 5/12 * 0.320 = 0.371 4/18/2004 Two-way split (find best partition of values) 35 Family Sports Luxury 1 2 1 4 1 1 0.393 © Tan,Steinbach, Kumar C1 C2 Gini CarType {Sports, {Family} Luxury} 3 1 2 4 0.400 Introduction to Data Mining CarType {Family, Luxury} 2 2 1 5 0.419 {Sports} C1 C2 Gini 4/18/2004 36 Continuous Attributes: Computing Gini Index l l l l Use Binary Decisions based on one value Several Choices for the splitting value – Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. © Tan,Steinbach, Kumar Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index l Cheat Introduction to Data Mining 4/18/2004 37 No No No Yes 60 70 75 85 Yes Yes No No No No 100 120 125 220 Taxable Income Sorted Values Split Positions 55 65 72 90 80 95 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 © Tan,Steinbach, Kumar 0.375 0.343 0.417 0.400 0.300 0.343 Introduction to Data Mining 0.375 0.400 4/18/2004 0.420 38 Examples for computing Entropy Alternative Splitting Criteria based on INFO l Continuous Attributes: Computing Gini Index... Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) Entropy at a given node t: 2 j Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) j (NOTE: p( j | t) is the relative frequency of class j at node t). – Measures homogeneity of a node. (log nc) when records are equally distributed among all classes implying least information (where nc denotes the number of classes) u Minimum (0.0) when all records belong to one class, implying most information C1 C2 0 6 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 u Maximum – Entropy based computations are similar to the GINI index computations © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Splitting Based on INFO... l © Tan,Steinbach, Kumar l n = Entropy( p ) − ∑ Entropy(i ) n k split P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 Introduction to Data Mining 4/18/2004 40 Splitting Based on INFO... Information Gain: GAIN P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65 i Gain Ratio: GainRATIO i =1 split = GAIN n n SplitINFO = − ∑ log SplitINFO n n Split k i i i =1 Parent Node, p is split into k partitions; ni is number of records in partition i – Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Parent Node, p is split into k partitions ni is the number of records in partition i – Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Splitting Criteria based on Classification Error l Examples for Computing Error Error (t ) = 1 − max P (i | t ) Classification error at a node t : i Error (t ) = 1 − max P (i | t ) i l C1 C2 0 6 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 Measures misclassification error made by a node. u Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information (where nc denotes the number of classes) (0.0) when all records belong to one class, implying most interesting information P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 u Minimum © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Comparison among Splitting Criteria P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 © Tan,Steinbach, Kumar Introduction to Data Mining Parent A? Yes Node N1 Gini(N1) = 1 – (3/3)2 – (0/3)2 =0 Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489 Introduction to Data Mining 4/18/2004 45 Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. l Issues – Determine how to split the records uHow © Tan,Steinbach, Kumar No Node N2 N1 N2 C1 3 4 C2 0 3 Gini=0.342 C1 7 C2 3 Gini = 0.42 Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342 Gini improves !! Introduction to Data Mining 4/18/2004 46 Stopping Criteria for Tree Induction l uHow 44 Misclassification Error vs Gini For a 2-class problem: © Tan,Steinbach, Kumar 4/18/2004 to specify the attribute test condition? to determine the best split? l Stop expanding a node when all the records belong to the same class l Stop expanding a node when all the records have the same attribute values l Early termination (to be discussed later) – Determine when to stop splitting © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Decision Tree Based Classification l Example: C4.5 Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets Simple depth-first construction. Uses Information Gain l Sorts Continuous Attributes at each node. l Needs entire data to fit in memory. l Unsuitable for Large Datasets. – Needs out-of-core sorting. l l l © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49 Implemented in Weka as J48. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50