Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
9/11/2013 Knowledge Discovery and Data Mining Unit # 3 Sajjad Haider Fall 2013 1 Acknowledgement • Most of the slides in this presentation are taken from course slides provided by – Han and Kimber (Data Mining Concepts and Techniques) and – Tan, Steinbach and Kumar (Introduction to Data Mining) Sajjad Haider Fall 2013 2 1 9/11/2013 Classification: Definition • Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Sajjad Haider Fall 2013 3 Classification: Motivation age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 Sajjad Haider income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Fall 2013 buys_computer no no yes yes yes no yes no yes yes yes yes yes no 4 2 9/11/2013 Decision/Classification Tree age? <=30 31..40 overcast student? no >40 credit rating? yes excellent yes no yes yes Sajjad Haider fair Fall 2013 5 Illustrating Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Class Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set Sajjad Haider Fall 2013 6 3 9/11/2013 Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K 60K Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes < 80K NO > 80K YES NO 10 Model: Decision Tree Training Data Sajjad Haider Fall 2013 7 Another Example of Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 Sajjad Haider Fall 2013 8 4 9/11/2013 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Class Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set Sajjad Haider Fall 2013 9 Apply Model to Test Data Test Data Start from the root of tree. Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO Sajjad Haider NO > 80K YES Fall 2013 10 5 9/11/2013 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Sajjad Haider Fall 2013 11 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO Sajjad Haider NO > 80K YES Fall 2013 12 6 9/11/2013 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Sajjad Haider Fall 2013 13 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO Sajjad Haider NO > 80K YES Fall 2013 14 7 9/11/2013 Apply Model to Test Data Test Data Refund Refund Marital Status Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO Sajjad Haider Fall 2013 15 Decision Tree Classification Task Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Class Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set Sajjad Haider Fall 2013 16 8 9/11/2013 Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting Sajjad Haider Fall 2013 17 How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous • Depends on number of ways to split – 2-way split – Multi-way split Sajjad Haider Fall 2013 18 9 9/11/2013 How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity Sajjad Haider Fall 2013 19 Measures of Node Impurity • Gini Index • Entropy • Misclassification error Sajjad Haider Fall 2013 20 10 9/11/2013 Measure of Impurity: GINI • Gini Index for a given node t : GINI (t ) 1 [ p( j | t )]2 j (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information C1 C2 0 6 Gini=0.000 C1 C2 1 5 C1 C2 Gini=0.278 Sajjad Haider 2 4 Gini=0.444 C1 C2 3 3 Gini=0.500 Fall 2013 21 Examples for computing GINI GINI (t ) 1 [ p( j | t )]2 j C1 C2 0 6 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 Sajjad Haider P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 Gini = 1 – P(C2) = 5/6 (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Fall 2013 22 11 9/11/2013 Classification: Motivation age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Sajjad Haider buys_computer no no yes yes yes no yes no yes yes yes yes yes no Fall 2013 23 Binary Attributes: Computing GINI Index • Splits into two partitions • Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Student? Yes Gini(N1) = 1 – (6/7)2 – (1/7)2 = 0.24 Gini(N2) = 1 – (3/7)2 – (4/7)2 = 0.49 Node N1 No Node N2 Gini(Student) = 7/14 * 0.24 + 7/14 * 0.49 = ?? Sajjad Haider 12 9/11/2013 GINI Index for Buy Computer Example • Gini (Income): • Gini (Credit_Rating): • Gini (Age): Sajjad Haider Fall 2013 25 13