Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification II Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering: Hierarchical and Partitional approaches Classification: Decision Trees and Bayesian classifiers Setting Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Age Salary Profession Location Customer type Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad Decision trees Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teaching Good Bad Age < 30 Bad Good SLIQ (Supervised Learning In Quest) Decision-tree classifier for data mining Design goals: Able to handle large disk-resident training sets No restrictions on training-set size Building tree GrowTree(TrainingData D) Partition(D); Partition(Data D) if (all points in D belong to the same class) then return; for each attribute A do evaluate splits on attribute A; use best split found to partition D into D1 and D2; Partition(D1); Partition(D2); Data Setup One list for each attribute Entries in an Attribute List consist of: attribute value class list index A list for the classes with pointers to the tree nodes Lists for continuous attributes are in sorted order Attribute lists may be disk-resident Class List must be in main memory Data Setup Class list Attribute lists Age Car Type Risk 23 17 43 68 32 20 family sports sports family truck family N1 High High High Low Low High Age CLI Car Type CLI Risk Leaf 23 17 43 68 32 20 0 1 2 3 4 5 family sports sports family truck family 0 1 2 3 4 5 High High High Low Low High N1 N1 N1 N1 N1 N1 Age CLI Car Type CLI 17 20 23 32 43 68 1 5 0 4 2 3 family sports sports family truck family 0 1 2 3 4 5 Risk Leaf 0 1 2 3 4 5 High High High Low Low High N1 N1 N1 N1 N1 N1 Evaluating Split Points Gini Index if data D contains examples from c classes Gini(D) = 1 - pj2 where pj is the relative frequency of class j in D If D split into D1 & D2 with n1 & n2 tuples each Ginisplit(D) = n1* gini(D1) + n2* gini(D2) n n Note: Only class frequencies are needed to compute index Finding Split Points For each attribute A do evaluate splits on attribute A using attribute list Key idea: To evaluate a split on numerical attributes we need to sort the set at each node. But, if we have all attributes presorted we don’t need to do that at the tree construction phase Keep split with lowest GINI index Finding Split Points: Continuous Attrib. Consider splits of form: value(A) < x Example: Age < 17 Evaluate this split-form for every value in an attribute list To evaluate splits on attribute A for a given tree-node: Initialize class-histograms of left and right children; for each record in the attribute list do find the corresponding entry in Class List and the class and Leaf node evaluate splitting index for value(A) < record.value; update the class histogram in the leaf N1 0 1 3 4 Age CLI 17 20 23 32 43 68 1 5 0 4 2 3 Class Leaf 0 1 2 3 4 5 High High High Low Low High N1 N1 N1 N1 N1 N1 0 1 3 1: Age < 20 3: Age < 32 4: Age < 43 High Low L 0 0 R 4 2 High Low L 1 0 R 3 2 High Low L 3 0 R 1 2 High Low L 3 1 R 1 1 Age < 32 4 GINI Index: und 0.33 0.22 0.5 Finding Split Points: Categorical Attrib. Consider splits of the form: value(A) {x1, x2, ..., xn} Example: CarType {family, sports} Evaluate this split-form for subsets of domain(A) To evaluate splits on attribute A for a given tree node: initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix; class/value matrix Car Type CLI Risk Leaf family sports sports family truck family 0 1 2 3 4 5 High High High Low Low High N1 N1 N1 N1 N1 N1 High family 2 sports 2 truck 0 Left Child CarType in {family} CarType in {sports} CarType in {truck} Low 1 0 1 Right Child High Low High Low 2 1 2 1 High Low High Low 2 0 2 2 High Low High Low 0 1 4 1 GINI Index: GINI = 0.444 GINI = 0.333 GINI = 0.267 Updating the Class List Next step is to update the Class List with the new nodes Scan the attr list that is used to split and update the corresponding leaf entry in the Class List For each attribute A in a split traverse the attribute list for each value u in the attr list find the corresponding entry in the class list (e) find the new class c to which u belongs update the class list for e to c update node reference in e to the node corresponding to class c Preventing overfitting A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data. An overfitted tree does not generalize to unseen instances. Happens when data contains noise or irrelevant attributes and training size is small. Overfitting can reduce accuracy drastically: 10-25% as reported in Minger’s 1989 Machine learning Approaches to prevent overfitting Two Approaches: Stop growing the tree beyond a certain point First over-fit, then post prune. (More widely used) Tree building divided into phases: Growth phase Prune phase Hard to decide when to stop growing the tree, so second appraoch more widely used. Criteria for finding correct final tree size: Three criteria: Cross validation with separate test data Use some criteria function to choose best size Example: Minimum description length (MDL) criteria Statistical bounds: use all data for training but apply statistical test to decide right size. The minimum description length principle (MDL) MDL: paradigm for statistical estimation particularly model selection Given data D and a class of models M, our choose is to choose a model m in M such that data and model can be encoded using the smallest total length L(D) = L(D|m) + L(m) How to find encoding length? Answer in Information Theory Consider the problem of transmitting n messages where pi is probability of seeing message i Shannon’s theorem: minimum expected length when -log pi bits to message i Encoding data Assume t records of training data D First send tree m using L(m|M) bits Assume all but the class labels of training data known. Goal: transmit class labels using L(D|m) If tree correctly predicts an instance, 0 bits Otherwise, log k bits where k is number of classes. Thus, if e errors on training data: total cost e log k + L(m|M) bits. Complex tree will have higher L(m|M) but lower e. Question: how to encode the tree? Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example age = “<=30” AND student = “no” THEN buys_computer = “no” age = “<=30” AND student = “yes” THEN buys_computer = “yes” age = “31…40” THEN buys_computer = “yes” age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” IF IF IF IF SPRINT An improvement over SLIQ Does not need to keep a list in main memory Parallel version is straightforward Attribute lists are extended with class field – no Class list is needed Uses hashing to assign records to classes and nodes Pros and Cons of decision trees • Cons • Pros + Reasonable training time – Cannot handle complicated relationship between features + Fast application – simple decision boundaries + Easy to interpret – problems with lots of missing + Easy to implement + Can handle large number data of features More information: http://www.recursive-partitioning.com/