Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a house 200 150 L1 100 • Does not own a house 50 L2 0 0 10 20 30 40 50 60 70 Age If the training data was as above – Could we define some simple rules by observation? Any point above the line L1 Owns a house Any point to the right of L2 Owns a house Any other point Does not own a house 2 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a house 200 150 L1 100 • Does not own a house 50 L2 0 0 10 20 30 40 50 60 70 Age Root node: Split at Income = 101 Income ≥ 101: Label = Yes Income < 101: Split at Age = 54 Age ≥ 54: Label = Yes Age < 54: Label = No 3 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a house 200 150 • Does not own a house 100 50 0 0 10 20 30 40 50 60 70 Age Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop? 4 Approach for splitting What are the possible lines for splitting? – For each variable, midpoints between pairs of consecutive values for the variable – How many? – If N = number of points in training set and m = number of variables – About O(N × m) How to choose which line to use for splitting? – The line which reduce impurity (~ heterogeneity of composition) the most How to measure impurity? 5 Gini Index for Measuring Impurity Suppose there are C classes Let p(i|t) = fraction of observations belonging to class i in rectangle (node) t Gini index: C Gini(t) =1- å p(i | t)2 i=1 If all observations in t belong to one single class Gini(t) = 0 When is Gini(t) maximum? 6 Entropy Average amount of information contained From another point of view – average amount of information expected – hence amount of uncertainty – We will study this in more detail later Entropy: C Entropy(t) = -å p(i | t)´ log2 p(i | t) i=1 Where 0 log20 is defined to be 0 7 Classification Error What if we stop the tree building at a node – That is, do not create any further branches for that node – Make that node a leaf – Classify the node with the most frequent class present in the node Classification error as measure of impurity This rectangle (node) is still impure ClassificationError(t) =1- maxi [ p(i | t)] Intuitively – the impurity of the most frequent class in the rectangle (node) 8 The Full Blown Tree Recursive splitting Suppose we don’t stop until all nodes are pure A large decision tree with leaf nodes having very few data points – Does not represent classes well – Overfitting Root 1000 Number of points 400 200 600 200 160 240 Solution: – Stop earlier, or – Prune back the tree Statistically not significant 2 1 5 9 Prune back Pruning step: collapse leaf nodes and make the immediate parent a leaf node Effect of pruning – Lose purity of nodes – But were they really pure or was that a noise? – Too many nodes ≈ noise Trade-off between loss of purity and gain in complexity Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7 10 Prune back: cost complexity Cost complexity of a (sub)tree: Classification error (based on training data) and a penalty for size of the tree tradeoff (T ) = Err(T)+ a L(T) Err(T) is the classification error L(T) = number of leaves in T Penalty factor α is between 0 and 1 – If α=0, no penalty for bigger tree Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7 11 Different Decision Tree Algorithms Chi-square Automatic Interaction Detector (CHAID) – Gordon Kass (1980) – Stop subtree creation if not statistically significant by chi-square test Classification and Regression Trees (CART) – Breiman et al. – Decision tree building by Gini’s index Iterative Dichotomizer 3 (ID3) – Ross Quinlan (1986) – Splitting by information gain (difference in entropy) C4.5 – Quinlan’s next algorithm, improved over ID3 – Bottom up pruning, both categorical and continuous variables – Handling of incomplete data points C5.0 – Ross Quinlan’s commercial version 12 Properties of Decision Trees Non parametric approach – Does not require any prior assumptions regarding the probability distribution of the class and attributes Finding an optimal decision tree is an NP-complete problem – Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning Fast to generate, fast to classify Easy to interpret or visualize Error propagation – An error at the top of the tree propagates all the way down 13 References Introduction to Data Mining, by Tan, Steinbach, Kumar – Chapter 4 is available online: http://wwwusers.cs.umn.edu/~kumar/dmbook/ch4.pdf 14