Download Association Rule Mining - Indian Statistical Institute

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a house 200 150 L1 100 • Does not own a house 50 L2 0 0 10 20 30 40 50 60 70 Age  If the training data was as above – Could we define some simple rules by observation?  Any point above the line L1  Owns a house  Any point to the right of L2  Owns a house  Any other point  Does not own a house 2 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a house 200 150 L1 100 • Does not own a house 50 L2 0 0 10 20 30 40 50 60 70 Age Root node: Split at Income = 101 Income ≥ 101: Label = Yes Income < 101: Split at Age = 54 Age ≥ 54: Label = Yes Age < 54: Label = No 3 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a house 200 150 • Does not own a house 100 50 0 0 10 20 30 40 50 60 70 Age  Approach: recursively split the data into partitions so that each partition becomes purer till … How to decide the split? How to measure purity? When to stop? 4 Approach for splitting  What are the possible lines for splitting? – For each variable, midpoints between pairs of consecutive values for the variable – How many? – If N = number of points in training set and m = number of variables – About O(N × m)  How to choose which line to use for splitting? – The line which reduce impurity (~ heterogeneity of composition) the most  How to measure impurity? 5 Gini Index for Measuring Impurity  Suppose there are C classes  Let p(i|t) = fraction of observations belonging to class i in rectangle (node) t  Gini index: C Gini(t) =1- å p(i | t)2 i=1  If all observations in t belong to one single class Gini(t) = 0  When is Gini(t) maximum? 6 Entropy  Average amount of information contained  From another point of view – average amount of information expected – hence amount of uncertainty – We will study this in more detail later  Entropy: C Entropy(t) = -å p(i | t)´ log2 p(i | t) i=1 Where 0 log20 is defined to be 0 7 Classification Error  What if we stop the tree building at a node – That is, do not create any further branches for that node – Make that node a leaf – Classify the node with the most frequent class present in the node  Classification error as measure of impurity This rectangle (node) is still impure ClassificationError(t) =1- maxi [ p(i | t)]  Intuitively – the impurity of the most frequent class in the rectangle (node) 8 The Full Blown Tree  Recursive splitting  Suppose we don’t stop until all nodes are pure  A large decision tree with leaf nodes having very few data points – Does not represent classes well – Overfitting Root 1000 Number of points 400 200 600 200 160 240  Solution: – Stop earlier, or – Prune back the tree Statistically not significant 2 1 5 9 Prune back  Pruning step: collapse leaf nodes and make the immediate parent a leaf node  Effect of pruning – Lose purity of nodes – But were they really pure or was that a noise? – Too many nodes ≈ noise  Trade-off between loss of purity and gain in complexity Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7 10 Prune back: cost complexity  Cost complexity of a (sub)tree:  Classification error (based on training data) and a penalty for size of the tree tradeoff (T ) = Err(T)+ a L(T)  Err(T) is the classification error  L(T) = number of leaves in T  Penalty factor α is between 0 and 1 – If α=0, no penalty for bigger tree Decision node (Freq = 7) Leaf node (label = Y) Freq = 5 Leaf node (label = B) Freq = 2 Prune Leaf node (label = Y) Freq = 7 11 Different Decision Tree Algorithms  Chi-square Automatic Interaction Detector (CHAID) – Gordon Kass (1980) – Stop subtree creation if not statistically significant by chi-square test  Classification and Regression Trees (CART) – Breiman et al. – Decision tree building by Gini’s index  Iterative Dichotomizer 3 (ID3) – Ross Quinlan (1986) – Splitting by information gain (difference in entropy)  C4.5 – Quinlan’s next algorithm, improved over ID3 – Bottom up pruning, both categorical and continuous variables – Handling of incomplete data points  C5.0 – Ross Quinlan’s commercial version 12 Properties of Decision Trees  Non parametric approach – Does not require any prior assumptions regarding the probability distribution of the class and attributes  Finding an optimal decision tree is an NP-complete problem – Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning  Fast to generate, fast to classify  Easy to interpret or visualize  Error propagation – An error at the top of the tree propagates all the way down 13 References  Introduction to Data Mining, by Tan, Steinbach, Kumar – Chapter 4 is available online: http://wwwusers.cs.umn.edu/~kumar/dmbook/ch4.pdf 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Association Rule Mining - Indian Statistical Institute