Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Further Data Mining: Building Decision Trees Nathan Rountree first presented 28 July 1999 Classification is often seen as the most useful (and lucrative) form of data mining. Although every pattern recognition technique under the sun has been used to do classification (e.g. rough sets, discriminant analysis, logistic regression, multilayer perceptrons) decision trees are the most popular. This may be because they form rules which are easy to understand, or perhaps because they can be converted easily into SQL. While not as “robust” as neural nets and not as statistically “tidy” as discriminant analysis, decision trees often show very good generalisation capability. This document concentrates on how to create and prune decision trees, concentrating on the SPRINT technique introduced by the QUEST research group. Contents 1 Overview 1 2 Growing Trees 2 3 Pruning Trees 3 4 Final Thoughts 4 5 References 4 1 Overview Decision trees are built by choosing an attribute and a value for that attribute which splits the data set. The attribute and value are chosen to minimise diversity of class label in the two resulting sets (an alternative way of looking at this is to maximise information gain or to minimise entropy). The first split is unlikely to be perfect, so we recursively split the sets created until all the sets we have consist of only one class. Creating the decision tree is simply a matter of collating the splits in the correct order. The trick in data mining (where we may be dealing with big datasets; possibly even too big to fit into memory) is to find that attribute and value with the minimum number of passes through the database. Having created a tree on a training set, how good a classifier is it? The answer depends on several things: are we testing the classifier on the training set (in which case it should be perfect) or on another test set drawn from the same population? If we test the tree out on a new set, we get an idea of its capacity for generalisation. 2. Growing Trees 2 Error Test set Training set Complexity Point of best generalisation Figure 1: Overfitting. Now consider Figure 1: the lower line represents the drop off in error on the test set as we add more nodes to the tree. The top line represents the equivalent error on a test set. Note that the error is always likely to be a bit higher for the test set, since the tree has not been “trained” on it. Note also that this is exactly the same thing that happens if you overtrain a neural network. The complexity of the model has to be just right: too little or too much results in poor generalisation. The solution to overfitting a tree is to grow as big a tree as you can, and then “prune” it back according to some criterion. As you prune it, you see the reverse of the top line of Figure 1: the error on a test set starts quite high (because of overfitting) then drops as the tree is pruned back. As the tree gets smaller the error drops off (as the generalisation ability improves) then, as the tree gets too small, the error creeps up again. If you create a sequence of pruned trees, the one to choose is the one just before this upsweep occurs, since it is the smallest tree with the best generalisation. 2 Growing Trees We will concentrate on the SPRINT method of growing trees since it is fast, efficient and requires the same amount of memory regardless of the number of tuples in the database. The first thing we need to know is how to measure the “goodness” of a split; what constitutes “minimisation of diversity?” The gini index provides a measure of this (sorry I’m not sure what it ¾ stands for). For a dataset , ½ . This measures the diversity of class labels within a set; the worst situation (even distribution of 2 classes) produces a value of 0.5. This decreases as a split favours one class: for instance a 70/30 distribution produces 0.42. If you have more than 2 classes, the maximum possible value increases; for instance the worst possible diversity for three classes is a 33 percent split, producing a gini value of 0.67. OK, so how do we measure the goodness of a split point? Well, if the decision will split the database into ½ and ¾ , the value of the divided data is given by ½ ½ ¾ ¾ and all we need to know to calculate this is what the distribution of the classes would be if we split the data at this point. So this is how we do it: 3. Pruning Trees 3 1. Number the rows of the database. 2. Break the database into its separate columns. Add the row number to each item and the class label for that row number. 3. Sort each column on attribute value. 4. Create 2 histograms of the class labels; call one “above” and the other “below”. We know how many of each label there are, because we counted them while doing the sort. Put these numbers in the “above” histogram and set the numbers in the “below” histogram to zero. Set a pointer to the first (lowest) value of the first attribute. 5. Increment the pointer. At each increment, note what label we are looking at. Increment that label in the “below” histogram and decrement the “above” histogram. The histograms contain all the information we need to calculate the gini index at each value for each attribute. Each time we see a gini value that is smaller than the last, we save what attribute and value the pointer is currently on. We have to swap the values in the histograms every time we move on to a new attribute. Once we have calculated a split point, we add it to the tree. Now we have to actually split the data. Splitting the attribute that produced the best gini value is easy; just test the value against the decision; send the data into the “left” file if it matches the decision and into the “right” file if it doesn’t. As this happens, note the row numbers and put every row number which goes left into a hash table (which can be written to disk if necessary). Now go back and do the other attributes; if the row number is in the hash table, go left; otherwise go right. The nice thing about this is that the order of the attributes is maintained, so we only ever do a big sort once at the very beginning. This technique builds a tree breadth first, so it is easier to implement it iteratively rather than recursively. In fact it is possible to have only four files open at once for the data to “pour” into; how this is done is left as an exercise for the (very) interested. 3 Pruning Trees Recall what we said in the overview about overfitting; now that we have a tree we need to prune it back. There are two main schools of thought on this, both fairly similar: pruning by Minimum Description Length (MDL) or by Minimum Cost Complexity (MCC). It is difficult to distinguish between the two, but they actually have quite different effects. Here goes. In both cases we store extra information at each tree node; specifically, what the highest represented class is and what proportion of items at that node actually are that class. So now we know what would happen if we snipped a branch and replaced it with a leaf; the leaf would be labelled with the highest represented class and we could recalculate the accuracy on the training set. MDL: Descend recursively to the nodes. On the way back up we observe the accuracy of each node. We calculate according to “message length” the “cost” of the branch. If the cost of the branch is larger than the error we would introduce by snipping it, then snip it. 4. Final Thoughts 4 Advantages: Very fast. We can tune the pruning to be more or less harsh by adding a “severity” parameter to the message length calculation. If the severity is high, large subtrees will be more likely to be “punished”. Disadvantages: Produces one tree only. Alternatively you could run it several times with differing severity parameters. Very sensitive to whatever formula you use to calculate “message length”. MCC: Define “cost-complexity” as a linear combination of “cost” (accuracy) and “complexity” (number of leaves on a branch). Add a tuning parameter t which we can increase or decrease to make complexity more or less “expensive”. Now imagine that we can slowly increase t: one branch will be the first to make complexity more costly than accuracy. Define this as the weakest branch, and snip it. Continue until you only have one node left. We can do this by calculating a t-value for every node; whichever node has the smallest t-value is the one to prune. Advantages: This method produces a list of pruned trees. We can test each tree against the test set to see which one is the best. This is an aggressive form of pruning, so often produces the smallest possible tree. Unfortunately, you don’t know which one that is until you test the whole lot. Disadvantages: Slow to do and slow to test. However still several orders of magnitude faster than building the tree in the first place. 4 Final Thoughts How would you do categorical attributes with this scheme? How do you not create lots of open files? How would you parallelise the building of a decision tree? What is the order of the tree-building algorithm in terms of the number of database rows? Columns? 5 References Compulsory: J.C. Shafer, R. Agrawal, M. Mehta, "SPRINT: A Scalable Parallel Classifier for Data Mining", Proc. of the 22th Int’l Conference on Very Large Databases, Mumbai (Bombay), India, Sept. 1996. M. Mehta, R. Agrawal and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining”, Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, March 1996. Suggested: M. Mehta, J. Rissanen, and R. Agrawal, “MDL-based Decision Tree Pruning”, Proc. of the 1st Int’l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August, 1995. L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Chapter 3, Wadsworth Inc., 1984.