Download how SPRINT builds decision trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Further Data Mining: Building Decision Trees
Nathan Rountree
first presented 28 July 1999
Classification is often seen as the most useful (and lucrative) form of data mining. Although every
pattern recognition technique under the sun has been used to do classification (e.g. rough sets,
discriminant analysis, logistic regression, multilayer perceptrons) decision trees are the most popular. This may be because they form rules which are easy to understand, or perhaps because
they can be converted easily into SQL. While not as “robust” as neural nets and not as statistically
“tidy” as discriminant analysis, decision trees often show very good generalisation capability. This
document concentrates on how to create and prune decision trees, concentrating on the SPRINT
technique introduced by the QUEST research group.
Contents
1
Overview
1
2
Growing Trees
2
3
Pruning Trees
3
4
Final Thoughts
4
5
References
4
1 Overview
Decision trees are built by choosing an attribute and a value for that attribute which splits the data
set. The attribute and value are chosen to minimise diversity of class label in the two resulting sets
(an alternative way of looking at this is to maximise information gain or to minimise entropy). The
first split is unlikely to be perfect, so we recursively split the sets created until all the sets we have
consist of only one class. Creating the decision tree is simply a matter of collating the splits in the
correct order.
The trick in data mining (where we may be dealing with big datasets; possibly even too big to fit
into memory) is to find that attribute and value with the minimum number of passes through the
database.
Having created a tree on a training set, how good a classifier is it? The answer depends on several
things: are we testing the classifier on the training set (in which case it should be perfect) or on
another test set drawn from the same population? If we test the tree out on a new set, we get an
idea of its capacity for generalisation.
2. Growing Trees
2
Error
Test set
Training set
Complexity
Point of best
generalisation
Figure 1: Overfitting.
Now consider Figure 1: the lower line represents the drop off in error on the test set as we add
more nodes to the tree. The top line represents the equivalent error on a test set. Note that the
error is always likely to be a bit higher for the test set, since the tree has not been “trained” on it.
Note also that this is exactly the same thing that happens if you overtrain a neural network. The
complexity of the model has to be just right: too little or too much results in poor generalisation.
The solution to overfitting a tree is to grow as big a tree as you can, and then “prune” it back
according to some criterion. As you prune it, you see the reverse of the top line of Figure 1: the
error on a test set starts quite high (because of overfitting) then drops as the tree is pruned back.
As the tree gets smaller the error drops off (as the generalisation ability improves) then, as the tree
gets too small, the error creeps up again. If you create a sequence of pruned trees, the one to choose
is the one just before this upsweep occurs, since it is the smallest tree with the best generalisation.
2 Growing Trees
We will concentrate on the SPRINT method of growing trees since it is fast, efficient and requires
the same amount of memory regardless of the number of tuples in the database.
The first thing we need to know is how to measure the “goodness” of a split; what constitutes
“minimisation of diversity?” The gini index provides a measure of this (sorry I’m not sure what it
¾
stands for). For a dataset , ½ . This measures the diversity of class labels within
a set; the worst situation (even distribution of 2 classes) produces a value of 0.5. This decreases as
a split favours one class: for instance a 70/30 distribution produces 0.42. If you have more than 2
classes, the maximum possible value increases; for instance the worst possible diversity for three
classes is a 33 percent split, producing a gini value of 0.67.
OK, so how do we measure the goodness of a split point? Well, if the decision will split the
database into ½ and ¾ , the value of the divided data is given by
½
½ ¾
¾
and all we need to know to calculate this is what the distribution of the classes would be if we split
the data at this point.
So this is how we do it:
3. Pruning Trees
3
1. Number the rows of the database.
2. Break the database into its separate columns. Add the row number to each item and the
class label for that row number.
3. Sort each column on attribute value.
4. Create 2 histograms of the class labels; call one “above” and the other “below”. We know
how many of each label there are, because we counted them while doing the sort. Put these
numbers in the “above” histogram and set the numbers in the “below” histogram to zero.
Set a pointer to the first (lowest) value of the first attribute.
5. Increment the pointer. At each increment, note what label we are looking at. Increment that
label in the “below” histogram and decrement the “above” histogram. The histograms contain all the information we need to calculate the gini index at each value for each attribute.
Each time we see a gini value that is smaller than the last, we save what attribute and value
the pointer is currently on. We have to swap the values in the histograms every time we
move on to a new attribute.
Once we have calculated a split point, we add it to the tree. Now we have to actually split the
data. Splitting the attribute that produced the best gini value is easy; just test the value against the
decision; send the data into the “left” file if it matches the decision and into the “right” file if it
doesn’t. As this happens, note the row numbers and put every row number which goes left into
a hash table (which can be written to disk if necessary). Now go back and do the other attributes;
if the row number is in the hash table, go left; otherwise go right. The nice thing about this is that
the order of the attributes is maintained, so we only ever do a big sort once at the very beginning.
This technique builds a tree breadth first, so it is easier to implement it iteratively rather than
recursively. In fact it is possible to have only four files open at once for the data to “pour” into;
how this is done is left as an exercise for the (very) interested.
3 Pruning Trees
Recall what we said in the overview about overfitting; now that we have a tree we need to prune
it back. There are two main schools of thought on this, both fairly similar: pruning by Minimum
Description Length (MDL) or by Minimum Cost Complexity (MCC). It is difficult to distinguish
between the two, but they actually have quite different effects. Here goes.
In both cases we store extra information at each tree node; specifically, what the highest represented class is and what proportion of items at that node actually are that class. So now we know
what would happen if we snipped a branch and replaced it with a leaf; the leaf would be labelled
with the highest represented class and we could recalculate the accuracy on the training set.
MDL: Descend recursively to the nodes. On the way back up we observe the accuracy of each
node. We calculate according to “message length” the “cost” of the branch. If the cost of the
branch is larger than the error we would introduce by snipping it, then snip it.
4. Final Thoughts
4
Advantages: Very fast. We can tune the pruning to be more or less harsh by adding a “severity”
parameter to the message length calculation. If the severity is high, large subtrees will be more
likely to be “punished”.
Disadvantages: Produces one tree only. Alternatively you could run it several times with differing
severity parameters. Very sensitive to whatever formula you use to calculate “message length”.
MCC: Define “cost-complexity” as a linear combination of “cost” (accuracy) and “complexity”
(number of leaves on a branch). Add a tuning parameter t which we can increase or decrease
to make complexity more or less “expensive”. Now imagine that we can slowly increase t: one
branch will be the first to make complexity more costly than accuracy. Define this as the weakest
branch, and snip it. Continue until you only have one node left. We can do this by calculating a
t-value for every node; whichever node has the smallest t-value is the one to prune.
Advantages: This method produces a list of pruned trees. We can test each tree against the test set
to see which one is the best. This is an aggressive form of pruning, so often produces the smallest
possible tree. Unfortunately, you don’t know which one that is until you test the whole lot.
Disadvantages: Slow to do and slow to test. However still several orders of magnitude faster than
building the tree in the first place.
4 Final Thoughts
How would you do categorical attributes with this scheme?
How do you not create lots of open files?
How would you parallelise the building of a decision tree?
What is the order of the tree-building algorithm in terms of the number of database rows?
Columns?
5 References
Compulsory:
J.C. Shafer, R. Agrawal, M. Mehta, "SPRINT: A Scalable Parallel Classifier for Data Mining", Proc.
of the 22th Int’l Conference on Very Large Databases, Mumbai (Bombay), India, Sept. 1996.
M. Mehta, R. Agrawal and J. Rissanen, “SLIQ: A Fast Scalable Classifier for Data Mining”, Proc. of
the Fifth Int’l Conference on Extending Database Technology, Avignon, France, March 1996.
Suggested:
M. Mehta, J. Rissanen, and R. Agrawal, “MDL-based Decision Tree Pruning”, Proc. of the 1st Int’l
Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August, 1995.
L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Chapter 3,
Wadsworth Inc., 1984.