Download Decision Trees - University of Essex

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Decision Trees
Slide 1 of 39
DECISION TREE INDUCTION
 What is a decision tree?
 The basic decision tree induction procedure
 From decision trees to production rules
 Dealing with missing values
 Inconsistent training data
 Incrementality
 Handling numerical attributes
 Attribute value grouping
 Alternative attribute selection criteria
 The problem of overfitting
P.D.Scott
University of Essex
Decision Trees
Slide 2 of 39
LOAN EVALUATION
When a bank is asked to make a loan it needs to assess how
likely it is that the borrower will be able to repay the loan.
Collectively the bank has a lot of experience of making loans
and discovering which ones are ultimately repaid.
However, any individual bank employee has only a limited
amount of experience.
It would thus be very helpful if, somehow the bank’s collective
experience could be used to construct a set of rules (or a
computer program embodying those rules) that could be used
to assess the risk that a prospective loan would not be repaid.
How?
What we need is a system that could take all the bank’s data
on the outcome of previous borrowers and the outcomes for
their loans and learn such a set of rules.
One widely used approach is decision tree induction.
P.D.Scott
University of Essex
Decision Trees
Slide 3 of 39
WHAT IS A DECISION TREE?
The following is a very simple decision tree that assigns
animals to categories:
Skin Covering
Scales
Feathers
Fur
Beak
Hooked
Straight
Sharp
Eagle
Fish
Teeth
Heron
Blunt
Lion
Lamb
Thus a decision tree can be used to predict the category (or
class) to which an example belongs.
P.D.Scott
University of Essex
Decision Trees
Slide 4 of 39
So what is a decision tree?
A tree in which:
Each terminal node (leaf) is associated with a class.
Each non-terminal node is associated with one of the
attributes that examples possess.
Each branch is associated with a particular value that the
attribute of its parent node can take.
What is decision tree induction?
A procedure that, given a training set, attempts to build a
decision tree that will correctly predict the class of any
unclassified example.
What is a training set?
A set of classified examples, drawn from some population of
possible examples. The training set is almost always a very
small fraction of the population.
What is an example
Typically decision trees operate using examples that take the
form of feature vectors.
A feature vector is simply a vector whose elements are the
values taken by the examples attributes.
For example, a heron might be represented as:
P.D.Scott
Skin Covering
Beak
Teeth
Feathers
Straight
None
University of Essex
Decision Trees
Slide 5 of 39
THE BASIC DECISION TREE ALGORITHM
FUNCTION build_dec_tree(examples,atts)
// Takes a set of classified examples and
// a list of attributes, atts. Returns the
// root node of a decision tree
Create node N;
IF examples are all in same class
THEN RETURN N labelled with that class;
IF atts is empty
THEN RETURN N labelled with modal example
class;
best_att = choose_best_att(examples,atts);
label N with best_att;
FOR each value ai of best_att
si = subset examples with best_att = ai;
IF si is not empty
THEN
new_atts = atts – best_att;
subtree = build_dec_tree(si,new_atts);
attach subtree as child of N;
ELSE
Create leaf node L;
Label L with modal example class;
attach L as child of N;
RETURN N;
P.D.Scott
University of Essex
Decision Trees
Slide 6 of 39
Choosing the Best Attribute
What is “the best attribute”?
Many possible definitions.
A reasonable answer would be the attribute that best
discriminates the examples with respect to their classes.
So what does “best discriminates” mean?
Still many possible answers.
Many different criteria have many used.
The most popular is information gain.
What is information gain?
P.D.Scott
University of Essex
Decision Trees
Slide 7 of 39
Shannon’s Information Function
Given a situation in which there are N unknown outcomes.
How much information have you acquired once you know
what the outcome is?
Consider some examples when the outcomes are all equally
likely:
Coin toss
2 outcomes
1 bit of information
Pick 1 card from 8
8 outcomes
3 bits of information
Pick 1 card from 32
32 outcomes 5 bits of information
In general, for N equiprobable outcomes
Information = ln2(N) bits
Since the probability of each outcome p = 1/N, we can also
express this as
Information = -ln2(p) bits
Non-equiprobable outcomes
Consider picking 1 card from a pack containing 127 red and 1
black.
There are 2 possible outcomes but you would be almost
certain that the result would be red.
Thus being told the outcome usually gives you less
information than being told the outcome of an experiment with
two equiprobable outcomes.
P.D.Scott
University of Essex
Decision Trees
Slide 8 of 39
Shannon’s Function
We need an expression that reflects the fact there is less
information to be gained when we already know that some
outcomes are more likely than others.
Shannon derived the following function:
N
Information    pi ln 2  pi bits
i 1
where N is the number of alternative outcomes.
Notice that it reduces to –ln2(p) when the outcomes are all
equiprobable.
If there are only two outcomes, it takes this form:
Information
1
0.8
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability
Information is also sometimes called uncertainty or entropy.
P.D.Scott
University of Essex
Decision Trees
Slide 9 of 39
Using Information to Assess Attributes
Suppose
You have a set of 100 examples, E
These examples fall in two classes, c1 and c2
70 are in c1
30 are in c2
How uncertain are you about the class an example belongs
to?
Information = -p(c1)ln2(p(c1)) - p(c2)ln2(p(c2))
= - 0.7ln2(0.7) - 0.3 ln2(0.3)
= -(0.7 x -0.51 + 0.3 x -1.74) = 0.88 bits
Now suppose
A is one of the example attributes with values v1 and v2
The 100 examples are distributed thus
v1
v2
c1
63
7
c2
6
24
What is the uncertainty for the examples whose A value is v1?
There are 69 of them; 63 in c1 and 6 in c2.
So for this subset, p(c1) = p(63/69) = 0.913
and p(c2) = p(6/69) = 0.087
Hence
Information = -0.913 ln2(0.913) - 0.087 ln2(0.087)) = 0.43
P.D.Scott
University of Essex
Decision Trees
Slide 10 of 39
Similarly for the examples whose A value is v2?
There are 31 of them; 7 in c1 and 24 in c2.
So for this subset, p(c1) = p(7/31) = 0.226
and p(c2) = p(24/31) = 0.774
Hence
Information = -0. 226 ln2(0. 226) - 0.774 ln2(0.774) = 0.77
So, if we know the value of attribute A
The uncertainty is 0.43 if the value is v1.
The uncertainty is 0.77 if the value is v2.
But 69% have value v1 and 31% have value v2.
Hence the average uncertainty if we know the value of
attribute A will be 0.69 x 0.43 + 0.31 x 0.77 = 0.54.
Compare this with the uncertainty if we don’t know the value
of A which we calculated earlier as 0.88.
Hence attribute A provides an information gain of
0.88 – 0.54 = 0.34 bits
P.D.Scott
University of Essex
Decision Trees
Slide 11 of 39
AN EXAMPLE
Suppose we have a training set of data derived from weather
records.
These contain four attributes:
Attribute
Possible Values
Temperature
Warm; Cool
Cloud Cover
Overcast; Cloudy; Clear
Wind
Windy; Calm
Precipitation
Rain; Dry
We want to build a system that predicts precipitation from the
other three attributes.
The training data set is:
[
[
[
[
[
[
[
[
P.D.Scott
warm,
cool,
cool,
warm,
cool,
cool,
cool,
warm,
overcast,
overcast,
cloudy,
clear,
clear,
overcast,
clear,
overcast,
windy; rain ]
calm; dry ]
windy; rain ]
windy; dry ]
windy; dry ]
windy; rain ]
calm; dry ]
calm; dry ]
University of Essex
Decision Trees
Slide 12 of 39
Initial Uncertainty
First we consider the initial uncertainty:
p(rain) = 3/8; p(dry) = 5/8
So Inf = -(3/8ln2(3/8)+5/8ln2(5/8)) = 0.954;
Next we must choose the best attribute for building branches
from the root node of the decision tree.
There are three to choose to from
Information Gain from Temperature Attribute
Cool Examples:
There are 5 of these; 2 rain and 3 dry.
So p(rain) = 2/5 and p(dry) = 3/5
Hence Infcool = -(2/5ln2(2/5)+3/5ln2(3/5)) = 0.971
Warm Examples:
There are 3 of these; 1 rain and 2 dry.
So p(rain) = 1/3 and p(dry) = 2/3
Hence Infwarm = -(1/3ln2(1/3)+2/3ln2(2/3)) = 0.918
Average Information:
5/8 Infcool + 3/8 Infwarm = 0.625x0.971+0.375x0.918 = 0.951
Hence Information Gain for Temperature is
Initial Information – Average Information for Temperature
= 0.954 – 0.951 = 0.003.
P.D.Scott
(Very small)
University of Essex
Decision Trees
Slide 13 of 39
Information Gain from Cloud Cover Attribute
A similar calculation gives an average information of 0.500.
Hence Information Gain for Cloud Cover is
Initial Information – Average Information for Cloud Cover
= 0.954 – 0.500 = 0.454.
(Large)
Information Gain from Wind Attribute
A similar calculation gives an average information of 0.607.
Hence Information Gain for Wind is
Initial Information – Average Information for Wind
= 0.954 – 0.607 = 0.347.
(Quite large)
The Best Attribute: Starting to build the tree
Cloud cover gives the greatest information gain so we choose
it as the attribute to begin tree construction
Cloud Cover
Overcast
Clear
warm,clear,windy: dry
cool,clear,windy: dry
cool,clear,calm: dry
warm,overcast,windy: rain
cool,overcast,calm: dry
cool,overcast,windy: rain
warm,overcast,calm: dry
Cloudy
cool,cloudy,windy: rain
P.D.Scott
University of Essex
Decision Trees
Slide 14 of 39
Developing the Tree
All the examples on the “Clear” branch belong to the same
class so no further elaboration is needed.
It can be terminated with a leaf node labelled “dry”
Similarly the single example on the “Cloudy” branch
necessarily belongs to one class.
It can be terminated with a leaf node labelled “rain”
This gives us :-
Cloud Cover
Overcast
Clear
warm,overcast,windy: rain
cool,overcast,calm: dry
cool,overcast,windy: rain
warm,overcast,calm: dry
Dry
Cloudy
Rain
The “Overcast” branch has both rain and dry examples.
So we must attempt to extend the tree from this node.
P.D.Scott
University of Essex
Decision Trees
Slide 15 of 39
Extending the Overcast subtree
There are 4 examples: 2 rain and 2 dry
So p(rain) = p(dry) = 0.5 and the uncertainty is 1.
There are two remaining attributes: temperature and wind.
Information Gain from Temperature Attribute
Cool Examples:
There are 2 of these; 1 rain and 1 dry.
So p(rain) = 1/2 and p(dry) = 1/2
Hence Infcool = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Warm Examples:
There are also 2 of these; 1 rain and 1 dry.
So again Infwarm = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1
Average Information:
1/2 Infcool + 1/2 Infwarm = 0. 5 x 1.0 + 0. 5 x 1.0 = 1
Hence Information Gain for Temperature is zero!
P.D.Scott
University of Essex
Decision Trees
Slide 16 of 39
Information Gain from Wind Attribute
Windy Examples:
There are 2 of these; 2 rain and 0 dry.
So p(rain) = 1 and p(dry) = 0
Hence Infwindy = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Calm Examples:
There are also 2 of these; 0 rain and 2 dry.
So again Infcalm = -(1 x ln2(1)+ 0 x ln2(0)) = 0
Average Information:
1/2 Infwindy + 1/2 Infcalm = 0. 5 x 0.0 + 0. 5 x 0.0 = 0
Hence Information Gain for Temperature is 1.
Note: This reflects the fact that wind is a perfect predictor
of precipitation for this subset of examples.
The Best Attribute:
Obviously wind is the best attribute so we can now extend
the tree.
P.D.Scott
University of Essex
Decision Trees
Slide 17 of 39
Cloud Cover
Overcast
Clear
Wind
Dry
Calm
cool,overcast,calm: dry
warm,overcast,calm: dry
Windy
Cloudy
Rain
warm,overcast,windy: rain
cool,overcast,windy: rain
All the examples on both the new branches belong to the
same class so they can be terminated with appropriately
labelled leaf nodes.
Cloud Cover
Overcast
Clear
Cloudy
Wind
Dry
Rain
Windy
Rain
P.D.Scott
Calm
Dry
University of Essex
Decision Trees
Slide 18 of 39
FROM DECISION TREES TO PRODUCTION RULES
Decision trees can easily be converted into sets of IF-THEN
rules.
The tree just derived would become:
IF
IF
IF
IF
clear THEN dry
cloudy THEN rain
overcast AND calm THEN dry
overcast AND windy THEN rain
Such rules are usually easier to understand than the
corresponding tree.
Large trees produce large sets of rules.
It is often possible to simplify these considerably by applying
transformations to them.
In some cases these simplified rule sets are more accurate
than the original tree because they reduce the effect of
overfitting – a topic we will discuss later.
P.D.Scott
University of Essex
Decision Trees
Slide 19 of 39
REFINEMENTS OF DECISION TREE LEARNING
The basic top down procedure for decision tree construction is
exemplified by ID3.
This technique has proved extremely successful as a
method for classification learning.
Consequently it has been applied to a wide range of
problems.
As this has happened, limitations have emerged and new
techniques developed to deal with them.
These include:
Dealing with Missing Values
Inconsistent Data
Incrementality
Handling Numerical Attributes
Attribute Value Grouping
Alternative Attribute Selection Criteria
The Problem of Overfitting
Note.
Many of these problems also arise in other learning
procedures and statistical methods.
Hence many of the solutions developed for use with
decision trees may be useful in conjunction with other
techniques.
P.D.Scott
University of Essex
Decision Trees
Slide 20 of 39
MISSING VALUES
Missing values are a major problem when working with real
data sets.
A survey respondent may not have answered a question.
The results of a lab test may not be available.
There are various approaches to this problem.
Discard the training example.
If there are many attributes, each of which might be
missing, this may almost wipe out the training data.
Guess the value.
e.g. substitute the commonest value.
Obviously error prone but quite effective if missing values
for a given variable are rare.
More sophisticated guessing techniques have been
developed.
Sometimes called “imputation”
Assign a probability to each possible value.
Probabilities can be estimated from remaining examples.
For training purposes, treat missing value cases as
fractional examples in each class for gain computation.
This fractionation will propagate as the tree is extended.
Resulting tree will give a probability for each class rather
than a definite answer.
P.D.Scott
University of Essex
Decision Trees
Slide 21 of 39
INCONSISTENT DATA
It is possible (and not unusual) to arrive at a situation in which
the examples associated with a leaf node belong to more than
one class.
This situation can arise in two ways:
1. There are no more attributes that could be used to
further subdivide the examples.
e.g. Suppose the weather data had contained a ninth
example
[ cool, cloudy, windy; dry ]
2. There are more attributes but none of them is useful in
distinguishing the classes.
e.g. Suppose the weather data had included the above
example and another attribute that identified the
person who recorded the data.
No More Attributes
A decision tree program can do one of two things.
Predict the most probable (modal) class.
This is what the pseudocode given earlier does.
Make a set of predictions with associated probabilities.
This is better.
P.D.Scott
University of Essex
Decision Trees
Slide 22 of 39
No More Useful Attributes
In this situation the system cannot build a subtree to further
discriminate the classes because the unused attributes do not
correlate with the classification.
This situation must be handled in the same way as No More
Attributes.
But first the program must detect when the situation has
arisen.
Detecting that Further Progress is Impossible.
This requires a threshold on information gain or a statistical
test.
We will discuss this when we consider pre-pruning as a
possible solution to overfitting.
P.D.Scott
University of Essex
Decision Trees
Slide 23 of 39
INCREMENTALITY
Most decision tree induction programs require the entire
training set to be available at the start.
Such programs can only incorporate new data by building
a complete new tree.
This is not a problem for most data mining applications.
In some applications, new training examples become
available at intervals.
e.g. Consider a robot dog learning football tactics by
playing games.
Learning programs that can accept new training instances
after learning has begun, without starting again are said to be
incremental.
Incremental Decision Tree Induction
ID4 is a modification of ID3 that can learn incrementally.
Maintains counts at every node throughout learning:
Number of each class associated with the node and its
subclasses.
Numbers of each class having each possible value of
each attribute.
When new example is encountered, all counts are
updated.
Where counts have changed, system can calculate if
current best attribute is still the best.
If not new subtree is built to replace original.
P.D.Scott
University of Essex
Decision Trees
Slide 24 of 39
Building Decision Trees with Numeric Attributes
The Problem
Standard decision tree procedures are designed to work with
categorical attributes.
No account is taken of any numerical relationship between
the values.
The branching factor of the tree will be reasonable if the
number of distinct values is modest.
If the same procedures are applied to numerical attributes we
run into two difficulties:
The branching factor may become absurdly large.
Consider the attribute “income” in a survey of 1000
people.
It is not unlikely that every individual will have a
different annual income.
All the information implicit in the ordering of values is
thrown away.
The Solution
Partition the value set into a small number of contiguous
subranges and then treat membership of each subrange as a
categorical variable.
The result is in effect a new ordinal attribute.
A reasonable branching factor.
Some of the ordering information has been used.
P.D.Scott
University of Essex
Decision Trees
Slide 25 of 39
Discretization of Continuous Variables
Two possible approaches to partitioning a range of numeric
values in order to build a classifier:
Divide the range up into a preset number of subranges.
For example, each subrange could have equal width or
include an equal number of examples.
Use the classification variable to determine the best way
to partition the numeric attribute.
The second approach has proved more successful.
Discretization using the classification variable
In principle this is straightforward:
Consider every possible partitioning of the numeric
attribute.
Assess each partitioning using the attribute selection
criterion.
In practice this is infeasible:
A set of m training examples can be partitioned in 2m-1
ways
However, it can be proved that if two neighbouring values
belong to the same class they should be assigned to the
same group.
This reduces the number of possibilities but the number of
partitionings is still infeasibly large.
Two solutions are possible:
Consider only the m-1 binary partitions. (e.g. C4.5)
Use heuristics to find good multiple partitions
P.D.Scott
University of Essex
Decision Trees
Slide 26 of 39
ATTRIBUTE VALUE GROUPING
Discretization of numeric attributes involves finding subgroups
of values that are equivalent for the purposes of classification.
This notion can usefully be applied to categorical variables:
Suppose
A decision tree program attempts to induce a tree to
predict some binary class, C.
That the training set comprises feature vectors of nominal
attributes A1, A2, ..., Ak.
That the class C is in fact defined by the following
classification function:
C = V2,1  (V5,2  V5,4)  V8,3
where Vi,j denotes that attribute Ai takes its jth value.
Suppose finally that each attribute has four possible
values and the program selects attributes in the order of
A5, A2, A8.
The resulting tree will be:
A5
V5,1
V5,2
A2
N
V2,1 V2,2
A8
V8,1 V8,2
N
P.D.Scott
V5,3
N
A2
N
V2,3 V2,4
N
N
V8,3 V8,4
Y
V5,4
V2,1 V2,2
A8
N
V8,1 V8,2
N
N
N
V2,3 V2,4
N
N
N
V8,3 V8,4
Y
N
University of Essex
Decision Trees
Slide 27 of 39
There is a great deal of duplication in the structure of this tree.
If branches could created for groups of attribute values a
much simpler one could be constructed:
A5
V5,2V5,4
V5,1,V5,3
A2
N
V2,2,V2,3,V2,4
V2,1
A8
V8,1,V8,2,V8,
4
N
N
V8,3
Y
The original tree:
Has 21 nodes
Divides the example space into 16 regions
The new tree
Has 7 nodes
Divides the example space into 4 regions
Their classification behaviours are identical.
Attribute Value Grouping Procedures
Heuristic attribute value grouping procedures, similar to those
used to discretize numeric attributes, can be used to produce
such tree simplification.
P.D.Scott
University of Essex
Decision Trees
Slide 28 of 39
ALTERNATIVE ATTRIBUTE SELECTION CRITERIA
Hill Climbing
The basic decision tree induction algorithm proceeds
using a hill climbing approach.
At every step, a new branch is created for each value of
the “best” attribute.
There is no backtracking.
Which is the Best Attribute?
The criterion used for selecting the best attribute is therefore
very important, since there is no opportunity to rectify its
mistakes.
Several alternatives have been used successfully.
Information Based Criteria
If an experiment has n possible outcomes then the amount of
information, expressed as bits, provided by knowing the
outcome is defined to be
n
I    pi log 2 pi
i 1
where pi is the prior probability of the ith outcome.
This quantity is also known as entropy and uncertainty.
For decision tree construction, the experiment is finding out
the correct classification of an example.
P.D.Scott
University of Essex
Decision Trees
Slide 29 of 39
Information Gain
ID3 originally used information gain to determine the best
attribute:
The information gain for an attribute A when used with a set of
examples X is defined to be:
Gain( X , A)  I ( X ) 
| Xv |
I(Xv )

vvalues( A) | X |
where |X| is the number of examples in set X.
This criterion has been (and is) widely and successfully used
But it is known to be biased towards attributes with many
values.
Why is it biased?
A many-valued attribute will partition the examples into
many subsets.
The average size of these subsets will be small.
Some of these are likely to contain a high percentage of
one class by chance alone.
Hence the true information gain will be over-estimated.
P.D.Scott
University of Essex
Decision Trees
Slide 30 of 39
Information Gain Ratio
The information gain ratio measure incorporates an additional
term, split information, to compensate for this bias:
It is defined:
SplitInf ( X , A)  
| Xv |
| Xv |
log

2
|
X
|
|X |
vvalues( A)
which is in fact the information imparted when you are given
the value of attribute A.
For example, if we consider equiprobable values:
If A has 2 values, SplitInf(X,A) = 1
If A has 4 values, SplitInf(X,A) = 2
If A has 8 values, SplitInf(X,A) = 3
The information gain ratio is then defined:
GainRatio( X , A) 
Gain( X , A)
SplitInf ( X , A)
The gain ratio itself can lead to difficulties if values are far
from equiprobable.
If most of the examples are of the same value then
SplitInf(X,A) will be close to zero, and hence GainRatio(X,A)
may be very large.
The usual solution is to use Gain rather than GainRatio
whenever SplitInf is small.
P.D.Scott
University of Essex
Decision Trees
Slide 31 of 39
Information Distance
SplitInf can be regarded as a normalisation factor for gain.
Lopez de Mantaras has developed a mathematically sounder
form of normalization based on the information distance
between two partitions.
This has not been widely adopted.
The Gini Criterion
An alternative to the information theory based measures.
Based on the notion of minimising misclassification rate.
Suppose you know the probabilities p(ci) of each class ci for
the examples assigned to a node.
Suppose you are given an unclassified example that would be
assigned to that node and decide to make a random guess of
its class with probabilities p(ci).
What is the probability that you will guess incorrectly?
G   p( ci ) p( c j )
i j
where the sum is taken over all classes.
G is called the Gini criterion and can be used in the same
ways as information to select the best attribute.
P.D.Scott
University of Essex
Decision Trees
Slide 32 of 39
SO WHAT IS THE BEST ATTRIBUTE SELECTION CRITERION?
A good attribute selection criterion should select those
attributes that most improve prediction accuracy.
It should also be cheap to compute.
Which criteria are widely used?
Information gain ratio is most popular in machine learning
research.
The Gini criterion is very popular in the statistical and pattern
recognition communities.
The evidence
Empirical evidence suggests that choice of criteria has little
impact on the classification accuracy of the resulting trees.
There are claims that some methods produce smaller trees
with similar accuracies.
P.D.Scott
University of Essex
Decision Trees
Slide 33 of 39
THE PROBLEM OF OVERFITTING
Question
What would happen if a completely random set of data were
used as the training and test sets for a decision tree induction
program?
Answer
The program would build a decision tree.
If there were many variables and plenty of data it could be
quite a large tree.
Question
Would the tree be any good as a classifier?
Would it, for example, do better than the simple strategy of
always picking the modal class?
Answer
No.
Note also that if the experiment were repeated with a new set
of random data we would get an entirely different tree.
Questions
Isn’t this rather worrying?
Could the same sort of thing be happening with non-random
data?
Answers
Yes and yes.
P.D.Scott
University of Essex
Decision Trees
Slide 34 of 39
What is going on?
A decision tree is a mathematical model of some population
of examples.
But the tree is built on the basis of a sample from that
population – the training set.
So what a decision tree program really does is build a model
of the training set.
The features of such a model can be divided into two groups:
1. Those that reflect relationships that are true for the
population as a whole.
2. Those that reflect relationships that are peculiar to the
particular training set.
Overfitting
Roughly speaking what happens is this:
Initially the features of a decision tree will reflect features
of the whole population.
As the tree gets deeper, the samples at each node get
smaller and the major relationships of the population will
have already been incorporated into the model.
Consequently any further additions are likely to reflect
relationships that have occurred by chance in the training
data.
From this point on the tree becomes a less accurate
model of the population: typically 20% less accurate.
This phenomenon of modelling the training data rather than
the population it represents is called overfitting.
P.D.Scott
University of Essex
Decision Trees
Slide 35 of 39
Eliminating Overfitting
There are two basic ways of preventing overfitting:
1. Stop tree growth before it happens: Pre-pruning.
2. Remove the parts of the tree due to overfitting after it has
been constructed: Post-pruning.
Pre-pruning
This approach is appealing because it would save the effort
involved in building then scrapping subtrees.
This implies the need for a stopping criterion: a function
whose value determines when a leaf node should not be
expanded into subtrees.
Two types of stopping criteria have been tried:
Stopping when the improvement gets too small.
Typically stop when the improvement indicated by the
attribute selection criterion drops below some pre-set
threshold .
Choice of  is crucial. Too low and you still get
overfitting. Too high and you lose accuracy.
This method proved unsatisfactory. It wasn’t possible
to choose a value for  that worked for all data sets.
Stopping when the evidence for an extension becomes
statistically insignificant.
Quinlan used 2 testing in some versions of ID3.
He later abandoned this because results were
“satisfactory but uneven”.
P.D.Scott
University of Essex
Decision Trees
Slide 36 of 39
Chi-Square Tests
Chi-square testing is an extremely useful technique for
determining whether the differences between two distributions
could be due to chance.
That is, whether they could both be samples of the same
parent population.
Suppose we have a set of n categories and a set of
observations O1…Oi…On of the frequency that each
category occurs in a sample.
Suppose we wish to know if this set of observations could
be a sample drawn from some population whose
frequencies we also know.
We can calculate the expected frequencies E1…Ei…En of
each category if the sample exactly followed the
distribution the population.
Now compute the value of the chi-square statistic defined:
(Oi  Ei ) 2
 
Ei
i 1
n
2
Clearly 2 increases as the two distributions deviate.
To determine whether the deviation is statistically significant,
consult chi square tables for the appropriate number of
degrees of freedom – in this case n-1.
P.D.Scott
University of Essex
Decision Trees
Slide 37 of 39
Post-Pruning
The Basic Idea
First build a decision tree, allowing overfitting to occur.
Then, for each subtree:
Assess whether a more accurate tree would result if the
subtree were replaced by a leaf.
(The leaf will choose the modal class for classification)
If so, replace the subtree with a leaf.
Validation Data Sets
How do we assess whether a pruned tree would be more
accurate?
We can’t use the training data because the tree has
overfitted to this.
We can’t use the test data because then we would have
no independent measure for the accuracy of the final tree.
We must have a third set used only for this purpose.
This is known as a validation set.
Notes:
Validation sets can also be used in pre-pruning.
In C4.5, Quinlan uses the training data for validation but
treats the result as an estimate and sets up a confidence
interval.
This is statistically dubious because the training data
isn’t an independent sample.
Quinlan justifies it on the grounds that it works in
practice.
P.D.Scott
University of Essex
Decision Trees
Slide 38 of 39
Refinements
Substituting Branches
Rather than replacing a subtree with a leaf, it can be replaced
by its most frequently used branch.
More Drastic Transformations
More substantial changes, possibly leading to a structure that
is no longer a tree, have also been used.
One example is the transformation into rule sets in C4.5:
Generate a set of production rules equivalent to the tree
by creating one rule for each path from the root to a leaf.
Generalize each rule by removing any precondition whose
loss does not reduce the accuracy.
This step corresponds to pruning, but note that the
structure may no longer be equivalent to a tree.
An example might match the LHS of more than one
rule.
Sort the rules by their estimated accuracy.
When using the rules for classification, this accuracy is
used for conflict resolution.
P.D.Scott
University of Essex
Decision Trees
Slide 39 of 39
Suggested Readings
Mitchell, T. M., (1997),“Machine Learning”,McGraw-Hill. Chapter 3.
Tan, Steinbach & Kumar (2006) “Introduction to Data Mining”.
Chapter 4
Han & Kamber (2006), “Data Mining: Concepts and Techniques”.
Section 6.3
Breiman, L., Freidman, J. H., Olshen, R. A. and Stone, C. J., (1984)
Classification and Regression Trees. Wadsworth, Pacific Grove, CA..
(This is a thorough treatment of the subject from a more statistical
perspective – an essential reference if you are doing research in the
area – usually known as “The CART book”.)
Quinlan, J. R., (1986), Induction of Decision Trees. Machine
Learning, 1(1), pp 81-106. (A full account of ID3).
Quinlan, J. R., (1993), Programs for Machine Learning. Morgan
Kaufmann, Los Altos, CA.. (A complete account of C4.5, the successor
to ID3 and the yardstick to which other decision tree induction
procedures are usually compared.)
Dougherty, J., Kohavi, R. and Sahami, M., (1995), Supervised and
Unsupervised Discretisation of Continuous Features, in Proc. 12th
Int. Conf. on Machine Learning, Morgan Kaufmann, Los Altos, CA.,
pp 194-202. (A good comparative study of different methods for
discretising numeric attributes).
Ho, K. M. and Scott, P. D. (2000) Reducing Decision Tree
Fragmentation Through Attribute Value Grouping: A Comparative
Study. Intelligent Data Analysis. 6 pp 255-274
Implementations
An implementation of a decision tree procedure is available as part of
the WEKA suite of data mining programs. It is called J4.8 and closely
resembles C4.5.
P.D.Scott
University of Essex