Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Decision Trees Slide 1 of 39 DECISION TREE INDUCTION What is a decision tree? The basic decision tree induction procedure From decision trees to production rules Dealing with missing values Inconsistent training data Incrementality Handling numerical attributes Attribute value grouping Alternative attribute selection criteria The problem of overfitting P.D.Scott University of Essex Decision Trees Slide 2 of 39 LOAN EVALUATION When a bank is asked to make a loan it needs to assess how likely it is that the borrower will be able to repay the loan. Collectively the bank has a lot of experience of making loans and discovering which ones are ultimately repaid. However, any individual bank employee has only a limited amount of experience. It would thus be very helpful if, somehow the bank’s collective experience could be used to construct a set of rules (or a computer program embodying those rules) that could be used to assess the risk that a prospective loan would not be repaid. How? What we need is a system that could take all the bank’s data on the outcome of previous borrowers and the outcomes for their loans and learn such a set of rules. One widely used approach is decision tree induction. P.D.Scott University of Essex Decision Trees Slide 3 of 39 WHAT IS A DECISION TREE? The following is a very simple decision tree that assigns animals to categories: Skin Covering Scales Feathers Fur Beak Hooked Straight Sharp Eagle Fish Teeth Heron Blunt Lion Lamb Thus a decision tree can be used to predict the category (or class) to which an example belongs. P.D.Scott University of Essex Decision Trees Slide 4 of 39 So what is a decision tree? A tree in which: Each terminal node (leaf) is associated with a class. Each non-terminal node is associated with one of the attributes that examples possess. Each branch is associated with a particular value that the attribute of its parent node can take. What is decision tree induction? A procedure that, given a training set, attempts to build a decision tree that will correctly predict the class of any unclassified example. What is a training set? A set of classified examples, drawn from some population of possible examples. The training set is almost always a very small fraction of the population. What is an example Typically decision trees operate using examples that take the form of feature vectors. A feature vector is simply a vector whose elements are the values taken by the examples attributes. For example, a heron might be represented as: P.D.Scott Skin Covering Beak Teeth Feathers Straight None University of Essex Decision Trees Slide 5 of 39 THE BASIC DECISION TREE ALGORITHM FUNCTION build_dec_tree(examples,atts) // Takes a set of classified examples and // a list of attributes, atts. Returns the // root node of a decision tree Create node N; IF examples are all in same class THEN RETURN N labelled with that class; IF atts is empty THEN RETURN N labelled with modal example class; best_att = choose_best_att(examples,atts); label N with best_att; FOR each value ai of best_att si = subset examples with best_att = ai; IF si is not empty THEN new_atts = atts – best_att; subtree = build_dec_tree(si,new_atts); attach subtree as child of N; ELSE Create leaf node L; Label L with modal example class; attach L as child of N; RETURN N; P.D.Scott University of Essex Decision Trees Slide 6 of 39 Choosing the Best Attribute What is “the best attribute”? Many possible definitions. A reasonable answer would be the attribute that best discriminates the examples with respect to their classes. So what does “best discriminates” mean? Still many possible answers. Many different criteria have many used. The most popular is information gain. What is information gain? P.D.Scott University of Essex Decision Trees Slide 7 of 39 Shannon’s Information Function Given a situation in which there are N unknown outcomes. How much information have you acquired once you know what the outcome is? Consider some examples when the outcomes are all equally likely: Coin toss 2 outcomes 1 bit of information Pick 1 card from 8 8 outcomes 3 bits of information Pick 1 card from 32 32 outcomes 5 bits of information In general, for N equiprobable outcomes Information = ln2(N) bits Since the probability of each outcome p = 1/N, we can also express this as Information = -ln2(p) bits Non-equiprobable outcomes Consider picking 1 card from a pack containing 127 red and 1 black. There are 2 possible outcomes but you would be almost certain that the result would be red. Thus being told the outcome usually gives you less information than being told the outcome of an experiment with two equiprobable outcomes. P.D.Scott University of Essex Decision Trees Slide 8 of 39 Shannon’s Function We need an expression that reflects the fact there is less information to be gained when we already know that some outcomes are more likely than others. Shannon derived the following function: N Information pi ln 2 pi bits i 1 where N is the number of alternative outcomes. Notice that it reduces to –ln2(p) when the outcomes are all equiprobable. If there are only two outcomes, it takes this form: Information 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability Information is also sometimes called uncertainty or entropy. P.D.Scott University of Essex Decision Trees Slide 9 of 39 Using Information to Assess Attributes Suppose You have a set of 100 examples, E These examples fall in two classes, c1 and c2 70 are in c1 30 are in c2 How uncertain are you about the class an example belongs to? Information = -p(c1)ln2(p(c1)) - p(c2)ln2(p(c2)) = - 0.7ln2(0.7) - 0.3 ln2(0.3) = -(0.7 x -0.51 + 0.3 x -1.74) = 0.88 bits Now suppose A is one of the example attributes with values v1 and v2 The 100 examples are distributed thus v1 v2 c1 63 7 c2 6 24 What is the uncertainty for the examples whose A value is v1? There are 69 of them; 63 in c1 and 6 in c2. So for this subset, p(c1) = p(63/69) = 0.913 and p(c2) = p(6/69) = 0.087 Hence Information = -0.913 ln2(0.913) - 0.087 ln2(0.087)) = 0.43 P.D.Scott University of Essex Decision Trees Slide 10 of 39 Similarly for the examples whose A value is v2? There are 31 of them; 7 in c1 and 24 in c2. So for this subset, p(c1) = p(7/31) = 0.226 and p(c2) = p(24/31) = 0.774 Hence Information = -0. 226 ln2(0. 226) - 0.774 ln2(0.774) = 0.77 So, if we know the value of attribute A The uncertainty is 0.43 if the value is v1. The uncertainty is 0.77 if the value is v2. But 69% have value v1 and 31% have value v2. Hence the average uncertainty if we know the value of attribute A will be 0.69 x 0.43 + 0.31 x 0.77 = 0.54. Compare this with the uncertainty if we don’t know the value of A which we calculated earlier as 0.88. Hence attribute A provides an information gain of 0.88 – 0.54 = 0.34 bits P.D.Scott University of Essex Decision Trees Slide 11 of 39 AN EXAMPLE Suppose we have a training set of data derived from weather records. These contain four attributes: Attribute Possible Values Temperature Warm; Cool Cloud Cover Overcast; Cloudy; Clear Wind Windy; Calm Precipitation Rain; Dry We want to build a system that predicts precipitation from the other three attributes. The training data set is: [ [ [ [ [ [ [ [ P.D.Scott warm, cool, cool, warm, cool, cool, cool, warm, overcast, overcast, cloudy, clear, clear, overcast, clear, overcast, windy; rain ] calm; dry ] windy; rain ] windy; dry ] windy; dry ] windy; rain ] calm; dry ] calm; dry ] University of Essex Decision Trees Slide 12 of 39 Initial Uncertainty First we consider the initial uncertainty: p(rain) = 3/8; p(dry) = 5/8 So Inf = -(3/8ln2(3/8)+5/8ln2(5/8)) = 0.954; Next we must choose the best attribute for building branches from the root node of the decision tree. There are three to choose to from Information Gain from Temperature Attribute Cool Examples: There are 5 of these; 2 rain and 3 dry. So p(rain) = 2/5 and p(dry) = 3/5 Hence Infcool = -(2/5ln2(2/5)+3/5ln2(3/5)) = 0.971 Warm Examples: There are 3 of these; 1 rain and 2 dry. So p(rain) = 1/3 and p(dry) = 2/3 Hence Infwarm = -(1/3ln2(1/3)+2/3ln2(2/3)) = 0.918 Average Information: 5/8 Infcool + 3/8 Infwarm = 0.625x0.971+0.375x0.918 = 0.951 Hence Information Gain for Temperature is Initial Information – Average Information for Temperature = 0.954 – 0.951 = 0.003. P.D.Scott (Very small) University of Essex Decision Trees Slide 13 of 39 Information Gain from Cloud Cover Attribute A similar calculation gives an average information of 0.500. Hence Information Gain for Cloud Cover is Initial Information – Average Information for Cloud Cover = 0.954 – 0.500 = 0.454. (Large) Information Gain from Wind Attribute A similar calculation gives an average information of 0.607. Hence Information Gain for Wind is Initial Information – Average Information for Wind = 0.954 – 0.607 = 0.347. (Quite large) The Best Attribute: Starting to build the tree Cloud cover gives the greatest information gain so we choose it as the attribute to begin tree construction Cloud Cover Overcast Clear warm,clear,windy: dry cool,clear,windy: dry cool,clear,calm: dry warm,overcast,windy: rain cool,overcast,calm: dry cool,overcast,windy: rain warm,overcast,calm: dry Cloudy cool,cloudy,windy: rain P.D.Scott University of Essex Decision Trees Slide 14 of 39 Developing the Tree All the examples on the “Clear” branch belong to the same class so no further elaboration is needed. It can be terminated with a leaf node labelled “dry” Similarly the single example on the “Cloudy” branch necessarily belongs to one class. It can be terminated with a leaf node labelled “rain” This gives us :- Cloud Cover Overcast Clear warm,overcast,windy: rain cool,overcast,calm: dry cool,overcast,windy: rain warm,overcast,calm: dry Dry Cloudy Rain The “Overcast” branch has both rain and dry examples. So we must attempt to extend the tree from this node. P.D.Scott University of Essex Decision Trees Slide 15 of 39 Extending the Overcast subtree There are 4 examples: 2 rain and 2 dry So p(rain) = p(dry) = 0.5 and the uncertainty is 1. There are two remaining attributes: temperature and wind. Information Gain from Temperature Attribute Cool Examples: There are 2 of these; 1 rain and 1 dry. So p(rain) = 1/2 and p(dry) = 1/2 Hence Infcool = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1 Warm Examples: There are also 2 of these; 1 rain and 1 dry. So again Infwarm = -(1/2ln2(1/2)+1/2ln2(1/2)) = 1 Average Information: 1/2 Infcool + 1/2 Infwarm = 0. 5 x 1.0 + 0. 5 x 1.0 = 1 Hence Information Gain for Temperature is zero! P.D.Scott University of Essex Decision Trees Slide 16 of 39 Information Gain from Wind Attribute Windy Examples: There are 2 of these; 2 rain and 0 dry. So p(rain) = 1 and p(dry) = 0 Hence Infwindy = -(1 x ln2(1)+ 0 x ln2(0)) = 0 Calm Examples: There are also 2 of these; 0 rain and 2 dry. So again Infcalm = -(1 x ln2(1)+ 0 x ln2(0)) = 0 Average Information: 1/2 Infwindy + 1/2 Infcalm = 0. 5 x 0.0 + 0. 5 x 0.0 = 0 Hence Information Gain for Temperature is 1. Note: This reflects the fact that wind is a perfect predictor of precipitation for this subset of examples. The Best Attribute: Obviously wind is the best attribute so we can now extend the tree. P.D.Scott University of Essex Decision Trees Slide 17 of 39 Cloud Cover Overcast Clear Wind Dry Calm cool,overcast,calm: dry warm,overcast,calm: dry Windy Cloudy Rain warm,overcast,windy: rain cool,overcast,windy: rain All the examples on both the new branches belong to the same class so they can be terminated with appropriately labelled leaf nodes. Cloud Cover Overcast Clear Cloudy Wind Dry Rain Windy Rain P.D.Scott Calm Dry University of Essex Decision Trees Slide 18 of 39 FROM DECISION TREES TO PRODUCTION RULES Decision trees can easily be converted into sets of IF-THEN rules. The tree just derived would become: IF IF IF IF clear THEN dry cloudy THEN rain overcast AND calm THEN dry overcast AND windy THEN rain Such rules are usually easier to understand than the corresponding tree. Large trees produce large sets of rules. It is often possible to simplify these considerably by applying transformations to them. In some cases these simplified rule sets are more accurate than the original tree because they reduce the effect of overfitting – a topic we will discuss later. P.D.Scott University of Essex Decision Trees Slide 19 of 39 REFINEMENTS OF DECISION TREE LEARNING The basic top down procedure for decision tree construction is exemplified by ID3. This technique has proved extremely successful as a method for classification learning. Consequently it has been applied to a wide range of problems. As this has happened, limitations have emerged and new techniques developed to deal with them. These include: Dealing with Missing Values Inconsistent Data Incrementality Handling Numerical Attributes Attribute Value Grouping Alternative Attribute Selection Criteria The Problem of Overfitting Note. Many of these problems also arise in other learning procedures and statistical methods. Hence many of the solutions developed for use with decision trees may be useful in conjunction with other techniques. P.D.Scott University of Essex Decision Trees Slide 20 of 39 MISSING VALUES Missing values are a major problem when working with real data sets. A survey respondent may not have answered a question. The results of a lab test may not be available. There are various approaches to this problem. Discard the training example. If there are many attributes, each of which might be missing, this may almost wipe out the training data. Guess the value. e.g. substitute the commonest value. Obviously error prone but quite effective if missing values for a given variable are rare. More sophisticated guessing techniques have been developed. Sometimes called “imputation” Assign a probability to each possible value. Probabilities can be estimated from remaining examples. For training purposes, treat missing value cases as fractional examples in each class for gain computation. This fractionation will propagate as the tree is extended. Resulting tree will give a probability for each class rather than a definite answer. P.D.Scott University of Essex Decision Trees Slide 21 of 39 INCONSISTENT DATA It is possible (and not unusual) to arrive at a situation in which the examples associated with a leaf node belong to more than one class. This situation can arise in two ways: 1. There are no more attributes that could be used to further subdivide the examples. e.g. Suppose the weather data had contained a ninth example [ cool, cloudy, windy; dry ] 2. There are more attributes but none of them is useful in distinguishing the classes. e.g. Suppose the weather data had included the above example and another attribute that identified the person who recorded the data. No More Attributes A decision tree program can do one of two things. Predict the most probable (modal) class. This is what the pseudocode given earlier does. Make a set of predictions with associated probabilities. This is better. P.D.Scott University of Essex Decision Trees Slide 22 of 39 No More Useful Attributes In this situation the system cannot build a subtree to further discriminate the classes because the unused attributes do not correlate with the classification. This situation must be handled in the same way as No More Attributes. But first the program must detect when the situation has arisen. Detecting that Further Progress is Impossible. This requires a threshold on information gain or a statistical test. We will discuss this when we consider pre-pruning as a possible solution to overfitting. P.D.Scott University of Essex Decision Trees Slide 23 of 39 INCREMENTALITY Most decision tree induction programs require the entire training set to be available at the start. Such programs can only incorporate new data by building a complete new tree. This is not a problem for most data mining applications. In some applications, new training examples become available at intervals. e.g. Consider a robot dog learning football tactics by playing games. Learning programs that can accept new training instances after learning has begun, without starting again are said to be incremental. Incremental Decision Tree Induction ID4 is a modification of ID3 that can learn incrementally. Maintains counts at every node throughout learning: Number of each class associated with the node and its subclasses. Numbers of each class having each possible value of each attribute. When new example is encountered, all counts are updated. Where counts have changed, system can calculate if current best attribute is still the best. If not new subtree is built to replace original. P.D.Scott University of Essex Decision Trees Slide 24 of 39 Building Decision Trees with Numeric Attributes The Problem Standard decision tree procedures are designed to work with categorical attributes. No account is taken of any numerical relationship between the values. The branching factor of the tree will be reasonable if the number of distinct values is modest. If the same procedures are applied to numerical attributes we run into two difficulties: The branching factor may become absurdly large. Consider the attribute “income” in a survey of 1000 people. It is not unlikely that every individual will have a different annual income. All the information implicit in the ordering of values is thrown away. The Solution Partition the value set into a small number of contiguous subranges and then treat membership of each subrange as a categorical variable. The result is in effect a new ordinal attribute. A reasonable branching factor. Some of the ordering information has been used. P.D.Scott University of Essex Decision Trees Slide 25 of 39 Discretization of Continuous Variables Two possible approaches to partitioning a range of numeric values in order to build a classifier: Divide the range up into a preset number of subranges. For example, each subrange could have equal width or include an equal number of examples. Use the classification variable to determine the best way to partition the numeric attribute. The second approach has proved more successful. Discretization using the classification variable In principle this is straightforward: Consider every possible partitioning of the numeric attribute. Assess each partitioning using the attribute selection criterion. In practice this is infeasible: A set of m training examples can be partitioned in 2m-1 ways However, it can be proved that if two neighbouring values belong to the same class they should be assigned to the same group. This reduces the number of possibilities but the number of partitionings is still infeasibly large. Two solutions are possible: Consider only the m-1 binary partitions. (e.g. C4.5) Use heuristics to find good multiple partitions P.D.Scott University of Essex Decision Trees Slide 26 of 39 ATTRIBUTE VALUE GROUPING Discretization of numeric attributes involves finding subgroups of values that are equivalent for the purposes of classification. This notion can usefully be applied to categorical variables: Suppose A decision tree program attempts to induce a tree to predict some binary class, C. That the training set comprises feature vectors of nominal attributes A1, A2, ..., Ak. That the class C is in fact defined by the following classification function: C = V2,1 (V5,2 V5,4) V8,3 where Vi,j denotes that attribute Ai takes its jth value. Suppose finally that each attribute has four possible values and the program selects attributes in the order of A5, A2, A8. The resulting tree will be: A5 V5,1 V5,2 A2 N V2,1 V2,2 A8 V8,1 V8,2 N P.D.Scott V5,3 N A2 N V2,3 V2,4 N N V8,3 V8,4 Y V5,4 V2,1 V2,2 A8 N V8,1 V8,2 N N N V2,3 V2,4 N N N V8,3 V8,4 Y N University of Essex Decision Trees Slide 27 of 39 There is a great deal of duplication in the structure of this tree. If branches could created for groups of attribute values a much simpler one could be constructed: A5 V5,2V5,4 V5,1,V5,3 A2 N V2,2,V2,3,V2,4 V2,1 A8 V8,1,V8,2,V8, 4 N N V8,3 Y The original tree: Has 21 nodes Divides the example space into 16 regions The new tree Has 7 nodes Divides the example space into 4 regions Their classification behaviours are identical. Attribute Value Grouping Procedures Heuristic attribute value grouping procedures, similar to those used to discretize numeric attributes, can be used to produce such tree simplification. P.D.Scott University of Essex Decision Trees Slide 28 of 39 ALTERNATIVE ATTRIBUTE SELECTION CRITERIA Hill Climbing The basic decision tree induction algorithm proceeds using a hill climbing approach. At every step, a new branch is created for each value of the “best” attribute. There is no backtracking. Which is the Best Attribute? The criterion used for selecting the best attribute is therefore very important, since there is no opportunity to rectify its mistakes. Several alternatives have been used successfully. Information Based Criteria If an experiment has n possible outcomes then the amount of information, expressed as bits, provided by knowing the outcome is defined to be n I pi log 2 pi i 1 where pi is the prior probability of the ith outcome. This quantity is also known as entropy and uncertainty. For decision tree construction, the experiment is finding out the correct classification of an example. P.D.Scott University of Essex Decision Trees Slide 29 of 39 Information Gain ID3 originally used information gain to determine the best attribute: The information gain for an attribute A when used with a set of examples X is defined to be: Gain( X , A) I ( X ) | Xv | I(Xv ) vvalues( A) | X | where |X| is the number of examples in set X. This criterion has been (and is) widely and successfully used But it is known to be biased towards attributes with many values. Why is it biased? A many-valued attribute will partition the examples into many subsets. The average size of these subsets will be small. Some of these are likely to contain a high percentage of one class by chance alone. Hence the true information gain will be over-estimated. P.D.Scott University of Essex Decision Trees Slide 30 of 39 Information Gain Ratio The information gain ratio measure incorporates an additional term, split information, to compensate for this bias: It is defined: SplitInf ( X , A) | Xv | | Xv | log 2 | X | |X | vvalues( A) which is in fact the information imparted when you are given the value of attribute A. For example, if we consider equiprobable values: If A has 2 values, SplitInf(X,A) = 1 If A has 4 values, SplitInf(X,A) = 2 If A has 8 values, SplitInf(X,A) = 3 The information gain ratio is then defined: GainRatio( X , A) Gain( X , A) SplitInf ( X , A) The gain ratio itself can lead to difficulties if values are far from equiprobable. If most of the examples are of the same value then SplitInf(X,A) will be close to zero, and hence GainRatio(X,A) may be very large. The usual solution is to use Gain rather than GainRatio whenever SplitInf is small. P.D.Scott University of Essex Decision Trees Slide 31 of 39 Information Distance SplitInf can be regarded as a normalisation factor for gain. Lopez de Mantaras has developed a mathematically sounder form of normalization based on the information distance between two partitions. This has not been widely adopted. The Gini Criterion An alternative to the information theory based measures. Based on the notion of minimising misclassification rate. Suppose you know the probabilities p(ci) of each class ci for the examples assigned to a node. Suppose you are given an unclassified example that would be assigned to that node and decide to make a random guess of its class with probabilities p(ci). What is the probability that you will guess incorrectly? G p( ci ) p( c j ) i j where the sum is taken over all classes. G is called the Gini criterion and can be used in the same ways as information to select the best attribute. P.D.Scott University of Essex Decision Trees Slide 32 of 39 SO WHAT IS THE BEST ATTRIBUTE SELECTION CRITERION? A good attribute selection criterion should select those attributes that most improve prediction accuracy. It should also be cheap to compute. Which criteria are widely used? Information gain ratio is most popular in machine learning research. The Gini criterion is very popular in the statistical and pattern recognition communities. The evidence Empirical evidence suggests that choice of criteria has little impact on the classification accuracy of the resulting trees. There are claims that some methods produce smaller trees with similar accuracies. P.D.Scott University of Essex Decision Trees Slide 33 of 39 THE PROBLEM OF OVERFITTING Question What would happen if a completely random set of data were used as the training and test sets for a decision tree induction program? Answer The program would build a decision tree. If there were many variables and plenty of data it could be quite a large tree. Question Would the tree be any good as a classifier? Would it, for example, do better than the simple strategy of always picking the modal class? Answer No. Note also that if the experiment were repeated with a new set of random data we would get an entirely different tree. Questions Isn’t this rather worrying? Could the same sort of thing be happening with non-random data? Answers Yes and yes. P.D.Scott University of Essex Decision Trees Slide 34 of 39 What is going on? A decision tree is a mathematical model of some population of examples. But the tree is built on the basis of a sample from that population – the training set. So what a decision tree program really does is build a model of the training set. The features of such a model can be divided into two groups: 1. Those that reflect relationships that are true for the population as a whole. 2. Those that reflect relationships that are peculiar to the particular training set. Overfitting Roughly speaking what happens is this: Initially the features of a decision tree will reflect features of the whole population. As the tree gets deeper, the samples at each node get smaller and the major relationships of the population will have already been incorporated into the model. Consequently any further additions are likely to reflect relationships that have occurred by chance in the training data. From this point on the tree becomes a less accurate model of the population: typically 20% less accurate. This phenomenon of modelling the training data rather than the population it represents is called overfitting. P.D.Scott University of Essex Decision Trees Slide 35 of 39 Eliminating Overfitting There are two basic ways of preventing overfitting: 1. Stop tree growth before it happens: Pre-pruning. 2. Remove the parts of the tree due to overfitting after it has been constructed: Post-pruning. Pre-pruning This approach is appealing because it would save the effort involved in building then scrapping subtrees. This implies the need for a stopping criterion: a function whose value determines when a leaf node should not be expanded into subtrees. Two types of stopping criteria have been tried: Stopping when the improvement gets too small. Typically stop when the improvement indicated by the attribute selection criterion drops below some pre-set threshold . Choice of is crucial. Too low and you still get overfitting. Too high and you lose accuracy. This method proved unsatisfactory. It wasn’t possible to choose a value for that worked for all data sets. Stopping when the evidence for an extension becomes statistically insignificant. Quinlan used 2 testing in some versions of ID3. He later abandoned this because results were “satisfactory but uneven”. P.D.Scott University of Essex Decision Trees Slide 36 of 39 Chi-Square Tests Chi-square testing is an extremely useful technique for determining whether the differences between two distributions could be due to chance. That is, whether they could both be samples of the same parent population. Suppose we have a set of n categories and a set of observations O1…Oi…On of the frequency that each category occurs in a sample. Suppose we wish to know if this set of observations could be a sample drawn from some population whose frequencies we also know. We can calculate the expected frequencies E1…Ei…En of each category if the sample exactly followed the distribution the population. Now compute the value of the chi-square statistic defined: (Oi Ei ) 2 Ei i 1 n 2 Clearly 2 increases as the two distributions deviate. To determine whether the deviation is statistically significant, consult chi square tables for the appropriate number of degrees of freedom – in this case n-1. P.D.Scott University of Essex Decision Trees Slide 37 of 39 Post-Pruning The Basic Idea First build a decision tree, allowing overfitting to occur. Then, for each subtree: Assess whether a more accurate tree would result if the subtree were replaced by a leaf. (The leaf will choose the modal class for classification) If so, replace the subtree with a leaf. Validation Data Sets How do we assess whether a pruned tree would be more accurate? We can’t use the training data because the tree has overfitted to this. We can’t use the test data because then we would have no independent measure for the accuracy of the final tree. We must have a third set used only for this purpose. This is known as a validation set. Notes: Validation sets can also be used in pre-pruning. In C4.5, Quinlan uses the training data for validation but treats the result as an estimate and sets up a confidence interval. This is statistically dubious because the training data isn’t an independent sample. Quinlan justifies it on the grounds that it works in practice. P.D.Scott University of Essex Decision Trees Slide 38 of 39 Refinements Substituting Branches Rather than replacing a subtree with a leaf, it can be replaced by its most frequently used branch. More Drastic Transformations More substantial changes, possibly leading to a structure that is no longer a tree, have also been used. One example is the transformation into rule sets in C4.5: Generate a set of production rules equivalent to the tree by creating one rule for each path from the root to a leaf. Generalize each rule by removing any precondition whose loss does not reduce the accuracy. This step corresponds to pruning, but note that the structure may no longer be equivalent to a tree. An example might match the LHS of more than one rule. Sort the rules by their estimated accuracy. When using the rules for classification, this accuracy is used for conflict resolution. P.D.Scott University of Essex Decision Trees Slide 39 of 39 Suggested Readings Mitchell, T. M., (1997),“Machine Learning”,McGraw-Hill. Chapter 3. Tan, Steinbach & Kumar (2006) “Introduction to Data Mining”. Chapter 4 Han & Kamber (2006), “Data Mining: Concepts and Techniques”. Section 6.3 Breiman, L., Freidman, J. H., Olshen, R. A. and Stone, C. J., (1984) Classification and Regression Trees. Wadsworth, Pacific Grove, CA.. (This is a thorough treatment of the subject from a more statistical perspective – an essential reference if you are doing research in the area – usually known as “The CART book”.) Quinlan, J. R., (1986), Induction of Decision Trees. Machine Learning, 1(1), pp 81-106. (A full account of ID3). Quinlan, J. R., (1993), Programs for Machine Learning. Morgan Kaufmann, Los Altos, CA.. (A complete account of C4.5, the successor to ID3 and the yardstick to which other decision tree induction procedures are usually compared.) Dougherty, J., Kohavi, R. and Sahami, M., (1995), Supervised and Unsupervised Discretisation of Continuous Features, in Proc. 12th Int. Conf. on Machine Learning, Morgan Kaufmann, Los Altos, CA., pp 194-202. (A good comparative study of different methods for discretising numeric attributes). Ho, K. M. and Scott, P. D. (2000) Reducing Decision Tree Fragmentation Through Attribute Value Grouping: A Comparative Study. Intelligent Data Analysis. 6 pp 255-274 Implementations An implementation of a decision tree procedure is available as part of the WEKA suite of data mining programs. It is called J4.8 and closely resembles C4.5. P.D.Scott University of Essex