Download Data Mining with Decision Trees and Decision Rules C. Apte and

Data Mining with Decision Trees and Decision Rules C. Apte and S.M. Weiss Future Generation Computer Systems November 1997 Data Mining with Decision Trees and Decision Rules Chidanand Apte Sholom Weiss T.J. Watson Research Center IBM Research Division Yorktown Heights, NY 10598 Department of Computer Science Rutgers University New Brunswick, NJ 08903 Abstract acceptably accurate learning method, it is possible to develop predictive applications, i.e., using the solution to predict the expected value of the eld of interest, given all the others. For example, a credit company may use its customer data base to model delinquency. Given a new application prole, this solution may be used to predict the likelihood that the applicant may default. Classication and regression are critical types of prediction problems. When the goal of prediction is discrete valued, a classication solution is developed. When the goal is numerical and continuous, a regression solution is developed. This type of data analysis has been an active area of research in many scientic areas for quite some time. These domains, ranging from medicine and geology to astronomy and physics, have relied upon the investigation of data gathered from experiments and observations to formulate predictive models. These solutions are essentially approximation functions for describing the behavior of a response or objective variable of interest in terms of independent or input variables or features. Modeling methods have been developed drawing upon techniques from statistics, pattern recognition, and machine learning [18, 21, 27]. A classical example of a predictive modeling application in medicine is the gathering of extensive observations of patients with and without diagnosis of a particular disease. The modeling task then attempts to formulate a model that describes the likelihood of a patient having the disease as a function of all the independent observations about the patient. Once an accurate predictive model is formulated, it serves as an additional piece of diagnostic machinery that may be used for predicting the likelihood of a disease given a patient's data. Until recently, these techniques had been restricted in their applications, given the volume of data and computational resource that was required for the modeling. However, with the increasing availability of high This paper describes the use of decision tree and rule induction in data mining applications. Of methods for classication and regression that have been developed in the elds of pattern recognition, statistics, and machine learning, these are of particular interest for data mining since they utilize symbolic and interpretable representations. Symbolic solutions can provide a high degree of insight into the decision boundaries that exist in the data, and the logic underlying them. This aspect makes these predictive mining techniques particularly attractive in commercial and industrial data mining applications. We present here a synopsis of some major state-of-the-art tree and rule mining methodologies, as well as some recent advances. Keywords Decision Tree, Rule Induction, Data Mining 1 Introduction The use of computer technology in decision support is now widespread and pervasive across a wide range of business and industry. This has resulted in the capture and availability of data in immense volume and proportion. There are many examples that can be cited. Point of sale data in retail, policy and claim data in insurance, medical history data in health care, nancial data in banking and securities, are some instances of the types of data that is being collected. The data are typically a collection of records, where each individual record may correspond to a transaction or a customer, and the elds in the record correspond to attributes. Very often, these elds are of mixed type, with some being numerical (continuous valued, e.g. age) and some symbolic (discrete valued, e.g. color). Given this layout of data, one may model a eld in terms of all the other elds in the data. With an 1 volume data in business and industry, and the signicant drop in computational cost, many of these techniques are beginning to be applied in commercial applications, with many demonstrated successes. There are many issues related to the problem of formulating a predictively accurate model. These have mainly to with the nature of the data and the representation language for the solution. Data characteristics generally dictate the complexity of the mining task. The data may be noisy, incomplete, or incorrect. The representation language for the solution will usually limit the scope of the functions that can be formulated. Classication and regression learning has been evolving over a considerable period of time, with contributions coming from statistics, pattern recognition, and more recently, the eld of machine learning. One of the earliest methods developed for classication modeling was the technique of linear discriminants [14]. An early technique that came into existence for regression was linear regression [23]. Each has its own limitations. Since then, a slew of techniques and methods have been developed, including k-nearestneighbor, decision tree, rule induction, neural networks, etc. [27]. For the remaining part of this paper, the problem of classication modeling will be examined from a decision tree and rule induction perspective. Tree and rule based regression modeling will then be briey introduced, and nally the conclusion will discuss general issues and directions for symbolic solution mining. Figure 1: Peg Data 2 illustrates a decision tree that corresponds to the partitions shown in Figure 1. At the top level of the tree, there is the root node, at which the classication process begins. The test at the root node tests all example instances for Length 0:75. Examples that satisfy this test are passed down the left (TRUE) arc to a leaf node, indicating that all examples belong to a single class (SQUARE) and no more tests are needed. The right (FALSE) arc from the root node receives all examples that fail the test at the root node. These examples are not yet purely from one class, so further testing is required at this intermediate node. The test at this node is for Diameter 3:00. Examples that satisfy this test are all in one class (STAR) and those that don't are also all in one class (DIAMOND), and so both arcs from this node lead to leaf nodes, and the decision tree solution is complete. Note that this example illustrates a binary tree, where each intermediate node can split into at most two sub-trees. Decision trees may be non-binary also, where each node may split into more than two sub-trees, by performing tests that result in more than two outcomes (.e.g. subset membership or interval membership tests). Closely related to decision tree solutions are rule based solutions. A rule may be constructed by forming a conjunct of every test that occurs on a path between the root node and a leaf node of a tree. The collection of all such rules obtained by traversing every unique path from root node to leaf node is a corresponding rule based solution for classication. For example, for the peg data that was used to illustrate the decision 2 Symbolic methods for classication modeling The problem of classication modeling will be examined here through a simple, hypothetical example. Assume that length and diameter data is available for a variety of pegs that are being manufactured on an assembly line. Pegs are either square, star, or diamond shaped. A classication solution that characterizes the peg variety as a function of the length and diameter can be useful in understanding how these varieties dier, for designing visual inspection systems and automated sorting machinery. Figure 1 illustrates this data. The gure also shows the two axis parallel lines, one at Length = 0:75, and the second at Diameter = 3:00, that seem to completely partition the three peg varieties into three dierent sub-areas. Decision tree solution methods provide automated techniques for discovering these types of axis parallel partitions, in their maximally general forms. Figure 2 non-mutually exclusive decision rules. These rules essentially correspond to decision regions that overlap each other in the data space. Once a decision tree or decision rule solution is generated from data, it can be used for estimating or predicting the response or class variable for a new case. The application of a decision tree to a data example is a straightforward top-down decision process, controlled by evaluating the tests and taking the appropriate branch, beginning at the root, and terminating when a leaf node is reached. The process of applying decision rules to data examples is determined by the style in which the rules were generated. A rule generation algorithm may induce ordered rule sets, i.e., induced by ordering all the classes which are present, and then using a xed sequence, such as the smallest to the largest class, with rules for every class being discovered under the assumption that the only classes that need to be discriminated between are the ones that are remaining in the sequence. With this rule set, the rules have to be applied to a new data example in exactly the same sequence as they were generated. A rule generation algorithm may also induce un-ordered rule sets, in which rules for every class are generated under the assumption that all other classes that are present in the data need to be discriminated against during the induction process. With this type of a rule set, application to a new data example can be order independent and more exible in creating dierent rule application strategies. Decision tree and decision rule solutions oer a level of interpretability that is unique to symbolic models. The solutions may be directly inspected to understand the decision surfaces that exist in the data. This particular aspect, which makes these solutions easy to digest even for a non-technical end-user, makes these techniques very appealing in decision support related data mining activities, where insight and explanations are of critical importance. What makes this approach technically viable is the fact that most modern symbolic modeling methodologies succeed in formulating solutions that are also competitive in predictive accuracy when compared to more non-intuitive or quantitative techniques, such as neural networks. This is an important reason for the increased attention to and use of decision rule modeling techniques that generate rules directly from data [9, 1, 2, 8]. Classication modeling algorithms are designed with several objectives. Perhaps the most well known criteria by which these algorithms are evaluated is accuracy, speed, and interpretability. Solutions derived using dierent approaches can thus be compared in terms of their predictive accuracy on unseen data, on Length .le. 0.75 True False Square Diameter .le. 3.00 True Star False Diamond Figure 2: Classifying Pegs with a Decision Tree tree, an equivalent rule solution is: If (Length <= 0.75) Then Square If (Not (Length <= 0.75)) & (Diameter <= 3.0) Then Star If (Not (Length <= 0.75)) & (Not (Diameter <= 3.0)) Then Diamond Some rule induction programs are add-ons to decision tree solutions, whereby a tree is rst generated, and then translated into a set of rules. However, techniques that directly generate rules from data are also available, which overcome some of the drawbacks of decision tree modeling. Often, in more complex data sets, disjuncts of rules form the description for a class, and hence the rule based solutions are more generally identied by the term DNF (Disjunctive Normal Form), or decision rules. Rules that are created by translating a decision tree into DNF expressions are typically mutually exclusive in nature, since a decision tree essentially partitions a data space into distinct disjoint regions via axis parallel surfaces created by its top-down sequence of decisions. For certain data spaces, this nature of partitioning may not always be capable of producing compact solutions. For example, decision trees cannot easily model simple exclusive-or functions. On the other hand, if algorithms are employed that directly generate DNF expressions from data, it is possible to create rules that capture such decision surfaces, via 3 the computational cost involved in generating the solution, and the level of understanding and insight that is provided by the solution. Both decision tree and decision rule modeling systems score high on interpretability. Accuracy and speed vary from algorithm to algorithm, and in most instances these two issues are coupled, i.e., improving predictive accuracy tends to require increased computational eort. Learning algorithms take a variety of factors into account while computing the classication solution. There is inherent noise in real-world data that needs to be handled. The prior distributions of classes in the training set may aect the solution generation. There may be explicit penalties associated with misclassication that need factoring in. These and related issues will be described by going through brief descriptions of some actual decision tree and decision rule modeling algorithms in the following sections. and test partitions. The training partition is used to generate an over-tted model, while the test partition is used to generalize this model to the best possible derivative. An averaging process across the many different train-test combinations (hence the term crossvalidation) is used to select the nal RSbest from the many candidates that are available. When the data set that is available is large, a single train-test partition is sucient for evaluation and to select RSbest using the pruning approach. 3 Decision Tree Modeling Decision trees are generated from training data in a top-down, general-to-specic direction. The initial state of a decision tree is the root node that is assigned all the examples from the training set. If it is the case that all examples belong to the same class, then no further decisions need to be made to partition the examples, and the solution is complete. If examples at this node belong to two or more classes, then a test is made at the node that will result in a split. The process is recursively repeated for each of the new intermediate nodes until a completely discriminating tree is obtained. A decision tree at this stage is potentially an over-tted solution, i.e., it may have components that are too specic to noise and outliers that may be present in the training data. To relax this over-tting, most decision tree methods go through a second phase called pruning that tries to generalize the tree by eliminating sub-trees that seem too specic. Error estimation techniques play a major role in tree pruning. Most modern decision tree modeling algorithms are a combination of a specic type of a splitting criterion for growing a full tree, and a specic type of a pruning criterion for pruning tree. 2.1 Error estimation and evaluation criteria Estimating the true accuracy of a decision tree or rule model is one of the most important aspects of the modeling process. A solution generated from a set of training examples will almost always be highly accurate on the same data, but far less accurate on new data. A sample of cases contains noise and vary from other samples, leading a learning method astray in its predictions. To handle this shortcoming, most modeling techniques employ a two-fold strategy in the model generation process, where the rst step involves the generation of the model from the training data, and the second step involves testing the proposed solution on independent cases, sometimes pruning it to compensate for the over-tting of the rst step. The essential problem to be solved in the pruning step may be specied as follows: given a set of sample examples, S, where each example is composed of observed features and the class labels, the problem is to nd the best model RSbest such that the error rate on new examples, Errtrue(RSbest), is minimum. Given an over-tted model RS on a set of examples S, a derivative of RS needs to be determined that satises the above criteria. Several pruning techniques have been devised that t the above paradigm, and will be explained in more detail in the following sections. Broadly speaking, these techniques usually either employ the crossvalidation approach or the train-and-test approach. Cross-validation is usually preferred when the modeling is being done with small samples, so one repeatedly breaks up the data into dierent combinations of train 3.1 Growing a Full Tree CART [4] is a binary decision tree modeling algorithm that has been in extensive use. The evaluation function used for splitting in CART is the GINI index. For a given current P node t, this index is dened as Gini(t) = 1 ; i p2i where pi is the probability of class i in t. For each candidate split, the impurity (as dened by the GINI index) of all the sub-partitions is summed and the split that causes the maximumreduction in impurity is chosen. For the candidate splits, CART considers all possible splits in the sequence of values for continuous valued attributes ((n ; 1) splits for n values) and all possible subset splits for categorical attributes ((2n;1 ; 1) splits for n distinct values) 4 if n is small, and equivalence splits for categorical attributes (n splits for n distinct values) if n is large. At each node, CART determines the best split for each attribute and then selects the winner from this short list, utilizing the GINI index. C4.5 [20] is another popular decision tree modeling system, a variant and extension of an earlier well known decision tree modeling system, ID3. ID3 utilizes entropy criteria for splitting nodes. Given a criterion used is Entropy(t) = Pnodei ;pt,i logthepisplitting , where pi is the probability of class i within node t. An attribute and split are selected that minimize entropy. Splitting a node produces two or more direct descendants. Each child has a measure of entropy. The sum of each child's entropy is weighted by its percentage of the parent's cases in computing the nal weighted entropy used to decide the best split. In C4.5, given a node t, the splitting criterion used is the GainRatio(t) = gain(t)=SplitInformation(t). This ratio expresses the proportion of information generated by a split that is helpful for developing the classication, and may be thought of as a normalized information gain or entropy measure for the test. A test is selected that maximizes this ratio, as long as the numerator (the information gain) is larger than the average gain across all tests. The numerator in this ratio is the standard information entropy difference achieved at node t, expressed P as gain(t) = k Ci info(T);info (T), where info(T ) = ; t i=1 CT , and P s Ti infot (T ) = i=1 T info(Ti ) Bayes' classication rule assigns an example to the class with the highest conditional probability. This rule is known theoretically to be the optimum, i.e., it minimizes the total classication error. Formally, if there are C classes, then Bayes' rule is to assign an example to class i where P (Cijx) > P (Gj jx)8j 6= i. The success of this rule is in the underlying fact that all information that can be had about class membership is contained in the set of conditional probabilities. In practice, it is not possible to compute conditional probabilities for high dimensional data sets, for an enormous number of examples will be required to correctly assess conditional probabilities of the type P(C jx). Limited practical derivations of Bayes' rule exist, which include linear discrimination, and more recently, Bayes' tree [5]. Bayes' tree requires knowledge of prior class probabilities (empirically derivable from class proportions in the training data). Associated with a tree is a posterior probability of correct classication. The decision to grow a tree from a node is based upon increasing the posterior probability of the resulting tree. Of all the candidate splits that can be made, the one chosen is the one that causes the maximum increase to the posterior probability. SLIQ [16], a recent decision tree building system, pays attention to scalability issues, utilizing data structures and processing methods that allow applications to very large data sets are not required to be memory resident. In the tree building phase, SLIQ utilizes the Gini index, as in the CART system. For more discussions on splitting criteria, see [15] in this issue. 3.2 Pruning a Full Tree CART's pruning mechanism is an eort to get to the right sized tree that minimizes the true misclassication error estimate. A fully grown decision tree will have an apparent error rate of zero or close to zero on the training data from which the tree was built. However, its true error rate, measured by evaluating the misclassications when the tree is applied to a test data set, may be much higher. The goal of the pruning process is to nd that sub-tree that produces the least true error, taking into account the size (complexity) of the tree. Utilizing a formulation for the cost complexity of a tree, which is a function of the misclassication of the tree on the training data and the size (e.g. total number of leaves), one can derive a sequence of trees of decreasing cost complexity, starting from the fully grown tree. This sequence is recursively created by picking the last tree in the sequence (initially, the full tree), examining each of its non-leaf sub-trees, and picking the one with the least cost-complexity metric, and making that the next sub-tree in the sequence. The process stops when the nal sub-tree is just the root node. Once this sequence of decreasing cost-complexity sub-trees is produced, their individual true error rates can be determined by applying each sub-tree to a holdout data set. Typically, it is observed that initially, as the cost-complexity decreases, so does the true error rate, until one reaches a minima. Beyond that, as the cost-complexity decreases, the true error starts increasing again. Obviously, one chooses the sub-tree corresponding to the minimum true error rate as the nal pruned version. This is similar to the process described in [13]. In contrast to the cost-complexity pruning of CART, in which the true error rate of a tree and its subtrees is predicted from a separate set of examples that are distinct from the training examples, C4.5 uses a signicance test that compares a parent node to its children. Starting with a fully grown tree, when the results of the children are not found signicantly dier5 ent than the parent, the children are pruned. Not requiring independent holdout cases, a signicance test can make this comparison. For example, a signicance test at two standard errors is usually eective. However, holdout cases can improve pruning performance when pruning is performed at varying signicance levels and the most predictive solution is selected. Pruning Bayes' tree also relies upon enhancing posterior probability of the resulting tree. A test set is used for determining the probabilities. Of all trees resulting from pruning a node from a tree, the one that results in maximum posterior probability is chosen. SLIQ employs an alternate scheme for decision tree pruning, the MDL (minimum description length) principle. In the tree pruning phase, the MDL methodology is employed. As per the MDL principle, the total cost of description is the sum of the cost of describing a model, and the cost of describing data that are exceptions to this model. Given alternate models, the MDL principle further states that the best model is the one with the least description cost. In the case of decision trees, the alternate models may be viewed as the set of sub-trees made available as a result of pruning, and the data is the set of examples from which the full tree is initially built. SLIQ utilizes the classication error as the cost for encoding data, given a tree. The cost for describing a tree is formulated as a recursive combination of the cost of encoding a node and the cost of encoding the split at that node. The total cost at each node in a fully grown tree is then used to prune the node back to a leaf node, or to prune its left sub-tree, or right sub-tree, or leave it unchanged. SLIQ employs a twophase pruning strategy; the rst phase does a balanced pruning in which internal nodes either get fully converted to leaf nodes or left unchanged. In the second phase the sub-tree obtained from the rst phase is reexamined to prune back nodes by eliminating partial (either left or right) sub-trees. modeling systems employ a search process to evolve this set of highly specic and individual instances to more general rules. This search process is iterative, and usually terminates when rules can no longer be generalized, or some other alternate stopping criteria satised. As in the case of decision tree building, noise in the data may lead to over-tted decision rules, and various pruning mechanisms have been developed to deal with over-tted decision rule solutions. Rule induction methods attempt to nd a compact \covering" rule set that completely partitions the examples into their correct classes. The covering set is found by heuristically searching for a single \best" rule that covers cases for only one class. Having found a \best" conjunctive rule for a class C, the rule is added to the rule set, and the cases satisfying it are removed from further consideration. The process is repeated until no cases remain to be covered. The AQ [17] family of algorithms is inuenced and motivated by methods used in electrical engineering for simplifying logic circuits. Using the AQ terminology, a test on an attribute is called a selector, a conjunct of tests is called a complex, and a disjunction of complexes is called a cover. If a rule satises an example, it is called a cover for the example. Initially, every example is itself a complex in the model. Complexes are then examined, and selectors are then dropped as long as the resulting complex remains consistent (matching only examples of the same class and none of any other class). Complexes are thus produced, one at a time. Combining generalized complexes produces covers that are also complete (all examples of a class are covered). In the search process for creating complexes, an evaluation function is used for ordering and determining which selectors to drop (or generalize). Although this evaluation function can be set by an external entity (such as an end-user or calling function), the one that is normally used is the ratio of examples correctly classied by a complex to the total examples classied by that complex. CN2 [6] may be regarded as a system that extends AQ in terms of its ability to deal with noise in the data. Specically, CN2 retains a set of complexes during its search that are deemed to be statistically covering a large number of instances of a class, even if they also cover instances of other classes. Additionally, CN2 executes a general to specic search, as opposed to AQ's strict specic to general approach. Each specialization step either add new selectors to a complex, or removes an entire complex. CN2 employs two types of heuristics in the search for the best complexes, signicance and goodness. Sig- 4 Decision Rule Induction Decision rules, in disjunctive normal form (DNF), may be induced from training data in a bottom-up specic-to-general style, or in a top-down general-tospecic style, as in decision tree building. This section will highlight methodologies dealing with bottom-up specic-to-general approaches to rule induction. The initial state of a decision rule solution is indeed the collection of all individual instances or examples in a training data set, each of which may be thought of as a highly specialized decision rule. Most decision rule 6 nicance is the threshold such that any complexes below this threshold will not be considered for selecting the best complex. To test signicance, P CN2 uses the entropy statistic. This is given by 2 ni=1 pi log(pi =qi ), where the distribution p1 ; :::; pi is the observed frequency distribution of examples among classes satisfying a given complex and q1; :::; qi is the expected frequency distribution of the same number of examples under the assumption that the complex selects examples randomly. This statistic provides an informationtheoretic measure of the distance between the two distributions. Any complex whose entropy statistic falls below a pre-specied threshold is rejected. Goodness is a measure of the quality of the complex that is used for ordering complexes that are candidates for inclusion in the nal cover. The commonly used measure of goodness in CN2 is the Laplacian error estimate (n ; nc + k ; 1)=(n + k) where n is the total number of examples covered by the rule, nc is the number of positive examples covered by the rule, and k is the number of classes in the data being modeled. ITRULE [24] also employs a bottom-up search process to formulate rules directly from data. The algorithm generates a set of R rules, where R is a user-dened parameter. This set of rules is considered to be the R most informative rules as dened by the J-measure. The J-measure evaluates the average information content of a rule, and can be used for both generalization and specialization of individual rules until they reach optimal information content level. The algorithm proceeds by rst nding R rules, calculating their J-measures, and then iterating through a process whereby rules with J-measures higher than the rule with the least J-measure are introduced into the list at the expense of the latter. The J-measure and its use is analogous to the entropy statistic and its use in CN2, although theoretically shown to be more robust. Swap-1 [25] uses local optimization techniques to dynamically revise and improve its covering set. Once a covering set is found that separates the classes, the induced set of rules is further rened. Using train and test evaluation methods, the initial covering rule set is scaled back to the most statistically accurate subset of rules. Rules for two classes can potentially be satised simultaneously. Such conicts are resolved by inducing rules for each class according to a class priority ordering, with the last class considered a default class. Unlike the 1-level lookahead employed in constructing tests (such as Gini and entropy based methods), Swap-1 constantly looks back to see whether any improvement can be made before adding a new test. The following steps are taken to form the single best rule: (a) Make the single best swap from among all possible rule component swaps, including deleting a component; (b) If no swap is found, add the single best component to the rule, where \best" is evaluated as predictive value, i.e. percentage correct decisions by the rule. For equal predictive values, maximum case coverage is a secondary criterion. Swapping and component addition terminate when 100% predictive value is reached. Finding the optimal combination of attributes and values for even a single xed-size rule is a complex task. However, there are other optimization problems, such as the traveling salesman problem, where local swapping nds excellent approximate solutions. Given a set of samples S, and a covering rule set RS, RS can be progressively weakened so that it becomes increasingly less complex, though decreasing in accuracy. The objective is to select rule set RSbest from fRS1,...RSi,...RSng, a collection of rule sets in decreasing order of complexity, such that RSbest will make the fewest errors on new cases T. In practice, the optimal solution can usually not be found because of incomplete samples and limitations on search time. It is not possible to search over all possible rule sets of complexity Cx(RSi ), where Cx is some appropriate complexity t measure, such as the number of components in the rule set. If the set fRS1,...RSi,...RSng is ordered by some complexity measure Cx(RSi ), then the best one is selected by min[Err(RSi)]. Thus to solve this problem in practice, a method must induce and order fRSi g by Cx(RSi ) and estimate each rule set's error rate, Err(RSi ). A rule set's error rate is dened as the fraction of misclassied cases to the total classied cases as a result of applying the rule. Pruning methods adapted to rule induction can be used to prune a rule set and form fRSi g. Let the rule set RS1 be the covering rule set. Each subsequent RSi+1 can be found by pruning RSi at its weakest link. A rule set can be pruned by deleting single rules or single components. The application of a form of pruning known as weakest-link pruning results in an ordered series of decreasing complexity rule sets, fRSi g. The RAMP rule generation system [12] generates \minimal" classication rules from tabular data sets where one of the columns is a \class" variable and the remaining columns are \explanatory" features. The data set is completely discretized (i.e., continuous valued features are discretized into a nite set of discrete values, categorical features are left untouched) by an optimal numerical discretization step prior to rule generation. While the RAMP approach to generating classication rules is similar to techniques that directly gen7 erate rule from data, it's primary goal is to strive for a \minimal" rule set that is complete and consistent with the training data. Completeness implies that the rules cover all of the examples in the training data while consistency implies that the rules cover no counter-examples for their respective intended classes. The RAMP system utilizes a logic minimization methodology, called R-Mini, to generate \minimal" complete and consistent rules. This technique was rst developed for programmable logic array circuit minimization (MINI), and is considered to be one of the best known 2-level logic minimization techniques. The merits of minimality have been well discussed [22]. The principal hypothesis here is that a simpler solution tends to have higher accuracy. Thus, if two dierent solutions (in the same representation) both describe a particular data set, the less complex of the two will be more accurate in its description. Complexity is measured dierently for dierent modeling techniques. For decision rules, it would be total number of rules and total number of tests in all the rules. A smaller description will tend to be better in its predictive accuracy, and this has been borne out in our extensive evaluations. A data set with N features may be thought as a collection of discrete points (one per example) in an N-dimensional space. A classication rule is a hypercube (a \complex" in AQ terminology) in this space that contains one or more of these points. When there is more than one cube for a given class, all the cubes are Or-ed to provide a complete logical classication function for the class. Within a cube the conditions for each part are And-ed, thereby giving the DNF representation for the overall classication solution. The size of a cube indicates its generality, i.e., the larger the cube, the more vertices it contains, and potentially cover more example-points. RAMP's minimality objective is rst driven by the minimal number of cubes, and then the most general cubes. The most general cubes are prime cubes that cannot be further generalized without violating the consistency of that cube. The minimality objective translates to nding a minimal number of prime cubes that cover all the example-points of a class and not cover any examplepoints of any counter-class. This objective is similar to many switching function minimization algorithms. The core heuristics used in the RAMP rule generation system for achieving minimality consists of iterating (for a reasonable number of rounds) over two key sub-steps: 1. Generalization step, R-EXPAND, which takes each rule in the current set (initially each example is a rule) and opportunistically generalizes it to remove other rules that are subsumed. 2. Specialization/Reformulation, R-REDUCE, which takes each rule in the current set and specializes it to the most specic rule necessary to continue covering only the unique examples it covers. Redundant cubes disappear during this step. This annealing-like approach to rule generation (via iterative improvements) may be potentially indenite. A limit is used that controls how long the system should keep iterating without observing a reduction. If no reduction takes place within this limit, the minimization process may be stopped. In practice, it has been observed that RAMP rule generation satisfactorily converges the rule set given that it has gone through at least 5-7 iterations without performing a reduction. RAMP takes a a slightly dierent approach to overcoming the overtting bias in the modeling process. Instead of pruning a solution, RAMP generates multiple solutions. The heuristics that control the rule generation process in RAMP use randomization, and therefore multiple generations from the same training set can result in solutions that model the decision surface using dierent combinations of rules. Although each individual solution models some regions correctly and some incorrectly (due to overtting), the union of multiple solutions enhances the correctness, and smooths out the overtting. Benchmark tests have indicated that combining up to ve solutions produces solutions with very competitive predictive accuracies. 5 Tree and Rule-based Regression Regression is the problem of approximating the values of a continuous variable. Given samples of output (response) variable y and input (predictor) variables x = fx1:::xng, the regression task is to nd a mapping y = f(x). The classical approach to the problem is linear least-squares regression [23]. Linear regression has proven quite eective for many real-world applications. However, the simple linear model has its limits, and more complex models often t the data better. Nonlinear regression models have been explored and many new eective methods have emerged, including projection pursuit and MARS. Neural networks trained by back-propagation is another alternate nonlinear regression model. An overview of many dierent regression models, with application to classica8 tion models as well, is available [21]. Most of these methods produce solutions in terms of weighted models. The CART program induces both classication and regression trees. These regression trees are strictly binary trees. In terms of performance, regression trees often are competitive in performance to other regression methods [4]. Regression trees are noted to be particularly strong when there are many higher order dependencies among the input variables. The advantages of the regression tree solution are similar to the advantages enjoyed by classication trees over other models. On the negative side, decision trees cannot represent compactly many simple functions, for example linear functions. A second weakness is that the regression tree solution is discrete, yet predicts a continuous variable. For function approximation, the expectation is a smooth continuous function, but a decision tree provides discrete regions that are discontinuous at the boundaries. All in all though, regression trees often produce strong results, and for many applications their advantages strongly outweigh their potential disadvantages. x13 ! y=10 x21 ! y=2 Otherwise y=5 Figure 3: Example of Regression Rules recursively pruned so that the ratio delta/n is minimized, where n is the number of pruned nodes and delta is the increase in error. Weakest link pruning has several desirable characteristics: (a) it prunes by training cases only, so that the remaining test cases are relatively independent (b) it is compatible with resampling. An interesting extension to regression trees is exemplied in [19], wherein the tree may be terminated at each of its leaf nodes by a linear regression model. Thus the linearity in a decision surface is modeled at the leaves, while the non-linearity in the decision surface is modeled by the actual tree. 5.2 Regression by Rule Induction 5.1 Regression by Tree Induction Both tree and rule induction models nd solutions in disjunctive normal form, and the model of equation 1 is applicable to both. Each rule in a rule-set represents a single partition or region Ri. However, unlike the tree regions, the regions for rules need not be disjoint. With non-disjoint regions, several rules may be satised for a single sample. Some mechanism is needed to resolve the conicts in ki , the constant values assigned, when multiple rules, Ri regions, are invoked. One standard model is to order the rules, as in Figure 3. Such ordered rule-sets have also been referred to as decision lists. The rst rule that is satised is selected, as in equation 2. Like classication trees, regression trees are induced by recursive partitioning. The solution takes the form of equation 1, where Ri are disjoint regions, ki are constant values, and yji refers to the y-values of the training cases that fall within the region Ri. if x Ri then f(x) = ki = medianfyji g (1) Regression trees have the same representation as classication trees except for the terminal nodes. The decision at a terminal node is to assign a case a constant y value. The single best constant value is the median of the training cases falling into that terminal node because for a partition, the median is the minimizer of mean absolute distance. For regression tree induction, the minimized function, i.e. absolute distance, is a satisfactory splitting criteria for growing the tree. At each node, the single best split that minimizes the mean absolute distance is selected. Splitting continues until fewer than a minimum number of cases are covered by a node, or until all cases within the node have the identical value of y. The pruning strategies employed for classication trees are equally valid for regression trees. Like the covering procedures, the only substantial dierence is that the error rate is measured in terms of mean absolute distance. For weakest-link pruning, a tree is if i < j and x both Ri and Rj then f(x) = ki (2) Given this model of regression rule sets, the problem is to nd procedures that eectively induce solutions. For rule-based regression, a covering strategy analogous to the classication tree strategy could be specied. A rule could be induced by adding a single component at a time, where each added component is the single best minimizer of distance. As usual, the constant value ki is the median of the region formed by the current rule. As the rule is extended, fewer cases are covered. When fewer than a minimal number of cases are covered, rule extension terminates. The covered cases are removed and rule induction can continue 9 1. Generate a set of Pseudo-classes. 2. Generate a covering rule-set for the transformed classication problem using a rule induction method such as Swap-1. 3. Initialize the current rule set to be the covering rule set and save it. 4. If the current rule set can be pruned, iteratively do the following: a) Prune the current rule set. b) Optimize the pruned rule set and save it. c) Make this pruned rule set the new current rule set. 5. Use test cases or cross-validation to pick the best of the saved rule sets. Figure 4: Swap-1R Method for Learning Regression Rules on the remaining cases. This is also the regression analogue of rule induction procedures for classication. The Swap-1R [26] system for inducing decision regression rules works by mapping the regression problem into a classication problem. Let fCig be a set consisting of an arbitrary number of classes, each class containing approximately equal values of fyi g. To solve a classication problem, the classes are expected to be dierent from each other, and it is assumed that rules can be found to distinguish these classes. Classes formed by an ordering and discretization of fyi g form the classication problem. In practice, one learning model is not always superior to others, and a learning strategy that examines the results of dierent models may do better. Moreover, by combining dierent models, enhanced results may be achieved. A general approach to combining learning models is a scheme referred to as stacking [28]. The models could be completely dierent, such as combining decision trees with linear regression models. Dierent models are applied independently to nd solutions, and in a subsequent layer yet another model is used for combining the solutions into a single solution. This layer may be a simple weighted vote, as per equation 3, or something more sophisticated. This method of model combination is in contrast to the usual approach to evaluation of dierent models, where the single best performing model is selected. 10 y= K X w M (x) k=1 k k (3) While stacking has been shown to give improved results, a major drawback is that properties of the combined models are not retained. Thus when interpretable models are combined, the result may not be interpretable at all. It is also not possible to compensate for weaknesses in one model by introducing another model in a controlled fashion. A modied technique for combination of alternate solutions is to retain the interpretable nature of rules, while at the same time addressing the problem of symbolic regression solutions assigning a constant value as the predictor, once a region is identied. The following strategy is used to determine the y-value of a case x that falls in region Ri , instead of assigning a single constant value ki for region Ri, where ki is determined by the median y value of training cases in the region, i (x), the mean of the k-nearest (training assign yknn set) instances of x in region Ri . An interesting aspect of this strategy is that knearest neighbor results need only be considered for the cases covered by a particular partition. This hybrid approach alleviates the weakness of partitions being assigned single constant values. Moreover, some of the global distance measure diculties of the k-nn methods may also be relieved because the table lookup is reduced to partitioned and related groupings. Another decision rule based regression approach is exemplied by the RAMP system. This system also supports post-processing for regression, by computing additional metrics for the rules as they are generated on a pre-transformed pseudo classication problem, based upon example data-points that are covered by each rule in the training data. Three parameters are attached to each rule, , the mean of all the original class values of training examples covered by that rule; , the standard deviation of these values; and N, the total number of training examples covered by that rule. For regression estimation, two straightforward averaging approaches are made available, the simple and weighted approach. In the simple averaging approach, the simple average of of all rules that cover an example is computed as its predicted value. Therefore, for each example in the test data, if M is the total number of rulesPthat cover the example, its predicted M i class value is iM=1 . In the weighted average approach, there are several options available for weighting the rules. One of these options is to compute and assign a prediction of the weighted average, e.g., P P p Ni M i=1 pi i Ni M i=1 i . In general, weighted averaging usually leads to smoother correlations between predicted and actual values. It has been observed that no unique combination of weighting and error estimation seems to be uniformly applicable. The user has the ability to evaluate these combinations on test data, determine which one tends to be most accurate, and then utilize that metric for ne tuning the solution. Data Warehouse Sample Rule/Tree solution Sample Rule/Tree solution Sample Rule/Tree solution Sample Rule/Tree solution New example 6 Discussion The rise in attention and focus on decision support solutions using data mining techniques has refueled a big interest in classication and regression modeling, particularly symbolic techniques [10]. This paper has attempted to provide the reader with the key issues of decision tree and decision rule modeling techniques, two important approaches to symbolic modeling. While some aspects of this technology have reached maturity and become stable, there are also many aspects that remain open. Symbolic modeling approaches that remain consistently robust across a wide variety of data sets are not yet well understood. Additionally, though some of these techniques are conceptually robust and elegant, they prove to be computationally challenging when applied to large scale business and industrial data sets. Many new approaches are under active research and development. It is well recognized that these techniques cannot be applied in a black box mode to data. Intensive application analysis and knowledge engineering continues to play an important role. Techniques such as preprocessing of raw data and feature extraction contribute greatly to improving the accuracy of the symbolic modeling process. Techniques to handle missing and erroneous values in data are also critical. Techniques and methods for feature selection [15] play a useful role in pre-pruning the search space. Characteristics of data, such as too many categories for a variable, or extreme bias in class proportions, or hierarchies in attributes, can signicantly aect the modeling process, and a catalog of methodologies to address these issues is also slowly emerging. Scalability is emerging as a key factor in coupling these techniques to the extremely large volumes of data that are becoming available today. Systems and techniques that focus on resolving this key problem are beginning to emerge [7, 16]. Techniques such as stacking [28], bagging [3], and boosting [11], also prove to be extremely useful in improving the modeling process, by using hybrid ap- Averaging or Voting Mechanism New example classification or regression result Figure 5: Multiple Solutions for Maximizing Accuracy proaches that either combine solutions from multiple approaches, or multiple solutions using the same approach but on dierent training samples. Utilizing the bagging approach, predictive performance can be improved, sometimes very substantially, by nding many solutions on dierent random samples taken from a large data warehouse. For classication, the many answers for a new case are voted or for regression they are averaged. This process is illustrated in Figure 5. N cases may be randomly drawn from a large data base or even simulated by resampling from a smaller dataset. If a learning method is fast, it is not dicult to generate new solutions for each new sample of N cases. The most obvious candidate for this approach to learning is the decision tree. In the boosting approach, error cases are sampled with greater frequency, in subsequent modeling iterations. For a single solution and sample, a decision tree usually yields good results for most learning problems. Solutions are found quickly, typically much faster than most other learning methods. The predictive performance of decision trees is often weaker than some other learning methods, such as neural nets. However, studies show that when answers for solutions found on many random samples are voted, the resulting performance can approach optimal predictive performance [3]. The predictive performance is often signicantly increased, though the clarity of presentation may be sometimes compromised. Finally, an open issue that continues to be explored is characterization of datasets, using either simple measures, statistical measures, or information theoretic measures, that will allow an educated mapping of the most appropriate mining technique to a dataset 11 for maximizing the accuracy of the resulting solution. [13] J. Hosking, E. Pednault, and M. Sudan. A Statistical Perspective on Data Mining. 1997. in this issue. [14] M. James. Classication Algorithms. John Wiley & Sons, 1985. [15] I. Kononenko and S.J. Hong. Attribute Selection for Modelling. 1997. in this issue. [16] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A Fast Scalable Classier for Data Mining. In References [1] C. Apte, F. Damerau, and S. Weiss. Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems, 12(3):233{251, July 1994. [2] C. Apte and S.J. Hong. Predicting Equity Returns from Securities Data with Minimal Rule Generation. In Advances in Knowledge Discovery, pages 541{560. AAAI Press / The MIT Press, 1995. Proceedings of the Fifth International Conference on Extending Database Technology, 1996. [17] R. Michalski, I. Mozetic, J. Hong, and N. Lavrac. The Multi-Purpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. In Proceedings of the AAAI-86, pages 1041{1045, 1986. [18] D. Michie, D. Spiegelhalter, and C. Taylor. Ma- [3] L. Breiman. Bagging predictors. Machine Learning, 24:123{140, 1996. [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression Trees. Wadsworth, Monterrey, CA., 1984. chine Learning, Neural and Statistical Classication. Ellis Horwood, 1994. [19] J. Quinlan. Combining instance-based and model-based learning. In International Conference on Machine Learning, pages 236{243. Morgan Kaufmann, 1993. [20] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [21] B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. [22] J. Rissanan. Stochastic Complexity in Statistical Inquiry. World Scientic Series in Computer Science, 15, 1989. [23] H. Schee. The Analysis of Variance. John Wiley & Sons, 1959. [24] P. Smyth and R. Goodman. An Information Theoretic Approach to Rule Induction from Databases. IEEE Transactions on Knowledge and Data Engineering, 4(4):301{316, August 1992. [25] S. Weiss and N. Indurkhya. Optimized Rule Induction. IEEE EXPERT, 8(6):61{69, December 1993. [26] S. Weiss and N. Indurkhya. Rule-based Machine Learning Methods for Functional Prediction. Journal of Articial Intelligence Research, 3:383{403, 1995. [27] S. Weiss and C.A. Kulikowski. Computer Systems That Learn. Morgan Kaufmann, 1991. [5] W. Buntine. Learning Classication Trees. Statistics and Computing, 2:63{73, 1992. [6] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 3:261{283, 1989. [7] W. Cohen. Fast eective rule induction. In The XII International Conference on Machine Learning, pages 115{123, 1995. [8] M. Craven and J. Shavlik. Using Neural Networks for Data Mining. 1997. in this issue. [9] U. Fayyad, S.G. Djorgovski, and N. Weir. Automating the Analysis and Cataloging of Sky Surveys. In Advances in Knowledge Discovery, pages 471{493. AAAI Press / The MIT Press, 1995. [10] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI Press / The MIT Press, 1995. [11] Y. Freund and R. Schapire. Experiments with a New Boosting Algorithm. In Proceedings of the International Machine Learning Conference, pages 148{156. Morgan Kaufmann, 1996. [12] S.J. Hong. R-MINI: An Iterative Approach for Generating Minimal Rules from Examples. IEEE Transactions on Knowledge and Data Engineering, 1997. to appear. 12 [28] D. Wolpert. Stacked generalization. Neural Networks, 5:241{259, 1992. 13

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining with Decision Trees and Decision Rules C. Apte and