Download Data Mining with Decision Trees and Decision Rules C. Apte and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining with Decision Trees and Decision Rules
C. Apte and S.M. Weiss
Future Generation Computer Systems
November 1997
Data Mining with Decision Trees and Decision Rules
Chidanand Apte
Sholom Weiss
T.J. Watson Research Center
IBM Research Division
Yorktown Heights, NY 10598
Department of Computer Science
Rutgers University
New Brunswick, NJ 08903
Abstract
acceptably accurate learning method, it is possible to
develop predictive applications, i.e., using the solution
to predict the expected value of the eld of interest,
given all the others. For example, a credit company
may use its customer data base to model delinquency.
Given a new application prole, this solution may be
used to predict the likelihood that the applicant may
default.
Classication and regression are critical types of
prediction problems. When the goal of prediction is
discrete valued, a classication solution is developed.
When the goal is numerical and continuous, a regression solution is developed.
This type of data analysis has been an active area of
research in many scientic areas for quite some time.
These domains, ranging from medicine and geology to
astronomy and physics, have relied upon the investigation of data gathered from experiments and observations to formulate predictive models. These solutions are essentially approximation functions for describing the behavior of a response or objective variable of interest in terms of independent or input variables or features. Modeling methods have been developed drawing upon techniques from statistics, pattern
recognition, and machine learning [18, 21, 27].
A classical example of a predictive modeling application in medicine is the gathering of extensive observations of patients with and without diagnosis of a
particular disease. The modeling task then attempts
to formulate a model that describes the likelihood of
a patient having the disease as a function of all the
independent observations about the patient. Once an
accurate predictive model is formulated, it serves as
an additional piece of diagnostic machinery that may
be used for predicting the likelihood of a disease given
a patient's data.
Until recently, these techniques had been restricted
in their applications, given the volume of data and
computational resource that was required for the modeling. However, with the increasing availability of high
This paper describes the use of decision tree and
rule induction in data mining applications. Of methods for classication and regression that have been developed in the elds of pattern recognition, statistics,
and machine learning, these are of particular interest
for data mining since they utilize symbolic and interpretable representations. Symbolic solutions can provide a high degree of insight into the decision boundaries that exist in the data, and the logic underlying
them. This aspect makes these predictive mining techniques particularly attractive in commercial and industrial data mining applications. We present here a synopsis of some major state-of-the-art tree and rule mining methodologies, as well as some recent advances.
Keywords
Decision Tree, Rule Induction, Data Mining
1 Introduction
The use of computer technology in decision support
is now widespread and pervasive across a wide range of
business and industry. This has resulted in the capture
and availability of data in immense volume and proportion. There are many examples that can be cited.
Point of sale data in retail, policy and claim data in
insurance, medical history data in health care, nancial data in banking and securities, are some instances
of the types of data that is being collected. The data
are typically a collection of records, where each individual record may correspond to a transaction or a
customer, and the elds in the record correspond to
attributes. Very often, these elds are of mixed type,
with some being numerical (continuous valued, e.g.
age) and some symbolic (discrete valued, e.g. color).
Given this layout of data, one may model a eld in
terms of all the other elds in the data. With an
1
volume data in business and industry, and the signicant drop in computational cost, many of these techniques are beginning to be applied in commercial applications, with many demonstrated successes.
There are many issues related to the problem of
formulating a predictively accurate model. These have
mainly to with the nature of the data and the representation language for the solution. Data characteristics
generally dictate the complexity of the mining task.
The data may be noisy, incomplete, or incorrect. The
representation language for the solution will usually
limit the scope of the functions that can be formulated.
Classication and regression learning has been
evolving over a considerable period of time, with contributions coming from statistics, pattern recognition,
and more recently, the eld of machine learning. One
of the earliest methods developed for classication
modeling was the technique of linear discriminants
[14]. An early technique that came into existence
for regression was linear regression [23]. Each has its
own limitations. Since then, a slew of techniques and
methods have been developed, including k-nearestneighbor, decision tree, rule induction, neural networks, etc. [27]. For the remaining part of this paper, the problem of classication modeling will be examined from a decision tree and rule induction perspective. Tree and rule based regression modeling will
then be briey introduced, and nally the conclusion
will discuss general issues and directions for symbolic
solution mining.
Figure 1: Peg Data
2 illustrates a decision tree that corresponds to the
partitions shown in Figure 1. At the top level of the
tree, there is the root node, at which the classication
process begins. The test at the root node tests all
example instances for Length 0:75. Examples that
satisfy this test are passed down the left (TRUE) arc
to a leaf node, indicating that all examples belong to a
single class (SQUARE) and no more tests are needed.
The right (FALSE) arc from the root node receives all
examples that fail the test at the root node. These
examples are not yet purely from one class, so further
testing is required at this intermediate node. The test
at this node is for Diameter 3:00. Examples that
satisfy this test are all in one class (STAR) and those
that don't are also all in one class (DIAMOND), and
so both arcs from this node lead to leaf nodes, and
the decision tree solution is complete. Note that this
example illustrates a binary tree, where each intermediate node can split into at most two sub-trees. Decision trees may be non-binary also, where each node
may split into more than two sub-trees, by performing tests that result in more than two outcomes (.e.g.
subset membership or interval membership tests).
Closely related to decision tree solutions are rule
based solutions. A rule may be constructed by forming
a conjunct of every test that occurs on a path between
the root node and a leaf node of a tree. The collection
of all such rules obtained by traversing every unique
path from root node to leaf node is a corresponding
rule based solution for classication. For example, for
the peg data that was used to illustrate the decision
2 Symbolic methods for classication
modeling
The problem of classication modeling will be examined here through a simple, hypothetical example.
Assume that length and diameter data is available for
a variety of pegs that are being manufactured on an
assembly line. Pegs are either square, star, or diamond shaped. A classication solution that characterizes the peg variety as a function of the length and
diameter can be useful in understanding how these
varieties dier, for designing visual inspection systems
and automated sorting machinery. Figure 1 illustrates
this data. The gure also shows the two axis parallel lines, one at Length = 0:75, and the second at
Diameter = 3:00, that seem to completely partition
the three peg varieties into three dierent sub-areas.
Decision tree solution methods provide automated
techniques for discovering these types of axis parallel
partitions, in their maximally general forms. Figure
2
non-mutually exclusive decision rules. These rules essentially correspond to decision regions that overlap
each other in the data space.
Once a decision tree or decision rule solution is generated from data, it can be used for estimating or predicting the response or class variable for a new case.
The application of a decision tree to a data example
is a straightforward top-down decision process, controlled by evaluating the tests and taking the appropriate branch, beginning at the root, and terminating
when a leaf node is reached. The process of applying
decision rules to data examples is determined by the
style in which the rules were generated. A rule generation algorithm may induce ordered rule sets, i.e.,
induced by ordering all the classes which are present,
and then using a xed sequence, such as the smallest
to the largest class, with rules for every class being
discovered under the assumption that the only classes
that need to be discriminated between are the ones
that are remaining in the sequence. With this rule set,
the rules have to be applied to a new data example in
exactly the same sequence as they were generated. A
rule generation algorithm may also induce un-ordered
rule sets, in which rules for every class are generated
under the assumption that all other classes that are
present in the data need to be discriminated against
during the induction process. With this type of a rule
set, application to a new data example can be order independent and more exible in creating dierent rule
application strategies.
Decision tree and decision rule solutions oer a level
of interpretability that is unique to symbolic models.
The solutions may be directly inspected to understand
the decision surfaces that exist in the data. This particular aspect, which makes these solutions easy to
digest even for a non-technical end-user, makes these
techniques very appealing in decision support related
data mining activities, where insight and explanations
are of critical importance. What makes this approach
technically viable is the fact that most modern symbolic modeling methodologies succeed in formulating
solutions that are also competitive in predictive accuracy when compared to more non-intuitive or quantitative techniques, such as neural networks. This is an
important reason for the increased attention to and
use of decision rule modeling techniques that generate
rules directly from data [9, 1, 2, 8].
Classication modeling algorithms are designed
with several objectives. Perhaps the most well known
criteria by which these algorithms are evaluated is accuracy, speed, and interpretability. Solutions derived
using dierent approaches can thus be compared in
terms of their predictive accuracy on unseen data, on
Length .le. 0.75
True
False
Square
Diameter .le. 3.00
True
Star
False
Diamond
Figure 2: Classifying Pegs with a Decision Tree
tree, an equivalent rule solution is:
If
(Length <= 0.75)
Then Square
If
(Not (Length <= 0.75)) &
(Diameter <= 3.0)
Then Star
If
(Not (Length <= 0.75)) &
(Not (Diameter <= 3.0))
Then Diamond
Some rule induction programs are add-ons to decision tree solutions, whereby a tree is rst generated,
and then translated into a set of rules. However, techniques that directly generate rules from data are also
available, which overcome some of the drawbacks of
decision tree modeling. Often, in more complex data
sets, disjuncts of rules form the description for a class,
and hence the rule based solutions are more generally identied by the term DNF (Disjunctive Normal
Form), or decision rules.
Rules that are created by translating a decision tree
into DNF expressions are typically mutually exclusive in nature, since a decision tree essentially partitions a data space into distinct disjoint regions via
axis parallel surfaces created by its top-down sequence
of decisions. For certain data spaces, this nature of
partitioning may not always be capable of producing
compact solutions. For example, decision trees cannot easily model simple exclusive-or functions. On the
other hand, if algorithms are employed that directly
generate DNF expressions from data, it is possible to
create rules that capture such decision surfaces, via
3
the computational cost involved in generating the solution, and the level of understanding and insight that
is provided by the solution. Both decision tree and
decision rule modeling systems score high on interpretability. Accuracy and speed vary from algorithm
to algorithm, and in most instances these two issues
are coupled, i.e., improving predictive accuracy tends
to require increased computational eort.
Learning algorithms take a variety of factors into
account while computing the classication solution.
There is inherent noise in real-world data that needs
to be handled. The prior distributions of classes in the
training set may aect the solution generation. There
may be explicit penalties associated with misclassication that need factoring in. These and related issues
will be described by going through brief descriptions
of some actual decision tree and decision rule modeling
algorithms in the following sections.
and test partitions. The training partition is used to
generate an over-tted model, while the test partition
is used to generalize this model to the best possible
derivative. An averaging process across the many different train-test combinations (hence the term crossvalidation) is used to select the nal RSbest from the
many candidates that are available. When the data
set that is available is large, a single train-test partition is sucient for evaluation and to select RSbest
using the pruning approach.
3 Decision Tree Modeling
Decision trees are generated from training data in
a top-down, general-to-specic direction. The initial
state of a decision tree is the root node that is assigned all the examples from the training set. If it is
the case that all examples belong to the same class,
then no further decisions need to be made to partition the examples, and the solution is complete. If
examples at this node belong to two or more classes,
then a test is made at the node that will result in
a split. The process is recursively repeated for each
of the new intermediate nodes until a completely discriminating tree is obtained. A decision tree at this
stage is potentially an over-tted solution, i.e., it may
have components that are too specic to noise and
outliers that may be present in the training data. To
relax this over-tting, most decision tree methods go
through a second phase called pruning that tries to
generalize the tree by eliminating sub-trees that seem
too specic. Error estimation techniques play a major
role in tree pruning. Most modern decision tree modeling algorithms are a combination of a specic type
of a splitting criterion for growing a full tree, and a
specic type of a pruning criterion for pruning tree.
2.1 Error estimation and evaluation criteria
Estimating the true accuracy of a decision tree or
rule model is one of the most important aspects of
the modeling process. A solution generated from a
set of training examples will almost always be highly
accurate on the same data, but far less accurate on
new data. A sample of cases contains noise and vary
from other samples, leading a learning method astray
in its predictions. To handle this shortcoming, most
modeling techniques employ a two-fold strategy in the
model generation process, where the rst step involves
the generation of the model from the training data,
and the second step involves testing the proposed solution on independent cases, sometimes pruning it to
compensate for the over-tting of the rst step.
The essential problem to be solved in the pruning
step may be specied as follows: given a set of sample
examples, S, where each example is composed of observed features and the class labels, the problem is to
nd the best model RSbest such that the error rate on
new examples, Errtrue(RSbest), is minimum. Given
an over-tted model RS on a set of examples S, a derivative of RS needs to be determined that satises
the above criteria.
Several pruning techniques have been devised that
t the above paradigm, and will be explained in
more detail in the following sections. Broadly speaking, these techniques usually either employ the crossvalidation approach or the train-and-test approach.
Cross-validation is usually preferred when the modeling is being done with small samples, so one repeatedly
breaks up the data into dierent combinations of train
3.1 Growing a Full Tree
CART [4] is a binary decision tree modeling algorithm that has been in extensive use. The evaluation
function used for splitting in CART is the GINI index. For a given current
P node t, this index is dened
as Gini(t) = 1 ; i p2i where pi is the probability of
class i in t. For each candidate split, the impurity (as
dened by the GINI index) of all the sub-partitions is
summed and the split that causes the maximumreduction in impurity is chosen. For the candidate splits,
CART considers all possible splits in the sequence of
values for continuous valued attributes ((n ; 1) splits
for n values) and all possible subset splits for categorical attributes ((2n;1 ; 1) splits for n distinct values)
4
if n is small, and equivalence splits for categorical attributes (n splits for n distinct values) if n is large. At
each node, CART determines the best split for each
attribute and then selects the winner from this short
list, utilizing the GINI index.
C4.5 [20] is another popular decision tree modeling system, a variant and extension of an earlier well
known decision tree modeling system, ID3. ID3 utilizes entropy criteria for splitting nodes. Given a
criterion used is Entropy(t) =
Pnodei ;pt,i logthepisplitting
, where pi is the probability of class i
within node t. An attribute and split are selected
that minimize entropy. Splitting a node produces two
or more direct descendants. Each child has a measure
of entropy. The sum of each child's entropy is weighted
by its percentage of the parent's cases in computing
the nal weighted entropy used to decide the best split.
In C4.5, given a node t, the splitting criterion used
is the GainRatio(t) = gain(t)=SplitInformation(t).
This ratio expresses the proportion of information generated by a split that is helpful for developing the
classication, and may be thought of as a normalized
information gain or entropy measure for the test. A
test is selected that maximizes this ratio, as long as
the numerator (the information gain) is larger than
the average gain across all tests. The numerator in
this ratio is the standard information entropy difference achieved at node t, expressed P
as gain(t) =
k Ci
info(T);info
(T),
where
info(T
)
=
;
t
i=1 CT , and
P
s Ti
infot (T ) = i=1 T info(Ti )
Bayes' classication rule assigns an example to the
class with the highest conditional probability. This rule
is known theoretically to be the optimum, i.e., it minimizes the total classication error. Formally, if there
are C classes, then Bayes' rule is to assign an example to class i where P (Cijx) > P (Gj jx)8j 6= i.
The success of this rule is in the underlying fact that
all information that can be had about class membership is contained in the set of conditional probabilities. In practice, it is not possible to compute conditional probabilities for high dimensional data sets, for
an enormous number of examples will be required to
correctly assess conditional probabilities of the type
P(C jx). Limited practical derivations of Bayes' rule
exist, which include linear discrimination, and more
recently, Bayes' tree [5].
Bayes' tree requires knowledge of prior class probabilities (empirically derivable from class proportions in
the training data). Associated with a tree is a posterior probability of correct classication. The decision
to grow a tree from a node is based upon increasing
the posterior probability of the resulting tree. Of all
the candidate splits that can be made, the one chosen
is the one that causes the maximum increase to the
posterior probability.
SLIQ [16], a recent decision tree building system,
pays attention to scalability issues, utilizing data
structures and processing methods that allow applications to very large data sets are not required to be
memory resident. In the tree building phase, SLIQ
utilizes the Gini index, as in the CART system.
For more discussions on splitting criteria, see [15]
in this issue.
3.2 Pruning a Full Tree
CART's pruning mechanism is an eort to get to
the right sized tree that minimizes the true misclassication error estimate. A fully grown decision tree will
have an apparent error rate of zero or close to zero
on the training data from which the tree was built.
However, its true error rate, measured by evaluating
the misclassications when the tree is applied to a test
data set, may be much higher. The goal of the pruning
process is to nd that sub-tree that produces the least
true error, taking into account the size (complexity) of
the tree. Utilizing a formulation for the cost complexity of a tree, which is a function of the misclassication
of the tree on the training data and the size (e.g. total number of leaves), one can derive a sequence of
trees of decreasing cost complexity, starting from the
fully grown tree. This sequence is recursively created
by picking the last tree in the sequence (initially, the
full tree), examining each of its non-leaf sub-trees, and
picking the one with the least cost-complexity metric,
and making that the next sub-tree in the sequence.
The process stops when the nal sub-tree is just the
root node.
Once this sequence of decreasing cost-complexity
sub-trees is produced, their individual true error rates
can be determined by applying each sub-tree to a holdout data set. Typically, it is observed that initially, as
the cost-complexity decreases, so does the true error
rate, until one reaches a minima. Beyond that, as
the cost-complexity decreases, the true error starts increasing again. Obviously, one chooses the sub-tree
corresponding to the minimum true error rate as the
nal pruned version. This is similar to the process
described in [13].
In contrast to the cost-complexity pruning of
CART, in which the true error rate of a tree and its
subtrees is predicted from a separate set of examples
that are distinct from the training examples, C4.5 uses
a signicance test that compares a parent node to its
children. Starting with a fully grown tree, when the
results of the children are not found signicantly dier5
ent than the parent, the children are pruned. Not requiring independent holdout cases, a signicance test
can make this comparison. For example, a signicance
test at two standard errors is usually eective. However, holdout cases can improve pruning performance
when pruning is performed at varying signicance levels and the most predictive solution is selected.
Pruning Bayes' tree also relies upon enhancing posterior probability of the resulting tree. A test set is
used for determining the probabilities. Of all trees resulting from pruning a node from a tree, the one that
results in maximum posterior probability is chosen.
SLIQ employs an alternate scheme for decision tree
pruning, the MDL (minimum description length) principle. In the tree pruning phase, the MDL methodology is employed. As per the MDL principle, the total
cost of description is the sum of the cost of describing a model, and the cost of describing data that are
exceptions to this model. Given alternate models, the
MDL principle further states that the best model is
the one with the least description cost. In the case of
decision trees, the alternate models may be viewed as
the set of sub-trees made available as a result of pruning, and the data is the set of examples from which
the full tree is initially built.
SLIQ utilizes the classication error as the cost for
encoding data, given a tree. The cost for describing
a tree is formulated as a recursive combination of the
cost of encoding a node and the cost of encoding the
split at that node. The total cost at each node in a
fully grown tree is then used to prune the node back
to a leaf node, or to prune its left sub-tree, or right
sub-tree, or leave it unchanged. SLIQ employs a twophase pruning strategy; the rst phase does a balanced
pruning in which internal nodes either get fully converted to leaf nodes or left unchanged. In the second
phase the sub-tree obtained from the rst phase is reexamined to prune back nodes by eliminating partial
(either left or right) sub-trees.
modeling systems employ a search process to evolve
this set of highly specic and individual instances to
more general rules. This search process is iterative,
and usually terminates when rules can no longer be
generalized, or some other alternate stopping criteria
satised. As in the case of decision tree building, noise
in the data may lead to over-tted decision rules, and
various pruning mechanisms have been developed to
deal with over-tted decision rule solutions.
Rule induction methods attempt to nd a compact
\covering" rule set that completely partitions the examples into their correct classes. The covering set is
found by heuristically searching for a single \best" rule
that covers cases for only one class. Having found a
\best" conjunctive rule for a class C, the rule is added
to the rule set, and the cases satisfying it are removed
from further consideration. The process is repeated
until no cases remain to be covered.
The AQ [17] family of algorithms is inuenced and
motivated by methods used in electrical engineering
for simplifying logic circuits. Using the AQ terminology, a test on an attribute is called a selector, a conjunct of tests is called a complex, and a disjunction of
complexes is called a cover. If a rule satises an example, it is called a cover for the example. Initially, every
example is itself a complex in the model. Complexes
are then examined, and selectors are then dropped
as long as the resulting complex remains consistent
(matching only examples of the same class and none
of any other class). Complexes are thus produced, one
at a time. Combining generalized complexes produces
covers that are also complete (all examples of a class
are covered).
In the search process for creating complexes, an
evaluation function is used for ordering and determining which selectors to drop (or generalize). Although
this evaluation function can be set by an external entity (such as an end-user or calling function), the one
that is normally used is the ratio of examples correctly
classied by a complex to the total examples classied
by that complex.
CN2 [6] may be regarded as a system that extends
AQ in terms of its ability to deal with noise in the data.
Specically, CN2 retains a set of complexes during its
search that are deemed to be statistically covering a
large number of instances of a class, even if they also
cover instances of other classes. Additionally, CN2 executes a general to specic search, as opposed to AQ's
strict specic to general approach. Each specialization
step either add new selectors to a complex, or removes
an entire complex.
CN2 employs two types of heuristics in the search
for the best complexes, signicance and goodness. Sig-
4 Decision Rule Induction
Decision rules, in disjunctive normal form (DNF),
may be induced from training data in a bottom-up
specic-to-general style, or in a top-down general-tospecic style, as in decision tree building. This section
will highlight methodologies dealing with bottom-up
specic-to-general approaches to rule induction. The
initial state of a decision rule solution is indeed the
collection of all individual instances or examples in a
training data set, each of which may be thought of as
a highly specialized decision rule. Most decision rule
6
nicance is the threshold such that any complexes below this threshold will not be considered for selecting
the best complex. To test signicance,
P CN2 uses the
entropy statistic. This is given by 2 ni=1 pi log(pi =qi ),
where the distribution p1 ; :::; pi is the observed frequency distribution of examples among classes satisfying a given complex and q1; :::; qi is the expected frequency distribution of the same number of examples
under the assumption that the complex selects examples randomly. This statistic provides an informationtheoretic measure of the distance between the two distributions. Any complex whose entropy statistic falls
below a pre-specied threshold is rejected. Goodness
is a measure of the quality of the complex that is used
for ordering complexes that are candidates for inclusion in the nal cover. The commonly used measure
of goodness in CN2 is the Laplacian error estimate
(n ; nc + k ; 1)=(n + k) where n is the total number
of examples covered by the rule, nc is the number of
positive examples covered by the rule, and k is the
number of classes in the data being modeled.
ITRULE [24] also employs a bottom-up search
process to formulate rules directly from data. The
algorithm generates a set of R rules, where R is a
user-dened parameter. This set of rules is considered
to be the R most informative rules as dened by the
J-measure. The J-measure evaluates the average information content of a rule, and can be used for both generalization and specialization of individual rules until
they reach optimal information content level. The algorithm proceeds by rst nding R rules, calculating
their J-measures, and then iterating through a process
whereby rules with J-measures higher than the rule
with the least J-measure are introduced into the list
at the expense of the latter. The J-measure and its
use is analogous to the entropy statistic and its use in
CN2, although theoretically shown to be more robust.
Swap-1 [25] uses local optimization techniques to
dynamically revise and improve its covering set. Once
a covering set is found that separates the classes, the
induced set of rules is further rened. Using train and
test evaluation methods, the initial covering rule set
is scaled back to the most statistically accurate subset
of rules. Rules for two classes can potentially be satised simultaneously. Such conicts are resolved by
inducing rules for each class according to a class priority ordering, with the last class considered a default
class. Unlike the 1-level lookahead employed in constructing tests (such as Gini and entropy based methods), Swap-1 constantly looks back to see whether any
improvement can be made before adding a new test.
The following steps are taken to form the single best
rule: (a) Make the single best swap from among all
possible rule component swaps, including deleting a
component; (b) If no swap is found, add the single
best component to the rule, where \best" is evaluated
as predictive value, i.e. percentage correct decisions
by the rule. For equal predictive values, maximum
case coverage is a secondary criterion. Swapping and
component addition terminate when 100% predictive
value is reached. Finding the optimal combination of
attributes and values for even a single xed-size rule
is a complex task. However, there are other optimization problems, such as the traveling salesman problem,
where local swapping nds excellent approximate solutions.
Given a set of samples S, and a covering rule set
RS, RS can be progressively weakened so that it becomes increasingly less complex, though decreasing in
accuracy. The objective is to select rule set RSbest
from fRS1,...RSi,...RSng, a collection of rule sets in
decreasing order of complexity, such that RSbest will
make the fewest errors on new cases T. In practice,
the optimal solution can usually not be found because
of incomplete samples and limitations on search time.
It is not possible to search over all possible rule sets
of complexity Cx(RSi ), where Cx is some appropriate
complexity t measure, such as the number of components in the rule set.
If the set fRS1,...RSi,...RSng is ordered by some
complexity measure Cx(RSi ), then the best one is selected by min[Err(RSi)]. Thus to solve this problem
in practice, a method must induce and order fRSi g
by Cx(RSi ) and estimate each rule set's error rate,
Err(RSi ). A rule set's error rate is dened as the fraction of misclassied cases to the total classied cases
as a result of applying the rule. Pruning methods
adapted to rule induction can be used to prune a rule
set and form fRSi g. Let the rule set RS1 be the covering rule set. Each subsequent RSi+1 can be found
by pruning RSi at its weakest link. A rule set can
be pruned by deleting single rules or single components. The application of a form of pruning known
as weakest-link pruning results in an ordered series of
decreasing complexity rule sets, fRSi g.
The RAMP rule generation system [12] generates
\minimal" classication rules from tabular data sets
where one of the columns is a \class" variable and the
remaining columns are \explanatory" features. The
data set is completely discretized (i.e., continuous valued features are discretized into a nite set of discrete
values, categorical features are left untouched) by an
optimal numerical discretization step prior to rule generation.
While the RAMP approach to generating classication rules is similar to techniques that directly gen7
erate rule from data, it's primary goal is to strive
for a \minimal" rule set that is complete and consistent with the training data. Completeness implies that the rules cover all of the examples in the
training data while consistency implies that the rules
cover no counter-examples for their respective intended classes. The RAMP system utilizes a logic
minimization methodology, called R-Mini, to generate \minimal" complete and consistent rules. This
technique was rst developed for programmable logic
array circuit minimization (MINI), and is considered
to be one of the best known 2-level logic minimization
techniques.
The merits of minimality have been well discussed
[22]. The principal hypothesis here is that a simpler
solution tends to have higher accuracy. Thus, if two
dierent solutions (in the same representation) both
describe a particular data set, the less complex of the
two will be more accurate in its description. Complexity is measured dierently for dierent modeling
techniques. For decision rules, it would be total number of rules and total number of tests in all the rules.
A smaller description will tend to be better in its predictive accuracy, and this has been borne out in our
extensive evaluations.
A data set with N features may be thought as a
collection of discrete points (one per example) in an
N-dimensional space. A classication rule is a hypercube (a \complex" in AQ terminology) in this space
that contains one or more of these points. When there
is more than one cube for a given class, all the cubes
are Or-ed to provide a complete logical classication
function for the class. Within a cube the conditions
for each part are And-ed, thereby giving the DNF representation for the overall classication solution. The
size of a cube indicates its generality, i.e., the larger
the cube, the more vertices it contains, and potentially
cover more example-points. RAMP's minimality objective is rst driven by the minimal number of cubes,
and then the most general cubes. The most general
cubes are prime cubes that cannot be further generalized without violating the consistency of that cube.
The minimality objective translates to nding a
minimal number of prime cubes that cover all the
example-points of a class and not cover any examplepoints of any counter-class. This objective is similar
to many switching function minimization algorithms.
The core heuristics used in the RAMP rule generation system for achieving minimality consists of iterating (for a reasonable number of rounds) over two
key sub-steps:
1. Generalization step, R-EXPAND, which takes
each rule in the current set (initially each exam-
ple is a rule) and opportunistically generalizes it
to remove other rules that are subsumed.
2. Specialization/Reformulation,
R-REDUCE, which takes each rule in the current
set and specializes it to the most specic rule necessary to continue covering only the unique examples it covers. Redundant cubes disappear during
this step.
This annealing-like approach to rule generation (via
iterative improvements) may be potentially indenite.
A limit is used that controls how long the system
should keep iterating without observing a reduction.
If no reduction takes place within this limit, the minimization process may be stopped. In practice, it has
been observed that RAMP rule generation satisfactorily converges the rule set given that it has gone
through at least 5-7 iterations without performing a
reduction.
RAMP takes a a slightly dierent approach to overcoming the overtting bias in the modeling process.
Instead of pruning a solution, RAMP generates multiple solutions. The heuristics that control the rule
generation process in RAMP use randomization, and
therefore multiple generations from the same training
set can result in solutions that model the decision surface using dierent combinations of rules. Although
each individual solution models some regions correctly
and some incorrectly (due to overtting), the union
of multiple solutions enhances the correctness, and
smooths out the overtting. Benchmark tests have indicated that combining up to ve solutions produces
solutions with very competitive predictive accuracies.
5 Tree and Rule-based Regression
Regression is the problem of approximating the values of a continuous variable. Given samples of output
(response) variable y and input (predictor) variables
x = fx1:::xng, the regression task is to nd a mapping
y = f(x).
The classical approach to the problem is linear
least-squares regression [23]. Linear regression has
proven quite eective for many real-world applications. However, the simple linear model has its limits,
and more complex models often t the data better.
Nonlinear regression models have been explored and
many new eective methods have emerged, including projection pursuit and MARS. Neural networks
trained by back-propagation is another alternate nonlinear regression model. An overview of many dierent regression models, with application to classica8
tion models as well, is available [21]. Most of these
methods produce solutions in terms of weighted models.
The CART program induces both classication and
regression trees. These regression trees are strictly binary trees. In terms of performance, regression trees
often are competitive in performance to other regression methods [4]. Regression trees are noted to be
particularly strong when there are many higher order
dependencies among the input variables. The advantages of the regression tree solution are similar to the
advantages enjoyed by classication trees over other
models. On the negative side, decision trees cannot
represent compactly many simple functions, for example linear functions. A second weakness is that
the regression tree solution is discrete, yet predicts a
continuous variable. For function approximation, the
expectation is a smooth continuous function, but a
decision tree provides discrete regions that are discontinuous at the boundaries. All in all though, regression trees often produce strong results, and for many
applications their advantages strongly outweigh their
potential disadvantages.
x13 ! y=10
x21 ! y=2
Otherwise y=5
Figure 3: Example of Regression Rules
recursively pruned so that the ratio delta/n is minimized, where n is the number of pruned nodes and
delta is the increase in error. Weakest link pruning
has several desirable characteristics: (a) it prunes by
training cases only, so that the remaining test cases
are relatively independent (b) it is compatible with
resampling.
An interesting extension to regression trees is exemplied in [19], wherein the tree may be terminated
at each of its leaf nodes by a linear regression model.
Thus the linearity in a decision surface is modeled at
the leaves, while the non-linearity in the decision surface is modeled by the actual tree.
5.2 Regression by Rule Induction
5.1 Regression by Tree Induction
Both tree and rule induction models nd solutions
in disjunctive normal form, and the model of equation 1 is applicable to both. Each rule in a rule-set
represents a single partition or region Ri. However,
unlike the tree regions, the regions for rules need not
be disjoint. With non-disjoint regions, several rules
may be satised for a single sample. Some mechanism
is needed to resolve the conicts in ki , the constant
values assigned, when multiple rules, Ri regions, are
invoked. One standard model is to order the rules,
as in Figure 3. Such ordered rule-sets have also been
referred to as decision lists. The rst rule that is satised is selected, as in equation 2.
Like classication trees, regression trees are induced
by recursive partitioning. The solution takes the form
of equation 1, where Ri are disjoint regions, ki are
constant values, and yji refers to the y-values of the
training cases that fall within the region Ri.
if x Ri then f(x) = ki = medianfyji g
(1)
Regression trees have the same representation as
classication trees except for the terminal nodes. The
decision at a terminal node is to assign a case a constant y value. The single best constant value is the
median of the training cases falling into that terminal node because for a partition, the median is the
minimizer of mean absolute distance.
For regression tree induction, the minimized function, i.e. absolute distance, is a satisfactory splitting
criteria for growing the tree. At each node, the single
best split that minimizes the mean absolute distance
is selected. Splitting continues until fewer than a minimum number of cases are covered by a node, or until
all cases within the node have the identical value of y.
The pruning strategies employed for classication
trees are equally valid for regression trees. Like the
covering procedures, the only substantial dierence is
that the error rate is measured in terms of mean absolute distance. For weakest-link pruning, a tree is
if i < j and x both Ri and Rj then f(x) = ki (2)
Given this model of regression rule sets, the problem is to nd procedures that eectively induce solutions. For rule-based regression, a covering strategy
analogous to the classication tree strategy could be
specied. A rule could be induced by adding a single
component at a time, where each added component
is the single best minimizer of distance. As usual, the
constant value ki is the median of the region formed by
the current rule. As the rule is extended, fewer cases
are covered. When fewer than a minimal number of
cases are covered, rule extension terminates. The covered cases are removed and rule induction can continue
9
1. Generate a set of Pseudo-classes.
2. Generate a covering rule-set for the
transformed classication problem
using a rule induction method such
as Swap-1.
3. Initialize the current rule set to be the
covering rule set and save it.
4. If the current rule set can be pruned,
iteratively do the following:
a) Prune the current rule set.
b) Optimize the pruned rule set and
save it.
c) Make this pruned rule set the new
current rule set.
5. Use test cases or cross-validation to pick
the best of the saved rule sets.
Figure 4: Swap-1R Method for Learning Regression
Rules
on the remaining cases. This is also the regression analogue of rule induction procedures for classication.
The Swap-1R [26] system for inducing decision regression rules works by mapping the regression problem into a classication problem. Let fCig be a set
consisting of an arbitrary number of classes, each class
containing approximately equal values of fyi g. To
solve a classication problem, the classes are expected
to be dierent from each other, and it is assumed that
rules can be found to distinguish these classes. Classes
formed by an ordering and discretization of fyi g form
the classication problem.
In practice, one learning model is not always superior to others, and a learning strategy that examines
the results of dierent models may do better. Moreover, by combining dierent models, enhanced results
may be achieved. A general approach to combining
learning models is a scheme referred to as stacking
[28]. The models could be completely dierent, such as
combining decision trees with linear regression models.
Dierent models are applied independently to nd solutions, and in a subsequent layer yet another model is
used for combining the solutions into a single solution.
This layer may be a simple weighted vote, as per equation 3, or something more sophisticated. This method
of model combination is in contrast to the usual approach to evaluation of dierent models, where the
single best performing model is selected.
10
y=
K
X
w M (x)
k=1
k
k
(3)
While stacking has been shown to give improved
results, a major drawback is that properties of the
combined models are not retained. Thus when interpretable models are combined, the result may not be
interpretable at all. It is also not possible to compensate for weaknesses in one model by introducing
another model in a controlled fashion.
A modied technique for combination of alternate
solutions is to retain the interpretable nature of rules,
while at the same time addressing the problem of symbolic regression solutions assigning a constant value as
the predictor, once a region is identied. The following strategy is used to determine the y-value of a case
x that falls in region Ri , instead of assigning a single
constant value ki for region Ri, where ki is determined
by the median y value of training cases in the region,
i (x), the mean of the k-nearest (training
assign yknn
set) instances of x in region Ri .
An interesting aspect of this strategy is that knearest neighbor results need only be considered for
the cases covered by a particular partition. This hybrid approach alleviates the weakness of partitions being assigned single constant values. Moreover, some
of the global distance measure diculties of the k-nn
methods may also be relieved because the table lookup
is reduced to partitioned and related groupings.
Another decision rule based regression approach is
exemplied by the RAMP system. This system also
supports post-processing for regression, by computing
additional metrics for the rules as they are generated
on a pre-transformed pseudo classication problem,
based upon example data-points that are covered by
each rule in the training data. Three parameters are
attached to each rule, , the mean of all the original class values of training examples covered by that
rule; , the standard deviation of these values; and N,
the total number of training examples covered by that
rule.
For regression estimation, two straightforward averaging approaches are made available, the simple and
weighted approach. In the simple averaging approach,
the simple average of of all rules that cover an example is computed as its predicted value. Therefore,
for each example in the test data, if M is the total
number of rulesPthat
cover the example, its predicted
M
i
class value is iM=1 . In the weighted average approach, there are several options available for weighting the rules. One of these options is to compute
and assign a prediction of the weighted average, e.g.,
P
P
p
Ni
M
i=1 pi i
Ni
M
i=1 i
. In general, weighted averaging usually
leads to smoother correlations between predicted and
actual values. It has been observed that no unique
combination of weighting and error estimation seems
to be uniformly applicable. The user has the ability
to evaluate these combinations on test data, determine
which one tends to be most accurate, and then utilize
that metric for ne tuning the solution.
Data Warehouse
Sample
Rule/Tree solution
Sample
Rule/Tree solution
Sample
Rule/Tree solution
Sample
Rule/Tree solution
New example
6 Discussion
The rise in attention and focus on decision support
solutions using data mining techniques has refueled a
big interest in classication and regression modeling,
particularly symbolic techniques [10]. This paper has
attempted to provide the reader with the key issues of
decision tree and decision rule modeling techniques,
two important approaches to symbolic modeling.
While some aspects of this technology have reached
maturity and become stable, there are also many
aspects that remain open. Symbolic modeling approaches that remain consistently robust across a wide
variety of data sets are not yet well understood. Additionally, though some of these techniques are conceptually robust and elegant, they prove to be computationally challenging when applied to large scale
business and industrial data sets.
Many new approaches are under active research and
development. It is well recognized that these techniques cannot be applied in a black box mode to data.
Intensive application analysis and knowledge engineering continues to play an important role. Techniques
such as preprocessing of raw data and feature extraction contribute greatly to improving the accuracy of
the symbolic modeling process. Techniques to handle
missing and erroneous values in data are also critical.
Techniques and methods for feature selection [15] play
a useful role in pre-pruning the search space. Characteristics of data, such as too many categories for
a variable, or extreme bias in class proportions, or
hierarchies in attributes, can signicantly aect the
modeling process, and a catalog of methodologies to
address these issues is also slowly emerging. Scalability is emerging as a key factor in coupling these
techniques to the extremely large volumes of data that
are becoming available today. Systems and techniques
that focus on resolving this key problem are beginning
to emerge [7, 16].
Techniques such as stacking [28], bagging [3], and
boosting [11], also prove to be extremely useful in
improving the modeling process, by using hybrid ap-
Averaging or
Voting
Mechanism
New example classification or regression result
Figure 5: Multiple Solutions for Maximizing Accuracy
proaches that either combine solutions from multiple
approaches, or multiple solutions using the same approach but on dierent training samples. Utilizing the
bagging approach, predictive performance can be improved, sometimes very substantially, by nding many
solutions on dierent random samples taken from a
large data warehouse. For classication, the many answers for a new case are voted or for regression they
are averaged. This process is illustrated in Figure 5.
N cases may be randomly drawn from a large data
base or even simulated by resampling from a smaller
dataset. If a learning method is fast, it is not dicult
to generate new solutions for each new sample of N
cases. The most obvious candidate for this approach
to learning is the decision tree. In the boosting approach, error cases are sampled with greater frequency,
in subsequent modeling iterations.
For a single solution and sample, a decision tree
usually yields good results for most learning problems.
Solutions are found quickly, typically much faster than
most other learning methods. The predictive performance of decision trees is often weaker than some
other learning methods, such as neural nets. However,
studies show that when answers for solutions found on
many random samples are voted, the resulting performance can approach optimal predictive performance
[3]. The predictive performance is often signicantly
increased, though the clarity of presentation may be
sometimes compromised.
Finally, an open issue that continues to be explored
is characterization of datasets, using either simple
measures, statistical measures, or information theoretic measures, that will allow an educated mapping
of the most appropriate mining technique to a dataset
11
for maximizing the accuracy of the resulting solution.
[13] J. Hosking, E. Pednault, and M. Sudan. A Statistical Perspective on Data Mining. 1997. in this
issue.
[14] M. James. Classication Algorithms. John Wiley
& Sons, 1985.
[15] I. Kononenko and S.J. Hong. Attribute Selection
for Modelling. 1997. in this issue.
[16] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ:
A Fast Scalable Classier for Data Mining. In
References
[1] C. Apte, F. Damerau, and S. Weiss. Automated
Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems,
12(3):233{251, July 1994.
[2] C. Apte and S.J. Hong. Predicting Equity Returns from Securities Data with Minimal Rule
Generation. In Advances in Knowledge Discovery, pages 541{560. AAAI Press / The MIT
Press, 1995.
Proceedings of the Fifth International Conference
on Extending Database Technology, 1996.
[17] R. Michalski, I. Mozetic, J. Hong, and N. Lavrac.
The Multi-Purpose Incremental Learning System
AQ15 and its Testing Application to Three Medical Domains. In Proceedings of the AAAI-86,
pages 1041{1045, 1986.
[18] D. Michie, D. Spiegelhalter, and C. Taylor. Ma-
[3] L. Breiman. Bagging predictors. Machine Learning, 24:123{140, 1996.
[4] L. Breiman, J. Friedman, R. Olshen, and
C. Stone. Classication and Regression Trees.
Wadsworth, Monterrey, CA., 1984.
chine Learning, Neural and Statistical Classication. Ellis Horwood, 1994.
[19] J. Quinlan. Combining instance-based and
model-based learning. In International Conference on Machine Learning, pages 236{243. Morgan Kaufmann, 1993.
[20] J.R. Quinlan. C4.5: Programs for Machine
Learning. Morgan Kaufmann, 1993.
[21] B. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
[22] J. Rissanan. Stochastic Complexity in Statistical Inquiry. World Scientic Series in Computer
Science, 15, 1989.
[23] H. Schee. The Analysis of Variance. John Wiley
& Sons, 1959.
[24] P. Smyth and R. Goodman. An Information Theoretic Approach to Rule Induction from Databases. IEEE Transactions on Knowledge and
Data Engineering, 4(4):301{316, August 1992.
[25] S. Weiss and N. Indurkhya. Optimized Rule Induction. IEEE EXPERT, 8(6):61{69, December
1993.
[26] S. Weiss and N. Indurkhya. Rule-based Machine Learning Methods for Functional Prediction. Journal of Articial Intelligence Research,
3:383{403, 1995.
[27] S. Weiss and C.A. Kulikowski. Computer Systems
That Learn. Morgan Kaufmann, 1991.
[5] W. Buntine. Learning Classication Trees. Statistics and Computing, 2:63{73, 1992.
[6] P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 3:261{283, 1989.
[7] W. Cohen. Fast eective rule induction. In The
XII International Conference on Machine Learning, pages 115{123, 1995.
[8] M. Craven and J. Shavlik. Using Neural Networks
for Data Mining. 1997. in this issue.
[9] U. Fayyad, S.G. Djorgovski, and N. Weir. Automating the Analysis and Cataloging of Sky Surveys. In Advances in Knowledge Discovery, pages
471{493. AAAI Press / The MIT Press, 1995.
[10] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, editors. Advances in Knowledge
Discovery and Data Mining. AAAI Press / The
MIT Press, 1995.
[11] Y. Freund and R. Schapire. Experiments with
a New Boosting Algorithm. In Proceedings of
the International Machine Learning Conference,
pages 148{156. Morgan Kaufmann, 1996.
[12] S.J. Hong. R-MINI: An Iterative Approach for
Generating Minimal Rules from Examples. IEEE
Transactions on Knowledge and Data Engineering, 1997. to appear.
12
[28] D. Wolpert. Stacked generalization. Neural Networks, 5:241{259, 1992.
13