Download Improving Decision Tree Performance by Exception Handling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Automation and Computing
7(3), August 2010, 372-380
DOI: 10.1007/s11633-010-0517-5
Improving Decision Tree Performance by
Exception Handling
Appavu Alias Balamurugan Subramanian1
1
S. Pramala1
B. Rajalakshmi1
Ramasamy Rajaram2
Department of Information Technology, Thiagarajar College of Engineering, Madurai, India
2
Department of Computer Science, Thiagarajar College of Engineering, Madurai, India
Abstract: This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation
procedure for specific training datasets. The tie occurs when there are equal proportions of the target class outcome in the leaf node0 s
records that leads to a situation where majority voting cannot be applied. To solve the above mentioned exception, we propose to base
the prediction of the result on the naive Bayes (NB) estimate, k-nearest neighbour (k-NN) and association rule mining (ARM). The
other features used for splitting the parent nodes are also taken into consideration.
Keywords: Data mining, classification, decision tree, majority voting, naive Bayes (NB), k-nearest-neighbour (k-NN), association
rule mining (ARM).
1
Introduction
The process of knowledge discovery in databases (KDD)
is defined by Fayyad et al.[1] as “the non trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data”. Data mining is the core
step of the KDD process, which is concerned with a computationally efficient enumeration of patterns present in
a database. Classification is a primary data mining task
aimed at learning a function that classifies a database record
into one of the several predefined classes based on the values
of the record attributes[2] .
The decision tree is an important classification tool and
its various improvements such as ID3[3, 4] , ID4[5] , ID5[6] ,
ITI[7] , C4.5[8] , and CART[9] over the original decision tree
algorithm have been proposed. All of them deal with the
concept of incrementally building a decision tree in real
time. In decision-tree learning, a decision tree is induced
from a set of labelled training instances represented by a tuple of attribute values and a class label. Because of the vast
search space, decision-tree learning is typically a greedy,
top-down recursive process starting with the entire training
data and an empty tree. An attribute that best partitions
the training data is chosen as the splitting attribute for the
root, and the training data are then partitioned into disjoint
subsets satisfying the values of the splitting attribute.
For each subset, the algorithm proceeds recursively until
all instances in a subset belong to the same class. However,
prior decision tree algorithms do not handle the exception,
namely, “when two or more classes have equal probabilities in a tree leaf”. This paper investigates exception handling in decision tree construction. We examine few exceptions such as how to produce classifications from a leaf
node which contains ties. In such a leaf, each class is represented equally, preventing the tree from using “majority
voting” to output a classification prediction. We introduce
new techniques for handling these exceptions and use real
world datasets to show that these techniques improve clasManuscript received September 3, 2008; revised June 26, 2009
sification accuracy.
This paper is organized as follows. Related works are
stated in Section 2. Problem statements are illustrated in
Section 3. Sections 4 and 5 aim at handling several exceptions occurring during decision tree induction. We compare
the proposed algorithm to the most common algorithm of
decision tree construction and evaluate the performance of
our algorithm on a variety of standard benchmark datasets
in Section 6. Section 7 concludes the paper besides presenting a number of issues for future research.
2
Related work
Decision-tree induction algorithm is one of the most successful learning algorithms, due to its various attractive features: simplicity, comprehensibility, absence of parameters,
and the applicability to handle mixed-type data. Decision
tree learning is one of the most widely used and practical methods for inductive learning. The ID3 algorithm[3]
is a useful concept-learning algorithm because it can efficiently construct a well generalized decision tree. For nonincremental learning tasks, this algorithm is often an ideal
choice for building a classification rule. However, for incremental learning tasks, it would be far preferable to accept
instances incrementally, without the necessity to build a
new decision tree each time.
There exist several techniques to construct incremental
decision tree based models. Some of the earlier efforts include ID4[5] , ID5[6] , ID5R[10] , and ITI[7] . All these systems
work by using the ID3 style “information gain” measure to
select the attributes. They are all designed to incrementally
build a decision tree using one training instance at a time
by keeping the necessary statistics (measure for information
gain) at each decision node.
Many learning tasks are incremental as new instances, or
details become available over time. Decision tree induction
algorithm works by building a tree, and updating it as new
instances become available. The ID3 algorithm[3] can be
373
A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling
used to learn incrementally by adding each new instance to
the training set as it becomes available and by re-running
ID3 against the enlarged training set. This is however computationally inefficient.
The ID5[6] and ID5R[10] are both incremental decision
tree builders that overcome the deficiencies of ID4. The essential difference is that when tree restructuring is required,
instead of discarding a sub-tree due to its high entropy,
the attribute that is to be placed at the node is pulled up
and the tree structure below the node is retained. In the
case of ID5[6] , the sub-trees are not recursively updated
while in ID5R[10] they are updated recursively. Leaving the
sub-trees un-restructured is computationally more efficient.
However, the resulting sub-tree is not guaranteed to be the
same as the one that would be produced by ID3 on the same
training instances.
The incremental tree inducer (ITI)[7] is a programme that
constructs decision tree automatically from labelled examples. The most useful aspect of the ITI algorithm is that
it provides a mechanism for incremental tree induction. If
one has already constructed a tree, and then obtains a new
labelled example, it is possible to present it to the algorithm, and have the algorithm revise the tree as necessary.
The alternative would be to build a new tree from scratch,
based on the augmented set of labelled examples, which
is typically much more expensive. ITI handles symbolic
variables, numeric variables, and missing data values. It
includes a virtual pruning mechanism too.
The development of decision tree learning leads to and
is encouraged by a growing number of commercial systems
such as C5.0/See5 (RuleQuest Research), MineSet (SGI),
and Intelligent Miner (IBM). Numerous techniques have
been developed to speed up decision-tree learning, such as
designing a fast tree-growing algorithm, parallelization, and
data partitioning.
A number of strategies for decision tree improvements
have been proposed in the literature [11 − 19]. They aim at
“tweaking” an already robust model despite its main and
obvious limitation. A number of ensemble classifiers have
been proposed in the literature [20−24] which seem to have
little improvement on accuracy especially when the added
complexity of the method is considered.
3
noted as |A|. For each attribute ai , the set of possible values
is denoted as Vi . The individual values are indicated by vij ,
where j is between 1 and the number of values for attribute
ai , denoted as |Vi |. The notations used to represent the
features of training data are, Aj , where j = 1, · · · , n for
attributes, C for class attributes and the values for each
attribute is Vij, where i = 1, · · · , n and j refers to the attribute to which it belongs. The decision tree induction
algorithm has the chance of handling three different types
of training data sets. The general structure of the training
data set is shown in Table 1.
Table 1
Structure of training set
A1
A2
A3
A4
A5
···
Am
C
V11
V12
V13
V14
V15
···
V1m
C1
V21
V22
V23
V24
V25
···
V2m
C2
V31
V32
V33
V34
V35
···
V3m
C3
···
···
···
···
···
···
···
···
Vn1
Vn2
Vn3
Vn4
Vn5
···
Vnm
Cn
After analyzing the decision tree algorithms, it has been
found that the concept of majority voting has to handle different types of inputs. Consider the attribute Ak between
A1 to Am , and C be the class attribute. The values of Ak
are V1k and V2k , and let the values of the class attribute be
C1 and C2 . Classification can be done if
1) Both Ak and C have only one distinct value;
2) When most of the attribute values V1k , V2k of particular attribute Ak belong to the same class is called as
majority voting.
The maximum occurrence of the distinct value and its
corresponding distinct class label value is obtained from
Table 2, which shows that majority voting is successful.
Consider the training data set shown in Table 3 where majority voting fails.
Table 2
Input format for majority voting condition
Ak
C
V1k
C2
V2k
C2
V1k
C2
V1k
C1
Problem statements and overview
If AK = V1k then C = C2
This paper studies one type of tie in the decision tree.
The tie occurs when there are the same amount of different
class cases in a leaf node. For the problem, the association
rule mining (ARM), k-nearest neighbor (k-NN), and naive
Bayes (NB) are integrated into the decision tree algorithm.
ARM, k-NN, and NB are used to make prediction on the instances in the tie leaf node. WEKA[25, 26] uses a weak kind
of tie breaking, such as taking the first value mentioned in
the value declaration part of the WEKA data file, while the
proposed solutions try to break ties more intelligently, such
as to opt for using ARM, k-NN and NB. We use ARM, kNN, and NB to select the class value when majority voting
does not work.
The set of attributes used to describe an instance is denoted by A, and the individual attributes are indicated as
ai , where i is between 1 and the number of attributes, de-
If AK = V2k then C = C2 .
Table 3
Example training set where majority voting cannot be
applied
Ak
C
V1k
C1
V2k
C2
V1k
C1
V1k
C2
V1k
C2
Under this condition, only one rule can be generated as
If Ak = V2k then C = C2 .
374
International Journal of Automation and Computing 7(3), August 2010
The count of the distinct value is 2 in Table 3. Hence,
two rules should be generated from the table but only one
rule is generated with the help of majority voting and the
other rule cannot be generated
If Ak = V1k then C = ?
The value for Ak can be found by majority voting as
Ak = V1k but its corresponding class value cannot be determined because among the four records there is an equal
partition of class attribute C = C1 or C2 , in Table 3. Hence,
the majority voting cannot be applied in this case. The decision tree induction algorithm does not give any specific
solution to handle this problem.
4
The training tuples in Hayes-Roth dataset is given as input to decision tree induction algorithm. For the given
training data set, the decision tree induction algorithm
starts its recursive function of identifying the attribute with
the highest information gain. On the first recursion, the
attribute “age” gets the highest information gain and it divides the data set into two halves, which is the core idea
behind the algorithm.
The recursive function identifies the attributes at each
iteration, and the rules are generated. The majority voting is the stopping condition for the recursive partitioning,
and all possible data at that particular iteration is given
in Table 4. The possible set of rules generated is shown in
Tables 5 and 6.
Table 4
Ties between classes
This paper addresses the problem of tree induction for
supervised learning. We have proposed a method to handle
the situation where the number of instances for different
classes is the same at a leaf node. The basic algorithm for
decision tree induction is a greedy algorithm that constructs
decision trees in a top-down recursive divide-and-conquer
manner.
We have chosen a training data from the “Hayes-Roth
database”. (The data are adapted from [27]. The class attribute, has three distinct values (namely, {1, 2, 3}); therefore, there are three distinct classes (m = 3). Let class C1
correspond to “1”, class C2 correspond to “2”, and class C3
correspond to “3”. There are 21 samples with class label
“1” , 21 samples with class label “2”, and 19 samples with
class label “3”.
The expected information needed to classify a given sample is first derived to compute the information gain of each
attribute.
21
21
) log2 ( )−
59
59
21
21
19
19
( ) log2 ( ) − ( ) log2 ( ) = 1.5833.
59
59
59
59
I(s1 , s2 ,s3 ) = I(21, 21, 19) = −(
Next, we need to compute the entropy of each attribute.
In the attribute “hobby”. The distribution of “1”, “2”, and
“3” samples for each value of “hobby” needs to be looked
at, and the expected information gain for each of these distributions is computed.
Table 5
Number
A sample partitioned dataset
Education level
Hobby
Class
3
1
2
4
3
3
1
1
1
2
1
2
3
2
2
1
2
1
2
2
2
3
3
2
3
1
1
1
3
1
2
3
2
3
2
1
3
3
1
In Table 4, “education level” gets the highest information
gain, and hence it splits the table into four as “Education
level = 3”, “Education level = 4”, “Education level = 2”,
and “Education level = 1”, then the recursive function is
applied on the table.
In Table 7, both information gain and majority voting
cannot be applied and hence decision tree induction algorithm chooses the class label value arbitrarily. Thus, for
the same set of training samples, different rules are generated. This type of misconception occurs when recursive
partitioning faces a data set which has an equal number of
records, but the concept of majority voting works correctly
only when the number of records is unequal. Fig. 1 shows
majority voting failure with Hayes-Roth dataset.
Generated rules for Hayes-Roth dataset with “1” as the first class value
Generated rules
1
If Age = 1 and Marital status = 1 and Education level = 2, then Class = 1
2
If Age = 1 and Marital status = 1, and Education level = 4 and Hobby=3, then Class = 3
3
If Age = 1 and Marital status = 2 and Education level = 1, then Class = 1
4
If Age = 1 and Marital status = 2 and Education level = 2, then Class = 2
5
If Age = 1 and Marital status = 2 and Education level = 4 and Hobby=3, then Class = 1
6
If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=3, then Class = 1
7
If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=2, then Class = 2
8
If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=1, then Class = 1
9
If Age = 1 and Marital status = 4, then Class = 3
10
If Age = 1 and Marital status = 3, then Class = 1
A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling
Table 6
Number
375
Generated rules for Hayes-ROTH dataset with “2” as the first class value
Generated rules
1
If Age = 1 and Marital status = 1 and Education level = 2, then Class = 1
2
If Age = 1 and Marital status = 1 and Education level = 4, and Hobby=3 then Class = 3
3
If Age = 1 and Marital status = 2 and Education level = 1, then Class = 1
4
If Age = 1 and Marital status = 2 and Education level = 2, then Class = 2
5
If Age = 1 and Marital status = 2 and Education level = 4 and Hobby=3, then Class = 1
6
If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=3, then Class = 1
7
If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=2, then Class = 2
8
If Age = 1 and Marital status = 2 and Education level = 3 and Hobby=1, then Class = 2
9
If Age = 1 and Marital status = 4, then Class = 3
10
If Age = 1 and Marital status = 3, then Class = 1
Fig. 1 The attribute age has the highest information gain and therefore becomes a test attribute at the root node of the decision tree.
Branches are grown for each value of age. The samples are partitioned according to each branch
Table 7
5
Example for a situation where majority voting is not
possible
Hobby
Class
1
2
2
2
3
2
1
1
2
1
3
1
Methods for handling ties between
classes
This paper addresses the research question in the context
of decision tree induction: which class to choose when, classified with respect to an attribute, the number of records
having the different class values are equal, i.e., when majority voting fails. The NB probabilistic model, k-NN, and
ARM are used to answer the problem.
5.1
Naive Bayes (NB)
In this paper, it has been proposed to use the NB classifier to select the class value when majority voting does
not work. The presented approach deals with the shortcomings in information gain based decision tree induction
algorithm. In order to avoid the arbitrary selection of class
label due to failure of the “majority voting”, the usage of
NB to assign the class label is proposed. The class label
chosen by the NB would be better than an arbitrary selection. Thus, decision tree naive Bayes (DTNB) can be called
as the extension of decision tree algorithm which does not
give contradictory rules.
The traditional decision tree induction algorithm data
set would fall under any one of these categories: 1) consistent data or 2) inconsistent data. However, with DTNB
algorithm, any data set considered for classification would
fall under any one of these categories: 1) consistent data
identified by alpha rules, 2) inconsistent data identified by
alpha rules, 3) consistent data identified by beta rules, or
4) inconsistent data identified by beta rules.
Inconsistent data are those which are not classified in the
given training data set. Hence, the efficiency of the algorithm decreases due to its inapplicability in classifying the
given data. The records, which are considered as inconsistent, are identified by alpha rules. Noisy data may be
present in the training data due to human errors, that is,
the user records irrelevant values in the data set. The algorithm should be developed in such a way that it is capable
of handling noisy data also.
When a rule is formed by a set of records in which the
value of the class label accounts for more than 50 % than
376
International Journal of Automation and Computing 7(3), August 2010
its counterpart, then that rule is acceptable and those instances that contradict this rule will be recorded as the
noisy data. The inconsistent records are those which are
identified as noise by beta rules.
Consider the attribute “hobby” in Fig. 1, which cannot
decide the class label0 s value. The proposed solution considers the path of the tree as the test data and all the remaining data as the training set for this problem. This
dataset is given to the probability based algorithm (simple
naive Bayesian) and the path of tree is considered as the
test data, i.e.,
If Age = 1 and Marital status = 2 and Education level = 3
and Hobby = 1, then Class = ?
NB is a classification technique[28−31] that is both predictive and descriptive. It analyzes the relationship between each independent and dependent variable and derives
a conditional probability for each relationship. When a new
case is analyzed, a prediction is made by combining the effects of the independent variables on the dependent variable
(the outcome that is predicted). NB requires only one pass
through the training set to generate a classification model,
which makes it the most efficient data mining technique.
Bayesian classification is based on Bayes theorem[29, 32] .
Let X be a data sample whose label is unknown and let H
be some hypothesis, such that the data sample X belongs
to a specified class C. According to Bayes theorem, for
classification problems, P (H|X) is to be determined, i.e.,
the probability that the hypothesis H holds true, given the
observed data sample X. P (X), P (H), and P (X|H) may
be estimated from the given data. Bayes theorem is useful
in that it provides a way of calculating the posterior probability, P (H|X), from P (H), P (X), and P (X|H). From
Bayes theorem, it follows:
P (H|X) =
P (X|H)P (H)
P (X)
(1)
where X = (Age = 1 and Marital status = 2 and Education
level = 3 and Hobby = 1)
NB aims at predicting the class value (Class = 1 or
Class = 2), and NB does it probabilistically. We use the
notation that Pr(A) denotes the probability of an event A,
and Pr(A/B) denotes the probability of A conditional on
another event B. The hypothesis H is that class will be
“1” or “2”, and our aim to find Pr(H/E). E is evidence.
The evidence E is the particular combination of attribute
values, Age = 1 and Marital status = 2 and Education
level = 3 and Hobby = 1. By maximizing P (X|Ci )P (Ci ),
for i = 1, 2, P (Ci ), the prior probability of each class can
be computed based on the training samples.
Therefore, the naive Bayesian classifier predicts Class =
1 for sample X. The procedure for handling ties between
classes using NB are given in Algorithm 1.
Algorithm 1. Algorithm for handling ties between
classes using NB
1) If attribute-list is empty then
2) If training data rule is null
3) Return N as a leaf node labeled with the most common
class in samples // Majority voting
This rule traversal from the root to the leaf is alpha rule.
If (records satisfy alpha rule)
i) Correctly classified (as normal decision tree
induction algorithm)
Else
ii) Incorrectly classified (as normal decision tree
induction algorithm)
4) Else
5) Return N as a leaf node labelled with the attribute
corresponding to class label value from probability based
algorithm (naive Bayesian algorithm)
This traversal of the rule from the root to the leaf is
called the beta rule
// A unique rule If (records satisfy beta rule)
Correctly classified (These records cannot be handled
by decision tree induction algorithm)
Else
Incorrectly classified.
Hence, the class value is assigned based on the given
training data set where majority voting is not possible. The
rules developed in this manner are called the beta rules as
it has the unique feature of handling the exception due to
failure of majority voting.
5.2
k-nearest neighbour (k-NN)
In this section, the usage of k-NN is proposed to overcome
a few problems that result due to the failure of majority voting. This is an enhancement of the decision tree induction
algorithm to deal with the cases of unpredictability. We
explain the needs that drive the change, by describing specific problems of the algorithm, suggest solutions, and test
them with benchmark data sets. If two or more classes have
equal probabilities in a tree leaf, the assigning of class label
using k-NN algorithm is proposed to solve the problem.
k-NN is suitable for classification models[33−35] . Unlike
other predictive algorithms, the training data is not scanned
or processed to create the model. Instead, the training data
is the model itself. When a new case or instance is presented
to the model, the algorithm looks at the entire data to find
a subset of cases that are most similar to it and uses them to
predict the outcome. For distance measure, we use a simple
metric that is computed by summing the scores for each of
the independent columns. The procedure for handling ties
between classes using instance based learning are given in
Algorithm 2.
Algorithm 2. Algorithm for handling ties between
classes using k-NN
1) If attribute-list is empty then
2) If training data rule is null
3) return N as a leaf node labeled with the most common class
in samples // Majority voting
This rule traversal from the root to the leaf is alpha rule.
If (records satisfy alpha rule)
i) Correctly classified (as normal decision tree
induction algorithm)
Else
ii) Incorrectly classified (as normal decision tree
induction algorithm)
4) Else
5) Return N as a leaf node labelled with the attribute
corresponding to class label value from instance based
learning (k-NN algorithm)
This traversal of the rule from the root to the leaf is called the
beta rule // A unique rule
If (records satisfy beta rule)
Correctly classified (These records cannot be handled
by decision tree induction algorithm)
Else
Incorrectly classified.
377
A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling
In the traditional decision tree induction algorithm, the
data set would fall under any one of the categories, as explained in the previous section. The proposed algorithm
considers the path of the tree as the test data and all the
remaining data as the training set for this problem. This
training data set is given to the instance based learning
(simple k-NN classifier) and path of tree is considered as
the test data i.e., let X= (Age = 1 and Marital status = 2
and Education level = 3 and Hobby = 1).
Output = y 0 = 1.
If Age = 1 and Marital status = 2 and Education level =
3 and Hobby = 1, then Class = 1.
5.3
Association rule mining (ARM)
ARM finds interesting associations and correlation relationships among large sets of data items[36] . If two or more
classes have equal probabilities in a tree leaf, the usage of
ARM is proposed to assign the class label value. ARM can
be used to solve the majority voting failure problem existing in decision tree induction algorithm. After finding out
the rules, the class value for the conflicting data is decided
as follows:
1) Find out the rule having the highest number of attributes.
2) If two or more rules have the highest number of attributes but different class values, then find out the number
of rules for each different class values having the highest
number of attributes and those class values are returned
back as the appropriate class label. Otherwise, return the
corresponding label of the rule having highest number of
attributes. The procedure for handling ties between classes
using ARM are given in Algorithm 3. Using ARM, we
can predict that, if Hobby = 1 and Age = 1 and Educational
level = 3 and Marital status = 2, then Class = 1.
Algorithm 3. Algorithm for handling ties between
classes using ARM
Generate a decision tree from the training tuples of data D
Input: Training data tuples D
Output: Decision tree
Procedure:
1) Create a node in the decision tree
2) If all tuples in the training data belong to the same class
C, the node created above is a leaf node labeled with class C
3) If there are no remaining attributes on which the tuples
may be further partitioned
i) If any one class has higher majority count than others
// Majority voting success
Convert the node created above into a leaf node and label
it with majority class C at that instance
// Apply majority voting
ii) Else // majority voting fails
a) Apply ARM algorithm to find out the frequent
pattern in training data D for conflicting rule
b) Tune the conflicting rule depending on class
value obtained from frequent patterns
4) Else
Apply attribute selection method to find the “best”
splitting-criterion
5) Label the created node with splitting criterion
6) Remove splitting attribute from the attribute list
7) For each outcome of splitting attribute, partition the tuples
and grow sub trees for each partition
8) If number of tuples satisfying an outcome of the splitting
attribute is empty
Apply majority voting
9) Else
Apply the above algorithm to those tuples that satisfy
outcome of the splitting attribute.
6
Experimental results
mance evaluation
and
perfor-
The performance of the proposed methods was evaluated
on artificial and real-world domains. Artificial domains are
useful because they allow variation in parameters, and aid
us in understanding the specific problems that algorithms
exhibit, and the test conjectures. Real world domains are
useful because they come from real-world problems and are
therefore actual problems on which we would like to improve performance.
All real-world data sets used are from the University of
California Irvine (UCI) repository[27] , which contains over
177 data sets mostly contributed by researchers in the field
of machine learning. The data sets that possess majority
voting problem, i.e., two records having the same attribute
values but different class values, are purposely chosen for
experimentation to prove the efficiency of our algorithm.
Experiments are conducted to measure the performance of
the proposed methods, and it was observed that these data
sets are efficiently classified.
Tables 8 and 9 give a description of the dataset used in
the experiment. The proposed algorithms0 performance is
compared with various existing decision tree induction algorithms. Tables 10 and 11 show, for each data set, the estimated predictive accuracy of the decision tree with k-NN,
decision tree with ARM, and decision tree with NB versus other traditional decision tree methods. As one can see
from Tables 10 and 11, the predictive accuracy of the DTARM, DT-KNN, and DT-NB tends to be better than the
accuracy of traditional decision tree induction algorithms.
There are four common approaches for estimating the
accuracy such as using training data, using test data, crossvalidation, and percentage splitting[37] . The evaluation
function used is percentage splitting and 10 fold cross validation.
In Tables 10 and 11, datasets with the problem of majority voting are considered and their accuracy is predicatively
compared with that of the traditional decision tree induction algorithm. The proposed algorithm has provided better results than C4.5 besides solving the exceptions caused
due to the failure of majority voting.
This paper presents an algorithm based on decision tree
with k-NN, decision tree with ARM, and decision tree with
NB which make use of the decision tree as the main structure and conditionally applies ARM, k-NN, and NB at some
of the leaf nodes in order to handle cases where a node
has equal probability for more than one class. The nearest neighbor technique used in DT-KNN algorithm has a
large number of drawbacks when compared to the NB technique used in DT-NB algorithm. These drawbacks include a
lack of descriptive output, reliance on an arbitrary distance
measure, and the performance implications of scanning the
entire data set to predict each new case.
The NB method, on the other hand, has a number of
qualities. It is the most efficient technique with respect
to training and it can make predictions from partial data.
378
International Journal of Automation and Computing 7(3), August 2010
The biggest theoretical drawback with NB is the need for
independence between the columns which is usually not a
problem in practice. We can see the performance of each
classifier, still, some conclusions can be made from this experiment:
1) Using NB, k-NN, and ARM to resolve tie such as two
Table 8
or more classes have equal probabilities in a tree leaf, which
is better than a random decision at the node).
2) Using NB to resolve tie such as two or more classes
have equal probabilities in a tree leaf, which is better than
an ARM and k-NN at the node.
The accuracy across various classifiers as shown in Fig. 2.
The datasets that possess majority voting problem in the UCI repository
Dataset
Number of instances
Number of attributes
Abalone
4177
9
Associated task
Classification
Pima-diabetes
768
9
Classification
Classification
Flag
194
30
Glass
214
11
Classification
Housing
506
14
Classification
Image segmentation
210
20
Classification
Ionosphere
351
35
Classification
Iris
150
5
Classification
Liver disorder
345
7
Classification
Parkinson
195
24
Classification
Yeast
1484
10
Classification
Table 9
The datasets that possess majority voting problem in the UCI repository (after discretization)
Dataset
Number of instances
Number of attributes
Blood transfusion
748
5
Associated task
Classification
Contraceptive method choice
1473
10
Classification
Classification
Concrete
1030
9
Forest-fires
517
13
Classification
Haberman
306
4
Classification
Hayes-Roth
132
6
Classification
Solar flare 1
323
13
Classification
Solar flare 2
1066
13
Classification
Spect-heart
267
23
Classification
Teaching assistant evaluation
151
6
Classification
Table 10
Predictive accuracy of the proposed methods versus C4.5, NB, k-NN with percentage splitting
Data set
C4.5
NB
k-NN
Decision tree with ARM
Decision tree with NB
Decision tree with k-NN
Blood transfusion
84.61
86.26
83.51
85.16
90.11
87.36
Solar flare 1
67.9
56.79
70.37
85.18
91.46
88.89
Glass
66.04
28.3
69.81
75.47
86.79
81.13
Haberman
59.57
66.67
54.54
87.88
90.91
87.88
Hayes-Roth
67.5
72.5
57.5
85.00
92.50
87.50
Liver disorder
63.95
46.51
66.28
86.04
90.70
87.20
Parkinson
50
50
33.33
86.67
90.00
86.67
Teaching assistant evaluation
54
70.27
89.19
86.48
94.59
86.48
Mean
64.20
59.66
65.57
84.735
90.88
86.64
Table 11
Predictive accuracy of the proposed methods versus C4.5, NB, k-NN with ten-fold cross validation
Data set
C4.5
NB
k-NN
Decision tree with ARM
Decision tree with NB
Decision tree with k-NN
Solar flare 1
71.21
65.01
69.66
85.39
89.16
87.62
Hayes-Roth
72.73
80.30
57.58
89.23
93.94
91.67
Parkinson
80.51
69.23
89.74
97.89
98.97
97.89
Teaching assistant evaluation
52.98
50.33
62.25
86.00
93.38
89.40
Mean
69.36
66.22
69.81
89.63
93.86
91.65
A. A. B. Subramanian et al. / Improving Decision Tree Performance by Exception Handling
379
[7] P. E. Utgoff. An improved algorithm for incremental induction of decision trees. In Proceedings of the 11th International Conference on Machine Learning, pp. 318–325, 1994.
[8] J. R. Quinlan. C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
[9] L. Breiman, J. H. Friedman, R. A. Olsen, C. J. Stone. Classification and Regression Trees, Wadsworth and Brooks,
1984.
[10] P. E. Utgoff. Incremental induction of decision trees. Machine Learning, vol. 4, no. 2, pp. 161–186, 1989.
[11] S. A. Balamurugan, R. Rajaram. Effective and efficient feature selection for large scale data using Bayes0 theorem. International Journal of Automation and Computing, vol. 6,
no. 1, pp. 62–71, 2009.
[12] W. Buntine. Learning classification trees. Statistics and
Computing, vol. 2, no. 2, pp. 63 –73, 1992.
[13] C. R. P. Hartmann, P. K. Varshney, K. G. Mehrotra, C.
L. Gerberich. Application of information theory to the construction of efficient decision trees. IEEE Transactions on
Information Theory, vol. 28, no. 4, pp. 565–577, 1982.
Fig. 2
7
Predictive accuracy on classifiers
Conclusions
This paper proposes enhancements to the basic decision
tree induction algorithm. The enhancement handles the
case where the numbers of instances for different classes are
the same in a leaf node. The results on the UCI data sets
show that the proposed modification procedure has better
predictive accuracy over the existing methods in the literature. Future work includes the use of a more robust
classifier instead of NB, k-NN, and ARM. Building a tree
for each tied feature with lower cost is also an area that
demands attention in the near future.
References
[1] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth. From data
mining to knowledge discovery: An overview. Advances
in Knowledge Discovery and Data Mining, U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Menlo Park,
CA, USA: American Association for Artificial Intelligence,
pp. 1–34, 1996.
[2] J. W. Han, M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2006.
[3] J. R. Quinlan. Induction of decision trees. Journal of Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
[4] J. R. Quinlan. Simplifying decision trees. International
Journal of Man-machine Studies, vol. 27, no. 3, pp. 221–234,
1987.
[5] P. E. Utgoff. Improved training via incremental learning. In
Proceedings of the 6th International Workshop on Machine
Learning, Morgan Kaufmann Publishers Inc., Ithaca, New
York, USA, pp. 362–365, 1989.
[6] P. E. Utgoff. ID5: An incremental ID3. In Proceedings
of the 5th International Conference on Machine Learning,
Morgan Kaufmann Publishers Inc., Ann Arbor, MI, USA,
pp. 107–120, 1988.
[14] J. Mickens, M. Szummer, D. Narayanan. Snitch interactive decision trees for troubleshooting misconfigurations.
In Proceedings of the 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning
Techniques, USENIX Association, Cambridge, MA, USA,
Article No. 8, 2007.
[15] R. Kohavi, C. Kunz. Option decision trees with majority votes. In Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, pp. 161–169,
1997.
[16] R. Carina, A. Niculescu-Mizil. An empirical comparison
of supervised learning algorithms. In Proceedings of the
23rd International Conference on Machine Learning, ACM,
Pittsburgh, Pennsylvania, USA, pp. 161–168, 2006.
[17] J. C. Schlimmer, D. Fisher. A case study of incremental concept induction. In Proceedings of the 5th National Conference on Artificial Intelligence, Morgan Kaufmann, Philadelpha, USA, pp. 496–501, 1986.
[18] J. C. Schlimmer, R. Granger. Beyond incremental processing: Tracking concept drift. In Proceedings of the 5th National Conference on Artificial Intelligence, vol. 1, pp. 502–
507, 1986.
[19] P. E. Utgoff, N. C. Berkman, J. A. Clouse. Decision tree
induction based on efficient tree restructuring. Machine
Learning, vol. 29, no. 1, pp. 5–44, 2004.
[20] H. A. Chipman, E. I. George, R. E. McCulloch. Bayesian
CART model search. Journal of the American Statistical
Association, vol. 93, no. 443, pp. 935–948, 1998.
[21] R. Kohavi. Scaling up the accuracy of naive Bayes classifiers: A decision tree hybrid. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data
Mining, AAAI Press, pp. 202–207, 1996.
[22] L. M. Wang, S. M. Yuan, L. Li, H. J. Li. Improving the performance of decision tree: A hybrid approach. Conceptual
Modeling, Lecture Notes in Computer Science, Springer,
vol. 3288, pp. 327–335, 2004.
[23] Y. Li, K. H. Ang, G. C. Y. Chong, W. Y. Feng, K. C. Tan,
H. Kashiwagi. CAutoCSD-evolutionary search and optimisation enabled computer automated control system design.
International Journal of Automation and Computing, vol. 1,
no. 1, pp. 76–88, 2006.
[24] Z. H. Zhou, Z. Q. Chen. Hybrid decision tree. Journal of
Knowledge-based Systems, vol. 15, no. 8, pp. 515–528, 2002.
[25] WEKA. Open Source Collection of Machine Learning Algorithm.
380
International Journal of Automation and Computing 7(3), August 2010
[26] I. H. Witten, E. Frank. Data Mining-practical Machine
Learning Tools and Techniques with Java Implementation,
2nd Edition, 2004.
[27] C. L. Blake, C. J. Merz. UCI Repository of Machine Learning Databases, [Online], Available: http://
www.ics.uci.edu/˜mlearn/mlrepository.html, 2008.
[28] E. Frank, M. Hall, B. Pfahringer. Locally weighted naive
Bayes. In Proceedings of Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, pp. 249–256, 2003.
[29] J. Joyce. Bayes Theorem, Stanford Encyclopedia of Philosophy, 2003.
[30] L. Jiang, D. Wang, Z. Cai, X. Yan. Survey of improving
naive Bayes for classification. In Proceedings of the 3rd International Conference on Advanced Data Mining and Applications, Springer, vol. 4632, pp. 134–145, 2007.
[31] P. Langley, W. Iba, K. Thompson. An analysis of Bayesian
classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence, AAAI press and MIT press,
pp. 223–228, 1992.
[32] J. M. Bernardo, A. F. Smith. Bayesian Theory, John Wiley
& Sons, 1993.
[33] D. W. Aha, D. Kibler, M. K. Albert. Instance-based learning algorithms. Machine Learning, vol. 6, no. 1, pp. 37–66,
1991.
[34] T. M. Cover, P. E. Hart. Nearest neighbour pattern classification. IEEE Transactions on Information Theory, vol. 13,
no. 1, pp. 21–27, 1967.
[35] S. M. Weiss. Small sample error rate estimation for knearest neighbour classifiers. IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 13, no. 3, pp. 285–
289, 1991.
[36] T. Fukuda, Y. Morimoto, S Morishita, T. Tokuyama. Data
mining with optimized two-dimensional association rules.
ACM Transactions on Database Systems, vol. 26, no. 2,
pp. 179–213, 2001.
[37] R. Kohavi. A study of cross-validation and bootstrap for
accuracy estimation and model selection. In Proceedings of
the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143, 1995.
Appavu Alias Balamurugan Subramanian is a Ph. D. candidate in the Department of Information and Communication Engineering at Anna University, Chennai, India. He is also a faculty at Thiagarajar College of Engineering, Madurai, India.
His research interests include data mining and text mining.
E-mail: [email protected] (Corresponding
author)
S. Pramala received the B. Tech. degree
in the Department of Information Technology at Thiagarajar College of Engineering,
Madurai, India in 2009, and the B. Tech.
degree in information technology in May,
2010.
Her research interests include data mining, where classification and prediction
methods dominate her domain area, mining of frequent patterns, associations and
correlations existing in test data.
E-mail: [email protected]
B. Rajalakshmi is doing her undergraduate course in information technology. She
qualifies as a final year student (2009) at
Thiagarajar College of Engineering, Madurai, India. She received B. Tech. degree in
information technology in May, 2010.
Her research interests include textual
data mining, usage of various classification
techniques for efficient retrieval of data,
data pruning, and pre-processing.
E-mail: [email protected]
Ramasamy
Rajaram received the
Ph. D. degree from Madurai Kamaraj University, India. He is a professor of Department of Computer Science and Information
Technology at Thiagarajar College of Engineering, Madurai, India.
His research interests include data mining and information security.
E-mail: [email protected]