Download Modified Decision Tree Classification Algorithm for Large Data Sets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Collaborative decision-making software wikipedia , lookup

Data vault modeling wikipedia , lookup

Bayesian inference in marketing wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
Modified Decision Tree Classification Algorithm for Large Data Sets
Ihsan A. Kareem *, Mehdi G. Duaimi
Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq.
Abstract
A decision tree is an important classification technique in data mining
classification. Decision trees have proved to be valuable tools for the classification,
description, and generalization of data. Work on building decision trees for data sets
exists in multiple disciplines such as signal processing, pattern recognition, decision
theory, statistics, machine learning and artificial neural networks. This research
deals with the problem of finding the parameter settings of decision tree algorithm in
order to build accurate, small trees, and to reduce execution time for a given domain.
The proposed approach (mC4.5) is a supervised learning model based on C4.5
algorithm to construct a decision tree. The modification on C4.5 algorithm includes
two phases: the first phase is discretization all continuous attributes instead of
dealing with numerical values. The second phase is using the average gain measure
instead of gain ratio measure, to choose the best attribute. It has been experimented
on three data sets. All those data files are picked up from the popular (UCI)
University of California at Irvine data repository. The results obtained from
experiments show that (mC4.5) is better than C4.5 in decreasing the total number of
nodes without affecting the accuracy; at the same time increasing the accuracy ratio.
Keywords: Attribute selection measure; C4.5 algorithm; Decision tree
classification.
‫ﺧوارزﻣﯾﺔ ﺗﺻﻧﯾف ﺷﺟرة اﻟﻘرار اﻟﻣﻌدﻟﺔ ﻟﻣﺟﻣوﻋﺎت اﻟﺑﯾﺎﻧﺎت اﻟﻛﺑﯾرة‬
‫ ﻣﻬدي ﻛزار دﻋﯾﻣﻲ‬،* ‫اﺣﺳﺎن ﻋﻠﻲ ﻛرﯾم‬
‫ اﻟﻌ ارق‬، ‫ ﺑﻐداد‬، ‫ ﺟﺎﻣﻌﺔ ﺑﻐداد‬، ‫ ﻛﻠﯾﺔ اﻟﻌﻠوم‬، ‫ﻗﺳم ﻋﻠوم اﻟﺣﺎﺳﺑﺎت‬
: ‫اﻟﺧﻼﺻﺔ‬
‫ وﻗد أﺛﺑﺗت أﺷﺟﺎر اﻟﻘرار أن ﺗﻛون‬.‫ﺷﺟرة اﻟﻘرار ﻫو أﺳﻠوب ﺗﺻﻧﯾف ﻫﺎم ﻓﻲ ﺗﺻﻧﯾف واﺳﺗﺧراج اﻟﺑﯾﺎﻧﺎت‬
‫ واﻟﻌﻣل ﻋﻠﻰ ﺑﻧﺎء أﺷﺟﺎر ﻗرار ﻟﻣﺟﻣوﻋﺎت اﻟﺑﯾﺎﻧﺎت ﻣوﺟود ﻓﻲ‬.‫ وﺗﻌﻣﯾم اﻟﺑﯾﺎﻧﺎت‬،‫أدوات ﻗﯾﻣﺔ ﻟﻠﺗﺻﻧﯾف واﻟوﺻف‬
‫ﺗﺧﺻﺻﺎت ﻣﺗﻌددة ﻣﺛل ﻣﻌﺎﻟﺟﺔ اﻹﺷﺎرات و اﻟﺗﻌرف ﻋﻠﻰ اﻷﻧﻣﺎط وﻧظرﯾﺔ اﻟﻘرار واﻻﺣﺻﺎءات واﻟﺗﻌﻠم اﻵﻟﻲ‬
‫ ﯾﺗﻧﺎول ﻫذا اﻟﺑﺣث ﻣﺷﻛﻠﺔ اﻟﻌﺛور ﻋﻠﻰ إﻋدادات ﻋﺎﻣل ﻣﺗﻐﯾر ﻓﻲ ﺧوارزﻣﯾﺔ‬.‫واﻟﺷﺑﻛﺎت اﻟﻌﺻﺑﯾﺔ اﻻﺻطﻧﺎﻋﯾﺔ‬
.‫ واﻟﺣد ﻣن وﻗت اﻟﺗﻧﻔﯾذ ﻟﻣﺟﺎل ﻣﻌﯾن‬،‫ اﻟﺻﻐﯾرة‬،‫ﺷﺟرة اﻟﻘ اررات ﻣن أﺟل ﺑﻧﺎء اﺷﺟﺎر دﻗﯾﻘﺔ‬
‫( ﻟﺑﻧﺎء ﺷﺟرة‬C4.5) ‫ اﻟﻧﻬﺞ اﻟﻣﻘﺗرح ﯾﺳﺗﺧدم ﺧوارزﻣﯾﺔ‬.‫اﻟﺗﻘﻧﯾﺔ اﻟﻣﻘﺗرﺣﺔ ﻫﻲ ﻧﻣوذج اﻟﺗﻌﻠم ﺗﺣت إﻻﺷراف‬
‫ اﻟﻣرﺣﻠﺔ اﻷوﻟﻰ ﻫﻲ ﺗﻔرﯾد ﻛل اﻟﺳﻣﺎت اﻟﻣﺳﺗﻣر ﺑدﻻ‬:‫( ﯾﺷﻣل ﻣرﺣﻠﺗﯾن‬C4.5) ‫ اﻟﺗﻌدﯾل ﻋﻠﻰ ﺧوارزﻣﯾﺔ‬.‫اﻟﻘرار‬
gain ) ‫( ﺑدﻻ ﻣن‬average gain measure) ‫ اﻟﻣرﺣﻠﺔ اﻟﺛﺎﻧﯾﺔ ﺗﺗم ﺑﺎﺳﺗﺧدام‬.‫ﻣن اﻟﺗﻌﺎﻣل ﻣﻊ اﻟﻘﯾم اﻟﻌددﯾﺔ‬
‫ وﻗد ﺗﻣت ﺗﺟرﺑت ذﻟك ﻋﻠﻰ‬.‫( ﻻﺧﺗﯾﺎر أﻓﺿل ﺳﻣﺔ‬C4.5) ‫( واﻟذي ﯾﺳﺗﺧدم ﻓﻲ ﺧوارزﻣﯾﺔ‬ratio measure
University of ) (UCI) ‫ ﯾﺗم اﺧﺗﯾﺎر ﻛل ﺗﻠك ﻣﻠﻔﺎت اﻟﺑﯾﺎﻧﺎت ﻣن‬.‫ﺛﻼث ﻣﺟﻣوﻋﺎت ﻣن اﻟﺑﯾﺎﻧﺎت‬
_____________________________________
*Email: [email protected]
1638
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
‫( ﻫﻲ‬mC4.5) ‫ اﻟﻧﺗﺎﺋﺞ اﻟﻣﺗﺣﺻل ﻋﻠﯾﻬﺎ ﻣن اﻟﺗﺟﺎرب ﺗﺑﯾن أن‬.‫( ﻣﺳﺗودع اﻟﺑﯾﺎﻧﺎت‬California at Irvine
‫( ﻓﻲ ﺧﻔض اﻟﻌدد اﻹﺟﻣﺎﻟﻲ ﻟﻠﻌﻘد دون اﻟﺗﺄﺛﯾر ﻋﻠﻰ اﻟدﻗﺔ؛ وﯾﺗم ﻓﻲ اﻟوﻗت ﻧﻔﺳﻪ زﯾﺎدة ﻧﺳﺑﺔ‬C4.5) ‫أﻓﺿل ﻣن‬
.‫اﻟدﻗﺔ‬
Introduction:
Data mining is the process of discovering meaningful new correlations, patterns and trends by
sifting through large amounts of data stored in repositories, using pattern recognition technologies as
well as statistical and mathematical techniques [1]. Classification is a form of data analysis that
extracts model describing important data classes. Such models, called classifiers, predict categorical
(discrete, unordered) class labels. For example, a classification model can be built to categorize bank
loan applications as either safe or risky [2]. Decision tree induction is the learning of decision trees
from class labeled training tuples. A decision tree is a flowchart like tree structure, where each internal
node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node [3].
The C4.5 algorithm is an extension to (ID3) Interactive Dichotomizer developed by Quinlan Ross. It
is also based on Hunt’s algorithm. C4.5 handles both categorical and continuous attributes to build a
decision tree. In order to handle continuous attributes, C4.5 splits the attribute values into two
partitions based on the selected threshold such that all the values above the threshold as one child and
the remaining as another child. It also handles missing attribute values. C4.5 uses gain ratio as an
attribute selection measure to build a decision tree [4]. An optimization problem of machine learning
algorithms consists of finding the parameter settings in order to build accurate and small trees for a
given domain. The following points explain the main problems that may arise along with a decision
tree construction for large data sets:
1. Dealing with large data: large data sets increase size of the search space, and thus it increase the
chance that the inducer will choose an overfitted classifier that is not valid in general.
2. Attribute selection measure: The information gain measure is the most popular. However, a
natural bias in information gain is that it favors to choose attributes with many values and harms
its performance. The gain ratio measure prevents the attributes with many values by combining a
term called split information. Unfortunately, the gain ratio measure suffers from another
inevitable practical issue that the denominator sometimes is zero or very small.
3. Continuous attributes: If there are N different values of attribute A in the set of instances D, there
are N - 1 thresholds that can be used for a test on attribute A. Every threshold gives unique
subsets D1 and D2 and so the value of the splitting criterion is a function of the threshold.
Related Works:
For surveying the problem of improvement decision tree classification algorithm for large data sets,
several algorithms have been developed for building decision tree of large data sets.
Kohavi & John 1995 [5], searched for parameter settings of C4.5 decision trees that would result in
optimal performance on a particular data set. The optimization objective was ‘optimal performance’ of
the tree, i.e., the accuracy measured using 10-fold cross-validation.
Liu X.H 1998 [6], proposed a new optimized algorithm of decision trees. On the basis of ID3, this
algorithm considered attribute selection in two levels of the decision tree and the classification
accuracy of the improved algorithm had been proved higher than ID3.
Liu Yuxun & Xie Niuniu 2010 [7], solving the problem a decision tree algorithm based on attribute
importance is proposed. The improved algorithm uses attribute-importance to increase the information
gain of attribution which has fewer attributions and compares ID3 with improved ID3 by an example.
The experimental analysis of the data shows that the improved ID3 algorithm can get more reasonable
and more effective rules. Gaurav & Hitesh 2013 [8], propose C4.5 algorithm which is improved by
the use of L’Hospital Rule, this simplifies the calculation process and improves the efficiency of
decision making algorithms. The aim is to implement the algorithms in a space effective manner and
response time for the application will be promoted as the performance measures.
The C4.5 Tree Construction Algorithm
The algorithm builds a decision tree starting from the training set TS, which is a group of instances
or a set of terms in the database. Each instance specifies values for a collection of attributes and a
class. Each attribute may have either discrete or continuous values. On the other hand, the special
value (unknown) is allowed to denote unspecified values. The class may have only discrete values [9].
1639
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
C4.5 algorithm builds a decision tree with the strategy of divide and conquers. With C4.5, each
node in the tree connects with a range of instances. Also, instances are assigned weights that take into
account the attribute values which are unknown[10].
Let T is the set of instances associated at the node. The weighted frequency Freq (Ci, T) is being
computed of cases in T. If all instances in T belong to the same class Ci (or the number of instances in
T is less than a particular value), then the node is a leaf, with associated class Ci, (respectively, the
most frequent class). The classification error of the leaf is the weighted sum of the cases in T.
Since T contains instances belonging to two classes, therefore, the information gain of each
attribute have been calculated. For nominal attributes, the information gain is relative to the splitting
of instances in T into sets with several attribute values. For continuous attributes, the information gain
is relative to the splitting of T into two subsets, namely, cases with an attribute value not greater than
and cases with an attribute value greater than a certain local threshold, which is determined during
information gain calculation.
Attribute that achieved the highest information gain is being selected for testing in a node. Then in
case a continuous attribute is being selected, the threshold is computed as the greatest value of the
whole training set that is below the local threshold. A decision node has s children if T1,…Ts that are
the sets of the splitting produced by the test on the selected attribute. If T is empty, the child node is
directly set to be a leaf, with an associated class (the most frequent class of the parent node) and with
classification error 0.
If T is not empty, then, divide and conquer approach will consist of recursively applying the same
operations on the set consisting of Ti, plus those cases in T with an unknown value of the selected
attribute [11].
Splitting Criteria:
In most decision trees inducers discrete splitting functions are univariate, i.e., an internal node is
split according to the value of a single attribute [12].
Entropy
It is a measure in the information theory, which characterizes the impurity of an arbitrary
collection of examples. If the target attribute takes on c different values, then the entropy S relative to
this n-wise classification is defined in equation (1):
Entropy (s) =
−Pi log Pi … ( )
Where pi is the probability of S belonging to class i. Logarithm is base 2 because entropy is a
measure of the expected encoding length measured in bits [13].
Information Gain
It measures the expected reduction in entropy by partitioning the examples according to this
attribute. The information gain, Gain(S, A) of an attribute A, relative to the collection of examples S,
is defined in equation (2):
|Sv|
Gain (S, A) = Entropy (S) −
… ( )
|S|
( )
WhereValue (A) is the set of all possible values for attribute A, and Sv is the subset of S for which
the attribute A has value v. We can use this measure to rank attributes and build the decision tree
where at each node is located the attribute with the highest information gain among the attributes not
yet considered in the path from the root [13].
Gain Ratio
In order to reduce the effect of the bias resulting from the use of information gain, a variant known
as Gain Ratio was introduced by the Australian academic Ross Quinlan in his influential system C4.5.
Gain Ratio adjusts the information gain for each attribute to allow for the breadth and uniformity of
the attribute values [14]. The default splitting criteria used by c4.5 is gain ratio, an information based
measure that takes into account different numbers ( and different probabilities ) of tests outcomes [15].
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which
attempts to overcome this bias. It applies a kind of normalization to information gain using a “split
information” value defined analogously with information gain (D) as:
1640
Kareem and Duaimi
SplitInfo A(D) = −
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
|Dj|
× log
|D|
|Dj|
… ( )
|D|
This value represents the potential information generated by splitting the training data set, D, into v
partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it
considers the number of tuples having that outcome with respect to the total number of tuples in D. It
differs from information gain, which measures the information with respect to classification that is
acquired based on the same partitioning. The gain ratio is defined as:
Gain (A)
…( )
SplitInfo A (D)
The attribute with the maximum gain ratio is selected as the splitting attribute. Note, however, that
as the split information approaches 0, the ratio becomes unstable. A constraint is added to avoid this,
whereby the information gain of the test selected must be large at least as great as the average gain
over all tests examined [16].
The mC4.5 Tree Construction Algorithm
The first strategy of (mC4.5(1)) is the same as the C4.5 algorithm for the tasks of building a
decision tree. The difference between this strategy and C4.5 consists of using discretization all
continuous attributes instead of dealing with numerical values.The second strategy of (mC4.5(2)) is
the same as the C4.5 algorithm. The difference between this strategy and C4.5 consists of using the
average gain measure instead of gain ratio measure to choose the best attribute.
Architecture of Proposed Approach
The suggested approach consists of two phases; first phase is building a decision tree using C4.5
algorithm while the second phase building a decision tree using mC4.5 algorithm.
The first phase C4.5 algorithm includes:
Stage 1.1: Selecting data set as an input to the C4.5 algorithm for processing.
Stage 1.2: Calculating entropy, information gain and gain ratio of attributes as in equations (1),(2),(3)
and (4).
Stage 1.3: Processing the given input data set depending on the particular algorithm of C4.5 data
mining.
Stage 1.4: Model evaluations; computing accuracy, using a k-fold cross validation technique, error
rate, tree size, running time for each process.
The second phase mC4.5 algorithm includes:
Stage 2.1: Choose the same data used in the first stage.
Stage 2.2: Discretization all continuous attributes in the data set.
Stage 2.3: Processing the given input data set according to mC4.5 algorithm.
Stage 2.4: Model evaluations; compute accuracy, using a k-fold cross-validation method, error rate,
tree size, running time for each process. The overall structure of the proposed approach is illustrated in
figure 1.
GainRatio (A) =
Figure 1- Architecture of the Proposed Approach
1641
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
The Proposed Measure
The average gain will be the application of a new strategy to select the best attribute, rather than
those used in C4.5 algorithm. A notable disadvantage of information gain is that it is biased to
selecting attributes with many values. This motivated Quinlan to present gain ratio for overcoming its
disadvantage. But gain ratio suffers from an inevitable practical issue that the split information can be
zero or very small when some attributes that happen to have the same value for nearly all training
instances. Responding to these facts, working is to improve attribute selection measure simply by
applying average gain for decision tree inductive learning.
Having this modified idea; it will be defined as modified attribute selection measure called
average gain. More exactly, the average gain (S, A) of an attribute A, relative to a set of training
instances S, can be defined by equation (5).
Gain (S, A)
Average gain (S, A) =
… (5)
N
Where N is the number of values of the attribute A and Gain (S, A) is the information gain that the
attribute A partitions the training instances S.
Experimental Results
This section contains the experimental results of the proposed approach. The evaluation results
obtained from the testing process is presented. In general, the performance of a classifier is mainly
measured by the classification accuracy that is how often the future prediction for a record with
unknown class label, and the execution time that is execution time for building a decision tree. The
experimental results show that the average gain measure safely outperforms the gain ratio measure in
terms of tree size and training time, yet at the same time maintains higher classification accuracy that
distinguishes the gain ratio measure. It has experimented on three data sets. All these data files are
picked up from the popular UCI the (University of California at Irvine) data repository. Table (1)
shows the details of those data files.
Table 1- Data Sets Properties
Data Set Name
Number of Instances
Number of
Attributes
Attribute Types
Missing
Values
Bank Marketing
4521
13
Continuous
& Nominal
No
Diabetes
768
9
Continuous
No
Credit Approval
690
16
Continuous
& Nominal
No
Classifiers Accuracy
Accuracy is a general measure for quantifying the goodness of learning systems. The accuracy of a
decision tree is computed using a testing set that is separate of the training set or by using estimation
techniques such as cross-validation. Two-fold cross-validation is applied in this work to evaluate the
performance of a decision tree. The basic process is: a complete data set is split into two parts, one
part of the dataset being dedicated to the training and the other one for the testing. The training set is
used to learn the algorithm and generate the tree, while the testing set is used to estimate the generated
decision tree. This procedure is repeated, hence, every part of the data set is used for both testing data
and training data, respectively. Afterwards, the overall accuracy parameters are calculated as means
for the evaluation of the individual cross-validation subset. Table 2 shows the model accuracy and
error rate in percent for two classifiers (C4.5 and mC4.5). Figures 2 and 3 show a comparison of
classifier accuracy and error rate. In other words, each figure shows the difference in accuracy and
error rate for all data sets between the two classification algorithms (C4.5, mC4.5).
1642
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
Table 2- The Difference in Model Accuracy and Error Rate for All Data Sets
Data Set
Name
Accuracy (%)
Error Rate (%)
C4.5
mC4.5 (1)
mC4.5 (2)
C4.5
mC4.5 (1)
mC4.5 (2)
Bank
Marketing
93.98363
94.89051
94.13846
6.01636
5.10948
5.86153
Diabetes
70.96354
71.875
72.65625
29.03645
28.125
27.34375
Credit
Approval
78.98550
82.02898
84.63768
21.01449
17.97101
15.36231
Figure 2- Comparison of Classifiers Accuracy
Figure 3- Comparison of Classifiers Error Rate
Tree Size
Tree size is a widely used evaluation criterion in decision trees learning. Tree complexity has a
crucial effect on its accuracy. Generally a decision tree complexity is measured by one of the
following metrics: tree depth, the total number of nodes, the total number of leaves, and the number of
attributes used. Two metrics use in this research: the total number of nodes and the total number of
leaves to measure tree size parameter. Tree size can be controlled by stopping rules that limit growth.
One general stopping rule is to establish a lower limits on the number of instances in a node and do not
allow splitting below this limit. Another stopping rule is to limit the maximum depth to which a tree
may grow. Table 3 shows some of stopping rules used with all data sets.
1643
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
Table 3- Parameters to Control Sizes of Tree
Data Set Name
Minimum number of records in a leaf
Number of bins for discretization
Bank Marketing
5
5
Diabetes
3
2
Credit Approval
4
3
The requirements to stop separation and search for the best attribute ends when the number of records
becomes five. Also to control the size of the tree, the process of discretization chooses a number of
bins. This number means the division of continuous attributes into groups equal to a number of bins.
The table 4 shows the tree complexity of the two classifiers (C4.5 and mC4.5) for all data sets. Figures
4 and 5 show a comparison of the total number of nodes and the total number of leaves.
Table 4 - The Difference in Tree Size of All Data Sets
Total number of leaves
Total number of nodes
Data Set Name
C4.5
mC4.5(1)
mC4.5(2)
C4.5
mC4.5(1)
mC4.5(2)
Bank
Marketing
107
80
44
165
97
56
Diabetes
84
47
21
167
70
31
Credit
Approval
61
68
25
92
79
30
Figure 4- Comparison of total number of leaves
Figutr 5 - Comparison of Total Number of Nodes
1644
Kareem and Duaimi
Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645
Conclusions
This work focuses on C4.5 algorithm. The C4.5 classifier is one of the most classic classification
algorithms on data mining and most often used, decision tree classifiers. C4.5 algorithm performs well
in constructing decision trees and extracting rules from the data set. The objective of this work has
been to demonstrate that the technology for building decision trees from examples is fairly robust. The
purposes of this work are to increase the classification accuracy, decrease the total number of nodes
and timing to build a classification model. Different kinds of data sets are used as input in order to
analyze the different aspects of the C4.5 algorithm. Cross-validation techniques adopted in order to
evaluate the performance of the C4.5 and mC4.5 algorithms. Two modifications are proposed to
implement C4.5 algorithm for construction a decision tree model; some useful conclusions are listed in
the following points:
1. A new node splitting measure has been proposed for decision tree construction. The proposed
measure is handling the bias towards selecting attributes with many values.
2. The gain ratio measure suffers from the practical issue that the denominator sometimes is zero or
very small.
3. Dealing with continuous attributes may lead to inappropriate classification;
4. The experimental results reveal the accuracy and size of the tree for bank data set. The accuracy
is increased by 1-2%, while tree size is decreased dramatically. It is dropped from 165 to 56
nodes. The accuracy and size of the tree for diabetes data set; the accuracy is increased by 1-2%,
while tree size is decreased from 167 to 31 nodes. For credit approval data set; the accuracy is
increased by 4-6%, while tree size is decreased from 92 to 30 nodes.
References:
1. Daniel T. Larose.2005. Discovering Knowledge in Data an Introduction to Data Mining. John
Wiley & Sons.
2. Mehmed Kantardzic.2003. Data Mining: Concepts, Models, Methods, and Algorithms. John
Wiley & Sons.
3. Sushmtta and Mitra and Tinku and Acharya.2003 Data Mining Multimedia, Soft Computing, and
Bioinformatics. John Wiley & Sons, Inc.
4. Krzysztof,J.C and Witold, P and Roman,W.S and Lukasz,A.K.2007. Data Mining A Knowledge
Discovery Approach. Springer Science Business Media, LLC.
5. Tea,T.2007. Optimizing Accuracy and Size of Decision Trees. Department of Intelligent Systems,
Jozef Stefan Institute, Jamova 39, SI-1000,pp:81-84.
6. Weiguo,Y and Jing, D and Mingyu, L.2011. Optimization of Decision Tree Based on Variable
Precision Rough Set. International Conference on Artificial Intelligence and Computational
Intelligence, pp:148-151.
7. Liu, Y and Xie, N.2010. Improved ID3 Algorithm.IEEE ,pp:465-468.
8. Gaurav, L and Agrawal and Hitesh, G.2013. Optimization of C4.5 Decision Tree Algorithm for
Data Mining Application. International Journal of Emerging Technology and Advanced
Engineering, 3(3), pp:341-345.
9. Salvatore, R.2002. Efficient C4.5. IEEE, 14(2), pp:438-444.
10. Quinlan, J.1993. C4.5 Programs for Machine Learning. Morgan Kaufman Publ
Incorporated,pp:1-11.
11. Rong, C and Lizhen, X.2009. Improved C4.5 Algorithm for the Analysis of Sales. Information
Systems and Applications Conference. IEEE,pp:173-167.
12. Lior,R and Oded, M.2008. Data Mining With Decision Trees Theory and Applications. Series in
Machine Perception and Artificial Intelligence, World Scientific Publishing.
13. Anand, B.2009. Extension and Evaluation of ID3 – Decision Tree Algorithm. Department of
Computer Science University of Maryland, College Park, pp:1-8.
14. Max, B.2007. Principles of Data Mining. Springer-Verlag London Limited.
15. Quinlan, J.1996. Improved Use of Continuous Attributes in C4.5. Journals of Artificial Intelligent
Research, 3(96),pp:77-90.
16. Jiawei,H and Micheline, K and Jian, P.2012. Data Mining Concepts and Techniques. 3rd Ed.
Morgan Kaufmann.
1645