Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 Modified Decision Tree Classification Algorithm for Large Data Sets Ihsan A. Kareem *, Mehdi G. Duaimi Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq. Abstract A decision tree is an important classification technique in data mining classification. Decision trees have proved to be valuable tools for the classification, description, and generalization of data. Work on building decision trees for data sets exists in multiple disciplines such as signal processing, pattern recognition, decision theory, statistics, machine learning and artificial neural networks. This research deals with the problem of finding the parameter settings of decision tree algorithm in order to build accurate, small trees, and to reduce execution time for a given domain. The proposed approach (mC4.5) is a supervised learning model based on C4.5 algorithm to construct a decision tree. The modification on C4.5 algorithm includes two phases: the first phase is discretization all continuous attributes instead of dealing with numerical values. The second phase is using the average gain measure instead of gain ratio measure, to choose the best attribute. It has been experimented on three data sets. All those data files are picked up from the popular (UCI) University of California at Irvine data repository. The results obtained from experiments show that (mC4.5) is better than C4.5 in decreasing the total number of nodes without affecting the accuracy; at the same time increasing the accuracy ratio. Keywords: Attribute selection measure; C4.5 algorithm; Decision tree classification. ﺧوارزﻣﯾﺔ ﺗﺻﻧﯾف ﺷﺟرة اﻟﻘرار اﻟﻣﻌدﻟﺔ ﻟﻣﺟﻣوﻋﺎت اﻟﺑﯾﺎﻧﺎت اﻟﻛﺑﯾرة ﻣﻬدي ﻛزار دﻋﯾﻣﻲ،* اﺣﺳﺎن ﻋﻠﻲ ﻛرﯾم اﻟﻌ ارق، ﺑﻐداد، ﺟﺎﻣﻌﺔ ﺑﻐداد، ﻛﻠﯾﺔ اﻟﻌﻠوم، ﻗﺳم ﻋﻠوم اﻟﺣﺎﺳﺑﺎت : اﻟﺧﻼﺻﺔ وﻗد أﺛﺑﺗت أﺷﺟﺎر اﻟﻘرار أن ﺗﻛون.ﺷﺟرة اﻟﻘرار ﻫو أﺳﻠوب ﺗﺻﻧﯾف ﻫﺎم ﻓﻲ ﺗﺻﻧﯾف واﺳﺗﺧراج اﻟﺑﯾﺎﻧﺎت واﻟﻌﻣل ﻋﻠﻰ ﺑﻧﺎء أﺷﺟﺎر ﻗرار ﻟﻣﺟﻣوﻋﺎت اﻟﺑﯾﺎﻧﺎت ﻣوﺟود ﻓﻲ. وﺗﻌﻣﯾم اﻟﺑﯾﺎﻧﺎت،أدوات ﻗﯾﻣﺔ ﻟﻠﺗﺻﻧﯾف واﻟوﺻف ﺗﺧﺻﺻﺎت ﻣﺗﻌددة ﻣﺛل ﻣﻌﺎﻟﺟﺔ اﻹﺷﺎرات و اﻟﺗﻌرف ﻋﻠﻰ اﻷﻧﻣﺎط وﻧظرﯾﺔ اﻟﻘرار واﻻﺣﺻﺎءات واﻟﺗﻌﻠم اﻵﻟﻲ ﯾﺗﻧﺎول ﻫذا اﻟﺑﺣث ﻣﺷﻛﻠﺔ اﻟﻌﺛور ﻋﻠﻰ إﻋدادات ﻋﺎﻣل ﻣﺗﻐﯾر ﻓﻲ ﺧوارزﻣﯾﺔ.واﻟﺷﺑﻛﺎت اﻟﻌﺻﺑﯾﺔ اﻻﺻطﻧﺎﻋﯾﺔ . واﻟﺣد ﻣن وﻗت اﻟﺗﻧﻔﯾذ ﻟﻣﺟﺎل ﻣﻌﯾن، اﻟﺻﻐﯾرة،ﺷﺟرة اﻟﻘ اررات ﻣن أﺟل ﺑﻧﺎء اﺷﺟﺎر دﻗﯾﻘﺔ ( ﻟﺑﻧﺎء ﺷﺟرةC4.5) اﻟﻧﻬﺞ اﻟﻣﻘﺗرح ﯾﺳﺗﺧدم ﺧوارزﻣﯾﺔ.اﻟﺗﻘﻧﯾﺔ اﻟﻣﻘﺗرﺣﺔ ﻫﻲ ﻧﻣوذج اﻟﺗﻌﻠم ﺗﺣت إﻻﺷراف اﻟﻣرﺣﻠﺔ اﻷوﻟﻰ ﻫﻲ ﺗﻔرﯾد ﻛل اﻟﺳﻣﺎت اﻟﻣﺳﺗﻣر ﺑدﻻ:( ﯾﺷﻣل ﻣرﺣﻠﺗﯾنC4.5) اﻟﺗﻌدﯾل ﻋﻠﻰ ﺧوارزﻣﯾﺔ.اﻟﻘرار gain ) ( ﺑدﻻ ﻣنaverage gain measure) اﻟﻣرﺣﻠﺔ اﻟﺛﺎﻧﯾﺔ ﺗﺗم ﺑﺎﺳﺗﺧدام.ﻣن اﻟﺗﻌﺎﻣل ﻣﻊ اﻟﻘﯾم اﻟﻌددﯾﺔ وﻗد ﺗﻣت ﺗﺟرﺑت ذﻟك ﻋﻠﻰ.( ﻻﺧﺗﯾﺎر أﻓﺿل ﺳﻣﺔC4.5) ( واﻟذي ﯾﺳﺗﺧدم ﻓﻲ ﺧوارزﻣﯾﺔratio measure University of ) (UCI) ﯾﺗم اﺧﺗﯾﺎر ﻛل ﺗﻠك ﻣﻠﻔﺎت اﻟﺑﯾﺎﻧﺎت ﻣن.ﺛﻼث ﻣﺟﻣوﻋﺎت ﻣن اﻟﺑﯾﺎﻧﺎت _____________________________________ *Email: [email protected] 1638 Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 ( ﻫﻲmC4.5) اﻟﻧﺗﺎﺋﺞ اﻟﻣﺗﺣﺻل ﻋﻠﯾﻬﺎ ﻣن اﻟﺗﺟﺎرب ﺗﺑﯾن أن.( ﻣﺳﺗودع اﻟﺑﯾﺎﻧﺎتCalifornia at Irvine ( ﻓﻲ ﺧﻔض اﻟﻌدد اﻹﺟﻣﺎﻟﻲ ﻟﻠﻌﻘد دون اﻟﺗﺄﺛﯾر ﻋﻠﻰ اﻟدﻗﺔ؛ وﯾﺗم ﻓﻲ اﻟوﻗت ﻧﻔﺳﻪ زﯾﺎدة ﻧﺳﺑﺔC4.5) أﻓﺿل ﻣن .اﻟدﻗﺔ Introduction: Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques [1]. Classification is a form of data analysis that extracts model describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. For example, a classification model can be built to categorize bank loan applications as either safe or risky [2]. Decision tree induction is the learning of decision trees from class labeled training tuples. A decision tree is a flowchart like tree structure, where each internal node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node [3]. The C4.5 algorithm is an extension to (ID3) Interactive Dichotomizer developed by Quinlan Ross. It is also based on Hunt’s algorithm. C4.5 handles both categorical and continuous attributes to build a decision tree. In order to handle continuous attributes, C4.5 splits the attribute values into two partitions based on the selected threshold such that all the values above the threshold as one child and the remaining as another child. It also handles missing attribute values. C4.5 uses gain ratio as an attribute selection measure to build a decision tree [4]. An optimization problem of machine learning algorithms consists of finding the parameter settings in order to build accurate and small trees for a given domain. The following points explain the main problems that may arise along with a decision tree construction for large data sets: 1. Dealing with large data: large data sets increase size of the search space, and thus it increase the chance that the inducer will choose an overfitted classifier that is not valid in general. 2. Attribute selection measure: The information gain measure is the most popular. However, a natural bias in information gain is that it favors to choose attributes with many values and harms its performance. The gain ratio measure prevents the attributes with many values by combining a term called split information. Unfortunately, the gain ratio measure suffers from another inevitable practical issue that the denominator sometimes is zero or very small. 3. Continuous attributes: If there are N different values of attribute A in the set of instances D, there are N - 1 thresholds that can be used for a test on attribute A. Every threshold gives unique subsets D1 and D2 and so the value of the splitting criterion is a function of the threshold. Related Works: For surveying the problem of improvement decision tree classification algorithm for large data sets, several algorithms have been developed for building decision tree of large data sets. Kohavi & John 1995 [5], searched for parameter settings of C4.5 decision trees that would result in optimal performance on a particular data set. The optimization objective was ‘optimal performance’ of the tree, i.e., the accuracy measured using 10-fold cross-validation. Liu X.H 1998 [6], proposed a new optimized algorithm of decision trees. On the basis of ID3, this algorithm considered attribute selection in two levels of the decision tree and the classification accuracy of the improved algorithm had been proved higher than ID3. Liu Yuxun & Xie Niuniu 2010 [7], solving the problem a decision tree algorithm based on attribute importance is proposed. The improved algorithm uses attribute-importance to increase the information gain of attribution which has fewer attributions and compares ID3 with improved ID3 by an example. The experimental analysis of the data shows that the improved ID3 algorithm can get more reasonable and more effective rules. Gaurav & Hitesh 2013 [8], propose C4.5 algorithm which is improved by the use of L’Hospital Rule, this simplifies the calculation process and improves the efficiency of decision making algorithms. The aim is to implement the algorithms in a space effective manner and response time for the application will be promoted as the performance measures. The C4.5 Tree Construction Algorithm The algorithm builds a decision tree starting from the training set TS, which is a group of instances or a set of terms in the database. Each instance specifies values for a collection of attributes and a class. Each attribute may have either discrete or continuous values. On the other hand, the special value (unknown) is allowed to denote unspecified values. The class may have only discrete values [9]. 1639 Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 C4.5 algorithm builds a decision tree with the strategy of divide and conquers. With C4.5, each node in the tree connects with a range of instances. Also, instances are assigned weights that take into account the attribute values which are unknown[10]. Let T is the set of instances associated at the node. The weighted frequency Freq (Ci, T) is being computed of cases in T. If all instances in T belong to the same class Ci (or the number of instances in T is less than a particular value), then the node is a leaf, with associated class Ci, (respectively, the most frequent class). The classification error of the leaf is the weighted sum of the cases in T. Since T contains instances belonging to two classes, therefore, the information gain of each attribute have been calculated. For nominal attributes, the information gain is relative to the splitting of instances in T into sets with several attribute values. For continuous attributes, the information gain is relative to the splitting of T into two subsets, namely, cases with an attribute value not greater than and cases with an attribute value greater than a certain local threshold, which is determined during information gain calculation. Attribute that achieved the highest information gain is being selected for testing in a node. Then in case a continuous attribute is being selected, the threshold is computed as the greatest value of the whole training set that is below the local threshold. A decision node has s children if T1,…Ts that are the sets of the splitting produced by the test on the selected attribute. If T is empty, the child node is directly set to be a leaf, with an associated class (the most frequent class of the parent node) and with classification error 0. If T is not empty, then, divide and conquer approach will consist of recursively applying the same operations on the set consisting of Ti, plus those cases in T with an unknown value of the selected attribute [11]. Splitting Criteria: In most decision trees inducers discrete splitting functions are univariate, i.e., an internal node is split according to the value of a single attribute [12]. Entropy It is a measure in the information theory, which characterizes the impurity of an arbitrary collection of examples. If the target attribute takes on c different values, then the entropy S relative to this n-wise classification is defined in equation (1): Entropy (s) = −Pi log Pi … ( ) Where pi is the probability of S belonging to class i. Logarithm is base 2 because entropy is a measure of the expected encoding length measured in bits [13]. Information Gain It measures the expected reduction in entropy by partitioning the examples according to this attribute. The information gain, Gain(S, A) of an attribute A, relative to the collection of examples S, is defined in equation (2): |Sv| Gain (S, A) = Entropy (S) − … ( ) |S| ( ) WhereValue (A) is the set of all possible values for attribute A, and Sv is the subset of S for which the attribute A has value v. We can use this measure to rank attributes and build the decision tree where at each node is located the attribute with the highest information gain among the attributes not yet considered in the path from the root [13]. Gain Ratio In order to reduce the effect of the bias resulting from the use of information gain, a variant known as Gain Ratio was introduced by the Australian academic Ross Quinlan in his influential system C4.5. Gain Ratio adjusts the information gain for each attribute to allow for the breadth and uniformity of the attribute values [14]. The default splitting criteria used by c4.5 is gain ratio, an information based measure that takes into account different numbers ( and different probabilities ) of tests outcomes [15]. C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome this bias. It applies a kind of normalization to information gain using a “split information” value defined analogously with information gain (D) as: 1640 Kareem and Duaimi SplitInfo A(D) = − Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 |Dj| × log |D| |Dj| … ( ) |D| This value represents the potential information generated by splitting the training data set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it considers the number of tuples having that outcome with respect to the total number of tuples in D. It differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as: Gain (A) …( ) SplitInfo A (D) The attribute with the maximum gain ratio is selected as the splitting attribute. Note, however, that as the split information approaches 0, the ratio becomes unstable. A constraint is added to avoid this, whereby the information gain of the test selected must be large at least as great as the average gain over all tests examined [16]. The mC4.5 Tree Construction Algorithm The first strategy of (mC4.5(1)) is the same as the C4.5 algorithm for the tasks of building a decision tree. The difference between this strategy and C4.5 consists of using discretization all continuous attributes instead of dealing with numerical values.The second strategy of (mC4.5(2)) is the same as the C4.5 algorithm. The difference between this strategy and C4.5 consists of using the average gain measure instead of gain ratio measure to choose the best attribute. Architecture of Proposed Approach The suggested approach consists of two phases; first phase is building a decision tree using C4.5 algorithm while the second phase building a decision tree using mC4.5 algorithm. The first phase C4.5 algorithm includes: Stage 1.1: Selecting data set as an input to the C4.5 algorithm for processing. Stage 1.2: Calculating entropy, information gain and gain ratio of attributes as in equations (1),(2),(3) and (4). Stage 1.3: Processing the given input data set depending on the particular algorithm of C4.5 data mining. Stage 1.4: Model evaluations; computing accuracy, using a k-fold cross validation technique, error rate, tree size, running time for each process. The second phase mC4.5 algorithm includes: Stage 2.1: Choose the same data used in the first stage. Stage 2.2: Discretization all continuous attributes in the data set. Stage 2.3: Processing the given input data set according to mC4.5 algorithm. Stage 2.4: Model evaluations; compute accuracy, using a k-fold cross-validation method, error rate, tree size, running time for each process. The overall structure of the proposed approach is illustrated in figure 1. GainRatio (A) = Figure 1- Architecture of the Proposed Approach 1641 Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 The Proposed Measure The average gain will be the application of a new strategy to select the best attribute, rather than those used in C4.5 algorithm. A notable disadvantage of information gain is that it is biased to selecting attributes with many values. This motivated Quinlan to present gain ratio for overcoming its disadvantage. But gain ratio suffers from an inevitable practical issue that the split information can be zero or very small when some attributes that happen to have the same value for nearly all training instances. Responding to these facts, working is to improve attribute selection measure simply by applying average gain for decision tree inductive learning. Having this modified idea; it will be defined as modified attribute selection measure called average gain. More exactly, the average gain (S, A) of an attribute A, relative to a set of training instances S, can be defined by equation (5). Gain (S, A) Average gain (S, A) = … (5) N Where N is the number of values of the attribute A and Gain (S, A) is the information gain that the attribute A partitions the training instances S. Experimental Results This section contains the experimental results of the proposed approach. The evaluation results obtained from the testing process is presented. In general, the performance of a classifier is mainly measured by the classification accuracy that is how often the future prediction for a record with unknown class label, and the execution time that is execution time for building a decision tree. The experimental results show that the average gain measure safely outperforms the gain ratio measure in terms of tree size and training time, yet at the same time maintains higher classification accuracy that distinguishes the gain ratio measure. It has experimented on three data sets. All these data files are picked up from the popular UCI the (University of California at Irvine) data repository. Table (1) shows the details of those data files. Table 1- Data Sets Properties Data Set Name Number of Instances Number of Attributes Attribute Types Missing Values Bank Marketing 4521 13 Continuous & Nominal No Diabetes 768 9 Continuous No Credit Approval 690 16 Continuous & Nominal No Classifiers Accuracy Accuracy is a general measure for quantifying the goodness of learning systems. The accuracy of a decision tree is computed using a testing set that is separate of the training set or by using estimation techniques such as cross-validation. Two-fold cross-validation is applied in this work to evaluate the performance of a decision tree. The basic process is: a complete data set is split into two parts, one part of the dataset being dedicated to the training and the other one for the testing. The training set is used to learn the algorithm and generate the tree, while the testing set is used to estimate the generated decision tree. This procedure is repeated, hence, every part of the data set is used for both testing data and training data, respectively. Afterwards, the overall accuracy parameters are calculated as means for the evaluation of the individual cross-validation subset. Table 2 shows the model accuracy and error rate in percent for two classifiers (C4.5 and mC4.5). Figures 2 and 3 show a comparison of classifier accuracy and error rate. In other words, each figure shows the difference in accuracy and error rate for all data sets between the two classification algorithms (C4.5, mC4.5). 1642 Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 Table 2- The Difference in Model Accuracy and Error Rate for All Data Sets Data Set Name Accuracy (%) Error Rate (%) C4.5 mC4.5 (1) mC4.5 (2) C4.5 mC4.5 (1) mC4.5 (2) Bank Marketing 93.98363 94.89051 94.13846 6.01636 5.10948 5.86153 Diabetes 70.96354 71.875 72.65625 29.03645 28.125 27.34375 Credit Approval 78.98550 82.02898 84.63768 21.01449 17.97101 15.36231 Figure 2- Comparison of Classifiers Accuracy Figure 3- Comparison of Classifiers Error Rate Tree Size Tree size is a widely used evaluation criterion in decision trees learning. Tree complexity has a crucial effect on its accuracy. Generally a decision tree complexity is measured by one of the following metrics: tree depth, the total number of nodes, the total number of leaves, and the number of attributes used. Two metrics use in this research: the total number of nodes and the total number of leaves to measure tree size parameter. Tree size can be controlled by stopping rules that limit growth. One general stopping rule is to establish a lower limits on the number of instances in a node and do not allow splitting below this limit. Another stopping rule is to limit the maximum depth to which a tree may grow. Table 3 shows some of stopping rules used with all data sets. 1643 Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 Table 3- Parameters to Control Sizes of Tree Data Set Name Minimum number of records in a leaf Number of bins for discretization Bank Marketing 5 5 Diabetes 3 2 Credit Approval 4 3 The requirements to stop separation and search for the best attribute ends when the number of records becomes five. Also to control the size of the tree, the process of discretization chooses a number of bins. This number means the division of continuous attributes into groups equal to a number of bins. The table 4 shows the tree complexity of the two classifiers (C4.5 and mC4.5) for all data sets. Figures 4 and 5 show a comparison of the total number of nodes and the total number of leaves. Table 4 - The Difference in Tree Size of All Data Sets Total number of leaves Total number of nodes Data Set Name C4.5 mC4.5(1) mC4.5(2) C4.5 mC4.5(1) mC4.5(2) Bank Marketing 107 80 44 165 97 56 Diabetes 84 47 21 167 70 31 Credit Approval 61 68 25 92 79 30 Figure 4- Comparison of total number of leaves Figutr 5 - Comparison of Total Number of Nodes 1644 Kareem and Duaimi Iraqi Journal of Science, 2014, Vol 55, No.4A, pp:1638-1645 Conclusions This work focuses on C4.5 algorithm. The C4.5 classifier is one of the most classic classification algorithms on data mining and most often used, decision tree classifiers. C4.5 algorithm performs well in constructing decision trees and extracting rules from the data set. The objective of this work has been to demonstrate that the technology for building decision trees from examples is fairly robust. The purposes of this work are to increase the classification accuracy, decrease the total number of nodes and timing to build a classification model. Different kinds of data sets are used as input in order to analyze the different aspects of the C4.5 algorithm. Cross-validation techniques adopted in order to evaluate the performance of the C4.5 and mC4.5 algorithms. Two modifications are proposed to implement C4.5 algorithm for construction a decision tree model; some useful conclusions are listed in the following points: 1. A new node splitting measure has been proposed for decision tree construction. The proposed measure is handling the bias towards selecting attributes with many values. 2. The gain ratio measure suffers from the practical issue that the denominator sometimes is zero or very small. 3. Dealing with continuous attributes may lead to inappropriate classification; 4. The experimental results reveal the accuracy and size of the tree for bank data set. The accuracy is increased by 1-2%, while tree size is decreased dramatically. It is dropped from 165 to 56 nodes. The accuracy and size of the tree for diabetes data set; the accuracy is increased by 1-2%, while tree size is decreased from 167 to 31 nodes. For credit approval data set; the accuracy is increased by 4-6%, while tree size is decreased from 92 to 30 nodes. References: 1. Daniel T. Larose.2005. Discovering Knowledge in Data an Introduction to Data Mining. John Wiley & Sons. 2. Mehmed Kantardzic.2003. Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. 3. Sushmtta and Mitra and Tinku and Acharya.2003 Data Mining Multimedia, Soft Computing, and Bioinformatics. John Wiley & Sons, Inc. 4. Krzysztof,J.C and Witold, P and Roman,W.S and Lukasz,A.K.2007. Data Mining A Knowledge Discovery Approach. Springer Science Business Media, LLC. 5. Tea,T.2007. Optimizing Accuracy and Size of Decision Trees. Department of Intelligent Systems, Jozef Stefan Institute, Jamova 39, SI-1000,pp:81-84. 6. Weiguo,Y and Jing, D and Mingyu, L.2011. Optimization of Decision Tree Based on Variable Precision Rough Set. International Conference on Artificial Intelligence and Computational Intelligence, pp:148-151. 7. Liu, Y and Xie, N.2010. Improved ID3 Algorithm.IEEE ,pp:465-468. 8. Gaurav, L and Agrawal and Hitesh, G.2013. Optimization of C4.5 Decision Tree Algorithm for Data Mining Application. International Journal of Emerging Technology and Advanced Engineering, 3(3), pp:341-345. 9. Salvatore, R.2002. Efficient C4.5. IEEE, 14(2), pp:438-444. 10. Quinlan, J.1993. C4.5 Programs for Machine Learning. Morgan Kaufman Publ Incorporated,pp:1-11. 11. Rong, C and Lizhen, X.2009. Improved C4.5 Algorithm for the Analysis of Sales. Information Systems and Applications Conference. IEEE,pp:173-167. 12. Lior,R and Oded, M.2008. Data Mining With Decision Trees Theory and Applications. Series in Machine Perception and Artificial Intelligence, World Scientific Publishing. 13. Anand, B.2009. Extension and Evaluation of ID3 – Decision Tree Algorithm. Department of Computer Science University of Maryland, College Park, pp:1-8. 14. Max, B.2007. Principles of Data Mining. Springer-Verlag London Limited. 15. Quinlan, J.1996. Improved Use of Continuous Attributes in C4.5. Journals of Artificial Intelligent Research, 3(96),pp:77-90. 16. Jiawei,H and Micheline, K and Jian, P.2012. Data Mining Concepts and Techniques. 3rd Ed. Morgan Kaufmann. 1645