Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Construction of decision tree using incremental learning in bank system Iman Kohyarnejadfard * and Farhad Mazareinezhad [email protected] , Iran University of Science and technology [email protected] , Shiraz University Abstract Decision tree is known as an important tool in classification and data mining, so various algorithms are introduced to identify and build decision tree. Due to developments on electronic banking, registration of information transaction has become easier, and by analyzing the databases information, we will be able to identify customers easily and allocate the resources to Profitable customers optimally to develop the banking productivity. Since the funding for granting facilities lead to financial Costs, therefore, the lending conditions for both banks and depositors must be economical. Currently due to the high volume of bank facilities, their repayment is a great challenge for banks. Data clustering to the appropriate classes or categories is one of the important subjects in pattern recognition. Doing this work according to minimize the data in which are not classified correctly is so important. There are variety of methods for clustering which each one has its own advantages and disadvantages. In this paper, we proposed a method which creates decision tree using multivariate decision function to construct the decision tree. The linear combination of the decision variables in decision trees allows presentation of the multi variable equations, which have a higher accuracy and compression rather than single variable decisions functions. Linear programming can be used in order to specify the linear combination of the separators between two classes in the decision tree. Key words Data mining, clustering, decision tree, increasing learning of tree 1 Introduction Use of the data mining techniques for analyzing the customer’s behavior is the permanent procedure in the global economy. The analysis and realize of costumer behavior is one of the principles of the competitive strategies development of banks so that leads to attract and retain the potential costumers and maximizing the customers value. The data mining is a technique which helps the banks in decision for specification of their purpose customers. Advances in the information technology in recent years have transformed communication marketing to an undeniable reality. Technologies such as data warehouse, data mining and management of competition software, have introduced the management of customer relationship as an area for competition. In particular by extracting the hidden data from a great database, via data mining, the organizations can determine the valuable customers and can predict their future behavior [9,10]. Since the lending and granting the credit cards are the behaviors along with risk not a behavior along with confidence, therefore, this paper will try to rate customer credit by using of data mining methods. Considering and entering the degree of risk in banks decisions for lending is very necessary and critical [2,8]. because if a bank and customer relations is not restored based on the degree of risk, it may be taken a lot of assurance from a good customer who has a valid account, and on the other hand, a loan and credit will be given to another customer who has low credibility. For this purpose, managers, should measure and grade validity of real and legal persons, to specify the risk of lending and set the relations between bank and customers based on customers risk level [2,8]. Learning of decision tree is one of the most common and most practical learning methods. This approach is a method for approximation of the discrete value function. A decision tree is a tree structure plus some of their deduction procedures. The decision tree is known as an important tool in classifying and data mining because according to the tree structure of them, they allow a simpler deduction. To introduce and compose the decision tree, a various algorithm has been proposed that the majority of them use the greedy methods [5]. More algorithms which have been developed for learning of decision trees, such as ID3, CART and C4.5, are a kind of basic algorithm which direct an up to down greedy search in area of all decision trees. But in contrast, the multivariate decision functions can be used which decrease the interpretation but create a simpler tree and a tree with fewer nodes [3,4]. For this purpose, to specify the best set of features located on a node of tree as an optional function, the linear programming can be used. The linear combination of the decision variables allows the presentation of multivariate equations witch have the possibility of increasing the accuracy and decreasing the number of tree nodes [7]. 2 The decision tree In artificial intelligence, trees are used to depict various concepts such as structure of sentences, equations, game modes and etcetera. Learning in decision tree is a way of approximating the objective functions with discrete values. This method is resistant against data noises and can learn disjunction combination of conjunctive predicate. This method is one of the most popular of the inductive learning algorithms which has been applied in various application, successfully. The decision tree classifies the samples so that they grow from root toward down and finally arrive to leaf nodes. Each internal or non-leaf node is specified by a feature. This feature presents a question related to input. In every internal nodes, there are branches as many as number of possible answers to that question. The leaves of this tree are characterized with a class or a category of answers. The advantages of using decision tree instead of to other techniques is: 1)The produced and applied rules are extractable and understandable, 2)The decision tree has ability to work with both continuous and discrete data, 3)It uses the simple decisions areas, 4)Unnecessary compressions are omitted in this structure, 5)The different features are used for the different samples, 6)The data preparation for a decision tree is simple and unnecessary, 7) Verification of a model in decision tree is possible by using the statistical tests, 8) The structures of decision tree to analysis the great data in the short time are powerful and also, the decision tree is able to identify the differences in subgroups. 2.1 The Linear Planning Approach The optimal linear combination of variables includes a plane which minimizes the error measurement criterion in determining the classification of samples [6]. In other word, our use of the linear planning in building the decision tree is the minimizing the wrong error in the classification of samples. In following of this section, we formulate the problem of finding this plane in the form of a solvable problem by using the linear planning methods. For this purpose, we first describe the used symbols in this paper. Each sample is presented with the variable X, which denotes a vector in n-dimensional space 𝑅𝑛 and n denotes the number of features defined for that sample. Assuming that we have two separate sets, A and B which they have been defined in n dimensional space, A set is presented with a 𝑚 × 𝑛 matrix and B set is presented with a 𝑘 × 𝑛 matrix. m and n represent the number of samples belonging to the A and B classes, respectively. 2.1.1 Formulating a problem using linear programming If A and B sets are linearly separable, then target is finding a plane that can separate two sets. As a result: 𝐴𝜔 > 𝑒𝛾 and 𝑒𝛾 < 𝐵𝜔. While ω is an ndimensional matrix which presents the plane variables weight and 𝛾 is a real number which is considered as a bound, so that 𝜔𝑋 = 𝛾. If the sample 𝐴𝑖 in A set, is classified correctly, then −𝐴𝑖 𝜔 + 𝛾 < 0 (i.e. (−𝐴𝑖 𝜔 + 𝛾)+ = 0 and otherwise (−𝐴𝑖 𝜔 + 𝛾)+ ≥ 0). Thus, this aspect is extensible for B set [7]. min 1 1 ‖(−Aω + eγ)+ ‖1 + ‖(−Bω + eγ)+ ‖1 m k (1) Under the condition that 𝜔 ≠ 0. The condition 𝜔 ≠ 0 is an essential condition because without it 𝛾 = 0 and 𝜔 = 0 will be a possible answer which will not show any plane. But according to the definition of linear programming, this condition brings out the problem from linear programming. To resolve this problem: min 1 1 ‖(−Aω + eγ + e)+ ‖1 + ‖(−Bω + eγ + e)+ ‖1 m k (2) The problem always produce a plane with 𝜔𝑋 = 𝛾 formula which is able to recognize all samples placed in A and B classes, under the condition that two A and B classes are linearly separated of each other. Otherwise, the obtained plane is the plane which separate the samples of two classes with minimum error. But to ensure that any of samples will not be on the obtained plane, equation 2 can be rewritten as follow: 𝑚 𝑘 𝑖=0 𝑖=0 1 1 min ∑(−Aω + γ + 1)+ + ∑(−Bω + γ + 1)+ m k (3) For example, if in 2-demensional space, we consider the specified samples in Fig. 1, the plane presented above is in a linear form which separates two categories of black and white samples from each other and its equation has been expressed in the figure. Fig. 1 Separation of two class using linear programming [7] Equation of related line is obtained from solving problem as follows: ω1 [ω ] [x, y] = γ 2 (4) In Fig. 2, this operation is shown for samples which are not linear separable and error plane is drawn. Fig. 2 Linear combination of variables using linear programming [7] 2.1.2 Construction of Decision Tree To build a decision tree, we have two main approaches. First, we build a tree using increasing method and as mentioned above, any nodes of the decision tree contain a linear function which has been obtained by solving the problem of samples separability using linear programming method. For this purpose, first we place the first sample in the first node, as a result, the obtained tree has a node which includes a label of samples placed on it. Then the further samples are inserted into the tree respectively until combination of the placed samples in the node disturb the problem condition which is the amount of accuracy of each node. For example, if we consider 100 percent accuracy in each node, it means that the amount of wrong in classifying for each node is equal to zero and this causes that any samples with a class except the class of leaf which placed in it, aren’t acceptable. Entering the sample which causes the difficulties for the problem, the tree structure should be investigated and, if necessary should be updated. Therefore, we are facing with a separability problem of the samples of two classes for that node and the samples placed in it and as mentioned before, we solve it in the linear programming form. To solve this problem two cases is possible. First is that the samples are the linearly separable. In this case, the considered leaf node is replaced by an internal node with decision function of the obtained line equation and also two leafs which each of them has been labeled with the class of samples which has been placed on that leaf. The next case is the samples aren’t linearly separable. For this case, the plane with minimum classification error is found and placed in the root node, then, the samples are moved in tree so that each of them are transferred to a part of left or right sub tree. Now, the algorithm is done for each left or right sub trees so that all samples placed on the each sub tree be linearly separable. The approach will continue until all samples enter to the tree. Insert new sample into the tree If error occurred in classification: While classification becomes true: Find parent node Current height = current height of sub-tree Solve linear programming problem with all containing samples and build new tree If all samples classified true : Replace sub-tree with new tree Get out of the loop Fig. 3 proposed algorithm 2.2 The increasing approach in building tree Most of tree induction algorithms investigate together teaching methods. It means that if after the tree building, the new samples have been entered; the whole of tree must be built again to create a tree which can train the new sample. While the useful feature of the completive structures of the decision tree such as neural networks, is the existence of the increasing build capability for them, it means that the network weights after training it, can be adjusted again so that also use the new sample. Furthermore, in the most of times, the requested information for learning the aspects aren’t available from the beginning and over the time the new data will arrive, therefore the decision structures must be reviewed. Considering this issue, the presented algorithm in the previous section can be edited such that when we faced with a new sample which is different with the label placed on the seen leaf node, we investigate the parent node recursively so that only among samples placed in that parent, it can modify the coefficients of the plane so that it also include the new sample and classifies correctly, in the absence of this possibility it goes to the next parent and we investigate the problem again. If this issue hasn’t been realized in each parent nodes and get to the root node, then we will rebuild the tree. In each steps of the problem review, we must search among samples which only had been seen in the same parent node. In order to lack of increase in the tree irregular height in each step, we should investigate that the height of the sub tree produced by problem solving with the linear programming isn’t higher than the current tree height. In this case, we ignore the obtained answer from problem solving with the linear planning and with the hope that we correctly classify in the next parent node just by changing the plane parameters, not change the tree height of all samples, we follow the solving problem for the parent node. If during this way any parent node wasn’t found which by changing the coefficients of himself and his children achieve a desire result and we was in the root node, is the same as we rebuild the tree which in practice, this event compared to the amount of samples which the tree learn them in the learning procedure happens very rarely. The pseudo code of the above algorithm has been presented in the figure. 3 Data set and feature extraction One of the most important in the pattern recognition is the extraction of features or specific property of the received input data and decreasing the dimensions of pattern vector. This case often is defined as the preprocessing and feature extraction. The data set includes the sets of samples which has been formed from the information of people or different objects. In the conducted project, the data set contains 1000 samples from credit card applicants of one of the German banks. Each customer includes 32 cases of data, it means that each row of data set includes 32 columns which the first column presents the customer number and the last column denotes the answer variable. The answer variable presents customers credit with good credit and bad credit classes. In general, there are 700 customers with good credit and 300 customers with bad credit in this set. Fig. 4 Variable values for first customers 3.1 The Segmentation of Data The predictive modeling is similar to the human experience for applying the observations to create a model of the important features of phenomena. In this method, the extension of the real world and new data matching capability with a general format is used. This model is created by using the monitored learning method which is contained two training and testing phase. In the training phase, by using a large number of samples that have been observed before, a model is built. These samples are called as the experimental set. In the testing phase, the model is applied for the data which aren’t in the experimental set in order to calculate its authenticity. 3.2 Cross validation The cross validation, which sometimes called circulating estimate, is an evaluation method which determine that the results of a statistical analysis on a data set what extent can be generalized and be independent from the training data. In particular, this technique is used for prediction applications to specify in practice the consider model what extent will be useful. In generally, a round of cross validation contains partitioning data into two complementary sets, performing analysis on one of the subsets (the training data) and validation of analysis use of data of another sets (the validation data or test). To reduce scattering, the validation several times with different partitions is done and the average of validation results is calculated. When collecting the more data is difficult, costly or impossible, the use of cross validation helps to avoid from biased assumptions with current data which haven’t generalization capability. 3.2.1 Cross validation K-fold In this validation, the data is partitioned to K subsets. From these K subsets, for each time, one of them is used for validation (test) and K-1 of them is used for training. Finally, the mean of result of this k time’s validation is selected as final estimation. Of course the other methods can be used to combine the results. Usually, 10-fold is used. In the classified k fold, it tries to be the ratio of data in each class in each subset equal to that in the general set. 4 Experimental Results Finally, with comparison of the predictive class and the true class, the authentication of obtained model can be achieved. After performing full testing on the data set, the following table 1 shows the authentication of this model for consider data set. Table1 Results of proposed method for mentioned dataset Time Number of leaves Experimental error Training error 120.1 2 0.069 0.058 As it can be seen, this method has an appropriate training error and testing error and with the minimum number of leaves and runtime, has a proper performance. Table 2 shows the comparison between this method and another method in the tree building such as c4.5 and CART. Table2 Error of different methods in building decision tree Method Learning method Experimental result C4.5 0.198 0.22 CART 0.17 0.2125 Inc LP Tree 0.058 0.069 As mentioned before, there is another classifying method except the decision tree. Here, we present the results of classifying by two Support Vector Machine method and neural network. Table 3 shows the results of using Support Vector Machine method. Table 3 The accuracy SVM’s of results method C=1 C=0.7 C=0.5 C=0.1 SVM with linear kernel 0.826 0.852 0.899 0.886 SVM with Gaussian kernel 0.871 0.899 0.941 0.923 Table 4 Experimental error in SVM and neural network method Experimental error SVM 0.059 neural network 0.245 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 C4.5 CART Inc LP Tree SVM Neural network Fig.5 Accuracy of different methods on mentioned dataset 5 Conclusion According to the results, it can be seen that the performance of the SVM method is slightly higher than the proposed method. But this better performance has been obtained at a cost of increasing the computational complexity and explaining capability and it has a greater computational order rather than the proposed algorithm. Here, first the tree tries to perform the learning procedure by using the local data, and of course, whatever we move towards the root node, data increases so that when we arrive to root node, all data contribute in the process of rebuilding the tree. Finally, whatever the problem has the greater data and also the dimension of problem (the set of features of each sample) grows, the increasing algorithm of the decision tree building more improves. According to the amount of accuracy and the speed of the linear programming. for solving this problem, it is a new approach in building decision tree which has a unique feature in comparing to another method for building multivariate decision trees and is that to solve the problem it searches the absolute minimum and hasn’t the problems related to the relative minimums. References 1. C. Liang, “Decision Tree for Dynamic and Uncertain Data Streams,” pp. 209–224, 2010. 2. Yi, J., et al. (2008). A bank customer credit evaluation based on the decision tree and the simulated annealing algorithm. Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on. 3. M. Last, O. Maimon, and E. Minkov, “Improving stability of decision trees,” International Journal of Pattern recognition, pp. 1–26, 2002. 4. E. Zawadzki and T. Sandholm, “Search Tree Restructuring,” 2010. 5. J. Su and H. Zhang, “A fast decision tree learning algorithm, ”roceedings of the National Conference on Artificial intelligence, vol. 5, no. Quinlan 1993, 2006. 6. Y. Zhang and H. Huei-chuen, “Decision tree pruning via integer programming,” no. August 2005, 2005. 7. K. Bennett, Decision tree construction via linear programming. 1992. 8. Giudici P,Applied Data Mining, Statistical Methods for Business and Industry, john Wiley &Sons, 2003. 9. Hand, D& Mannila, H& Smyth. P.: Principle of Data Mining, The MIT Press jun 2002. 10. Kantardiz, Mehmed Data Mining: Concepts, Models, Methods, and Algorithms, John Wiley & Sons, 2003.