Download Construction of decision tree using incremental learning in bank

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

B-tree wikipedia , lookup

Red–black tree wikipedia , lookup

Quadtree wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Interval tree wikipedia , lookup

Binary tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Construction of decision tree using incremental
learning in bank system
Iman Kohyarnejadfard * and Farhad Mazareinezhad
[email protected] , Iran University of Science and technology
[email protected] , Shiraz University
Abstract Decision tree is known as an important tool in classification and data
mining, so various algorithms are introduced to identify and build decision tree.
Due to developments on electronic banking, registration of information
transaction has become easier, and by analyzing the databases information,
we will be able to identify customers easily and allocate the resources to
Profitable customers optimally to develop the banking productivity. Since the
funding for granting facilities lead to financial Costs, therefore, the lending
conditions for both banks and depositors must be economical. Currently due to
the high volume of bank facilities, their repayment is a great challenge for
banks. Data clustering to the appropriate classes or categories is one of the
important subjects in pattern recognition. Doing this work according to minimize
the data in which are not classified correctly is so important. There are variety
of methods for clustering which each one has its own advantages and
disadvantages.
In this paper, we proposed a method which creates decision tree using
multivariate decision function to construct the decision tree. The linear
combination of the decision variables in decision trees allows presentation of
the multi variable equations, which have a higher accuracy and compression
rather than single variable decisions functions. Linear programming can be
used in order to specify the linear combination of the separators between two
classes in the decision tree.
Key words Data mining, clustering, decision tree, increasing learning of tree
1 Introduction
Use of the data mining techniques for analyzing the customer’s behavior is the
permanent procedure in the global economy. The analysis and realize of
costumer behavior is one of the principles of the competitive strategies
development of banks so that leads to attract and retain the potential costumers
and maximizing the customers value. The data mining is a technique which
helps the banks in decision for specification of their purpose customers.
Advances in the information technology in recent years have transformed
communication marketing to an undeniable reality. Technologies such as data
warehouse, data mining and management of competition software, have
introduced the management of customer relationship as an area for
competition. In particular by extracting the hidden data from a great database,
via data mining, the organizations can determine the valuable customers and
can predict their future behavior [9,10].
Since the lending and granting the credit cards are the behaviors along
with risk not a behavior along with confidence, therefore, this paper will try to
rate customer credit by using of data mining methods. Considering and entering
the degree of risk in banks decisions for lending is very necessary and critical
[2,8]. because if a bank and customer relations is not restored based on the
degree of risk, it may be taken a lot of assurance from a good customer who
has a valid account, and on the other hand, a loan and credit will be given to
another customer who has low credibility. For this purpose, managers, should
measure and grade validity of real and legal persons, to specify the risk of
lending and set the relations between bank and customers based on customers
risk level [2,8].
Learning of decision tree is one of the most common and most practical
learning methods. This approach is a method for approximation of the discrete
value function. A decision tree is a tree structure plus some of their deduction
procedures. The decision tree is known as an important tool in classifying and
data mining because according to the tree structure of them, they allow a
simpler deduction. To introduce and compose the decision tree, a various
algorithm has been proposed that the majority of them use the greedy methods
[5]. More algorithms which have been developed for learning of decision trees,
such as ID3, CART and C4.5, are a kind of basic algorithm which direct an up
to down greedy search in area of all decision trees. But in contrast, the
multivariate decision functions can be used which decrease the interpretation
but create a simpler tree and a tree with fewer nodes [3,4]. For this purpose, to
specify the best set of features located on a node of tree as an optional function,
the linear programming can be used. The linear combination of the decision
variables allows the presentation of multivariate equations witch have the
possibility of increasing the accuracy and decreasing the number of tree nodes
[7].
2 The decision tree
In artificial intelligence, trees are used to depict various concepts such as
structure of sentences, equations, game modes and etcetera. Learning in
decision tree is a way of approximating the objective functions with discrete
values. This method is resistant against data noises and can learn disjunction
combination of conjunctive predicate. This method is one of the most popular
of the inductive learning algorithms which has been applied in various
application, successfully. The decision tree classifies the samples so that they
grow from root toward down and finally arrive to leaf nodes. Each internal or
non-leaf node is specified by a feature. This feature presents a question related
to input. In every internal nodes, there are branches as many as number of
possible answers to that question. The leaves of this tree are characterized with
a class or a category of answers.
The advantages of using decision tree instead of to other techniques is:
1)The produced and applied rules are extractable and understandable, 2)The
decision tree has ability to work with both continuous and discrete data, 3)It
uses the simple decisions areas, 4)Unnecessary compressions are omitted in
this structure, 5)The different features are used for the different samples, 6)The
data preparation for a decision tree is simple and unnecessary, 7) Verification
of a model in decision tree is possible by using the statistical tests, 8) The
structures of decision tree to analysis the great data in the short time are
powerful and also, the decision tree is able to identify the differences in
subgroups.
2.1 The Linear Planning Approach
The optimal linear combination of variables includes a plane which minimizes
the error measurement criterion in determining the classification of samples [6].
In other word, our use of the linear planning in building the decision tree is the
minimizing the wrong error in the classification of samples. In following of this
section, we formulate the problem of finding this plane in the form of a solvable
problem by using the linear planning methods. For this purpose, we first
describe the used symbols in this paper. Each sample is presented with the
variable X, which denotes a vector in n-dimensional space 𝑅𝑛 and n denotes
the number of features defined for that sample. Assuming that we have two
separate sets, A and B which they have been defined in n dimensional space,
A set is presented with a 𝑚 × 𝑛 matrix and B set is presented with a 𝑘 × 𝑛
matrix. m and n represent the number of samples belonging to the A and B
classes, respectively.
2.1.1 Formulating a problem using linear programming
If A and B sets are linearly separable, then target is finding a plane that can
separate two sets. As a result: 𝐴𝜔 > 𝑒𝛾 and 𝑒𝛾 < 𝐵𝜔. While ω is an ndimensional matrix which presents the plane variables weight and 𝛾 is a real
number which is considered as a bound, so that 𝜔𝑋 = 𝛾. If the sample 𝐴𝑖 in A
set, is classified correctly, then −𝐴𝑖 𝜔 + 𝛾 < 0 (i.e. (−𝐴𝑖 𝜔 + 𝛾)+ = 0 and
otherwise (−𝐴𝑖 𝜔 + 𝛾)+ ≥ 0). Thus, this aspect is extensible for B set [7].
min
1
1
‖(−Aω + eγ)+ ‖1 + ‖(−Bω + eγ)+ ‖1
m
k
(1)
Under the condition that 𝜔 ≠ 0.
The condition 𝜔 ≠ 0 is an essential condition because without it 𝛾 = 0
and 𝜔 = 0 will be a possible answer which will not show any plane. But
according to the definition of linear programming, this condition brings out the
problem from linear programming. To resolve this problem:
min
1
1
‖(−Aω + eγ + e)+ ‖1 + ‖(−Bω + eγ + e)+ ‖1
m
k
(2)
The problem always produce a plane with 𝜔𝑋 = 𝛾 formula which is able
to recognize all samples placed in A and B classes, under the condition that two
A and B classes are linearly separated of each other. Otherwise, the obtained
plane is the plane which separate the samples of two classes with minimum
error. But to ensure that any of samples will not be on the obtained plane,
equation 2 can be rewritten as follow:
𝑚
𝑘
𝑖=0
𝑖=0
1
1
min ∑(−Aω + γ + 1)+ + ∑(−Bω + γ + 1)+
m
k
(3)
For example, if in 2-demensional space, we consider the specified samples in
Fig. 1, the plane presented above is in a linear form which separates two
categories of black and white samples from each other and its equation has
been expressed in the figure.
Fig. 1 Separation of two class using linear programming [7]
Equation of related line is obtained from solving problem as follows:
ω1
[ω ] [x, y] = γ
2
(4)
In Fig. 2, this operation is shown for samples which are not linear separable
and error plane is drawn.
Fig. 2 Linear combination of variables using linear programming [7]
2.1.2 Construction of Decision Tree
To build a decision tree, we have two main approaches. First, we build a tree
using increasing method and as mentioned above, any nodes of the decision
tree contain a linear function which has been obtained by solving the problem
of samples separability using linear programming method. For this purpose, first
we place the first sample in the first node, as a result, the obtained tree has a
node which includes a label of samples placed on it. Then the further samples
are inserted into the tree respectively until combination of the placed samples
in the node disturb the problem condition which is the amount of accuracy of
each node. For example, if we consider 100 percent accuracy in each node, it
means that the amount of wrong in classifying for each node is equal to zero
and this causes that any samples with a class except the class of leaf which
placed in it, aren’t acceptable.
Entering the sample which causes the difficulties for the problem, the tree
structure should be investigated and, if necessary should be updated.
Therefore, we are facing with a separability problem of the samples of two
classes for that node and the samples placed in it and as mentioned before, we
solve it in the linear programming form. To solve this problem two cases is
possible. First is that the samples are the linearly separable. In this case, the
considered leaf node is replaced by an internal node with decision function of
the obtained line equation and also two leafs which each of them has been
labeled with the class of samples which has been placed on that leaf. The next
case is the samples aren’t linearly separable. For this case, the plane with
minimum classification error is found and placed in the root node, then, the
samples are moved in tree so that each of them are transferred to a part of left
or right sub tree. Now, the algorithm is done for each left or right sub trees so
that all samples placed on the each sub tree be linearly separable. The
approach will continue until all samples enter to the tree.
Insert new sample into the tree
If error occurred in classification:
While classification becomes true:
Find parent node
Current height = current height of sub-tree
Solve linear programming problem with all containing
samples and build new tree
If all samples classified true :
Replace sub-tree with new tree
Get out of the loop
Fig. 3 proposed algorithm
2.2 The increasing approach in building tree
Most of tree induction algorithms investigate together teaching methods. It
means that if after the tree building, the new samples have been entered; the
whole of tree must be built again to create a tree which can train the new
sample. While the useful feature of the completive structures of the decision
tree such as neural networks, is the existence of the increasing build capability
for them, it means that the network weights after training it, can be adjusted
again so that also use the new sample.
Furthermore, in the most of times, the requested information for learning the
aspects aren’t available from the beginning and over the time the new data will
arrive, therefore the decision structures must be reviewed. Considering this
issue, the presented algorithm in the previous section can be edited such that
when we faced with a new sample which is different with the label placed on
the seen leaf node, we investigate the parent node recursively so that only
among samples placed in that parent, it can modify the coefficients of the plane
so that it also include the new sample and classifies correctly, in the absence
of this possibility it goes to the next parent and we investigate the problem
again. If this issue hasn’t been realized in each parent nodes and get to the root
node, then we will rebuild the tree. In each steps of the problem review, we must
search among samples which only had been seen in the same parent node. In
order to lack of increase in the tree irregular height in each step, we should
investigate that the height of the sub tree produced by problem solving with the
linear programming isn’t higher than the current tree height. In this case, we
ignore the obtained answer from problem solving with the linear planning and
with the hope that we correctly classify in the next parent node just by changing
the plane parameters, not change the tree height of all samples, we follow the
solving problem for the parent node. If during this way any parent node wasn’t
found which by changing the coefficients of himself and his children achieve a
desire result and we was in the root node, is the same as we rebuild the tree
which in practice, this event compared to the amount of samples which the tree
learn them in the learning procedure happens very rarely. The pseudo code of
the above algorithm has been presented in the figure.
3 Data set and feature extraction
One of the most important in the pattern recognition is the extraction of features
or specific property of the received input data and decreasing the dimensions
of pattern vector. This case often is defined as the preprocessing and feature
extraction. The data set includes the sets of samples which has been formed
from the information of people or different objects. In the conducted project, the
data set contains 1000 samples from credit card applicants of one of the
German banks. Each customer includes 32 cases of data, it means that each
row of data set includes 32 columns which the first column presents the
customer number and the last column denotes the answer variable. The answer
variable presents customers credit with good credit and bad credit classes. In
general, there are 700 customers with good credit and 300 customers with bad
credit in this set.
Fig. 4 Variable values for first customers
3.1 The Segmentation of Data
The predictive modeling is similar to the human experience for applying the
observations to create a model of the important features of phenomena. In this
method, the extension of the real world and new data matching capability with
a general format is used. This model is created by using the monitored learning
method which is contained two training and testing phase. In the training phase,
by using a large number of samples that have been observed before, a model
is built. These samples are called as the experimental set. In the testing phase,
the model is applied for the data which aren’t in the experimental set in order to
calculate its authenticity.
3.2 Cross validation
The cross validation, which sometimes called circulating estimate, is an
evaluation method which determine that the results of a statistical analysis on
a data set what extent can be generalized and be independent from the training
data. In particular, this technique is used for prediction applications to specify
in practice the consider model what extent will be useful. In generally, a round
of cross validation contains partitioning data into two complementary sets,
performing analysis on one of the subsets (the training data) and validation of
analysis use of data of another sets (the validation data or test). To reduce
scattering, the validation several times with different partitions is done and the
average of validation results is calculated. When collecting the more data is
difficult, costly or impossible, the use of cross validation helps to avoid from
biased assumptions with current data which haven’t generalization capability.
3.2.1 Cross validation K-fold
In this validation, the data is partitioned to K subsets. From these K subsets, for
each time, one of them is used for validation (test) and K-1 of them is used for
training. Finally, the mean of result of this k time’s validation is selected as final
estimation. Of course the other methods can be used to combine the results.
Usually, 10-fold is used. In the classified k fold, it tries to be the ratio of data in
each class in each subset equal to that in the general set.
4 Experimental Results
Finally, with comparison of the predictive class and the true class, the
authentication of obtained model can be achieved. After performing full testing
on the data set, the following table 1 shows the authentication of this model for
consider data set.
Table1 Results of proposed method for mentioned dataset
Time
Number of leaves
Experimental error
Training error
120.1
2
0.069
0.058
As it can be seen, this method has an appropriate training error and
testing error and with the minimum number of leaves and runtime, has a proper
performance. Table 2 shows the comparison between this method and another
method in the tree building such as c4.5 and CART.
Table2 Error of different methods in building decision tree
Method
Learning method Experimental result
C4.5
0.198
0.22
CART
0.17
0.2125
Inc LP Tree
0.058
0.069
As mentioned before, there is another classifying method except the
decision tree. Here, we present the results of classifying by two Support Vector
Machine method and neural network. Table 3 shows the results of using
Support Vector Machine method.
Table 3 The accuracy SVM’s of results
method
C=1
C=0.7
C=0.5
C=0.1
SVM with linear kernel
0.826
0.852
0.899
0.886
SVM with Gaussian kernel
0.871
0.899
0.941
0.923
Table 4 Experimental error in SVM and neural network
method
Experimental error
SVM
0.059
neural network
0.245
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
C4.5
CART
Inc LP Tree
SVM
Neural
network
Fig.5 Accuracy of different methods on mentioned dataset
5 Conclusion
According to the results, it can be seen that the performance of the SVM method
is slightly higher than the proposed method. But this better performance has
been obtained at a cost of increasing the computational complexity and
explaining capability and it has a greater computational order rather than the
proposed algorithm. Here, first the tree tries to perform the learning procedure
by using the local data, and of course, whatever we move towards the root
node, data increases so that when we arrive to root node, all data contribute in
the process of rebuilding the tree. Finally, whatever the problem has the greater
data and also the dimension of problem (the set of features of each sample)
grows, the increasing algorithm of the decision tree building more improves.
According to the amount of accuracy and the speed of the linear programming.
for solving this problem, it is a new approach in building decision tree which has
a unique feature in comparing to another method for building multivariate
decision trees and is that to solve the problem it searches the absolute minimum
and hasn’t the problems related to the relative minimums.
References
1. C. Liang, “Decision Tree for Dynamic and Uncertain Data Streams,” pp. 209–224,
2010.
2. Yi, J., et al. (2008). A bank customer credit evaluation based on the decision tree and
the simulated annealing algorithm. Computer and Information Technology, 2008. CIT
2008. 8th IEEE International Conference on.
3. M. Last, O. Maimon, and E. Minkov, “Improving stability of decision trees,”
International Journal of Pattern recognition, pp. 1–26, 2002.
4. E. Zawadzki and T. Sandholm, “Search Tree Restructuring,” 2010.
5. J. Su and H. Zhang, “A fast decision tree learning algorithm, ”roceedings of the
National Conference on Artificial intelligence, vol. 5, no. Quinlan 1993, 2006.
6. Y. Zhang and H. Huei-chuen, “Decision tree pruning via integer programming,” no.
August 2005, 2005.
7. K. Bennett, Decision tree construction via linear programming. 1992.
8. Giudici P,Applied Data Mining, Statistical Methods for Business and Industry, john
Wiley &Sons, 2003.
9. Hand, D& Mannila, H& Smyth. P.: Principle of Data Mining, The MIT Press jun 2002.
10. Kantardiz, Mehmed Data Mining: Concepts, Models, Methods, and Algorithms, John
Wiley & Sons, 2003.